selecting columns in awk discarding corresponding header

selecting columns in awk discarding corresponding header - awk

How to properly select columns in awk after some processing. My file here:
cat foo
A;B;C
9;6;7
8;5;4
1;2;3
I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way:
awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo
gives me this unexpected output:
linenumber;A;B;C
1 9 7
2 8 4
3 1 3
but expected is (note B is now the third column as we added linenumber as first):
linenumber;B
1;6
2;5
3;2
[fixed and revised]

To get your expected output, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1)
}' file
Output:
linenumber;C
1;9
2;8
3;1
To add a column with line number and extract first and last columns, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$1,$NF
}' file
Output this time:
linenumber;A;C
1;9;7
2;8;4
3;1;3

Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here).
[EDIT]: see a generic solution at the end.
If the field you want to keep is the second of the input file (the B column), try:
$ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo
linenumber;B
1;6
2;5
3;2
or
$ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number:
$ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example):
$ awk -F\; -v cols='1;3' '
BEGIN { OFS = ";"; n = split(cols, c); }
{ printf("%s", FNR == 1 ? "linenumber" : FNR - 1);
for(i = 1; i <= n; i++) printf("%s", OFS $(c[i]));
printf("\n");
}' foo
linenumber;A;C
1;9;7
2;8;4
3;1;3

Related

assigning a var inside AWK for use outside awk

I am using ksh on AIX.
I have a file with multiple comma delimited fields. The value of each field is read into a variable inside the script.
The last field in the file may contain multiple | delimited values. I need to test each value and keep the first one that doesn't begin with R, then stop testing the values.
sample value of $principal_diagnosis0
R65.20|A41.9|G30.9|F02.80
I've tried:
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){echo $i; primdiag = $i}}}'
but I get this message : awk: Field $i is not correct.
My goal is to have a variable that I can use outside of the awk statement that gets assigned the first non-R code (in this case it would be A41.9).
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){print $i}}}'
gets me the output of :
A41.9
G30.9
F02.80
So I know it's reading the values and evaluating properly. But I need to stop after the first match and be able to use that value outside of awk.
Thanks!

To answer your specific question:
$ principal_diagnosis0='R65.20|A41.9|G30.9|F02.80'
$ foo=$(echo "$principal_diagnosis0" | awk -v RS='|' '/^[^R]/{sub(/\n/,""); print; exit}')
$ echo "$foo"
A41.9
The above will work with any awk, you can do it more briefly with GNU awk if you have it:
foo=$(echo "$principal_diagnosis0" | awk -v RS='[|\n]' '/^[^R]/{print; exit}')

you can make FS and OFS do all the hard work :
echo "${principal_diagnosis0}" |
mawk NF=NF FS='^(R[^|]+[|])+|[|].+$' OFS=
A41.9
——————————————————————————————————————————
another slightly different variation of the same concept — overwriting fields but leaving OFS as is :
gawk -F'^.*R[^|]+[|]|[|].+$' '$--NF=$--NF'
A41.9
this works, because when you break it out :
gawk -F'^.*R[^|]+[|]|[|].+$' '
{ print NF
} $(_=--NF)=$(__=--NF) { print _, __, NF, $0 }'
3
1 2 1 A41.9
you'll notice you start with NF = 3, and the two subsequent decrements make it equivalent to $1 = $2,
but since final NF is now reduced to just 1, it would print it out correctly instead of 2 copies of it
…… which means you can also make it $0 = $2, as such :
gawk -F'^.*R[^|]+[|]|[|].+$' '$-_=$-—NF'
A41.9
——————————————————————————————————————————
a 3rd variation, this time using RS instead of FS :
mawk NR==2 RS='^.*R[^|]+[|]|[|].+$'
A41.9
——————————————————————————————————————————
and if you REALLY don't wanna mess with FS/OFS/RS, use gsub() instead :
nawk 'gsub("^.*R[^|]+[|]|[|].+$",_)'
A41.9

linux csv file concatenate columns into one column

I've been looking to do this with sed, awk, or cut. I am willing to use any other command-line program that I can pipe data through.
I have a large set of data that is comma delimited. The rows have between 14 and 20 columns. I need to recursively concatenate column 10 with column 11 per row such that every row has exactly 14 columns. In other words, this:
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
will become:
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
I can get the first 10 columns. I can get the last N columns. I can concatenate columns. I cannot think of how to do it in one line so I can pass a stream of endless data through it and end up with exactly 14 columns per row.
Examples (by request):
How many columns are in the row?
sed 's/[^,]//g' | wc -c
Get the first 10 columns:
cut -d, -f1-10
Get the last 4 columns:
rev | cut -d, -f1-4 | rev
Concatenate columns 10 and 11, showing columns 1-10 after that:
awk -F',' ' NF { print $1","$2","$3","$4","$5","$6","$7","$8","$9","$10$11}'

Awk solution:
awk 'BEGIN{ FS=OFS="," }
{
diff = NF - 14;
for (i=1; i <= NF; i++)
printf "%s%s", $i, (diff > 1 && i >= 10 && i < (10+diff)?
"": (i == NF? ORS : ","))
}' file
The output:
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p

With GNU awk for the 3rd arg to match() and gensub():
$ cat tst.awk
BEGIN{ FS="," }
match($0,"(([^,]+,){9})(([^,]+,){"NF-14"})(.*)",a) {
$0 = a[1] gensub(/,/,"","g",a[3]) a[5]
}
{ print }
$ awk -f tst.awk file
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p

If perl is okay - can be used just like awk for stream processing
$ cat ip.txt
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4
1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4
$ awk -F, '{print NF}' ip.txt
16
18
22
$ perl -F, -lane '$n = $#F - 4;
print join ",", (#F[0..8], join("", #F[9..$n]), #F[$n+1..$#F])
' ip.txt
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
1,2,3,4,5,6,3,4,2,43432,5,2,3,4
1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4
-F, -lane split on , results saved in #F array
$n = $#F - 4 magic number, to ensure output ends with 14 columns. $#F gives the index of last element of array (won't work if input line has less than 14 columns)
join helps to stitch array elements together with specified string
#F[0..8] array slice with first 9 elements
#F[9..$n] and #F[$n+1..$#F] the other slices as needed
Borrowing from Ed Morton's regex based solution
$ perl -F, -lape '$n=$#F-13; s/^([^,]*,){9}\K([^,]*,){$n}/$&=~tr|,||dr/e' ip.txt
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
1,2,3,4,5,6,3,4,2,43432,5,2,3,4
1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4
$n=$#F-13 magic number
^([^,]*,){9}\K first 9 fields
([^,]*,){$n} fields to change
$&=~tr|,||dr use tr to delete the commas
e this modifier allows use of Perl code in replacement section
this solution also has the added advantage of working even if input field is less than 14

You can try this gnu sed
sed -E '
s/,/\n/9g
:A
s/([^\n]*\n)(.*)(\n)(([^\n]*\n){4})/\1\2\4/
tA
s/\n/,/g
' infile

First variant - with awk
awk -F, '
{
for(i = 1; i <= NF; i++) {
OFS = (i > 9 && i < NF - 4) ? "" : ","
if(i == NF) OFS = "\n"
printf "%s%s", $i, OFS
}
}' input.txt
Second variant - with sed
sed -r 's/,/#/10g; :l; s/#(.*)((#[^#]){4})/\1\2/; tl; s/#/,/g' input.txt
or, more straightforwardly (without loop) and probably faster.
sed -r 's/,(.),(.),(.),(.)$/#\1#\2#\3#\4/; s/,//10g; s/#/,/g' input.txt
Testing
Input
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u
Output
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
a,b,c,d,e,f,g,h,i,jklmn,o,p,q,r
a,b,c,d,e,f,g,h,i,jklmnopq,r,s,t,u

Solved a similar problem using csvtool. Source file, copied from one of the other answers:
$ cat input.txt
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4
1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4
Concatenating columns:
$ cat input.txt | csvtool format '%1,%2,%3,%4,%5,%6,%7,%8,%9,%10%11%12,%13,%14,%15,%16,%17,%18,%19,%20,%21,%22\n' -
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p,,,,,,
1,2,3,4,5,6,3,4,2,434,3,2,5,2,3,4,,,,
1,2,3,4,5,6,3,4,2,4as,f,e,3,4,3,2,5,2,3,4
anatoly#anatoly-workstation:cbs$ cat input.txt

Explanation for a specific command in a awk

Can you explain me what doing this command ' dateA="$dateA" '?
awk 'FNR>1 && dateA<=$5' FS='|' dateA="$dateA" "$infile"

awk 'FNR > 1 && dateA <= $5 ' FS='|' dateA="$dateA" "$infile"
FNR is variable gives you total number of records, related to current file, don't get confused with variable NR, FNR and NR value will be same as long as awk reads first file, for second file, variable FNR will reset, whereas NR doesn't.
This is how FNR and NR works in awk
$ seq 1 5 >file1
$ seq 1 3 >file2
$ cat file1
1
2
3
4
5
$ cat file2
1
2
3
$ awk '{print "Current line : "$0,"File: "FILENAME,"FNR : ",FNR,"NR : ",NR}' file1 file2
Current line : 1 File: file1 FNR : 1 NR : 1
Current line : 2 File: file1 FNR : 2 NR : 2
Current line : 3 File: file1 FNR : 3 NR : 3
Current line : 4 File: file1 FNR : 4 NR : 4
Current line : 5 File: file1 FNR : 5 NR : 5
Current line : 1 File: file2 FNR : 1 NR : 6
Current line : 2 File: file2 FNR : 2 NR : 7
Current line : 3 File: file2 FNR : 3 NR : 8
FNR > 1 && dateA <= $5 If no of records read is greater than 1 and variable dateA is less than or equal to 5th field/column, we get boolean true state, so such line will be printed
FS='|' FS is input field separator, you can also set it like
awk -F'|' '{ .... }' OR
awk -v FS='|' '{ .... }' OR
awk 'BEGIN{FS="|"}{ .... }'
dateA="$dateA" dateA is awk variable whose value is taken from your shell variable $dateA, similarly you can set it like
awk -v dateA="$dateA" '{ .... }'
Your above command can be rewritten like below also
awk -F'|' -v dateA="$dateA" 'FNR>1 && dateA <= $5' "$infile"
and some people prefer awk 'condition{action}' for better reading, so you can also write it as
awk -F'|' -v dateA="$dateA" 'FNR>1 && dateA <= $5{ print }' "$infile"
^ ^
| |
If this condition is true |
|
Action is to print line,
print or print $0 is same

Please go through the following explanation and let me know if this helps you.
Explanation: Kindly don't run following awk, it is expanded for explanation purposes only.
awk '
FNR>1 && dateA<=$5 ##FNR denotes the number of current line in awk so here 2 conditions with AND conditions are being checked.
##1st is if current line number is greater than 1 and second is variable named dateA value should be lesser
##and equal to 5.
##So let me explain here awk works on method of condition and then action, so if any condition is TRUE then action
##will happen, here condition is there but NO action defined, so by default print action will happen. print of
##current line.
'
FS='|' ##FS denotes the field separator, in awk we could define the field separator by ourselves too, so making it here as |
dateA="$dateA" ##creating variable named dateA whose value is equal to shell variable named dateA. In awk if we have to assign
##shell variable values to awk variables we have to create an awk variable and then assign shell variable value to
##it.
"$infile" ##Mentioning the Input_file name here which awk has to go through. Point to be noted here the "$infile" means
##it is a shell variable (as we all know to print shell variable value we have to use "$infile")

AWK allows assigning internal variable in the arguments with the form var=value. Since AWK does not have access to shell variables dateA="$dateA" is used to "export" dateA to the AWK script.
Note that assignment arguments occur during file processing, after BEGIN, and can be used in-between files:
$ echo >file1; echo >file2
$ awk -vx=0 '
BEGIN {
print "BEGIN", x
}
{
print FILENAME, x
}
END {
print "END", x
}' x=1 file1 x=2 file2 x=3
BEGIN 0
file1 1
file2 2
END 3

How to get rows with values more than 2 in at least 2 columns?

I am trying to extract row where value is >=2 in atleast two column. My input file look like this
gain,top1,sos1,pho1
ATC1,0,0,0
ATC2,1,2,1
ATC3,6,6,0
ATC4,1,1,2
and my awk script look like this
cat input_file | awk 'BEGIN{FS=",";OFS=","};{count>=0;for(i=2; i<4; i++) {if($i!=0) {count++}};if (count>=2){print $0}}'
which doesn't give me the expected output that should be
gain,top1,sos1,pho1
ATC3,6,6,0
What is the problem with this script. Thanks.

awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++)if($i>=2)f++}f>=2 || FNR==1' file
Or below one, print and go to next line immediately after finding 2 values (Reasonably faster)
awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++){ if($i>=2)f++; if(f>=2){ print; next} } }FNR==1' file
Explanation
awk -F, ' # call awk and set field separator as comma
FNR>1{ # we wanna skip header to be checked so, if no of records related to current file is greater than 1
f=0; # set variable f = 0
for(i=2; i<=NF; i++) # start looping from second field to no of fields in record/line/row
{
if($i>=2)f++; # if field value is greater than 2 increment variable f
if(f>=2) # if we got 2 values ? then
{
print; # print record/line/row
next # we got enough go to next line
}
}
}FNR==1 # if first record being read then print in fact if FNR==1 we get boolean true, so it does default operation print $0, that is current record/line/row
' file
Input
$ cat file
gain,top1,sos1,pho1
ATC1,0,0,0
ATC2,1,2,1
ATC3,6,6,0
ATC4,1,1,2
Output-1
$ awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++)if($i>=2)f++}f>=2 || FNR==1' file
gain,top1,sos1,pho1
ATC3,6,6,0
Output-2 (Reasonably faster)
$ awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++){ if($i>=2)f++; if(f>=2){ print; next} } }FNR==1' file
gain,top1,sos1,pho1
ATC3,6,6,0

hacky awk, handles the header as well
$ awk -F, '($2>=2) + ($3>=2) + ($4>=2) > 1' file
gain,top1,sos1,pho1
ATC3,6,6,0
or,
$ awk -F, 'function ge2(x) {return x>=2?1:0}
ge2($2) + ge2($3) + ge2($4) > 1' file
gain,top1,sos1,pho1
ATC3,6,6,0

#pali: #try:
Hope this should be much faster.
awk '{Q=$0;}(gsub(/,[2-9]/,"",Q)>=2) || FNR==1' Input_file
Here I am putting line's value into a variable named Q then, from Q variable globally substituting all the matches , then digits from 2 to 9 to NULL. Then checking it's count if that is greater or equal than 2, if either it's global substitution's value is greater than 2 or line number is 1 then it should print the current line.

Is there a way to completely delete fields in awk, so that extra delimiters do not print?

Consider the following command:
$ gawk -F"\t" "BEGIN{OFS=\"\t\"}{$2=$3=\"\"; print $0}" Input.tsv
When I set $2 = $3 = "", the intended effect is to get the same effect as writing:
print $1,$4,$5...$NF
However, what actually happens is that I get two empty fields, with the extra field delimiters still printing.
Is it possible to actually delete $2 and $3?
Note: If this was on Linux in bash, the correct statement above would be the following, but Windows does not handle single quotes well in cmd.exe.
$ gawk -F'\t' 'BEGIN{OFS="\t"}{$2=$3=""; print $0}' Input.tsv

This is an oldie but goodie.
As Jonathan points out, you can't delete fields in the middle, but you can replace their contents with the contents of other fields. And you can make a reusable function to handle the deletion for you.
$ cat test.awk
function rmcol(col, i) {
for (i=col; i<NF; i++) {
$i = $(i+1)
}
NF--
}
{
rmcol(3)
}
1
$ printf 'one two three four\ntest red green blue\n' | awk -f test.awk
one two four
test red blue

You can't delete fields in the middle, but you can delete fields at the end, by decrementing NF.
So you can shift all the later fields down to overwrite $2 and $3 then decrement NF by two, which erases the last two fields:
$ echo 1 2 3 4 5 6 7 | awk '{for(i=2; i<NF-1; ++i) $i=$(i+2); NF-=2; print $0}'
1 4 5 6 7

If you are just looking to remove columns, you can use cut:
$ cut -f 1,4- file.txt
To emulate cut:
$ awk -F "\t" '{ for (i=1; i<=NF; i++) if (i != 2 && i != 3) { if (i == NF) printf $i"\n"; else printf $i"\t" } }' file.txt
Similarly:
$ awk -F "\t" '{ delim =""; for (i=1; i<=NF; i++) if (i != 2 && i != 3) { printf delim $i; delim = "\t"; } printf "\n" }' file.txt
HTH

The only way I can think to do it in Awk without using a loop is to use gsub on $0 to combine adjacent FS:
$ echo {1..10} | awk '{$2=$3=""; gsub(FS"+",FS); print}'
1 4 5 6 7 8 9 10

One way could be to remove fields like you do and remove extra spaces with gsub:
$ awk 'BEGIN { FS = "\t" } { $2 = $3 = ""; gsub( /\s+/, "\t" ); print }' input-file

In the addition of the answer by Suicidal Steve I'd like to suggest one more solution but using sed instead awk.
It seems more complicated than usage of cut as it was suggested by Steve. But it was the better solution because sed -i allows editing in-place.
$ sed -i 's/\(.*,\).*,.*,\(.*\)/\1\2/' FILENAME

To remove fields 2 and 3 from a given input file (assuming a tab field separator), you can remove the fields from $0 using gensub and regenerate it as follows:
awk -F '\t' 'BEGIN{OFS="\t"}\
{$0=gensub(/[^\t]*\t/,"",3);\
$0=gensub(/[^\t]*\t/,"",2);\
print}' Input.tsv

The method presented in the answer of ghoti has some problems:
every assignment of $i = $(i+1) forces awk to rebuild the record $0. This implies that if you have 100 fields and you want to delete field 10, you rebuild the record 90 times.
changing the value of NF manually is not posix compliant and leads to undefined behaviour (as is mentioned in the comments).
A somewhat more cumbersome, but stable robust way to delete a set of columns would be:
a single column:
awk -v del=3 '
BEGIN{FS=fs;OFS=ofs}
{ b=""; for(i=1;i<=NF;++i) if(i!=del) b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file
multiple columns:
awk -v del=3,5,7 '
BEGIN{FS=fs;OFS=ofs; del="," del ","}
{ b=""; for(i=1;i<=NF;++i) if (del !~ ","i",") b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file

Well, if the goal is to remove the extra delimiters, then you can use tr on Linux. Example:
$ echo "1,2,,,5" | tr -s ','
1,2,5

echo one two three four five six|awk '{
print $0
is3=$3
$3=""
print $0
print is3
}'
one two three four five six
one two four five six
three

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

selecting columns in awk discarding corresponding header - awk

Related

assigning a var inside AWK for use outside awk

linux csv file concatenate columns into one column

Explanation for a specific command in a awk

How to get rows with values more than 2 in at least 2 columns?

Is there a way to completely delete fields in awk, so that extra delimiters do not print?

Categories

Resources