An awk script without hard-coded field information - awk

We have the following awk script that extracts fields 6, 7 and 14 from a CSV file:
awk -F, '{for (i=1; i<=NF; i++) if (i in [6, 7, 14]) printf "%s,", $i; print ""}' $input_file
The script works beautifully, except that the information about the fields of interest is hard-coded. We would like to be able to pass this information as a single command line argument (or even a series of command line arguments), to make the script more versatile. We tried a few things, including the following, but we keep getting a syntax error:
awk -F, '{for (i=1; i<=NF; i++) if (i in ['$2']) printf "%s,", $i; print ""}' $input_file

awk -F, '{for (i=1; i<=NF; i++) if (i in [6, 7, 14]) printf "%s,", $i; print ""}' $input_file
is not valid awk syntax which is one reason why
awk -F, '{for (i=1; i<=NF; i++) if (i in ['$2']) printf "%s,", $i; print ""}' $input_file
or any variation of it would also give you a syntax error.
This is probably what you're trying to do:
awk -F, -v vals="$2" '
BEGIN { split(vals,tmp); for (i in tmp) arr[tmp[i]] }
{ for (i=1; i<=NF; i++) if (i in arr) printf "%s,", $i; print "" }
' "$input_file"
assuming $2 contains a comma-separated string like 6,7,14 and your input file is a CSV with unquoted fields.
That would still print a trailing , on each line and looping through all fields and discarding the ones you don't want for every input line is an inefficient way to do what you're trying to do. This would solve both of those additional problems:
awk -v vals="$2" '
BEGIN { FS=OFS=","; n=split(vals,arr) }
{ for (i=1; i<=n; i++) printf "%s%s", $(arr[i]), (i<n ? OFS : ORS) }
' "$input_file"

Another option is to not use (g)awk, and use cut:
cut -d "," -f "6,7,14" inputfile
(or: a="6,7,14"; cut -d "," -f "$a" inputfile)
When input contains:
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,2,25,26
output should look like:
f,g,n
6,7,14

Related

awk: transpose column header to first field of row

My input looks like this:
A|B|C
1|2|3
4|5|6
Using awk, I am trying to get:
A|1
B|2
C|3
A|4
B|5
C|6
My code:
gawk '
BEGIN{FS=OFS="|"}
NR==1{
for(i=1; i<=NF; i++){
x_i=$i
}
}
NR>1{
for(i=1; i<=NF; i++){
print x_i FS $i
}
}' input
But it keeps only the last iteration of the NR==1 block even if I use the same loop in NR>1 bock:
C|1
C|2
C|3
C|4
C|5
C|6
Any trick?
EDIT
Thanks to Jose, I needed to change x_i by x[i].
What about if using the same input, I would need to output:
A;B;C|1|2|3
A;B;C|4|5|6
$ awk 'BEGIN{FS=OFS="|"} NR==1{split($0,h);next} {for (i=1;i<=NF;i++) print h[i], $i}' file
A|1
B|2
C|3
A|4
B|5
C|6
$ awk 'BEGIN{FS=OFS="|"} NR==1{gsub(/\|/,";");h=$0;next} {print h, $0}' file
A;B;C|1|2|3
A;B;C|4|5|6
Read Effective Awk Programming, 4th Edition, by Arnold Robbins.
you can try,
awk 'BEGIN{FS=OFS="|"}
NR==1{for(i=1; i<=NF; ++i) d[i]=$i; next}
{for(i=1; i<=NF; ++i) print d[i], $i}
' input
you get
A|1
B|2
C|3
A|4
B|5
C|6
Important Note
your logic is correct, only x[i] instead of x_i
gawk '
BEGIN{FS=OFS="|"}
NR==1{
for(i=1; i<=NF; i++){
x[i]=$i
}
}
NR>1{
for(i=1; i<=NF; i++){
print x[i] FS $i
}
}' input
Here is another using split and for:
$ awk 'NR==1 { split($0,a,"|") }
NR>1 { n=split($0,b,"|");
for(i=1;i<=n;i++)
print a[i] "|" b[i] }' file
A|1
B|2
C|3
A|4
B|5
C|6

Calculate average of each column in a file

I have a text file with n number of rows (separated by commas) and columns and I want to find average of each column, excluding empty field.
A sample input looks like:
1,2,3
4,,6
,7,
The desired output is:
2.5, 4.5, 4.5
I tried with
awk -F',' '{ for(i=1;i<=NF;i++) sum[i]=sum[i]+$i;if(max < NF)max=NF;};END { for(j=1;j<=max;j++) printf "%d\t",sum[j]/max;}' input
But it treats consecutive delimiters as one and mixing columns.
Any help is much appreciated.
You can use this one-liner:
$ awk -F, '{for(i=1; i<=NF; i++) {a[i]+=$i; if($i!="") b[i]++}}; END {for(i=1; i<=NF; i++) printf "%s%s", a[i]/b[i], (i==NF?ORS:OFS)}' foo
2.5 4.5 4.5
Otherwise, you can save this in a file script.awk and run awk -f script.awk your_file:
{
for(i=1; i<=NF; i++) {
a[i]+=$i
if($i!="")
b[i]++}
}
END {
for(i=1; i<=NF; i++)
printf "%s%s", a[i]/b[i], (i==NF?ORS:OFS)
}

tcsh error: while loop

This is a basic program but since I'm a newbie, I'm not able to figure out the solution.
I have a file named rama.xvg in the following format:
-75.635 105.879 ASN-2
-153.704 64.7089 ARG-3
-148.238 -47.6076 GLN-4
-63.2568 -8.05441 LEU-5
-97.8149 -7.34302 GLU-6
-119.276 8.99017 ARG-7
-144.198 -103.917 SER-8
-65.4354 -10.3962 GLY-9
-60.6926 12.424 ARG-10
-159.797 -0.551989 PHE-11
65.9924 -48.8993 GLY-12
179.677 -7.93138 GLY-13
..........
...........
-70.5046 38.0408 GLY-146
-155.876 153.746 TRP-147
-132.355 151.023 GLY-148
-66.2679 167.798 ASN-2
-151.342 -33.0647 ARG-3
-146.483 41.3483 GLN-4
..........
..........
-108.566 0.0212432 SER-139
47.6854 33.6991 MET-140
47.9466 40.1073 ASP-141
46.4783 48.5301 SER-142
-139.17 172.486 LYS-143
58.9514 32.0602 SER-144
60.744 18.3059 SER-145
-94.0533 165.745 GLY-146
-161.809 177.435 TRP-147
129.172 -101.736 GLY-148
I need to extract all the lines containing "ASN-2" in one file all_1.dat and so on for all the 147 residues.
If I run the following command in the terminal, it gives the desired output for ASN-2:
awk '{if( NR%147 == 1 ) printf $0 "\n"}' rama.xvg > all_1.dat
To avoid doing it repeatedly for all the residues, I have written the following code.
#!/bin/tcsh
set i = 1
while ( $i < 148)
echo $i
awk '{if( NR%147 == i ) printf $0 "\n"}' rama.xvg > all_"$i".dat
# i++
end
But this code prints the lines containing GLY-148 in all the output files.
Please let me know what is the error in this code. I think it is related to nesting.
In your awk-line the variable i is an awk-variable not shell variable! If you want use shell-variable $i you can do:
awk -v i="$i" '{if( NR%147 == i ) printf $0 "\n"}' rama.xvg > all_"$i".dat
But I think would better put your while-loop into awk:
awk '{for (i=1; i<=147; i++) { if (NR%147==i) {printf $0 "\n" > ("all_" i ".dat") } } }' rama.xvg

how to collapse fields with zeros from a datafile

I have a file with data as follows
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
1,2,3,4,5,6,7,8,9,10
1,2,1,2,0,1,0,1,0,1
1,1,1,1,0,2,3,0,0,0
5,1,1,0,0,0,0,0,1,0
I would like to change the delimiters from col6 through column 10 to pipe '|' and the column value would be followed by column name.
Desired Output:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1
I tried using the command
awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i; next}
{for (i=1; i<=NF; i++) if ($i>0) printf "%s%s", ((i>5)?hdr[i] ":":"") $i,
((i<NF)? ((i>5)?"|":",") : ORS)}' data.csv
but not getting the result as expected
Output:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,col6:1|col8:1|col10:1
1,1,1,1,col6:2|col7:3|5,1,1,col9:1|
The columns that does not contain a zero is ending with '|' and the next line is starting data starts there !
In this example, row 2 data ends with a pipe '|' and row 3 data starts in the row 2. row 4 data ends with a pipe '|'
Can some one help me fix this please
P.S: For people looking for the reason behind all this work, I'm trying to load the data from a csv file in to a framework. Source data has 10 columns and The destination dataset would have 6 columns - first 5 from source as is and the rest as a map. Also, I'll have to make sure that there is no map key with the value as zero and then start the data analysis on the set.
This post is to get help for making the data set ready for analysis.
$ awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i":"; next} {for (i=1; i<=5; i++) printf $i","; b=""; for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";} printf ORS}' data.csv
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1
Or, written over multiple lines:
awk -F ', *' '
NR==1{
for (i=1; i<=NF; i++) hdr[i]=$i":"
next
}
{
for (i=1; i<=5; i++) printf $i","
b=""
for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";}
printf ORS
}
' data.csv
How it works
NR==1{for (i=1; i<=NF; i++) hdr[i]=$i":"; next}
For the first line, NR==1, we save each field and a trailing colon into array hdr. Then, the rest of the commands are skipped and we just to the next line.
for (i=1; i<=5; i++) printf $i","
If we get here, we are working or the second are later lines. In this case, we print the first five fields, each followed by a comma.
b=""
We initialize the variable b to the empty string.
for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";}
For fields 6 to the end, if the field is nonzero, we print b followed by the hdr followed by the field value. After we have encountered the first such nonzero field, b is set to |.
printf ORS
After printing the last field, we print an output record separator (default is a newline).
The above solution is excellent. Helps me with a similar issue. However, I need to cater for an all-zero case in columns 6 to 10. See the last line of your data below.
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
1,2,3,4,5,6,7,8,9,10
1,2,1,2,0,1,0,1,0,1
1,1,1,1,0,2,3,0,0,0
5,1,1,0,0,0,0,0,1,0
5,1,1,0,0,0,0,0,0,0
This might never happen in your data, however if it does you are left with an inconvenient comma at the end of the line:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1
5,1,1,0,0,
To get around it I made change. Here it is, somewhat spread out for clarity:
awk -F ', *' '
NR==1{
for (i=1; i<=NF; i++) hdr[i]=$i":"
next
}
{
for (i=1; i<5; i++) printf("%s,", $i);
if(i==5) printf("%s", $i);
b="";
for (i=6; i<=NF; i++) {
if ($i>0) {
if(b=="") b=","; else b="|";
printf("%s%s",b, hdr[i] $i);
}
}
printf(ORS);
}

Rounding floating number using AWK

I have a file b.xyz as,
-19.794325 -23.350704 -9.552335
-20.313872 -23.948248 -8.924463
-18.810708 -23.571757 -9.494047
-20.048543 -23.660052 -10.478968
I want to limit each of the entries to three decimal digits.
I tried this one
awk '{ $1=sprintf("%.3f",$1)} {$2=sprintf("%.3f",$2)} {$3=sprintf("%.3f",$3)} {print $1, $2, $3}' b.xyz
it works for three columns, but how to expand it to apply for n/all columns?
If you will always have three fields, then you can use:
$ awk '{printf "%.3f %.3f %.3f\n", $1, $2, $3}' file
-19.794 -23.351 -9.552
-20.314 -23.948 -8.924
-18.811 -23.572 -9.494
-20.049 -23.660 -10.479
For an undefined number of lines, you can do:
$ awk '{for (i=1; i<=NF; i++) printf "%.3f%s", $i, (i==NF?"\n":" ")}' file
-19.794 -23.351 -9.552
-20.314 -23.948 -8.924
-18.811 -23.572 -9.494
-20.049 -23.660 -10.479
It will loop through all the fields and print them. (i==NF?"\n":" ") prints a new line when the last item is reached.
Or even (thanks Jotne!):
awk '{for (i=1; i<=NF; i++) printf "%.3f %s", $i, (i==NF?RS:FS)}' file
Example
$ cat a
-19.794325 -23.350704 -9.552335 2.13423 23 23223.23 23.23442
-20.313872 -23.948248 -8.924463
-18.810708 -23.571757 -9.494047
-20.048543 -23.660052 -10.478968
$ awk '{for (i=1; i<=NF; i++) printf "%.3f %s", $i, (i==NF?"\n":" ")}' a
-19.794 -23.351 -9.552 2.134 23.000 23223.230 23.234
-20.314 -23.948 -8.924
-18.811 -23.572 -9.494
-20.049 -23.660 -10.479
$ awk '{for (i=1; i<=NF; i++) printf "%.3f %s", $i, (i==NF?RS:FS)}' a
-19.794 -23.351 -9.552 2.134 23.000 23223.230 23.234
-20.314 -23.948 -8.924
-18.811 -23.572 -9.494
-20.049 -23.660 -10.479