Awk variable and summation - awk

I would like to understand the difference in result for the following awk commands.
I have read that when awk introduces numerical variables they are set to zero by default, so would assume that sum=0 would be implicitly assumed.
However 1) gives an incorrect result, while 2) is correct.
Aim: Find the total number of lines in a file without using NR
financial.txt
14D 20190503 0.31 0.31 0.295 0.295 117949
14DO 20190503 0.00 0.00 0.00 0.07 0
1AD 20190503 0.18 0.19 0.18 0.19 54370
1AG 20190503 0.041 0.042 0.041 0.042 284890
1AL 20190503 0.00 0.00 0.00 0.88 0
1ST 20190503 0.05 0.05 0.049 0.049 223215
3DP 20190503 0.049 0.054 0.048 0.048 2056379
3PL 20190503 1.055 1.06 1.02 1.05 120685
4CE 20190503 0.00 0.00 0.00 0.009 0
4DS 20190503 0.072 0.076 0.072 0.075 2375896
$ awk 'BEGIN {sum+=1} END {print sum}' financial.txt
1
$ awk 'BEGIN {sum=0}{sum+=1} END {print sum}' financial.txt
5527
Thanks
After reviewing comments, I found the solution I was looking for without using BEGIN.
$ awk '{sum+=1}END{print sum}' financial.txt
5527

All awk variables are initialized to zero-or-null. If first used in a numeric context they become 0 at that point while if first used in a string context they become null at that point. Wrt your code samples, this:
BEGIN {sum+=1} END {print sum}
means:
BEGIN {sum+=1}
END {print sum}
while this:
BEGIN {sum=0}{sum+=1} END {print sum}
means:
BEGIN {sum=0}
<true> {sum+=1}
END {print sum}
See the difference? Add ;print sum before every } to trace how sum is being populated if it's not obvious what's happening.

From GNU AWK Manual :
A BEGIN rule is executed once only, before the first input record is read. Likewise, an END rule is executed once only, after all the input is read.
Thus, the following will execute "{sum+=1}" statement only once.
awk 'BEGIN {sum+=1} END {print sum}' financial.txt
But, in 2nd case "{sum+=1}" is executed for every line read from the file.
awk 'BEGIN {sum=0}{sum+=1} END {print sum}' financial.txt

Related

How can I filter by value for the multi valued column using awk?

I would like to use awk to filter out a multi valued column.
My data is two columned with the delimiter ;. The second column has three float values separated with white spaces.
randUni15799:1;0.00 0.00 0.00
randUni1785:1;0.00 0.00 0.00
randUni18335:1;0.00 0.00 0.00
randUni18368:1;223.67 219.17 0.00
randUni18438:1;43.71 38.71 1.52
What I want to achieve is the following. I want to filter all rows that the first and second value of the second column is bigger than 200.
randUni18368:1;223.67 219.17 0.00
Update:
With help from the comments, I tried this and worked
awk -F ";" '{split($2, a, " "); if (a[1] > 200 && a[2] > 200) print}'
One awk idea:
awk -F';' '{ n=split($2,a,/[[:space:]]+/) # split 2nd field on spaces; place values in array a[]
if (a[1] > 200 && a[2] > 200) # if 1st and 2nd array entries > 200 then ...
print # print current line to stdout
}
' randum.dat
# or as a one-liner
awk -F';' '{ n=split($2,a,/[[:space:]]+/); if (a[1] > 200 && a[2] > 200) print}' randum.dat
# reduced further based on OP's comments/questoins:
awk -F';' '{ split($2,a," "); if (a[1] > 200 && a[2] > 200) print}' randum.dat
This generates:
randUni18368:1;223.67 219.17 0.00
I would harness GNU AWK for this task following way, let file.txt content be
randUni15799:1;0.00 0.00 0.00
randUni1785:1;0.00 0.00 0.00
randUni18335:1;0.00 0.00 0.00
randUni18368:1;223.67 219.17 0.00
randUni18438:1;43.71 38.71 1.52
then
awk '(gensub(/.*;/,"",$1)+0)>200&&$2>200' file.txt
gives output
randUni18368:1;223.67 219.17 0.00
Explanation: I use gensub function on so anything up to ; including ; is replaced by empty string, i.e. removed and then returned. Observe that gensub does not alter $1 in this process, then add zero to convert that into number check if that is more than 200 and 2nd field is bigger than 200.
(tested in gawk 4.2.1)

Search and Print by Two Conditions using AWK

I have this file:
- - - Results from analysis of weight - - -
Akaike Information Criterion 307019.66 (assuming 2 parameters).
Bayesian Information Criterion 307036.93
Approximate stratum variance decomposition
Stratum Degrees-Freedom Variance Component Coefficients
id 39892.82 490.360 0.7 0.6 1.0
damid 0.00 0.00000 0.0 0.0 1.0
Residual Variance 1546.46 320.979 0.0 0.0 1.0
Model_Term Gamma Sigma Sigma/SE % C
id NRM_V 17633 0.18969 13.480 4.22 0 P
damid NRM_V 17633 0.07644 13.845 2.90 0 P
ide(damid) IDV_V 17633 0.00000 32.0979 1.00 0 S
Residual SCA_V 12459 1.0000 320.979 27.81 0 P
And I Would Like to print the Value of Sigma on id, note there are two id on the file, so I used the condition based on NRM_V too.
I tried this code:
tac myfile | awk '(/id/ && /NRM_V/){print $5}'
but the results printed were:
13.480
13.845
and I need just the first one
Could you please try following, I have added exit function of awk here which will help us to exit from code ASAP whenever first occurrence of condition comes, it will help us to save time too, since its no longer reading whole Input_file.
awk '(/id/ && /NRM_V/){print $5;exit}' Input_file
OR with columns:
awk '($1=="id" && $2=="NRM_V"){print $5;exit}' Input_file
In case you want to read file from last line towards first line and get its first value then try:
tac Input_file | awk '(/id/ && /NRM_V/){print $5;exit}'
OR with columns comparisons:
tac Input_file | awk '($1=="id" && $2=="NRM_V"){print $5;exit}'
The problem is that /id/ also matches damid. You could use the following to print the Sigma value only if the first field is id and the second field is NRM_V:
awk '$1=="id" && $2=="NRM_V"{ print $5 }' myfile

Calculate distances of continuous point

I need a help!
I have several points ABCDEF... with positions like this:
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
...
I aim to calculate distance of AB, BC, CD, EF,... and sum of them with output which has form like this:
sum_distance(AB)
sum_distance(AB+BC)
sum_distance(AB+BC+CD)
sum_distance(AB+BC+CD+DE)
sum_distance(AB+BC+CD+DE+EF)
....
I found on internet awk can do it and apply to my case. However, no result or error was exported on screen. Could you please help me with this situation?
bash shell, awk
awk 'FNR==NR { a[NR]=$0; next } { for (i=FNR+1;i<=NR-1;i++) {split(a[i],b); print $1 "-" b[1], sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2) | "column -t" } NR--}'
Output:
2.934280150
4.728297987
7.470140434
9.682130488
11.92469598
......
You don't need such a complex script for this trivial task. Try this instead:
awk 'NR>1{ printf "%.9f\n",s+=sqrt(($2-x)^2+($3-y)^2+($4-z)^2) }
{ x=$2;y=$3;z=$4 }' file
For all points but A, calculate the distance, add it to the sum s and print the sum. For all points keep coordinates in x, y, z for next calculation. Its output looks like this with gawk:
2.934280150
4.728297987
7.470140434
9.682130488
What is the Cardinal Rule? (never use code off the internet you don't understand...)
The problem with the awk script you are attempting to use is it is not exactly your case. By setting FNR==NR and then using the loop limits (i=FNR+1;i<=NR-1;i++) it is expecting multiple input files. For your case, you can actually simplify the script by removing the loop entirely since you only have a single input file.
You need only save the first row, then using next read the next row, compute and output the distance between the prior row and the current, set the current row as the row in the a[] array and repeat until you run out of rows, e.g.
awk '{
a[NR]=$0
if (NR == 1)
next
split(a[NR-1],b)
printf "%s\t%s\n", b[1] "-" $1,
sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2)
a[NR]=$0
}'
Example Input File
$ cat f
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
Example Use/Output
Simply paste the script into a terminal adding the filename at the end, e.g.
$ awk '{
> a[NR]=$0
> if (NR == 1)
> next
> split(a[NR-1],b)
> printf "%s\t%s\n", b[1] "-" $1,
> sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2)
> a[NR]=$0
> }' f
A-B 2.93428
B-C 1.79402
C-D 2.74184
D-E 2.21199
Look things over and let me know if you have further questions.
Try this:
awk 'function d(a,b){split(a,x);split(b,y);return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);} {p[FNR]=$0} FNR>1{sum[FNR]=sum[FNR-1]+d(p[FNR-1],p[FNR]);printf "%.9f\n",sum[FNR];}' file
With file content like this:
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
will provide output like this:
2.934280150
4.728297987
7.470140434
9.682130488
You didn't provide point F so your last line of output actually can't be count here.
Put into several lines here:
awk '
function d(a,b){
split(a,x);
split(b,y);
return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);
}
{p[FNR]=$0}
FNR>1{
sum[FNR]=sum[FNR-1]+d(p[FNR-1],p[FNR]);
printf "%.9f\n",sum[FNR];
}' file
It's quite straightforward here, function d for distance. And reuse the sum of former line.
And for fun, if you want to calculate the total distance of a graph, with initially one point and gradually add point into the graph. I.E. :
sum_distance(AB)
sum_distance(AB+BC+AC)
sum_distance(AB+BC+AC+AD+BD+CD)
...
Then just a little bit improvement will do, like this:
$ awk 'function d(a,b){split(a,x);split(b,y);return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);} {p[FNR]=$0} FNR>1{sum[FNR]=sum[FNR-1];for(i=FNR-1;i>0;i--)sum[FNR]+=d(p[i],p[FNR]);printf "%.9f\n",sum[FNR];}' file
2.934280150
6.160254691
14.349070561
22.466306583

awk to match two fields in two files

I want to find lines where fields 1 and 2 from file1 match fields 2 and 3 from file2, and then print all fields from file2. There are more lines in file2 than in file1
File1
rs116801199 720381
rs138295790 16057310
rs131531 16870251
rs131546 16872281
rs140375 16873251
rs131552 16873461
File2
--- rs116801199 720381 0.026 0.939 0.996 0 -1 -1 -1
1 rs12565286 721290 0.028 1.000 1.000 2 0.370 0.934 0.000
1 rs3094315 752566 0.432 1.000 1.000 2 0.678 0.671 0.435
--- rs3131972 752721 0.353 0.906 0.938 0 -1 -1 -1
--- rs61770173 753405 0.481 0.921 0.950 0 -1 -1 -1
I tried something like:
awk -F 'FNR==NR{a[$1];b[$2];next} FNR==1 || ($2 in a && $3 in b)' file1 file2 > test
But got a syntax error
Consider:
awk -F 'FNR==NR{a[$1];b[$2];next} FNR==1 || ($2 in a && $3 in b)' file1 file2
The option -F expects an argument but no argument is provided intentionally. The result is that awk interprets the entirety of the code as the field separator. That is why that code does not run as expected.
From the problem statement, I didn't see why FNR==1 should be in the code. So, I removed it. Once that is done, the parens are unnecessary. If that is the case, then, the code further simplifies to:
$ awk 'FNR==NR{a[$1];b[$2];next} $2 in a && $3 in b' file1 file2
--- rs116801199 720381 0.026 0.939 0.996 0 -1 -1 -1

comparing two columns in two files

I would like to compare two columns in two files.
Here's an example:
1 722603 0.08 0.0013 0.0035 0.02
1 793227 0.17 0 0 0.01
2 931508 0.52 0.95 0.93 0.92
1 722603 0.0348543
1 793227 0.130642
2 931508 0.275751
2 1025859 0.0739543
2 1237036 0.476705
This code compares the second columns of the two files:
awk 'FNR==NR {a[$2]++; next} a[$2]' file 1 file 2
However, I want to print the common second column if the first column is also the same. More specifically, if it finds 722603 in both files, it must check that the first column is also equal to 1 and then prints it. If the number in second column is repeated, it is important that it gets printed more than once with different values of column 1.
I'd be very thankful if you could guide me through this, thank you.
like this? extended your codes a bit:
awk 'FNR==NR {a[$1 FS $2]++; next} a[$1 FS $2]' file1 file2