Calculate distances of continuous point - awk

I need a help!
I have several points ABCDEF... with positions like this:
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
...
I aim to calculate distance of AB, BC, CD, EF,... and sum of them with output which has form like this:
sum_distance(AB)
sum_distance(AB+BC)
sum_distance(AB+BC+CD)
sum_distance(AB+BC+CD+DE)
sum_distance(AB+BC+CD+DE+EF)
....
I found on internet awk can do it and apply to my case. However, no result or error was exported on screen. Could you please help me with this situation?
bash shell, awk
awk 'FNR==NR { a[NR]=$0; next } { for (i=FNR+1;i<=NR-1;i++) {split(a[i],b); print $1 "-" b[1], sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2) | "column -t" } NR--}'
Output:
2.934280150
4.728297987
7.470140434
9.682130488
11.92469598
......

You don't need such a complex script for this trivial task. Try this instead:
awk 'NR>1{ printf "%.9f\n",s+=sqrt(($2-x)^2+($3-y)^2+($4-z)^2) }
{ x=$2;y=$3;z=$4 }' file
For all points but A, calculate the distance, add it to the sum s and print the sum. For all points keep coordinates in x, y, z for next calculation. Its output looks like this with gawk:
2.934280150
4.728297987
7.470140434
9.682130488

What is the Cardinal Rule? (never use code off the internet you don't understand...)
The problem with the awk script you are attempting to use is it is not exactly your case. By setting FNR==NR and then using the loop limits (i=FNR+1;i<=NR-1;i++) it is expecting multiple input files. For your case, you can actually simplify the script by removing the loop entirely since you only have a single input file.
You need only save the first row, then using next read the next row, compute and output the distance between the prior row and the current, set the current row as the row in the a[] array and repeat until you run out of rows, e.g.
awk '{
a[NR]=$0
if (NR == 1)
next
split(a[NR-1],b)
printf "%s\t%s\n", b[1] "-" $1,
sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2)
a[NR]=$0
}'
Example Input File
$ cat f
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
Example Use/Output
Simply paste the script into a terminal adding the filename at the end, e.g.
$ awk '{
> a[NR]=$0
> if (NR == 1)
> next
> split(a[NR-1],b)
> printf "%s\t%s\n", b[1] "-" $1,
> sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2)
> a[NR]=$0
> }' f
A-B 2.93428
B-C 1.79402
C-D 2.74184
D-E 2.21199
Look things over and let me know if you have further questions.

Try this:
awk 'function d(a,b){split(a,x);split(b,y);return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);} {p[FNR]=$0} FNR>1{sum[FNR]=sum[FNR-1]+d(p[FNR-1],p[FNR]);printf "%.9f\n",sum[FNR];}' file
With file content like this:
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
will provide output like this:
2.934280150
4.728297987
7.470140434
9.682130488
You didn't provide point F so your last line of output actually can't be count here.
Put into several lines here:
awk '
function d(a,b){
split(a,x);
split(b,y);
return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);
}
{p[FNR]=$0}
FNR>1{
sum[FNR]=sum[FNR-1]+d(p[FNR-1],p[FNR]);
printf "%.9f\n",sum[FNR];
}' file
It's quite straightforward here, function d for distance. And reuse the sum of former line.
And for fun, if you want to calculate the total distance of a graph, with initially one point and gradually add point into the graph. I.E. :
sum_distance(AB)
sum_distance(AB+BC+AC)
sum_distance(AB+BC+AC+AD+BD+CD)
...
Then just a little bit improvement will do, like this:
$ awk 'function d(a,b){split(a,x);split(b,y);return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);} {p[FNR]=$0} FNR>1{sum[FNR]=sum[FNR-1];for(i=FNR-1;i>0;i--)sum[FNR]+=d(p[i],p[FNR]);printf "%.9f\n",sum[FNR];}' file
2.934280150
6.160254691
14.349070561
22.466306583

Related

How can I filter by value for the multi valued column using awk?

I would like to use awk to filter out a multi valued column.
My data is two columned with the delimiter ;. The second column has three float values separated with white spaces.
randUni15799:1;0.00 0.00 0.00
randUni1785:1;0.00 0.00 0.00
randUni18335:1;0.00 0.00 0.00
randUni18368:1;223.67 219.17 0.00
randUni18438:1;43.71 38.71 1.52
What I want to achieve is the following. I want to filter all rows that the first and second value of the second column is bigger than 200.
randUni18368:1;223.67 219.17 0.00
Update:
With help from the comments, I tried this and worked
awk -F ";" '{split($2, a, " "); if (a[1] > 200 && a[2] > 200) print}'
One awk idea:
awk -F';' '{ n=split($2,a,/[[:space:]]+/) # split 2nd field on spaces; place values in array a[]
if (a[1] > 200 && a[2] > 200) # if 1st and 2nd array entries > 200 then ...
print # print current line to stdout
}
' randum.dat
# or as a one-liner
awk -F';' '{ n=split($2,a,/[[:space:]]+/); if (a[1] > 200 && a[2] > 200) print}' randum.dat
# reduced further based on OP's comments/questoins:
awk -F';' '{ split($2,a," "); if (a[1] > 200 && a[2] > 200) print}' randum.dat
This generates:
randUni18368:1;223.67 219.17 0.00
I would harness GNU AWK for this task following way, let file.txt content be
randUni15799:1;0.00 0.00 0.00
randUni1785:1;0.00 0.00 0.00
randUni18335:1;0.00 0.00 0.00
randUni18368:1;223.67 219.17 0.00
randUni18438:1;43.71 38.71 1.52
then
awk '(gensub(/.*;/,"",$1)+0)>200&&$2>200' file.txt
gives output
randUni18368:1;223.67 219.17 0.00
Explanation: I use gensub function on so anything up to ; including ; is replaced by empty string, i.e. removed and then returned. Observe that gensub does not alter $1 in this process, then add zero to convert that into number check if that is more than 200 and 2nd field is bigger than 200.
(tested in gawk 4.2.1)

How do I print the line where the max value is found using awk?

I have a file called probabilities.txt and it's a two column file with the first column listing distances and the second column probabilities.
The sample data is as follows:
0.2 0.05
0.4 0.10
0.6 0.63
0.8 0.11
1.0 0.03
... ...
10.0 0.01
I would like to print out the line that has the maximum value in column 2. I've tried the following:
awk 'BEGIN{a= 0} {if ($2 > a) a = $2} END{print $1, a}' probabilities.txt
This was the desired output:
0.6 0.63
But this is the output I get:
10.0 0.63
It seems like the code I wrote is just getting the max value in each column and then printing it out rather than printing out the line that has the max value in column 2. Printing out $0 also just prints out the last line of the file.
I assume one could fix this by treating the lines as an array rather than a scalar but I'm not really sure how to do that since I'm a beginner. Would appreciate any help
I had contemplated just leaving the answer as a comment, but given the trouble you had with the command it's worth writing up. To begin, you don't need BEGIN. In awk variables are initialized 0 until set, so you can simply use a max variable for the first time after comparing it.
Note: If your data involves negative numbers (neither distance or probabilities can), just add a new first rule and set max to the value in the first record (e.g. FNR==1 (max=$2; next})
Next, you don't save individual field values when you are wanting to capture the entire line (record) with the largest probability, save the entire record associated with the max value. Then in your END rule all you need to do is print that record.
Putting it altogether you would have:
awk '{if($2 > max) {max=$2; maxline=$0}} END {print maxline}' file
or, if you prefer:
awk '$2 > max {max=$2; maxline=$0} END {print maxline}' file
Example Use/Output
With your data in the file distprobs.txt you would get:
$ awk '{if($2 > max) {max=$2; maxline=$0}} END {print maxline}' distprobs.txt
0.6 0.63
and, second version same result:
$ awk '$2 > max {max=$2; maxline=$0} END {print maxline}' distprobs.txt
0.6 0.63

Search and Print by Two Conditions using AWK

I have this file:
- - - Results from analysis of weight - - -
Akaike Information Criterion 307019.66 (assuming 2 parameters).
Bayesian Information Criterion 307036.93
Approximate stratum variance decomposition
Stratum Degrees-Freedom Variance Component Coefficients
id 39892.82 490.360 0.7 0.6 1.0
damid 0.00 0.00000 0.0 0.0 1.0
Residual Variance 1546.46 320.979 0.0 0.0 1.0
Model_Term Gamma Sigma Sigma/SE % C
id NRM_V 17633 0.18969 13.480 4.22 0 P
damid NRM_V 17633 0.07644 13.845 2.90 0 P
ide(damid) IDV_V 17633 0.00000 32.0979 1.00 0 S
Residual SCA_V 12459 1.0000 320.979 27.81 0 P
And I Would Like to print the Value of Sigma on id, note there are two id on the file, so I used the condition based on NRM_V too.
I tried this code:
tac myfile | awk '(/id/ && /NRM_V/){print $5}'
but the results printed were:
13.480
13.845
and I need just the first one
Could you please try following, I have added exit function of awk here which will help us to exit from code ASAP whenever first occurrence of condition comes, it will help us to save time too, since its no longer reading whole Input_file.
awk '(/id/ && /NRM_V/){print $5;exit}' Input_file
OR with columns:
awk '($1=="id" && $2=="NRM_V"){print $5;exit}' Input_file
In case you want to read file from last line towards first line and get its first value then try:
tac Input_file | awk '(/id/ && /NRM_V/){print $5;exit}'
OR with columns comparisons:
tac Input_file | awk '($1=="id" && $2=="NRM_V"){print $5;exit}'
The problem is that /id/ also matches damid. You could use the following to print the Sigma value only if the first field is id and the second field is NRM_V:
awk '$1=="id" && $2=="NRM_V"{ print $5 }' myfile

Awk variable and summation

I would like to understand the difference in result for the following awk commands.
I have read that when awk introduces numerical variables they are set to zero by default, so would assume that sum=0 would be implicitly assumed.
However 1) gives an incorrect result, while 2) is correct.
Aim: Find the total number of lines in a file without using NR
financial.txt
14D 20190503 0.31 0.31 0.295 0.295 117949
14DO 20190503 0.00 0.00 0.00 0.07 0
1AD 20190503 0.18 0.19 0.18 0.19 54370
1AG 20190503 0.041 0.042 0.041 0.042 284890
1AL 20190503 0.00 0.00 0.00 0.88 0
1ST 20190503 0.05 0.05 0.049 0.049 223215
3DP 20190503 0.049 0.054 0.048 0.048 2056379
3PL 20190503 1.055 1.06 1.02 1.05 120685
4CE 20190503 0.00 0.00 0.00 0.009 0
4DS 20190503 0.072 0.076 0.072 0.075 2375896
$ awk 'BEGIN {sum+=1} END {print sum}' financial.txt
1
$ awk 'BEGIN {sum=0}{sum+=1} END {print sum}' financial.txt
5527
Thanks
After reviewing comments, I found the solution I was looking for without using BEGIN.
$ awk '{sum+=1}END{print sum}' financial.txt
5527
All awk variables are initialized to zero-or-null. If first used in a numeric context they become 0 at that point while if first used in a string context they become null at that point. Wrt your code samples, this:
BEGIN {sum+=1} END {print sum}
means:
BEGIN {sum+=1}
END {print sum}
while this:
BEGIN {sum=0}{sum+=1} END {print sum}
means:
BEGIN {sum=0}
<true> {sum+=1}
END {print sum}
See the difference? Add ;print sum before every } to trace how sum is being populated if it's not obvious what's happening.
From GNU AWK Manual :
A BEGIN rule is executed once only, before the first input record is read. Likewise, an END rule is executed once only, after all the input is read.
Thus, the following will execute "{sum+=1}" statement only once.
awk 'BEGIN {sum+=1} END {print sum}' financial.txt
But, in 2nd case "{sum+=1}" is executed for every line read from the file.
awk 'BEGIN {sum=0}{sum+=1} END {print sum}' financial.txt

comparing two columns in two files

I would like to compare two columns in two files.
Here's an example:
1 722603 0.08 0.0013 0.0035 0.02
1 793227 0.17 0 0 0.01
2 931508 0.52 0.95 0.93 0.92
1 722603 0.0348543
1 793227 0.130642
2 931508 0.275751
2 1025859 0.0739543
2 1237036 0.476705
This code compares the second columns of the two files:
awk 'FNR==NR {a[$2]++; next} a[$2]' file 1 file 2
However, I want to print the common second column if the first column is also the same. More specifically, if it finds 722603 in both files, it must check that the first column is also equal to 1 and then prints it. If the number in second column is repeated, it is important that it gets printed more than once with different values of column 1.
I'd be very thankful if you could guide me through this, thank you.
like this? extended your codes a bit:
awk 'FNR==NR {a[$1 FS $2]++; next} a[$1 FS $2]' file1 file2