I have a file called probabilities.txt and it's a two column file with the first column listing distances and the second column probabilities.
The sample data is as follows:
0.2 0.05
0.4 0.10
0.6 0.63
0.8 0.11
1.0 0.03
... ...
10.0 0.01
I would like to print out the line that has the maximum value in column 2. I've tried the following:
awk 'BEGIN{a= 0} {if ($2 > a) a = $2} END{print $1, a}' probabilities.txt
This was the desired output:
0.6 0.63
But this is the output I get:
10.0 0.63
It seems like the code I wrote is just getting the max value in each column and then printing it out rather than printing out the line that has the max value in column 2. Printing out $0 also just prints out the last line of the file.
I assume one could fix this by treating the lines as an array rather than a scalar but I'm not really sure how to do that since I'm a beginner. Would appreciate any help
I had contemplated just leaving the answer as a comment, but given the trouble you had with the command it's worth writing up. To begin, you don't need BEGIN. In awk variables are initialized 0 until set, so you can simply use a max variable for the first time after comparing it.
Note: If your data involves negative numbers (neither distance or probabilities can), just add a new first rule and set max to the value in the first record (e.g. FNR==1 (max=$2; next})
Next, you don't save individual field values when you are wanting to capture the entire line (record) with the largest probability, save the entire record associated with the max value. Then in your END rule all you need to do is print that record.
Putting it altogether you would have:
awk '{if($2 > max) {max=$2; maxline=$0}} END {print maxline}' file
or, if you prefer:
awk '$2 > max {max=$2; maxline=$0} END {print maxline}' file
Example Use/Output
With your data in the file distprobs.txt you would get:
$ awk '{if($2 > max) {max=$2; maxline=$0}} END {print maxline}' distprobs.txt
0.6 0.63
and, second version same result:
$ awk '$2 > max {max=$2; maxline=$0} END {print maxline}' distprobs.txt
0.6 0.63
I have the following piece of code:
awk '{h[$1]++}; END { for(k in h) print k, h[k]}' ${infile} >> ${outfile2}
Which does part of what I want: printing out the unique values and then also counting how many times these unique values have occurred. Now, I want to print out the 2nd and 3rd column as well from each unique value. For some reason the following does not seem to work:
awk '{h[$1]++}; END { for(k in h) print k, $2, $3, h[k]}' ${infile} >> ${outfile2}
awk '{h[$1]++}; END { for(k in h) print k, h[$2], h[$3], h[k]}' ${infile} >> ${outfile2}
The first prints out the last index's 2nd and 3rd column, whereas the second code prints out nothing except k and h[k].
${infile} would look like:
20600 33.8318 -111.9286 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
29400 33.9455 -113.5430 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
The desired output would be:
20600, 33.8318, -111.9286, 3
30900, 33.3979, -111.8140, 2
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
You were close and you can do it all in awk, but if you are going to store the count based on field 1 and also have field 2 and field 3 available in END to output, you also need to store field 2 & 3 in arrays indexed by field 1 (or whatever field you are keeping count of). For example you could do:
awk -v OFS=', ' '
{ h[$1]++; i[$1]=$2; j[$1]=$3 }
END {
for (a in h)
print a, i[a], j[a], h[a]
}
' infile
Where h[$1] holds the count of the number of times field 1 is seen indexing the array with field 1. i[$1]=$2 captures field 2 indexed by field 1, and then j[$1]=$3 captures field 3 indexed by field 1.
Then within END all that is needed is to output field 1 (a the index of h), i[a] (field 2), j[a] (field 3), and finally h[a] the count of the number of times field 1 was seen.
Example Use/Output
Using your example data, you can just copy/middle-mouse-paste the code at the terminal with the correct filename, e.g.
$ awk -v OFS=', ' '
> { h[$1]++; i[$1]=$2; j[$1]=$3 }
> END {
> for (a in h)
> print a, i[a], j[a], h[a]
> }
> ' infile
20600, 33.8318, -111.9286, 3
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
30900, 33.3979, -111.8140, 2
Which provides the output desired. If you need to preserve the order of records in the order of the output you show, you can use string-concatenation to group fields 1, 2 & 3 as the index of the array and then output the array and index, e.g.
$ awk '{a[$1", "$2", "$3]++}END{for(i in a) print i ", " a[i]}' infile
20600, 33.8318, -111.9286, 3
30600, 33.4461, -111.7876, 2
29400, 33.9455, -113.5430, 1
30900, 33.3979, -111.8140, 2
Look things over and let me know if you have further questions.
GNU datamash is a very handy tool for working on groups of columnar data in files that makes this trivial to do.
Assuming your file uses tabs to separate columns like it appears to:
$ datamash -s --output-delimiter=, -g 1,2,3 count 3 < input.tsv
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
Though it's not much more complicated in awk, using a multi dimensional array:
$ awk 'BEGIN { OFS=SUBSEP="," }
{ group[$1,$2,$3]++ }
END { for (g in group) print g, group[g] }' input.tsv
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
20600,33.8318,-111.9286,3
30900,33.3979,-111.8140,2
If you want sorted output instead of random order for this one, if using GNU awk, add a PROCINFO["sorted_in"] = "#ind_str_asc" in the BEGIN block, or otherwise pipe the output through sort.
You can also get the same effect by pipelining a bunch of utilities (including awk and uniq):
$ sort -k1,3n input.tsv | cut -f1-3 | uniq -c | awk -v OFS=, '{ print $2, $3, $4, $1 }'
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
I need a help!
I have several points ABCDEF... with positions like this:
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
...
I aim to calculate distance of AB, BC, CD, EF,... and sum of them with output which has form like this:
sum_distance(AB)
sum_distance(AB+BC)
sum_distance(AB+BC+CD)
sum_distance(AB+BC+CD+DE)
sum_distance(AB+BC+CD+DE+EF)
....
I found on internet awk can do it and apply to my case. However, no result or error was exported on screen. Could you please help me with this situation?
bash shell, awk
awk 'FNR==NR { a[NR]=$0; next } { for (i=FNR+1;i<=NR-1;i++) {split(a[i],b); print $1 "-" b[1], sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2) | "column -t" } NR--}'
Output:
2.934280150
4.728297987
7.470140434
9.682130488
11.92469598
......
You don't need such a complex script for this trivial task. Try this instead:
awk 'NR>1{ printf "%.9f\n",s+=sqrt(($2-x)^2+($3-y)^2+($4-z)^2) }
{ x=$2;y=$3;z=$4 }' file
For all points but A, calculate the distance, add it to the sum s and print the sum. For all points keep coordinates in x, y, z for next calculation. Its output looks like this with gawk:
2.934280150
4.728297987
7.470140434
9.682130488
What is the Cardinal Rule? (never use code off the internet you don't understand...)
The problem with the awk script you are attempting to use is it is not exactly your case. By setting FNR==NR and then using the loop limits (i=FNR+1;i<=NR-1;i++) it is expecting multiple input files. For your case, you can actually simplify the script by removing the loop entirely since you only have a single input file.
You need only save the first row, then using next read the next row, compute and output the distance between the prior row and the current, set the current row as the row in the a[] array and repeat until you run out of rows, e.g.
awk '{
a[NR]=$0
if (NR == 1)
next
split(a[NR-1],b)
printf "%s\t%s\n", b[1] "-" $1,
sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2)
a[NR]=$0
}'
Example Input File
$ cat f
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
Example Use/Output
Simply paste the script into a terminal adding the filename at the end, e.g.
$ awk '{
> a[NR]=$0
> if (NR == 1)
> next
> split(a[NR-1],b)
> printf "%s\t%s\n", b[1] "-" $1,
> sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2)
> a[NR]=$0
> }' f
A-B 2.93428
B-C 1.79402
C-D 2.74184
D-E 2.21199
Look things over and let me know if you have further questions.
Try this:
awk 'function d(a,b){split(a,x);split(b,y);return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);} {p[FNR]=$0} FNR>1{sum[FNR]=sum[FNR-1]+d(p[FNR-1],p[FNR]);printf "%.9f\n",sum[FNR];}' file
With file content like this:
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
will provide output like this:
2.934280150
4.728297987
7.470140434
9.682130488
You didn't provide point F so your last line of output actually can't be count here.
Put into several lines here:
awk '
function d(a,b){
split(a,x);
split(b,y);
return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);
}
{p[FNR]=$0}
FNR>1{
sum[FNR]=sum[FNR-1]+d(p[FNR-1],p[FNR]);
printf "%.9f\n",sum[FNR];
}' file
It's quite straightforward here, function d for distance. And reuse the sum of former line.
And for fun, if you want to calculate the total distance of a graph, with initially one point and gradually add point into the graph. I.E. :
sum_distance(AB)
sum_distance(AB+BC+AC)
sum_distance(AB+BC+AC+AD+BD+CD)
...
Then just a little bit improvement will do, like this:
$ awk 'function d(a,b){split(a,x);split(b,y);return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);} {p[FNR]=$0} FNR>1{sum[FNR]=sum[FNR-1];for(i=FNR-1;i>0;i--)sum[FNR]+=d(p[i],p[FNR]);printf "%.9f\n",sum[FNR];}' file
2.934280150
6.160254691
14.349070561
22.466306583
I would like to subset a file while I keep the separator in the subsetted output using ´awk´ in bash.
That´s what I am using:
The input file is created in R language with:
inp <- 'AX-1 1 125 AA 0.2 1 AB -0.89 0 AA 0.005 0.56
AX-2 2 456 AA 0 0 AA -0.56 0.56 AB -0.003 0
AX-3 3 3445 BB 1.2 1 NA 0.002 0 AA 0.005 0.55'
inp <- read.table(text=inp, header=F)
write.table(inp, "inp.txt", col.names=F, row.names=F, quote=F, sep="\t")
(So fields are separated by tabs)
The code in bash:
awk {'print $1 $2 $3'} inp.txt
The result:
AX-11125
AX-22456
AX-333445
Please note that my columns were merged in the awkoutput (and I would like it to be tab delimited as the input file). Probably it is a simple syntax problem, but I would be grateful to any ideas.
Use
awk -v OFS='\t' '{ print $1, $2, $3 }'
or
awk '{ print $1 "\t" $2 "\t" $3 }'
Written one after another without an operator between them, variables in awk are concatenated - $1 $2 $3 is no different from $1$2$3 in this respect.
The first solution sets the output field separator OFS to a tab, then uses the comma operator to print separated fields. The second solution simply sprinkles tabs in there directly, and everything is concatenated as it was before.
I would like to compare two columns in two files.
Here's an example:
1 722603 0.08 0.0013 0.0035 0.02
1 793227 0.17 0 0 0.01
2 931508 0.52 0.95 0.93 0.92
1 722603 0.0348543
1 793227 0.130642
2 931508 0.275751
2 1025859 0.0739543
2 1237036 0.476705
This code compares the second columns of the two files:
awk 'FNR==NR {a[$2]++; next} a[$2]' file 1 file 2
However, I want to print the common second column if the first column is also the same. More specifically, if it finds 722603 in both files, it must check that the first column is also equal to 1 and then prints it. If the number in second column is repeated, it is important that it gets printed more than once with different values of column 1.
I'd be very thankful if you could guide me through this, thank you.
like this? extended your codes a bit:
awk 'FNR==NR {a[$1 FS $2]++; next} a[$1 FS $2]' file1 file2