comparing two columns in two files - awk

I would like to compare two columns in two files.
Here's an example:
1 722603 0.08 0.0013 0.0035 0.02
1 793227 0.17 0 0 0.01
2 931508 0.52 0.95 0.93 0.92
1 722603 0.0348543
1 793227 0.130642
2 931508 0.275751
2 1025859 0.0739543
2 1237036 0.476705
This code compares the second columns of the two files:
awk 'FNR==NR {a[$2]++; next} a[$2]' file 1 file 2
However, I want to print the common second column if the first column is also the same. More specifically, if it finds 722603 in both files, it must check that the first column is also equal to 1 and then prints it. If the number in second column is repeated, it is important that it gets printed more than once with different values of column 1.
I'd be very thankful if you could guide me through this, thank you.

like this? extended your codes a bit:
awk 'FNR==NR {a[$1 FS $2]++; next} a[$1 FS $2]' file1 file2

Related

How do I print the line where the max value is found using awk?

I have a file called probabilities.txt and it's a two column file with the first column listing distances and the second column probabilities.
The sample data is as follows:
0.2 0.05
0.4 0.10
0.6 0.63
0.8 0.11
1.0 0.03
... ...
10.0 0.01
I would like to print out the line that has the maximum value in column 2. I've tried the following:
awk 'BEGIN{a= 0} {if ($2 > a) a = $2} END{print $1, a}' probabilities.txt
This was the desired output:
0.6 0.63
But this is the output I get:
10.0 0.63
It seems like the code I wrote is just getting the max value in each column and then printing it out rather than printing out the line that has the max value in column 2. Printing out $0 also just prints out the last line of the file.
I assume one could fix this by treating the lines as an array rather than a scalar but I'm not really sure how to do that since I'm a beginner. Would appreciate any help
I had contemplated just leaving the answer as a comment, but given the trouble you had with the command it's worth writing up. To begin, you don't need BEGIN. In awk variables are initialized 0 until set, so you can simply use a max variable for the first time after comparing it.
Note: If your data involves negative numbers (neither distance or probabilities can), just add a new first rule and set max to the value in the first record (e.g. FNR==1 (max=$2; next})
Next, you don't save individual field values when you are wanting to capture the entire line (record) with the largest probability, save the entire record associated with the max value. Then in your END rule all you need to do is print that record.
Putting it altogether you would have:
awk '{if($2 > max) {max=$2; maxline=$0}} END {print maxline}' file
or, if you prefer:
awk '$2 > max {max=$2; maxline=$0} END {print maxline}' file
Example Use/Output
With your data in the file distprobs.txt you would get:
$ awk '{if($2 > max) {max=$2; maxline=$0}} END {print maxline}' distprobs.txt
0.6 0.63
and, second version same result:
$ awk '$2 > max {max=$2; maxline=$0} END {print maxline}' distprobs.txt
0.6 0.63

Search and Print by Two Conditions using AWK

I have this file:
- - - Results from analysis of weight - - -
Akaike Information Criterion 307019.66 (assuming 2 parameters).
Bayesian Information Criterion 307036.93
Approximate stratum variance decomposition
Stratum Degrees-Freedom Variance Component Coefficients
id 39892.82 490.360 0.7 0.6 1.0
damid 0.00 0.00000 0.0 0.0 1.0
Residual Variance 1546.46 320.979 0.0 0.0 1.0
Model_Term Gamma Sigma Sigma/SE % C
id NRM_V 17633 0.18969 13.480 4.22 0 P
damid NRM_V 17633 0.07644 13.845 2.90 0 P
ide(damid) IDV_V 17633 0.00000 32.0979 1.00 0 S
Residual SCA_V 12459 1.0000 320.979 27.81 0 P
And I Would Like to print the Value of Sigma on id, note there are two id on the file, so I used the condition based on NRM_V too.
I tried this code:
tac myfile | awk '(/id/ && /NRM_V/){print $5}'
but the results printed were:
13.480
13.845
and I need just the first one
Could you please try following, I have added exit function of awk here which will help us to exit from code ASAP whenever first occurrence of condition comes, it will help us to save time too, since its no longer reading whole Input_file.
awk '(/id/ && /NRM_V/){print $5;exit}' Input_file
OR with columns:
awk '($1=="id" && $2=="NRM_V"){print $5;exit}' Input_file
In case you want to read file from last line towards first line and get its first value then try:
tac Input_file | awk '(/id/ && /NRM_V/){print $5;exit}'
OR with columns comparisons:
tac Input_file | awk '($1=="id" && $2=="NRM_V"){print $5;exit}'
The problem is that /id/ also matches damid. You could use the following to print the Sigma value only if the first field is id and the second field is NRM_V:
awk '$1=="id" && $2=="NRM_V"{ print $5 }' myfile

AWK select rows where all columns are equal

I have a file with tab-separated values where the number of columns is not known a priori. In other words the number of columns is consistent within a file but different files have different number of columns. The first column is a key, the other columns are some arbitrary values.
I need to filter out the rows where the values are not the same. For example, assuming that the number of columns is 4, I need to keep the first 2 rows and filter out the 3-rd:
1 A A A
2 B B B
3 C D C
I'm planning to use AWK for this purpose, but I don't know how to deal with the fact that the number of columns is unknown. The case of the known number of columns is simple, this is a solution for 4 columns:
$2 == $3 && $3 == $4 {print}
How can I generalize the solution for arbitrary number of columns?
If you guarantee no field contains regex-active chars and the first field never match the second, and there is no blank line in the input:
awk '{tmp=$0;gsub($2,"")} NF==1{print tmp}' file
Note that this solution is designed for this specific case and less extendable than others.
Another slight twist on the approach. In your case you know you want to compare fields 2-4 so you can simply loop from i=3;i<=NF checking $i!=$(i-1) for equality, and if it fails, don't print, get the next record, e.g.
awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1'
Example Use/Output
With your data in file.txt:
$ awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1' file.txt
1 A A A
2 B B B
Could you please try following. This will compare all columns from 2nd column to till last column and check if every element is equal or not. If they are all same it will print line.
awk '{for(i=3;i<=NF;i++){if($(i-1)==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
OR(by hard coding $2 in code, since if $2=$3 AND $3=$4 it means $2=$3=$4 so intentionally taking $2 in comparison rather than having i-1 fetching its previous value.)
awk '{for(i=3;i<=NF;i++){if($2==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
I'd use a counter t with initial value of 2 to add the number of times $i == $(i+1) where i iterates from 2 to NF-1. print the line only if t==NF is true:
awk -F'\t' '{t=2;for(i=2;i<NF;i++){t+=$i==$(i+1)}}t==NF' file.txt
Here is a generalisation of the problem:
Select all lines where a set of columns have the same value: c1 c2 c3 c4 ..., where ci can be any number:
Assume we want to select the columns: 2 3 4 11 15
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if ($(a[i])!=$(a[1])) next}1' file
A bit more robust, in case a line might not contain all fields:
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if (a[i] <= NF) if ($(a[i])!=$(a[1])) next}1' file

Calculate distances of continuous point

I need a help!
I have several points ABCDEF... with positions like this:
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
...
I aim to calculate distance of AB, BC, CD, EF,... and sum of them with output which has form like this:
sum_distance(AB)
sum_distance(AB+BC)
sum_distance(AB+BC+CD)
sum_distance(AB+BC+CD+DE)
sum_distance(AB+BC+CD+DE+EF)
....
I found on internet awk can do it and apply to my case. However, no result or error was exported on screen. Could you please help me with this situation?
bash shell, awk
awk 'FNR==NR { a[NR]=$0; next } { for (i=FNR+1;i<=NR-1;i++) {split(a[i],b); print $1 "-" b[1], sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2) | "column -t" } NR--}'
Output:
2.934280150
4.728297987
7.470140434
9.682130488
11.92469598
......
You don't need such a complex script for this trivial task. Try this instead:
awk 'NR>1{ printf "%.9f\n",s+=sqrt(($2-x)^2+($3-y)^2+($4-z)^2) }
{ x=$2;y=$3;z=$4 }' file
For all points but A, calculate the distance, add it to the sum s and print the sum. For all points keep coordinates in x, y, z for next calculation. Its output looks like this with gawk:
2.934280150
4.728297987
7.470140434
9.682130488
What is the Cardinal Rule? (never use code off the internet you don't understand...)
The problem with the awk script you are attempting to use is it is not exactly your case. By setting FNR==NR and then using the loop limits (i=FNR+1;i<=NR-1;i++) it is expecting multiple input files. For your case, you can actually simplify the script by removing the loop entirely since you only have a single input file.
You need only save the first row, then using next read the next row, compute and output the distance between the prior row and the current, set the current row as the row in the a[] array and repeat until you run out of rows, e.g.
awk '{
a[NR]=$0
if (NR == 1)
next
split(a[NR-1],b)
printf "%s\t%s\n", b[1] "-" $1,
sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2)
a[NR]=$0
}'
Example Input File
$ cat f
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
Example Use/Output
Simply paste the script into a terminal adding the filename at the end, e.g.
$ awk '{
> a[NR]=$0
> if (NR == 1)
> next
> split(a[NR-1],b)
> printf "%s\t%s\n", b[1] "-" $1,
> sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2)
> a[NR]=$0
> }' f
A-B 2.93428
B-C 1.79402
C-D 2.74184
D-E 2.21199
Look things over and let me know if you have further questions.
Try this:
awk 'function d(a,b){split(a,x);split(b,y);return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);} {p[FNR]=$0} FNR>1{sum[FNR]=sum[FNR-1]+d(p[FNR-1],p[FNR]);printf "%.9f\n",sum[FNR];}' file
With file content like this:
A 0.00 0.50 0.10
B 1.00 2.50 2.00
C 0.70 0.88 1.29
D 2.13 2.90 0.11
E 1.99 0.77 0.69
will provide output like this:
2.934280150
4.728297987
7.470140434
9.682130488
You didn't provide point F so your last line of output actually can't be count here.
Put into several lines here:
awk '
function d(a,b){
split(a,x);
split(b,y);
return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);
}
{p[FNR]=$0}
FNR>1{
sum[FNR]=sum[FNR-1]+d(p[FNR-1],p[FNR]);
printf "%.9f\n",sum[FNR];
}' file
It's quite straightforward here, function d for distance. And reuse the sum of former line.
And for fun, if you want to calculate the total distance of a graph, with initially one point and gradually add point into the graph. I.E. :
sum_distance(AB)
sum_distance(AB+BC+AC)
sum_distance(AB+BC+AC+AD+BD+CD)
...
Then just a little bit improvement will do, like this:
$ awk 'function d(a,b){split(a,x);split(b,y);return sqrt((x[2]-y[2])^2 + (x[3]-y[3])^2 + (x[4]-y[4])^2);} {p[FNR]=$0} FNR>1{sum[FNR]=sum[FNR-1];for(i=FNR-1;i>0;i--)sum[FNR]+=d(p[i],p[FNR]);printf "%.9f\n",sum[FNR];}' file
2.934280150
6.160254691
14.349070561
22.466306583

Data partitioning by columns

I have a this big matrix of 50 rows and 1.5M columns. From these 1.5M columns, the first two are my headers.
I am trying to divide my data by columns into small pieces. So for example each small set will be 50 lines and 100 columns. But each small data must have the first two columns mentioned above as the headers.
I tried
awk '{print $1"\t"$2"\t"}' test | cut -f 3-10
awk '{print $1"\t"$2"\t"}' test | cut -f 11-20
...
or
cut -f 1-2 | cut -f 3-10 test
cut -f 1-2 | cut -f 11-20 test
...
but none of the above is working.
Is there an efficient way of doing this?
One way with awk. I don't know if it (awk) can handle such a big number of columns, but give it a try. It uses modulus operator to cut line each a specific number of columns.
awk '{
## Print header of first line.
printf "%s%s%s%s", $1, FS, $2, FS
## Count number of columns printed, from 0 to 100.
count = 0
## Traverse every columns but the first two keys.
for ( i = 3; i <= NF; i++ ) {
## Print header again when counted 100 columns.
if ( count != 0 && count % 100 == 0 ) {
printf "%s%s%s%s%s", ORS, $1, FS, $2, FS
}
## Print current column and count it.
printf "%s%s", $i, FS
++count
}
## Separator between splits.
print ORS
}
' infile
I've tested it with two lines and 4 columns instead of 100. Here is the test file:
key1 key2 one two three four five six seven eight nine ten
key1 key2 one2 two2 three2 four2 five2 six2 seven2 eight2 nine2 ten2
And results in:
key1 key2 one two three four
key1 key2 five six seven eight
key1 key2 nine ten
key1 key2 one2 two2 three2 four2
key1 key2 five2 six2 seven2 eight2
key1 key2 nine2 ten2