How to compare pairs of columns in awk? - awk

I have the following dataset from a pairwise analysis (the first row are just the sample ids):
A B A C
1 1 1 0
1 2 1 1
1 0 1 2
I wish to compare the values for field 1 and field 2 then field 3 and field 4 such that I want to print the row number NR every time I see a 1 and 2 combination for the pairs I am examining.
For example for pairs A and B, I would want the output:
A B 2
For pairs A and C, I would want the output:
A C 3
I would want to proceed row by row so I would likely need the code to include:
for i in {1..3}; do
awk 'NR=="'${i}'" {code}'
done
But I have no idea how to proceed in a pairwise fashion (i.e. compare field 1 and field 2 and then field 3 and field 4 etc...).
How can I do this?

It's hard to say with such a minimal example but this MAY be what you want:
$ cat tst.awk
FNR==1 {
for (i=1;i<=NF;i++) {
name[i] = $i
}
next
}
{
for (i=1;i<NF;i+=2) {
if ( ($i == 1) && ($(i+1) == 2) ) {
print name[i], name[i+1], NR-1
}
}
}
$ awk -f tst.awk file
A B 2
A C 3

You certainly should only run the script once; there's no need to run awk more frequently. It isn't yet entirely clear how you want multiple matches printed. However, if you're working a line at time, then the output probably comes a line at a time.
Working on that basis, then:
awk 'NR == 1 { for (i = 1; i < NF; i += 2)
{ cols[(i+1)/2,1] = $i; cols[(i+1)/2,2] = $(i+1); }
next
}
{ for (i = 1; i < NF; i += 2)
{ if ($i == 1 && $(i+1) == 2)
print cols[(i+1)/2,1], cols[(i+1)/2,2], NR - 1
}
}'
The NR == 1 block of code captures the headings so they can be used in the main printing code. There are plenty of other ways to store the information too. The other block of code looks at the data lines and checks that pairs of fields contain 1 2 and print out the control data if there is a match. Because NF will be an even number, but the loops count on the odd numbers, the < comparison is OK. Often in awk, you use for (i = 1; i <= NF; i++) with a single increment and then <= is required for correct behaviour.
For your minimal data set, this produces:
A B 2
A C 3
For this larger data set:
A B A C
1 1 1 0
1 2 1 1
1 0 1 2
1 2 4 2
5 3 1 9
7 0 3 2
1 2 1 0
9 0 1 2
1 2 3 2
the code produces:
A B 2
A C 3
A B 4
A B 7
A C 8
A B 9

Related

Awk - Conditionally print an element from a certain row, based on the condition of a different element in a different row

Say I have a lot of files with a consistent number of columns and rows, and a sample one looks like this:
1 2 3
4 5 6
7 8 9
I want to print column 3 of row 2, but only if column 3 of row 3 == 4 (in this case it is 9). I'm using this logic is a means to determine if the file is valid for my use-case, and extract the relevant field if it is.
My attempt, based on other answers to people asking how to isolate certain rows was this: awk 'BEGIN{FNR=3} $3=="4"{FNR=2;print $2}'
so you are looking for something like this?
awk 'FNR==2{ x = $3 }FNR==3 && $3=="4"{ print x }' file.txt
cat file.txt
1 2 3
4 5 6
7 8 4
Output:
6
cat file.txt
1 2 3
4 5 6
7 8 9
Output:
Nothing since column 3 of row 3 is 9
awk 'FNR==3 && $3==4{print p} {p=$3}' *
Here's another which doesn't care for the order in which the records appear. In the OP the problem was to print a value (v) from 2nd record based on the tested value (t) on the 3rd record. This solution allows for the test value to appear in an earlier record than the value to be printed:
$ awk '
FNR==2 { # record on which is the value to print
v=$3
f=1 # flag indicating the value v has been read
}
FNR==3 { # record of which is the value to test
t=$3
g=1 # test value read indicator
}
f && g { # once the value and test value are acquired and
if(t==4) # test the test
print v # output
exit # and exit
}' file
6
Record order reversed (FNR values changed in the code):
$ cat file2
1 2 3
7 8 4 # records
4 5 6 # reversed
$ awk 'FNR==3{v=$3;f=1}FNR==2{t=$3;g=1}f&&g{if(t==4)print v;exit}' file2
6
Flags f and g are different from v and t in case either should be empty ("").

How to do a join using awk

Here is my Input file
Identifier Relation
A 1
A 2
A 3
B 2
B 3
C 1
C 2
C 3
I want to join this file to itself based on the "Relation" field.
Sample Output file
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
I used the following awk script:
awk 'NR==FNR {a[NR]=$0; next} { for (k in a) if (a[k]~$2) print a[k],$0}' input input > output
However, I had to do another awk step to delete lines which did a join with itself i.e, A 1 A 1 ; B 2 B 2 etc.
The second issue with this file is it prints both directions of the join, thus
A 1 C 1 is printed along with C 1 A 1 on another line.
Both these lines display the same relation and I would NOT like to include this.I want to see just one or the other i.e, "A 1 C 1" or "C 1 A 1" not both.
Any suggestions/directions are highly appreciated.
alternative solution using awk with join and sort support
$ join -j 2 <(sort -k2 -k1,1 file){,}
| awk '$2!=$3 && !($3 FS $2 in a){a[$2 FS $3]; print$2,$1,$3,$1}'
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
create the cross product, eliminate the diagonal and one of the symmetrical pairs.
Here is an awk-only solution:
awk 'NR>1{ar[$2]=(ar[$2]$1);}\
END{ for(key in ar){\
for(i=1; i<length(ar[key]); i++) {\
for(j=i+1; j<length(ar[key])+1; j++) {\
print substr(ar[key],i,1), key, substr(ar[key],j,1), key;\
}\
}\
}}' infile
Each number in the second column of the input serves as a key of an awk-array. The value of the corresponding array-element is a sequence of first-column letters (e.g., array[1]=ABC).
Then, we built all two-letter combinations for each sequence (e.g., "ABC" gives "AB", "AC" and "BC")
Output:
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
Note:
If a number occurs only once, no output is generated for this number.
The order of output depends on the order of input. (No sorting of letters!!). That is if the second input line was C 1, then array[1]="CAB" and the first output line would be C 1 A 1
First line of input is ignored due to NR>1
There is surely a solution with awk only, but I'm going to propose a solution using awk and sort because I think it's quite simple and does not require storing the entire file content in awk variables. The idea is as follows:
rewrite the input file so that the "relation" field is first (A 1 -> 1 A)
use sort -n to put together all lines with same "relation"
use awk to combine consecutive lines having the same "relation"
That would translate to something like:
awk '{print $2 " " $1}' input | sort -n |
awk '{if ($1==lastsel)printf " "; else if(lastsel) printf "\n"; lastsel=$1; printf "%s %s", $2, $1;}END{if(lastsel)printf"\n"}'
A 1 C 1
A 2 B 2 C 2
A 3 B 3 C 3
EDIT: If you want only one i-j relation per line:
awk '{print $2 " " $1}' input | sort -n |
awk '$1!=rel{rel=$1;item=$2;next;} {printf "%s %s %s %s\n", item, rel, $2, $1;}'
A 1 C 1
A 2 B 2
A 2 C 2
A 3 B 3
A 3 C 3
Note the following limitations with this solution:
In case a given n has only one entry, nothing will be output (no output such as D 1)
All relations always have the lexicographically first item in the first column (e.g. A 1 C 1 but never B 1 C 1)

How to extract data at a specific location from a file containing a grid of data points

I have a file containing a 3D grid (x, y, time), with a property "v" at each grid point. I want to extract the time profile of "v" at a particular x, y point, or more specifically, at the x, y point closest to my desired location (it is unlikely that the desired location will exactly fall on a grid point). Is there an easy awk script for this when the file is in either ascii or binary format?
Example of file format
X Y Time V
1 1 0 2
1 1 10 3
1 1 20 4
1 2 0 3
1 2 10 8
1 2 20 11
1 3 0 3
Example of desired output if location of interest is x=0.9, y=2.1
1 2 0 3
1 2 10 8
1 2 20 11
$ cat tst.awk
function abs(val) { return (val < 0 ? -val : val) }
BEGIN { ARGV[ARGC] = ARGV[ARGC-1]; ARGC++ }
NR==FNR {
if (NR>1) {
dist[NR] = abs(x - $1) + abs(y - $2)
min = (NR==2 || dist[NR]<min ? dist[NR] : min)
}
next
}
FNR==1 || dist[FNR] == min
$ awk -v x=0.9 -v y=2.1 -f tst.awk file
X Y Time V
1 2 0 3
1 2 10 8
1 2 20 11
Just check that the algorithm to calculate dist[] is what you need and tweak it to suit otherwise.

awk histogram in buckets

Consider I have a following file..
1 a
1 b
1 a
1 c
1 a
2 a
2 d
2 a
2 d
I want to have a histogram within a bucket... for example if bucket is 1 then the output will be
a 3
b 1
c 1
a 2
d 2
for bucket 2... we have
a 5
b 1
c 1
d 2
I want to do it with awk and I literally stuck...
here is my code:
awk '
{A[$1]} count [$2]++
{for(i in A) {print i,A[i]}
}' test
Any help?
Thanks,
Amir.
Edit Adding a size_of_bucket variable.
awk -v "size_of_bucket=2" '
{
bucket = int(($1-1)/size_of_bucket);
A[bucket","$2]++;
}
END {
for (i in A) {
print i, A[i];
}
}
'

Selecting first nth rows by groups using AWK

I have the following file with 4 fields. There are 3 groups in field 2, and the 4th field consists 0's and 1's.
The first field is just the index.
I like to use AWK to do the following task
Select the first 3 rows of group 1 (Note that group 1 has only 2 rows). The number of rows is based on the number of 1's found in the 4th field times 3.
Select the first 6 rows of group 2. The number of rows is based on the number of 1's found in the 4th field times 3.
Select the first 9 rows of group 3. The number of rows is based on the number of 1's found in the 4th field times 3.
So 17 rows are selected for the output file.
Thank you for your help.
Input
1 1 TN1148 1
2 1 S52689 0
3 2 TA2081 1
4 2 TA2592 1
5 2 TA4011 0
6 2 TA4246 0
7 2 TA4275 0
8 2 TB0159 0
9 2 TB0392 0
10 3 TB0454 1
11 3 TB0496 1
12 3 TB1181 1
13 3 TC0027 0
14 3 TC1340 0
15 3 TC2247 0
16 3 TC3094 0
17 3 TD0106 0
18 3 TD1146 0
19 3 TD1796 0
20 3 TD3587 0
Output
1 1 TN1148 1
2 1 S52689 0
3 2 TA2081 1
4 2 TA2592 1
5 2 TA4011 0
6 2 TA4246 0
7 2 TA4275 0
8 2 TB0159 0
10 3 TB0454 1
11 3 TB0496 1
12 3 TB1181 1
13 3 TC0027 0
14 3 TC1340 0
15 3 TC2247 0
16 3 TC3094 0
17 3 TD0106 0
18 3 TD1146 0
The key to this awk program is to pass the input file in twice: Once to count how many rows you want and once to print them.
awk '
NR == FNR {wanted_rows[$2] += 3*$4; next}
--wanted_rows[$2] >= 0 {print}
' input_file.txt input_file.txt
#!/usr/bin/awk -f
# by Dennis Williamson - 2010-12-02
# for http://stackoverflow.com/questions/4334167/selecting-first-nth-rows-by-groups-using-awk
$2 == prev {
count += $4
groupcount++
array[idx++] = $0
}
$2 != prev {
if (NR > 1) {
for (i=0; i<count*3; i++) {
if (i == groupcount) break
print array[i]
}
}
prev = $2
count = 1
groupcount = 1
split("", array) # delete the array
idx = 0
array[idx++] = $0
}
END {
for (i=0; i<count*3; i++) {
if (i == groupcount) break
print array[i]
}
}