awk to compare multiple columns in 2 files - awk

I would like to compare multiple columns from 2 files and NOT print lines matching my criteria.
An example of this would be:
file1
apple green 4
orange red 5
apple yellow 6
apple yellow 8
grape green 5
file2
apple yellow 7
grape green 10
output
apple green 4
orange red 5
apple yellow 8
I want to remove lines where $1 and $2 from file1 correspond to $1 and $2 from file2 AND when $3 from file1 is smaller than $3 from file2.
I can now only do the first part of the job, that is remove lines where $1 and $2 from file1 correspond to $1 and $2 from file2 (fields are separated by tabs):
awk -F '\t' 'FNR == NR {a[$1FS$2]=$1; next} !($1FS$2 in a)' file2 file1
Could you help me apply the last condition?
Many thanks in advance!

What you are after is this:
awk '(NR==FNR){a[$1,$2]=$3; next}!(($1,$2) in a) && a[$1,$2] < $3))' <file2> <file1>

Store 3rd field value while building the array and then use it for comparison
$ awk -F '\t' 'FNR==NR{a[$1FS$2]=$3; next} !(($1FS$2 in a) && $3 > a[$1FS$2])' f2 f1
apple green 4
orange red 5
apple yellow 6
grape green 5
Better written as:
awk -F '\t' '{k = $1FS$2} FNR==NR{a[k]=$3; next} !((k in a) && $3 > a[k])' f2 f1

Related

awk Can not Select Column with empty value

i am trying to select a column with its missing value
here is my input file separated by tab
1 2 3
4 5
6
7 8
9
i am trying to select the first column in which output will look like
1
4
7
and the length of my column would be 5 in this case
I have tried
awk '$1!=""{print $1}' ./demo.txt
but it returns
1
4
6
7
9
can anybody help with this I am new in AWK
You can use cut:
$ cut -f 1 file # the default delimiter is a tab
Or with sed:
$ sed 's/[[:blank:]].*$//' file
Or awk:
$ awk '{sub(/[[:blank:]].*$/,"")}1' file
Or:
$ awk 'BEGIN{FS=OFS="\t"} {print $1}' file
All those print the first column and all five lines (blank or not)
Prints:
1
4
7
Tell awk to use a tab (\t) as the input field delimiter (-F):
$ awk -F'\t' '{ print $1 }' demo.txt
1
4
7
If you want to print multiple columns, maintaining the same delimiter for output, another approach using the FS and OFS variables:
$ awk 'BEGIN { FS=OFS="\t" } { print $1,$3 }' demo.txt
1 3
4 5
7
9
With sed something like:
sed 's/^\([^[:blank:]]*\).*/\1/' demo.txt
Using FIELDWIDTHS in gnu-awk you can do this for fixed width separated data:
awk 'BEGIN {FIELDWIDTHS = "4 4 *"} {print $1}' file
1
4
7
For demo purpose:
awk 'BEGIN {FIELDWIDTHS = "4 4 *"} {print NR ":", $1}' file
1: 1
2: 4
3:
4: 7
5:
if they're all single digits in 1st column :
echo \
'1 2 3
4 5
6
7 8
9' |
mawk NF=1 FS=
gcat -n
1 1
2 4
3
4 7
5
that's literally all you need. To play it safe, then do
nawk NF=1 FS='[[:space:]]' # overly-verbose so-called
# "proper" posix form
gawk NF=1 FS='[ \t]' # suffices unless the input
# happens to have uncommon bytes
# like \013 \v or \014 \f
or a very fringe way of fudging NF :
mawk 'NF ^= FS="[ \t]"'

lookup value associate with list items

I am trying to approximate vlookup from excel.
I have two files.
file #1 - list.txt:
green
purple
orange
file #2 - reads.txt:
blue 2
green 3
red 5
purple 6
I am trying to read in the entries of list.txt, then go to reads.txt and pull out the associated value.
the desired output would be:
green 3
purple 6
orange 0
if I write:
awk -F ' ' 'FNR == NR {keys[$1]; next} {if ($1 in keys) print $1,$2}' list.txt reads.txt
I get:
green 3
purple 6
but nothing for orange, and I need the line:
orange 0
If I write
awk -F ' ' 'FNR == NR {keys[$1]; next} {if ($1 in keys) print $1,$2; else print $1,0}' list.txt reads.txt
I get:
blue 0
green 3
red 0
purple 6
any ideas how to fix this?
major newbie here, so any help appreciated!
$ awk 'NR==FNR{map[$1]=$2; next} {print $1, map[$1]+0}' reads.txt list.txt
green 3
purple 6
orange 0
1st solution(with as per shown samples): Could you please try following, written and tested with shown samples.
awk '
FNR==NR{
arr[$1]=$2
next
}
($0 in arr){
print $0,arr[$0]
next
}
{
print $0,0
}
' reads.txt list.txt
Output will be as follows, for shown samples.
green 3
purple 6
orange 0
2nd solution(Generic solution): In case your Input_file named reads.txt has multiple values of same first column and you want to print all values which are present in list.txt then please try following.
awk '
FNR==NR{
++arr[$1]
val[$1 OFS arr[$1]]=$2
next
}
($0 in arr){
for(i=1;i<=arr[$0];i++){
print $0,val[$0 OFS i]
}
next
}
{
print $0,0
}
' reads.txt list.txt
Sample of run: Let's say your sample of reads.txt is this:
cat reads.txt
blue 2
green 3
red 5
purple 6
green 15
green 120
Now after running this Generic solution we will get following.
green 3
green 15
green 120
purple 6
orange 0
If order doesn't matter, I'd use a left join instead of awk:
$ join -a1 -e 0 -o 0,2.2 <(sort list1.txt) <(sort reads.txt)
green 3
orange 0
purple 6
(Assumes a shell like bash that supports <() command redirection syntax).

Join two columns from different files with awk

I want to join two columns from two different files using awk. These files look like (A, B, C, 0, 1, 2, etc are columns)
file1:
A B C D E F
fil2:
0 1 2 3 4 5
And I want to be able to select arbitrary columns on my ouput, something of the form:
Ie, I want the output to be:
A C E 4 5
I've seen a million answers with the following awk code (and very similar ones), offering no explanation. But none of them address the exact problem I want to solve:
awk 'FNR==NR{a[FNR]=$2;next};{$NF=a[FNR]};1' file2 file1
awk '
NR==FNR {A[$1,$3,$6] = $0; next}
($1 SUBSEP $2 SUBSEP $3) in A {print A[$1,$2,$3], $4}
' A.txt B.txt
But none of them seem to do what I want and I am not able to understand them.
So, how can I achieve the desired output using awk? (and please, offer an explanation, I want to actually learn)
Note:
I know I can do this using something like
paste <(awk '{print $1}' file1) <(awk '{print $2}' file2)
As I said, I'm trying to learn and understand awk.
With GNU awk for true multi-dimensional arrays and ARGIND:
$ awk -v flds='1 1 1 3 1 5 2 5 2 6' '
BEGIN{ nf = split(flds,o) }
{ f[ARGIND][1]; split($0,f[ARGIND]) }
NR!=FNR { for (i=2; i<=nf; i+=2) printf "%s%s", f[o[i-1]][o[i]], (i<nf?OFS:ORS) }
' file1 file2
A C E 4 5
The "flds" string is just a series of <file number> <field number in that file> pairs so you can print the fields from each file in whatever order you like, e.g.:
$ awk -v flds='1 1 2 2 1 3 2 4 1 5 2 6' 'BEGIN{nf=split(flds,o)} {f[ARGIND][1]; split($0,f[ARGIND])} NR!=FNR{for (i=2; i<=nf; i+=2) printf "%s%s",f[o[i-1]][o[i]], (i<nf?OFS:ORS)}' file1 file2
A 1 C 3 E 5
$ awk -v flds='2 1 1 2 2 3 1 4 2 5' 'BEGIN{nf=split(flds,o)} {f[ARGIND][1]; split($0,f[ARGIND])} NR!=FNR{for (i=2; i<=nf; i+=2) printf "%s%s",f[o[i-1]][o[i]], (i<nf?OFS:ORS)}' file1 file2
0 B 2 D 4

Comparing two lists and printing select columns from each list

I want to compare two lists and print some columns from one, and some from the other if two match. I suspect I'm close but I suppose it's better to check..
1st file: Data.txt
101 0.123
145 0.119
242 0.4
500 0.88
2nd File: Map.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
So, if I want to compare the 3rd column of file2 against the 1st of file1 and print the 1st column of file2 and all of file1, I tried awk 'NR==FNR {a[$3];next}$1 in a{print$0}' file2 file1
but that only prints matches in file1. I tried adding x=$1 in the awk. i.e. awk 'NR==FNR {x=$1;a[$3];next}$1 in a{print x $0} file2 file1 but that saves only one value of $1 and outputs that value every line. I also tried adding $1 into a[$3], which is obviously wrong thus giving zero output.
Ideally I'd like to get this output:
blue 145 0.119
ted 500 0.88
which is the 1st column of file2 and the 3rd column of file2 matched to 1st column of file1, and the rest of file1.
You had it almost exactly in your second attempt. Just instead of assigning the value of $1 to a scalar you can stash it in the array for later use.
awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
$ cat file1.txt
101 0.123
145 0.119
242 0.4
500 0.88
$ cat file2.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
$ awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
blue 101 0.123
ted 500 0.88

What are NR and FNR and what does "NR==FNR" imply?

I am learning file comparison using awk.
I found syntax like below,
awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
I couldn't understand what is the significance of NR==FNR in this?
If I try with FNR==NR then also I get the same output?
What exactly does it do?
In Awk:
FNR refers to the record number (typically the line number) in the current file.
NR refers to the total record number.
The operator == is a comparison operator, which returns true when the two surrounding operands are equal.
This means that the condition NR==FNR is normally only true for the first file, as FNR resets back to 1 for the first line of each file but NR keeps on increasing.
This pattern is typically used to perform actions on only the first file. It works assuming that the first file is not empty, otherwise the two variables would continue to be equal while Awk was processing the second file.
The next inside the block means any further commands are skipped, so they are only run on files other than the first.
The condition FNR==NR compares the same two operands as NR==FNR, so it behaves in the same way.
Look for keys (first word of line) in file2 that are also in file1.
Step 1: fill array a with the first words of file 1:
awk '{a[$1];}' file1
Step 2: Fill array a and ignore file 2 in the same command. For this check the total number of records until now with the number of the current input file.
awk 'NR==FNR{a[$1]}' file1 file2
Step 3: Ignore actions that might come after } when parsing file 1
awk 'NR==FNR{a[$1];next}' file1 file2
Step 4: print key of file2 when found in the array a
awk 'NR==FNR{a[$1];next} $1 in a{print $1}' file1 file2
Look up NR and FNR in the awk manual and then ask yourself what is the condition under which NR==FNR in the following example:
$ cat file1
a
b
c
$ cat file2
d
e
$ awk '{print FILENAME, NR, FNR, $0}' file1 file2
file1 1 1 a
file1 2 2 b
file1 3 3 c
file2 4 1 d
file2 5 2 e
There are awk built-in variables.
NR - It gives the total number of records processed.
FNR - It gives the total number of records for each input file.
Assuming you have Files a.txt and b.txt with
cat a.txt
a
b
c
d
1
3
5
cat b.txt
a
1
2
6
7
Keep in mind
NR and FNR are awk built-in variables.
NR - Gives the total number of records processed. (in this case both in a.txt and b.txt)
FNR - Gives the total number of records for each input file (records in either a.txt or b.txt)
awk 'NR==FNR{a[$0];}{if($0 in a)print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
a.txt 1 1 a
a.txt 2 2 b
a.txt 3 3 c
a.txt 4 4 d
a.txt 5 5 1
a.txt 6 6 3
a.txt 7 7 5
b.txt 8 1 a
b.txt 9 2 1
lets Add "next" to skip the first matched with NR==FNR
in b.txt and in a.txt
awk 'NR==FNR{a[$0];next}{if($0 in a)print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
b.txt 8 1 a
b.txt 9 2 1
in b.txt but not in a.txt
awk 'NR==FNR{a[$0];next}{if(!($0 in a))print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
b.txt 10 3 2
b.txt 11 4 6
b.txt 12 5 7
awk 'NR==FNR{a[$0];next}!($0 in a)' a.txt b.txt
2
6
7
Here is the pseudo code for your interest.
NR = 1
for (i=1; i<=files.length; ++i) {
line = read line from files[i]
FNR = 1
while (not EOF) {
columns = getColumns(line)
if (NR is equals to FNR) { // processing first file
add columns[1] to a
} else { // processing remaining files
if (columns[1] exists in a) {
print columns[1]
}
}
NR = NR + 1
FNR = FNR + 1
line = read line from files[i]
}
}