awk the column of one file from another file - awk

having some trouble with awk. I have two files and am trying to read a column of the 2nd file with the first and pull out all matches.
file1:
1
2
3
4
5
file2:
apples peaches 3
apples peaches 9
oranges pears 7
apricots figs 1
expected output:
apples peaches 3
apricots figs 1
awk -F"|" '
FNR==NR {f1[$1];next}
($3 in f1)
' file1 file2 > output.txt

It's not clear (to me) the format of file2 (eg, is that a space or tab between fields?), or if a line in file2 could have more than 3 (white) spaced delimited strings (eg, apples black raspberries 6), so picking a delimiter for file2 would require more details. Having said that ...
there are no pipes ('|') in the sample files so the current code (using -F"|") is going to lump the entire line into awk variable $1
we can make this a bit easier by recognizing that we're only interested in the last field from file2
Adding an entry to file2:
$ cat file2
apples peaches 3
apples peaches 9
oranges pears 7
apricots figs 1
apples black raspberries 2
A couple small changes to the current awk code:
awk 'FNR==NR {f1[$1]; next} $(NF) in f1' file1 file2
This generates:
apples peaches 3
apricots figs 1
apples black raspberries 2

This is more a side-note, I suggest to use awk, as explained by markp-fuso.
You can use the join command:
join -11 -23 <(sort -k1,1n file1) <(sort -k3,3n file2)
The example above is using join with the help of the shell and the sort command:
Command explanation:
join
-11 # Join based on column 1 of file 1 ...
-23 # and column 3 in file 2
<(sort -k1,1n file1) # sort file 1 based on column 1
<(sort -k3,3n file2) # sort file 2 based on column 3
The <() constructs are so called process substitutions, provided by the shell where you run the command in. The output of the command in parentheses will be treated like a file, and can be used as a parameter for our join command. We don't need to create an intermediate, sorted file.

Related

To print unmatched values when compared from files using awk/diff/sed/grep

I was trying to print some un-matching strings from two files, where in my first file is having string with some integer values separated with spaces, also my second file is having some string values which matches with some of string values from first file but that doesn't have any integers specified in-front of it.
using below awk and diff command i was trying to get unmatched data of my first file when compared with second one.
While using awk it will result the contents of first file, Basically it will print contents of last argument passed to the awk command.
awk -F, 'FNR==NR {a[$1];next} !($0 in a)' f2 f1
while using diff it will result the contents of second file, here it will prints contents of first argument passed.
diff --changed-group-format='%<' --unchanged-group-format='' f2 f1
f1
papaya 10
apple 23
Moosumbi 44
mango 32
jackfruit 15
kiwi 60
orange 11
strawberry 67
banana 99
grapes 21
dates 6
f2
apple
mango
kiwi
strawberry
expected result
papaya 10
Moosumbi 44
jackfruit 15
orange 11
banana 99
grapes 21
dates 6
This is a very common thing to do in awk: read the values from f2 into an array, then while processing f1, only print the lines where the first field does not exist in the f2 array:
awk 'NR == FNR {f2[$1]; next} !($1 in f2)' f2 f1
I was also going to suggest the grep command from #shellter, but he beat me to it (you should add the -w option though, to match whole words).

Query the contents of a file using another file in AWK

I am trying to conditionally filter a file based on values in a second file. File1 contains numbers and File2 contains two columns of numbers. The question is to filter out those rows in file1 which fall within the range denoted in each row of file2.
I have a series of loops which works, but takes >12hrs to run depending on the lengths of both files. This code is noted below. Alternatively, I have tried to use awk, and looked at other questions posted on slack overflow, but I cannot figure out how to change the code appropriately.
Loop method:
while IFS= read READ
do
position=$(echo $READ | awk '{print $4}')
while IFS= read BED
do
St=$(echo $BED | awk '{print $2}')
En=$(echo $BED | awk '{print $3}')
if (($position < "$St"))
then
break
else
if (($position >= "$St" && $position <= "$En"));
then
echo "$READ" | awk '{print $0"\t EXON"}' >> outputfile
fi
fi
done < file2
done < file1
Blogs with similar questions:
awk: filter a file with another file
awk 'NR==FNR{a[$1];next} !($2 in a)' d3_tmp FS="[ \t=]" m2p_tmp
Find content of one file from another file in UNIX
awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' File1 File2
file1: (tab delimited)
AAA BBB 1500
CCC DDD 2500
EEE FFF 2000
file2: (tab delimited)
GGG 1250 1750
HHH 1950 2300
III 2600 2700
Expected output would retain rows 1 and 3 from file1 (in a new file, file3) because these records fall within the ranges of row 1 columns 2 and 3, and row 2 columns 2 and columns 3 of file2. In the actual files, they're not row restricted i.e. I am not wanting to look at row1 of file1 and compare to row1 of file2, but compare row1 to all rows in file2 to get the hit.
file3 (output)
AAA BBB 1500
EEE FFF 2000
One way:
awk 'NR==FNR{a[i]=$2;b[i++]=$3;next}{for(j=0;j<i;j++){if ($3>=a[j] && $3<=b[j]){print;}}}' i=0 file2 file1
AAA BBB 1500
EEE FFF 2000
Read the file2 contents and store it in arrays a and b. When file1 is read, check for the number to be between the entire a and b arrays and print.
One more option:
$ awk 'NR==FNR{for(i=$2;i<=$3;i++)a[i];next}($3 in a)' file2 file1
AAA BBB 1500
EEE FFF 2000
File2 is read and the entire range of numbers is broken up and stored into the associate array a. When we read the file1, we just need to lookup the array a.
Another awk. It may or may not make sense depending on the filesizes:
$ awk '
NR==FNR {
a[$3]=$2 # hash file2 records, $3 is key, $2 value
next
}
{
for(i in a) # for each record in file1 go thru ever element in a
if($3<=i && $3>=a[i]) { # if it falls between
print # output
break # exit loop once match found
}
}' file2 file1
Output:
AAA BBB 1500
EEE FFF 2000

awk to compare multiple columns in 2 files

I would like to compare multiple columns from 2 files and NOT print lines matching my criteria.
An example of this would be:
file1
apple green 4
orange red 5
apple yellow 6
apple yellow 8
grape green 5
file2
apple yellow 7
grape green 10
output
apple green 4
orange red 5
apple yellow 8
I want to remove lines where $1 and $2 from file1 correspond to $1 and $2 from file2 AND when $3 from file1 is smaller than $3 from file2.
I can now only do the first part of the job, that is remove lines where $1 and $2 from file1 correspond to $1 and $2 from file2 (fields are separated by tabs):
awk -F '\t' 'FNR == NR {a[$1FS$2]=$1; next} !($1FS$2 in a)' file2 file1
Could you help me apply the last condition?
Many thanks in advance!
What you are after is this:
awk '(NR==FNR){a[$1,$2]=$3; next}!(($1,$2) in a) && a[$1,$2] < $3))' <file2> <file1>
Store 3rd field value while building the array and then use it for comparison
$ awk -F '\t' 'FNR==NR{a[$1FS$2]=$3; next} !(($1FS$2 in a) && $3 > a[$1FS$2])' f2 f1
apple green 4
orange red 5
apple yellow 6
grape green 5
Better written as:
awk -F '\t' '{k = $1FS$2} FNR==NR{a[k]=$3; next} !((k in a) && $3 > a[k])' f2 f1

Concatenate files based off unique titles in their first column

I have many files that are of two column format with a label in the first column and a number in the second column. The number is positive (never zero):
AGS 3
KET 45
WEGWET 12
FEW 56
Within each file, the labels are not repeated.
I would like to concatenate these many files into one file with many+1 columns, such that the first column includes the unique set of all labels across all files, and the last five columns include the number for each label of each file. If the label did not exist in a certain file (and hence there is no number for it), I would like it to default to zero. For instance, if the second file contains this:
AGS 5
KET 14
KJV 2
FEW 3
then the final output would look like:
AGS 3 5
KET 45 14
WEGWET 12 0
KJV 0 2
FEW 56 3
I am new to Linux, and have been playing around with sed and awk, but realize this probably requires multiple steps...
*Edit note: I had to change it from just 2 files to many files. Even though my example only shows 2 files, I would like to do this in case of >2 files as well. Thank you...
Here is one way using awk:
awk '
NR==FNR {a[$1]=$0;next}
{
print (($1 in a)?a[$1] FS $2: $1 FS "0" FS $2)
delete a[$1]
}
END{
for (x in a) print a[x],"0"
}' file1 file2 | column -t
AGS 3 5
KET 45 14
KJV 0 2
FEW 56 3
WEGWET 12 0
You read file1 in to an array indexed at column 1 and assign entire line as it's value
For the file2, check if column 1 is present in our array. If it is print the value from file1 along with value from file2. If it is not present print 0 as value for file1.
Delete the array element as we go along to get only what was unique in file1.
In the END block print what was unique in file1 and print 0 for file2.
Pipe the output to column -t for pretty format.
Assuming that your data are in files named file1 and file2:
$ awk 'FNR==NR {a[$1]=$2; b[$1]=0; next} {a[$1]+=0; b[$1]=$2} END{for (x in b) {printf "%-15s%3s%3s\n",x,a[x],b[x]}}' file1 file2
KJV 0 2
WEGWET 12 0
KET 45 14
AGS 3 5
FEW 56 3
To understand the above, we have to understand an awk trick.
In awk, NR is the number of records (lines) that have been processed and FNR is the number of records that we have processed in the current file. Consequently, the condition FNR==NR is true only when we are processing in the first file. In this case, the associative array a gets all the values from the first file and associative array b gets placeholder, i.e. zero, values. When we process the second file, its values go in array b and we make sure that array a at least has a placeholder value of zero. When we are done with the second file, the data is printed.
More than two files using GNU Awk
I created a file3:
$ cat file3
AGS 3
KET 45
WEGWET 12
FEW 56
AGS 17
ABC 100
The awk program extended to work with any number of files is:
$ awk 'FNR==1 {n+=1} {a[$1][n]=$2} END{for (x in a) {printf "%-15s",x; for (i=1;i<=n;i++) {printf "%5s",a[x][i]};print ""}}' file1 file2 file3
KJV 2
ABC 100
WEGWET 12 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56
This code works creates a file counter. We know that we are in a new file every time that FNR is 1 and a counter, n, is incremented. For every line we encounter, we put the data in a 2-D array. The first dimension of a is the label and the second is the number of the file that we encountered it in. In the end, we just loop over all the labels and all the files, from 1 to n and print the data.
More than 2 files without GNU Awk
Without requiring GNU's awk, we can solve the problem using simulated two-dimensional arrays:
$ awk 'FNR==1 {n+=1} {b[$1]=1; a[$1,":",n]=$2} END{for (x in b) {printf "%-15s",x; for (i=1;i<=n;i++) {q=a[x,":",i]+0; printf "%5s",q};print ""}}' file1 file2 file3
KJV 0 2 0
ABC 0 0 100
WEGWET 12 0 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56

Common values in 2 columns in 2 files

Suppose I have these 2 tab delimited files, where the second column in first file contains matching values from first column of the second file, I would like to get an output like this:
FileA:
1 A
2 B
3 C
FileB:
A Apple
C Cinnabon
B Banana
I would like an output like this:
1 Apple
2 Banana
3 Cinnabon
I can write a script for this, but I would like to know how to make it in awk or perl in one line.
awk 'BEGIN{FS=OFS="\t"}NR==FNR{a[$1]=$2;next}{$2=a[$2]}1' f2 f1
GNU sed oneliner
sed -r 's:\s*(\S+)\s+(\S+):/\\s*\\S\\+\\s\\+\1/s/\\(\\s*\\S\\+\\s\\+\\)\1/\\1\2/:' fileB | sed -f - fileA
..output:
1 Apple
2 Banana
3 Cinnabon
The command you want is this:
$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{$2=a[$2]; print}' file2 file1
1 Apple
2 Banana
3 Cinnabon