how to get the common rows according to the first column in awk - awk

I have two ',' separated files as follow:
file1:
A,inf
B,inf
C,0.135802
D,72.6111
E,42.1613
file2:
A,inf
B,inf
C,0.313559
D,189.5
E,38.6735
I want to compare 2 files ans get the common rows based on the 1st column. So, for the mentioned files the out put would look like this:
A,inf,inf
B,inf,inf
C,0.135802,0.313559
D,72.6111,189.5
E,42.1613,38.6735
I am trying to do that in awk and tried this:
awk ' NR == FNR {val[$1]=$2; next} $1 in val {print $1, val[$1], $2}' file1 file2
this code returns this results:
A,inf
B,inf
C,0.135802
D,72.6111
E,42.1613
which is not what I want. do you know how I can improve it?

$ awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1]=$0;next}$1 in a{print a[$1],$2}' file1 file2
A,inf,inf
B,inf,inf
C,0.135802,0.313559
D,72.6111,189.5
E,42.1613,38.6735
Explained:
$ awk '
BEGIN {FS=OFS="," } # set separators
NR==FNR { # first file
a[$1]=$0 # hash to a, $1 as index
next # next record
}
$1 in a { # second file, if $1 in a
print a[$1],$2 # print indexed record from a with $2
}' file1 file2

Your awk code basically works, you are just missing to tell awk to use , as the field delimiter. You can do it by adding BEGIN{FS=OFS=","} to the beginning of the script.
But having that the files are sorted like in the examples in your question, you can simply use the join command:
join -t, file1 file2
This will join the files based on the first column. -t, tells join that columns are separated by commas.
If the files are not sorted, you can sort them on the fly like this:
join -t, <(sort file1) <(sort file2)

Related

Substitute patterns using a correspondence file

I try to change in a file some word by others using sed or awk.
My initial fileA as this format:
>Genus_species|NP_001006347.1|transcript-60_2900.p1:1-843
I have a second fileB with the correspondences like this:
NP_001006347.1 GeneA
XP_003643123.1 GeneB
I am trying to substitute in FileA the name to get this ouput:
>Genus_species|GeneA|transcript-60_2900.p1:1-843
I was thinking to use awk or sed, to do something like
's/$patternA/$patternB/' with a while read l but how to indicate which pattern 1 and 2 are in the fileB? I tried also this but not working.
sed "$(sed 's/^\([^ ]*\) \(.*\)$/s#\1#\2#g/' fileB)" fileA
Awk may be able to do the job more easily?
Thanks
It is easier to this in awk:
awk -v OFS='|' 'NR == FNR {
map[$1] = $2
next
}
{
for (i=1; i<=NF; ++i)
$i in map && $i = map[$i]
} 1' file2 FS='|' file1
>Genus_species|GeneA|transcript-60_2900.p1:1-843
Written and tested with your shown samples, considering that you have only one entry for NP_digits.digits in your Input_fileA then you could try following too.
awk '
FNR==NR{
arr[$1]=$2
next
}
match($0,/NP_[0-9]+\.[0-9]+/) && ((val=substr($0,RSTART,RLENGTH)) in arr){
$0=substr($0,1,RSTART-1) arr[val] substr($0,RSTART+RLENGTH)
}
1
' Input_fileB Input_fileA
Using awk
awk -F [\|" "] 'NR==FNR { arr[$1]=$2;next } NR!=FNR { OFS="|";$2=arr[$2] }1' fileB fileA
Set the field delimiter to space or |. Process fileB first (NR==FNR) Create an array called arr with the first space delimited field as the index and the second the value. Then for the second file (NR != FNR), check for an entry for the second field in the arr array and if there is an entry, change the second field for the value in the array and print the lines with short hand 1
You are looking for the join command which can be used like this:
join -11 -22 -t'|' <(tr ' ' '|' < fileB | sort -t'|' -k1) <(sort -t'|' -k2 fileA)
This performs a join on column 1 of fileB with column 2 of fileA. The tr was used such that fileB also uses | as delimiter because join requires it to be equal on both files.
Note that the output columns are not in the order you specified. You can swap by piping the output into awk.

Awk command has unexpected results when comparing two files

I am using an awk command to compare the first column in two file.
I want to take col1 of file1 and if there is a match in col1 of file2, update the "date updated" in the last column. If there is no match, I want to append the entire line of file1 to file2 and append a "date updated" value to that line as well. Here is the command I'm currently using:
awk 'FNR == NR { f1[$1] = $0; next }
$1 in f1 { print; delete f1[$1] }
END { for (user in f1) print f1[user] }' file1 file2
File1:
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeNEW
mschrode,161.1,plasma-de+,serv01,datetimeNEW
File2:
jbomba,114.0,plasma-de+,serv01,datetimeOLD
mschrode,104.0,plasma-de+,serv01,datetimeOLD
deleteme,192.0,random,serv01,datetimeOLD #<---- Needs to be removed: WORKS!
Expected Output:(order does not matter)
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeOLD #<---- NEED THIS VALUE
mschrode,161.1,plasma-de+,serv01,datetimeOLD #<---- NEED THIS VALUE
Current Output:(order does not matter)
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeNEW #<----WRONG OUTPUT
mschrode,161.1,plasma-de+,serv01,datetimeNEW #<----WRONG OUTPUT
The Logic Broken Down:
If $usr/col1 in file2 does NOT exist in file1
remove entire line from file2
(ex: line3 in file2, user: deleteme)
If $usr/col1 in file1 does NOT exist in file2
append entire line to file2
(ex: lines 1-5 in file1)
So the issue is, when there IS a match between the two files, I need to keep the information from file2, not the information from file1. In the output examples above you'll see I need to keep the datetimeOLD from file2 along with the new information from file1.
Set field separator to comma, and read file2 first:
$ awk -F',' 'FNR==NR{a[$1]=$0;next} $1 in a{print a[$1];next} 1' file2 file1
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,114.0,plasma-de+,serv01,datetimeOLD
mschrode,104.0,plasma-de+,serv01,datetimeOLD

AWK: 2 columns 2 files display where second column has unique data

I need the first column to check if it doesn't match the first column on the second file. Though, if the second column matches the second column on the second file, to display this data with awk on Linux.
I want awk to detect the change with both the first and second column of the first file with the second file.
file1.txt
sdsdjs ./file.txt
sdsksp ./example.txt
jsdjsk ./number.txt
dfkdfk ./ok.txt
file2.txt
sdsdks ./file.txt <-- different
sdsksd ./example.txt <-- different
jsdjsk ./number.txt <-- same
dfkdfa ./ok.txt <-- different
Expected output:
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Notice how in the second file there may be lines missing and not the same as the first.
As seen above, how can awk display results only where the second column is unique and does not match the first column?
Something like this might work for you:
awk 'FNR == NR { f[FNR"_"$2] = $1; next }
f[FNR"_"$2] && f[FNR"_"$2] != $1' file1.txt file2.txt
Breakdown:
FNR == NR { } # Run on first file as FNR is record number for the file, while NR is the global record number
f[FNR"_"$2] = $1; # Store first column under the name of FNR followed by an underbar followed by the second column
next # read next record and redo
f[FNR"_"$2] && f[FNR"_"$2] != $1 # If the first column doesn't match while the second does, then print the line
A simpler approach which will ignore the second column is:
awk 'FNR == NR { f[FNR"_"$1] = 1; next }
!f[FNR"_"$1]' file1.txt file2.txt
If the records don't have to be in the respective position in the files ie. we compare matching second column strings, this should be enough:
$ awk '{if($2 in a){if($1!=a[$2])print $2}else a[$2]=$1}' file1 file2
Output:
file.txt
In pretty print:
$ awk '{
if($2 in a) { # if $2 match processing
if($1!=a[$2]) # and $1 don t
print $2 # output
} else # else
a[$2]=$1 # store
}' file1 file2
Updated:
$ awk '{if($2 in a){if($1!=a[$2])print $1,$2}else a[$2]=$1}' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Basically changed the print $2 to print $1,$2.
The way your question is worded is very confusing but after reading it several times and looking at your posted expected output I THINK you're just trying to say you want the lines from file2 that don't appear in file1. If so that's just:
$ awk 'NR==FNR{a[$0];next} !($0 in a)' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
If your real data has more fields than shown in your sample input but you only want the first 2 fields considered for the comparison then fix your question to show a more truly representative example but the solution would be:
$ awk 'NR==FNR{a[$1,$2];next} !(($1,$2) in a)' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
if that's not it then please edit your question to clarify what it is you're trying to do and include an example where the above doesn't produce the expected output.
I understand the original problem in the following way:
Two files, file1 and file2 contain a set of key-value pairs.
The key is the filename, the value is the string in the first column
If a matching key is found between file1 and file2 but the value is different, print the matching line of file2
You do not really need advanced awk for this task, it can easily be achieved with a simple pipeline of awk and grep.
$ awk '{print $NF}' file2.txt | grep -wFf - file1.txt | grep -vwFf - file2.txt
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Here, the first grep will select the lines from file1.txt which do have the same key (filename). The second grep will try to search the full matching lines from file1 in file2, but it will print the failures. Be aware that in this case, the lines need to be completely identical.
If you just want to use awk, then the above logic is achieved with the solution presented by Ed Morton. No need to repeat it here.
I think this is what you're looking for
$ awk 'NR==FNR{a[$2]=$1; next} a[$2]!=$1' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
print the records from file2 where field1 values are different for the same field2 value. This script assumes that field2 values are unique within each file, so that it can be used as keys. Since the content looks like file paths, this is a valid assumption. Otherwise, you need to match the records perhaps with the corresponding line numbers.
In case you are looking for a more straightforward line-based diff based on the the first field on a line being different.
awk 'NR==FNR { a[NR] = $1; next } a[FNR]!=$1' file1 file2

awk returning whitespace matches when comparing columns in csv

I am trying to do a file comparison in awk but it seems to be returning all the lines instead of just the lines that match due to whitespace matching
awk -F "," 'NR==FNR{a[$2];next}$6 in a{print $6}' file1.csv fil2.csv
How do I instruct awk not to match the whitespaces?
I get something like the following:
cccs
dert
ssss
assak
this should do
$ awk -F, 'NR==FNR && $2 {a[$2]; next}
$6 in a {print $6}' file1 file2
if you data file includes spaces and numerical fields, as commented below better to change the check from $2 to $2!="" && $2!~/[[:space:]]+/
Consider cases like $2=<space>foo<space><space>bar in file1 vs $6=foo<space>bar<space> in file2.
Here's how to robustly compare $6 in file2 against $2 of file1 ignoring whitespace differences, and only printing lines that do not have empty or all-whitespace key fields:
awk -F, '
{
key = (NR==FNR ? $2 : $6)
gsub(/[[:space:]]+/," ",key)
gsub(/^ | $/,"",key)
}
key=="" { next }
NR==FNR { file1[key]; next }
key in file1
' file1 file2
If you want to make the comparison case-insensitive then add key=tolower(key) before the first gsub(). If you want to make it independent of punctuation add gsub(/[[:punct:]]/,"",key) before the first gsub(). And so on...
The above is untested of course since no testable sample input/output was provided.

Awk merging of two files on id

I would like to obtain the match the IDs of the first file to the IDs of the second file, so i get, for example, Thijs Al,NED19800616,39. I know this should be possible with AWK, but I'm not really good at it.
file1 (few entries)
NED19800616,Thijs Al
BEL19951212,Nicolas Cleppe
BEL19950419,Ben Boes
FRA19900221,Arnaud Jouffroy
...
file2 (many entries)
38,FRA19920611
39,NED19800616
40,BEL19931210
41,NED19751211
...
Don't use awk, use join. First make sure the input files are sorted:
sort -t, -k1,1 file1 > file1.sorted
sort -t, -k2,2 file2 > file2.sorted
join -t, -1 1 -2 2 file[12].sorted
With awk you can do
$ awk -F, 'NR==FNR{a[$2]=$1;next}{print $2, $1, a[$1] }' OFS=, file2 file1
Thijs Al,NED19800616,39
Nicolas Cleppe,BEL19951212,
Ben Boes,BEL19950419,
Arnaud Jouffroy,FRA19900221,