unix - compare columns of two files - awk

I have two files. First file is masterlist of IDS. Second file is normal input file.
I'm trying to print only the records of input where it's id (column 3) is NOT in masterlist (column 1).
sample:
masterlist.txt
111
222
333
input.txt
col1,col2,col3,col4,col5,col6
abc,abc,111,xyz,xyz,xyz
abc,abc,222,xyz,xyz,xyz
abc,abc,333,xyz,xyz,xyz
abc,abc,444,xyz,xyz,xyz
desired output:
col3,col4,col5,col6
abc,abc,444,xyz,xyz,xyz
I have come up with this code so far but I'm not getting the correct output.
awk -F\| '!b{a[$0]; next}$3 in a {true; next} {print $3","$4","$11","$12}' masterlist.txt b=1 input.txt

Could you please try following awk and let us know if this helps you.
awk 'FNR==NR{a[$1];next} !($3 in a)' masterlist.txt FS="," input.txt

Related

Countif like function in AWK with field headers

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below
>awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
>awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ
What i want is:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
#adds field name count that I want:
awk -F, -v OFS=, 'NR==1{ print $0, "count"}
NR>1{ print $0}' file3
Ho do I get the output I want?
I have tried this from previous/similar question but no joy,
>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,
,
,
,
,
,
very similar question to this one
similar python related Q, for my ref
I would harness GNU AWK for this task following way, let file.txt content be
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
then
awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt
gives output
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr
(tested in GNU Awk 5.0.1)
You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:
awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}'
You can also output the header with the new field name:
awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'
One awk idea:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiters as comma
{ lines[NR]=$0
if (NR==1) next
col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block
cnt[$6]++
}
END { for (i=1;i<=NR;i++)
print lines[i], (i==1 ? "count" : cnt[col6[i]] )
}
' file3
This generates:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

AWK: 2 columns 2 files display where second column has unique data

I need the first column to check if it doesn't match the first column on the second file. Though, if the second column matches the second column on the second file, to display this data with awk on Linux.
I want awk to detect the change with both the first and second column of the first file with the second file.
file1.txt
sdsdjs ./file.txt
sdsksp ./example.txt
jsdjsk ./number.txt
dfkdfk ./ok.txt
file2.txt
sdsdks ./file.txt <-- different
sdsksd ./example.txt <-- different
jsdjsk ./number.txt <-- same
dfkdfa ./ok.txt <-- different
Expected output:
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Notice how in the second file there may be lines missing and not the same as the first.
As seen above, how can awk display results only where the second column is unique and does not match the first column?
Something like this might work for you:
awk 'FNR == NR { f[FNR"_"$2] = $1; next }
f[FNR"_"$2] && f[FNR"_"$2] != $1' file1.txt file2.txt
Breakdown:
FNR == NR { } # Run on first file as FNR is record number for the file, while NR is the global record number
f[FNR"_"$2] = $1; # Store first column under the name of FNR followed by an underbar followed by the second column
next # read next record and redo
f[FNR"_"$2] && f[FNR"_"$2] != $1 # If the first column doesn't match while the second does, then print the line
A simpler approach which will ignore the second column is:
awk 'FNR == NR { f[FNR"_"$1] = 1; next }
!f[FNR"_"$1]' file1.txt file2.txt
If the records don't have to be in the respective position in the files ie. we compare matching second column strings, this should be enough:
$ awk '{if($2 in a){if($1!=a[$2])print $2}else a[$2]=$1}' file1 file2
Output:
file.txt
In pretty print:
$ awk '{
if($2 in a) { # if $2 match processing
if($1!=a[$2]) # and $1 don t
print $2 # output
} else # else
a[$2]=$1 # store
}' file1 file2
Updated:
$ awk '{if($2 in a){if($1!=a[$2])print $1,$2}else a[$2]=$1}' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Basically changed the print $2 to print $1,$2.
The way your question is worded is very confusing but after reading it several times and looking at your posted expected output I THINK you're just trying to say you want the lines from file2 that don't appear in file1. If so that's just:
$ awk 'NR==FNR{a[$0];next} !($0 in a)' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
If your real data has more fields than shown in your sample input but you only want the first 2 fields considered for the comparison then fix your question to show a more truly representative example but the solution would be:
$ awk 'NR==FNR{a[$1,$2];next} !(($1,$2) in a)' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
if that's not it then please edit your question to clarify what it is you're trying to do and include an example where the above doesn't produce the expected output.
I understand the original problem in the following way:
Two files, file1 and file2 contain a set of key-value pairs.
The key is the filename, the value is the string in the first column
If a matching key is found between file1 and file2 but the value is different, print the matching line of file2
You do not really need advanced awk for this task, it can easily be achieved with a simple pipeline of awk and grep.
$ awk '{print $NF}' file2.txt | grep -wFf - file1.txt | grep -vwFf - file2.txt
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Here, the first grep will select the lines from file1.txt which do have the same key (filename). The second grep will try to search the full matching lines from file1 in file2, but it will print the failures. Be aware that in this case, the lines need to be completely identical.
If you just want to use awk, then the above logic is achieved with the solution presented by Ed Morton. No need to repeat it here.
I think this is what you're looking for
$ awk 'NR==FNR{a[$2]=$1; next} a[$2]!=$1' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
print the records from file2 where field1 values are different for the same field2 value. This script assumes that field2 values are unique within each file, so that it can be used as keys. Since the content looks like file paths, this is a valid assumption. Otherwise, you need to match the records perhaps with the corresponding line numbers.
In case you are looking for a more straightforward line-based diff based on the the first field on a line being different.
awk 'NR==FNR { a[NR] = $1; next } a[FNR]!=$1' file1 file2

Use Awk to get data from two files

I have two different files with two columns each.
file1.txt
DevId Group
aaa A
bbb B
file2.txt
Group RefId
A 111-222-333
B 444-555-666
I need only need DevId and its corresponding RefId.
Required Output
DevId RefId
aaa 111-222-333
bbb 444-555-666
I tried using this syntax but I can't get it correctly.
awk -F, -v OFS=, 'NR==FNR{a[$1]=$2;next}{print a[$2],$1}' file2.txt file1.txt
I hope someone could help me.
Here:
awk -v RS="\r\n" 'FNR==NR{a[$1]=$2;next}{ print $1, a[$2]}' file2.txt file1.txt
This was modified from Awk multiple files which I suggest you read for the explanation.
Edit: As mentioned by #JamesBrown, added -v RS="\r\n" for line endings

summarizing the contents of a text file to an other one using awk

I have a big text file with 2 tab separated fields. as you see in the small example every 2 lines have a number in common. I want to summarize my text file in this way.
1- look for the lines that have the number in common and sum up the second column of those lines.
small example:
ENST00000054666.6 2
ENST00000054666.6_2 15
ENST00000054668.5 4
ENST00000054668.5_2 10
ENST00000054950.3 0
ENST00000054950.3_2 4
expected output:
ENST00000054666.6 17
ENST00000054668.5 14
ENST00000054950.3 4
as you see the difference is in both columns. in the 1st column there is only one repeat of each common and without "_2" and in the 2nd column the values is sum up of both lines (which have common number in input file).
I tried this code but does not return what I want:
awk -F '\t' '{ col2 = $2, $2=col2; print }' OFS='\t' input.txt > output.txt
do you know how to fix it?
Solution 1st: Following awk may help you on same.
awk '{sub(/_.*/,"",$1)} {a[$1]+=$NF} END{for(i in a){print i,a[i]}}' Input_file
Solution 2nd: In case your Input_file is sorted by 1st field then following may help you.
awk '{sub(/_.*/,"",$1)} prev!=$1 && prev{print prev,val;val=""} {val+=$NF;prev=$1} END{if(val){print prev,val}}' Input_file
Use > output.txt at the end of the above codes in case you need the output in a output file too.
If order is not a concern, below may also help :
awk -v FS="\t|_" '{count[$1]+=$NF}
END{for(i in count){printf "%s\t%s%s",i,count[i],ORS;}}' file
ENST00000054668.5 14
ENST00000054950.3 4
ENST00000054666.6 17
Edit :
If the order of the output does matter, below approach using a flag helps :
$ awk -v FS="\t|_" '{count[$1]+=$NF;++i;
if(i==2){printf "%s\t%s%s",$1,count[$1],ORS;i=0}}' file
ENST00000054666.6 17
ENST00000054668.5 14
ENST00000054950.3 4

printing multiple NR from one file based on the value from other file using awk

I want to print out multiple rows from one file based on the input values from the other.
Following is the representation of file 1:
2
4
1
Following is the representation of file 2:
MANCHKLGO
kflgklfdg
fhgjpiqog
fkfjdkfdg
fghjshdjs
jgfkgjfdk
ghftrysba
gfkgfdkgj
jfkjfdkgj
Based on the first column of the first file, the code should first print the second row of the second file followed by fourth row and then the first row of the second file. Hence, the output should be following:
kflgklfdg
fkfjdkfdg
MANCHKLGO
Following are the codes that I tried:
awk 'NR==FNR{a[$1];next}FNR in a{print $0}' file1.txt file2.txt
However, as expected, the output is not in the order as it first printed the first row then the second and fourth row is the last. How can I print the NR from the second file as exactly in the order given in the first file?
Try:
$ awk 'NR==FNR{a[NR]=$0;next} {print a[$1]}' file2 file1
kflgklfdg
fkfjdkfdg
MANCHKLGO
How it works
NR==FNR{a[NR]=$0;next}
This saves the contents of file2 in array a.
print a[$1]
For each number in file1, we print the desired line of file2.
Solution to earlier version of question
$ awk 'NR==FNR{a[NR]=$0;next} {print a[2*$1];print a[2*$1+1]}' file2 file1
fkfjdkfdg
fghjshdjs
gfkgfdkgj
jfkjfdkgj
kflgklfdg
fhgjpiqog
Another take:
awk '
NR==FNR {a[$1]; order[n++] = $1; next}
FNR in a {lines[FNR] = $0}
END {for (i=0; i<n; i++) print lines[order[i]]}
' file1.txt file2.txt
This version stores fewer lines in memory, if your files are huge.