How to find the difference between two files using multiple conditions? - awk

I have two files file1.txt and file2.txt like below -
cat file1.txt
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
cat file2.txt
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Requirement is to fetch the records which are not available in
file1.txt using below condition.
file1.txt file2.txt
col1(date) col1(Date)
col2(number: 4343250019 ) col2(last value of number: 9)
col3(number) col3(number)
col5(alphanumeric) col5(alphanumeric)
Expected Output :
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
This output line doesn't available in file1.txt but available in
file2.txt after satisfying the matching criteria.
I was trying below steps to achieve this output -
###Replacing the space/tab from the file1.txt with pipe
awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' OFS="|" file1.txt > file1.txt1
### Looping on a combination of four column of file1.txt1 with combination of modified column of file2.txt and output in output.txt
awk 'BEGIN{FS=OFS="|"} {a[$1FS$2FS$3FS$5];next} {(($1 FS substr($2,length($2),1) FS $3 FS $5) in a) print $0}' file2.txt file1.txt1 > output.txt
###And finally, replace the "N" from column 8th and put "NULL" if the value is "N".
awk -F'|' '{ gsub ("N","NULL",$8);print}' OFS="|" output.txt > output.txt1
What is the issue?
My 2nd operation is not working and I am trying to put all 3 operations in one operation.

awk -F'[|]|[[:blank:]]+' 'FNR==NR{E[$1($2%10)$3$5]++;next}!($1$2$3$5 in E)' file1.txt file2.txt
and your sample output is wrong, it should be (last field if different: data45333)
2016-07-20-22|9|1003116|001|data45333|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2017-06-22-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Commented code
# separator for both file first with blank, second with `|`
awk -F'[|]|[[:blank:]]+' '
# for first file
FNR==NR{
# create en index entry based on the 4 field. The forat of filed allow to use them directly without separator (univoq)
E[ $1 ( $2 % 10 ) $3 $5 ]++
# for this line (file) don't go further
next
}
# for next file lines
# if not in the index list of entry, print the line (default action)
! ( ( $1 $2 $3 $5 ) in E ) { print }
' file1.txt file2.txt

Input
$ cat f1
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
$ cat f2
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Output
$ awk 'FNR==NR{a[$1,substr($2,length($2)),$3,$5];next}!(($1,$2,$3,$5) in a)' f1 FS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Explanation
awk ' # call awk.
FNR==NR{ # This is true when awk reads first file
a[$1,substr($2,length($2)),$3,$5] # array a where index being $1(field1), last char from $2, $3 and $5
next # stop processing go to next line
}
!(($1,$2,$3,$5) in a) # here we check index $1,$2,$3,$5 exists in array a by reading file f2
' f1 FS="|" f2 # Read f1 then
# set FS and then read f2
FNR==NR If the number of records read so far in the current file
is equal to the number of records read so far across all files,
condition which can only be true for the first file read.
a[$1,substr($2,length($2)),$3,$5] populate array "a" such that the
indexed by the first
field, last char of second field, third field and fifth field from
current record of file1
next Move on to the next record so we don't do any processing
intended for records from the second file.
!(($1,$2,$3,$5) in a) IF the array a index constructed from the
fields ($1,$2,$3,$5) of the current record of file2 does not exist
in array a, we get boolean true (! Called Logical NOT Operator. It is used to reverse the logical state of its operand. If a condition is true, then Logical NOT operator will make it false and vice versa.) so awk does default operation print $0 from file2
f1 FS="|" f2 read file1(f1), set field separator "|" after
reading first file, and then read file2(f2)
--edit--
When filesize is huge around 60GB(900 millions rows), its not a good
idea to process the file two times. 3rd operation - (replace "N" with
"NULL" from col - 8 ""awk -F'|' '{ gsub ("N","NULL",$8);print}'
OFS="|" output.txt
$ awk 'FNR==NR{
a[$1,substr($2,length($2)),$3,$5];
next
}
!(($1,$2,$3,$5) in a){
sub(/N/,"NULL",$8);
print
}' f1 FS="|" OFS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one

You can try this awk:
awk -F'[ |]*' 'NR==FNR{su=substr($2,length($2),1); a[$1":"su":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{print $0}' f1 f2
Here,
a[] - an associative array
$1":"su":"$3":"$5 - this forms key for an array index. su is last digit of field $2 (su=substr($2,length($2),1)). Then, assigning an 1 as value for this key.
NR==FNR{...;next} - this block works for processing f1.
Update:
awk 'NR==FNR{$2=substr($2,length($2),1); a[$1":"$2":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{gsub(/^N$/,"NULL",$8);print}' f1 FS="|" OFS='|' f2

Related

Awk compare 2 files and match 2 columns, get % difference between column 3s

I have 2 files, with the formatting below. I am trying to compare lines where columns 1 and 2 match and then get the difference in the 2 #'s that are in column 3.
if file 2 column 3 is greater than file1 column 3, i would like a + at the end of the row
if file2 column is less than file 1 column 3, i would like a - at the end of the row
if either file column 3 is 0 i would like a * at the end of the row.
I only want to print lines where the difference between the 2 columns is > 15%
file1
abc,1,472
abc,2,536
abc,3,652
abc,4,512
abc,5,474
abc,6,266
abc,7,520
def,1,954
def,9,538
def,10,136
def,11,341
def,12,183
def,13,1209
def,14,365
def,15,536
def,16,979
def,17,0
xyz,1,547
xyz,19,0
xyz,20,0
xyz,21,0
xyz,22,0
xyz,23,0
xyz,24,0
File 2
abc,1,456
abc,2,533
abc,3,643
abc,4,444
abc,5,124
abc,6,255
abc,7,520
def,1,954
def,9,538
def,10,435
def,11,341
def,12,155
def,13,1209
def,14,365
def,15,536
def,16,979
def,17,0
xyz,1,547
xyz,19,124
xyz,20,0
xyz,21,0
xyz,22,0
xyz,23,0
xyz,24,0
expected output
abc,5,474,124,74%,- // (474-124)/474 = 74%
def,10,136,435,31%,+. // (435-136)/474 = 69%
xyz,19,0,124,100%,*. // either file has 0 , print 100% and *
I have tried multiple iterations of this but cannot seem to get the formatting to work.
awk -F, 'FNR==NR{a[$1,$2]; next ;b[$1,$2,$3]; next} $1,$2 in a {if ($3>b[$3]) {Q=((b[$3]/$3) *100)) {print Q,$0 }} else if (b[$3]>$3) {Q=(($3/b[$3]) *100)){print Q,$0 }}' file1 file2
i get this error
^ unexpected newline or end of string
also tried variations on this line but i cannot figure out the division by 0 error
awk -F, 'FNR==NR{a[$1,$2]; next ;b[$1,$2,$3]; next} $1,$2 in a {if ((Q=(b[$3]/$3) > 15) || (Q=($3/b[$3])) > 15 ){print Q,$0}}' file1 file2
awk: cmd. line:1: (FILENAME=file2 FNR=1) fatal: division by zero attempted
you need to handle if the denominator is zero in the base case, since you cannot find the relative change in that case, you need to report absolute change.
$ awk -F, -v OFS=, '{k=$1 FS $2}
FNR==NR {a[k]=$3; next}
k in a {if(a[k]) q=$3/a[k]-1;
else if($3) zero=1
else q=0
plus=q>0.15
minus=q<-0.15
q=q<0?-q:q;
if(zero) plus=minus=0
if(plus || minus || zero)
print k,a[k],$3,(zero?100:int(100*q))"%",(plus?"+":minus?"-":"*")
q=zero=0}' file1 file2
abc,5,474,124,73%,-
def,10,136,435,219%,+
def,12,183,155,15%,-
xyz,19,0,124,100%,*
you can put this in a diff.awk file and run with awk -f diff.awk file1 file2
the file contents should be
BEGIN{FS=OFS=","}
{k=$1 FS $2}
... the code in between
q=zero=0}
note that text body is without the single quotes. You can make it executable with the right shebang but I think this will be simpler.

Awk command to compare specific columns in file1 to file2 and display output

File1
111,222,560,0.7
111,333,560,0.2
111,444,560,0.1
File2
2017,111,560,0.0537
2018,111,560,0.0296
2019,111,560,0.0624
Desired output:
2017,111,560,0.0537,222,0.7
2018,111,560,0.0296,222,0.7
2019,111,560,0.0624,222,0.7
2017,111,560,0.0537,333,0.2
2018,111,560,0.0296,333,0.2
2019,111,560,0.0624,333,0.2
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0296,444,0.1
I tried awk NR==FNR command but it’s displaying only the last matched...
reads every line and check if column 1 and 3 of file1 exists in file2:
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0296,444,0.1
I tried awk NR==FNR command but it’s displaying only the last
matched...
reads every line and check if column 1 and 3 of file1 exists in file2:
Using awk and sort
awk 'BEGIN{
# set input and output field separator
FS=OFS=","
}
# read first file f1
# index key field1 and field3 of file1 (f1)
{
k=$1 FS $3
}
# save 2nd and last field of file1 (f1) in array a, key being k
FNR==NR{
a[k]=(k in a ? a[k] RS:"") $2 OFS $NF;
# stop processing go to next line
next
}
# read 2nd file f2 from here
# 2nd and 3rd field of fiel2 (f2) used as key
{
k=$2 FS $3
}
# if key exists in array a
k in a{
# split array value by RS row separator, and put it in array t
split(a[k],t,RS);
# iterate array t, print and sort
for(i=1; i in t; i++)
print $0,t[i] | "sort -t, -nk5"
}
' f1 f2
Test Results:
$ cat f1
111,222,560,0.7
111,333,560,0.2
111,444,560,0.1
$ cat f2
2017,111,560,0.0537
2018,111,560,0.0296
2019,111,560,0.0624
$ awk 'BEGIN{FS=OFS=","}{k=$1 FS $3}FNR==NR{a[k]=(k in a ? a[k] RS:"") $2 OFS $NF; next}{k=$2 FS $3}k in a{split(a[k],t,RS); for(i=1; i in t; i++)print $0,t[i] | "sort -t, -nk5" }' f1 f2
2017,111,560,0.0537,222,0.7
2018,111,560,0.0296,222,0.7
2019,111,560,0.0624,222,0.7
2017,111,560,0.0537,333,0.2
2018,111,560,0.0296,333,0.2
2019,111,560,0.0624,333,0.2
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0624,444,0.1
Following awk may help you in same.
awk -F, '
FNR==NR{
a[FNR]=$0;
next
}
{
for(i=1;i<=length(a);i++){
print a[i] FS $2 FS $NF
}
}' Input_file2 Input_file1
Adding explanation too for code as follows.
awk -F, ' ##Setting field separator as comma here for all the lines.
FNR==NR{ ##Using FNR==NR condition which will be only TRUE then first Input_file named File2 is being read.
##FNR and NR both indicates the number of lines for a Input_file only difference is FNR value will be RESET whenever a new file is being read and NR value will be keep increasing till all Input_files are read.
a[FNR]=$0; ##Creating an array named a whose index is FNR(current line) value and its value is current line value.
next ##Using next statement will sip all further statements now.
}
{
for(i=1;i<=length(a);i++){##Starting a for loop from variable i value from 1 to length of array a value. This will be executed on 2nd Input_file reading.
print a[i] FS $2 FS $NF ##Printing the value of array a whose index is variable i and printing 2nd and last field of current line.
}
}' File2 File1 ##Mentioning the Input_file names here.
another one with join/awk
$ join -t, -j99 file2 file1 |
awk -F, -v OFS=, '$3==$6 && $4==$8 {print $2,$3,$4,$5,$7,$9}'

How to print all columns after matching on key field

How can I join all fields of each row from both files after matching on a key field? How to generalize this one-liner if the number of fields is unknown in f2?
f2:
a 1 2
b 3 4
c 5 6
f3:
10 a x y z
11 g x y z
12 j x y z
observed:
a 10 x y z
a1 10 x y z
Desired:
a 1 2 10 x y z
These are my best attempts but are incorrect:
awk 'FNR==NR{a[$1]=$2;next} ($2 in a) {print a[$2],$0}' f2.txt f3.txt > f4.txt
awk 'FNR==NR{a[$1]=$2$3;next} ($2 in a) {print a[$2],$0}' f2.txt f3.txt > f4.txt
awk 'NR==FNR{a[$1]=$0;next} ($2 in a){print a[$2],$1,$3,$4,$5}' f2.txt f3.txt > f4.txt
save the whole as value and column1 as key, when read 2nd file, check column2 in array a or not, if it is, print a[$2] and the rest columns
A shorter way(the disadvantage of this command is there's one extra space between 10 and x):
awk 'NR==FNR{a[$1]=$0;next} ($2 in a){second=$2; $2="";print a[second],$0}' f2.txt f3.txt > f4.txt
replace $2 of 2nd file with empty string, and print the whole line $0
if your files are sorted in the keys as in your example, join is the tool for this task
join -11 -22 f2.txt f3,txt
#mxttgen31: try:
awk 'FNR==NR{Q=$2;$2="";A[Q]=$0;next} ($1 in A){print $0,A[$1]}' f3 f2
Explanation of above command as follows:
awk 'FNR==NR{ ##### Checking condition FNR==NR here, where FNR and NR both denotes the number of line,
only difference between FNR and NR is as we could read mutiple files from awk,
value of FNR will be RESET on next file's start, where NR's value will be keep on increasing till
it completes reading all the file. so this condition will be TRUE only when first Input_file(which is f3 here) will be TRUE.
Q=$2; ##### Assigning second field's value to variable Q.
$2=""; ##### making second field's value to NULL now.
A[$2]=$0; ##### Create an array named A whose index is $2 and value is current line.
next} ##### putting next(awk's in-built keyword) which skips all next further statements and take the cursor again to starting.
($1 in A) ##### Cursor will come here whenever second Input_file is being read, so here checking $1(first field) is present in array A then do following.
{print $0,A[$1]} ##### print the current line($0) of current file named f2 and print array A's value whose index is $1 of current file f2.
' f3 f2 ##### Mentioning Input_files here.

Printing the compared result under last column header

I'm doing a comparision of 2 files file1,file2 using first column in file1 to first column in file2 and retriving corresponding value from 7 th column .
awk -F, 'FNR==NR{a[$1]=$7;next} {print (($1 in a) ? $0","a[$1] : $0",NA");}' file2.txt file1.txt > tmp && mv tmp file1.txt
also on next day it will compare and append the result .
cat file1.txt
N1,N2,N3,N4,N5,N6,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10
XX,ZZ,XC,EE,RR,BB,OK,OK,OK,OK,OK,OK,OK,OK
XC,CF,FG,RG,GH,GH,NA,NA,NA,NA,NA,NA,NA,NA,NA
DM,DF,GR,TH,EW,BB
cat file2.txt
cat file2.txt
DF,GH,MH,FR,FG,GH,NA
XX,ZZ,XC,EE,RR,BB,OK
awk -F, 'FNR==NR{a[$1]=$7;next} {print (($1 in a) ? $0","a[$1] : $0",NA");}' file2.txt file1.txt > tmp && mv tmp file1.txt
mv: overwrite `file1.txt'? y
cat file1.txt
N1,N2,N3,N4,N5,N6,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,NA ---> Header
XX,ZZ,XC,EE,RR,BB,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK
XC,CF,FG,RG,GH,GH,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
DM,DF,GR,TH,EW,BB,NA
after adding new row
DM,DF,GR,TH,EW
problem is it is comparing and printing result for header also and result is printed
under header D1 instead of D10 for newly inserted row in file1
How can we print like this, compare should exclude header and result under last column header
N1,N2,N3,N4,N5,N6,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10
XX,ZZ,XC,EE,RR,BB,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK
XC,CF,FG,RG,GH,GH,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
DM,DF,GR,TH,EW,BB ,NA
To avoid having header updated, change awk's expression to the following:
'FNR==NR{a[$1]=$7;next} FNR==1{print $0; next} {print (($1 in a) ? $0","a[$1] : $0",NA");}'
In this case 1st line of the file1.txt will be printed as is, without any changes.
But don't you also need to have new day (like "D10" in the example) be added to the header on each run? Or you do it elsewhere?
As to the 2nd question (printing new value at the same position in the string for the shorter line as for the longer line), you should further modify awk:
'FNR==NR{a[$1]=$7;next} FNR==1{print $0; len=length($0); next} {printf $0; cont=(($1 in a) ? ","a[$1] : ",NA"); for (i=length($0)+1;i<=len-length(cont);i++) printf " " ; print cont;}'

awk print line of file2 based on condition of file1

I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1