awk to compare two file by identifier & output in a specific format - awk

I have 2 large files i need to compare all pipe delimited
file 1
a||d||f||a
1||2||3||4
file 2
a||d||f||a
1||1||3||4
1||2||r||f
Now I want to compare the files & print accordingly such as if any update found in file 2 will be printed as updated_value#oldvalue & any new line added to file 2 will also be updated accordingly.
So the desired output is: (only the updated & new data)
1||1#2||3||4
1||2||r||f
what I have tried so far is to get the separated changed values:
awk -F '[||]+' 'NR==FNR{for(i=1;i<=NF;i++)a[NR,i]=$i;next}{for(i=1;i<=NF;i++)if(a[FNR,i]!=$i)print $i"#"a[FNR,i]}' file1 file2 >output
But I want to print the whole line. How can I achieve that??

I would say:
awk 'BEGIN{FS=OFS="|"}
FNR==NR {for (i=1;i<=NF;i+=2) a[FNR,i]=$i; next}
{for (i=1; i<=NF; i+=2)
if (a[FNR,i] && a[FNR,i]!=$i)
$i=$i"#"a[FNR,i]
}1' f1 f2
This stores the file1 in a matrix a[line number, column]. Then, it compares its values with its correspondence in file2.
Note I am using the field separator | instead of || and looping in steps of two to use the proper data. This is because I for example did gawk -F'||' '{print NF}' f1 and got just 1, meaning that FS wasn't well understood. Will be grateful if someone points the error here!
Test
$ awk 'BEGIN{FS=OFS="|"} FNR==NR {for (i=1;i<=NF;i+=2) a[FNR,i]=$i; next} {for (i=1; i<=NF; i+=2) if (a[FNR,i] && a[FNR,i]!=$i) $i=$i"#"a[FNR,i]}1' f1 f2
a||d||f||b#a
1||1#2||3||4
1||2||r||f

Related

Trying to read data from two files, and subtract values from both files using awk

I have two files
0.975301988947238963 1.75276754663189283 2.00584
0.0457467532388459441 1.21307648993841410 1.21394
-0.664000617674924687 1.57872850852906366 1.71268
-0.812129324498058969 4.86617859243825635 4.93348
and
1.98005959631337536 -3.78935155011290536 4.27549
-1.04468782080821154 4.99192849476267053 5.10007
-1.47203672235857397 -3.15493073343947694 3.48145
2.68001948430755244 -0.0630730371855307004 2.68076
I want to subtract the two values in column 3 of each file.
My first awk statement was
**awk
'BEGIN {print "Test"} FNR>1 && FNR==NR { r[$1]=$3; next} FNR>1 { print $3, r[$1], (r[$1]-$3)}' zzz0.dat zzz1.dat**
Test
5.10007 -5.10007
3.48145 -3.48145
2.68076 -2.68076
This suggests it does not recognize r[$1]=$3
I created an additional column xyz by
**awk 'NR==1{$(NF+1)="xyz"} NR>1{$(NF+1)="xyz"}1' zzz0.dat**
then
awk 'BEGIN {print "Test"} FNR>1 && FNR==NR { xyz[$4]=$3; next} FNR>1 { print $3, xyz[$4], (xyz[$4]-$3)}' zzz00.dat zzz11.dat
Test
5.10007 4.93348 -0.16659
3.48145 4.93348 1.45203
2.68076 4.93348 2.25272
This now shows three columns, but xyz[$4] is printing only the value in the last column, instead of creating a array.
My real files have thousands of lines. How can I resolve this problem ?
You can do it relatively easily using a numeric index for your array. For example:
awk 'NR==FNR {a[++n]=$3; next} o<n{++o; printf "%lf - %lf = %lf\n", a[o], $3, a[o]-$3}' file1 file2
That way you preserve the ordering of the records across files. Without a numeric index, the arrays are associative and there is no specific ordering preserved.
Example Use/Output
With your files in file1 and file2 respectively, you would have:
$ awk 'NR==FNR {a[++n]=$3; next} o<n{++o; printf "%lf - %lf = %lf\n", a[o], $3, a[o]-$3}' file1 file2
2.005840 - 4.275490 = -2.269650
1.213940 - 5.100070 = -3.886130
1.712680 - 3.481450 = -1.768770
4.933480 - 2.680760 = 2.252720
Let me know if that is what you intended or if you have any further questions. If I missed your intent, drop a comment and I will help further.
if the records are aligned in both files, easiest is
$ paste file1 file2 | awk '{print $3,$6,$3-$6}'
2.00584 4.27549 -2.26965
1.21394 5.10007 -3.88613
1.71268 3.48145 -1.76877
4.93348 2.68076 2.25272
if you're only interested in the difference, change to print $3-$6.

printing multiple NR from one file based on the value from other file using awk

I want to print out multiple rows from one file based on the input values from the other.
Following is the representation of file 1:
2
4
1
Following is the representation of file 2:
MANCHKLGO
kflgklfdg
fhgjpiqog
fkfjdkfdg
fghjshdjs
jgfkgjfdk
ghftrysba
gfkgfdkgj
jfkjfdkgj
Based on the first column of the first file, the code should first print the second row of the second file followed by fourth row and then the first row of the second file. Hence, the output should be following:
kflgklfdg
fkfjdkfdg
MANCHKLGO
Following are the codes that I tried:
awk 'NR==FNR{a[$1];next}FNR in a{print $0}' file1.txt file2.txt
However, as expected, the output is not in the order as it first printed the first row then the second and fourth row is the last. How can I print the NR from the second file as exactly in the order given in the first file?
Try:
$ awk 'NR==FNR{a[NR]=$0;next} {print a[$1]}' file2 file1
kflgklfdg
fkfjdkfdg
MANCHKLGO
How it works
NR==FNR{a[NR]=$0;next}
This saves the contents of file2 in array a.
print a[$1]
For each number in file1, we print the desired line of file2.
Solution to earlier version of question
$ awk 'NR==FNR{a[NR]=$0;next} {print a[2*$1];print a[2*$1+1]}' file2 file1
fkfjdkfdg
fghjshdjs
gfkgfdkgj
jfkjfdkgj
kflgklfdg
fhgjpiqog
Another take:
awk '
NR==FNR {a[$1]; order[n++] = $1; next}
FNR in a {lines[FNR] = $0}
END {for (i=0; i<n; i++) print lines[order[i]]}
' file1.txt file2.txt
This version stores fewer lines in memory, if your files are huge.

How to get rows with values more than 2 in at least 2 columns?

I am trying to extract row where value is >=2 in atleast two column. My input file look like this
gain,top1,sos1,pho1
ATC1,0,0,0
ATC2,1,2,1
ATC3,6,6,0
ATC4,1,1,2
and my awk script look like this
cat input_file | awk 'BEGIN{FS=",";OFS=","};{count>=0;for(i=2; i<4; i++) {if($i!=0) {count++}};if (count>=2){print $0}}'
which doesn't give me the expected output that should be
gain,top1,sos1,pho1
ATC3,6,6,0
What is the problem with this script. Thanks.
awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++)if($i>=2)f++}f>=2 || FNR==1' file
Or below one, print and go to next line immediately after finding 2 values (Reasonably faster)
awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++){ if($i>=2)f++; if(f>=2){ print; next} } }FNR==1' file
Explanation
awk -F, ' # call awk and set field separator as comma
FNR>1{ # we wanna skip header to be checked so, if no of records related to current file is greater than 1
f=0; # set variable f = 0
for(i=2; i<=NF; i++) # start looping from second field to no of fields in record/line/row
{
if($i>=2)f++; # if field value is greater than 2 increment variable f
if(f>=2) # if we got 2 values ? then
{
print; # print record/line/row
next # we got enough go to next line
}
}
}FNR==1 # if first record being read then print in fact if FNR==1 we get boolean true, so it does default operation print $0, that is current record/line/row
' file
Input
$ cat file
gain,top1,sos1,pho1
ATC1,0,0,0
ATC2,1,2,1
ATC3,6,6,0
ATC4,1,1,2
Output-1
$ awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++)if($i>=2)f++}f>=2 || FNR==1' file
gain,top1,sos1,pho1
ATC3,6,6,0
Output-2 (Reasonably faster)
$ awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++){ if($i>=2)f++; if(f>=2){ print; next} } }FNR==1' file
gain,top1,sos1,pho1
ATC3,6,6,0
hacky awk, handles the header as well
$ awk -F, '($2>=2) + ($3>=2) + ($4>=2) > 1' file
gain,top1,sos1,pho1
ATC3,6,6,0
or,
$ awk -F, 'function ge2(x) {return x>=2?1:0}
ge2($2) + ge2($3) + ge2($4) > 1' file
gain,top1,sos1,pho1
ATC3,6,6,0
#pali: #try:
Hope this should be much faster.
awk '{Q=$0;}(gsub(/,[2-9]/,"",Q)>=2) || FNR==1' Input_file
Here I am putting line's value into a variable named Q then, from Q variable globally substituting all the matches , then digits from 2 to 9 to NULL. Then checking it's count if that is greater or equal than 2, if either it's global substitution's value is greater than 2 or line number is 1 then it should print the current line.

awk returning whitespace matches when comparing columns in csv

I am trying to do a file comparison in awk but it seems to be returning all the lines instead of just the lines that match due to whitespace matching
awk -F "," 'NR==FNR{a[$2];next}$6 in a{print $6}' file1.csv fil2.csv
How do I instruct awk not to match the whitespaces?
I get something like the following:
cccs
dert
ssss
assak
this should do
$ awk -F, 'NR==FNR && $2 {a[$2]; next}
$6 in a {print $6}' file1 file2
if you data file includes spaces and numerical fields, as commented below better to change the check from $2 to $2!="" && $2!~/[[:space:]]+/
Consider cases like $2=<space>foo<space><space>bar in file1 vs $6=foo<space>bar<space> in file2.
Here's how to robustly compare $6 in file2 against $2 of file1 ignoring whitespace differences, and only printing lines that do not have empty or all-whitespace key fields:
awk -F, '
{
key = (NR==FNR ? $2 : $6)
gsub(/[[:space:]]+/," ",key)
gsub(/^ | $/,"",key)
}
key=="" { next }
NR==FNR { file1[key]; next }
key in file1
' file1 file2
If you want to make the comparison case-insensitive then add key=tolower(key) before the first gsub(). If you want to make it independent of punctuation add gsub(/[[:punct:]]/,"",key) before the first gsub(). And so on...
The above is untested of course since no testable sample input/output was provided.

sum same column across multiple files using awk ?

I want to add the 3rd column of 5 files such that the new file will have the same 2nd col and the sum of the 3rd col of the 5 files.
I tried something like this:
$ cat freqdat044.dat | awk '{n=$3; getline <"freqdat046.dat";print $2" " n+$3}' > freqtrial1.dat
freqdat048.dat`enter code here`$ cat freqdat044.dat | awk '{n=$3; getline <"freqdat046.dat";print $2" " n+$3}' > freqtrial1.dat
The files names:
freqdat044.dat
freqdat045.dat
freqdat046.dat
freqdat047.dat
freqdat049.dat
freqdat050.dat
And saved in output file the contain only $2 and the new col form the summation of the 3rd
awk '{x[$2] += $3} END {for(y in x) print y,x[y]}' freqdat044.dat freqdat045.dat freqdat046.dat freqdat047.dat freqdat049.dat freqdat050.dat
This does not necessarily print lines as they appear in the first file. If you want to preserve that sorting, then you have to save that ordering somewhere:
awk 'FNR==NR {keys[FNR]=$2; cnt=FNR} {x[$2] += $3} END {for(i=1; i<=cnt; ++i) print keys[i],x[keys[i]]}' freqdat044.dat freqdat045.dat freqdat046.dat freqdat047.dat freqdat049.dat freqdat050.dat