Find the difference between two files - awk

I have the following situation:
The file1.dat is like:
1 2
1 3
1 4
2 1
and the file2.dat is like:
1 2
2 1
2 3
3 4
I want to find the differences between the second file from the first. I tried wit grep -v -f file1 file2 but my real files are bigger than this two and when I tried with it the shell never ended is work.
The result should be:
2 3
3 4
The files are sorted and they have the same number of elements. Any way to find a solution with awk?

Seems like you want lines in file2 that are not in file1:
$ awk 'FNR==NR{a[$0];next}!($0 in a)' file1 file2
2 3
3 4
However it's simpler to use comm:
$ comm -13 file1 file2
2 3
3 4

Related

Retrieving lines comparing columns from two files

I'm trying to compare 2 tables and retrieve the matches based on two columns:
file 1
0.736 5 100 T
0.723 1 15 T
0.792 6 100 T
0.634 3 100 T
0.754 7 100 T
0.708 2 100 T
0.722 9 100 T
0.542 1 6 T
File 2
0.736 5
0.634 3
0.542 1
output
0.736 5 100 T
0.634 3 100 T
0.542 1 6 T
When I try this code it tells me that awk is not found, which doesnt make sense because I use awk regularly.. Could you help me out spotting the error here please?
awk 'FNR==NR{a[$1,$2]=$0;next}{if(b=a[$1,$2]){print b}}' file1 file2> output
you could use grep
grep -f file2 file1
or awk
awk 'NR==FNR{A[$1];next}$1 in A' file2 file1
Hope this helps :)

AWK: Comparing two different columns in two files

I have these two files
File1:
9 8 6 8 5 2
2 1 7 0 6 1
3 2 3 4 4 6
File2: (which has over 4 million lines)
MN 1 0
JK 2 0
AL 3 90
CA 4 83
MK 5 54
HI 6 490
I want to compare field 6 of file1, and compare field 2 of file 2. If they match, then put field 3 of file2 at the end of file1
I've looked at other solutions but I can't get it to work correctly.
Desired output:
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490
My attempt:
awk 'NR==FNR{a[$2]=$2;next}a[$6]{print $0,a[$6]}' file2 file1
program just hangs after that.
To print all lines in file1 with match if available:
$ awk 'FNR==NR{a[$2]=$3;next;} {print $0,a[$6];}' file2 file1
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490
To print only the lines that have a match:
$ awk 'NR==FNR{a[$2]=$3;next} $6 in a {print $0,a[$6]}' file2 file1
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490
Note that I replaced a[$2]=$2 with a[$2]=$3 and changed the test a[$6] (which is false if the value is zero) to $6 in a.
Your own attempt basically has two bugs as seen in #John1024's answer:
You use field 2 as both key and value in a, where you should be storing field 3 as the value (since you want to keep it for later), i.e., it should be a[$2] = $3.
The test a[$6] is false when the value in a is zero, even if it exists. The correct test is $6 in a.
Hence:
awk 'NR==FNR { a[$2]=$3; next } $6 in a {print $0, a[$6] }' file2 file1
However, there might be better approaches, but it is not clear from your specifications. For instance, you say that file2 has over 4 million lines, but it is unknown if there are also that many unique values for field 2. If yes, then a will also have that many entries in memory. And, you don't specify how long file1 is, or if its order must be preserved for output, or if every line (even without matches in file2) should be output.
If it is the case that file1 has many fewer lines than file2 has unique values for field 2, and only matching lines need to be output, and order does not need to be preserved, then you might wish to read file1 first…

Lookup and Replace with two files in awk

I am trying to correct one file with another with a single line of AWK code. I am trying to take $1 from FILE2, look it up in FILE1, get the corresponding $3 and $4. After I set them as variables I want the program to stop evaluating FILE1, change $10 and $11 from FILE2 to the values of the variables, and print this out.
I am having trouble getting the awk to switch from FILE1 to FILE2 after I have extracted the variables. I've tried nextfile, but this resets the program and it tires to extract variables from FILE2, I set NR to the last Record, but it did not switch.
I am also doing a loop to get each line out of FILE1, but if that can be part of the script I am sure it would speed things up not having to reopen awk over and over again.
here is the parts I have figured out.
for file in `cut -f 1 FILE2`; do
awk -v a=$file '$1=a{s=$2;q=$4; ---GO TO FILE1---}{if ($1==a) {$10=s; $11=q; print 0;exit}' FILE1 FILE2 >> FILEOUT
done
a quick example set NOTE: Despite how I have this written, the two files are not in the same order and on the order of 8GB in size, so a little unwieldy to sort.
FILE1
A 12345 + AJD$JD
B 12504 + DKFJ#%
C 52042 + DSJTJE
FILE2
A 2 3 4 5 6 7 8 9 345 D$J
B 2 3 4 5 6 7 8 9 250 KFJ
C 2 3 4 5 6 7 8 9 204 SJT
OUTFILE
A 2 3 4 5 6 7 8 9 12345 AJD$JD
B 2 3 4 5 6 7 8 9 12504 DKFJ#%
C 2 3 4 5 6 7 8 9 52042 DSJTJE
This is the code I got to work based on Kent's answer below.
awk 'NR==FNR{a[$1]=$2" "$4;next}$1 in a{$9=$9" "a[$1]}{$10="";$11=""}2' f1 f2
try this one-liner:
kent$ awk 'NR==FNR{a[$1]=$2" "$4;next}$1 in a{NF-=2;$0=$0" "a[$1]}7' f1 f2
A 2 3 4 5 6 7 8 9 12345 AJD$JD
B 2 3 4 5 6 7 8 9 12504 DKFJ#%
C 2 3 4 5 6 7 8 9 52042 DSJTJE
No need to loop over the files repeatedly - just read one file and store the relevant fields in arrays keyed on $1, then go through the other file and use those arrays to look up the values you want to insert.
awk '(FILENAME=="FILE1"){y[$1]=$2;z[$1]=$4}; (FILENAME=="FILE2" && $1 in y){$10=y[$1];$11=z[$1];print $0}' FILE1 FILE2
That said, it sounds like you might have a use for the join command here rather than messing about with awk (the above script assumes all your $1/$2/$4 values will fit in memory).

Multiply every nth field...elegantly

I have a text file with a series of numbers:
1 2 4 2 2 6 3 4 7 4 4 8 2 4 6 5 5 8
I need to have every third field multiplied by 3, so output would be:
1 2 12 2 2 18 3 4 21 4 4 24 2 4 18 5 5 24
Now, I've hammered out a solution already, but I know there's a quicker, more elegant one out there. Here's what I've gotten to work:
xargs -n1 < input.txt | awk '{printf NR%3 ? "%d " : $0*3" ", $1}' > output.txt
I feel that there must be an awk one-liner that can do this?? How can I make awk look at each field (instead of each record), thus not needing the call to xargs to put every field on a different line? Or maybe sed can do it?
Try:
awk '{for (i=3;i<=NF;i+=3)$i*=3; print}' input.txt > output.txt
I have not tested this yet (posted on my iPod). The print command without parameters should print out the whole (partially modified) line. You might have to set OFS=" " in the BEGIN section to get the blank as the separator in the output.
this line would work too:
awk -v RS="\\n| " -v ORS=" " '!(NR%3){$0*=3}7' file

extracting data from a file with awk

I have a data set like below
first 0 1
first 1 2
first 2 3
second 0 1
second 1 2
second 2 3
third 0 1
third 1 2
third 2 3
I need to check this file and extract the third columns for first, second and third and store them in different files.
The output files should contain:
1
2
3
This is pretty straight forward awk '{print $3>$1}' file i.e. print the third field and redirect the output to the file, where the filename is the first field.
Demo:
$ ls
file
$ awk '{print $3>$1}' file
$ ls
file first second third
$ cat first
1
2
3
$ cat second
1
2
3
$ cat third
1
2
3