AWK: 2 columns 2 files display where second column has unique data - awk

I need the first column to check if it doesn't match the first column on the second file. Though, if the second column matches the second column on the second file, to display this data with awk on Linux.
I want awk to detect the change with both the first and second column of the first file with the second file.
file1.txt
sdsdjs ./file.txt
sdsksp ./example.txt
jsdjsk ./number.txt
dfkdfk ./ok.txt
file2.txt
sdsdks ./file.txt <-- different
sdsksd ./example.txt <-- different
jsdjsk ./number.txt <-- same
dfkdfa ./ok.txt <-- different
Expected output:
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Notice how in the second file there may be lines missing and not the same as the first.
As seen above, how can awk display results only where the second column is unique and does not match the first column?

Something like this might work for you:
awk 'FNR == NR { f[FNR"_"$2] = $1; next }
f[FNR"_"$2] && f[FNR"_"$2] != $1' file1.txt file2.txt
Breakdown:
FNR == NR { } # Run on first file as FNR is record number for the file, while NR is the global record number
f[FNR"_"$2] = $1; # Store first column under the name of FNR followed by an underbar followed by the second column
next # read next record and redo
f[FNR"_"$2] && f[FNR"_"$2] != $1 # If the first column doesn't match while the second does, then print the line
A simpler approach which will ignore the second column is:
awk 'FNR == NR { f[FNR"_"$1] = 1; next }
!f[FNR"_"$1]' file1.txt file2.txt

If the records don't have to be in the respective position in the files ie. we compare matching second column strings, this should be enough:
$ awk '{if($2 in a){if($1!=a[$2])print $2}else a[$2]=$1}' file1 file2
Output:
file.txt
In pretty print:
$ awk '{
if($2 in a) { # if $2 match processing
if($1!=a[$2]) # and $1 don t
print $2 # output
} else # else
a[$2]=$1 # store
}' file1 file2
Updated:
$ awk '{if($2 in a){if($1!=a[$2])print $1,$2}else a[$2]=$1}' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Basically changed the print $2 to print $1,$2.

The way your question is worded is very confusing but after reading it several times and looking at your posted expected output I THINK you're just trying to say you want the lines from file2 that don't appear in file1. If so that's just:
$ awk 'NR==FNR{a[$0];next} !($0 in a)' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
If your real data has more fields than shown in your sample input but you only want the first 2 fields considered for the comparison then fix your question to show a more truly representative example but the solution would be:
$ awk 'NR==FNR{a[$1,$2];next} !(($1,$2) in a)' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
if that's not it then please edit your question to clarify what it is you're trying to do and include an example where the above doesn't produce the expected output.

I understand the original problem in the following way:
Two files, file1 and file2 contain a set of key-value pairs.
The key is the filename, the value is the string in the first column
If a matching key is found between file1 and file2 but the value is different, print the matching line of file2
You do not really need advanced awk for this task, it can easily be achieved with a simple pipeline of awk and grep.
$ awk '{print $NF}' file2.txt | grep -wFf - file1.txt | grep -vwFf - file2.txt
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Here, the first grep will select the lines from file1.txt which do have the same key (filename). The second grep will try to search the full matching lines from file1 in file2, but it will print the failures. Be aware that in this case, the lines need to be completely identical.
If you just want to use awk, then the above logic is achieved with the solution presented by Ed Morton. No need to repeat it here.

I think this is what you're looking for
$ awk 'NR==FNR{a[$2]=$1; next} a[$2]!=$1' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
print the records from file2 where field1 values are different for the same field2 value. This script assumes that field2 values are unique within each file, so that it can be used as keys. Since the content looks like file paths, this is a valid assumption. Otherwise, you need to match the records perhaps with the corresponding line numbers.

In case you are looking for a more straightforward line-based diff based on the the first field on a line being different.
awk 'NR==FNR { a[NR] = $1; next } a[FNR]!=$1' file1 file2

Related

Use awk to remove lines based on a column from another file

I have the following code that works to extract lines from the multiple-column file_1 that have a value in the first column that appears in the single-column file_2:
awk 'NR==FNR{a[$1][$0];next} $0 in a {for (i in a[$0]) print i}' file_1 file_2
I got this code from the answer to this question: AWK to filter a file based upon columns of another file
I want to change to code to do the opposite, namely to remove every line from file_1 where the first column matches any value that appears in the single-column file_2. How do I do this?
awk 'NR==FNR { arr[$0]="1";next } arr[$1]!="1" { print $0 }' file2 file1
Process the second file first (NR==FNR) and create an array called arr, with the line ($0) as the key and 1 the value. Then when processing the next file (file1), check if the first space delimited field ($1) exists as a key in the array arr and if it doesn't, print the line.
Direct the output to a file if you want to store the results:
awk 'NR==FNR { arr[$0]="1";next } arr[$1]!="1" { print $0 }' file2 file1 > file3

Awk command has unexpected results when comparing two files

I am using an awk command to compare the first column in two file.
I want to take col1 of file1 and if there is a match in col1 of file2, update the "date updated" in the last column. If there is no match, I want to append the entire line of file1 to file2 and append a "date updated" value to that line as well. Here is the command I'm currently using:
awk 'FNR == NR { f1[$1] = $0; next }
$1 in f1 { print; delete f1[$1] }
END { for (user in f1) print f1[user] }' file1 file2
File1:
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeNEW
mschrode,161.1,plasma-de+,serv01,datetimeNEW
File2:
jbomba,114.0,plasma-de+,serv01,datetimeOLD
mschrode,104.0,plasma-de+,serv01,datetimeOLD
deleteme,192.0,random,serv01,datetimeOLD #<---- Needs to be removed: WORKS!
Expected Output:(order does not matter)
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeOLD #<---- NEED THIS VALUE
mschrode,161.1,plasma-de+,serv01,datetimeOLD #<---- NEED THIS VALUE
Current Output:(order does not matter)
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeNEW #<----WRONG OUTPUT
mschrode,161.1,plasma-de+,serv01,datetimeNEW #<----WRONG OUTPUT
The Logic Broken Down:
If $usr/col1 in file2 does NOT exist in file1
remove entire line from file2
(ex: line3 in file2, user: deleteme)
If $usr/col1 in file1 does NOT exist in file2
append entire line to file2
(ex: lines 1-5 in file1)
So the issue is, when there IS a match between the two files, I need to keep the information from file2, not the information from file1. In the output examples above you'll see I need to keep the datetimeOLD from file2 along with the new information from file1.
Set field separator to comma, and read file2 first:
$ awk -F',' 'FNR==NR{a[$1]=$0;next} $1 in a{print a[$1];next} 1' file2 file1
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,114.0,plasma-de+,serv01,datetimeOLD
mschrode,104.0,plasma-de+,serv01,datetimeOLD

how to get the common rows according to the first column in awk

I have two ',' separated files as follow:
file1:
A,inf
B,inf
C,0.135802
D,72.6111
E,42.1613
file2:
A,inf
B,inf
C,0.313559
D,189.5
E,38.6735
I want to compare 2 files ans get the common rows based on the 1st column. So, for the mentioned files the out put would look like this:
A,inf,inf
B,inf,inf
C,0.135802,0.313559
D,72.6111,189.5
E,42.1613,38.6735
I am trying to do that in awk and tried this:
awk ' NR == FNR {val[$1]=$2; next} $1 in val {print $1, val[$1], $2}' file1 file2
this code returns this results:
A,inf
B,inf
C,0.135802
D,72.6111
E,42.1613
which is not what I want. do you know how I can improve it?
$ awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1]=$0;next}$1 in a{print a[$1],$2}' file1 file2
A,inf,inf
B,inf,inf
C,0.135802,0.313559
D,72.6111,189.5
E,42.1613,38.6735
Explained:
$ awk '
BEGIN {FS=OFS="," } # set separators
NR==FNR { # first file
a[$1]=$0 # hash to a, $1 as index
next # next record
}
$1 in a { # second file, if $1 in a
print a[$1],$2 # print indexed record from a with $2
}' file1 file2
Your awk code basically works, you are just missing to tell awk to use , as the field delimiter. You can do it by adding BEGIN{FS=OFS=","} to the beginning of the script.
But having that the files are sorted like in the examples in your question, you can simply use the join command:
join -t, file1 file2
This will join the files based on the first column. -t, tells join that columns are separated by commas.
If the files are not sorted, you can sort them on the fly like this:
join -t, <(sort file1) <(sort file2)

printing multiple NR from one file based on the value from other file using awk

I want to print out multiple rows from one file based on the input values from the other.
Following is the representation of file 1:
2
4
1
Following is the representation of file 2:
MANCHKLGO
kflgklfdg
fhgjpiqog
fkfjdkfdg
fghjshdjs
jgfkgjfdk
ghftrysba
gfkgfdkgj
jfkjfdkgj
Based on the first column of the first file, the code should first print the second row of the second file followed by fourth row and then the first row of the second file. Hence, the output should be following:
kflgklfdg
fkfjdkfdg
MANCHKLGO
Following are the codes that I tried:
awk 'NR==FNR{a[$1];next}FNR in a{print $0}' file1.txt file2.txt
However, as expected, the output is not in the order as it first printed the first row then the second and fourth row is the last. How can I print the NR from the second file as exactly in the order given in the first file?
Try:
$ awk 'NR==FNR{a[NR]=$0;next} {print a[$1]}' file2 file1
kflgklfdg
fkfjdkfdg
MANCHKLGO
How it works
NR==FNR{a[NR]=$0;next}
This saves the contents of file2 in array a.
print a[$1]
For each number in file1, we print the desired line of file2.
Solution to earlier version of question
$ awk 'NR==FNR{a[NR]=$0;next} {print a[2*$1];print a[2*$1+1]}' file2 file1
fkfjdkfdg
fghjshdjs
gfkgfdkgj
jfkjfdkgj
kflgklfdg
fhgjpiqog
Another take:
awk '
NR==FNR {a[$1]; order[n++] = $1; next}
FNR in a {lines[FNR] = $0}
END {for (i=0; i<n; i++) print lines[order[i]]}
' file1.txt file2.txt
This version stores fewer lines in memory, if your files are huge.

awk returning whitespace matches when comparing columns in csv

I am trying to do a file comparison in awk but it seems to be returning all the lines instead of just the lines that match due to whitespace matching
awk -F "," 'NR==FNR{a[$2];next}$6 in a{print $6}' file1.csv fil2.csv
How do I instruct awk not to match the whitespaces?
I get something like the following:
cccs
dert
ssss
assak
this should do
$ awk -F, 'NR==FNR && $2 {a[$2]; next}
$6 in a {print $6}' file1 file2
if you data file includes spaces and numerical fields, as commented below better to change the check from $2 to $2!="" && $2!~/[[:space:]]+/
Consider cases like $2=<space>foo<space><space>bar in file1 vs $6=foo<space>bar<space> in file2.
Here's how to robustly compare $6 in file2 against $2 of file1 ignoring whitespace differences, and only printing lines that do not have empty or all-whitespace key fields:
awk -F, '
{
key = (NR==FNR ? $2 : $6)
gsub(/[[:space:]]+/," ",key)
gsub(/^ | $/,"",key)
}
key=="" { next }
NR==FNR { file1[key]; next }
key in file1
' file1 file2
If you want to make the comparison case-insensitive then add key=tolower(key) before the first gsub(). If you want to make it independent of punctuation add gsub(/[[:punct:]]/,"",key) before the first gsub(). And so on...
The above is untested of course since no testable sample input/output was provided.