How to use awk to compare six files for common elements - awk

To compare two fields of two files for common elements and print the common elements, I use this:
awk 'NR==FNR{a[$2];next}$2 in a{print $2}' file1 file2
This compares column 2 of the two files.
Now, I have six of such files. How can I use awk to compare the column 2 of these six files for common elements, and print out the common elements? Each file has only two fields.
My expectation is an output of only the common elements.
Thank you.

If the values in column 2 of each file are unique within the file, then this is sufficient:
awk '{a[$2]++;} END {for (i in a) if (a[i] > 1) print i;}' \
file1 file2 file3 file4 file5 file6
(assuming you want 'appears in more than one file'; change > 1 to == 6 if you want 'appears in all files').
If the values in column 2 could be repeated with a given file, then you have to work a bit harder, possibly like this:
awk '{ if (f[$2] != FILENAME) a[$2]++; f[$2] = FILENAME; }
END { for (i in a) if (a[i] > 1) print i; }' \
file1 file2 file3 file4 file5 file6
The array f keeps a record of which file last identified the value in $2; if that isn't the current file, then increment the count in array a and record that the current file identified the value in $2. As before, tweak the condition in the for loop to adjust what gets printed.

Related

AWK: 2 columns 2 files display where second column has unique data

I need the first column to check if it doesn't match the first column on the second file. Though, if the second column matches the second column on the second file, to display this data with awk on Linux.
I want awk to detect the change with both the first and second column of the first file with the second file.
file1.txt
sdsdjs ./file.txt
sdsksp ./example.txt
jsdjsk ./number.txt
dfkdfk ./ok.txt
file2.txt
sdsdks ./file.txt <-- different
sdsksd ./example.txt <-- different
jsdjsk ./number.txt <-- same
dfkdfa ./ok.txt <-- different
Expected output:
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Notice how in the second file there may be lines missing and not the same as the first.
As seen above, how can awk display results only where the second column is unique and does not match the first column?
Something like this might work for you:
awk 'FNR == NR { f[FNR"_"$2] = $1; next }
f[FNR"_"$2] && f[FNR"_"$2] != $1' file1.txt file2.txt
Breakdown:
FNR == NR { } # Run on first file as FNR is record number for the file, while NR is the global record number
f[FNR"_"$2] = $1; # Store first column under the name of FNR followed by an underbar followed by the second column
next # read next record and redo
f[FNR"_"$2] && f[FNR"_"$2] != $1 # If the first column doesn't match while the second does, then print the line
A simpler approach which will ignore the second column is:
awk 'FNR == NR { f[FNR"_"$1] = 1; next }
!f[FNR"_"$1]' file1.txt file2.txt
If the records don't have to be in the respective position in the files ie. we compare matching second column strings, this should be enough:
$ awk '{if($2 in a){if($1!=a[$2])print $2}else a[$2]=$1}' file1 file2
Output:
file.txt
In pretty print:
$ awk '{
if($2 in a) { # if $2 match processing
if($1!=a[$2]) # and $1 don t
print $2 # output
} else # else
a[$2]=$1 # store
}' file1 file2
Updated:
$ awk '{if($2 in a){if($1!=a[$2])print $1,$2}else a[$2]=$1}' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Basically changed the print $2 to print $1,$2.
The way your question is worded is very confusing but after reading it several times and looking at your posted expected output I THINK you're just trying to say you want the lines from file2 that don't appear in file1. If so that's just:
$ awk 'NR==FNR{a[$0];next} !($0 in a)' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
If your real data has more fields than shown in your sample input but you only want the first 2 fields considered for the comparison then fix your question to show a more truly representative example but the solution would be:
$ awk 'NR==FNR{a[$1,$2];next} !(($1,$2) in a)' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
if that's not it then please edit your question to clarify what it is you're trying to do and include an example where the above doesn't produce the expected output.
I understand the original problem in the following way:
Two files, file1 and file2 contain a set of key-value pairs.
The key is the filename, the value is the string in the first column
If a matching key is found between file1 and file2 but the value is different, print the matching line of file2
You do not really need advanced awk for this task, it can easily be achieved with a simple pipeline of awk and grep.
$ awk '{print $NF}' file2.txt | grep -wFf - file1.txt | grep -vwFf - file2.txt
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Here, the first grep will select the lines from file1.txt which do have the same key (filename). The second grep will try to search the full matching lines from file1 in file2, but it will print the failures. Be aware that in this case, the lines need to be completely identical.
If you just want to use awk, then the above logic is achieved with the solution presented by Ed Morton. No need to repeat it here.
I think this is what you're looking for
$ awk 'NR==FNR{a[$2]=$1; next} a[$2]!=$1' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
print the records from file2 where field1 values are different for the same field2 value. This script assumes that field2 values are unique within each file, so that it can be used as keys. Since the content looks like file paths, this is a valid assumption. Otherwise, you need to match the records perhaps with the corresponding line numbers.
In case you are looking for a more straightforward line-based diff based on the the first field on a line being different.
awk 'NR==FNR { a[NR] = $1; next } a[FNR]!=$1' file1 file2

How to join two files based on one column in AWK (using wildcards)

I have 2 files, and I need to compare column 2 from file 2 with column 3 from file 1.
File 1
"testserver1","testserver1.domain.net","-1.1.1.1-10.10.10.10-"
"testserver2","testserver2.domain.net","-2.2.2.2-20.20.20.20-200.200.200.200-"
"testserver3","testserver3.domain.net","-3.3.3.3-"
File 2
"windows","10.10.10.10","datacenter1"
"linux","2.2.2.2","datacenter2"
"aix","4.4.4.4","datacenter2"
Expected Output
"testserver1","testserver1.domain.net","windows","10.10.10.10","datacenter1"
"testserver2","testserver2.domain.net","linux","2.2.2.2","datacenter2"
All I have been able to find statements that only work if the columns are identical, I need it to work if column 3 from file 1 contains value from column 2 from file 2
I've tried this, but again, it only works if the columns are identical (which I don't want):
awk 'BEGIN {FS = OFS = ","};NR == FNR{f[$3] = $1","$2;next};$2 in f {print f[$2],$0}' file1.csv file2.csv
hacky!
$ awk -F'","' 'NR==FNR {n=split($NF,x,"-"); for(i=2;i<n;i++) a[x[i]]=$1 FS $2; next}
$2 in a {print a[$2] "\"," $0}' file1 file2
"testserver1","testserver1.domain.net","windows","10.10.10.10","datacenter1"
"testserver2","testserver2.domain.net","linux","2.2.2.2","datacenter2"
assumes the lookup is unique, i.e. file1 records are mutually exclusive in that field.

printing multiple NR from one file based on the value from other file using awk

I want to print out multiple rows from one file based on the input values from the other.
Following is the representation of file 1:
2
4
1
Following is the representation of file 2:
MANCHKLGO
kflgklfdg
fhgjpiqog
fkfjdkfdg
fghjshdjs
jgfkgjfdk
ghftrysba
gfkgfdkgj
jfkjfdkgj
Based on the first column of the first file, the code should first print the second row of the second file followed by fourth row and then the first row of the second file. Hence, the output should be following:
kflgklfdg
fkfjdkfdg
MANCHKLGO
Following are the codes that I tried:
awk 'NR==FNR{a[$1];next}FNR in a{print $0}' file1.txt file2.txt
However, as expected, the output is not in the order as it first printed the first row then the second and fourth row is the last. How can I print the NR from the second file as exactly in the order given in the first file?
Try:
$ awk 'NR==FNR{a[NR]=$0;next} {print a[$1]}' file2 file1
kflgklfdg
fkfjdkfdg
MANCHKLGO
How it works
NR==FNR{a[NR]=$0;next}
This saves the contents of file2 in array a.
print a[$1]
For each number in file1, we print the desired line of file2.
Solution to earlier version of question
$ awk 'NR==FNR{a[NR]=$0;next} {print a[2*$1];print a[2*$1+1]}' file2 file1
fkfjdkfdg
fghjshdjs
gfkgfdkgj
jfkjfdkgj
kflgklfdg
fhgjpiqog
Another take:
awk '
NR==FNR {a[$1]; order[n++] = $1; next}
FNR in a {lines[FNR] = $0}
END {for (i=0; i<n; i++) print lines[order[i]]}
' file1.txt file2.txt
This version stores fewer lines in memory, if your files are huge.

How to print two lines of several files to a new file with speicific order?

I have a task to do with awk. I am doing sequence analysis for some genes.
I have several files with sequences in order. I would like to extract first sequence of each file into new file and like till the last sequence. I know only how to do with first or any specific line with awk.
awk 'FNR == 2 {print; nextfile}' *.txt > newfile
Here I have input like this
File 1
Saureus081.1
ATCGGCCCTTAA
Saureus081.2
ATGCCTTAAGCTATA
Saureus081.3
ATCCTAAAGGTAAGG
File 2
SaureusRF1.1
ATCGGCCCTTAC
SauruesRF1.2
ATGCCTTAAGCTAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
File 3
SaureusN305.1
ATCGGCCCTTACT
SauruesN305.2
ATGCCTTAAGCTAGA
SaureusN305.3
ATCCTAAAGGTAATG
And similar files 12 are there
File 4
.
.
.
.File 12
Required Output
Newfile
Saureus081.1
ATCGGCCCTTAA
SaureusRF1.1
ATCGGCCCTTAC
SaureusN305.1
ATCGGCCCTTACT
Saureus081.2
ATGCCTTAAGCTATA
SaureusRF1.2
ATGCCTTAAGCTAGG
SauruesN305.2
ATGCCTTAAGCTAGA
Saureus081.3
ATCCTAAAGGTAAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
SaureusN305.3
ATCCTAAAGGTAATG
I guess this task can be done easily with awk but not getting any idea how to do for multiple lines
Based on the modified question, the answer shall be done with some changes.
$ awk -F'.' 'NR%2{k=$2;v=$0;getline;a[k]=a[k]?a[k] RS v RS $0:v RS $0} END{for(i in a)print a[i]}' file1 file2 file3
Saureus081.1
ATCGGCCCTTAA
SaureusRF1.1
ATCGGCCCTTAC
SaureusN305.1
ATCGGCCCTTACT
Saureus081.2
ATGCCTTAAGCTATA
SauruesRF1.2
ATGCCTTAAGCTAGG
SauruesN305.2
ATGCCTTAAGCTAGA
Saureus081.3
ATCCTAAAGGTAAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
SaureusN305.3
ATCCTAAAGGTAATG
Brief explanation,
Set '.' as the delimeter
For every odd record, distinguish k=$2 as the key of array a
Invoke getline to set $0 of next record as the value corresponds to the key k
Print the whole array for the last step
If your data is very large, I would suggest creating temporary files:
awk 'FNR%2==1 { filename = $1 }
{ print $0 >> filename }' file1 ... filen
Afterwards, you can cat them together:
cat Seq1 ... Seqn > result
This has the additional advantage that it will work if not all sequences are present in all files.
paste + awk solution:
paste File1 File2 | awk '{ p=$2;$2="" }NR%2{ k=p; print }!(NR%2){ v=p; print $1 RS k RS v }'
paste File1 File2 - merge corresponding lines of files
p=$2;$2="" - capture the value of the 2nd field which is the respective key/value from File2
The output:
Seq1
ATCGGCCCTTAA
Seq1
ATCGGCCCTTAC
Seq2
ATGCCTTAAGCTATA
Seq2
ATGCCTTAAGCTAGG
Seq3
ATCCTAAAGGTAAGG
Seq3
ATCCTAAAGGTAAGC
Additional approach for multiple files:
paste Files[0-9]* | awk 'NR%2{ k=$1; n=NF; print k }
!(NR%2){ print $1; for(i=2;i<=n;i++) print k RS $i }'

awk print line of file2 based on condition of file1

I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1