Awk command has unexpected results when comparing two files - awk

I am using an awk command to compare the first column in two file.
I want to take col1 of file1 and if there is a match in col1 of file2, update the "date updated" in the last column. If there is no match, I want to append the entire line of file1 to file2 and append a "date updated" value to that line as well. Here is the command I'm currently using:
awk 'FNR == NR { f1[$1] = $0; next }
$1 in f1 { print; delete f1[$1] }
END { for (user in f1) print f1[user] }' file1 file2
File1:
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeNEW
mschrode,161.1,plasma-de+,serv01,datetimeNEW
File2:
jbomba,114.0,plasma-de+,serv01,datetimeOLD
mschrode,104.0,plasma-de+,serv01,datetimeOLD
deleteme,192.0,random,serv01,datetimeOLD #<---- Needs to be removed: WORKS!
Expected Output:(order does not matter)
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeOLD #<---- NEED THIS VALUE
mschrode,161.1,plasma-de+,serv01,datetimeOLD #<---- NEED THIS VALUE
Current Output:(order does not matter)
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeNEW #<----WRONG OUTPUT
mschrode,161.1,plasma-de+,serv01,datetimeNEW #<----WRONG OUTPUT
The Logic Broken Down:
If $usr/col1 in file2 does NOT exist in file1
remove entire line from file2
(ex: line3 in file2, user: deleteme)
If $usr/col1 in file1 does NOT exist in file2
append entire line to file2
(ex: lines 1-5 in file1)
So the issue is, when there IS a match between the two files, I need to keep the information from file2, not the information from file1. In the output examples above you'll see I need to keep the datetimeOLD from file2 along with the new information from file1.

Set field separator to comma, and read file2 first:
$ awk -F',' 'FNR==NR{a[$1]=$0;next} $1 in a{print a[$1];next} 1' file2 file1
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,114.0,plasma-de+,serv01,datetimeOLD
mschrode,104.0,plasma-de+,serv01,datetimeOLD

Related

Use awk to remove lines based on a column from another file

I have the following code that works to extract lines from the multiple-column file_1 that have a value in the first column that appears in the single-column file_2:
awk 'NR==FNR{a[$1][$0];next} $0 in a {for (i in a[$0]) print i}' file_1 file_2
I got this code from the answer to this question: AWK to filter a file based upon columns of another file
I want to change to code to do the opposite, namely to remove every line from file_1 where the first column matches any value that appears in the single-column file_2. How do I do this?
awk 'NR==FNR { arr[$0]="1";next } arr[$1]!="1" { print $0 }' file2 file1
Process the second file first (NR==FNR) and create an array called arr, with the line ($0) as the key and 1 the value. Then when processing the next file (file1), check if the first space delimited field ($1) exists as a key in the array arr and if it doesn't, print the line.
Direct the output to a file if you want to store the results:
awk 'NR==FNR { arr[$0]="1";next } arr[$1]!="1" { print $0 }' file2 file1 > file3

how to get the common rows according to the first column in awk

I have two ',' separated files as follow:
file1:
A,inf
B,inf
C,0.135802
D,72.6111
E,42.1613
file2:
A,inf
B,inf
C,0.313559
D,189.5
E,38.6735
I want to compare 2 files ans get the common rows based on the 1st column. So, for the mentioned files the out put would look like this:
A,inf,inf
B,inf,inf
C,0.135802,0.313559
D,72.6111,189.5
E,42.1613,38.6735
I am trying to do that in awk and tried this:
awk ' NR == FNR {val[$1]=$2; next} $1 in val {print $1, val[$1], $2}' file1 file2
this code returns this results:
A,inf
B,inf
C,0.135802
D,72.6111
E,42.1613
which is not what I want. do you know how I can improve it?
$ awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1]=$0;next}$1 in a{print a[$1],$2}' file1 file2
A,inf,inf
B,inf,inf
C,0.135802,0.313559
D,72.6111,189.5
E,42.1613,38.6735
Explained:
$ awk '
BEGIN {FS=OFS="," } # set separators
NR==FNR { # first file
a[$1]=$0 # hash to a, $1 as index
next # next record
}
$1 in a { # second file, if $1 in a
print a[$1],$2 # print indexed record from a with $2
}' file1 file2
Your awk code basically works, you are just missing to tell awk to use , as the field delimiter. You can do it by adding BEGIN{FS=OFS=","} to the beginning of the script.
But having that the files are sorted like in the examples in your question, you can simply use the join command:
join -t, file1 file2
This will join the files based on the first column. -t, tells join that columns are separated by commas.
If the files are not sorted, you can sort them on the fly like this:
join -t, <(sort file1) <(sort file2)

printing multiple NR from one file based on the value from other file using awk

I want to print out multiple rows from one file based on the input values from the other.
Following is the representation of file 1:
2
4
1
Following is the representation of file 2:
MANCHKLGO
kflgklfdg
fhgjpiqog
fkfjdkfdg
fghjshdjs
jgfkgjfdk
ghftrysba
gfkgfdkgj
jfkjfdkgj
Based on the first column of the first file, the code should first print the second row of the second file followed by fourth row and then the first row of the second file. Hence, the output should be following:
kflgklfdg
fkfjdkfdg
MANCHKLGO
Following are the codes that I tried:
awk 'NR==FNR{a[$1];next}FNR in a{print $0}' file1.txt file2.txt
However, as expected, the output is not in the order as it first printed the first row then the second and fourth row is the last. How can I print the NR from the second file as exactly in the order given in the first file?
Try:
$ awk 'NR==FNR{a[NR]=$0;next} {print a[$1]}' file2 file1
kflgklfdg
fkfjdkfdg
MANCHKLGO
How it works
NR==FNR{a[NR]=$0;next}
This saves the contents of file2 in array a.
print a[$1]
For each number in file1, we print the desired line of file2.
Solution to earlier version of question
$ awk 'NR==FNR{a[NR]=$0;next} {print a[2*$1];print a[2*$1+1]}' file2 file1
fkfjdkfdg
fghjshdjs
gfkgfdkgj
jfkjfdkgj
kflgklfdg
fhgjpiqog
Another take:
awk '
NR==FNR {a[$1]; order[n++] = $1; next}
FNR in a {lines[FNR] = $0}
END {for (i=0; i<n; i++) print lines[order[i]]}
' file1.txt file2.txt
This version stores fewer lines in memory, if your files are huge.

How to print two lines of several files to a new file with speicific order?

I have a task to do with awk. I am doing sequence analysis for some genes.
I have several files with sequences in order. I would like to extract first sequence of each file into new file and like till the last sequence. I know only how to do with first or any specific line with awk.
awk 'FNR == 2 {print; nextfile}' *.txt > newfile
Here I have input like this
File 1
Saureus081.1
ATCGGCCCTTAA
Saureus081.2
ATGCCTTAAGCTATA
Saureus081.3
ATCCTAAAGGTAAGG
File 2
SaureusRF1.1
ATCGGCCCTTAC
SauruesRF1.2
ATGCCTTAAGCTAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
File 3
SaureusN305.1
ATCGGCCCTTACT
SauruesN305.2
ATGCCTTAAGCTAGA
SaureusN305.3
ATCCTAAAGGTAATG
And similar files 12 are there
File 4
.
.
.
.File 12
Required Output
Newfile
Saureus081.1
ATCGGCCCTTAA
SaureusRF1.1
ATCGGCCCTTAC
SaureusN305.1
ATCGGCCCTTACT
Saureus081.2
ATGCCTTAAGCTATA
SaureusRF1.2
ATGCCTTAAGCTAGG
SauruesN305.2
ATGCCTTAAGCTAGA
Saureus081.3
ATCCTAAAGGTAAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
SaureusN305.3
ATCCTAAAGGTAATG
I guess this task can be done easily with awk but not getting any idea how to do for multiple lines
Based on the modified question, the answer shall be done with some changes.
$ awk -F'.' 'NR%2{k=$2;v=$0;getline;a[k]=a[k]?a[k] RS v RS $0:v RS $0} END{for(i in a)print a[i]}' file1 file2 file3
Saureus081.1
ATCGGCCCTTAA
SaureusRF1.1
ATCGGCCCTTAC
SaureusN305.1
ATCGGCCCTTACT
Saureus081.2
ATGCCTTAAGCTATA
SauruesRF1.2
ATGCCTTAAGCTAGG
SauruesN305.2
ATGCCTTAAGCTAGA
Saureus081.3
ATCCTAAAGGTAAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
SaureusN305.3
ATCCTAAAGGTAATG
Brief explanation,
Set '.' as the delimeter
For every odd record, distinguish k=$2 as the key of array a
Invoke getline to set $0 of next record as the value corresponds to the key k
Print the whole array for the last step
If your data is very large, I would suggest creating temporary files:
awk 'FNR%2==1 { filename = $1 }
{ print $0 >> filename }' file1 ... filen
Afterwards, you can cat them together:
cat Seq1 ... Seqn > result
This has the additional advantage that it will work if not all sequences are present in all files.
paste + awk solution:
paste File1 File2 | awk '{ p=$2;$2="" }NR%2{ k=p; print }!(NR%2){ v=p; print $1 RS k RS v }'
paste File1 File2 - merge corresponding lines of files
p=$2;$2="" - capture the value of the 2nd field which is the respective key/value from File2
The output:
Seq1
ATCGGCCCTTAA
Seq1
ATCGGCCCTTAC
Seq2
ATGCCTTAAGCTATA
Seq2
ATGCCTTAAGCTAGG
Seq3
ATCCTAAAGGTAAGG
Seq3
ATCCTAAAGGTAAGC
Additional approach for multiple files:
paste Files[0-9]* | awk 'NR%2{ k=$1; n=NF; print k }
!(NR%2){ print $1; for(i=2;i<=n;i++) print k RS $i }'

output the record by comparing string in other file using awk script

I want to output the record having matching string in another file using awk script
file1 code
849002|48|1208004|1
849007|28|1208004|1
855003|48|1208004|1
855004|28|1208004|1
855006|28|1208004|1
file2 code :
00990029000000004804470425|ST1400029|0.550|Recurring|1248073|ST1400029
00990029000000008410517183|IM1450029|1.000|Recurring|855003|ST1400029
009900290000000000007800612988|IM3350029|1.000|Recurring|1248063|ST1400029
Notice that 855003 occurs in the middle row of each sample? That's the match I'm looking for, and the output should be:
00990029000000008410517183|IM1450029|1.000|Recurring|855003|ST1400029
Because I want to search $1 of file1 in $5 in file2, if match found then output the line
I tried this but its returning zero record
awk 'NR==FNR{a[$1]=$1;next}a[$5]{print $0}' file2 file1 > outfile
Your help will resolve my issue, I have to search long list of data
Don't forget to set the delimiter using the -F flag:
awk -F "|" 'FNR==NR { a[$1]; next } $5 in a' file1 file2
Results:
00990029000000008410517183|IM1450029|1.000|Recurring|855003|ST1400029
try this (didn't test)
awk 'NR==FNR{a[$1];next}$5 in a' file1 file2