I have two files named: File1 & File2 Where File has 100 names while File2 has 1000 names , now I want awk to read the File1 and File2 and print only the names which are there in File2 and not on File1.
Your help & time much appreciated.
Example: Below File1 & File2 names...
File1:
karn
steve
vaithee
File2:
vaithee
Karn
steve
niraj
mana
henry
So, Output should be:
niraj
mana
henry
awk 'FNR==NR{a[tolower($1)];next}!(tolower($1) in a)' file1 file2
Input
$ cat file1
karn
steve
vaithee
$ cat file2
vaithee
Karn
steve
niraj
mana
henry
Output
$ awk 'FNR==NR{a[tolower($1)];next}!(tolower($1) in a)' file1 file2
niraj
mana
henry
Explanation
FNR==NR If the number of records read so far in the current file
is equal to the number of records read so far across all files,
condition which can only be true for the first file read.
a[tolower($1)] populate array "a" such that the
indexed by the first
field in lowercase from
current record of file1
next Move on to the next record so we don't do any processing
intended for records from the second file.
!(tolower($1) in a) IF the array a index constructed from the
field 1 ($1) in lowercase of the current record of the file2 does not exist (!)
in array a, we get boolean true (! Called Logical NOT Operator. It is used to reverse the logical state of its operand. If a condition is true, then Logical NOT operator will make it false and vice versa.) so awk does default operation print $0 from file2
file1 file2 read file1 and then read file2
Related
Update: I figured out the reason for the extraneous newline. I created file1 and file2 on a Windows machine. Windows adds <cr><newline> to the end of each line. So, for example, the first record in file1 is not this:
Bill <tab> 25 <newline>
Instead, it is this:
Bill <tab> 25 <cr><newline>
So when I set a[Bill] to $2 I am actually setting it to $2<cr>.
I used a hex editor and removed all of the <cr> symbols in file1 and file2. Now the AWK program works as desired.
I have seen the SO posts on using AWK to do a natural join of two files. I took one of the solutions and am trying to get it to work. Alas, I have been unsuccessful. I am hoping you can tell me what I am doing wrong.
Note: I appreciate other solutions, but what I really want is to understand why my AWK program doesn't work (i.e., why/how an extraneous newline is being introduced).
I want to do a join of these two files:
file1 (name, tab, age):
Bill 25
John 24
Mary 21
file2 (name, tab, marital-status)
Bill divorced
Glenn married
John married
Mary single
When joined, I expect to see this (name, tab, age, tab, marital-status):
Bill 25 divorced
John 24 married
Mary 21 single
Notice that file2 has a person named Glenn, but file1 doesn't. No record in file1 joins to it.
My AWK program almost produces that result. But, for reasons I don't understand, the marital-status value is on the next line:
Bill 25
divorced
John 24
married
Mary 21
single
Here is my AWK program:
awk 'BEGIN { OFS = '\t' }
NR == FNR { a[$1] = ($1 in a? a[$1] OFS : "")$2; next }
$1 in a { $0 = $0 OFS a[$1]; delete a[$1]; print }' file2 file1 > joined_file1_file2
You may try this awk solution:
awk 'BEGIN {FS=OFS="\t"} {sub(/\r$/, "")}
FNR == NR {m[$1]=$2; next} {print $0, m[$1]}' file2 file1
Bill 25 divorced
John 24 married
Mary 21 single
Here:
Using sub(/\r$/, "") to remove any DOS line ending
If $1 doesn't exist in mapping m then m[$1] will be an empty string so we can simplify awk processing
My input file (file1) looks like this:
part position col3 col4 info
part1 34 1 1 NAME=Mark;AGE=23;HEIGHT=189
part2 55 1 1 NAME=Alice;AGE=43;HEIGHT=167
part2 19 1 1 NAME=Emily;AGE=16;HEIGHT=164
part3 23 1 1 NAME=Owen;AGE=55;HEIGHT=181
part3 99 1 1 NAME=Rachel;AGE=76;HEIGHT=162
I need to retrieve the text after "NAME=" in the info column, but only if the values in the first two columns match another file (file2).
part position
part2 55
part3 23
Then only the 2nd and 4th rows will be considered and text after "NAME=" in those rows are put into the output file:
Alice
Owen
I don't need to preserve the order of the original rows, so the following output is equally valid:
Owen
Alice
My (not very good) attempt:
awk -F, 'FNR==NR {a[$1]=$5; next}; $1 in a {print a[$1]}' file1 file2
Something like,
awk -F"[ =;]" 'FNR==NR{found[$1" "$2]=$6; next} $1" "$2 in found{print found[$1" "$2]}'
Example
$ awk -F"[ =;]" 'FNR==NR{found[$1" "$2]=$6; next} $1" "$2 in found{print found[$1" "$2]}' file1 file2
Alice
Owen
What it does?
-F"[ =;]" -F sets the field separators. Here we set it to space or = or ;. This makes it easier to get the name from the first file without using a split function.
found[$1" "$2]=$6 This block is run only for file1, here we save the names $6 in the associative array found indexed by part position
$1" "$2 in found{print found[$1" "$2]} This is executed for the second file. Checks if the part position is found in the array, if yes print the name from the array
Using gnu awk below would do the same
awk 'NR>1 && NR==FNR{found[$1","$2];next}\
$1","$2 in found{print gensub(/^NAME=([^;]*).*/,"\\1","1",$NF);}' file2 file1
Output
Alice
Owen
I have file 1 as
blah blah cool
fold bold match
ed ted bled
file2 as
blah ha cool
fold bold match
ed ted bled
I want to output rows only if the second field does not match like so
blah ha cool
Instead however, I'm getting this:
blah ha cool
fold bold match
ed ted bled
Heres my attempt:
$ awk -F'\t' 'NR==FNR{a[$1]=$0;next} $1 in a{split(a[$1],r); if (r[2] != 2) print $0 FS "false"; else next;}' file1 file2
My guess is that I'm not increment-ing correctly through the associative array..
As I understand it, you want to print rows from file2 for which the second column differs from the second column in the corresponding row in file1. If that is the case:
$ awk 'FNR==NR{a[NR]=$2; next} $2!=a[FNR]' file1 file2
blah ha cool
FNR==NR{a[NR]=$2; next} saves each value of the second field of file1 in array a under the key of its row number. $2!=a[FNR] prints any row from file2 for which the second field differs from the second field of file1 for the same row.
for line by line comparison, assuming tab delimited data
$ paste file1 file2 | awk '$2!=$5'
blah blah cool blah ha cool
to only report file2 record
$ paste file1 file2 | awk '$2!=$5' | cut -f4-
blah ha cool
This solution will work for very large files as well.
I have two files, File1 and File2. File1 has 6000 rows and file2 has 3000 rows. I want to match the ids and merge the files based on matches, which is simple. But, the ids in file1 and file2 only match partially. Have a look at the files. For every id (row) in file2 there must be two matching ids (rows) in file 1. Also, not all the ids in file2 are present in file1. I had tried awk but didn't get the desired output.
File1
1_A01_A
1_A01_B
2_B03_A
2_B03_B
1_A02_A
1_A02_B
2_B04_A
2_B04_B
1_A03_A
1_A03_B
2_B05_A
2_B05_B
1_A04_A
1_A04_B
2_B06_A
2_B06_B
1_A06_A
1_A06_B
2_B07_A
2_B07_B
1_A07_A
1_A07_B
2_B08_A
2_B08_B
9_F10_A
9_F10_B
12_D08_A
12_D08_B
5505744243493_F09.CEL_A_A
5505744243493_F09.CEL_B_B
File2
1_A01 14
2_B03 13
1_A02 4
2_B04 14
1_A03 11
2_B05 8
1_A04 18
2_B06 15
1_A06 10
2_B07 4
1_A07 8
2_B08 22
1_A08 5
2_B09 15
1_A09 20
2_B10 17
awk -F" " 'FNR==NR{a[$1]=$2;next}{for(i in a){if($1~i){print $1" "a[i];next}}}' file1.txt file2.txt
FNR==NR will be true while awk reads file 1 and false when it reads file 2. The part of code starting from {for(i in a} .. will be executed for file 2. $1~i looks for Like condition and then for relevant matches the output is printed.
by mistake I have used different file notations. My file1.txt contains the content of file2.txt in the problem statement and vise versa
Output
1_A01_A|14
1_A01_B|14
2_B03_A|13
2_B03_B|13
1_A02_A|4
1_A02_B|4
2_B04_A|14
2_B04_B|14
1_A03_A|11
1_A03_B|11
2_B05_A|8
2_B05_B|8
1_A04_A|18
1_A04_B|18
2_B06_A|15
2_B06_B|15
1_A06_A|10
1_A06_B|10
2_B07_A|4
2_B07_B|4
1_A07_A|8
1_A07_B|8
2_B08_A|22
2_B08_B|22
This might work for you (GNU sed):
sed -r 's|^(\S+)(\s+\S+)$|s/^\1.*/\&\2/p|' file2 | sed -nf - file1
This creates a sed script from file2 and then runs it against the data in file1.
N.B. The order of either file is unimportant and file1 is processed only once.
I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1