Compare two files by two column matching - awk

I have two files with columns. I need to print the content of second file IF the first and the second columns of both files are equal. Ex:
file1
Name1 123 blabla
Name1 456 bla
Name3 777 s
file2
Name1 123 something more
Name2 456 some words
Name4 111 no
Desired output:
Name1 123 something more
I have written this code, but it only works for one column (the second in this case):
awk 'BEGIN{FS=OFS="\t"} NR == FNR {f[$2]; next;} $2 in f{print $0;}' file1 file2
I have found something related here: comparing two columns in two files , but I'm not able to find the correct way. I tried this but is not working..:
awk 'BEGIN{FS=OFS="\t"} NR == FNR {f[$1 FS $2]; next;} if($1 in f && $2 in f){print $0;}'
Thanks in advance,

You can have
awk 'NR == FNR { a[$1, $2]++; next } a[$1, $2]' file1 file2
Output:
Name1 123 something more
[$1, $2] is different from[$1 "," $2]. Somehow an implementation of Awk makes it sure that $1, $2 would not match a literal string.

Related

Retrieve all rows from 2 columns matching from 2 different files

I need to retrieve all rows from a file starting from some column matching from another file.
My first file is:
col1,col2,col3
1TF4,WP_110462952.1,AEV67733.1
1TF4,EGD45884.1,AEV67733.1
2BTO,NP_006073.2,XP_037953971.1
2BTO,XP_037953971.1,XP_037953971.1
The second one is:
col1,col2,col3,col4,col5
BAA13425.1,SDD02770.1,38.176,296,175
BAA13425.1,WP_002465021.1,32.056,287,185
BBE42932.1,AEG17356.1,40.909,110,64
BBE42932.1,WP_048124638.1,40.367,109,64
I want to retrieve all rows from the second file, where its file2_col1=file1_col3 and file2_col2=file1_col1
I tried like this but it doesn't print everything
awk -F"," 'FILENAME=="file1"{A[$3$2]=$3$2}
FILENAME=="file2"{if(A[$1$2]){print $0}}' file1 file2 > test
I want to retrieve all rows from the second file, where its file2_col1=file1_col3 and file2_col2=file1_col1
You may use this 2 pass awk solution:
awk -F, 'FNR == NR {seen[$3,$1]; next} FNR == 1 || ($1,$2) in seen' file1 file2
col1,col2,col3,col4,col5
BAA13425.1,2BTO,32.056,287,185
BAA13425.1,2BTO,12.410,641,123
Where input files are:
cat file1
col1,col2,col3
1TF4,WP_110462952.1,AEV67733.1
1TF4,EGD45884.BAA13425.1
2BTO,NP_006073.2,BAA13425.1
2BTO,XP_037953971.1,BAA13425.1
cat file2
col1,col2,col3,col4,col5
BAA13425.1,SDD02770.1,38.176,296,175
BAA13425.1,2BTO,32.056,287,185
BBE42932.1,AEG17356.1,40.909,110,64
BBE42932.1,WP_048124638.1,40.367,109,64
BAA13425.1,2BTO,12.410,641,123

Count rows and columns for multiple CSV files and make new file

I have multiple large comma separated CSV files in a directory. But, as a toy example:
one.csv has 3 rows, 2 columns
two.csv has 4 rows 5 columns
This is what the files look like -
# one.csv
a b
1 1 3
2 2 2
3 3 1
# two.csv
c d e f g
1 4 1 1 4 1
2 3 2 2 3 2
3 2 3 3 2 3
4 1 4 4 1 4
The goal is to make a new .txt or .csv that gives the rows and columns for each:
one 3 2
two 4 5
To get the rows and columns (and dump it into a file) for a single file
$ awk -F "," '{print NF}' *.csv | sort | uniq -c > dims.txt
But I'm not understanding the syntax to get counts for multiple files.
What I've tried
$ awk '{for (i=1; i<=2; i++) -F "," '{print NF}' *.csv$i | sort | uniq -c}'
With any awk, you could try following awk program.
awk '
FNR==1{
if(cols && rows){
print file,rows,cols
}
rows=cols=file=""
file=FILENAME
sub(/\..*/,"",file)
cols=NF
next
}
{
rows=(FNR-1)
}
END{
if(cols && rows){
print file,rows,cols
}
}
' one.csv two.csv
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line of each line then do following.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
rows=cols=file="" ##Nullifying rows, cols and file here.
file=FILENAME ##Setting FILENAME value to file here.
sub(/\..*/,"",file) ##Removing everything from dot to till end of value in file.
cols=NF ##Setting NF values to cols here.
next ##next will skip all further statements from here.
}
{
rows=(FNR-1) ##Setting FNR-1 value to rows here.
}
END{ ##Starting END block of this program from here.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
}
' one.csv two.csv ##Mentioning Input_file names here.
Using gnu awk you can do this in a single awk:
awk -F, 'ENDFILE {
print gensub(/\.[^.]+$/, "", "1", FILENAME), FNR-1, NF-1
}' one.csv two.csv > dims.txt
cat dims.txt
one 3 2
two 4 5
You will need to iterate over all CSVs print the name for each file and the dimensions
for i in *.csv; do awk -F "," 'END{print FILENAME, NR, NF}' $i; done > dims.txt
If you want to avoid awk you can also do it wc -l for lines and grep -o "CSV-separator" | wc -l for fields
I would harness GNU AWK's ENDFILE for this task as follows, let content of one.csv be
1,3
2,2
3,1
and two.csv be
4,1,1,4,1
3,2,2,3,2
2,3,3,2,3
1,4,4,1,4
then
awk 'BEGIN{FS=","}ENDFILE{print FILENAME, FNR, NF}' one.csv two.csv
output
one.csv 3 2
two.csv 4 5
Explanation: ENDFILE is executed after processing every file, I set FS to , assuming that fields are ,-separated and there is not , inside filed, FILENAME, FNR, NF are built-in GNU AWK variables: FNR is number of current row in file, i.e. in ENDFILE number of last row, NF is number of fileds (again of last row). If you have files with headers use FNR-1, if you have rows prepended with row number use NF-1.
edit: changed NR to FNR
Without GNU awk you can use the shell plus POSIX awk this way:
for fn in *.csv; do
cols=$(awk '{print NF; exit}' "$fn")
rows=$(awk 'END{print NR-1}' "$fn")
printf "%s %s %s\n" "${fn%.csv}" "$rows" "$cols"
done
Prints:
one 3 2
two 4 5

How to join two files based on one column in AWK (using wildcards)

I have 2 files, and I need to compare column 2 from file 2 with column 3 from file 1.
File 1
"testserver1","testserver1.domain.net","-1.1.1.1-10.10.10.10-"
"testserver2","testserver2.domain.net","-2.2.2.2-20.20.20.20-200.200.200.200-"
"testserver3","testserver3.domain.net","-3.3.3.3-"
File 2
"windows","10.10.10.10","datacenter1"
"linux","2.2.2.2","datacenter2"
"aix","4.4.4.4","datacenter2"
Expected Output
"testserver1","testserver1.domain.net","windows","10.10.10.10","datacenter1"
"testserver2","testserver2.domain.net","linux","2.2.2.2","datacenter2"
All I have been able to find statements that only work if the columns are identical, I need it to work if column 3 from file 1 contains value from column 2 from file 2
I've tried this, but again, it only works if the columns are identical (which I don't want):
awk 'BEGIN {FS = OFS = ","};NR == FNR{f[$3] = $1","$2;next};$2 in f {print f[$2],$0}' file1.csv file2.csv
hacky!
$ awk -F'","' 'NR==FNR {n=split($NF,x,"-"); for(i=2;i<n;i++) a[x[i]]=$1 FS $2; next}
$2 in a {print a[$2] "\"," $0}' file1 file2
"testserver1","testserver1.domain.net","windows","10.10.10.10","datacenter1"
"testserver2","testserver2.domain.net","linux","2.2.2.2","datacenter2"
assumes the lookup is unique, i.e. file1 records are mutually exclusive in that field.

AWK Retrieve text after a certain pattern where the 1st and 2nd columns match the values in the 1st and 2nd columns in an input file

My input file (file1) looks like this:
part position col3 col4 info
part1 34 1 1 NAME=Mark;AGE=23;HEIGHT=189
part2 55 1 1 NAME=Alice;AGE=43;HEIGHT=167
part2 19 1 1 NAME=Emily;AGE=16;HEIGHT=164
part3 23 1 1 NAME=Owen;AGE=55;HEIGHT=181
part3 99 1 1 NAME=Rachel;AGE=76;HEIGHT=162
I need to retrieve the text after "NAME=" in the info column, but only if the values in the first two columns match another file (file2).
part position
part2 55
part3 23
Then only the 2nd and 4th rows will be considered and text after "NAME=" in those rows are put into the output file:
Alice
Owen
I don't need to preserve the order of the original rows, so the following output is equally valid:
Owen
Alice
My (not very good) attempt:
awk -F, 'FNR==NR {a[$1]=$5; next}; $1 in a {print a[$1]}' file1 file2
Something like,
awk -F"[ =;]" 'FNR==NR{found[$1" "$2]=$6; next} $1" "$2 in found{print found[$1" "$2]}'
Example
$ awk -F"[ =;]" 'FNR==NR{found[$1" "$2]=$6; next} $1" "$2 in found{print found[$1" "$2]}' file1 file2
Alice
Owen
What it does?
-F"[ =;]" -F sets the field separators. Here we set it to space or = or ;. This makes it easier to get the name from the first file without using a split function.
found[$1" "$2]=$6 This block is run only for file1, here we save the names $6 in the associative array found indexed by part position
$1" "$2 in found{print found[$1" "$2]} This is executed for the second file. Checks if the part position is found in the array, if yes print the name from the array
Using gnu awk below would do the same
awk 'NR>1 && NR==FNR{found[$1","$2];next}\
$1","$2 in found{print gensub(/^NAME=([^;]*).*/,"\\1","1",$NF);}' file2 file1
Output
Alice
Owen

awk print line of file2 based on condition of file1

I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1