Find matching IDs in two big files

Find matching IDs in two big files - awk

I have 2 big files.
file1 has 160 million lines with this format: id:email
file2 has 45 million lines with this format: id:hash
The problem is to find all equal ids and save those to a third file, with the format: email:hash
Tried something like:
awk -F':' 'NR==FNR{a[$1]=$2;next} {print a[$1]":"$2}' test1.in test2.in > res.in
But it's not working :(
Example file1:
9305718:test00#yahoo.com
59287478:login#hotmail.com
file2:
21367509:e90100b1b668142ad33e58c17a614696ec04474c
9305718:d63fff1d21e1a04c066824dd2f83f3aeaa0edf6e
Desired result:
test00#yahoo.com:d63fff1d21e1a04c066824dd2f83f3aeaa0edf6e

With GNU join and GNU bash:
join -t : -j 1 <(sort -t : -k1,1 file1) <(sort -t : -k1,1 file2) -o 1.2,2.2
Update:
join -t: <(sort file1) <(sort file2) -o 1.2,2.2

In AWK (not considering the amount of resources you have available):
$ awk -F':' 'NR==FNR{a[$1]=$2;next} a[$1] {print a[$1]":"$2}' test1.in test2.in
test00#yahoo.com :d63fff1d21e1a04c066824dd2f83f3aeaa0edf6e

Related

Awk: How do I output data from two files

So yeah im trying to match file1 that contains email to file2 that cointains email colons address, how do i go on bout doing that?
tried awk 'FNR==NR{a[$1]=$0; next}{print a[$1] $0}' but idk what im doing wrong
file1:
email#email.email
email#test.test
test#email.email
file2:
email#email.email:addressotest
email#test.club:clubbingson
test#email.email:addresso2
output:
test#email.email:addresso2
email#email.email:addressotest

Following awk may help you in same.
awk 'FNR==NR{a[$0];next} ($1 in a)' FILE_1 FS=":" FILE_2

join with presorting input files
$ join -t: <(sort file1) <(sort file2)
email#email.email:addressotest
test#email.email:addresso2

Hey why going for a awk solution when you can simply use the following join command:
join -t':' file 1 file2
where join as its names indicate is just a file joining command and you chose the field separator and usually the input columns and output to display (here not necessary)
Tested:
$more file{1,2}
::::::::::::::
file1
::::::::::::::
email#email.email
email#test.test
test#email.email
::::::::::::::
file2
::::::::::::::
email#email.email:addressotest
email#test.club:clubbingson
test#email.email:addresso2
$join -t':' file1 file2
email#email.email:addressotest
test#email.email:addresso2
If you need to sort the output as well, change the command into:
join -t':' file 1 file2 | sort -t":" -k1
or
join -t':' file 1 file2 | sort -t":" -k2
depending on which column you want to sort upon (eventually add the -r option to sort in reverse order.
join -t':' file 1 file2 | sort -t":" -k1 -r
or
join -t':' file 1 file2 | sort -t":" -k2 -r

How to grep the outputs of awk, line by line?

Let's say I have the following text file:
$ cat file1.txt outputs
MarkerName Allele1 Allele2 Freq1 FreqSE P-value Chr Pos
rs2326918 a g 0.8510 0.0001 0.5255 6 130881784
rs2439906 c g 0.0316 0.0039 0.8997 10 6870306
rs10760160 a c 0.5289 0.0191 0.8107 9 123043147
rs977590 a g 0.9354 0.0023 0.8757 7 34415290
rs17278013 t g 0.7498 0.0067 0.3595 14 24783304
rs7852050 a g 0.8814 0.0006 0.7671 9 9151167
rs7323548 a g 0.0432 0.0032 0.4555 13 112320879
rs12364336 a g 0.8720 0.0015 0.4542 11 99515186
rs12562373 a g 0.7548 0.0020 0.6151 1 164634379
Here is an awk command which prints MarkerName if Pos >= 11000000
$ awk '{ if($8 >= 11000000) { print $1 }}' file1.txt
This command outputs the following:
MarkerName
rs2326918
rs10760160
rs977590
rs17278013
rs7323548
rs12364336
rs12562373
Question: I would like to feed this into a grep statement to parse another text file, textfile2.txt. Somehow, one pipes the output from the previous awk command into grep AWKOUTPUT textfile2.txt
I would like each row of the awk command above to be grepped against textfile2.txt, i.e.
grep "rs2326918" textfile2.txt
## and then
grep "rs10760160" textfile2.txt
### and then
...
Naturally, I would save all resulting rows from textfile2.txt into a final file, i.e.
$ awk '{ if($8 >= 11000000) { print $1 }}' file1.txt | grep PIPE_OUTPUT_BY_ROW textfile2.txt > final.txt
How does one grep from a pipe line by line?
EDIT: To clarify, the one constraint I have is that file1.txt is actually the output of a previous pipe. (I'm trying to simplify the question somewhat.) How would that change the answer?

awk + grep solution:
grep -f <(awk '$8 >= 11000000{ print $1 }' file1.txt) textfile2.txt > final.txt
-f file - obtain patterns from file, one per line

You can use bash to do this:
bash-3.1$ echo "rs2326918" > filename2.txt
bash-3.1$ (for i in `awk '{ if($8 >= 11000000) { print $1 }}' file1.txt |
grep -v MarkerName`; do grep $i filename2.txt; done) > final.txt
bash-3.1$ cat final.txt
rs2326918
Alternatively,
bash-3.1$ cat file1.txt | (for i in `awk '{ if($8 >= 11000000) { print $1 }}' |
grep -v MarkerName`; do grep $i filename2.txt; done) > final.txt
The switch grep -v tells grep to reverse its usual activity and print all lines that do not match the pattern. This switch "inVerts" the match.

only using awk can do this for you:
$ awk 'NR>1 && NR==FNR {if ($8 >= 110000000) a[$1]++;next} \
{ for(i in a){if($0~i) print}}' file1.txt file2.txt> final.txt

awk, how to pass in a list of files based on a condition?

I was wondering if there is any way to pass in a file list to awk. The file list has thousands of files and I am using a grep -l to find a subset of files I am interested in passing to awk
E.g.,
grep -l id file-*.csv
file-1.csv
file-2.csv
$ cat file-1.csv
id,col_1,col_2
1,abc,100
2,def,200
$ cat file-2.csv
id,col_1,col_2
3,xyz,1000
4,hij,2000
If I do
$ awk -F, '{print $2,$3}' file-1.csv file-2.csv | grep -v col
abc 100
def 200
xyz 1000
hij 2000
it works how I would want but seeing as there are too many files to manually do like this
file-1.csv file-2.csv
I was wondering if there is a way to pass in the result of the...
grep -l id file-*.csv
Edit:
grep -l id
is the condition. Each file has a header but only some have 'id' in the header so I can't use the file-*.csv wildcard in the awk statement.
If I did an ls on file-*.csv I would end up with more the file-1.csv and file-2.csv.
e.g.,
$ cat file-3.csv
name,col,num
a1,hij,3000
b2,lmn,50000
$ ls -l file-*.csv
-rw-r--r-- 1 tp staff 35 20 Sep 18:50 file-1.csv
-rw-r--r-- 1 tp staff 37 20 Sep 18:51 file-2.csv
-rw-r--r-- 1 tp staff 38 20 Sep 18:52 file-3.csv
$ grep -l id file-*.csv
file-1.csv
file-2.csv

Based on the output you show under "If I do", it sounds like this might be what you're trying to do:
awk -F, 'FNR>1{print $2,$3}' file-*.csv
but your question isn't clear so it's a guess.
Given your updated question all you need with GNU awk for nextfile is:
awk -F, 'FNR==1{if ($1 != "id") nextfile} {print $2,$3}' file-*.csv
and with any awk (but less efficiently than with GNU awk):
awk -F, 'FNR==1{f=($1=="id"?1:0); next} f{print $2,$3}' file-*.csv

awk -F, 'NR > 1{print $2,$3}' $(grep -l id file-*.csv)
(This will not work if any of your filenames contain whitespace.)

To find the files with id field, merge/output their contents excluding the lines with field id:
grep trick:
grep --no-group-separator -hA 1000000 'id' file-*.csv | grep -v 'id'
-h - suppress the prefixing the file names on output
-A num - print num lines of trailing context after matching line(s). 1000000 - considered as maximal number of line which, presumably, will not be exceeded(you may adjust it in case if you really have files with more than 1000000 lines)
The output (for 2 sample files from the question):
1,abc,100
2,def,200
3,xyz,1000
4,hij,2000

How to merge these codes, awk then cut

I am using awk in Debian.
input
11.22.33.44#55878:
11.22.33.43#55879:
...
...
(smtp:55.66.77.88)
(smtp:55.66.77.89)
...
...
cpe-33-22-11-99.buffalo.res.rr.com[99.11.22.33]
cpe-34-22-11-99.buffalo.res.rr.com[99.11.22.34]
...
Parts of sh codes (running in Debian)
awk '/#/ {print > "file1";next} \
/smtp/ {print > "file2";next} \
{print > "file7"}' input
#
if [ -s file1 ] ; then
#IP type => 11.22.33.44#55878:
cut -d'#' -f1 file1 >> output
rm -f file1
fi
#
if [ -s file2 ] ; then
#IP type => (smtp:55.66.77.88)
cut -d':' -f2 file2 | cut -d')' -f1 >> output
rm -f file2
fi
#
if [ -s file7 ] ; then
#IP type => cpe-33-22-11-99.buffalo.res.rr.com[99.11.22.33]
cut -d'[' -f2 file7 | cut -d']' -f1 >> output
rm -f file7
fi
then output
11.22.33.44
11.22.33.43
55.66.77.88
55.66.77.89
99.11.22.33
99.11.22.34
Is it possible to merge these codes only with awk , something like
awk '/#/ {print | cut -d'#' -f1 > "file1";next} \
/smtp/ {print | cut -d':' -f2 | cut -d')' -f1 > "file2";next} \
{print | cut -d'[' -f2 file7 | cut -d']' > "file7"}' input
I am newbie and have no idea for this,
After search questions, still no help.
any hint?
Thanks.
Best Regard.

$ awk -F'[][()#]|smtp:' '/#/{print $1;next} /smtp/{print $3;next} /\[/{print $2}' input
11.22.33.44
11.22.33.43
55.66.77.88
55.66.77.89
99.11.22.33
99.11.22.34
To save this in the file output:
awk -F'[][()#]|smtp:' '/#/{print $1;next} /smtp/{print $3;next} /\[/{print $2}' input >output
How it works
-F'[][()#]|smtp:'
This sets the field separator to (a) any of the characters ][()# or (b) the string smtp:.
/#/{print $1;next}
If the line contains #, then print the first field and skip to the next line.
/smtp/{print $3;next}
If the line contains smtp, then print the third field and skip to the next line.
/\[/{print $2}
If the line contains [, then print the second field.
Variation
There is more than one way to solve this problem, For example, using a slightly different field separator, we can still get the desired output:
$ awk -F'[][()#:]' '/#/{print $1;next} /smtp/{print $3;next} /\[/{print $2}' input
11.22.33.44
11.22.33.43
55.66.77.88
55.66.77.89
99.11.22.33
99.11.22.34

Awk merging of two files on id

I would like to obtain the match the IDs of the first file to the IDs of the second file, so i get, for example, Thijs Al,NED19800616,39. I know this should be possible with AWK, but I'm not really good at it.
file1 (few entries)
NED19800616,Thijs Al
BEL19951212,Nicolas Cleppe
BEL19950419,Ben Boes
FRA19900221,Arnaud Jouffroy
...
file2 (many entries)
38,FRA19920611
39,NED19800616
40,BEL19931210
41,NED19751211
...

Don't use awk, use join. First make sure the input files are sorted:
sort -t, -k1,1 file1 > file1.sorted
sort -t, -k2,2 file2 > file2.sorted
join -t, -1 1 -2 2 file[12].sorted

With awk you can do
$ awk -F, 'NR==FNR{a[$2]=$1;next}{print $2, $1, a[$1] }' OFS=, file2 file1
Thijs Al,NED19800616,39
Nicolas Cleppe,BEL19951212,
Ben Boes,BEL19950419,
Arnaud Jouffroy,FRA19900221,

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find matching IDs in two big files - awk

With GNU join and GNU bash: join -t : -j 1 <(sort -t : -k1,1 file1) <(sort -t : -k1,1 file2) -o 1.2,2.2 Update: join -t: <(sort file1) <(sort file2) -o 1.2,2.2

In AWK (not considering the amount of resources you have available): $ awk -F':' 'NR==FNR{a[$1]=$2;next} a[$1] {print a[$1]":"$2}' test1.in test2.in test00#yahoo.com :d63fff1d21e1a04c066824dd2f83f3aeaa0edf6e

Related

Awk: How do I output data from two files

How to grep the outputs of awk, line by line?

awk, how to pass in a list of files based on a condition?

How to merge these codes, awk then cut

Awk merging of two files on id

Categories

Resources