Subsetting GWAS results by matching snp column from another file - awk

I have a GWAS summary estimate file with the following columns (file 1):
1 chr1_1726_G_A 0.023 0.160
1 chr1_20184_GAATA_G 0.033 0.180
1 chr1_791101_T_TGG 0.099 0.170
file 2
chr1_20184_GAATA_G
chr1_791101_T_TGG
I would like to match the column1 of file 2 with column 2 of file1 to create a file 3 such as:
1 chr1_20184_GAATA_G 0.033 0.180
1 chr1_791101_T_TGG 0.099 0.170
By using the below code, I get an empty file3:
awk 'FNR==NR{arr[$2];next} (($2) in arr)' file2 file1 > file3

With your shown samples, please try following awk code.
awk 'FNR==NR{arr[$0];next} ($2 in arr)' file2 file1
OR
awk 'FNR==NR{arr[$1];next} ($2 in arr)' file2 file1
Explanation: Use $0(in 1st solution) OR $1(in OR solution) for array rather than using $2 in first block and then rest of your code is fine to match; matching records here.

Related

Retrieve all rows from 2 columns matching from 2 different files

I need to retrieve all rows from a file starting from some column matching from another file.
My first file is:
col1,col2,col3
1TF4,WP_110462952.1,AEV67733.1
1TF4,EGD45884.1,AEV67733.1
2BTO,NP_006073.2,XP_037953971.1
2BTO,XP_037953971.1,XP_037953971.1
The second one is:
col1,col2,col3,col4,col5
BAA13425.1,SDD02770.1,38.176,296,175
BAA13425.1,WP_002465021.1,32.056,287,185
BBE42932.1,AEG17356.1,40.909,110,64
BBE42932.1,WP_048124638.1,40.367,109,64
I want to retrieve all rows from the second file, where its file2_col1=file1_col3 and file2_col2=file1_col1
I tried like this but it doesn't print everything
awk -F"," 'FILENAME=="file1"{A[$3$2]=$3$2}
FILENAME=="file2"{if(A[$1$2]){print $0}}' file1 file2 > test
I want to retrieve all rows from the second file, where its file2_col1=file1_col3 and file2_col2=file1_col1
You may use this 2 pass awk solution:
awk -F, 'FNR == NR {seen[$3,$1]; next} FNR == 1 || ($1,$2) in seen' file1 file2
col1,col2,col3,col4,col5
BAA13425.1,2BTO,32.056,287,185
BAA13425.1,2BTO,12.410,641,123
Where input files are:
cat file1
col1,col2,col3
1TF4,WP_110462952.1,AEV67733.1
1TF4,EGD45884.BAA13425.1
2BTO,NP_006073.2,BAA13425.1
2BTO,XP_037953971.1,BAA13425.1
cat file2
col1,col2,col3,col4,col5
BAA13425.1,SDD02770.1,38.176,296,175
BAA13425.1,2BTO,32.056,287,185
BBE42932.1,AEG17356.1,40.909,110,64
BBE42932.1,WP_048124638.1,40.367,109,64
BAA13425.1,2BTO,12.410,641,123

How to join two files based on one column in AWK (using wildcards)

I have 2 files, and I need to compare column 2 from file 2 with column 3 from file 1.
File 1
"testserver1","testserver1.domain.net","-1.1.1.1-10.10.10.10-"
"testserver2","testserver2.domain.net","-2.2.2.2-20.20.20.20-200.200.200.200-"
"testserver3","testserver3.domain.net","-3.3.3.3-"
File 2
"windows","10.10.10.10","datacenter1"
"linux","2.2.2.2","datacenter2"
"aix","4.4.4.4","datacenter2"
Expected Output
"testserver1","testserver1.domain.net","windows","10.10.10.10","datacenter1"
"testserver2","testserver2.domain.net","linux","2.2.2.2","datacenter2"
All I have been able to find statements that only work if the columns are identical, I need it to work if column 3 from file 1 contains value from column 2 from file 2
I've tried this, but again, it only works if the columns are identical (which I don't want):
awk 'BEGIN {FS = OFS = ","};NR == FNR{f[$3] = $1","$2;next};$2 in f {print f[$2],$0}' file1.csv file2.csv
hacky!
$ awk -F'","' 'NR==FNR {n=split($NF,x,"-"); for(i=2;i<n;i++) a[x[i]]=$1 FS $2; next}
$2 in a {print a[$2] "\"," $0}' file1 file2
"testserver1","testserver1.domain.net","windows","10.10.10.10","datacenter1"
"testserver2","testserver2.domain.net","linux","2.2.2.2","datacenter2"
assumes the lookup is unique, i.e. file1 records are mutually exclusive in that field.

Partial id match and merge multiple to one

I have two files, File1 and File2. File1 has 6000 rows and file2 has 3000 rows. I want to match the ids and merge the files based on matches, which is simple. But, the ids in file1 and file2 only match partially. Have a look at the files. For every id (row) in file2 there must be two matching ids (rows) in file 1. Also, not all the ids in file2 are present in file1. I had tried awk but didn't get the desired output.
File1
1_A01_A
1_A01_B
2_B03_A
2_B03_B
1_A02_A
1_A02_B
2_B04_A
2_B04_B
1_A03_A
1_A03_B
2_B05_A
2_B05_B
1_A04_A
1_A04_B
2_B06_A
2_B06_B
1_A06_A
1_A06_B
2_B07_A
2_B07_B
1_A07_A
1_A07_B
2_B08_A
2_B08_B
9_F10_A
9_F10_B
12_D08_A
12_D08_B
5505744243493_F09.CEL_A_A
5505744243493_F09.CEL_B_B
File2
1_A01 14
2_B03 13
1_A02 4
2_B04 14
1_A03 11
2_B05 8
1_A04 18
2_B06 15
1_A06 10
2_B07 4
1_A07 8
2_B08 22
1_A08 5
2_B09 15
1_A09 20
2_B10 17
awk -F" " 'FNR==NR{a[$1]=$2;next}{for(i in a){if($1~i){print $1" "a[i];next}}}' file1.txt file2.txt
FNR==NR will be true while awk reads file 1 and false when it reads file 2. The part of code starting from {for(i in a} .. will be executed for file 2. $1~i looks for Like condition and then for relevant matches the output is printed.
by mistake I have used different file notations. My file1.txt contains the content of file2.txt in the problem statement and vise versa
Output
1_A01_A|14
1_A01_B|14
2_B03_A|13
2_B03_B|13
1_A02_A|4
1_A02_B|4
2_B04_A|14
2_B04_B|14
1_A03_A|11
1_A03_B|11
2_B05_A|8
2_B05_B|8
1_A04_A|18
1_A04_B|18
2_B06_A|15
2_B06_B|15
1_A06_A|10
1_A06_B|10
2_B07_A|4
2_B07_B|4
1_A07_A|8
1_A07_B|8
2_B08_A|22
2_B08_B|22
This might work for you (GNU sed):
sed -r 's|^(\S+)(\s+\S+)$|s/^\1.*/\&\2/p|' file2 | sed -nf - file1
This creates a sed script from file2 and then runs it against the data in file1.
N.B. The order of either file is unimportant and file1 is processed only once.

awk to match two fields in two files

I want to find lines where fields 1 and 2 from file1 match fields 2 and 3 from file2, and then print all fields from file2. There are more lines in file2 than in file1
File1
rs116801199 720381
rs138295790 16057310
rs131531 16870251
rs131546 16872281
rs140375 16873251
rs131552 16873461
File2
--- rs116801199 720381 0.026 0.939 0.996 0 -1 -1 -1
1 rs12565286 721290 0.028 1.000 1.000 2 0.370 0.934 0.000
1 rs3094315 752566 0.432 1.000 1.000 2 0.678 0.671 0.435
--- rs3131972 752721 0.353 0.906 0.938 0 -1 -1 -1
--- rs61770173 753405 0.481 0.921 0.950 0 -1 -1 -1
I tried something like:
awk -F 'FNR==NR{a[$1];b[$2];next} FNR==1 || ($2 in a && $3 in b)' file1 file2 > test
But got a syntax error
Consider:
awk -F 'FNR==NR{a[$1];b[$2];next} FNR==1 || ($2 in a && $3 in b)' file1 file2
The option -F expects an argument but no argument is provided intentionally. The result is that awk interprets the entirety of the code as the field separator. That is why that code does not run as expected.
From the problem statement, I didn't see why FNR==1 should be in the code. So, I removed it. Once that is done, the parens are unnecessary. If that is the case, then, the code further simplifies to:
$ awk 'FNR==NR{a[$1];b[$2];next} $2 in a && $3 in b' file1 file2
--- rs116801199 720381 0.026 0.939 0.996 0 -1 -1 -1

awk print line of file2 based on condition of file1

I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1