awk join two files and delete same columns - awk

I have two files, the first file contain:
rs1210110 1:14096821 C ENSG00000116731 ENST00000505823 Transcript
rs1210110 1:14096821 C ENSG00000116731 ENST00000491815 Transcript
rs1210110 1:14096821 C ENSG00000116731 ENST00000343137 Transcript
rs2746462 2:17380497 T ENSG00000117118 ENST00000485515 Transcript
rs2746462 2:17380497 T ENSG00000117118 ENST00000375499 Transcript
rs3219489 2:45797505 G ENSG00000132781 ENST00000525160 Transcript
second file contain:
chr1 14096821 rs1210110 T C 100.00 PASS DP=89
chr2 17380497 rs2746462 G T 100.00 PASS DP=158
I would like to join it to one file, where will be
chr1 14096821 rs1210110 T C 100.00 PASS DP=89 ENSG00000116731 ENST00000505823 Transcript
chr1 14096821 rs1210110 T C 100.00 PASS DP=89 ENSG00000116731 ENST00000491815 Transcript
chr1 14096821 rs1210110 T C 100.00 PASS DP=89 ENSG00000116731 ENST00000343137 Transcript
chr2 17380497 rs2746462 G T 100.00 PASS DP=158 ENSG00000117118 ENST00000485515 Transcript
chr2 17380497 rs2746462 G T 100.00 PASS DP=158 ENSG00000117118 ENST00000375499 Transcript
chr2 17380497 rs2746462 G G 100.00 PASS DP=158 ENSG00000132781 ENST00000525160 Transcript
Than, second file contain at third column rs code, which is same at first file at first column. But one row from the second file could have more rows from the first file, but with same rs code. And third column from first file will be in the output at 5th column.

All done in awk
cat f1
rs1210110 1:14096821 C ENSG00000116731 ENST00000505823 Transcript
rs1210110 1:14096821 C ENSG00000116731 ENST00000491815 Transcript
rs1210110 1:14096821 C ENSG00000116731 ENST00000343137 Transcript
rs2746462 2:17380497 T ENSG00000117118 ENST00000485515 Transcript
rs2746462 2:17380497 T ENSG00000117118 ENST00000375499 Transcript
rs3219489 2:45797505 G ENSG00000132781 ENST00000525160 Transcript
cat f2
chr1 14096821 rs1210110 T C 100.00 PASS DP=89
chr2 17380497 rs2746462 G T 100.00 PASS DP=158
awk 'FNR==NR {a[$2]=$0;next} {split($2,b,":");print a[b[2]],$4,$5,$6 }' OFS="\t" f2 f1
chr1 14096821 rs1210110 T C 100.00 PASS DP=89 ENSG00000116731 ENST00000505823 Transcript
chr1 14096821 rs1210110 T C 100.00 PASS DP=89 ENSG00000116731 ENST00000491815 Transcript
chr1 14096821 rs1210110 T C 100.00 PASS DP=89 ENSG00000116731 ENST00000343137 Transcript
chr2 17380497 rs2746462 G T 100.00 PASS DP=158 ENSG00000117118 ENST00000485515 Transcript
chr2 17380497 rs2746462 G T 100.00 PASS DP=158 ENSG00000117118 ENST00000375499 Transcript
ENSG00000132781 ENST00000525160 Transcript
Last line did not have a match, so it will be printed out with no information in front. This can be removed if needed.
A some different approach with awk
awk -F"[ \t:]*" 'FNR==NR {a[$2]=$0;next} {print a[$3],$5,$6,$7 }' OFS="\t" f2 f1

Use join to join and awk to reorder.
$ cat f1
rs1210110 1:14096821 C ENSG00000116731 ENST00000505823 Transcript
rs1210110 1:14096821 C ENSG00000116731 ENST00000491815 Transcript
rs1210110 1:14096821 C ENSG00000116731 ENST00000343137 Transcript
rs2746462 2:17380497 T ENSG00000117118 ENST00000485515 Transcript
rs2746462 2:17380497 T ENSG00000117118 ENST00000375499 Transcript
rs3219489 2:45797505 G ENSG00000132781 ENST00000525160 Transcript
$ cat f2
chr1 14096821 rs1210110 T C 100.00 PASS DP=89
chr2 17380497 rs2746462 G T 100.00 PASS DP=158
$ join -1 1 -2 3 f1 f2 | awk '{print $7, $8, $1, $9, $10, $11, $12, $13, $4, $5, $6}'
chr1 14096821 rs1210110 T C 100.00 PASS DP=89 ENSG00000116731 ENST00000505823 Transcript
chr1 14096821 rs1210110 T C 100.00 PASS DP=89 ENSG00000116731 ENST00000491815 Transcript
chr1 14096821 rs1210110 T C 100.00 PASS DP=89 ENSG00000116731 ENST00000343137 Transcript
chr2 17380497 rs2746462 G T 100.00 PASS DP=158 ENSG00000117118 ENST00000485515 Transcript
chr2 17380497 rs2746462 G T 100.00 PASS DP=158 ENSG00000117118 ENST00000375499 Transcript

Related

Using cbind and rbind in for loop to make data pairs

I am attempting to use rbind and cbind to make pairs with two identical dataframes.
Here's the data:
bed_A <- data.frame(chr = c(rep("chr1", 5), rep("chr10", 5)),
start = c(105000, 125000, 145000, 165000, 185000, 800000, 825000, 850000, 875000, 900000),
end = c( 107000, 127000, 147000, 167000, 187000, 802000, 827000, 852000, 877000, 902000))
bed_B <- bed_A
> bed_A
chr start end
1 chr1 105000 107000
2 chr1 125000 127000
3 chr1 145000 147000
4 chr1 165000 167000
5 chr1 185000 187000
6 chr10 800000 802000
7 chr10 825000 827000
8 chr10 850000 852000
9 chr10 875000 877000
10 chr10 900000 902000
I want to make pairs with different "chr" values. So, no chr1-chr1 or chr10-chr10 pairs. I know that there should be 25 unique pairs in my output, since each row with "chr1" can be paired with 5 rows with "chr10".
I am stuck trying to actually make the pairs.
My desired output would look like this (spaces just included for visual ease), where each row with "chr1" has been paired with each row with "chr10".
chr1 105000 107000 chr10 800000 802000
chr1 105000 107000 chr10 825000 827000
chr1 105000 107000 chr10 850000 852000
chr1 105000 107000 chr10 875000 877000
chr1 105000 107000 chr10 900000 902000
chr1 125000 127000 chr10 800000 802000
chr1 125000 127000 chr10 825000 827000
chr1 125000 127000 chr10 850000 852000
chr1 125000 127000 chr10 875000 877000
chr1 125000 127000 chr10 900000 902000
chr1 145000 147000 chr10 800000 802000
chr1 145000 147000 chr10 825000 827000
chr1 145000 147000 chr10 850000 852000
chr1 145000 147000 chr10 875000 877000
chr1 145000 147000 chr10 900000 902000
chr1 165000 167000 chr10 800000 802000
chr1 165000 167000 chr10 825000 827000
chr1 165000 167000 chr10 850000 852000
chr1 165000 167000 chr10 875000 877000
chr1 165000 167000 chr10 900000 902000
chr1 185000 187000 chr10 800000 802000
chr1 185000 187000 chr10 825000 827000
chr1 185000 187000 chr10 850000 852000
chr1 185000 187000 chr10 875000 877000
chr1 185000 187000 chr10 900000 902000
Here's what I have so far:
table_bed_A <- table(bed_A[,1])
table_bed_B <- table(bed_B[,1])
for(i in names(table_bed_A)){
for(j in names(table_bed_B)){
df <- cbind(
do.call("rbind", replicate( 5, bed_A[bed_A$chr == i , ], simplify = FALSE)),
bed_B[bed_B$chr == j , ] )
}
}
This outputs a dataframe with 25 rows, but the pairs are incorrect. I would like to keep the double for loop, cbind, and rbind structure and use base R. I know it appears redundant to have the identical dataframes, but eventually I hope to use this structure on dissimilar dataframes, which is why I have it structured this way. Any help is greatly appreciated!

Awk fixed width columns and left leaning columns

I have a file named file1 consisting of 4350 lines and 12 columns, as shown below.
ATOM 1 CE1 LIG H 1 75.206 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 74.984 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 74.926 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 1.886 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 62.517 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 59.442 35.851 2.791 0.00 0.00 HAC1
I am using awk -v d="74.106" '{$7=sprintf("%0.3f", $7+d)} 1' file1 > file2 to add a value d to the 7th column of file1. After this, my file2 does not retain the correct formatting. A section of file2 is shown below.
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
I need my file2 to keep the same formatting as my file1, where only columns 2, 8, and 9 are left leaning.
I have tried to use awk -v FIELDWIDTHS="7 6 4 4 4 5 8 8 8 6 6 10" '{print $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12}' to specify the maximum width for each of the 12th columns. This line does not change my file2. Moreover, I cannot find a way to make columns 2, 8, and 9 left leaning as in file1.
How can I achieve these two things?
I appreciate any guidance. Thank you!
Well, with the default FS, awk strips the duplicate spaces when you modify a field.
What you need to do first is to understand your ATOM record format:
COLUMNS
DATA TYPE
CONTENTS
1 - 6
Record name
"ATOM "
7 - 11
Integer
Atom serial number.
13 - 16
Atom
Atom name.
17
Character
Alternate location indicator.
18 - 20
Residue name
Residue name.
22
Character
Chain identifier.
23 - 26
Integer
Residue sequence number.
27
AChar
Code for insertion of residues.
31 - 38
Real(8.3)
Orthogonal coordinates for X in Angstroms.
39 - 46
Real(8.3)
Orthogonal coordinates for Y in Angstroms.
47 - 54
Real(8.3)
Orthogonal coordinates for Z in Angstroms.
55 - 60
Real(6.2)
Occupancy.
61 - 66
Real(6.2)
Temperature factor (Default = 0.0).
73 - 76
LString(4)
Segment identifier, left-justified.
77 - 78
LString(2)
Element symbol, right-justified.
79 - 80
LString(2)
Charge on the atom.
Then you can use substr for generating a modified record:
awk -v d="74.106" '
/^ATOM / {
xCoord = sprintf( "%8.3f", substr($0,31,8) + d )
$0 = substr($0,1,30) xCoord substr($0,39)
}
1
' file.pdb
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
Using awk
$ awk -v d=74.106 '/ATOM/{sub($7,sprintf("%0.3f", $7+d))}1' input_file
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1

Convert values of an entire column in a datafarme pandas

I have the following dataframe:
chr start_position end_position gene_name
0 Chr Position Ref Gene_Name
1 chr22 24128945 G nan
2 chr19 45867080 G ERCC2
3 chr3 52436341 C BAP1
4 chr7 151875065 G KMT2C
5 chr19 1206633 CGGGT STK11
and I'd like to convert the entire 'end_position' column to contain the values of the 'start_position'+len('end_position'), the results should be:
chr start_position end_position gene_name
0 Chr Position Ref Gene_Name
1 chr22 24128945 24128946 nan
2 chr19 45867080 45867081 ERCC2
3 chr3 52436341 52436342 BAP1
4 chr7 151875065 151875066 KMT2C
5 chr19 1206633 1206638 STK11
I have written the below script:
patient_vcf_to_df.apply(pd.to_numeric, errors='ignore')
patient_vcf_to_df['end_position'] = patient_vcf_to_df['end_position'].map(lambda x: patient_vcf_to_df['start_position'] + len(x))
but I got the error:
TypeError: must be str, not int
Anyone knows how can I fix the problem?
Thanks a lot!
first I'd read your CSV in a way that the 0 row would become a header (column names):
df = pd.read_csv(filename, header=1)
to get the following DF:
Chr Position Ref Gene_Name
0 chr22 24128945 G NaN
1 chr19 45867080 G ERCC2
2 chr3 52436341 C BAP1
3 chr7 151875065 G KMT2C
4 chr19 1206633 CGGGT STK11
as a positive side-effect:
In [99]: df.dtypes
Out[99]:
chr object
position int64 # <--- NOTE
ref object
gene_name object
dtype: object
if you want to lower-case your columns:
In [97]: df.columns = df.columns.str.lower()
In [98]: df
Out[98]:
chr position ref gene_name
0 chr22 24128945 G NaN
1 chr19 45867080 G ERCC2
2 chr3 52436341 C BAP1
3 chr7 151875065 G KMT2C
4 chr19 1206633 CGGGT STK11
to make sure that position column is of a numeric dtype:
df['position'] = pd.to_numeric(df['position'], errors='coerce')
and then:
In [101]: df['end_position'] = df['position'] + df['ref'].str.len()
In [102]: df
Out[102]:
chr position ref gene_name end_position
0 chr22 24128945 G NaN 24128946
1 chr19 45867080 G ERCC2 45867081
2 chr3 52436341 C BAP1 52436342
3 chr7 151875065 G KMT2C 151875066
4 chr19 1206633 CGGGT STK11 1206638

Repetitive grabbing with awk

I am trying to compare two files and combine different columns of each. The example files are:
1.txt
chr8 12 24 . . + chr1 11 12 XX4 -
chr3 22 33 . . + chr4 60 61 XXX9 -
2.txt
chr1 11 1 X1 X2 11 12 2.443 0.843 +1 SXSD 1.3020000
chr1 11 2 X3 X4 11 12 0.888 0.833 -1 XXSD -28.887787
chr1 11 3 X5 X6 11 12 0.888 0.843 +1 XXSD 2.4909883
chr1 12 4 X7 X8 11 12 0.888 0.813 -1 CMKY 0.0009223
chr1 12 5 X9 X10 11 12 0.888 0.010 -1 XASD 0.0009223
chr1 12 6 X11 X12 11 12 0.888 0.813 -1 XUPS 0.10176998
I want to compare the 1st,6th and 7th columns of 2.txt, with 7th,8th and 9th columns of 1.txt, and if there is a match, I want to print the whole line of 1.txt with 3th and 12th columns of 2.txt.
The expected output is :
chr8 12 24 . . + chr1 11 12 XX4 - 1 1.3020000
chr8 12 24 . . + chr1 11 12 XX4 - 2 -28.887787
chr8 12 24 . . + chr1 11 12 XX4 - 3 2.4909883
chr8 12 24 . . + chr1 11 12 XX4 - 4 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 5 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998
My trial is with awk:
awk 'NR==FNR{ a[$1,$6,$7]=$3"\t"$12; next } { s=SUBSEP; k=$7 s $8 s $9 }k in a{ print $0,a[k] }' 2.txt 1.txt
It outputs only the last match and I cannot make it print all matches:
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998
How can I repetitively search and print all matches?
You're making it much harder than it has to be by reading the 2nd file first.
$ cat tst.awk
NR==FNR { a[$7,$8,$9] = $0; next }
($1,$6,$7) in a { print a[$1,$6,$7], $3, $12 }
$ awk -f tst.awk 1.txt 2.txt
chr8 12 24 . . + chr1 11 12 XX4 - 1 1.3020000
chr8 12 24 . . + chr1 11 12 XX4 - 2 -28.887787
chr8 12 24 . . + chr1 11 12 XX4 - 3 2.4909883
chr8 12 24 . . + chr1 11 12 XX4 - 4 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 5 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998
Extended AWK solution:
awk 'NR==FNR{ s=SUBSEP; k=$1 s $6 s $7; a[k]=(k in a? a[k]"#":"")$3"\t"$12; next }
{ s=SUBSEP; k=$7 s $8 s $9 }
k in a{ len=split(a[k], b, "#"); for (i=1;i<=len;i++) print $0,b[i] }' 2.txt 1.txt
s=SUBSEP; k=$1 s $6 s $7 - constructing key k value comprised of the 1st, 6th and 7th fields of hte file 2.txt
a[k]=(k in a? a[k]"#":"")$3"\t"$12 - concatenate the $3"\t"$12 sequences with custom separator # within the same group (group presented by k)
s=SUBSEP; k=$7 s $8 s $9 - constructing key k value comprised of the 7th, 8th and 9th fields of the file 1.txt
len=split(a[k], b, "#"); - split previously accumulated sequences into array b by separator #
The output:
chr8 12 24 . . + chr1 11 12 XX4 - 1 1.3020000
chr8 12 24 . . + chr1 11 12 XX4 - 2 -28.887787
chr8 12 24 . . + chr1 11 12 XX4 - 3 2.4909883
chr8 12 24 . . + chr1 11 12 XX4 - 4 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 5 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998

How to subset a tsv file based on a pattern?

I have two files. One file is a tab separated file containing multiple columns. the other file is a list of gene names. I have to extract only those rows which have the genes listed in file 2 are present in file 1.
I tried the below command but it extract all the rows:
awk 'NR==FNR{a[$0]=1;next} {for(i in a){if($10~i){print;break}}}' File2 file1
File1:
Input line ID Chrom Position Strand Ref. base(s) Alt. base(s) Sample ID HUGO symbol Sequence ontology Protein sequen
3 VAR113_NM-02_TUMOR_DNA chr1 11082255 + G T NM-02_TUMOR_DNA TARDBP MS K263N . PASS het 3 25
4 VAR114_NM-02_TUMOR_DNA chr1 15545868 + G T NM-02_TUMOR_DNA TMEM51 MS V131F . PASS het 3 13
6 VAR116_NM-02_TUMOR_DNA chr1 20676680 + C T NM-02_TUMOR_DNA VWA5B1 SY S970S . PASS het 4 34
7 rs149021429_NM-02_TUMOR_DNA chr1 21554495 + C A NM-02_TUMOR_DNA ECE1 SY S570S . PASS het 3
16 VAR126_NM-02_TUMOR_DNA chr1 39905109 + C T NM-02_TUMOR_DNA MACF1 SY V4069V . PASS het 4 17
21 VAR131_NM-02_TUMOR_DNA chr1 101387378 + G T NM-02_TUMOR_DNA SLC30A7 MS A275S . PASS het 4 45
24 VAR134_NM-02_TUMOR_DNA chr1 113256156 + C A NM-02_TUMOR_DNA PPM1J MS S135I . PASS het 3 9
25 rs201097299_NM-02_TUMOR_DNA chr1 145326106 + A T NM-02_TUMOR_DNA NBPF10 MS M1327L . PASS het 5
26 VAR136_NM-02_TUMOR_DNA chr1 149859281 + T C NM-02_TUMOR_DNA HIST2H2AB SY E62E . PASS het 11
27 VAR137_NM-02_TUMOR_DNA chr1 150529801 + C A NM-02_TUMOR_DNA ADAMTSL4 SY S679S . PASS het 3
28 rs376491237_NM-02_TUMOR_DNA chr1 150532649 + C A NM-02_TUMOR_DNA ADAMTSL4 SY R1068R . PASS het
34 VAR144_NM-02_TUMOR_DNA chr1 160389277 + T A NM-02_TUMOR_DNA VANGL2 SY L226L . PASS het 3 6
35 VAR145_NM-02_TUMOR_DNA chr1 161012389 + C A NM-02_TUMOR_DNA USF1 MS D44Y . PASS het 3 32
37 VAR147_NM-02_TUMOR_DNA chr1 200954042 + G T NM-02_TUMOR_DNA KIF21B MS R1250S . PASS het 3 21
41 rs191896925_NM-02_TUMOR_DNA chr1 207760805 + G T NM-02_TUMOR_DNA CR1 MS G1869W . PASS het 3
42 VAR152_NM-02_TUMOR_DNA chr1 208218427 + C A NM-02_TUMOR_DNA PLXNA2 SY G1208G . PASS het 3 13
43 VAR153_NM-02_TUMOR_DNA chr1 222715425 + A G NM-02_TUMOR_DNA HHIPL2 SY Y349Y . PASS het 10 41
44 VAR154_NM-02_TUMOR_DNA chr1 222715452 + T A NM-02_TUMOR_DNA HHIPL2 SY G340G . PASS het 5 46
45 VAR155_NM-02_TUMOR_DNA chr1 223568296 + G A NM-02_TUMOR_DNA C1orf65 SY G493G . PASS het 3 25
48 VAR158_NM-02_TUMOR_DNA chr2 8931258 + G A NM-02_TUMOR_DNA KIDINS220 MS P458L . PASS het 3 13
51 VAR161_NM-02_TUMOR_DNA chr2 37229656 + C A NM-02_TUMOR_DNA HEATR5B MS G1704C . PASS het 4 9
60 VAR170_NM-02_TUMOR_DNA chr2 84775506 + G T NM-02_TUMOR_DNA DNAH6 MS Q427H . PASS het 3 20
63 VAR173_NM-02_TUMOR_DNA chr2 86378563 + C A NM-02_TUMOR_DNA IMMT MS A420S . PASS het 6 29
64 VAR174_NM-02_TUMOR_DNA chr2 86716546 + G T NM-02_TUMOR_DNA KDM3A MS C1140F . PASS het 3 18
65 VAR175_NM-02_TUMOR_DNA chr2 96852612 + C A NM-02_TUMOR_DNA STARD7 SY L323L . PASS het 2 2
67 VAR177_NM-02_TUMOR_DNA chr2 121747740 + C A NM-02_TUMOR_DNA GLI2 MS P1417H . PASS het 2 2
71 rs199770435_NM-02_TUMOR_DNA chr2 130872871 + C T NM-02_TUMOR_DNA POTEF SY G184G . PASS het 8
72 rs199695856_NM-02_TUMOR_DNA chr2 132919171 + A G NM-02_TUMOR_DNA ANKRD30BL SY H36H . PASS het
73 rs111295191_NM-02_TUMOR_DNA chr2 132919192 + G A NM-02_TUMOR_DNA ANKRD30BL SY N29N . PASS het
76 VAR186_NM-02_TUMOR_DNA chr2 167084231 + T A NM-02_TUMOR_DNA SCN9A SY A1392A . PASS het 3 19
77 VAR187_NM-02_TUMOR_DNA chr2 168100115 + C G NM-02_TUMOR_DNA XIRP2 MS T738S . PASS het 9 49
78 VAR188_NM-02_TUMOR_DNA chr2 179343033 + G T NM-02_TUMOR_DNA FKBP7 MS A65D . PASS het 3 7
79 VAR189_NM-02_TUMOR_DNA chr2 179544108 + G C NM-02_TUMOR_DNA TTN MS P11234A . PASS het 3 17
82 VAR192_NM-02_TUMOR_DNA chr2 220074164 + G T NM-02_TUMOR_DNA ZFAND2B MS E92D . PASS het 2 2
83 VAR193_NM-02_TUMOR_DNA chr2 220420892 + C A NM-02_TUMOR_DNA OBSL1 MS G1487W . PASS het 3 9
84 rs191578275_NM-02_TUMOR_DNA chr2 233273263 + C A NM-02_TUMOR_DNA ALPPL2 MS P279Q . PASS het 3
86 VAR196_NM-02_TUMOR_DNA chr2 241815391 + G T NM-02_TUMOR_DNA AGXT SY L272L . PASS het 3 10
88 VAR198_NM-02_TUMOR_DNA chr3 9484995 + C T NM-02_TUMOR_DNA SETD5 SG R361* . PASS het 3 18
96 VAR206_NM-02_TUMOR_DNA chr3 49848502 + G T NM-02_TUMOR_DNA UBA7 MS P382H . PASS het 5 38
102 VAR212_NM-02_TUMOR_DNA chr3 58302669 + G T NM-02_TUMOR_DNA RPP14 MS L89F . PASS het 3 30
103 VAR213_NM-02_TUMOR_DNA chr3 63981750 + C A NM-02_TUMOR_DNA ATXN7 MS T751K . PASS het 3 13
104 rs146577101_NM-02_TUMOR_DNA chr3 97868656 + C T NM-02_TUMOR_DNA OR5H14 MS R143W . PASS het 4
107 rs58176285_NM-02_TUMOR_DNA chr3 123419183 + G A NM-02_TUMOR_DNA MYLK SY A1044A . PASS het 18
108 VAR218_NM-02_TUMOR_DNA chr3 123419189 + C T NM-02_TUMOR_DNA MYLK SY K1042K . PASS het 23 174
115 VAR225_NM-02_TUMOR_DNA chr3 183753779 + C A NM-02_TUMOR_DNA HTR3D MS P91T . PASS het 4 48
File2:
FBN1
HELZ
RALGPS2
DYNC1I2
NFE2L2
POSTN
INO80
I want those row which contains these genes.
So if I am following you correctly you just want to search $9 in file1 using the genes in file2 and I add MYLK to the list I get:
Maybe:
awk 'NR==FNR{A[$1];next}$9 in A' file2 file1
**empty line** (since `MYLK` was found after the line break it is included
107 rs58176285_NM-02_TUMOR_DNA chr3 123419183 + G A NM-02_TUMOR_DNA MYLK SY A1044A . PASS het 18
108 VAR218_NM-02_TUMOR_DNA chr3 123419189 + C T NM-02_TUMOR_DNA MYLK SY K1042K . PASS het 23 174
To remove the line break from the output:
awk 'NR==FNR{A[$1];next}$9 in A' file2 file1 | awk '!/^$/'
107 rs58176285_NM-02_TUMOR_DNA chr3 123419183 + G A NM-02_TUMOR_DNA MYLK SY A1044A . PASS het 18
108 VAR218_NM-02_TUMOR_DNA chr3 123419189 + C T NM-02_TUMOR_DNA MYLK SY K1042K . PASS het 23 174