print lines matching criteria from two files with different formats - awk

I am trying to print those lines in file1 that match file2. $1 in file1 has the value that in stored in array c, then looked up in file2 $1:$2. That is the first criteria used to match the lines but not the only one. $5 of file1 must match $4 of file2, if those two criteria are met and $2 in file1 is SNV or INDEL and $3
in file1 is exonic then the matching line from file1 is printed. If the lines do not match then they are skipped. The below awk executes but no output results and in this example, there should be one line. My actual data is several thousand lines all in the below format. Thank you :).
file 1
##reference=hg19
##referenceURI=hg19
# locus type location function coding
chr1:11184539 CNV
chr1:11184573 REF exonic
chr1:11189845 REF exonic
chr2:47630550 SNV intronic
chr4:55141050 SNV exonic synonymous c.1701A>G
chr4:55141050 INDEL exonic nonsynonymous c.1697_1711delGCCCAGATGGACATG
file2
chr4 55141050 COSM742 c.1696_1713delAGCCCAGATGGACATGAAinsCGC p.Ser566_Glu571delinsArg
chr4 55141050 COSM12417 c.1697_1711delGCCCAGATGGACATG p.Ser566_Glu571delinsLy
awk
awk 'FNR==3 {next}
# skip first three lines in file1
{FS = OFS = "\t"}
# define input and output as tab-delimited
NR==FNR{c[$1]; next} ($1":"$2) in c && NR==FNR{c[$5];next}$4 in c && $2 ~ /SNV|INDEL/ && $3=="exonic"' file1 file2
# process each line in file matching on criteria
desired output
chr4:55141050 INDEL exonic synonymous c.1697_1711delGCCCAGATGGACATG

Awk solution:
awk 'NR==FNR{ a[$1":"$2]=substr($4,1,6); next }
NF>=5 && $2~/SNV|INDEL/ && $3=="exonic" &&
($1 in a) && a[$1]==substr($5,1,6)' file2 OFS='\t' file1
The output:
chr4:55141050 INDEL exonic nonsynonymous c.1697_1711delGCCCAGATGGACATG

Related

awk to add text to file based on coordinates

In the below awk (which executes but produces an empty output) I am using the $4 in file1 as a unique id and reading each $1, $2, and $3 value into a variable chr, min, and max.
The $4 is then split on the _ in file2 and read into array. Each value in the split will match a $4 id in file1 The chr needs to match the $1, the min and max must be between the $2 and $3 values in file2.
An exact match is not needed rather just that the min or max variables falls within $2 and $3. If that is true then exon is printed in $5 of file1, if it is not true then intron is printed in $5.
The desired output has the exon/intron added to it but there is another part where the values in $2 or $3 are adjusted but I am trying to script that before I ask. I am not sure if the below is the best way but hopefully it is a start. Thank you :).
file1 tab delimited except for whitespace after $3 and $4
chr7 94027591 94027701 COL1A2
chr6 31980068 31980074 TNXB
file2 tab delimited
chr7 94027059 94027070 COL1A2_cds_1_0_chr7_94027060_f 0 +
chr7 94027693 94027708 COL1A2_cds_2_0_chr7_94027694_f 0 +
chr6 32009125 32009227 TNXB_cds_0_0_chr6_32009126_r 0 -
chr6 32009547 32009711 TNXB_cds_1_0_chr6_32009548_r 0 -
desired output
chr7 94027683 94027701 COL1A2 exon
chr6 31980068 31980074 TNXB intron
awk w/ comments
awk '
FNR==NR{ open block process matching line in file 1 and file2
a[$4]; # use as a key with unique id
chr[$4]=$1; # store $1 value in chr
min[$4]=$2; # store $2 value in min
max[$4]=$3; # store $3 value in max
next # process next line
} # close block
{ # open block
split($4,array,"_"); # spilt $4 on underscore
print $0,(array[1] in a) && ($2<=min[array[1]] && $2<=max[array[1] && $1=chr[array[1]])?"exon":"intron"
}' file1 OFS="\t" file2 > output # close block, mention input with field separators and output
IMHO, your shown final output is NOT looking correct by logic, since Input_file2 has multiple entries and Input_file1 has only single ones(I am going by samples shown only). Could you please check this one once? If any changes in your output or logic then please do mention them clearly.
awk '
BEGIN{
SUBSEP=","
}
FNR==NR{
max[$1,$NF]=$3
min[$1,$NF]=$2
next
}
{
split($4,array,"_")
}
(($1,array[1]) in max){
if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>max[array[5],array[1]] && $3<max[array[5],array[1]])){
print array[5],array[1],min[array[5],array[1]],max[array[5],array[1]],"exon"
next
}
}
{
print $0,"intron"
}' Input_file1 Input_file2 | column -t
What this command is doing it is checking Input_file2's 2nd field OR 3rd field either they are coming in range of Input_file1's 2nd and 3rd field. If anyone of them is coming then I am printing Input_file1's output adding exon in it or else printing Input_file2's output adding intron string at last of it.

filtering in a text file using awk

i have a tab separated text file like this small example:
chr1 100499714 100499715 1
chr1 100502177 100502178 10
chr1 100502181 100502182 2
chr1 100502191 100502192 18
chr1 100502203 100502204 45
in the new file that I will make:
1- I want to select the rows based on the 4th column meaning in the value of 4th column is at least 10, I will keep the entire row otherwise will be filtered out.
2- in the next step the 4th column will be removed.
the result will look like this:
chr1 100502177 100502178
chr1 100502191 100502192
chr1 100502203 100502204
to get such results I have tried the following code in awk:
cat input.txt | awk '{print $1 "\t" $2 "\t" $3}' > out.txt
but I do not know how to implement the filtering step. do you know how to fix the code?
Just put the condition before output:
cat input.txt | awk '$4 >= 10 {print $1 "\t" $2 "\t" $3}' > out.txt
here is another, might work better if you have many more fields
$ awk '$NF>=10{sub(/\t\w+$/,""); print}' file

awk to update file based on matching lines with split

In the below awk I am trying to match $2 in file1 up until the ., with $4 in file2 up to the first undescore _. If a match is found then that portion of file2 is up dated with the matching $1 value in file1. I think it is close but not sure how to account for the . in file1. In my real data there are thousands of lines, but they are all in the below format and a match may not always be found. The awk as is does execute but file2 is not updated, I think because the . is not matching. Thank you :).
file 1 space delimited
TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
file 2 tab-delimited
chr1 92149295 92149414 NM_003243_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 NM_003243_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 NM_003243_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 NM_003242_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 NM_003242_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 NM_004612_cds_0_0_chr9_101867488_f
chr9 101904817 101904985 NM_001130916_cds_3_0_chr9_101904818_f
desired output tab-delimited
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
awk
awk 'FNR==NR {A[$1]=$1; next} $4 in A {sub ($4, $4 "_" A[$4]) }1' OFS='\t' file1 FS='\t' file2
Following awk may help you on same. Also you could change you FS field separator as per your Input_file(s) too, eg--> Input_file1 is space delimited then use FS=" " before it and Input_file2 is TAB delimited then use FS="\t" before its name.
awk '
FNR==NR{
val=$2;
sub(/\..*/,"",val);
a[val]=$1;
next
}
{
split($4,array,"_")
}
((array[1]"_"array[2]) in a){
sub(/.*_cds/,a[array[1]"_"array[2]]"_cds",$4);
print
}
' Input_file1 Input_file2
Output will be as follows:
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f

awk to print fields that match using conditions and a default value for non-matching in two files

Trying to use AWK to match the contents of each line in file with $2 in list. Both files are tab-delimited and there may be a space or special character in the name being matched in list, for example in file the name is BRCA1 but in list the name is BRCA 1 or in file name is BCR but in list the name is BCR/ABL.
If there is a match and $4 of list has full gene sequence in it, then $2 and $1 are printed separated by a tab. If there is no match found then the name that was not matched and 14 are printed separated by a tab. The awk below does execute, but no output results. Thank you :).
file
BRCA1
BCR
SCN1A
fbn1
list
List code gene gene name methodology
81 DMD dystrophin deletion analysis and duplication analysis
811 BRCA 1 BRCA2 full gene sequence and full deletion/duplication analysis
70 ABL1 ABL1 gene analysis variants in the kinse domane
71 BCR/ABL t(9;22) full gene sequence
awk
awk -F'\t' -v OFS="\t" 'FNR==NR{A[$1]=$0;next} ($2 in A){if($4=="full gene sequence"){print A[$2],$1}} ELSE {print A[$2],"14"}' file list
desired output
BRCA1 811
BCR 71
SCN1A 14
fbn1 85
edit
List code gene gene name methodology
85 fbn1 Fibrillin full gene sequencing
95 FBN1 fibrillin del/dup
result
85 fbn1 Fibrillin full gene sequencing
since only this line has full gene sequencing in it, only this is printed.
You can try,
awk 'BEGIN{FS=OFS="\t"}
FNR==NR{
if(NR>1){
gsub(" ","",$2) #removing white space
n=split($2,v,"/")
d[v[1]] = $1 #from split, first element as key
}
next
}{print $1, ($1 in d?d[$1]:14)}' list file
you get,
BRCA1 811
BCR 71
SCN1A 14
awk 'FNR==NR{
a[$2]=$1;
next
}
{
for(i in a){
if($1 ~ i || i ~ $1){ print $1, a[i] ; next }
}
print $1,14
}' list file
Input
$ cat list
List code gene gene name methodology
81 DMD dystrophin deletion analysis and duplication analysis
811 BRCA 1 BRCA2 full gene sequence and full deletion/duplication analysis
70 ABL1 ABL1 gene analysis variants in the kinse domane
71 BCR/ABL t(9;22) full gene sequence
$ cat file
BRCA1
BCR
SCN1A
Output
$ awk 'FNR==NR{
a[$2]=$1;
next
}
{
for(i in a){
if($1 ~ i || i ~ $1){ print $1, a[i] ; next }
}
print $1,14
}' list file
BRCA1 811
BCR 71
SCN1A 14

awk not printing header in output file

The below awk seems to work great with 1 issue, the header lines do hot print in the output? I have been staring at this awhile with no luck. What am I missing? Thank you :).
awk
awk 'NR==FNR{for (i=1;i<=NF;i++) a[$i];next} FNR==1 || ($7 in a)' /home/panels/file1 test.txt |
awk '{split($2,a,"-"); print a[1] "\t" $0}' |
sort |
cut -f2-> /home/panels/test_filtered.vcf
test.txt (used in the awk to give the filtered output --only a small portion of the data but the tab delimited format is shown)
Chr Start End Ref Alt
chr1 949608 949608 G A
current output (has no header)
chr1 949608 949608 G A
desired output (has header)
Chr Start End Ref Alt
chr1 949608 949608 G A
It looks like the header is going to sort, and getting mixed in with your data. A simple solution is to do:
... | { read line; echo $line; sort; } |
to prevent the first line from going to sort.
you can combine your scripts and add the sort into awk and handle header this way.
$ awk 'NR==FNR{for(i=1;i<=NF;i++)a[$i]; next}
FNR==1{print "dummy\t" $0; next}
$7 in a{split($2,b,"-"); print b[1] "\t" $0 | "sort" }' file1 file2 |
cut -f2