awk to update file based on matching lines with split - awk

In the below awk I am trying to match $2 in file1 up until the ., with $4 in file2 up to the first undescore _. If a match is found then that portion of file2 is up dated with the matching $1 value in file1. I think it is close but not sure how to account for the . in file1. In my real data there are thousands of lines, but they are all in the below format and a match may not always be found. The awk as is does execute but file2 is not updated, I think because the . is not matching. Thank you :).
file 1 space delimited
TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
file 2 tab-delimited
chr1 92149295 92149414 NM_003243_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 NM_003243_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 NM_003243_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 NM_003242_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 NM_003242_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 NM_004612_cds_0_0_chr9_101867488_f
chr9 101904817 101904985 NM_001130916_cds_3_0_chr9_101904818_f
desired output tab-delimited
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
awk
awk 'FNR==NR {A[$1]=$1; next} $4 in A {sub ($4, $4 "_" A[$4]) }1' OFS='\t' file1 FS='\t' file2

Following awk may help you on same. Also you could change you FS field separator as per your Input_file(s) too, eg--> Input_file1 is space delimited then use FS=" " before it and Input_file2 is TAB delimited then use FS="\t" before its name.
awk '
FNR==NR{
val=$2;
sub(/\..*/,"",val);
a[val]=$1;
next
}
{
split($4,array,"_")
}
((array[1]"_"array[2]) in a){
sub(/.*_cds/,a[array[1]"_"array[2]]"_cds",$4);
print
}
' Input_file1 Input_file2
Output will be as follows:
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f

Related

awk to split field twice using two deliminator with condition

In the awk I am splitting on the space or : after the chrxx (it is not consistent so I added both as FS, then splitting on the -. I can not seem to duplicate $2 if there is no - after it. Lines 2,3 are examples. If there is a - after the number then the value to the right of it is $3 in the ouput. The awk seems close but isn't duplicating the value. Thank you :).
in
chr17 7124137-7124146 ACADVL
chr1 229568460 ACTA1
chr10 90708637 ACTA2
awk
awk -F"[ :-]" '$3=$3?$3:$2' OFS='\t' in
current
chr17 7124137 7124146 ACADVL
chr1 229568460 ACTA1
chr10 90708637 ACTA2
desired output
chr17 7124137 7124146 ACADVL
chr1 229568460 229568460 ACTA1
chr10 90708637 90708637 ACTA2
If number of fields is three, copy 3rd field to 4th, and 2nd to 3rd. Force recomputing of whole record to make output tab separated regardless of what's done before.
awk -F'[ :-]' 'NF==3{$4=$3;$3=$2} {$1=$1} 1' OFS='\t' in
$ perl -lane 'if($F[1]=~/\-/){$F[1]=~s/-/ /}else{splice #F, 1, 0, $F[1];}print "#F" ' temp
chr17 7124137 7124146 ACADVL
chr1 229568460 229568460 ACTA1
chr10 90708637 90708637 ACTA2
[netcrk#o2uk1061 infinys_root]$

isolating the rows based on values of fields in awk

I have a text file like this small example:
small example:
chr1 HAVANA exon 13221 13374
chr1 HAVANA exon 13453 13670
chr1 HAVANA gene 14363 29806
I am trying to filter the rows base on the 3rd column. in fact if the 3rd column is gene i will keep the entire row and filter out the other rows. here is the expected output:
expected output:
chr1 HAVANA gene 14363 29806
I am trying to do that in awk using the following command but the results is empty. do you know how to fix it?
awk '{ if ($3 == 'gene') { print } }' infile.txt > outfile.txt
Use double quotes in the script:
$ awk '{ if ($3 == "gene") { print } }' file
chr1 HAVANA gene 14363 29806
or:
$ awk '{ if ($3 == "gene") print }' file
but you could just:
$ awk '$3 == "gene"'

filtering in a text file using awk

i have a tab separated text file like this small example:
chr1 100499714 100499715 1
chr1 100502177 100502178 10
chr1 100502181 100502182 2
chr1 100502191 100502192 18
chr1 100502203 100502204 45
in the new file that I will make:
1- I want to select the rows based on the 4th column meaning in the value of 4th column is at least 10, I will keep the entire row otherwise will be filtered out.
2- in the next step the 4th column will be removed.
the result will look like this:
chr1 100502177 100502178
chr1 100502191 100502192
chr1 100502203 100502204
to get such results I have tried the following code in awk:
cat input.txt | awk '{print $1 "\t" $2 "\t" $3}' > out.txt
but I do not know how to implement the filtering step. do you know how to fix the code?
Just put the condition before output:
cat input.txt | awk '$4 >= 10 {print $1 "\t" $2 "\t" $3}' > out.txt
here is another, might work better if you have many more fields
$ awk '$NF>=10{sub(/\t\w+$/,""); print}' file

print lines matching criteria from two files with different formats

I am trying to print those lines in file1 that match file2. $1 in file1 has the value that in stored in array c, then looked up in file2 $1:$2. That is the first criteria used to match the lines but not the only one. $5 of file1 must match $4 of file2, if those two criteria are met and $2 in file1 is SNV or INDEL and $3
in file1 is exonic then the matching line from file1 is printed. If the lines do not match then they are skipped. The below awk executes but no output results and in this example, there should be one line. My actual data is several thousand lines all in the below format. Thank you :).
file 1
##reference=hg19
##referenceURI=hg19
# locus type location function coding
chr1:11184539 CNV
chr1:11184573 REF exonic
chr1:11189845 REF exonic
chr2:47630550 SNV intronic
chr4:55141050 SNV exonic synonymous c.1701A>G
chr4:55141050 INDEL exonic nonsynonymous c.1697_1711delGCCCAGATGGACATG
file2
chr4 55141050 COSM742 c.1696_1713delAGCCCAGATGGACATGAAinsCGC p.Ser566_Glu571delinsArg
chr4 55141050 COSM12417 c.1697_1711delGCCCAGATGGACATG p.Ser566_Glu571delinsLy
awk
awk 'FNR==3 {next}
# skip first three lines in file1
{FS = OFS = "\t"}
# define input and output as tab-delimited
NR==FNR{c[$1]; next} ($1":"$2) in c && NR==FNR{c[$5];next}$4 in c && $2 ~ /SNV|INDEL/ && $3=="exonic"' file1 file2
# process each line in file matching on criteria
desired output
chr4:55141050 INDEL exonic synonymous c.1697_1711delGCCCAGATGGACATG
Awk solution:
awk 'NR==FNR{ a[$1":"$2]=substr($4,1,6); next }
NF>=5 && $2~/SNV|INDEL/ && $3=="exonic" &&
($1 in a) && a[$1]==substr($5,1,6)' file2 OFS='\t' file1
The output:
chr4:55141050 INDEL exonic nonsynonymous c.1697_1711delGCCCAGATGGACATG

awk to update unknown values in file using range in another

I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
edit:
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
rstart[k,c[k]]=a[2];
rend[k,c[k]]=a[3];
value[k,c[k]]=$2;
next}
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.