In the awk I am splitting on the space or : after the chrxx (it is not consistent so I added both as FS, then splitting on the -. I can not seem to duplicate $2 if there is no - after it. Lines 2,3 are examples. If there is a - after the number then the value to the right of it is $3 in the ouput. The awk seems close but isn't duplicating the value. Thank you :).
in
chr17 7124137-7124146 ACADVL
chr1 229568460 ACTA1
chr10 90708637 ACTA2
awk
awk -F"[ :-]" '$3=$3?$3:$2' OFS='\t' in
current
chr17 7124137 7124146 ACADVL
chr1 229568460 ACTA1
chr10 90708637 ACTA2
desired output
chr17 7124137 7124146 ACADVL
chr1 229568460 229568460 ACTA1
chr10 90708637 90708637 ACTA2
If number of fields is three, copy 3rd field to 4th, and 2nd to 3rd. Force recomputing of whole record to make output tab separated regardless of what's done before.
awk -F'[ :-]' 'NF==3{$4=$3;$3=$2} {$1=$1} 1' OFS='\t' in
$ perl -lane 'if($F[1]=~/\-/){$F[1]=~s/-/ /}else{splice #F, 1, 0, $F[1];}print "#F" ' temp
chr17 7124137 7124146 ACADVL
chr1 229568460 229568460 ACTA1
chr10 90708637 90708637 ACTA2
[netcrk#o2uk1061 infinys_root]$
Related
I have 2 tab separated file like these 2 examples:
file1:
chr10 103912167 103917248 NOLC1 ENST00000603742.1
chr16 18573197 18558622 NOMO2 ENST00000543392.1
chr1 120611947 120572610 NOTCH2 ENST00000256646.2
file2:
chr16 18573197 18558622 NOMO2 ENST00000543392.1
chr1 120611947 120572610 NOTCH2 ENST00000256646.2
chr1 145209308 145248834 NOTCH2NL ENST00000344859.3
based on the 4th column, I want to isolate the rows in the first file that are not present in the 2nd file. here is the expected output:
expected output:
chr10 103912167 103917248 NOLC1 ENST00000603742.1
I am doing that in AWK using the following command:
awk 'NR==FNR{a[$4]!=$4;next}a[$4]' file1 file2 > results.txt
but it does not return what I want. do you know how to fix the command?
awk 'NR==FNR{a[$4]=1;next}!a[$4]' file2 file1
#=> chr10 103912167 103917248 NOLC1 ENST00000603742.1
Since you want output content from file1 based on file2, you should read file2 first.
Note if file2 could be empty, you should change to different file checking methods, like ARGIND==1 for GNU awk, or FILENAME=="file2" etc.
In the below awk I am trying to match $2 in file1 up until the ., with $4 in file2 up to the first undescore _. If a match is found then that portion of file2 is up dated with the matching $1 value in file1. I think it is close but not sure how to account for the . in file1. In my real data there are thousands of lines, but they are all in the below format and a match may not always be found. The awk as is does execute but file2 is not updated, I think because the . is not matching. Thank you :).
file 1 space delimited
TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
file 2 tab-delimited
chr1 92149295 92149414 NM_003243_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 NM_003243_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 NM_003243_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 NM_003242_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 NM_003242_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 NM_004612_cds_0_0_chr9_101867488_f
chr9 101904817 101904985 NM_001130916_cds_3_0_chr9_101904818_f
desired output tab-delimited
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
awk
awk 'FNR==NR {A[$1]=$1; next} $4 in A {sub ($4, $4 "_" A[$4]) }1' OFS='\t' file1 FS='\t' file2
Following awk may help you on same. Also you could change you FS field separator as per your Input_file(s) too, eg--> Input_file1 is space delimited then use FS=" " before it and Input_file2 is TAB delimited then use FS="\t" before its name.
awk '
FNR==NR{
val=$2;
sub(/\..*/,"",val);
a[val]=$1;
next
}
{
split($4,array,"_")
}
((array[1]"_"array[2]) in a){
sub(/.*_cds/,a[array[1]"_"array[2]]"_cds",$4);
print
}
' Input_file1 Input_file2
Output will be as follows:
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
edit:
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
rstart[k,c[k]]=a[2];
rend[k,c[k]]=a[3];
value[k,c[k]]=$2;
next}
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.
I am trying to filter a file_to_filter by using another filter_file, which is just a list of strings in $1. I think I am close but can not seem to include the header row in the output. The file_to_filter is tab delimited as well. Thank you :).
file_to_filter
Chr Start End Ref Alt Func.refGene Gene.refGene
chr1 160098543 160098543 G A exonic ATP1A2
chr1 172410967 172410967 G A exonic PIGC
filter_file
PIGC
desired output (header included)
Chr Start End Ref Alt Func.refGene Gene.refGene
chr1 172410967 172410967 G A exonic PIGC
awk with current output (header not included)
awk -F'\t' 'NR==1{A[$1];next}$7 in A' file test
chr1 172410967 172410967 G A exonic PIGC
Assuming your fields really are tab-separated:
awk -F'\t' 'NR==FNR{tgts[$1]; next} (FNR==1) || ($7 in tgts)' filter_file file_to_filter
To start learning awk, read the book Effective Awk Programing, 4th Edition, by Arnold Robbins.
I have an awk that seemed straight-forward, but I seem to be having a problem. In the file below if $5 starts with a ( then to that string a ) is added at the end. However if$5does not start with a(then nothing is done. The out is separated by a tab. Theawkis almost right but I am not sure how to add the condition to only add a)if the field starts with a(`. Thank you :).
file
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1
awk tried
awk -v OFS='\t' '{print $1,$2,$3,$4,""$5")"}' file
current output
chr7 100490775 100491863 chr7:100490775-100491863 ACHE)
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769)
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
desired output (line 1 and 2 nothing is done but line 3 and 4 have a ) added to the end)
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
$ awk -v OFS='\t' '{p = substr($5,1,1)=="(" ? ")" : ""; $5=$5 p}1' mp.txt
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
Check the first character of the 5th field. If it is ( append a ) to the end, otherwise append the empty string.
By appending something (where one of the somethings is "nothing" :) in all cases, we force awk to reconstitute the record with the defined (tab) output separator, which saves us from having to print the individual fields. The trailing 1 acts as an always-true pattern whose default action is simply to print the reconstituted line.