remove field from tab seperated file using awk - awk

I am trying to clean-up some tab-delineated files and thought that the awk below would remove field 18 Otherinfo from the file. I also tried cut and can not seem to get the desired output. Thank you :).
file
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene PopFreqMax CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID common Otherinfo
chr1 949654 949654 A G exonic ISG15 . synonymous SNV ISG15:NM_005101:exon2:c.294A>G:p.V98V 0.96 . . . . . . 1 3825.28 624 chr1 949654 . A G 3825.28 PASS AF=1;AO=621;DP=624;FAO=399;FDP=399;FR=.;FRO=0;FSAF=225;FSAR=174;FSRF=0;FSRR=0;FWDB=0.00425236;FXX=0.00249994;HRUN=1;LEN=1;MLLD=97.922;OALT=G;OID=.;OMAPALT=G;OPOS=949654;OREF=A;PB=0.5;PBP=1;QD=38.3487;RBI=0.0367904;REFB=0.0353003;REVB=-0.0365438;RO=2;SAF=335;SAR=286;SRF=0;SRR=2;SSEN=0;SSEP=0;SSSB=0.00332809;STB=0.5;STBP=1;TYPE=snp;VARB=-3.42335e-05;ANN=ISG15 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/1:171:624:399:2:0:621:399:1:286:335:0:2:174:225:0:0 GOOD 399 reads
desired output
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene PopFreqMax CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID common
chr1 949654 949654 A G exonic ISG15 0 synonymous SNV ISG15:NM_005101:exon2:c.294A>G:p.V98V 0.96 . . . . . .
awk (runs but doesn't remove field 18)
awk '{ $18=""; print }' file1
cut (removes all field except 18)
cut -f18 file1

By default, awk uses blanks as delimiters. Therefore, you have to specify to use tabs as delimiters in your output (OFS):
awk 'BEGIN{FS=OFS="\t"}{$18=""; gsub(/\t\t/,"\t")}1' file1

Related

awk to skip lines up to and including pattern [duplicate]

This question already has answers here:
Printing with sed or awk a line following a matching pattern
(9 answers)
Closed 5 years ago.
I am trying to use awk to skip all lines including a specific pattern /^#CHROM/ and start processing on the line below. The awk does execute but currently returns all lines in the tab-delimited file. Thank you :).
file
##INFO=<ID=ANN,Number=1,Type=Integer,Description="My custom annotation">
##source_20170530.1=vcf-annotate(r953) -d key=INFO,ID=ANN,Number=1,Type=Integer,Description=My custom annotation -c CHROM,FROM,TO,INFO/ANN
##INFO=<ID=,Number=A,Type=Float,Description="Variant quality">
#CHROM POS ID REF ALT
chr1 948846 . T TA NA NA
chr2 948852 . T TA NA NA
chr3 948888 . T TA NA NA
awk
awk -F'\t' -v OFS="\t" 'NR>/^#CHROM/ {print $1,$2,$3,$4,$5,"ID=1"$6,"ID=2"$7}' file
desiered output
chr1 948846 . T TA ID1=NA ID2=NA
chr2 948852 . T TA ID1=NA ID2=NA
chr3 948888 . T TA ID1=NA ID2=NA
awk 'BEGIN{FS=OFS="\t"} f{print $1,$2,$3,$4,$5,"ID1="$6,"ID2="$7} /^#CHROM/{f=1}' file
See https://stackoverflow.com/a/17914105/1745001 for details on this and other awk search idioms. Yours is a variant of "b" on that page.
Use the following awk approach:
awk -v OFS="\t" '/^#CHROM/{ r=NR }r && NR>r{ $6="ID=1"$6; $7="ID=2"$7; print }' file
The output:
chr1 948846 . T TA ID=1NA ID=2NA
chr2 948852 . T TA ID=1NA ID=2NA
chr3 948888 . T TA ID=1NA ID=2NA
/^#CHROM/{ r=NR } - capturing the pattern line number
The alternative approach would look as below:
awk -v OFS="\t" '/^#CHROM/{ f=1; next }f{ $6="ID=1"$6; $7="ID=2"$7; print }' file

awk to update unknown values in file using range in another

I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
edit:
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
rstart[k,c[k]]=a[2];
rend[k,c[k]]=a[3];
value[k,c[k]]=$2;
next}
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.

awk to add header to output file

I am trying to filter a file_to_filter by using another filter_file, which is just a list of strings in $1. I think I am close but can not seem to include the header row in the output. The file_to_filter is tab delimited as well. Thank you :).
file_to_filter
Chr Start End Ref Alt Func.refGene Gene.refGene
chr1 160098543 160098543 G A exonic ATP1A2
chr1 172410967 172410967 G A exonic PIGC
filter_file
PIGC
desired output (header included)
Chr Start End Ref Alt Func.refGene Gene.refGene
chr1 172410967 172410967 G A exonic PIGC
awk with current output (header not included)
awk -F'\t' 'NR==1{A[$1];next}$7 in A' file test
chr1 172410967 172410967 G A exonic PIGC
Assuming your fields really are tab-separated:
awk -F'\t' 'NR==FNR{tgts[$1]; next} (FNR==1) || ($7 in tgts)' filter_file file_to_filter
To start learning awk, read the book Effective Awk Programing, 4th Edition, by Arnold Robbins.

awk to add closing parenthesis if field begins with opening parenthesis

I have an awk that seemed straight-forward, but I seem to be having a problem. In the file below if $5 starts with a ( then to that string a ) is added at the end. However if$5does not start with a(then nothing is done. The out is separated by a tab. Theawkis almost right but I am not sure how to add the condition to only add a)if the field starts with a(`. Thank you :).
file
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1
awk tried
awk -v OFS='\t' '{print $1,$2,$3,$4,""$5")"}' file
current output
chr7 100490775 100491863 chr7:100490775-100491863 ACHE)
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769)
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
desired output (line 1 and 2 nothing is done but line 3 and 4 have a ) added to the end)
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
$ awk -v OFS='\t' '{p = substr($5,1,1)=="(" ? ")" : ""; $5=$5 p}1' mp.txt
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
Check the first character of the 5th field. If it is ( append a ) to the end, otherwise append the empty string.
By appending something (where one of the somethings is "nothing" :) in all cases, we force awk to reconstitute the record with the defined (tab) output separator, which saves us from having to print the individual fields. The trailing 1 acts as an always-true pattern whose default action is simply to print the reconstituted line.

separate number range from a file using awk

I have a file with 5 columns and I want to separate the columns using number range as a criteria: example:
chr1 2120987 2144159 NM_001282670 0.48106
chr1 2123333 2126214 NM_001256946 2.71647
chr1 4715104 4837854 NM_001042478 0
chr1 4715104 4843851 NM_018836 0
chr1 3728644 3773797 NM_014704 4.61425
chr1 3773830 3801993 NM_004402 4.39674
chr1 3773830 3801993 NM_001282669 0
chr1 6245079 6259679 NM_000983 75.1769
chr1 6304251 6305638 NM_001024598 0
chr1 6307405 6321035 NM_207370 0.273874
chr1 6161846 6240194 NM_015557 0.0149477
chr1 6266188 6281359 NM_207396 0
chr1 6281252 6296044 NM_012405 14.0752
I want to remove 0 from the list , then would like to sort out numbers between 0.01 and 0.27 and so on....
I am new to shell programming....can someone help with awk ?
Thanks.
As you are new to shell programming, you may not be aware of grep and sort which would be simpler for this job.
If you are hell-bent on awk as your tool of choice, please just disregard my answer.
I would do it like this:
grep -v '\s0$' file | sort -k 5,5 -g
chr1 6161846 6240194 NM_015557 0.0149477
chr1 6307405 6321035 NM_207370 0.273874
chr1 2120987 2144159 NM_001282670 0.48106
chr1 2123333 2126214 NM_001256946 2.71647
chr1 3773830 3801993 NM_004402 4.39674
chr1 3728644 3773797 NM_014704 4.61425
chr1 6281252 6296044 NM_012405 14.0752
chr1 6245079 6259679 NM_000983 75.1769
The grep with -v inverts the search and looks for lines not containing the sequence space followed by a zero followed by end of line. The sort sorts the data on column 5, and does a general numeric sort because of the -g.
If you are trying to select the rows in which $5 is non-zero and within a certain range, then indeed awk makes sense, and the following may be close to what you're after:
awk -v min=0.01 -v max=0.27 '
$5 == 0 { next }
min <= $5 && $5 <= max { print }'
Here, the call to awk has been parameterized to suggest how these few lines can be adapted for more general usage.