remove field from tab seperated file using awk

remove field from tab seperated file using awk - awk

I am trying to clean-up some tab-delineated files and thought that the awk below would remove field 18 Otherinfo from the file. I also tried cut and can not seem to get the desired output. Thank you :).
file
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene PopFreqMax CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID common Otherinfo
chr1 949654 949654 A G exonic ISG15 . synonymous SNV ISG15:NM_005101:exon2:c.294A>G:p.V98V 0.96 . . . . . . 1 3825.28 624 chr1 949654 . A G 3825.28 PASS AF=1;AO=621;DP=624;FAO=399;FDP=399;FR=.;FRO=0;FSAF=225;FSAR=174;FSRF=0;FSRR=0;FWDB=0.00425236;FXX=0.00249994;HRUN=1;LEN=1;MLLD=97.922;OALT=G;OID=.;OMAPALT=G;OPOS=949654;OREF=A;PB=0.5;PBP=1;QD=38.3487;RBI=0.0367904;REFB=0.0353003;REVB=-0.0365438;RO=2;SAF=335;SAR=286;SRF=0;SRR=2;SSEN=0;SSEP=0;SSSB=0.00332809;STB=0.5;STBP=1;TYPE=snp;VARB=-3.42335e-05;ANN=ISG15 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/1:171:624:399:2:0:621:399:1:286:335:0:2:174:225:0:0 GOOD 399 reads
desired output
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene PopFreqMax CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID common
chr1 949654 949654 A G exonic ISG15 0 synonymous SNV ISG15:NM_005101:exon2:c.294A>G:p.V98V 0.96 . . . . . .
awk (runs but doesn't remove field 18)
awk '{ $18=""; print }' file1
cut (removes all field except 18)
cut -f18 file1

By default, awk uses blanks as delimiters. Therefore, you have to specify to use tabs as delimiters in your output (OFS):
awk 'BEGIN{FS=OFS="\t"}{$18=""; gsub(/\t\t/,"\t")}1' file1

Related

awk to skip lines up to and including pattern [duplicate]

This question already has answers here:
Printing with sed or awk a line following a matching pattern
(9 answers)
Closed 5 years ago.
I am trying to use awk to skip all lines including a specific pattern /^#CHROM/ and start processing on the line below. The awk does execute but currently returns all lines in the tab-delimited file. Thank you :).
file
##INFO=<ID=ANN,Number=1,Type=Integer,Description="My custom annotation">
##source_20170530.1=vcf-annotate(r953) -d key=INFO,ID=ANN,Number=1,Type=Integer,Description=My custom annotation -c CHROM,FROM,TO,INFO/ANN
##INFO=<ID=,Number=A,Type=Float,Description="Variant quality">
#CHROM POS ID REF ALT
chr1 948846 . T TA NA NA
chr2 948852 . T TA NA NA
chr3 948888 . T TA NA NA
awk
awk -F'\t' -v OFS="\t" 'NR>/^#CHROM/ {print $1,$2,$3,$4,$5,"ID=1"$6,"ID=2"$7}' file
desiered output
chr1 948846 . T TA ID1=NA ID2=NA
chr2 948852 . T TA ID1=NA ID2=NA
chr3 948888 . T TA ID1=NA ID2=NA

awk 'BEGIN{FS=OFS="\t"} f{print $1,$2,$3,$4,$5,"ID1="$6,"ID2="$7} /^#CHROM/{f=1}' file
See https://stackoverflow.com/a/17914105/1745001 for details on this and other awk search idioms. Yours is a variant of "b" on that page.

Use the following awk approach:
awk -v OFS="\t" '/^#CHROM/{ r=NR }r && NR>r{ $6="ID=1"$6; $7="ID=2"$7; print }' file
The output:
chr1 948846 . T TA ID=1NA ID=2NA
chr2 948852 . T TA ID=1NA ID=2NA
chr3 948888 . T TA ID=1NA ID=2NA
/^#CHROM/{ r=NR } - capturing the pattern line number
The alternative approach would look as below:
awk -v OFS="\t" '/^#CHROM/{ f=1; next }f{ $6="ID=1"$6; $7="ID=2"$7; print }' file

awk to update unknown values in file using range in another

I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
edit:
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A

here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
rstart[k,c[k]]=a[2];
rend[k,c[k]]=a[3];
value[k,c[k]]=$2;
next}
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.

awk to add header to output file

I am trying to filter a file_to_filter by using another filter_file, which is just a list of strings in $1. I think I am close but can not seem to include the header row in the output. The file_to_filter is tab delimited as well. Thank you :).
file_to_filter
Chr Start End Ref Alt Func.refGene Gene.refGene
chr1 160098543 160098543 G A exonic ATP1A2
chr1 172410967 172410967 G A exonic PIGC
filter_file
PIGC
desired output (header included)
Chr Start End Ref Alt Func.refGene Gene.refGene
chr1 172410967 172410967 G A exonic PIGC
awk with current output (header not included)
awk -F'\t' 'NR==1{A[$1];next}$7 in A' file test
chr1 172410967 172410967 G A exonic PIGC

Assuming your fields really are tab-separated:
awk -F'\t' 'NR==FNR{tgts[$1]; next} (FNR==1) || ($7 in tgts)' filter_file file_to_filter
To start learning awk, read the book Effective Awk Programing, 4th Edition, by Arnold Robbins.

awk to add closing parenthesis if field begins with opening parenthesis

I have an awk that seemed straight-forward, but I seem to be having a problem. In the file below if $5 starts with a ( then to that string a ) is added at the end. However if$5does not start with a(then nothing is done. The out is separated by a tab. Theawkis almost right but I am not sure how to add the condition to only add a)if the field starts with a(`. Thank you :).
file
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1
awk tried
awk -v OFS='\t' '{print $1,$2,$3,$4,""$5")"}' file
current output
chr7 100490775 100491863 chr7:100490775-100491863 ACHE)
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769)
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
desired output (line 1 and 2 nothing is done but line 3 and 4 have a ) added to the end)
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)

$ awk -v OFS='\t' '{p = substr($5,1,1)=="(" ? ")" : ""; $5=$5 p}1' mp.txt
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
Check the first character of the 5th field. If it is ( append a ) to the end, otherwise append the empty string.
By appending something (where one of the somethings is "nothing" :) in all cases, we force awk to reconstitute the record with the defined (tab) output separator, which saves us from having to print the individual fields. The trailing 1 acts as an always-true pattern whose default action is simply to print the reconstituted line.

separate number range from a file using awk

I have a file with 5 columns and I want to separate the columns using number range as a criteria: example:
chr1 2120987 2144159 NM_001282670 0.48106
chr1 2123333 2126214 NM_001256946 2.71647
chr1 4715104 4837854 NM_001042478 0
chr1 4715104 4843851 NM_018836 0
chr1 3728644 3773797 NM_014704 4.61425
chr1 3773830 3801993 NM_004402 4.39674
chr1 3773830 3801993 NM_001282669 0
chr1 6245079 6259679 NM_000983 75.1769
chr1 6304251 6305638 NM_001024598 0
chr1 6307405 6321035 NM_207370 0.273874
chr1 6161846 6240194 NM_015557 0.0149477
chr1 6266188 6281359 NM_207396 0
chr1 6281252 6296044 NM_012405 14.0752
I want to remove 0 from the list , then would like to sort out numbers between 0.01 and 0.27 and so on....
I am new to shell programming....can someone help with awk ?
Thanks.

As you are new to shell programming, you may not be aware of grep and sort which would be simpler for this job.
If you are hell-bent on awk as your tool of choice, please just disregard my answer.
I would do it like this:
grep -v '\s0$' file | sort -k 5,5 -g
chr1 6161846 6240194 NM_015557 0.0149477
chr1 6307405 6321035 NM_207370 0.273874
chr1 2120987 2144159 NM_001282670 0.48106
chr1 2123333 2126214 NM_001256946 2.71647
chr1 3773830 3801993 NM_004402 4.39674
chr1 3728644 3773797 NM_014704 4.61425
chr1 6281252 6296044 NM_012405 14.0752
chr1 6245079 6259679 NM_000983 75.1769
The grep with -v inverts the search and looks for lines not containing the sequence space followed by a zero followed by end of line. The sort sorts the data on column 5, and does a general numeric sort because of the -g.

If you are trying to select the rows in which $5 is non-zero and within a certain range, then indeed awk makes sense, and the following may be close to what you're after:
awk -v min=0.01 -v max=0.27 '
$5 == 0 { next }
min <= $5 && $5 <= max { print }'
Here, the call to awk has been parameterized to suggest how these few lines can be adapted for more general usage.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

remove field from tab seperated file using awk - awk

By default, awk uses blanks as delimiters. Therefore, you have to specify to use tabs as delimiters in your output (OFS): awk 'BEGIN{FS=OFS="\t"}{$18=""; gsub(/\t\t/,"\t")}1' file1

Related

awk to skip lines up to and including pattern [duplicate]

awk to update unknown values in file using range in another

awk to add header to output file

awk to add closing parenthesis if field begins with opening parenthesis

separate number range from a file using awk

Categories

Resources