awk to skip lines up to and including pattern [duplicate] - awk

This question already has answers here:
Printing with sed or awk a line following a matching pattern
(9 answers)
Closed 5 years ago.
I am trying to use awk to skip all lines including a specific pattern /^#CHROM/ and start processing on the line below. The awk does execute but currently returns all lines in the tab-delimited file. Thank you :).
file
##INFO=<ID=ANN,Number=1,Type=Integer,Description="My custom annotation">
##source_20170530.1=vcf-annotate(r953) -d key=INFO,ID=ANN,Number=1,Type=Integer,Description=My custom annotation -c CHROM,FROM,TO,INFO/ANN
##INFO=<ID=,Number=A,Type=Float,Description="Variant quality">
#CHROM POS ID REF ALT
chr1 948846 . T TA NA NA
chr2 948852 . T TA NA NA
chr3 948888 . T TA NA NA
awk
awk -F'\t' -v OFS="\t" 'NR>/^#CHROM/ {print $1,$2,$3,$4,$5,"ID=1"$6,"ID=2"$7}' file
desiered output
chr1 948846 . T TA ID1=NA ID2=NA
chr2 948852 . T TA ID1=NA ID2=NA
chr3 948888 . T TA ID1=NA ID2=NA

awk 'BEGIN{FS=OFS="\t"} f{print $1,$2,$3,$4,$5,"ID1="$6,"ID2="$7} /^#CHROM/{f=1}' file
See https://stackoverflow.com/a/17914105/1745001 for details on this and other awk search idioms. Yours is a variant of "b" on that page.

Use the following awk approach:
awk -v OFS="\t" '/^#CHROM/{ r=NR }r && NR>r{ $6="ID=1"$6; $7="ID=2"$7; print }' file
The output:
chr1 948846 . T TA ID=1NA ID=2NA
chr2 948852 . T TA ID=1NA ID=2NA
chr3 948888 . T TA ID=1NA ID=2NA
/^#CHROM/{ r=NR } - capturing the pattern line number
The alternative approach would look as below:
awk -v OFS="\t" '/^#CHROM/{ f=1; next }f{ $6="ID=1"$6; $7="ID=2"$7; print }' file

Related

filtering in a text file using awk

i have a tab separated text file like this small example:
chr1 100499714 100499715 1
chr1 100502177 100502178 10
chr1 100502181 100502182 2
chr1 100502191 100502192 18
chr1 100502203 100502204 45
in the new file that I will make:
1- I want to select the rows based on the 4th column meaning in the value of 4th column is at least 10, I will keep the entire row otherwise will be filtered out.
2- in the next step the 4th column will be removed.
the result will look like this:
chr1 100502177 100502178
chr1 100502191 100502192
chr1 100502203 100502204
to get such results I have tried the following code in awk:
cat input.txt | awk '{print $1 "\t" $2 "\t" $3}' > out.txt
but I do not know how to implement the filtering step. do you know how to fix the code?
Just put the condition before output:
cat input.txt | awk '$4 >= 10 {print $1 "\t" $2 "\t" $3}' > out.txt
here is another, might work better if you have many more fields
$ awk '$NF>=10{sub(/\t\w+$/,""); print}' file

awk to update file based on matching lines with split

In the below awk I am trying to match $2 in file1 up until the ., with $4 in file2 up to the first undescore _. If a match is found then that portion of file2 is up dated with the matching $1 value in file1. I think it is close but not sure how to account for the . in file1. In my real data there are thousands of lines, but they are all in the below format and a match may not always be found. The awk as is does execute but file2 is not updated, I think because the . is not matching. Thank you :).
file 1 space delimited
TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
file 2 tab-delimited
chr1 92149295 92149414 NM_003243_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 NM_003243_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 NM_003243_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 NM_003242_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 NM_003242_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 NM_004612_cds_0_0_chr9_101867488_f
chr9 101904817 101904985 NM_001130916_cds_3_0_chr9_101904818_f
desired output tab-delimited
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
awk
awk 'FNR==NR {A[$1]=$1; next} $4 in A {sub ($4, $4 "_" A[$4]) }1' OFS='\t' file1 FS='\t' file2
Following awk may help you on same. Also you could change you FS field separator as per your Input_file(s) too, eg--> Input_file1 is space delimited then use FS=" " before it and Input_file2 is TAB delimited then use FS="\t" before its name.
awk '
FNR==NR{
val=$2;
sub(/\..*/,"",val);
a[val]=$1;
next
}
{
split($4,array,"_")
}
((array[1]"_"array[2]) in a){
sub(/.*_cds/,a[array[1]"_"array[2]]"_cds",$4);
print
}
' Input_file1 Input_file2
Output will be as follows:
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f

awk to update unknown values in file using range in another

I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
edit:
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
rstart[k,c[k]]=a[2];
rend[k,c[k]]=a[3];
value[k,c[k]]=$2;
next}
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.

How awk the filename as a column in the output?

I am trying to perform some grep in contents of several files in a directory and appending my grep match in a single file, in my output I would also want a column which will have the filename as well to understand from which files that entry was picked up. I was trying to use awk for the same but it did not work.
for i in *_2.5kb.txt; do more $i | grep "NM_001080771" | echo `basename $i` | awk -F'[_.]' '{print $1"_"$2}' | head >> prom_genes_2.5kb.txt; done
files names are like this , I have around 50 files
48hrs_CT_merged_peaks_2.5kb.txt
48hrs_TAMO_merged_peaks_2.5kb.txt
72hrs_TAMO_merged_peaks_2.5kb.txt
72hrs_CT_merged_peaks_2.5kb.txt
5D_CT_merged_peaks_2.5kb.txt
5D_TAMO_merged_peaks_2.5kb.txt
each file contents several lines
chr1 3663275 3663483 14 2.55788 2.99631 1.40767 NM_001011874 -
chr1 4481687 4488063 264 7.85098 28.25170 26.41094 NM_011441 -
chr1 5008006 5013929 243 8.20677 26.17854 24.37907 NM_021374 -
chr1 5578362 5579949 65 3.48568 7.83501 6.57570 NM_011011 +
chr1 5905702 5908002 148 5.84647 16.53171 14.88463 NM_010342 -
chr1 9288507 9290352 77 4.04459 9.12442 7.77642 NM_027671 -
chr1 9291742 9292528 142 5.74749 16.21792 14.28185 NM_027671 -
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NM_021511 +
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NM_175236 +
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NR_027664 +
When I am getting a match for "NM_001080771" I am printing the entire content of that line to a new file and for each file this operation is being done and appending the match to one output file. I also want to add a column with filename as shown above in the final output so that I know from which file I am getting the entries.
desired output
chr4 21610972 21618492 193 7.28409 21.01724 19.35525 NM_001080771 - 48hrs_CT
chr4 21605096 21618696 76 4.22442 9.32981 7.68131 NM_001080771 - 48hrs_TAMO
chr4 21604864 21618713 12 1.78194 2.36793 1.25883 NM_001080771 - 72hrs_CT
chr4 21610305 21615717 26 2.90579 4.47333 2.65353 NM_001080771 - 72hrs_TAMO
chr4 21609924 21618600 23 2.63778 4.0642 2.33685 NM_001080771 - 5D_CT
chr4 21609936 21618680 30 5.63778 3.0642 8.33685 NM_001080771 - 5D_TAMO
This is not working. I want to basically append a column where the filename should also get added as an entry either first or the last column. How to do that?
or you can do all in awk
awk '/NM_001080771/ {print $0, FILENAME}' *_2.5kb.txt
this trims the filename in the desired format
$ awk '/NM_001080771/{sub(/_merged_peaks_2.5kb.txt/,"",FILENAME);
print $0, FILENAME}' *_2.5kb.txt
As long as the number of files is not huge, why not just:
grep NM_001080771 *_2.5kb.txt | awk -F: '{print $2,$1}'
If you have too many files for that to work, here's a script-based approach that uses awk to append the filename:
#!/bin/sh
for i in *_2.5kb.txt; do
< $i grep "NM_001080771" | \
awk -v where=`basename $i` '{print $0,where}'
done
./thatscript | head > prom_genes_2.5kb.txt
Here we are using awk's -v VAR=VALUE command line feature to pass in the filename (because we are using stdin we don't have anything useful in awk's built-in FILENAME variable).
You can also use such a loop around #karakfa's elegant awk-only approach:
#!/bin/sh
for i in *_2.5kb.txt; do
awk '/NM_001080771/ {print $0, FILENAME}' $i
done
And finally, here's a version with the desired filename munging:
#!/bin/sh
for i in *_2.5kb.txt; do
awk -v TAG=${i%_merged_peaks_2.5kb.txt} '/NM_001080771/ {print $0, TAG}' $i
done
(this uses the shell's variable substitution ${variable%pattern} to trim pattern from the end of variable)
Bonus
Guessing you might want to search for other strings in the future, so why don't we pass in the search string like so:
#!/bin/sh
what=${1?Need search string}
for i in *_2.5kb.txt; do
awk -v TAG=${i%_merged_peaks_2.5kb.txt} /${what}/' {print $0, TAG}' $i
done
./thatscript NM_001080771 | head > prom_genes_2.5kb.txt
YET ANOTHER EDIT
Or if you have a pathological need to over-complicate and pedantically quote things, even in 5-line "throwaway" scripts:
#!/bin/sh
shopt -s nullglob
what="${1?Need search string}"
filematch="*_2.5kb.txt"
trimsuffix="_merged_peaks_2.5kb.txt"
for filename in $filematch; do
awk -v tag="${filename%${trimsuffix}}" \
-v what="${what}" \
'$0 ~ what {print $0, tag}' $filename
done

how to compare the columns in two files using awk

I want to compare two files using awk command, with the File 1 and 2 containing the following information. The File 1 is the nucleotide positions as can be seen in the column 2 of File 2.
Now I need an awk command to compare the column (only one cloumn is present) in File 1 to the column 2 of File 2 and if a match is found, print the whole line in File 2 to File 3
File 1
7113528
8680847
File 2
chromosome01 6765006 . C T 22.8 . DP=3;RPB=-8.745357e-01;AF1=1;AC1=2;DP4=1,1,1,0;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1505||492|OS01G0223600|protein_coding|CODING|OS01T0223600-01||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||262||174|OS01G0223500|protein_coding|CODING|OS01T0223500-00||1),INTERGENIC(MODIFIER||||||||||1) PL 51,0,0
chromosome01 6765043 . C T 22.8 . DP=3;RPB=-8.745357e-01;AF1=1;AC1=2;DP4=1,1,1,0;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1468||492|OS01G0223600|protein_coding|CODING|OS01T0223600-01||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||225||174|OS01G0223500|protein_coding|CODING|OS01T0223500-00||1),INTERGENIC(MODIFIER||||||||||1) PL 51,0,0
chromosome01 7113528 . GACAC GAC 7.98 . INDEL;IS=1,0.333333;DP=3;VDB=6.186179e-02;AF1=1;AC1=2;DP4=1,1,0,1;MQ=35;FQ=-34.5;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||2254||107|OS01G0228901|protein_coding|CODING|OS01T0228901-01||1),DOWNSTREAM(MODIFIER||3930|||NCRNA_20319|ncRNA|NON_CODING|NCRNA_20319||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||3930|||NCRNA_21253|ncRNA|NON_CODING|NCRNA_21253||1),UPSTREAM(MODIFIER||4436||687|OS01G0228800|protein_coding|CODING|OS01T0228800-01||1) PL 43,0,0
chromosome01 7113583 . C T 22.8 . DP=3;RPB=-8.745357e-01;AF1=1;AC1=2;DP4=1,1,0,1;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||2202||107|OS01G0228901|protein_coding|CODING|OS01T0228901-01||1),DOWNSTREAM(MODIFIER||3982|||NCRNA_20319|ncRNA|NON_CODING|NCRNA_20319||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||3982|||NCRNA_21253|ncRNA|NON_CODING|NCRNA_21253||1),UPSTREAM(MODIFIER||4488||687|OS01G0228800|protein_coding|CODING|OS01T0228800-01||1) PL 51,0,0
chromosome01 7427540 . C T 22.8 . DP=3;RPB=8.745357e-01;AF1=1;AC1=2;DP4=0,2,0,1;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1091|||NCRNA_19787|ncRNA|NON_CODING|NCRNA_19787||1),DOWNSTREAM(MODIFIER||1113|||NCRNA_7056|ncRNA|NON_CODING|NCRNA_7056||1),DOWNSTREAM(MODIFIER||2841||256|OS01G0234433|protein_coding|CODING|OS01T0234433-00||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||4859|||NCRNA_25306|ncRNA|NON_CODING|NCRNA_25306||1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Cca/Aca|P35T|421|OS01G0234200|protein_coding|CODING|OS01T0234200-00|1|1|WARNING_REF_DOES_NOT_MATCH_GENOME),UPSTREAM(MODIFIER||1091|||NCRNA_19719|ncRNA|NON_CODING|NCRNA_19719||1),UPSTREAM(MODIFIER||1113|||NCRNA_7253|ncRNA|NON_CODING|NCRNA_7253||1),UPSTREAM(MODIFIER||1844||386|OS01G0234300|protein_coding|CODING|OS01T0234300-00||1),UPSTREAM(MODIFIER||2862|||NCRNA_9648|ncRNA|NON_CODING|NCRNA_9648||1),UPSTREAM(MODIFIER||3028||255|OS01G0234499|protein_coding|CODING|OS01T0234499-00||1),UPSTREAM(MODIFIER||4863|||NCRNA_27966|ncRNA|NON_CODING|NCRNA_27966||1),UPSTREAM(MODIFIER||4872|||NCRNA_33984|ncRNA|NON_CODING|NCRNA_33984||1) PL 51,0,0
chromosome01 7427583 . T C 26.1 . DP=3;RPB=-9.668049e-01;AF1=1;AC1=2;DP4=0,1,0,1;MQ=42;FQ=-28;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1134|||NCRNA_19787|ncRNA|NON_CODING|NCRNA_19787||1),DOWNSTREAM(MODIFIER||1156|||NCRNA_7056|ncRNA|NON_CODING|NCRNA_7056||1),DOWNSTREAM(MODIFIER||2798||256|OS01G0234433|protein_coding|CODING|OS01T0234433-00||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||4902|||NCRNA_25306|ncRNA|NON_CODING|NCRNA_25306||1),SYNONYMOUS_CODING(LOW|SILENT|ggC/ggG|G20|421|OS01G0234200|protein_coding|CODING|OS01T0234200-00|1|1|WARNING_REF_DOES_NOT_MATCH_GENOME),UPSTREAM(MODIFIER||1134|||NCRNA_19719|ncRNA|NON_CODING|NCRNA_19719||1),UPSTREAM(MODIFIER||1156|||NCRNA_7253|ncRNA|NON_CODING|NCRNA_7253||1),UPSTREAM(MODIFIER||1801||386|OS01G0234300|protein_coding|CODING|OS01T0234300-00||1),UPSTREAM(MODIFIER||2905|||NCRNA_9648|ncRNA|NON_CODING|NCRNA_9648||1),UPSTREAM(MODIFIER||2985||255|OS01G0234499|protein_coding|CODING|OS01T0234499-00||1),UPSTREAM(MODIFIER||4906|||NCRNA_27966|ncRNA|NON_CODING|NCRNA_27966||1),UPSTREAM(MODIFIER||4915|||NCRNA_33984|ncRNA|NON_CODING|NCRNA_33984||1) PL 55,1,0
You can use this awk:
awk 'FNR==NR {a[$1]; next} $2 in a' f1 f2
To redirect to another file:
awk 'FNR==NR {a[$1]; next} $2 in a' f1 f2 > f3
Explanation
FNR==NR {a[$1]; next} loop through the first file storing the values in the array a[].
$2 in a if 2nd column of 2nd file is present in the array a[], then this is true and the full line is printed.
Test
$ awk 'FNR==NR {a[$1]; next} $2 in a' f1 f2
chromosome01 7113528 . GACAC GAC 7.98 . INDEL;IS=1,0.333333;DP=3;VDB=6.186179e-02;AF1=1;AC1=2;DP4=1,1,0,1;MQ=35;FQ=-34.5;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||2254||107|OS01G0228901|protein_coding|CODING|OS01T0228901-01||1),DOWNSTREAM(MODIFIER||3930|||NCRNA_20319|ncRNA|NON_CODING|NCRNA_20319||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||3930|||NCRNA_21253|ncRNA|NON_CODING|NCRNA_21253||1),UPSTREAM(MODIFIER||4436||687|OS01G0228800|protein_coding|CODING|OS01T0228800-01||1) PL 43,0,0
You can use grep:
grep -f file1 file2 > outputfile
The -f option tells grep to read the patterns from a file, one per line.
Note: Thanks to #fedorqui for pointing out that there can be problems if one of the patterns in file1 appears in another column in file2.