awk to skip lines up to and including pattern [duplicate] - awk

I am trying to use awk to skip all lines including a specific pattern /^#CHROM/ and start processing on the line below. The awk does execute but currently returns all lines in the tab-delimited file. Thank you :).
##INFO=<ID=ANN,Number=1,Type=Integer,Description="My custom annotation">
##source_20170530.1=vcf-annotate(r953) -d key=INFO,ID=ANN,Number=1,Type=Integer,Description=My custom annotation -c CHROM,FROM,TO,INFO/ANN
##INFO=<ID=,Number=A,Type=Float,Description="Variant quality">
chr1 948846 . T TA NA NA
chr2 948852 . T TA NA NA
chr3 948888 . T TA NA NA
awk -F'\t' -v OFS="\t" 'NR>/^#CHROM/ {print $1,$2,$3,$4,$5,"ID=1"$6,"ID=2"$7}' file
desiered output
chr1 948846 . T TA ID1=NA ID2=NA
chr2 948852 . T TA ID1=NA ID2=NA
chr3 948888 . T TA ID1=NA ID2=NA

awk 'BEGIN{FS=OFS="\t"} f{print $1,$2,$3,$4,$5,"ID1="$6,"ID2="$7} /^#CHROM/{f=1}' file
See for details on this and other awk search idioms. Yours is a variant of "b" on that page.

Use the following awk approach:
awk -v OFS="\t" '/^#CHROM/{ r=NR }r && NR>r{ $6="ID=1"$6; $7="ID=2"$7; print }' file
The output:
chr1 948846 . T TA ID=1NA ID=2NA
chr2 948852 . T TA ID=1NA ID=2NA
chr3 948888 . T TA ID=1NA ID=2NA
/^#CHROM/{ r=NR } - capturing the pattern line number
The alternative approach would look as below:
awk -v OFS="\t" '/^#CHROM/{ f=1; next }f{ $6="ID=1"$6; $7="ID=2"$7; print }' file


filtering in a text file using awk

i have a tab separated text file like this small example:
chr1 100499714 100499715 1
chr1 100502177 100502178 10
chr1 100502181 100502182 2
chr1 100502191 100502192 18
chr1 100502203 100502204 45
in the new file that I will make:
1- I want to select the rows based on the 4th column meaning in the value of 4th column is at least 10, I will keep the entire row otherwise will be filtered out.
2- in the next step the 4th column will be removed.
the result will look like this:
chr1 100502177 100502178
chr1 100502191 100502192
chr1 100502203 100502204
to get such results I have tried the following code in awk:
cat input.txt | awk '{print $1 "\t" $2 "\t" $3}' > out.txt
but I do not know how to implement the filtering step. do you know how to fix the code?
Just put the condition before output:
cat input.txt | awk '$4 >= 10 {print $1 "\t" $2 "\t" $3}' > out.txt
here is another, might work better if you have many more fields
$ awk '$NF>=10{sub(/\t\w+$/,""); print}' file

awk to update file based on matching lines with split

In the below awk I am trying to match $2 in file1 up until the ., with $4 in file2 up to the first undescore _. If a match is found then that portion of file2 is up dated with the matching $1 value in file1. I think it is close but not sure how to account for the . in file1. In my real data there are thousands of lines, but they are all in the below format and a match may not always be found. The awk as is does execute but file2 is not updated, I think because the . is not matching. Thank you :).
file 1 space delimited
TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
file 2 tab-delimited
chr1 92149295 92149414 NM_003243_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 NM_003243_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 NM_003243_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 NM_003242_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 NM_003242_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 NM_004612_cds_0_0_chr9_101867488_f
chr9 101904817 101904985 NM_001130916_cds_3_0_chr9_101904818_f
desired output tab-delimited
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
awk 'FNR==NR {A[$1]=$1; next} $4 in A {sub ($4, $4 "_" A[$4]) }1' OFS='\t' file1 FS='\t' file2
Following awk may help you on same. Also you could change you FS field separator as per your Input_file(s) too, eg--> Input_file1 is space delimited then use FS=" " before it and Input_file2 is TAB delimited then use FS="\t" before its name.
awk '
((array[1]"_"array[2]) in a){
' Input_file1 Input_file2
Output will be as follows:
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f

awk to update unknown values in file using range in another

I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.

How awk the filename as a column in the output?

I am trying to perform some grep in contents of several files in a directory and appending my grep match in a single file, in my output I would also want a column which will have the filename as well to understand from which files that entry was picked up. I was trying to use awk for the same but it did not work.
for i in *_2.5kb.txt; do more $i | grep "NM_001080771" | echo `basename $i` | awk -F'[_.]' '{print $1"_"$2}' | head >> prom_genes_2.5kb.txt; done
files names are like this , I have around 50 files
each file contents several lines
chr1 3663275 3663483 14 2.55788 2.99631 1.40767 NM_001011874 -
chr1 4481687 4488063 264 7.85098 28.25170 26.41094 NM_011441 -
chr1 5008006 5013929 243 8.20677 26.17854 24.37907 NM_021374 -
chr1 5578362 5579949 65 3.48568 7.83501 6.57570 NM_011011 +
chr1 5905702 5908002 148 5.84647 16.53171 14.88463 NM_010342 -
chr1 9288507 9290352 77 4.04459 9.12442 7.77642 NM_027671 -
chr1 9291742 9292528 142 5.74749 16.21792 14.28185 NM_027671 -
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NM_021511 +
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NM_175236 +
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NR_027664 +
When I am getting a match for "NM_001080771" I am printing the entire content of that line to a new file and for each file this operation is being done and appending the match to one output file. I also want to add a column with filename as shown above in the final output so that I know from which file I am getting the entries.
desired output
chr4 21610972 21618492 193 7.28409 21.01724 19.35525 NM_001080771 - 48hrs_CT
chr4 21605096 21618696 76 4.22442 9.32981 7.68131 NM_001080771 - 48hrs_TAMO
chr4 21604864 21618713 12 1.78194 2.36793 1.25883 NM_001080771 - 72hrs_CT
chr4 21610305 21615717 26 2.90579 4.47333 2.65353 NM_001080771 - 72hrs_TAMO
chr4 21609924 21618600 23 2.63778 4.0642 2.33685 NM_001080771 - 5D_CT
chr4 21609936 21618680 30 5.63778 3.0642 8.33685 NM_001080771 - 5D_TAMO
This is not working. I want to basically append a column where the filename should also get added as an entry either first or the last column. How to do that?
or you can do all in awk
awk '/NM_001080771/ {print $0, FILENAME}' *_2.5kb.txt
this trims the filename in the desired format
$ awk '/NM_001080771/{sub(/_merged_peaks_2.5kb.txt/,"",FILENAME);
print $0, FILENAME}' *_2.5kb.txt
As long as the number of files is not huge, why not just:
grep NM_001080771 *_2.5kb.txt | awk -F: '{print $2,$1}'
If you have too many files for that to work, here's a script-based approach that uses awk to append the filename:
for i in *_2.5kb.txt; do
< $i grep "NM_001080771" | \
awk -v where=`basename $i` '{print $0,where}'
./thatscript | head > prom_genes_2.5kb.txt
Here we are using awk's -v VAR=VALUE command line feature to pass in the filename (because we are using stdin we don't have anything useful in awk's built-in FILENAME variable).
You can also use such a loop around #karakfa's elegant awk-only approach:
for i in *_2.5kb.txt; do
awk '/NM_001080771/ {print $0, FILENAME}' $i
And finally, here's a version with the desired filename munging:
for i in *_2.5kb.txt; do
awk -v TAG=${i%_merged_peaks_2.5kb.txt} '/NM_001080771/ {print $0, TAG}' $i
(this uses the shell's variable substitution ${variable%pattern} to trim pattern from the end of variable)
Guessing you might want to search for other strings in the future, so why don't we pass in the search string like so:
what=${1?Need search string}
for i in *_2.5kb.txt; do
awk -v TAG=${i%_merged_peaks_2.5kb.txt} /${what}/' {print $0, TAG}' $i
./thatscript NM_001080771 | head > prom_genes_2.5kb.txt
Or if you have a pathological need to over-complicate and pedantically quote things, even in 5-line "throwaway" scripts:
shopt -s nullglob
what="${1?Need search string}"
for filename in $filematch; do
awk -v tag="${filename%${trimsuffix}}" \
-v what="${what}" \
'$0 ~ what {print $0, tag}' $filename

how to compare the columns in two files using awk

I want to compare two files using awk command, with the File 1 and 2 containing the following information. The File 1 is the nucleotide positions as can be seen in the column 2 of File 2.
Now I need an awk command to compare the column (only one cloumn is present) in File 1 to the column 2 of File 2 and if a match is found, print the whole line in File 2 to File 3
File 1
File 2
chromosome01 6765006 . C T 22.8 . DP=3;RPB=-8.745357e-01;AF1=1;AC1=2;DP4=1,1,1,0;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1505||492|OS01G0223600|protein_coding|CODING|OS01T0223600-01||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||262||174|OS01G0223500|protein_coding|CODING|OS01T0223500-00||1),INTERGENIC(MODIFIER||||||||||1) PL 51,0,0
chromosome01 6765043 . C T 22.8 . DP=3;RPB=-8.745357e-01;AF1=1;AC1=2;DP4=1,1,1,0;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1468||492|OS01G0223600|protein_coding|CODING|OS01T0223600-01||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||225||174|OS01G0223500|protein_coding|CODING|OS01T0223500-00||1),INTERGENIC(MODIFIER||||||||||1) PL 51,0,0
chromosome01 7113528 . GACAC GAC 7.98 . INDEL;IS=1,0.333333;DP=3;VDB=6.186179e-02;AF1=1;AC1=2;DP4=1,1,0,1;MQ=35;FQ=-34.5;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||2254||107|OS01G0228901|protein_coding|CODING|OS01T0228901-01||1),DOWNSTREAM(MODIFIER||3930|||NCRNA_20319|ncRNA|NON_CODING|NCRNA_20319||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||3930|||NCRNA_21253|ncRNA|NON_CODING|NCRNA_21253||1),UPSTREAM(MODIFIER||4436||687|OS01G0228800|protein_coding|CODING|OS01T0228800-01||1) PL 43,0,0
chromosome01 7113583 . C T 22.8 . DP=3;RPB=-8.745357e-01;AF1=1;AC1=2;DP4=1,1,0,1;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||2202||107|OS01G0228901|protein_coding|CODING|OS01T0228901-01||1),DOWNSTREAM(MODIFIER||3982|||NCRNA_20319|ncRNA|NON_CODING|NCRNA_20319||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||3982|||NCRNA_21253|ncRNA|NON_CODING|NCRNA_21253||1),UPSTREAM(MODIFIER||4488||687|OS01G0228800|protein_coding|CODING|OS01T0228800-01||1) PL 51,0,0
chromosome01 7427540 . C T 22.8 . DP=3;RPB=8.745357e-01;AF1=1;AC1=2;DP4=0,2,0,1;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1091|||NCRNA_19787|ncRNA|NON_CODING|NCRNA_19787||1),DOWNSTREAM(MODIFIER||1113|||NCRNA_7056|ncRNA|NON_CODING|NCRNA_7056||1),DOWNSTREAM(MODIFIER||2841||256|OS01G0234433|protein_coding|CODING|OS01T0234433-00||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||4859|||NCRNA_25306|ncRNA|NON_CODING|NCRNA_25306||1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Cca/Aca|P35T|421|OS01G0234200|protein_coding|CODING|OS01T0234200-00|1|1|WARNING_REF_DOES_NOT_MATCH_GENOME),UPSTREAM(MODIFIER||1091|||NCRNA_19719|ncRNA|NON_CODING|NCRNA_19719||1),UPSTREAM(MODIFIER||1113|||NCRNA_7253|ncRNA|NON_CODING|NCRNA_7253||1),UPSTREAM(MODIFIER||1844||386|OS01G0234300|protein_coding|CODING|OS01T0234300-00||1),UPSTREAM(MODIFIER||2862|||NCRNA_9648|ncRNA|NON_CODING|NCRNA_9648||1),UPSTREAM(MODIFIER||3028||255|OS01G0234499|protein_coding|CODING|OS01T0234499-00||1),UPSTREAM(MODIFIER||4863|||NCRNA_27966|ncRNA|NON_CODING|NCRNA_27966||1),UPSTREAM(MODIFIER||4872|||NCRNA_33984|ncRNA|NON_CODING|NCRNA_33984||1) PL 51,0,0
chromosome01 7427583 . T C 26.1 . DP=3;RPB=-9.668049e-01;AF1=1;AC1=2;DP4=0,1,0,1;MQ=42;FQ=-28;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1134|||NCRNA_19787|ncRNA|NON_CODING|NCRNA_19787||1),DOWNSTREAM(MODIFIER||1156|||NCRNA_7056|ncRNA|NON_CODING|NCRNA_7056||1),DOWNSTREAM(MODIFIER||2798||256|OS01G0234433|protein_coding|CODING|OS01T0234433-00||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||4902|||NCRNA_25306|ncRNA|NON_CODING|NCRNA_25306||1),SYNONYMOUS_CODING(LOW|SILENT|ggC/ggG|G20|421|OS01G0234200|protein_coding|CODING|OS01T0234200-00|1|1|WARNING_REF_DOES_NOT_MATCH_GENOME),UPSTREAM(MODIFIER||1134|||NCRNA_19719|ncRNA|NON_CODING|NCRNA_19719||1),UPSTREAM(MODIFIER||1156|||NCRNA_7253|ncRNA|NON_CODING|NCRNA_7253||1),UPSTREAM(MODIFIER||1801||386|OS01G0234300|protein_coding|CODING|OS01T0234300-00||1),UPSTREAM(MODIFIER||2905|||NCRNA_9648|ncRNA|NON_CODING|NCRNA_9648||1),UPSTREAM(MODIFIER||2985||255|OS01G0234499|protein_coding|CODING|OS01T0234499-00||1),UPSTREAM(MODIFIER||4906|||NCRNA_27966|ncRNA|NON_CODING|NCRNA_27966||1),UPSTREAM(MODIFIER||4915|||NCRNA_33984|ncRNA|NON_CODING|NCRNA_33984||1) PL 55,1,0
You can use this awk:
awk 'FNR==NR {a[$1]; next} $2 in a' f1 f2
To redirect to another file:
awk 'FNR==NR {a[$1]; next} $2 in a' f1 f2 > f3
FNR==NR {a[$1]; next} loop through the first file storing the values in the array a[].
$2 in a if 2nd column of 2nd file is present in the array a[], then this is true and the full line is printed.
$ awk 'FNR==NR {a[$1]; next} $2 in a' f1 f2
chromosome01 7113528 . GACAC GAC 7.98 . INDEL;IS=1,0.333333;DP=3;VDB=6.186179e-02;AF1=1;AC1=2;DP4=1,1,0,1;MQ=35;FQ=-34.5;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||2254||107|OS01G0228901|protein_coding|CODING|OS01T0228901-01||1),DOWNSTREAM(MODIFIER||3930|||NCRNA_20319|ncRNA|NON_CODING|NCRNA_20319||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||3930|||NCRNA_21253|ncRNA|NON_CODING|NCRNA_21253||1),UPSTREAM(MODIFIER||4436||687|OS01G0228800|protein_coding|CODING|OS01T0228800-01||1) PL 43,0,0
You can use grep:
grep -f file1 file2 > outputfile
The -f option tells grep to read the patterns from a file, one per line.
Note: Thanks to #fedorqui for pointing out that there can be problems if one of the patterns in file1 appears in another column in file2.