isolating the rows based on values of fields in awk

isolating the rows based on values of fields in awk - awk

I have a text file like this small example:
small example:
chr1 HAVANA exon 13221 13374
chr1 HAVANA exon 13453 13670
chr1 HAVANA gene 14363 29806
I am trying to filter the rows base on the 3rd column. in fact if the 3rd column is gene i will keep the entire row and filter out the other rows. here is the expected output:
expected output:
chr1 HAVANA gene 14363 29806
I am trying to do that in awk using the following command but the results is empty. do you know how to fix it?
awk '{ if ($3 == 'gene') { print } }' infile.txt > outfile.txt

Use double quotes in the script:
$ awk '{ if ($3 == "gene") { print } }' file
chr1 HAVANA gene 14363 29806
or:
$ awk '{ if ($3 == "gene") print }' file
but you could just:
$ awk '$3 == "gene"'

Related

Print column next to the column matching a pattern

I have this tab separated file:
gene 1 A 6 gene_name TP53 B
exon 6 B 2 2 A gene_name MYC2 10.0 B
transcript 3 B B 4 gene_name ORF1
How can I print the first column plus the next column after gene_name column? As you can see, gene_name do not exist always in the same column.
I am not sure about how to get the last part of this:
awk 'BEGIN{OFS="\t"} {print $1, ??}' myFile.tsv
So, my expected output is:
gene TP53
exon MYC2
transcript ORF1
Thanks!

With your shown samples, please try following.
1st solution: In case you have multiple gene_name values in single line then following may help.
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++){if($i=="gene_name"){print $1,$(i+1);i++}}}' Input_file
2nd solution: In case you have only 1 gene_name then use following.
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++){if($i=="gene_name"){print $1,$(i+1);next}}}' Input_file
3rd solution: With your very specific case where gene_name always coming on 3rd field we could try this one, for Generic ones try 1st or 2nd solutions.
awk 'BEGIN{FS=OFS="\t"} $3=="gene_name"{print $1,$4}' Input_file
OR if you want to check 2nd last field and print last field value then use:
awk 'BEGIN{FS=OFS="\t"} $(NF-1)=="gene_name"{print $(NF-1),$NF}' Input_file
4th solution: With sed please try following.
sed -E 's/(\S+).*gene_name\s+(\S+).*/\1\t\2/' Input_file

You may use this gnu awk solution:
awk '{print gensub(/^(\S+).*\tgene_name\t(\S+).*/, "\\1\t\\2", "1")}' file
gene TP53
exon MYC2
transcript ORF1

Using GNU grep:
grep -oP '(^\S+)|(\bgene_name\s+\K\S+)' myFile.tsv | paste - -

$ awk -v OFS='\t' '{v=$1; sub(/.* gene_name /,""); print v, $1}' file
gene TP53
exon MYC2
transcript ORF1

And also with awk:
awk -v FS=' .*gene_name | ' '{print $1,$2}' file
gene TP53
exon MYC2
transcript ORF1

filtering in a text file using awk

i have a tab separated text file like this small example:
chr1 100499714 100499715 1
chr1 100502177 100502178 10
chr1 100502181 100502182 2
chr1 100502191 100502192 18
chr1 100502203 100502204 45
in the new file that I will make:
1- I want to select the rows based on the 4th column meaning in the value of 4th column is at least 10, I will keep the entire row otherwise will be filtered out.
2- in the next step the 4th column will be removed.
the result will look like this:
chr1 100502177 100502178
chr1 100502191 100502192
chr1 100502203 100502204
to get such results I have tried the following code in awk:
cat input.txt | awk '{print $1 "\t" $2 "\t" $3}' > out.txt
but I do not know how to implement the filtering step. do you know how to fix the code?

Just put the condition before output:
cat input.txt | awk '$4 >= 10 {print $1 "\t" $2 "\t" $3}' > out.txt

here is another, might work better if you have many more fields
$ awk '$NF>=10{sub(/\t\w+$/,""); print}' file

awk to update file based on matching lines with split

In the below awk I am trying to match $2 in file1 up until the ., with $4 in file2 up to the first undescore _. If a match is found then that portion of file2 is up dated with the matching $1 value in file1. I think it is close but not sure how to account for the . in file1. In my real data there are thousands of lines, but they are all in the below format and a match may not always be found. The awk as is does execute but file2 is not updated, I think because the . is not matching. Thank you :).
file 1 space delimited
TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
file 2 tab-delimited
chr1 92149295 92149414 NM_003243_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 NM_003243_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 NM_003243_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 NM_003242_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 NM_003242_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 NM_004612_cds_0_0_chr9_101867488_f
chr9 101904817 101904985 NM_001130916_cds_3_0_chr9_101904818_f
desired output tab-delimited
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
awk
awk 'FNR==NR {A[$1]=$1; next} $4 in A {sub ($4, $4 "_" A[$4]) }1' OFS='\t' file1 FS='\t' file2

Following awk may help you on same. Also you could change you FS field separator as per your Input_file(s) too, eg--> Input_file1 is space delimited then use FS=" " before it and Input_file2 is TAB delimited then use FS="\t" before its name.
awk '
FNR==NR{
val=$2;
sub(/\..*/,"",val);
a[val]=$1;
next
}
{
split($4,array,"_")
}
((array[1]"_"array[2]) in a){
sub(/.*_cds/,a[array[1]"_"array[2]]"_cds",$4);
print
}
' Input_file1 Input_file2
Output will be as follows:
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f

awk multiple field seperators?

I have a large file with lines like so
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
I want to extract ENSG00000223972.5, DDX11L1, chr1, 11869 and 14409.
I have succeeded in the first two by:
awk 'BEGIN {FS="\""}; {print $2"\t"$6}' file.txt
I'm struggling to now extract the chr1, 11869 and 14409 as this will need a different feild seperator? How is this done on the same ;line??

Try to use following command to extract what you want,
awk 'BEGIN {FS="\"";OFS="\t"}; {split($1,a,/[\ ]*/); print a[1],a[4],a[5],$2,$6}' file.txt
Brief explanation,
split($1,a,/[\ ]*/: split $1 into the array a, and the separators would be regex /[\ ]*/
Print the split content stored in a as you desired.

$ awk -F'[ "]+' -v OFS='\t' '{print $1, $4, $5, $10, $16}' file
chr1 11869 14409 ENSG00000223972.5 DDX11L1

awk to update unknown values in file using range in another

I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
edit:
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A

here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
rstart[k,c[k]]=a[2];
rend[k,c[k]]=a[3];
value[k,c[k]]=$2;
next}
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

isolating the rows based on values of fields in awk - awk

Use double quotes in the script: $ awk '{ if ($3 == "gene") { print } }' file chr1 HAVANA gene 14363 29806 or: $ awk '{ if ($3 == "gene") print }' file but you could just: $ awk '$3 == "gene"'

Related

Print column next to the column matching a pattern

filtering in a text file using awk

awk to update file based on matching lines with split

awk multiple field seperators?

awk to update unknown values in file using range in another

Categories

Resources