awk multiple field seperators? - awk

I have a large file with lines like so
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
I want to extract ENSG00000223972.5, DDX11L1, chr1, 11869 and 14409.
I have succeeded in the first two by:
awk 'BEGIN {FS="\""}; {print $2"\t"$6}' file.txt
I'm struggling to now extract the chr1, 11869 and 14409 as this will need a different feild seperator? How is this done on the same ;line??

Try to use following command to extract what you want,
awk 'BEGIN {FS="\"";OFS="\t"}; {split($1,a,/[\ ]*/); print a[1],a[4],a[5],$2,$6}' file.txt
Brief explanation,
split($1,a,/[\ ]*/: split $1 into the array a, and the separators would be regex /[\ ]*/
Print the split content stored in a as you desired.

$ awk -F'[ "]+' -v OFS='\t' '{print $1, $4, $5, $10, $16}' file
chr1 11869 14409 ENSG00000223972.5 DDX11L1

Related

Print column next to the column matching a pattern

I have this tab separated file:
gene 1 A 6 gene_name TP53 B
exon 6 B 2 2 A gene_name MYC2 10.0 B
transcript 3 B B 4 gene_name ORF1
How can I print the first column plus the next column after gene_name column? As you can see, gene_name do not exist always in the same column.
I am not sure about how to get the last part of this:
awk 'BEGIN{OFS="\t"} {print $1, ??}' myFile.tsv
So, my expected output is:
gene TP53
exon MYC2
transcript ORF1
Thanks!
With your shown samples, please try following.
1st solution: In case you have multiple gene_name values in single line then following may help.
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++){if($i=="gene_name"){print $1,$(i+1);i++}}}' Input_file
2nd solution: In case you have only 1 gene_name then use following.
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++){if($i=="gene_name"){print $1,$(i+1);next}}}' Input_file
3rd solution: With your very specific case where gene_name always coming on 3rd field we could try this one, for Generic ones try 1st or 2nd solutions.
awk 'BEGIN{FS=OFS="\t"} $3=="gene_name"{print $1,$4}' Input_file
OR if you want to check 2nd last field and print last field value then use:
awk 'BEGIN{FS=OFS="\t"} $(NF-1)=="gene_name"{print $(NF-1),$NF}' Input_file
4th solution: With sed please try following.
sed -E 's/(\S+).*gene_name\s+(\S+).*/\1\t\2/' Input_file
You may use this gnu awk solution:
awk '{print gensub(/^(\S+).*\tgene_name\t(\S+).*/, "\\1\t\\2", "1")}' file
gene TP53
exon MYC2
transcript ORF1
Using GNU grep:
grep -oP '(^\S+)|(\bgene_name\s+\K\S+)' myFile.tsv | paste - -
$ awk -v OFS='\t' '{v=$1; sub(/.* gene_name /,""); print v, $1}' file
gene TP53
exon MYC2
transcript ORF1
And also with awk:
awk -v FS=' .*gene_name | ' '{print $1,$2}' file
gene TP53
exon MYC2
transcript ORF1

Change field separator with awk or sed of a specific set of columns

I would like to modify a file where both tabs and spaces are used as field separators.
At the beginning we have a file with this type of structure:
chr1 Cufflinks gene_id "XLOC_000001"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XLOC_000012"; oId "XR_001548508";
Doing awk -F' ' '$4=$6 {print $0}' performs what I am looking for (changing the value of the "gene_id" by the value in "oId"):
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";
The problem is that it changes the line structure: the tabs \t between chr1, Cufflinks and gene_id disappeared. I tried adding -v OFS=\t but it puts tabs in the gene_id "XLOC_000012"; oId "XR_001548508"; part (which should stay separated by spaces). I also tried with sed something like sed -i 's/ /\t/' but it also put tabs everywhere.
How could I change the field separator of column 1 to 3 (and do not change columns 3 to 6) ?
A possibility with awk:
awk -F '[ ]' '{$2 = $4; print}' file
By using the space character for the input field separator (as opposed to spaces and tabs), a field can be assigned to without changing the tab characters to spaces.
For more complex cases, there is split (but no "join"):
awk 'BEGIN {FS=OFS="\t"} {n = split($3, a, " "); a[2] = a[4]; for (i=1; i<=n; ++i)
$3 = (i == 1 ? "" : $3 " ") a[i]
} 1' file
You may use this sed that preserves your whitespaces:
sed -E $'s/^([ \t]*([^ \t]+[ \t]+){3})[^ \t]+([ \t]+)(([^ \t]+[ \t]+){1})([^ \t]+)/\\1\\6\\3\\4\\6/' ff
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";
Explanation for copying 6th field to 4th field:
^: # match start
([ \t]*([^ \t]+[ \t]+){3}): # match first 4-1 fields and capture in group #1
[^ \t]+: # match 4th field
([ \t]+): # match whitespace after 4th field and capture in group #3
(([^ \t]+[ \t]+){1}): # match next (6-4-1) fields and capture in group #4
([^ \t]+): # match 6th field and capture in group #6
\\1\\6\\3\\4\\6: Place back-reference back in substitution
Alternatively this awk also creates a tabular aligned output:
awk '$4=$6' file | column -t
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";

isolating the rows based on values of fields in awk

I have a text file like this small example:
small example:
chr1 HAVANA exon 13221 13374
chr1 HAVANA exon 13453 13670
chr1 HAVANA gene 14363 29806
I am trying to filter the rows base on the 3rd column. in fact if the 3rd column is gene i will keep the entire row and filter out the other rows. here is the expected output:
expected output:
chr1 HAVANA gene 14363 29806
I am trying to do that in awk using the following command but the results is empty. do you know how to fix it?
awk '{ if ($3 == 'gene') { print } }' infile.txt > outfile.txt
Use double quotes in the script:
$ awk '{ if ($3 == "gene") { print } }' file
chr1 HAVANA gene 14363 29806
or:
$ awk '{ if ($3 == "gene") print }' file
but you could just:
$ awk '$3 == "gene"'

awk to update unknown values in file using range in another

I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
edit:
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
rstart[k,c[k]]=a[2];
rend[k,c[k]]=a[3];
value[k,c[k]]=$2;
next}
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.

Find the double quotes values and print them using awk

I have a file with 1000 rows in it
For example:
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
I want to search file and extract the values after string FPKM, like
"1.2028600217"
Can I do it using awk?
if you don't care which column does the FPKM show in, you could:
grep -Po '(?<=FPKM )"[^"]*"' file
You can use awk, but this is a simple substitution on a single line so sed is better suited:
$ cat file
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
$ sed 's/.*FPKM *"\([^"]*\)".*/\1/' file
1.2028600217