awk multiple field seperators?

awk multiple field seperators? - awk

I have a large file with lines like so
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
I want to extract ENSG00000223972.5, DDX11L1, chr1, 11869 and 14409.
I have succeeded in the first two by:
awk 'BEGIN {FS="\""}; {print $2"\t"$6}' file.txt
I'm struggling to now extract the chr1, 11869 and 14409 as this will need a different feild seperator? How is this done on the same ;line??

Try to use following command to extract what you want,
awk 'BEGIN {FS="\"";OFS="\t"}; {split($1,a,/[\ ]*/); print a[1],a[4],a[5],$2,$6}' file.txt
Brief explanation,
split($1,a,/[\ ]*/: split $1 into the array a, and the separators would be regex /[\ ]*/
Print the split content stored in a as you desired.

$ awk -F'[ "]+' -v OFS='\t' '{print $1, $4, $5, $10, $16}' file
chr1 11869 14409 ENSG00000223972.5 DDX11L1

Related

Print column next to the column matching a pattern

I have this tab separated file:
gene 1 A 6 gene_name TP53 B
exon 6 B 2 2 A gene_name MYC2 10.0 B
transcript 3 B B 4 gene_name ORF1
How can I print the first column plus the next column after gene_name column? As you can see, gene_name do not exist always in the same column.
I am not sure about how to get the last part of this:
awk 'BEGIN{OFS="\t"} {print $1, ??}' myFile.tsv
So, my expected output is:
gene TP53
exon MYC2
transcript ORF1
Thanks!

With your shown samples, please try following.
1st solution: In case you have multiple gene_name values in single line then following may help.
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++){if($i=="gene_name"){print $1,$(i+1);i++}}}' Input_file
2nd solution: In case you have only 1 gene_name then use following.
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++){if($i=="gene_name"){print $1,$(i+1);next}}}' Input_file
3rd solution: With your very specific case where gene_name always coming on 3rd field we could try this one, for Generic ones try 1st or 2nd solutions.
awk 'BEGIN{FS=OFS="\t"} $3=="gene_name"{print $1,$4}' Input_file
OR if you want to check 2nd last field and print last field value then use:
awk 'BEGIN{FS=OFS="\t"} $(NF-1)=="gene_name"{print $(NF-1),$NF}' Input_file
4th solution: With sed please try following.
sed -E 's/(\S+).*gene_name\s+(\S+).*/\1\t\2/' Input_file

You may use this gnu awk solution:
awk '{print gensub(/^(\S+).*\tgene_name\t(\S+).*/, "\\1\t\\2", "1")}' file
gene TP53
exon MYC2
transcript ORF1

Using GNU grep:
grep -oP '(^\S+)|(\bgene_name\s+\K\S+)' myFile.tsv | paste - -

$ awk -v OFS='\t' '{v=$1; sub(/.* gene_name /,""); print v, $1}' file
gene TP53
exon MYC2
transcript ORF1

And also with awk:
awk -v FS=' .*gene_name | ' '{print $1,$2}' file
gene TP53
exon MYC2
transcript ORF1

Change field separator with awk or sed of a specific set of columns

I would like to modify a file where both tabs and spaces are used as field separators.
At the beginning we have a file with this type of structure:
chr1 Cufflinks gene_id "XLOC_000001"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XLOC_000012"; oId "XR_001548508";
Doing awk -F' ' '$4=$6 {print $0}' performs what I am looking for (changing the value of the "gene_id" by the value in "oId"):
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";
The problem is that it changes the line structure: the tabs \t between chr1, Cufflinks and gene_id disappeared. I tried adding -v OFS=\t but it puts tabs in the gene_id "XLOC_000012"; oId "XR_001548508"; part (which should stay separated by spaces). I also tried with sed something like sed -i 's/ /\t/' but it also put tabs everywhere.
How could I change the field separator of column 1 to 3 (and do not change columns 3 to 6) ?

A possibility with awk:
awk -F '[ ]' '{$2 = $4; print}' file
By using the space character for the input field separator (as opposed to spaces and tabs), a field can be assigned to without changing the tab characters to spaces.
For more complex cases, there is split (but no "join"):
awk 'BEGIN {FS=OFS="\t"} {n = split($3, a, " "); a[2] = a[4]; for (i=1; i<=n; ++i)
$3 = (i == 1 ? "" : $3 " ") a[i]
} 1' file

You may use this sed that preserves your whitespaces:
sed -E $'s/^([ \t]*([^ \t]+[ \t]+){3})[^ \t]+([ \t]+)(([^ \t]+[ \t]+){1})([^ \t]+)/\\1\\6\\3\\4\\6/' ff
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";
Explanation for copying 6th field to 4th field:
^: # match start
([ \t]*([^ \t]+[ \t]+){3}): # match first 4-1 fields and capture in group #1
[^ \t]+: # match 4th field
([ \t]+): # match whitespace after 4th field and capture in group #3
(([^ \t]+[ \t]+){1}): # match next (6-4-1) fields and capture in group #4
([^ \t]+): # match 6th field and capture in group #6
\\1\\6\\3\\4\\6: Place back-reference back in substitution
Alternatively this awk also creates a tabular aligned output:
awk '$4=$6' file | column -t
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";

isolating the rows based on values of fields in awk

I have a text file like this small example:
small example:
chr1 HAVANA exon 13221 13374
chr1 HAVANA exon 13453 13670
chr1 HAVANA gene 14363 29806
I am trying to filter the rows base on the 3rd column. in fact if the 3rd column is gene i will keep the entire row and filter out the other rows. here is the expected output:
expected output:
chr1 HAVANA gene 14363 29806
I am trying to do that in awk using the following command but the results is empty. do you know how to fix it?
awk '{ if ($3 == 'gene') { print } }' infile.txt > outfile.txt

Use double quotes in the script:
$ awk '{ if ($3 == "gene") { print } }' file
chr1 HAVANA gene 14363 29806
or:
$ awk '{ if ($3 == "gene") print }' file
but you could just:
$ awk '$3 == "gene"'

awk to update unknown values in file using range in another

I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
edit:
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A

here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
rstart[k,c[k]]=a[2];
rend[k,c[k]]=a[3];
value[k,c[k]]=$2;
next}
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.

Find the double quotes values and print them using awk

I have a file with 1000 rows in it
For example:
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
I want to search file and extract the values after string FPKM, like
"1.2028600217"
Can I do it using awk?

if you don't care which column does the FPKM show in, you could:
grep -Po '(?<=FPKM )"[^"]*"' file

You can use awk, but this is a simple substitution on a single line so sed is better suited:
$ cat file
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
$ sed 's/.*FPKM *"\([^"]*\)".*/\1/' file
1.2028600217

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk multiple field seperators? - awk

$ awk -F'[ "]+' -v OFS='\t' '{print $1, $4, $5, $10, $16}' file chr1 11869 14409 ENSG00000223972.5 DDX11L1

Related

Print column next to the column matching a pattern

Change field separator with awk or sed of a specific set of columns

isolating the rows based on values of fields in awk

awk to update unknown values in file using range in another

Find the double quotes values and print them using awk

Categories

Resources