Print column next to the column matching a pattern - awk

I have this tab separated file:
gene 1 A 6 gene_name TP53 B
exon 6 B 2 2 A gene_name MYC2 10.0 B
transcript 3 B B 4 gene_name ORF1
How can I print the first column plus the next column after gene_name column? As you can see, gene_name do not exist always in the same column.
I am not sure about how to get the last part of this:
awk 'BEGIN{OFS="\t"} {print $1, ??}' myFile.tsv
So, my expected output is:
gene TP53
exon MYC2
transcript ORF1
Thanks!

With your shown samples, please try following.
1st solution: In case you have multiple gene_name values in single line then following may help.
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++){if($i=="gene_name"){print $1,$(i+1);i++}}}' Input_file
2nd solution: In case you have only 1 gene_name then use following.
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++){if($i=="gene_name"){print $1,$(i+1);next}}}' Input_file
3rd solution: With your very specific case where gene_name always coming on 3rd field we could try this one, for Generic ones try 1st or 2nd solutions.
awk 'BEGIN{FS=OFS="\t"} $3=="gene_name"{print $1,$4}' Input_file
OR if you want to check 2nd last field and print last field value then use:
awk 'BEGIN{FS=OFS="\t"} $(NF-1)=="gene_name"{print $(NF-1),$NF}' Input_file
4th solution: With sed please try following.
sed -E 's/(\S+).*gene_name\s+(\S+).*/\1\t\2/' Input_file

You may use this gnu awk solution:
awk '{print gensub(/^(\S+).*\tgene_name\t(\S+).*/, "\\1\t\\2", "1")}' file
gene TP53
exon MYC2
transcript ORF1

Using GNU grep:
grep -oP '(^\S+)|(\bgene_name\s+\K\S+)' myFile.tsv | paste - -

$ awk -v OFS='\t' '{v=$1; sub(/.* gene_name /,""); print v, $1}' file
gene TP53
exon MYC2
transcript ORF1

And also with awk:
awk -v FS=' .*gene_name | ' '{print $1,$2}' file
gene TP53
exon MYC2
transcript ORF1

Related

Change field separator with awk or sed of a specific set of columns

I would like to modify a file where both tabs and spaces are used as field separators.
At the beginning we have a file with this type of structure:
chr1 Cufflinks gene_id "XLOC_000001"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XLOC_000012"; oId "XR_001548508";
Doing awk -F' ' '$4=$6 {print $0}' performs what I am looking for (changing the value of the "gene_id" by the value in "oId"):
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";
The problem is that it changes the line structure: the tabs \t between chr1, Cufflinks and gene_id disappeared. I tried adding -v OFS=\t but it puts tabs in the gene_id "XLOC_000012"; oId "XR_001548508"; part (which should stay separated by spaces). I also tried with sed something like sed -i 's/ /\t/' but it also put tabs everywhere.
How could I change the field separator of column 1 to 3 (and do not change columns 3 to 6) ?
A possibility with awk:
awk -F '[ ]' '{$2 = $4; print}' file
By using the space character for the input field separator (as opposed to spaces and tabs), a field can be assigned to without changing the tab characters to spaces.
For more complex cases, there is split (but no "join"):
awk 'BEGIN {FS=OFS="\t"} {n = split($3, a, " "); a[2] = a[4]; for (i=1; i<=n; ++i)
$3 = (i == 1 ? "" : $3 " ") a[i]
} 1' file
You may use this sed that preserves your whitespaces:
sed -E $'s/^([ \t]*([^ \t]+[ \t]+){3})[^ \t]+([ \t]+)(([^ \t]+[ \t]+){1})([^ \t]+)/\\1\\6\\3\\4\\6/' ff
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";
Explanation for copying 6th field to 4th field:
^: # match start
([ \t]*([^ \t]+[ \t]+){3}): # match first 4-1 fields and capture in group #1
[^ \t]+: # match 4th field
([ \t]+): # match whitespace after 4th field and capture in group #3
(([^ \t]+[ \t]+){1}): # match next (6-4-1) fields and capture in group #4
([^ \t]+): # match 6th field and capture in group #6
\\1\\6\\3\\4\\6: Place back-reference back in substitution
Alternatively this awk also creates a tabular aligned output:
awk '$4=$6' file | column -t
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";

awk to split field twice using two deliminator with condition

In the awk I am splitting on the space or : after the chrxx (it is not consistent so I added both as FS, then splitting on the -. I can not seem to duplicate $2 if there is no - after it. Lines 2,3 are examples. If there is a - after the number then the value to the right of it is $3 in the ouput. The awk seems close but isn't duplicating the value. Thank you :).
in
chr17 7124137-7124146 ACADVL
chr1 229568460 ACTA1
chr10 90708637 ACTA2
awk
awk -F"[ :-]" '$3=$3?$3:$2' OFS='\t' in
current
chr17 7124137 7124146 ACADVL
chr1 229568460 ACTA1
chr10 90708637 ACTA2
desired output
chr17 7124137 7124146 ACADVL
chr1 229568460 229568460 ACTA1
chr10 90708637 90708637 ACTA2
If number of fields is three, copy 3rd field to 4th, and 2nd to 3rd. Force recomputing of whole record to make output tab separated regardless of what's done before.
awk -F'[ :-]' 'NF==3{$4=$3;$3=$2} {$1=$1} 1' OFS='\t' in
$ perl -lane 'if($F[1]=~/\-/){$F[1]=~s/-/ /}else{splice #F, 1, 0, $F[1];}print "#F" ' temp
chr17 7124137 7124146 ACADVL
chr1 229568460 229568460 ACTA1
chr10 90708637 90708637 ACTA2
[netcrk#o2uk1061 infinys_root]$

isolating the rows based on values of fields in awk

I have a text file like this small example:
small example:
chr1 HAVANA exon 13221 13374
chr1 HAVANA exon 13453 13670
chr1 HAVANA gene 14363 29806
I am trying to filter the rows base on the 3rd column. in fact if the 3rd column is gene i will keep the entire row and filter out the other rows. here is the expected output:
expected output:
chr1 HAVANA gene 14363 29806
I am trying to do that in awk using the following command but the results is empty. do you know how to fix it?
awk '{ if ($3 == 'gene') { print } }' infile.txt > outfile.txt
Use double quotes in the script:
$ awk '{ if ($3 == "gene") { print } }' file
chr1 HAVANA gene 14363 29806
or:
$ awk '{ if ($3 == "gene") print }' file
but you could just:
$ awk '$3 == "gene"'

awk to update file based on matching lines with split

In the below awk I am trying to match $2 in file1 up until the ., with $4 in file2 up to the first undescore _. If a match is found then that portion of file2 is up dated with the matching $1 value in file1. I think it is close but not sure how to account for the . in file1. In my real data there are thousands of lines, but they are all in the below format and a match may not always be found. The awk as is does execute but file2 is not updated, I think because the . is not matching. Thank you :).
file 1 space delimited
TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
file 2 tab-delimited
chr1 92149295 92149414 NM_003243_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 NM_003243_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 NM_003243_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 NM_003242_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 NM_003242_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 NM_004612_cds_0_0_chr9_101867488_f
chr9 101904817 101904985 NM_001130916_cds_3_0_chr9_101904818_f
desired output tab-delimited
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
awk
awk 'FNR==NR {A[$1]=$1; next} $4 in A {sub ($4, $4 "_" A[$4]) }1' OFS='\t' file1 FS='\t' file2
Following awk may help you on same. Also you could change you FS field separator as per your Input_file(s) too, eg--> Input_file1 is space delimited then use FS=" " before it and Input_file2 is TAB delimited then use FS="\t" before its name.
awk '
FNR==NR{
val=$2;
sub(/\..*/,"",val);
a[val]=$1;
next
}
{
split($4,array,"_")
}
((array[1]"_"array[2]) in a){
sub(/.*_cds/,a[array[1]"_"array[2]]"_cds",$4);
print
}
' Input_file1 Input_file2
Output will be as follows:
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f

awk multiple field seperators?

I have a large file with lines like so
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
I want to extract ENSG00000223972.5, DDX11L1, chr1, 11869 and 14409.
I have succeeded in the first two by:
awk 'BEGIN {FS="\""}; {print $2"\t"$6}' file.txt
I'm struggling to now extract the chr1, 11869 and 14409 as this will need a different feild seperator? How is this done on the same ;line??
Try to use following command to extract what you want,
awk 'BEGIN {FS="\"";OFS="\t"}; {split($1,a,/[\ ]*/); print a[1],a[4],a[5],$2,$6}' file.txt
Brief explanation,
split($1,a,/[\ ]*/: split $1 into the array a, and the separators would be regex /[\ ]*/
Print the split content stored in a as you desired.
$ awk -F'[ "]+' -v OFS='\t' '{print $1, $4, $5, $10, $16}' file
chr1 11869 14409 ENSG00000223972.5 DDX11L1