Change pattern just in next column matching another pattern - awk

This is the header of my file:
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "chr1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "chr1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "chr1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "chrM:+:226-399" * + 3
MT HAVANA * 27 . -
I would like to save to another file exactly the same content than this, but just removing the chr pattern and transforming M to MT pattern in the column next to the column matching remap_original_location.
So, my desired output is:
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "MT:+:226-399" * + 3
MT HAVANA * 27 . -
Do you know how can I achieve this?
I am trying some code like this:
awk '{for(i=1;i<=NF;i++){ if($i=="remap_original_location"){print ??? }}}'
But I am not sure how to specify the print part. In addition, as you can see, not all rows present the pattern remap_original_location and yet I still want to prin them.

With perl:
perl -pe 's/remap_original_location "\Kchr(M)?/$1 ? "MT" : ""/e' ip.txt
remap_original_location " I'm assuming single space to be consistent between fields here and that the next field will always start with " character. You can adjust the regex for other variations if needed
\K preceding portion won't be part of the matched text to be replaced
(M)? optionally match M character
$1 ? "MT" : "" if first capture group isn't empty, use MT as replacement string, else use empty string
empty string is Falsy in Perl
you can also use $1 && "MT" instead of ternary expression in this case, since the Falsy value is same as the alternate value needed
e flag helps to use Perl code in replacement section

You may use this awk:
awk '{gsub(/chr/, ""); for (i=1; i<NF; ++i) if ($i == "remap_original_location") {gsub(/M/, "MT", $(i+1)); break}} 1' file
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "MT:+:226-399" * + 3
MT HAVANA * 27 . -
A more readable form:
awk '{
gsub(/chr/, "")
for (i=1; i<NF; ++i)
if ($i == "remap_original_location") {
gsub(/M/, "MT", $(i+1))
break
}
} 1' file

With your shown samples, could you please try following.
awk '
{
gsub(/chr/,"")
}
match($0,/remap_original_location "M:/){
val=substr($0,RSTART,RLENGTH)
sub(/"M:/,"\"MT:",val)
$0=substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
1' Input_file
OR as per Sundeep's comment one could try following too:
awk '{gsub(/chr/,""); sub(/remap_original_location "M/, "&T")} 1' Input_file

Related

isolating the rows based on values of fields in awk

I have a text file like this small example:
small example:
chr1 HAVANA exon 13221 13374
chr1 HAVANA exon 13453 13670
chr1 HAVANA gene 14363 29806
I am trying to filter the rows base on the 3rd column. in fact if the 3rd column is gene i will keep the entire row and filter out the other rows. here is the expected output:
expected output:
chr1 HAVANA gene 14363 29806
I am trying to do that in awk using the following command but the results is empty. do you know how to fix it?
awk '{ if ($3 == 'gene') { print } }' infile.txt > outfile.txt
Use double quotes in the script:
$ awk '{ if ($3 == "gene") { print } }' file
chr1 HAVANA gene 14363 29806
or:
$ awk '{ if ($3 == "gene") print }' file
but you could just:
$ awk '$3 == "gene"'

Replacing a string in one file, with the contents of another file based on a common string

I have two files. I would like to replace a certain string in file 1, with the contents of file 2 based on a common string.
file 1
Chr5 psl2gff exon 15907715 15907933 . + . NM_001046410
Chr2 psl2gff exon 8898358 8898394 . + . NM_001192190
file 2
NM_001046410 gene_id TUBA1D; transcript_id tubulin, alpha 3d
NM_001192190 gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
output
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
in file 1 there are multiple instances of the same string, however, file 2 only has it once. I would like all instances of the NM_**** etc. to be replaced by the contents of file 2 when the first column matches. following this, I would like to completely remove the NM_**** from the file.
I am very new to bash etc. I have looked all over the place for a way to do this, but none so far have worked. Also, there are over 5000 lines in file 2, many more in file 1.
Any help would be much appreciated!
Thanks.
this is a join operation. If the files are sorted on the join key, and if the white space is not significant the easiest will be
$ join -19 -21 file1 file2 | cut -d' ' -f2-
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
if the files are not sorted and white space is important awk will be a better solution
$ awk 'NR==FNR {k=$1; $1=""; a[k]=$0; next}
$NF in a {sub(FS $NF"$",a[$NF])}1' file2 file1
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
exercise for you is to understand the code. There are many examples (>100) on this site exactly for this question and with many commented scripts, some of which are written by me.

awk multiple field seperators?

I have a large file with lines like so
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
I want to extract ENSG00000223972.5, DDX11L1, chr1, 11869 and 14409.
I have succeeded in the first two by:
awk 'BEGIN {FS="\""}; {print $2"\t"$6}' file.txt
I'm struggling to now extract the chr1, 11869 and 14409 as this will need a different feild seperator? How is this done on the same ;line??
Try to use following command to extract what you want,
awk 'BEGIN {FS="\"";OFS="\t"}; {split($1,a,/[\ ]*/); print a[1],a[4],a[5],$2,$6}' file.txt
Brief explanation,
split($1,a,/[\ ]*/: split $1 into the array a, and the separators would be regex /[\ ]*/
Print the split content stored in a as you desired.
$ awk -F'[ "]+' -v OFS='\t' '{print $1, $4, $5, $10, $16}' file
chr1 11869 14409 ENSG00000223972.5 DDX11L1

remove field from tab seperated file using awk

I am trying to clean-up some tab-delineated files and thought that the awk below would remove field 18 Otherinfo from the file. I also tried cut and can not seem to get the desired output. Thank you :).
file
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene PopFreqMax CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID common Otherinfo
chr1 949654 949654 A G exonic ISG15 . synonymous SNV ISG15:NM_005101:exon2:c.294A>G:p.V98V 0.96 . . . . . . 1 3825.28 624 chr1 949654 . A G 3825.28 PASS AF=1;AO=621;DP=624;FAO=399;FDP=399;FR=.;FRO=0;FSAF=225;FSAR=174;FSRF=0;FSRR=0;FWDB=0.00425236;FXX=0.00249994;HRUN=1;LEN=1;MLLD=97.922;OALT=G;OID=.;OMAPALT=G;OPOS=949654;OREF=A;PB=0.5;PBP=1;QD=38.3487;RBI=0.0367904;REFB=0.0353003;REVB=-0.0365438;RO=2;SAF=335;SAR=286;SRF=0;SRR=2;SSEN=0;SSEP=0;SSSB=0.00332809;STB=0.5;STBP=1;TYPE=snp;VARB=-3.42335e-05;ANN=ISG15 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/1:171:624:399:2:0:621:399:1:286:335:0:2:174:225:0:0 GOOD 399 reads
desired output
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene PopFreqMax CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID common
chr1 949654 949654 A G exonic ISG15 0 synonymous SNV ISG15:NM_005101:exon2:c.294A>G:p.V98V 0.96 . . . . . .
awk (runs but doesn't remove field 18)
awk '{ $18=""; print }' file1
cut (removes all field except 18)
cut -f18 file1
By default, awk uses blanks as delimiters. Therefore, you have to specify to use tabs as delimiters in your output (OFS):
awk 'BEGIN{FS=OFS="\t"}{$18=""; gsub(/\t\t/,"\t")}1' file1

Find the double quotes values and print them using awk

I have a file with 1000 rows in it
For example:
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
I want to search file and extract the values after string FPKM, like
"1.2028600217"
Can I do it using awk?
if you don't care which column does the FPKM show in, you could:
grep -Po '(?<=FPKM )"[^"]*"' file
You can use awk, but this is a simple substitution on a single line so sed is better suited:
$ cat file
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
$ sed 's/.*FPKM *"\([^"]*\)".*/\1/' file
1.2028600217