AWK replace full string in TABLE2 according to TABLE1 - awk

I have TABLE1 where first column is a string which should be replaced in the TABLE2 and second column in the TABLE1 is the value which should replace the string.
TABLE1 looks as this:
g63. MYL9
g5990. PTC7
g6018. POLYUBQ
g17850. NAA50
Table 2 looks for example like this:
PIZI01000001v1 AUGUSTUS gene 751753 768572 0.06 - . g63.
PIZI01000001v1 AUGUSTUS intron 751969 752021 1 - . transcript_id "g63.t1"; gene_id "g63.
PIZI01000001v1 AUGUSTUS gene 16680331 16688019 0.25 + . g630.
PIZI01000001v1 AUGUSTUS intron 16680415 16683083 0.35 + . transcript_id "g630.t1"; gene_id "g630.
PIZI01000001v1 AUGUSTUS gene 16695081 16703546 0.93 + . g631.
PIZI01000001v1 AUGUSTUS gene 16730752 16735366 0.65 + . g632.
PIZI01000008v1 AUGUSTUS gene 1943857 1944177 0.71 - . g6299.
So I assembled the awk command
awk 'FNR==NR { array[$1]==$2; next } { for (i in array) gsub(i, array[i]) }1' TABLE1 TABLE
which works up to the limit that for example with value MYL9 is not replaced only the string g63. but also the strings like g630, g631, g632 ... g6300 ..... and so on. So the Final table would look like this
PIZI01000001v1 AUGUSTUS gene 751753 768572 0.06 - . MYL9
PIZI01000001v1 AUGUSTUS intron 751969 752021 1 - . transcript_id "MYL9"; gene_id "MYL9
PIZI01000001v1 AUGUSTUS gene 16680331 16688019 0.25 + . MYL9
PIZI01000001v1 AUGUSTUS intron 16680415 16683083 0.35 + . transcript_id "MYL9t1"; gene_id "MYL9
PIZI01000001v1 AUGUSTUS gene 16695081 16703546 0.93 + . MYL9
PIZI01000001v1 AUGUSTUS gene 16730752 16735366 0.65 + . MYL9
PIZI01000008v1 AUGUSTUS gene 1943857 1944177 0.71 - . g6299.
And I need it to edit jus g63. and not other like g630. and so on.
I spend quite long time with this and now I have to take pause, so if anybody has an idea whats wrong there, I would appreciate. Thanks

So I solved the problem in non elegant way. I realized, that the dot on the end in the first line is handled as special character (any symbol) so I just replaced the dots with underscore.

Related

Filtering with help of two columns from a TSV file

I have the following file:
Pepper1.55ch01 PGA1.55 gene 63209 63880 . - . ID=CA01g00010;Name=CA01g00010
Pepper1.55ch01 PGA1.55 mRNA 63209 63880 . - . ID=mRNA.CA01g00010;Parent=CA01g00010;Note="Detected protein of unknown function"
Pepper1.55ch01 PGA1.55 exon 63209 63300 . - . ID=exon:CA01g00010:1;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 CDS 63209 63300 . - 0 ID=CDS:CA01g00010:1;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 exon 63445 63730 . - . ID=exon:CA01g00010:2;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 CDS 63445 63730 . - 0 ID=CDS:CA01g00010:2;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 exon 63758 63880 . - . ID=exon:CA01g00010:3;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 CDS 63758 63880 . - 0 ID=CDS:CA01g00010:3;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 gene 112298 112938 . - . ID=CA01g00020;Name=CA01g00020
Pepper1.55ch01 PGA1.55 mRNA 112298 112938 . - . ID=mRNA.CA01g00020;Parent=CA01g00020;Note="PREDICTED: protein ECERIFERUM 3-like [Solanum tuberosum]"
Pepper1.55ch01 PGA1.55 exon 112298 112457 . - . ID=exon:CA01g00020:1;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 CDS 112298 112457 . - 0 ID=CDS:CA01g00020:1;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 exon 112565 112743 . - . ID=exon:CA01g00020:2;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 CDS 112565 112743 . - 0 ID=CDS:CA01g00020:2;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 exon 112828 112938 . - . ID=exon:CA01g00020:3;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 CDS 112828 112938 . - 0 ID=CDS:CA01g00020:3;Parent=mRNA.CA01g00020
...
Now I want to extract ID (e.g. CA01g00010) from column 9 if column 3 is a gene. However, the below awk/grep commands delivered different amounts of ids.
> awk '{print $3,$9}' Pepper_1.55.gene_models-1-12.gff3 | grep gene | wc -l
30265
> awk '{print $3}' Pepper_1.55.gene_models-1-12.gff3 | grep gene | wc -l
30242
It appears that column 9 sometimes contain the gene. What did I miss?
I want to extract ID (e.g. CA01g00010) from column 9 if column 3 is a gene
You may use this awk solution:
awk -F '\t' '$3 == "gene" {gsub(/^ID=|;.*/, "", $9); print $9}' file.tsv
CA01g00010
CA01g00020
Details:
-F '\t': This awk command uses \t (tab) as input field separator.
$3 == "gene": When $3 is gene then take an action
{...} is action block that contains:
gsub(/^ID=|;.*/, "", $9): Remove initial ID= part and anything that comes after ; from $9
print $9: print $9
With your shown samples, please try following awk code.
awk -F'\t' $3 == "gene" && $9 ~ /^ID=/ && split($9,array,"[=;]"){print array[2]}' Input_file
Explanation: Simple explanation would be, making field separator as TAB here for all the lines of Input_file. Then in main program, checking condition if 3rd column is gene AND 9th column starts from ID= AND splitting 9th column into array named array with delimiters of =; and printing 2nd element of line's 9th column.
Assumptions:
don't have to worry about case insenstive matches (eg, don't have to match on GENE or Gene)
a match in column 9 can be further stipulated as the column starting with ID=CA01g00010;
OP's current objective appears to be the collection of a count of matching rows; otherwise OP should update question to state the desired output (eg, print the entire line? print a subset of columns?)
Modifying OP's sample input to provide a mix of matches and non-matches:
$ cat input.dat
Pepper1.55ch01 PGA1.55 gene 63209 63880 . - . ID=CA01g00010;Name=CA01g00010
Pepper1.55ch01 PGA1.55 exon 63209 63880 . - . ID=CA01g00010; skipme; Name=CA01g00010
Pepper1.55ch01 PGA1.55 gene 63209 63300 . - . skipme; ID=CA01g00010:1;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 CDS 63758 63880 . - 0 ID=CDS:CA01g00010:3;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 gene 112298 112938 . - . ID=CA01g00020;Name=CA01g00020
Pepper1.55ch01 PGA1.55 exon 112298 112457 . - . ID=exon:CA01g00020:1;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 gene 112298 112938 . - . ID=DE03g00230; skipme; Name=CA01g00020
Pepper1.55ch01 PGA1.55 exon 112298 112457 . - . ID=exon:CA01g00020:1;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 gene 63209 63880 . - . ID=CA01g00010;Name=CA01g00010
Pepper1.55ch01 PGA1.55 exon 63209 63880 . - . ID=CA01g00010; skipme; Name=CA01g00010
One awk idea that replaces OP's current awk|grep|wc code:
awk -F'\t' -v col3="${col3}" -v id="${id}" ' # allow OP to define search strings for column 3 and the "ID=" match in column 9
$3 == col3 && match($9,"ID="id";")==1 { cnt++ } # if we find both matches then increment our counter
END { print cnt+0 } # "+0" to force default value from empty string to 0
' input.dat
For bash variables col3='gene' and id=CA01g00010 we get:
2
For bash variables col3='gene' and id='DE03g00230' we get:
1
For bash variables col3='gene' and id='findme' we get:
0

Change pattern just in next column matching another pattern

This is the header of my file:
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "chr1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "chr1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "chr1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "chrM:+:226-399" * + 3
MT HAVANA * 27 . -
I would like to save to another file exactly the same content than this, but just removing the chr pattern and transforming M to MT pattern in the column next to the column matching remap_original_location.
So, my desired output is:
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "MT:+:226-399" * + 3
MT HAVANA * 27 . -
Do you know how can I achieve this?
I am trying some code like this:
awk '{for(i=1;i<=NF;i++){ if($i=="remap_original_location"){print ??? }}}'
But I am not sure how to specify the print part. In addition, as you can see, not all rows present the pattern remap_original_location and yet I still want to prin them.
With perl:
perl -pe 's/remap_original_location "\Kchr(M)?/$1 ? "MT" : ""/e' ip.txt
remap_original_location " I'm assuming single space to be consistent between fields here and that the next field will always start with " character. You can adjust the regex for other variations if needed
\K preceding portion won't be part of the matched text to be replaced
(M)? optionally match M character
$1 ? "MT" : "" if first capture group isn't empty, use MT as replacement string, else use empty string
empty string is Falsy in Perl
you can also use $1 && "MT" instead of ternary expression in this case, since the Falsy value is same as the alternate value needed
e flag helps to use Perl code in replacement section
You may use this awk:
awk '{gsub(/chr/, ""); for (i=1; i<NF; ++i) if ($i == "remap_original_location") {gsub(/M/, "MT", $(i+1)); break}} 1' file
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "MT:+:226-399" * + 3
MT HAVANA * 27 . -
A more readable form:
awk '{
gsub(/chr/, "")
for (i=1; i<NF; ++i)
if ($i == "remap_original_location") {
gsub(/M/, "MT", $(i+1))
break
}
} 1' file
With your shown samples, could you please try following.
awk '
{
gsub(/chr/,"")
}
match($0,/remap_original_location "M:/){
val=substr($0,RSTART,RLENGTH)
sub(/"M:/,"\"MT:",val)
$0=substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
1' Input_file
OR as per Sundeep's comment one could try following too:
awk '{gsub(/chr/,""); sub(/remap_original_location "M/, "&T")} 1' Input_file

Replacing a string in one file, with the contents of another file based on a common string

I have two files. I would like to replace a certain string in file 1, with the contents of file 2 based on a common string.
file 1
Chr5 psl2gff exon 15907715 15907933 . + . NM_001046410
Chr2 psl2gff exon 8898358 8898394 . + . NM_001192190
file 2
NM_001046410 gene_id TUBA1D; transcript_id tubulin, alpha 3d
NM_001192190 gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
output
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
in file 1 there are multiple instances of the same string, however, file 2 only has it once. I would like all instances of the NM_**** etc. to be replaced by the contents of file 2 when the first column matches. following this, I would like to completely remove the NM_**** from the file.
I am very new to bash etc. I have looked all over the place for a way to do this, but none so far have worked. Also, there are over 5000 lines in file 2, many more in file 1.
Any help would be much appreciated!
Thanks.
this is a join operation. If the files are sorted on the join key, and if the white space is not significant the easiest will be
$ join -19 -21 file1 file2 | cut -d' ' -f2-
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
if the files are not sorted and white space is important awk will be a better solution
$ awk 'NR==FNR {k=$1; $1=""; a[k]=$0; next}
$NF in a {sub(FS $NF"$",a[$NF])}1' file2 file1
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
exercise for you is to understand the code. There are many examples (>100) on this site exactly for this question and with many commented scripts, some of which are written by me.

Split large file according to value in single column (AWK)

I would like to split a large file (10^6 rows) according to the value in the 6th column (about 10*10^3 unique values). However, I can't get it working because of the number of records. It should be easy but it's taking hours already and I'm not getting any further.
I've tried two options:
Option 1
awk '{print > $6".txt"}' input.file
awk: cannot open "Parent=mRNA:Solyc06g051570.2.1.txt" for output (Too many open files)
Option 2
awk '{print > $6; close($6)}' input.file
This doesn't cause an error but the files it creates contain only the last line corresponding to 'grouping' value $6
This is the beginning of my file, however, this file doesn't cause an error because it's so small:
exon 3688 4407 + ID=exon:Solyc06g005000.2.1.1 Parent=mRNA:Solyc06g005000.2.1
exon 4853 5604 + ID=exon:Solyc06g005000.2.1.2 Parent=mRNA:Solyc06g005000.2.1
exon 7663 7998 + ID=exon:Solyc06g005000.2.1.3 Parent=mRNA:Solyc06g005000.2.1
exon 9148 9408 + ID=exon:Solyc06g005010.1.1.1 Parent=mRNA:Solyc06g005010.1.1
exon 13310 13330 + ID=exon:Solyc06g005020.1.1.1 Parent=mRNA:Solyc06g005020.1.1
exon 13449 13532 + ID=exon:Solyc06g005020.1.1.2 Parent=mRNA:Solyc06g005020.1.1
exon 13711 13783 + ID=exon:Solyc06g005020.1.1.3 Parent=mRNA:Solyc06g005020.1.1
exon 14172 14236 + ID=exon:Solyc06g005020.1.1.4 Parent=mRNA:Solyc06g005020.1.1
exon 14717 14803 + ID=exon:Solyc06g005020.1.1.5 Parent=mRNA:Solyc06g005020.1.1
exon 14915 15016 + ID=exon:Solyc06g005020.1.1.6 Parent=mRNA:Solyc06g005020.1.1
exon 22106 22261 + ID=exon:Solyc06g005030.1.1.1 Parent=mRNA:Solyc06g005030.1.1
exon 23462 23749 - ID=exon:Solyc06g005040.1.1.1 Parent=mRNA:Solyc06g005040.1.1
exon 24702 24713 - ID=exon:Solyc06g005050.2.1.3 Parent=mRNA:Solyc06g005050.2.1
exon 24898 25402 - ID=exon:Solyc06g005050.2.1.2 Parent=mRNA:Solyc06g005050.2.1
exon 25728 25845 - ID=exon:Solyc06g005050.2.1.1 Parent=mRNA:Solyc06g005050.2.1
exon 36352 36835 + ID=exon:Solyc06g005060.2.1.1 Parent=mRNA:Solyc06g005060.2.1
exon 36916 38132 + ID=exon:Solyc06g005060.2.1.2 Parent=mRNA:Solyc06g005060.2.1
exon 57089 57096 + ID=exon:Solyc06g005070.1.1.1 Parent=mRNA:Solyc06g005070.1.1
exon 57329 58268 + ID=exon:Solyc06g005070.1.1.2 Parent=mRNA:Solyc06g005070.1.1
exon 59970 60505 - ID=exon:Solyc06g005080.2.1.24 Parent=mRNA:Solyc06g005080.2.1
exon 60667 60783 - ID=exon:Solyc06g005080.2.1.23 Parent=mRNA:Solyc06g005080.2.1
exon 63719 63880 - ID=exon:Solyc06g005080.2.1.22 Parent=mRNA:Solyc06g005080.2.1
exon 64143 64298 - ID=exon:Solyc06g005080.2.1.21 Parent=mRNA:Solyc06g005080.2.1
exon 66964 67191 - ID=exon:Solyc06g005080.2.1.20 Parent=mRNA:Solyc06g005080.2.1
exon 71371 71559 - ID=exon:Solyc06g005080.2.1.19 Parent=mRNA:Solyc06g005080.2.1
exon 73612 73717 - ID=exon:Solyc06g005080.2.1.18 Parent=mRNA:Solyc06g005080.2.1
exon 76764 76894 - ID=exon:Solyc06g005080.2.1.17 Parent=mRNA:Solyc06g005080.2.1
exon 77189 77251 - ID=exon:Solyc06g005080.2.1.16 Parent=mRNA:Solyc06g005080.2.1
exon 80044 80122 - ID=exon:Solyc06g005080.2.1.15 Parent=mRNA:Solyc06g005080.2.1
exon 80496 80638 - ID=exon:Solyc06g005080.2.1.14 Parent=mRNA:Solyc06g005080.2.1
Option 2, use “>>” instead of “>”, to append.
awk '{print >> $6; close($6)}' input.file

Find the double quotes values and print them using awk

I have a file with 1000 rows in it
For example:
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
I want to search file and extract the values after string FPKM, like
"1.2028600217"
Can I do it using awk?
if you don't care which column does the FPKM show in, you could:
grep -Po '(?<=FPKM )"[^"]*"' file
You can use awk, but this is a simple substitution on a single line so sed is better suited:
$ cat file
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
$ sed 's/.*FPKM *"\([^"]*\)".*/\1/' file
1.2028600217