Replacing a string in one file, with the contents of another file based on a common string - awk

I have two files. I would like to replace a certain string in file 1, with the contents of file 2 based on a common string.
file 1
Chr5 psl2gff exon 15907715 15907933 . + . NM_001046410
Chr2 psl2gff exon 8898358 8898394 . + . NM_001192190
file 2
NM_001046410 gene_id TUBA1D; transcript_id tubulin, alpha 3d
NM_001192190 gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
output
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
in file 1 there are multiple instances of the same string, however, file 2 only has it once. I would like all instances of the NM_**** etc. to be replaced by the contents of file 2 when the first column matches. following this, I would like to completely remove the NM_**** from the file.
I am very new to bash etc. I have looked all over the place for a way to do this, but none so far have worked. Also, there are over 5000 lines in file 2, many more in file 1.
Any help would be much appreciated!
Thanks.

this is a join operation. If the files are sorted on the join key, and if the white space is not significant the easiest will be
$ join -19 -21 file1 file2 | cut -d' ' -f2-
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
if the files are not sorted and white space is important awk will be a better solution
$ awk 'NR==FNR {k=$1; $1=""; a[k]=$0; next}
$NF in a {sub(FS $NF"$",a[$NF])}1' file2 file1
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
exercise for you is to understand the code. There are many examples (>100) on this site exactly for this question and with many commented scripts, some of which are written by me.

Related

AWK replace full string in TABLE2 according to TABLE1

I have TABLE1 where first column is a string which should be replaced in the TABLE2 and second column in the TABLE1 is the value which should replace the string.
TABLE1 looks as this:
g63. MYL9
g5990. PTC7
g6018. POLYUBQ
g17850. NAA50
Table 2 looks for example like this:
PIZI01000001v1 AUGUSTUS gene 751753 768572 0.06 - . g63.
PIZI01000001v1 AUGUSTUS intron 751969 752021 1 - . transcript_id "g63.t1"; gene_id "g63.
PIZI01000001v1 AUGUSTUS gene 16680331 16688019 0.25 + . g630.
PIZI01000001v1 AUGUSTUS intron 16680415 16683083 0.35 + . transcript_id "g630.t1"; gene_id "g630.
PIZI01000001v1 AUGUSTUS gene 16695081 16703546 0.93 + . g631.
PIZI01000001v1 AUGUSTUS gene 16730752 16735366 0.65 + . g632.
PIZI01000008v1 AUGUSTUS gene 1943857 1944177 0.71 - . g6299.
So I assembled the awk command
awk 'FNR==NR { array[$1]==$2; next } { for (i in array) gsub(i, array[i]) }1' TABLE1 TABLE
which works up to the limit that for example with value MYL9 is not replaced only the string g63. but also the strings like g630, g631, g632 ... g6300 ..... and so on. So the Final table would look like this
PIZI01000001v1 AUGUSTUS gene 751753 768572 0.06 - . MYL9
PIZI01000001v1 AUGUSTUS intron 751969 752021 1 - . transcript_id "MYL9"; gene_id "MYL9
PIZI01000001v1 AUGUSTUS gene 16680331 16688019 0.25 + . MYL9
PIZI01000001v1 AUGUSTUS intron 16680415 16683083 0.35 + . transcript_id "MYL9t1"; gene_id "MYL9
PIZI01000001v1 AUGUSTUS gene 16695081 16703546 0.93 + . MYL9
PIZI01000001v1 AUGUSTUS gene 16730752 16735366 0.65 + . MYL9
PIZI01000008v1 AUGUSTUS gene 1943857 1944177 0.71 - . g6299.
And I need it to edit jus g63. and not other like g630. and so on.
I spend quite long time with this and now I have to take pause, so if anybody has an idea whats wrong there, I would appreciate. Thanks
So I solved the problem in non elegant way. I realized, that the dot on the end in the first line is handled as special character (any symbol) so I just replaced the dots with underscore.

Change pattern just in next column matching another pattern

This is the header of my file:
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "chr1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "chr1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "chr1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "chrM:+:226-399" * + 3
MT HAVANA * 27 . -
I would like to save to another file exactly the same content than this, but just removing the chr pattern and transforming M to MT pattern in the column next to the column matching remap_original_location.
So, my desired output is:
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "MT:+:226-399" * + 3
MT HAVANA * 27 . -
Do you know how can I achieve this?
I am trying some code like this:
awk '{for(i=1;i<=NF;i++){ if($i=="remap_original_location"){print ??? }}}'
But I am not sure how to specify the print part. In addition, as you can see, not all rows present the pattern remap_original_location and yet I still want to prin them.
With perl:
perl -pe 's/remap_original_location "\Kchr(M)?/$1 ? "MT" : ""/e' ip.txt
remap_original_location " I'm assuming single space to be consistent between fields here and that the next field will always start with " character. You can adjust the regex for other variations if needed
\K preceding portion won't be part of the matched text to be replaced
(M)? optionally match M character
$1 ? "MT" : "" if first capture group isn't empty, use MT as replacement string, else use empty string
empty string is Falsy in Perl
you can also use $1 && "MT" instead of ternary expression in this case, since the Falsy value is same as the alternate value needed
e flag helps to use Perl code in replacement section
You may use this awk:
awk '{gsub(/chr/, ""); for (i=1; i<NF; ++i) if ($i == "remap_original_location") {gsub(/M/, "MT", $(i+1)); break}} 1' file
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "MT:+:226-399" * + 3
MT HAVANA * 27 . -
A more readable form:
awk '{
gsub(/chr/, "")
for (i=1; i<NF; ++i)
if ($i == "remap_original_location") {
gsub(/M/, "MT", $(i+1))
break
}
} 1' file
With your shown samples, could you please try following.
awk '
{
gsub(/chr/,"")
}
match($0,/remap_original_location "M:/){
val=substr($0,RSTART,RLENGTH)
sub(/"M:/,"\"MT:",val)
$0=substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
1' Input_file
OR as per Sundeep's comment one could try following too:
awk '{gsub(/chr/,""); sub(/remap_original_location "M/, "&T")} 1' Input_file

Change field separator with awk or sed of a specific set of columns

I would like to modify a file where both tabs and spaces are used as field separators.
At the beginning we have a file with this type of structure:
chr1 Cufflinks gene_id "XLOC_000001"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XLOC_000012"; oId "XR_001548508";
Doing awk -F' ' '$4=$6 {print $0}' performs what I am looking for (changing the value of the "gene_id" by the value in "oId"):
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";
The problem is that it changes the line structure: the tabs \t between chr1, Cufflinks and gene_id disappeared. I tried adding -v OFS=\t but it puts tabs in the gene_id "XLOC_000012"; oId "XR_001548508"; part (which should stay separated by spaces). I also tried with sed something like sed -i 's/ /\t/' but it also put tabs everywhere.
How could I change the field separator of column 1 to 3 (and do not change columns 3 to 6) ?
A possibility with awk:
awk -F '[ ]' '{$2 = $4; print}' file
By using the space character for the input field separator (as opposed to spaces and tabs), a field can be assigned to without changing the tab characters to spaces.
For more complex cases, there is split (but no "join"):
awk 'BEGIN {FS=OFS="\t"} {n = split($3, a, " "); a[2] = a[4]; for (i=1; i<=n; ++i)
$3 = (i == 1 ? "" : $3 " ") a[i]
} 1' file
You may use this sed that preserves your whitespaces:
sed -E $'s/^([ \t]*([^ \t]+[ \t]+){3})[^ \t]+([ \t]+)(([^ \t]+[ \t]+){1})([^ \t]+)/\\1\\6\\3\\4\\6/' ff
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";
Explanation for copying 6th field to 4th field:
^: # match start
([ \t]*([^ \t]+[ \t]+){3}): # match first 4-1 fields and capture in group #1
[^ \t]+: # match 4th field
([ \t]+): # match whitespace after 4th field and capture in group #3
(([^ \t]+[ \t]+){1}): # match next (6-4-1) fields and capture in group #4
([^ \t]+): # match 6th field and capture in group #6
\\1\\6\\3\\4\\6: Place back-reference back in substitution
Alternatively this awk also creates a tabular aligned output:
awk '$4=$6' file | column -t
chr1 Cufflinks gene_id "XR_003076322.1"; oId "XR_003076322.1";
chr1 Cufflinks gene_id "XR_001548508"; oId "XR_001548508";

awk multiple field seperators?

I have a large file with lines like so
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
I want to extract ENSG00000223972.5, DDX11L1, chr1, 11869 and 14409.
I have succeeded in the first two by:
awk 'BEGIN {FS="\""}; {print $2"\t"$6}' file.txt
I'm struggling to now extract the chr1, 11869 and 14409 as this will need a different feild seperator? How is this done on the same ;line??
Try to use following command to extract what you want,
awk 'BEGIN {FS="\"";OFS="\t"}; {split($1,a,/[\ ]*/); print a[1],a[4],a[5],$2,$6}' file.txt
Brief explanation,
split($1,a,/[\ ]*/: split $1 into the array a, and the separators would be regex /[\ ]*/
Print the split content stored in a as you desired.
$ awk -F'[ "]+' -v OFS='\t' '{print $1, $4, $5, $10, $16}' file
chr1 11869 14409 ENSG00000223972.5 DDX11L1

Find the double quotes values and print them using awk

I have a file with 1000 rows in it
For example:
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
I want to search file and extract the values after string FPKM, like
"1.2028600217"
Can I do it using awk?
if you don't care which column does the FPKM show in, you could:
grep -Po '(?<=FPKM )"[^"]*"' file
You can use awk, but this is a simple substitution on a single line so sed is better suited:
$ cat file
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
$ sed 's/.*FPKM *"\([^"]*\)".*/\1/' file
1.2028600217