Filtering with help of two columns from a TSV file - awk

I have the following file:
Pepper1.55ch01 PGA1.55 gene 63209 63880 . - . ID=CA01g00010;Name=CA01g00010
Pepper1.55ch01 PGA1.55 mRNA 63209 63880 . - . ID=mRNA.CA01g00010;Parent=CA01g00010;Note="Detected protein of unknown function"
Pepper1.55ch01 PGA1.55 exon 63209 63300 . - . ID=exon:CA01g00010:1;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 CDS 63209 63300 . - 0 ID=CDS:CA01g00010:1;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 exon 63445 63730 . - . ID=exon:CA01g00010:2;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 CDS 63445 63730 . - 0 ID=CDS:CA01g00010:2;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 exon 63758 63880 . - . ID=exon:CA01g00010:3;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 CDS 63758 63880 . - 0 ID=CDS:CA01g00010:3;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 gene 112298 112938 . - . ID=CA01g00020;Name=CA01g00020
Pepper1.55ch01 PGA1.55 mRNA 112298 112938 . - . ID=mRNA.CA01g00020;Parent=CA01g00020;Note="PREDICTED: protein ECERIFERUM 3-like [Solanum tuberosum]"
Pepper1.55ch01 PGA1.55 exon 112298 112457 . - . ID=exon:CA01g00020:1;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 CDS 112298 112457 . - 0 ID=CDS:CA01g00020:1;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 exon 112565 112743 . - . ID=exon:CA01g00020:2;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 CDS 112565 112743 . - 0 ID=CDS:CA01g00020:2;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 exon 112828 112938 . - . ID=exon:CA01g00020:3;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 CDS 112828 112938 . - 0 ID=CDS:CA01g00020:3;Parent=mRNA.CA01g00020
...
Now I want to extract ID (e.g. CA01g00010) from column 9 if column 3 is a gene. However, the below awk/grep commands delivered different amounts of ids.
> awk '{print $3,$9}' Pepper_1.55.gene_models-1-12.gff3 | grep gene | wc -l
30265
> awk '{print $3}' Pepper_1.55.gene_models-1-12.gff3 | grep gene | wc -l
30242
It appears that column 9 sometimes contain the gene. What did I miss?

I want to extract ID (e.g. CA01g00010) from column 9 if column 3 is a gene
You may use this awk solution:
awk -F '\t' '$3 == "gene" {gsub(/^ID=|;.*/, "", $9); print $9}' file.tsv
CA01g00010
CA01g00020
Details:
-F '\t': This awk command uses \t (tab) as input field separator.
$3 == "gene": When $3 is gene then take an action
{...} is action block that contains:
gsub(/^ID=|;.*/, "", $9): Remove initial ID= part and anything that comes after ; from $9
print $9: print $9

With your shown samples, please try following awk code.
awk -F'\t' $3 == "gene" && $9 ~ /^ID=/ && split($9,array,"[=;]"){print array[2]}' Input_file
Explanation: Simple explanation would be, making field separator as TAB here for all the lines of Input_file. Then in main program, checking condition if 3rd column is gene AND 9th column starts from ID= AND splitting 9th column into array named array with delimiters of =; and printing 2nd element of line's 9th column.

Assumptions:
don't have to worry about case insenstive matches (eg, don't have to match on GENE or Gene)
a match in column 9 can be further stipulated as the column starting with ID=CA01g00010;
OP's current objective appears to be the collection of a count of matching rows; otherwise OP should update question to state the desired output (eg, print the entire line? print a subset of columns?)
Modifying OP's sample input to provide a mix of matches and non-matches:
$ cat input.dat
Pepper1.55ch01 PGA1.55 gene 63209 63880 . - . ID=CA01g00010;Name=CA01g00010
Pepper1.55ch01 PGA1.55 exon 63209 63880 . - . ID=CA01g00010; skipme; Name=CA01g00010
Pepper1.55ch01 PGA1.55 gene 63209 63300 . - . skipme; ID=CA01g00010:1;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 CDS 63758 63880 . - 0 ID=CDS:CA01g00010:3;Parent=mRNA.CA01g00010
Pepper1.55ch01 PGA1.55 gene 112298 112938 . - . ID=CA01g00020;Name=CA01g00020
Pepper1.55ch01 PGA1.55 exon 112298 112457 . - . ID=exon:CA01g00020:1;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 gene 112298 112938 . - . ID=DE03g00230; skipme; Name=CA01g00020
Pepper1.55ch01 PGA1.55 exon 112298 112457 . - . ID=exon:CA01g00020:1;Parent=mRNA.CA01g00020
Pepper1.55ch01 PGA1.55 gene 63209 63880 . - . ID=CA01g00010;Name=CA01g00010
Pepper1.55ch01 PGA1.55 exon 63209 63880 . - . ID=CA01g00010; skipme; Name=CA01g00010
One awk idea that replaces OP's current awk|grep|wc code:
awk -F'\t' -v col3="${col3}" -v id="${id}" ' # allow OP to define search strings for column 3 and the "ID=" match in column 9
$3 == col3 && match($9,"ID="id";")==1 { cnt++ } # if we find both matches then increment our counter
END { print cnt+0 } # "+0" to force default value from empty string to 0
' input.dat
For bash variables col3='gene' and id=CA01g00010 we get:
2
For bash variables col3='gene' and id='DE03g00230' we get:
1
For bash variables col3='gene' and id='findme' we get:
0

Related

Change pattern just in next column matching another pattern

This is the header of my file:
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "chr1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "chr1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "chr1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "chrM:+:226-399" * + 3
MT HAVANA * 27 . -
I would like to save to another file exactly the same content than this, but just removing the chr pattern and transforming M to MT pattern in the column next to the column matching remap_original_location.
So, my desired output is:
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "MT:+:226-399" * + 3
MT HAVANA * 27 . -
Do you know how can I achieve this?
I am trying some code like this:
awk '{for(i=1;i<=NF;i++){ if($i=="remap_original_location"){print ??? }}}'
But I am not sure how to specify the print part. In addition, as you can see, not all rows present the pattern remap_original_location and yet I still want to prin them.
With perl:
perl -pe 's/remap_original_location "\Kchr(M)?/$1 ? "MT" : ""/e' ip.txt
remap_original_location " I'm assuming single space to be consistent between fields here and that the next field will always start with " character. You can adjust the regex for other variations if needed
\K preceding portion won't be part of the matched text to be replaced
(M)? optionally match M character
$1 ? "MT" : "" if first capture group isn't empty, use MT as replacement string, else use empty string
empty string is Falsy in Perl
you can also use $1 && "MT" instead of ternary expression in this case, since the Falsy value is same as the alternate value needed
e flag helps to use Perl code in replacement section
You may use this awk:
awk '{gsub(/chr/, ""); for (i=1; i<NF; ++i) if ($i == "remap_original_location") {gsub(/M/, "MT", $(i+1)); break}} 1' file
1 HAVANA gene 11869 14409 . + . gene_name "DDX11L1" remap_original_location "1:+:11869-14409"
1 HAVANA gene 118569 148409 . + . gene_name "ORF21" remap_original_location "1:+:118569-148409" clinSig 0.59
1 HAVANA transcript 118568 148419 . + . remap_original_location "1:+:118568-148419" clinSig 0.02 M .
MT HAVANA gene 226 399 . + . remap_original_location "MT:+:226-399" * + 3
MT HAVANA * 27 . -
A more readable form:
awk '{
gsub(/chr/, "")
for (i=1; i<NF; ++i)
if ($i == "remap_original_location") {
gsub(/M/, "MT", $(i+1))
break
}
} 1' file
With your shown samples, could you please try following.
awk '
{
gsub(/chr/,"")
}
match($0,/remap_original_location "M:/){
val=substr($0,RSTART,RLENGTH)
sub(/"M:/,"\"MT:",val)
$0=substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
1' Input_file
OR as per Sundeep's comment one could try following too:
awk '{gsub(/chr/,""); sub(/remap_original_location "M/, "&T")} 1' Input_file

isolating the rows based on values of fields in awk

I have a text file like this small example:
small example:
chr1 HAVANA exon 13221 13374
chr1 HAVANA exon 13453 13670
chr1 HAVANA gene 14363 29806
I am trying to filter the rows base on the 3rd column. in fact if the 3rd column is gene i will keep the entire row and filter out the other rows. here is the expected output:
expected output:
chr1 HAVANA gene 14363 29806
I am trying to do that in awk using the following command but the results is empty. do you know how to fix it?
awk '{ if ($3 == 'gene') { print } }' infile.txt > outfile.txt
Use double quotes in the script:
$ awk '{ if ($3 == "gene") { print } }' file
chr1 HAVANA gene 14363 29806
or:
$ awk '{ if ($3 == "gene") print }' file
but you could just:
$ awk '$3 == "gene"'

Replacing a string in one file, with the contents of another file based on a common string

I have two files. I would like to replace a certain string in file 1, with the contents of file 2 based on a common string.
file 1
Chr5 psl2gff exon 15907715 15907933 . + . NM_001046410
Chr2 psl2gff exon 8898358 8898394 . + . NM_001192190
file 2
NM_001046410 gene_id TUBA1D; transcript_id tubulin, alpha 3d
NM_001192190 gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
output
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
in file 1 there are multiple instances of the same string, however, file 2 only has it once. I would like all instances of the NM_**** etc. to be replaced by the contents of file 2 when the first column matches. following this, I would like to completely remove the NM_**** from the file.
I am very new to bash etc. I have looked all over the place for a way to do this, but none so far have worked. Also, there are over 5000 lines in file 2, many more in file 1.
Any help would be much appreciated!
Thanks.
this is a join operation. If the files are sorted on the join key, and if the white space is not significant the easiest will be
$ join -19 -21 file1 file2 | cut -d' ' -f2-
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
if the files are not sorted and white space is important awk will be a better solution
$ awk 'NR==FNR {k=$1; $1=""; a[k]=$0; next}
$NF in a {sub(FS $NF"$",a[$NF])}1' file2 file1
Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1
exercise for you is to understand the code. There are many examples (>100) on this site exactly for this question and with many commented scripts, some of which are written by me.

awk multiple field seperators?

I have a large file with lines like so
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
I want to extract ENSG00000223972.5, DDX11L1, chr1, 11869 and 14409.
I have succeeded in the first two by:
awk 'BEGIN {FS="\""}; {print $2"\t"$6}' file.txt
I'm struggling to now extract the chr1, 11869 and 14409 as this will need a different feild seperator? How is this done on the same ;line??
Try to use following command to extract what you want,
awk 'BEGIN {FS="\"";OFS="\t"}; {split($1,a,/[\ ]*/); print a[1],a[4],a[5],$2,$6}' file.txt
Brief explanation,
split($1,a,/[\ ]*/: split $1 into the array a, and the separators would be regex /[\ ]*/
Print the split content stored in a as you desired.
$ awk -F'[ "]+' -v OFS='\t' '{print $1, $4, $5, $10, $16}' file
chr1 11869 14409 ENSG00000223972.5 DDX11L1

Find the double quotes values and print them using awk

I have a file with 1000 rows in it
For example:
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
I want to search file and extract the values after string FPKM, like
"1.2028600217"
Can I do it using awk?
if you don't care which column does the FPKM show in, you could:
grep -Po '(?<=FPKM )"[^"]*"' file
You can use awk, but this is a simple substitution on a single line so sed is better suited:
$ cat file
chr1 Cufflinks transcript 34611 36081 1000 - . gene_id "FAM138A"; transcript_id "uc001aak.3"; FPKM "1.2028600217"; frac "1.000000"; conf_lo "0.735264"; conf_hi "1.670456"; cov "0.978610";
$ sed 's/.*FPKM *"\([^"]*\)".*/\1/' file
1.2028600217