Finding sequence in data

Finding sequence in data - awk

I to use awk to find the sequence of pattern in a DNA data but I cannot figure out how to do it. I have a text file "test.tx" which contains a lot of data and I want to be able to match any sequence that starts with ATG and ends with TAA, TGA or TAG and prints them.
for instance, if my text file has data that look like below. I want to find and match all the existing sequence and output as below.
AGACGCCGGAAGGTCCGAACATCGGCCTTATTTCGTCGCTCTCTTGCTTTGCTCGAATAAACGAGTTTGGCTTTATCGAATCTCCGTACCGTAAGGTCGAAAACGGCCGGGTCATTGAGTACGTGAAAGTACAAAATGG
GTCCGCGAATTTTTCGGTTCGTCTCAGCTTTCGCAGTTTATGGATCAGACGAACCCGCTCTCTGAAATTACTCATAAACGCAGGCTCTCGGCGCTCGGGCCCGGCGGACTCTCGCGGGAGCGTGCAGGTTTCGAAGTTC
GGATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAGGCGAACTGCTCGAAAATCAATTCCGAATCGGGCTTGAGCGAATGGAGCGGGCCATCAAGGAAAAAATGTCTATCCAGCAGGATATGCAAACGACG
AAAGTATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAAATTCAATATCAAAATGGGACGCCCCGAGCGCGACCGTATAGACGATCCGCTGCTTGCGCCGATGGATTTCATCGACGTTGTGAA
ATGAGACCGGGCGATCCGCCGACTGTGCCAACCGCCTACCGGCTTCTGG
Print out matches:
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
I try something like this, but it only display the rows that starts with ATG. it doesn't actually fix my problem
awk '/^AGT/{print $0}' test.txt

assuming the records are not spanning multiple lines
$ grep -oP 'ATG.*?T(AA|AG|GA)' file
ATGGATCAGACGAACCCGCTCTCTGA
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
ATGGGACGCCCCGAGCGCGACCGTATAG
ATGGATTTCATCGACGTTGTGA
non-greedy match, requires -P switch (to find the first match, not the longest).

Could you please try following.
awk 'match($0,/ATG.*TAA|ATG.*TGA|ATG.*TAG/){print substr($0,RSTART,RLENGTH)}' Input_file

Related

awk - store first occurrence based on cell

I have a file (around 10k entries) with following format:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
A;B;C;n/a;n/a
D;E;F;56.011;13.099
D;E;F;56.01;13.01
D;E;F;n/a;n/a
I;B;C;n/a;n/a
containing duplicates, some without, others with mildly contradicting LAT;LONG coordinates.
I only want to store first unique value of [$1;$2;$3;$4;$5] as output, so desired output should look like:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
I'd assume that I want to create an array, but I struggle with proper formating of it... so any help appreciated !

I'm glad you have it working, but personally, I would suggest something a little more along the lines of:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
Example Use/Output
With your data in file, you could then do:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
You can shorten it to about your example (which simply checks if the unique index of the first three fields combined has been set yet and relies on the default print operation to output the first records having the unique combination):
$ awk -F";" '!seen[$1,$2,$3]++' file
However, using the joined fields $1,$2,$3 as the index is about the only way you can ensure uniqueness.
If you say your works, then it is certainly shorter. Let me know if you have further questions.

Found it by stopping to look for creating arrays
created a new $1 being $1,$2,$3, but the other solutions is indeed more elegant, here is the command I came up with after merging the fields in the file (and setting them as new $1), which I then didn't have to do
awk -F';' '!seen[($1)]++' file1.csv > file2.csv

Awk - Grep - Match the exact string in a file

I have a file that looks like this
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
ON,454644,FRED84848,Super Man,65757,555,Free
I need to match the values in the fourth column exactly as they are written. So if I am searching for "Super" I need it to return the line with "Super" only.
ON,111111,TEN000812,Super,7483747483,767,Free
Likewise, if I'm looking for "Super Man" I need that exact line returned.
ON,454644,FRED84848,Super Man,65757,555,Free
I have tried using grep, but grep will match all instances that contain Super. So if I do this:
grep -i "Super" file.txt
It returns all lines, because they all contain "Super"
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
ON,454644,FRED84848,Super Man,65757,555,Free
I have also tired with awk, and I believe I'm close, but when I do:
awk '$4==Super' file.txt
I still get output like this:
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
I have been at this for hours, and any help would be greatly appreciated at this point.

You were close, or I should say very close just put field delimiter as comma in your solution and you are all set.
awk 'BEGIN{FS=","} $4=="Super"' Input_file
Also one more thing in OP's attempt while comparison with 4th field with string value, string should be wrapped in "
OR in case you want to mention value to be compared as an awk variable then try following.
awk -v value="Super" 'BEGIN{FS=","} $4==value' Input_file

You are quite close actually, you can try :
awk -F, '$4=="Super" {print}' file.txt
I find this form easier to grasp. Slightly longer than #RavinderSingh13 though
-F is the field separator, in this case comma
Next you have a condition followed by action
Condition is to check if the fourth field has the string Super
If the string is found, print it

awk - How to extract quoted string in space delimited log file

I'm hoping there might be some simple way to do this, as I'm a total novice using awk.
I have a bunch of log files from an AWS load balancer, and I want to extract entries from these logs, where a particular response code was received.
Checking the response code is easy enough, I can do the following...
$9=="403" {print $0}
However what I really want is just the request itself, $13, However this column is quoted, and will contain spaces. It looks like so...
"GET https://[my domain name]:443/[my path] HTTP/2.0"
If I do the following...
$9=="403" {print $13}
I just get...
"GET
So what I think I need to do, is for awk (or some other appropriate utility) to extract the complete column 13, and then be able to break that down into it's individual fields, for method, URL etc.

Could you please try following. I have given inside regex of match 443 as per your sample to match it you could give it as per your need to look for 403 change it to match($0,/\".*403.*\"/) too.
awk 'match($0,/\".*443.*\"/){print substr($0,RSTART,RLENGTH)}' Input_file
IMHO advantage of this approach will be you need NOT to hard code any field number in your awk. 1 more thing I have assumed that your Input_file will have "......403....." kind of section only once and you want to print that only.
1 more additional awk where I am assuming you may have multiple occurrences of "..." so picking only that one where 403|443 is coming.
awk 'match($0,/\".*443[^"]*/){print substr($0,RSTART,RLENGTH+1)}' Input_file
EDIT: Or if your Input_file has "...443..." one time or this text is coming first after starting of line(assuming if other occurrences of ".." will come later) then you could try following.
awk -F'"' '/443/{print $2}' Input_file

newer version gawk has a built-in variable FPAT which you can use to define fields by a regex pattern. For your logs, if no other quoted fields before the field 9 and 13:
awk -v FPAT='[^[:space:]]+|"[^"]*"' '$9 == "403"{print $13}' log_file
REF: https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

if $column_A="" equals to delete $column_A in awk?

I wish to delete one column of my data in awk but what I found is using command like $column_A="". Is column_A really deleted in this way?
For example, I wish to delete the second column and I found a solution: awk 'BEGIN{FS="\t";OFS="\t"}!($2="")' which print the result like: $1^0^0$3. It seems that it is the content of the second column is deleted but the second column.

after reading dev-null's comment, I got idea what are you asking...
My answer is: it depends on how do you define "a column is deleted".
see this example:
kent$ echo "foo,bar,blah"|awk -F, -v OFS="," '{$2="";print}'
foo,,blah
kent$ echo "foo,bar,blah"|awk -F, -v OFS="," '{print $1,$3}'
foo,blah
You see the difference? If you set the $x="" The column is still there, but it becomes an empty string. So the FS before and after stay. If this is what you wanted, it is fine. Otherwise just skip outputing the target column, like the 2nd example shows.

I would use cut for that:
cut -d$'\t' -f1,3- file
-f1,3- selects the first field, skips field 2 and then selects fields 3 to end.

Delete lines which contain a number smaller/larger than a user specified value

I need to delete lines in a large file which contain a value larger than a user specified number(see picture). For example I'd like to get rid of lines with values larger than 5e-48 (x>5e-48), i. e. lines with 7e-46, 7e-40, 1e-36,.... should be deleted.
Can sed, grep, awk or any other command do that?
Thank you
Markus

With awk:
awk '$3 <= 5e-48' filename
This selects only those lines whose third field is smaller than 5e-48.
If fields can contain spaces (since the data appears to be tab-separated) use
awk -F '\t' '$3 <= 5e-48' filename
This sets the field separator to \t, so lines are split at tabs rather than any whitespace. It does not appear to be necessary with the shown input data, but it is good practice to be defensive about these things (thanks to #tripleee for pointing this out).

In Perl, for example, the solution can be
perl -ane'print unless$F[2]>5e-48'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Finding sequence in data - awk

Could you please try following. awk 'match($0,/ATG.TAA|ATG.TGA|ATG.*TAG/){print substr($0,RSTART,RLENGTH)}' Input_file

Related

awk - store first occurrence based on cell

Awk - Grep - Match the exact string in a file

awk - How to extract quoted string in space delimited log file

if $column_A="" equals to delete $column_A in awk?

Delete lines which contain a number smaller/larger than a user specified value

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Finding sequence in data - awk

Could you please try following. awk 'match($0,/ATG.*TAA|ATG.*TGA|ATG.*TAG/){print substr($0,RSTART,RLENGTH)}' Input_file

Related

awk - store first occurrence based on cell

Awk - Grep - Match the exact string in a file

awk - How to extract quoted string in space delimited log file

if $column_A="" equals to delete $column_A in awk?

Delete lines which contain a number smaller/larger than a user specified value

Categories

Resources

Could you please try following. awk 'match($0,/ATG.TAA|ATG.TGA|ATG.*TAG/){print substr($0,RSTART,RLENGTH)}' Input_file