Delete lines which contain a number smaller/larger than a user specified value - awk

I need to delete lines in a large file which contain a value larger than a user specified number(see picture). For example I'd like to get rid of lines with values larger than 5e-48 (x>5e-48), i. e. lines with 7e-46, 7e-40, 1e-36,.... should be deleted.
Can sed, grep, awk or any other command do that?
Thank you
Markus

With awk:
awk '$3 <= 5e-48' filename
This selects only those lines whose third field is smaller than 5e-48.
If fields can contain spaces (since the data appears to be tab-separated) use
awk -F '\t' '$3 <= 5e-48' filename
This sets the field separator to \t, so lines are split at tabs rather than any whitespace. It does not appear to be necessary with the shown input data, but it is good practice to be defensive about these things (thanks to #tripleee for pointing this out).

In Perl, for example, the solution can be
perl -ane'print unless$F[2]>5e-48'

Related

Parse strings within quotations

I have a log file that includes lines with the pattern as below. I want to extract the two strings within the quotations and write them to another file, each one in a separate column. (Not all lines have this pattern, but these specific lines come sequentially.)
Input
(multiple lines of header)
Of these, 0 are new, while 1723332 are present in the base dataset.
Warning: Variants 'Variant47911' and 'Variant47910' have the same position.
Warning: Variants 'exm2254099' and 'exm12471' have the same position.
Warning: Variants 'newrs140234726' and 'exm15862' have the same position.
Desired output:
Variant47911 Variant47910
exm2254099 exm12471
newrs140234726 exm15862
This retrieves the lines but do not know how to specify the strings that need to be printed.
awk '/Warning: Variants '*'/ Input
Using the single quote as a field delimiter should get you most of the way there, and then you have to have a way to uniquely identify the lines you want to match. Below works for the sample you gave, but might have to be tweaked depending on the lines from the file that we're not seeing.
$ awk -v q="'" 'BEGIN {FS=q; OFS="\t"} /Warning: Variants/ && NF==5 {print $2, $4}' file
Variant47911 Variant47910
exm2254099 exm12471
newrs140234726 exm15862
This might work for you (GNU sed):
sed -En "/Variant/{s/[^']*'([^']*)'[^']*/\1\t/g;T;s/.$//p}" file
For all lines that contain Variant, remove everything except the text between single quotes and tab separate the results.

Finding sequence in data

I to use awk to find the sequence of pattern in a DNA data but I cannot figure out how to do it. I have a text file "test.tx" which contains a lot of data and I want to be able to match any sequence that starts with ATG and ends with TAA, TGA or TAG and prints them.
for instance, if my text file has data that look like below. I want to find and match all the existing sequence and output as below.
AGACGCCGGAAGGTCCGAACATCGGCCTTATTTCGTCGCTCTCTTGCTTTGCTCGAATAAACGAGTTTGGCTTTATCGAATCTCCGTACCGTAAGGTCGAAAACGGCCGGGTCATTGAGTACGTGAAAGTACAAAATGG
GTCCGCGAATTTTTCGGTTCGTCTCAGCTTTCGCAGTTTATGGATCAGACGAACCCGCTCTCTGAAATTACTCATAAACGCAGGCTCTCGGCGCTCGGGCCCGGCGGACTCTCGCGGGAGCGTGCAGGTTTCGAAGTTC
GGATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAGGCGAACTGCTCGAAAATCAATTCCGAATCGGGCTTGAGCGAATGGAGCGGGCCATCAAGGAAAAAATGTCTATCCAGCAGGATATGCAAACGACG
AAAGTATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAAATTCAATATCAAAATGGGACGCCCCGAGCGCGACCGTATAGACGATCCGCTGCTTGCGCCGATGGATTTCATCGACGTTGTGAA
ATGAGACCGGGCGATCCGCCGACTGTGCCAACCGCCTACCGGCTTCTGG
Print out matches:
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
I try something like this, but it only display the rows that starts with ATG. it doesn't actually fix my problem
awk '/^AGT/{print $0}' test.txt
assuming the records are not spanning multiple lines
$ grep -oP 'ATG.*?T(AA|AG|GA)' file
ATGGATCAGACGAACCCGCTCTCTGA
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
ATGGGACGCCCCGAGCGCGACCGTATAG
ATGGATTTCATCGACGTTGTGA
non-greedy match, requires -P switch (to find the first match, not the longest).
Could you please try following.
awk 'match($0,/ATG.*TAA|ATG.*TGA|ATG.*TAG/){print substr($0,RSTART,RLENGTH)}' Input_file

Find an exact match from a patterns file for another file using awk (patterns contain regex symbols to be ignored)

I have a file which has the following patterns.
NO_MATCH
NO_MATCH||NO_MATCH
NO_MATCH||NO_MATCH||NO_MATCH
NO_MATCH||NO_MATCH||NO_MATCH||NO_MATCH
These should be matched exactly with the 5th column of the target csv. I have tried:
awk 'NR==FNR{a[$0]=$0; next;} NR>FNR{if($5==a[$0])print $0}' pattern.csv input.csv > final_out.csv
But the || in the patterns file result in bad matches. The 5th column in the target csv looks something like this:
"AAAA||AAAA"
"BBBB||BBBB"
"NO_MATCH"
"NO_MATCH||NO_MATCH||NO_MATCH"
"NO_MATCH||BBBB"
I need to extract the 3rd and 4th lines.
Edit: I need exact match such as line 3 & 4. Hope this clears up the issue. The columns in the csv are double quoted as shown, and the quotes around fifth column should be removed.
awk 'BEGIN{FS=OFS=","} NR==FNR{a["\""$0"\""];next} ($5 in a){gsub(/^"|"$/,"",$5);print}' pattern.csv input.csv > final_out.csv
Keep pattern.csv's contents in an array with enclosing each line in quotes. For each line in input.csv, if fifth column exists in the array, remove quotes around it and print the line.

Is there a way to replace all occurrances of certain characters but only on every nth line?

I am trying to replace all characters that are not C, T, A or G with an N in the sequence part of a fasta file - i.e. every 2nd line
I think some combination of awk and tr is what I would need...
To print every other line:
awk '{if (NR % 2 == 0) print $0}' myfile
To replace these characters with an N
tr YRHIQ- N
...but I don't know how to combine them so that the character replacement is only on every 2nd line but it prints every line
this is the sort of thing I have
>SEQUENCE_1
AGCYGTQA-TGCTG
>SEQUENCE_2
AGGYGTQA-TGCTC
and I want it to look like this:
>SEQUENCE_1
AGCNGTNANTGCTG
>SEQUENCE_2
AGGNGTNANTGCTC
but not like this:
>SENUENCE_1
AGCNGTNANTGCTG
>SENUENCE_2
AGGNGTNANTGCTC
The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:
The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
The sequence can span multiple lines.
A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.
To answer the OP's question, If you just want to process every second line, you want to do:
awk '!(NR%2){gsub(/[^CTAG]/, "N")}1' file.fasta
This method will, however, fail on any of the following cases:
fasta file with a multi-line sequence
multi-fasta file with a possible blank-line between subsequent sequences
A better way would be to exclude the header line and process all other lines:
awk '!/^>/{gsub(/[^CTAG]/, "N")}1' file.fasta
Thanks to #kvantour's explanation on fasta files, here is another sed solution which suits your task better than the old one:
sed '/^>/! s/[^ACTG]/N/g' file.fasta
/^>/!: do the following if this line doesn't begin with a >,
s/[^ACTG]/N/g: replace every character but ACTG with N.
Here is one solution with awk
awk 'NR%2 ==0{gsub(/[^CTAG]/, "N")}1' file
the result
SEQUENCE_1
AGCNGTNANTGCTG
SEQUENCE_2
AGGNGTNANTGCTC
Explanation
As OP wanted, I am only looking for every even row to apply the change by
NR/2 == 0
NR is number of records (rows here) read so far from the file
and gsub(/[^CTAG]/, "N") replace with all the characters that are NOT 'C', 'T', 'A', 'G'
[^CTAG] the ^ is the negation
and awk goes by
expression action format
here the expression is NR/2==0 and the action is replacing the characters with N with gsub that are not CTAG

Getting numerical sub-string of fields using awk

I was wondering how I can get the numerical sub-string of fields using awk in a text file like what is shown below. I am already familiar with substr() function. However, since the length of fields are not fixed, I have no idea how to separate text from numerical part.
A.txt
"Asd.1"
"bcdujcd.2"
"mshde.3333"
"deuhdue.777"
P.S. All the numbers are separated from text part with a single dot (.).
You may try like this:
rt$ echo "bcdujcd.2"|awk -F'[^0-9]*' '$0=$2'
If you don't care about any non-digit parts of the line and only want to see the digit parts as output you could use:
awk '{gsub(/[^[:digit:]]+/, " ")}7' A.txt
which will generate:
1
2
3333
777
as output (there's a leading space on each line for the record).
If there can only be one number field per line than the replacement above can be "" instead of " " in the gsub and the leading space will do away. The replacement with the space will keep multiple numerical fields separated by a space if they occur on a single line. (i.e. "foo.88.bar.11" becomes 88 11 instead of 8811).
If you just need the second (period delimited) field of each line of that sort then awk -F. '{print $2}' will do that.
$ awk -F'[".]' '{print $3}' file
1
2
3333
777