Parse strings within quotations - awk

I have a log file that includes lines with the pattern as below. I want to extract the two strings within the quotations and write them to another file, each one in a separate column. (Not all lines have this pattern, but these specific lines come sequentially.)
Input
(multiple lines of header)
Of these, 0 are new, while 1723332 are present in the base dataset.
Warning: Variants 'Variant47911' and 'Variant47910' have the same position.
Warning: Variants 'exm2254099' and 'exm12471' have the same position.
Warning: Variants 'newrs140234726' and 'exm15862' have the same position.
Desired output:
Variant47911 Variant47910
exm2254099 exm12471
newrs140234726 exm15862
This retrieves the lines but do not know how to specify the strings that need to be printed.
awk '/Warning: Variants '*'/ Input

Using the single quote as a field delimiter should get you most of the way there, and then you have to have a way to uniquely identify the lines you want to match. Below works for the sample you gave, but might have to be tweaked depending on the lines from the file that we're not seeing.
$ awk -v q="'" 'BEGIN {FS=q; OFS="\t"} /Warning: Variants/ && NF==5 {print $2, $4}' file
Variant47911 Variant47910
exm2254099 exm12471
newrs140234726 exm15862

This might work for you (GNU sed):
sed -En "/Variant/{s/[^']*'([^']*)'[^']*/\1\t/g;T;s/.$//p}" file
For all lines that contain Variant, remove everything except the text between single quotes and tab separate the results.

Related

Remove pattern and everything before using AWK in fasta file

I searched a lot but could not find a solution to my issue. I have a file that looks like:
>HEADER1
AACTGGTTACGTGGTTCTCT
>HEADER2
GGTTTCTC
>HEADER3
CCAGGTTTCGAGGGGTTACGGGGTA
I want to remove GGTT pattern and everything before it. So basically there are several of these patterns in some of the lines so I want to remove all of them including everything before the pattern or among them.
The desired output should look like:
>HEADER1
CTCT
>HEADER2
TCTC
>HEADER3
ACGGGGTA
I tried suggested ways but could not be able to adjust it to my data.
Thank you in Advance for your help.
If it's not possible for your headers to include GGTT, I suppose the easiest would be:
$ sed 's/.*GGTT//' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA
If your headers might contain GGTT, then awk probably be better:
$ awk '!/^>/ {sub(/.*GGTT/, "")}1' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA
In both cases, the .*GGTT is "greedy", so it doesn't matter if there are multiple instances of GGTT, it will always match up to and remove everything through the last occurrence.
In the awk version, the pattern !/^>/ makes sure that subsitution is only done on lines that do not start with >.
Note that in general, sequences in fasta format as shown in the question may span multiple lines (= they are often wrapped to 80 or 100 nucleotides per line). This answer handles such cases correctly as well, unlike some other answers in this thread.
Use these two Perl one-liners connected by a pipe. The first one-liner does all of the common reformatting of the fasta sequences that is necessary in this and similar cases. It removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. It also properly handles leading and trailing whitespace/newlines in the file. The second one-liner actually removes everything up to and including the last GGTT in the sequence, in a case-insensitive manner.
Note: If GGTT is at the end of the sequence, the output will be a header plus an empty sequence. See seq4 in the example below. This may cause issues with some bioinformatics tools used downstream.
# Create the input for testing:
cat > in.fa <<EOF
>seq1 with blanks
ACGT GGTT ACGT
>seq2 with newlines
ACGT
GGTT
ACGT
>seq3 without blanks or newlines
ACGTGGTTACGT
>seq4 everything should be deleted, with empty sequence in the output
ACGTGGTTACGTGGTT
>seq5 lowercase
acgtggttacgt
EOF
# Reformat to single-line fasta, then delete subsequences:
perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' in.fa | \
perl -pe 'next if /^>/; s/.*GGTT//i;' > out.fa
Output in file out.fa:
>seq1 with blanks
ACGT
>seq2 with newlines
ACGT
>seq3 without blanks or newlines
ACGT
>seq4 everything should be deleted, with empty sequence in the output
>seq5 lowercase
acgt
The Perl one-linera use these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
chomp : Remove the input line separator (\n on *NIX).
if ( /^>/ ) : Test if the current line is a sequence header line.
$n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
END { print "\n"; } : Print the final newline after the last sequence.
s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.
next if /^>/; : Skip the header lines.
s/.*GGTT//i; : Replace everything (.*) up to and including the the last GGTT with nothing (= delete it). The /i modifier means case-insensitive match.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
Remove line breaks in a FASTA file

Finding sequence in data

I to use awk to find the sequence of pattern in a DNA data but I cannot figure out how to do it. I have a text file "test.tx" which contains a lot of data and I want to be able to match any sequence that starts with ATG and ends with TAA, TGA or TAG and prints them.
for instance, if my text file has data that look like below. I want to find and match all the existing sequence and output as below.
AGACGCCGGAAGGTCCGAACATCGGCCTTATTTCGTCGCTCTCTTGCTTTGCTCGAATAAACGAGTTTGGCTTTATCGAATCTCCGTACCGTAAGGTCGAAAACGGCCGGGTCATTGAGTACGTGAAAGTACAAAATGG
GTCCGCGAATTTTTCGGTTCGTCTCAGCTTTCGCAGTTTATGGATCAGACGAACCCGCTCTCTGAAATTACTCATAAACGCAGGCTCTCGGCGCTCGGGCCCGGCGGACTCTCGCGGGAGCGTGCAGGTTTCGAAGTTC
GGATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAGGCGAACTGCTCGAAAATCAATTCCGAATCGGGCTTGAGCGAATGGAGCGGGCCATCAAGGAAAAAATGTCTATCCAGCAGGATATGCAAACGACG
AAAGTATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAAATTCAATATCAAAATGGGACGCCCCGAGCGCGACCGTATAGACGATCCGCTGCTTGCGCCGATGGATTTCATCGACGTTGTGAA
ATGAGACCGGGCGATCCGCCGACTGTGCCAACCGCCTACCGGCTTCTGG
Print out matches:
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
I try something like this, but it only display the rows that starts with ATG. it doesn't actually fix my problem
awk '/^AGT/{print $0}' test.txt
assuming the records are not spanning multiple lines
$ grep -oP 'ATG.*?T(AA|AG|GA)' file
ATGGATCAGACGAACCCGCTCTCTGA
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
ATGGGACGCCCCGAGCGCGACCGTATAG
ATGGATTTCATCGACGTTGTGA
non-greedy match, requires -P switch (to find the first match, not the longest).
Could you please try following.
awk 'match($0,/ATG.*TAA|ATG.*TGA|ATG.*TAG/){print substr($0,RSTART,RLENGTH)}' Input_file

Replace character except between pattern using grep -o or sed (or others)

In the following file I want to replace all the ; by , with the exception that, when there is a string (delimited with two "), it should not replace the ; inside it.
Example:
Input
A;B;C;D
5cc0714b9b69581f14f6427f;5cc0714b9b69581f14f6428e;1;"5cc0714b9b69581f14f6427f;16a4fba8d13";xpto;
5cc0723b9b69581f14f64285;5cc0723b9b69581f14f64294;2;"5cc0723b9b69581f14f64285;16a4fbe3855";xpto;
5cc072579b69581f14f6428a;5cc072579b69581f14f64299;3;"5cc072579b69581f14f6428a;16a4fbea632";xpto;
output
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
For sed I have: sed 's/;/,/g' input.txt > output.txt but this would replace everything.
The regex for the " delimited string: \".*;.*\" .
(A regex for hexadecimal would be better -- something like: [0-9a-fA-F]+)
My problem is combining it all to make a grep -o / sed that replaces everything except for that pattern.
The file size is in the order of two digit Gb (max 99Gb), so performance is important. Relevant.
Any ideas are appreciated.
sed is for doing simple s/old/new on individual strings. grep is for doing g/re/p. You're not trying to do either of those tasks so you shouldn't be considering either of those tools. That leaves the other standard UNIX tool for manipulating text - awk.
You have a ;-separated CSV that you want to make ,-separated. That's simply:
$ awk -v FPAT='[^;]*|"[^"]+"' -v OFS=',' '{$1=$1}1' file
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
The above uses GNU awk for FPAT. See What's the most robust way to efficiently parse CSV using awk? for more details on parsing CSVs with awk.
If I get correctly your requirements, one option would be to make a three pass thing.
From your comment about hex, I'll consider nothing like # will come in the input so you can do (using GNU sed) :
sed -E 's/("[^"]+);([^"]+")/\1#\2/g' original > transformed
sed -i 's/;/,/g' transformed
sed -i 's/#/;/g' transformed
The idea being to replace the ; when within quotes by something else and write it to a new file, then replace all ; by , and then set back the ; in place within the same file (-i flag of sed).
The three pass can be combined in a single command with
sed -E 's/("[^"]+);([^"]+")/\1#\2/g;s/;/,/g;s/#/;/g' original > transformed
That said, there's probably a bunch of csv parser witch already handle quoted fields that you can probably use in the final use case as I bet this is just an intermediary step for something else later in the chain.
From Ed Morton's comment: if you do it in one pass, you can use \n as replacement separator as there can't be a newline in the text considered line by line.
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"]*"[^"]*)*"[^";]*);/\1\n/;ta;y/;/,/;y/\n/;/' file
Replace ;'s inside double quotes with newlines, transpose ;'s to ,'s and then transpose newlines to ;'s.

How to extract the first column from a tsv file?

I have a file containing some data and I want to use only the first column as a stdin for my script, but I'm having trouble extracting it.
I tried using this
awk -F"\t" '{print $1}' inputs.tsv
but it only shows the first letter of the first column. I tried some other things but it either shows the entire file or just the first letter of the first column.
My file looks something like this:
Harry_Potter 1
Lord_of_the_rings 10
Shameless 23
....
You can use cut which is available on all Unix and Linux systems:
cut -f1 inputs.tsv
You don't need to specify the -d option because tab is the default delimiter. From man cut:
-d delim
Use delim as the field delimiter character instead of the tab character.
As Benjamin has rightly stated, your awk command is indeed correct. Shell passes literal \t as the argument and awk does interpret it as a tab, while other commands like cut may not.
Not sure why you are getting just the first character as the output.
You may want to take a look at this post:
Difference between single and double quotes in Bash
Try this (better rely on a real csv parser...):
csvcut -c 1 -f $'\t' file
Check csvkit
Output :
Harry_Potter
Lord_of_the_rings
Shameless
Note :
As #RomanPerekhrest said, you should fix your broken sample input (we saw spaces where tabs are expected...)

Delete lines which contain a number smaller/larger than a user specified value

I need to delete lines in a large file which contain a value larger than a user specified number(see picture). For example I'd like to get rid of lines with values larger than 5e-48 (x>5e-48), i. e. lines with 7e-46, 7e-40, 1e-36,.... should be deleted.
Can sed, grep, awk or any other command do that?
Thank you
Markus
With awk:
awk '$3 <= 5e-48' filename
This selects only those lines whose third field is smaller than 5e-48.
If fields can contain spaces (since the data appears to be tab-separated) use
awk -F '\t' '$3 <= 5e-48' filename
This sets the field separator to \t, so lines are split at tabs rather than any whitespace. It does not appear to be necessary with the shown input data, but it is good practice to be defensive about these things (thanks to #tripleee for pointing this out).
In Perl, for example, the solution can be
perl -ane'print unless$F[2]>5e-48'