replacing fasta headers gives mismatch - awk

probably a simple issue, but I cannot seem to solve it like this.
I want to replace the headers of a FASTA file. This file is a subset of a larger file, but headers were adjusted in the process. I want to add the original headers since it includes essential information.
I selected the headers from the subset (subset.fasta) using grep, and used this to match and extract the headers from the original file, giving 'correct.headers'. They are the same number of headers and in the same order, so this should be ok.
I found the code below which should do what I want according to the description. I've only started learning awk, so I can't really control it, though.
awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' correct.headers subset.fasta > subset.correct.fasta
(source: Replace fasta headers using sed command)
However, there are some 100 more output lines than expected, and there's a shift starting after a couple of million lines.
My workflow was like this:
I had a subsetted fasta-file (created by a program extracting certain taxa) where the headers were missing info:
>header_1
read_1
>header_2
read_2
...
>header_n
read_n
I extracted the headers from this subsetted file using grep, giving the subset headers file:
>header_1
>header_2
...
>header_n
I matched the first part of the header to extract the original headers from the non-subsetted file using grep:
>header_1 info1
>header_2 info2
...
>header_n info_n
giving the same number of headers, matching order, etc.
I then used this file to replace the headers in the subset with the original ones using above awk line, but this gives a mismatch at a certain point and adds additional lines.
result
>header_1 info1
read_1
>header_2 info2
read_2
...
>header_x info_x
read_n
Where/why does it go wrong?
Thanks!

Related

awk replace string with another with new lines ( one time ) after finding another string

I wanted replace ___SIGNATURE___ with an HTML code signature after the first occurrence of "text/html" and only one replacement string ___SIGNATURE___. Any remaining ___SIGNATURE___ tags should simply be removed.
I am processing an email message where the header has a multipart boundary so there are two body parts, one with text/plain and another with text/html and the ___SIGNATURE___ tag exists in both.
So my part of my script looks like this:
awk -v signature="$(cat $disclaimer_file)" '/text\/html/ {html=1;} html==1 && !swap(swap=sub(/___SIGNATURE___/, signature);}1 in.$$ > temp.mail && mv temp.mail in.$$
sed -i "s/charset=us-ascii/charset=utf-8/1;s/___SIGNATURE___//" in.$$
It works, but is that optimal solution?
I have used altermime before but it was not good solution for my case.
Without access to sample messages, it's hard to predict what exactly will work, and whether we need to properly parse the MIME structures or if we can just blindly treat the message as text.
In the latter case, refactoring to something like
awk 'NR==FNR { signature = signature ORS $0; next }
{ sub(/charset="?[Uu][Ss]-[Aa][Ss][Cc][Ii][Ii]"?/, "charset=\"utf-8\"") }
/text\/html/ { html = 1 }
/text\/plain/ { html = 0 }
/___SIGNATURE___/ {
if (html && signature) {
# substr because there is a ORS before the text
sub(/___SIGNATURE___/, substr(signature, 2))
signature = ""
} else
sub(/___SIGNATURE___/, "")
} 1' "$disclaimer_file" "in.$$"
would avoid invoking both Awk and sed (and cat, and the quite pesky temporary file), where just Awk can reasonably and quite comfortably do all the work.
If you need a proper MIME parser, I would look into writing a simple Python script. The email library in Python 3.6+ is quite easy to use and flexible (but avoid copy/pasting old code which uses raw MIMEMultipart etc; you want to use the (no longer very) new EmailMessage class).

Extracting a specific value from a text file

I am running a script that outputs data.
I am specifically trying to extract one number. However, each time I run the script and get the output file, the position of the number I am interested in will be in a different position (due to the log nature of the output file).
I have tried several awk, sed, grep commands but I can't get any to work as many of them rely on the position of the word or number remaining constant.
This is what I am dealing, with. The value I require is the bold one:
Energy initial, next-to-last, final =
-5.96306582435 -5.96306582435 -5.96349956298
You can try
awk '{print $(i++%3+6)}' infile

Mark duplicate headers in a fasta file

I have a big Fasta file which I want to modify. It basically consists of many sequences with headers that start ">". My Problem is, that some of the Headers are not unique, even though the Sequences are unique.
Example:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
Now I want to find all duplicates in my big Fasta File and append numbers to the duplicates, so that I know which duplicate it is (1,2,3,...,x). When a new duplicate is found (one with another header), the counter should start from the beginning.
The output should be something like this:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082-1
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
I would prefer a method with awk or sed, so that I can easily modify the code to run on all files in a directory.
I have to admit, that I am just starting to learn programming and parsing, but I hope this is not a stupid question.
THX in advance for the help.
An awk script:
BEGIN {
OFS="\n";
ORS=RS=">";
}
{
name = $1;
$1 = "";
suffix = names[name] ? "-" names[name] : "";
print name suffix $0, "\n";
names[name]++;
}
The above uses the ">" as a record separator, and checks the first field (which is the header name that can be duplicated). For each line it prints, it adds a suffix after the header name for each additional time the field appears (i.e. '-1' for the first dup, '-2' for the second...)

awk: How can I use awk to determine if lines in one file of my choosing (lines 8-12, for example) are also present anywhere in another file

I have two files, baseline.txt and result.txt. I need to be able to find if lines in baseline.txt are also in results.txt. For example, if lines 8-12, is in results.txt. I need to use awk. Thanks.
Assuming the files are sorted, it looks like comm is more of what you're looking for if you want lines that are present in both files:
comm -12 baseline.txt results.txt
The -12 argument suppresses lines that are unique to baseline.txt and results.txt, respectively, leaving you with only lines that are common to both files ("suppress lines unique to file 1, suppress lines unique to file 2").
If you are dead set on using awk, then perhaps this question can help you.

Use awk to print variables and static text

I have a text file containing two values per line separated by a space. Field 1 is a URI and field 2 is a full URL.
My account doesn't have enough reputation to post links so please ignore the space after http in each of the examples.
Here's an example:
/uri1 http ://www.example. com/uri1
/uri2 http ://www.example.com/uri2
/uri3/test http ://www.example.com/uri3/index.html
/uri4?id=123 http ://www.example.com/uri4/page1.html
I'd like to create a new file where field 1 is the same, but I modify field 2 to be a combination of field2 with some static text then field 1.
Below is my example:
/uri1 http ://www.example.abc/uri1/#/www.test.abc/uri1
/uri2 http ://www.example.abc/uri2/#/www.test.abc/uri2
/uri3/test http ://www.example.abc/uri3/index.html/#/www.test.abc/uri3/test
/uri4?id=123 http ://www.example.abc/uri4/page1.html/#/www.test.abc/uri4?id=123
I thought awk would be the easiest way to do this, but it's not working like I expected.
Below is the command I'm trying to use, but it prints the fields over the top of each other and the output is like it's combined.
awk '{print $1, $2"/#/www.test.com"$1}' file.txt