Mark duplicate headers in a fasta file - awk

I have a big Fasta file which I want to modify. It basically consists of many sequences with headers that start ">". My Problem is, that some of the Headers are not unique, even though the Sequences are unique.
Example:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
Now I want to find all duplicates in my big Fasta File and append numbers to the duplicates, so that I know which duplicate it is (1,2,3,...,x). When a new duplicate is found (one with another header), the counter should start from the beginning.
The output should be something like this:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082-1
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
I would prefer a method with awk or sed, so that I can easily modify the code to run on all files in a directory.
I have to admit, that I am just starting to learn programming and parsing, but I hope this is not a stupid question.
THX in advance for the help.

An awk script:
BEGIN {
OFS="\n";
ORS=RS=">";
}
{
name = $1;
$1 = "";
suffix = names[name] ? "-" names[name] : "";
print name suffix $0, "\n";
names[name]++;
}
The above uses the ">" as a record separator, and checks the first field (which is the header name that can be duplicated). For each line it prints, it adds a suffix after the header name for each additional time the field appears (i.e. '-1' for the first dup, '-2' for the second...)

Related

replacing fasta headers gives mismatch

probably a simple issue, but I cannot seem to solve it like this.
I want to replace the headers of a FASTA file. This file is a subset of a larger file, but headers were adjusted in the process. I want to add the original headers since it includes essential information.
I selected the headers from the subset (subset.fasta) using grep, and used this to match and extract the headers from the original file, giving 'correct.headers'. They are the same number of headers and in the same order, so this should be ok.
I found the code below which should do what I want according to the description. I've only started learning awk, so I can't really control it, though.
awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' correct.headers subset.fasta > subset.correct.fasta
(source: Replace fasta headers using sed command)
However, there are some 100 more output lines than expected, and there's a shift starting after a couple of million lines.
My workflow was like this:
I had a subsetted fasta-file (created by a program extracting certain taxa) where the headers were missing info:
>header_1
read_1
>header_2
read_2
...
>header_n
read_n
I extracted the headers from this subsetted file using grep, giving the subset headers file:
>header_1
>header_2
...
>header_n
I matched the first part of the header to extract the original headers from the non-subsetted file using grep:
>header_1 info1
>header_2 info2
...
>header_n info_n
giving the same number of headers, matching order, etc.
I then used this file to replace the headers in the subset with the original ones using above awk line, but this gives a mismatch at a certain point and adds additional lines.
result
>header_1 info1
read_1
>header_2 info2
read_2
...
>header_x info_x
read_n
Where/why does it go wrong?
Thanks!

awk search for a paragraph by a word from a file using shell variables

I try filer file by name and previously asked how to use a shell variable in awk
How to force awk command search for a previously given word
Using recommendations given I can't now get the whole paragraph, only the name given... My question is: how can I get paragraphs, which consist the name given? Paragraphs are separated by \n\n
awk '{print var}' var="$NAME" RS="\n\n" ORS="\n\n" literatur.bib
It worked perfectly this way, but I had to write the name direct in awk, which doesn't match the task
awk '/name/' RS="\n\n" ORS="\n\n" literatur.bib
Input should look like this: Newton
Output like this, but with all the paragraphs with the name Newton:
#Book{Eilam:2005:reversing,
author = {Eldad Newton}, title =
{Reversing: Secrets of Reverse Engineering},
publisher = {Wiley},
year = 2005 }
#Book{Glatz:2006:Betriebssysteme,
author = {Eduard Newton},
title = "{Betriebssysteme}", publisher = {dpunkt}, year = 2006 }
Input file looks exactly like this example, but consist of hundreds of the similar paragraphs
When I search for example name Baruah I become this:
Baruah
Baruah
Baruah
And so many more times... And it just prints it without search, because if I would give the word not given in the file I will still become this result - this word written many times
SOLVED:
Just by using "" instead of ''
awk "/name/" RS="\n\n" ORS="\n\n" literatur.bib

AIX: remove the last symbols (CRLF) from a file

There is a large file where the last symbols are \r\n. I need to remove them. It seems to be equivalent to removing the last line(?).
UPD: no, it's not: a file have only one line, which ends with \r\n.
I know two ways, but both don't work for AIX:
sed 's/\r\n$//' file # I don't why it doesn't work
head -c-2 # head doesn't work with negative numbers
Is there any solution for AIX? A lot of large files must be processed, so performance is important.
Usually, if you need to edit a file via a script in place, I use ed due to historical reasons. For example:
ed - /tmp/foo.txt <<EOF
g/^$/d
w
q
EOF
ed is more than a bit cantankerous. Note also that you did not really remove the empty lines at the bottom of the file but rather all of the empty lines. With ed and some practice you can probably achieve deleting only the empty lines at the bottom of the file. e.g. go to the bottom of the file, search up for a non-empty line, then move down a line and delete from that point to the end of the file. ed command scripts act (pretty much) as you would expect.
Also, if they really do have \r\n, then those are not going to be considered empty lines but rather lines with a control-M (\r) in them. You may need to adjust your pattern if that is the case.
My answer https://stackoverflow.com/a/46083912/3220113 to the duplicate question should work here too. Another solution is using
awk ' (NR>1) { print s }
{s=$0}
END { printf("%s",substr($2, 1, length($2)-1) ) }
' inputfile

Simplest way to find text by regex and replace by lookup table

A legacy web application needs to be internationalized. Error messages are currently written inside source code in this way:
addErrorMessage("some text here");
These signs can be easily found and extracted using regex. They should be replaced with something like this:
addErrorMessage(ResourceBundle.getBundle("/Bundle", lcale).getString("key for text here"));
The correspondence between key for text here and some text here will be in a .property file.
According to some linux guru it can be achieved using awk, but I don't know anything about it. I can write a small application to do that task but it could be overkill. Are there ide plugin or existing applications for this goal ?
awk -v TextOrg='some text here' -v key='key for text here' ' "addErrorMessage(\"" TextOrg "\")" {
gsub( "addErrorMessage(\"" TextOrg "\")" \
, "addErrorMessage(ResourceBundle.getBundle(\"/Bundle\", lcale).getString(\"" key "\"))")
}
7
' YourFile
this is one way for a specific combination. Be carefful with:
assignation of value (-v ... that are constraint by shell interpretation in this case)
gsub is using regex to find, so your text need to be treated with this constraint (ex: "this f***ing text" -> "this f\*\*\*ing text" )
You certainly want to do if for several peer.
her with a file conatinaing peers
assuming that Trad.txt is a file that containt a series of 2 lines 1st, original text, second the key (to avoid some chara as separator that need complexe escape sequence interpretation if used)
ex: Trad.txt
some text
key text
other text
other key
sample code (simple, no exhaustive security, ...) Not tested, but for the concept with awk
awk '
# for first file only
FNR == NR {
# keep in memory first line as text to change
if ( NR % 2 ) TextOrg = $0
else {
# load in array the key corresponding (index is the text to change)
Key[ TextOrg] = $0
Len[ TextOrg] = length( addErrorMessage(\"" $0 "\")"
}
# don't go further in script for this line
next
}
# this point and further is reach only by second file
# if addError is found
/addErrorMessage(".*")/{
# try with each peer if there is a change (a more complex loop is more perfomant checking just necessary replacement but this one do the job)
for( TextOrg in Key) {
# to avoid regex interpretation
# Assuming for this sample code that there is 1 replace (loop is needed normaly)
# try a find of text (return the place where)
Here = index( addErrorMessage(\"" TextOrg "\")", $0)
if( Here > 0) {
# got a match, replace the substring be recreating a full one
$0 = substr( $0, 1, Here) \
"addErrorMessage(ResourceBundle.getBundle(\"/Bundle\", lcale).getString(\"" Key[ TextOrg] "\"))") \
substr( $0, Here + Len[ TextOrg])
}
}
}
# print the line in his current state (modified or not)
7
' Trad.txt YourFile
Finally, this is a workaround solution because lot of special case could occurs like "ref: function addErrorMessage(\" ...\") bla bla" will be an issue, or space inside () not treated here, or cutted line insdie (), ...

Reorganizing named fields with AWK

I have to deal with various input files with a number of fields, arbitrarily arranged, but all consistently named and labelled with a header line. These files need to be reformatted such that all the desired fields are in a particular order, with irrelevant fields stripped and missing fields accounted for. I was hoping to use AWK to handle this, since it has done me so well when dealing with field-related dilemmata in the past.
After a bit of mucking around, I ended up with something much like the following (writing from memory, untested):
# imagine a perfectly-functional BEGIN {} block here
NR==1 {
fldname[1] = "first_name"
fldname[2] = "last_name"
fldname[3] = "middle_name"
maxflds = 3
# this is just a sample -- my real script went through forty-odd fields
for (i=1;i<=NF;i++) for (j=1;j<=maxflds;j++) if ($i == fldname[j]) fldpos[j]=i
}
NR!=1 {
for (j=1;j<=maxflds;j++) {
if (fldpos[j]) printf "%s",$fldpos[j]
printf "%s","/t"
}
print ""
}
Now this solution works fine. I run it, I get my output exactly how I want it. No complaints there. However, for anything longer than three fields or so (such as the forty-odd fields I had to work with), it's a lot of painfully redundant code which always has and always will bother me. And the thought of having to insert a field somewhere else into that mess makes me shudder.
I die a little inside each time I look at it.
I'm sure there must be a more elegant solution out there. Or, if not, perhaps there is a tool better suited for this sort of task. AWK is awesome in it's own domain, but I fear I may be stretching it's limits some with this.
Any insight?
The only suggestion that I can think of is to move the initial array setup into the BEGIN block and read the ordered field names from a separate template file in a loop. Then your awk program consists only of loops with no embedded data. Your external template file would be a simple newline-separated list.
BEGIN {while ((getline < "fieldfile") > 0) fldname[++maxflds] = $0}
You would still read the header line in the same way you are now, of course. However, it occurs to me that you could use an associative array and reduce the nested for loops to a single for loop. Something like (untested):
BEGIN {while ((getline < "fieldfile") > 0) fldname[$0] = ++maxflds}
NR==1 {
for (i=1;i<=NF;i++) fldpos[i] = fldname[$i]
}