splitting a multiple FASTA file into separate files keeping their original names - awk

I am trying to work with an AWK script that was posted earlier on this forum. I am trying to split a large FASTA file containing multiple DNA sequences, into separate FASTA files. I need to separate each sequence into its own FASTA file, and the name of each of the new FASTA files needs to be the name of the DNA sequence from the original, large multifasta file (all the characters after the >).
I tried this script that I found here at stackoverflow:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input
It works well, but the DNA sequence begins directly after the name of the file- with no space. The DNA sequence needs to begin on a new line (regular FASTA format).
I would appreciate any help to solve this.
Thank you!!

Do you mean something like this?
awk '/^>chr/ {OUT=substr($0,2) ".fa";print " ">OUT}; OUT{print >OUT}' your_input
where the new file that is created for each "chromosome/sequence/thing" gets a blank line at the start?

I think this should work.
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File

Hope this perl script could help.
#!/usr/bin/perl
open (INFILE, "< your_input.fa")
or die "Can't open file";
while (<INFILE>) {
$line = $_;
chomp $line;
if ($line =~ /\>/) { #if has fasta >
close OUTFILE;
$new_file = substr($line,1);
$new_file .= ".fa";
open (OUTFILE, ">$new_file")
or die "Can't open: $new_file $!";
}
print OUTFILE "$line\n";
}
close OUTFILE;

The .fa (or .fasta) format looks like:
>ID1
SEQUENCE
>ID2
SEQUENCE
When splitting a fasta file it is actually not desired inserting a new line character at its top. Therefore the answer of Pramod is more appropriate. Additionally, the ID can be defined more generally to match only the > character. Consequently, the complete line would be:
awk '/^>/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
If you don't want to mess up your current directory with all the split files you can also output into a subdirectory (subdir):
awk '/^>/ {OUT="subdir/" substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File

awk to split multi-sequence fasta file into separate sequence files
This problem is best approached by considering each sequence (complete with header) a single record and changing awk's default record separator RS (usually a line break) to be the unique (one per record) > symbol used to define the start of a header. As we will want to use the header text as a file name, and as fasta headers cannot contain line-breaks, it is also convenient to reset awk's default field separaor FS (usually white space) to be line breaks.
Both of these are done in an awk BEGIN block:
BEGIN{RS=">";FS="\n"}
Since the file begins with >, the first record will be empty and therefore must be ignored to prevent an error caused by trying to write to a file name extracted from an empty record. Thus, the main awk action block is filtered to only process records beginning with record number (NR) 2. This is achieved by placing a condition before the action block as follows:
NR>1{ ... }
with the record separator set to > each record is a whole sequence including its header, and each is split into fields at line breaks (because we set the field separator to "\n"). Thus, field 1 ($1) of each record contains the text we wish to use as filenames. Note the record separator (>) is no longer part of any field and so the entire first field can be used to build the filename. In this example, ".fasta" has been appended as a file extension:
fnme=$1 ".fasta";
Next, the fasta header marker ">" is printed, followed by the entire record ($0) to the filename fnme just formed, using awk's > redirect:
print ">" $0 > fnme;
lastly, the file is closed to prevent awk exceeding the system limit for the number of open files allowed, if many files are to be written (see footnote):
close(fnme);
whole procedure
awk command
awk 'BEGIN{RS=">";FS="\n"} NR>1{fnme=$1".fasta"; print ">" $0 > fnme; close(fnme);}' example.fasta
Tested on the following mock file named example.fasta:
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
results (terminal commands and output)
$ ls
'DNA sequence 1.fasta' 'DNA sequence 3.fasta'
'DNA sequence 2.fasta' example.fasta
$ cat DNA\ sequence\ 1.fasta
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
$ cat DNA\ sequence\ 2.fasta
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
$ cat DNA\ sequence\ 3.fasta
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
footnote
"To write numerous files, successively, in the same awk program. If the files aren’t closed, eventually awk may exceed a system limit on the number of open files in one process. It is best to close each one when the program has finished writing it."
quoted from https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html

Related

Print filenames & line number with number of fields greater than 'x'

I am running Ubuntu Linux. I am in need to print filenames & line numbers containing more than 7 columns. There are several hundred thousand files.
I am able to print the number of columns per file using awk. However the output I am after is something like
file1.csv-463 which is to suggest file1.csv has more than 7 records on line 463. I am using awk command awk -F"," '{print NF}' * to print the number of fields across all files.
Please could I request help?
If you have GNU awk with you, try following code then. This will simply check condition if NF is greater than 7 then it will print that particular file's file name along with line number and nextfile will take program to next Input_file which will save our time because we need not to read whole Input_file then.
awk -F',' 'NF>7{print FILENAME,FNR;nextfile}' *.csv
Above will print only very first match of condition to get/print all matched lines try following then:
awk -F',' 'NF>7{print FILENAME,FNR}' *.csv
This might work for you (GNU sed):
sed -Ens 's/\S+/&/8;T;F;=;p' *.csv | paste - - -
If there is no eighth column, break.
Output the file name F, the line number = and print the current line p.
Feed the output into a paste command which prints three lines as one.
N.B. The -s option resets the line numbers for each file, without it, it will number each line for the entire input.

use awk to split one file into several small files by pattern

I have read this post about using awk to split one file into several files:
and I am interested in one of the solutions provided by Pramod and jaypal singh:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Because I still can not add any comment so I ask in here.
If the input is
>chr22
asdgasge
asegaseg
>chr1
aweharhaerh
agse
>chr14
gasegaseg
How come it will result in three files:
chr22.fasta
chr1.fasta
chr14.fasta
As an example, in chr22.fasta:
>chr22
asdgasge
asegaseg
I understand the first part
/^>chr/ {OUT=substr($0,2) ".fa"};
and these commands:
/^>chr/ substr() close() >>
But I don't understand that how awk split the input by the second part:
{print >> OUT; close(OUT)}
Could anyone explain more details about this command? Thanks a lot!
Could you please go through following and let me know if this helps you.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if a line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Create variable OUT whose value is substring of current line and starts from letter 2nd to till end. concatenating .fa to it too.
}
{
print >> OUT ##Printing current line(s) in file name whose value is variable OUT.
close(OUT) ##using close to close output file whose value if variable OUT value. Basically this is to avoid "TOO MANY FILES OPENED ERROR" error.
}' Input_File ##Mentioning Input_file name here.
You could take reference from man awk page for used functions of awk too as follows.
substr(s, i [, n]) Returns the at most n-character substring of s starting at i. If n is omitted, the rest of s is used.
The part you are asking questions about is a bit uncomfortable:
{ print $0 >> OUT; close(OUT) }
With this part, the awk program does the following for every line it processes:
Open the file OUT
Move the file pointer the the end of the file OUT
append the line $0 followed by ORS to the file OUT
close the file OUT
Why is this uncomfortable? Mainly because of the structure of your files. You should only close the file when you finished writing to it and not every time you write to it. Currently, if you have a fasta record of 100 lines, it will open, close the file 100 times.
A better approach would be:
awk '/^>chr/{close(OUT); OUT=substr($0,2)".fasta" }
{print > OUT }
END {close(OUT)}'
Here we only open the file the first time we write to it and we close it when we don't need it anymore.
note: the END statement is not really needed.

Returning Nth line from multiple files

Given a folder with multiple .csv files, I want to return the Nth line from each file and write to a new file.
For a single file, I use
awk 'NR==5' file.csv
For multiple files, I figured
ls *.csv | xargs awk 'NR==5'
...however that only returns the 5th line from the first file in the list.
Thanks!
Could you please try following and let me know if this helps you(GNU awk should help I believe):
awk 'FNR==5{print;nextfile}' *.csv
In case you need to take output into a single out file then append > output_file at last of above command too.
Explanation:
FNR==5: Checking here if line number is 5th for current Input_file then do actions mentioned after it.
{print;}: print is awk out of the box keyword which prints the current line so it will print the 5th line only.
nextfile: as by name itself it is clear that nextfile will skip all the lines in current Input_file(since I have given *.csv in end so it means it will pass all csv files to awk 1 by 1) it will save time for us since we DO NOT want to read the entries Input_file we needed only 5th line and we got it.

AWK - splitting a multiple header file into separate files?

I have a file that is unusually structured and I am trying to split it into multiple files based on the first column ($1) being populated. Essentially it is a header for each 'section' of the file.
The column doesn't have any pattern either so hence random_text_1 and random_text_2 as a column name so I can't use pattern matching...
Sample file...
random_text_1,col_2,col_3,col_4,col_5
,1,a,"1, a",abc
,2,,"2",abc
,3,c,"3,c",def
random_text2,col_2,col_3,col_4,col_5
,1,x,"2, x",abcdef
,2,,"2",abcdef
,3,y,"3,y",defefg
I am tying to split the file into 'blocks', i.e.,
One file would be...
random_text_1,col_2,col_3,col_4,col_5
,1,a,"1, a",abc
,2,,"2",abc
,3,c,"3,c",def
And the other...
random_text2,col_2,col_3,col_4,col_5
,1,x,"2, x",abcdef
,2,,"2",abcdef
,3,y,"3,y",defefg
If I do something like this it splits the file but only gives me the contents with the 'headers' removed and if I do $1!="" it just gives me the header in separate files...
awk 'BEGIN { OFS = ","; FPAT = "([^" OFS "]*)|(\"[^\"]+\")"} $1=="" { print $0 > "output-file" $1 } ' test-file.csv
awk -F, '$1!=""{close(out); out="out"++c} {print > out}' file
You don't need the close(out) if you're using gawk as it handles closing files when necessary for you.
Make the output file name assignment out="output-file" $1 to name the output files based on the header lines $1 as in your posted script if you like, I can't tell from your input if that would create unique files or not while I know the above will do so.

Use AWK to search through fasta file, given a second file containing sequence names

I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).
seq.fasta
>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC
name.txt
Clone_23
Clone_27-1
I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.
awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta
The problem is that I can only extract the sequence of first candidate in name.txt, like this
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
Can anyone help to fix one-line awk command above?
If it is ok or even desired to print the name as well, you can simply use grep:
grep -Ff name.txt -A1 a.fasta
-f name.txt picks patterns from name.txt
-F treats them as literal strings rather than regular expressions
A1 prints the matching line plus the subsequent line
If the names are not desired in output I would simply pipe to another grep:
above_command | grep -v '>'
An awk solution can look like this:
awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta
Better explained in a multiline version:
# True as long as we are reading the first file, name.txt
NR==FNR {
# Store the names in the array 'n'
n[$0]
next
}
# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline
$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC