I have a file which looks like this:
>4RYF_1
MAENTKNENITNILTQKLIDTRTVLIYGEINQELAEDVSKQLLLLESISNDPITIFINSQGGHVEAGDTIHDMIKFIKPTVKVVGTGWVASAGITIYLAAEKENRFSLPNTRYMIHQPAGGVQGQSTEIEIEAKEIIRMRERINRLIAEATGQSYEQISKDTDRNFWLSVNEAKDYGIVNEIIENRDGLKMASWSHPQFEK
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
I want to keep the sequence and previous line only if the sequence has a given length. For selecting only lines with that condition I use:
awk 'length($0) > 50 && length($0) <=800)' sample.txt
But how can I keep lines starting with > as well if this condition is met?
Yet another awk solution:
awk '/^>/ { header = $0; next } length > 50 && length <= 800 { print header ORS $0 }'
Would you please try the following:
awk -v RS='>' -F'\n' '
length($2) > 50 && length($2) <= 800 {printf ">%s", $0}
' sample.txt
Assigning RS to '>' tells awk to split the file on > into records,
treating the header line and the sequence line in the same record.
Assigning FS to '\n' splits the record to the header and
sequence, each assigning $1 to the header and $2 to the sequence.
As the leading > is chopped off as a delimiter, we need to prepend it
when printing the matched records.
Here is one-liner:
LANG=C grep -B1 '^.\{51,800\}$' < sample.txt
The command was really slow with LANG=en_US.UTF-8 that I set by default, so using LANG=C instead.
man grep tells you that '-B NUM' means ' Print NUM lines of leading context before matching lines.'.
'^' means start of line
'.' means any character
'{51,800}' means we want between 51 and 800 of the previous thing
'$' means end of line.
Or in other words, we want to match lines that are between 51 and 800 characters, and print it and the previous line.
A potential solution with AWK is:
awk '!/^>/ {next}; {getline s}; length(s) > 50 && length(s) <= 800 { print $0 "\n" s }' example.fasta
e.g. if example.fasta contains
>4RYF_1
WLSVNEAKDYGIVNEIIENRDGLKMASWSHPQFEK
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
>1000_chars
YiJOgeCApTkcJWxIuvooOxuqVnPdSLtOQmUfnzpBvcpYKyCvelFwKgMchYFnlvuZwVxNcnSvGcACsMywDQVvYBAiaIesQkLkYNsExRbqKPZIPnCRMAFHLmIzxIBqLwoNEPSKMZCTpwbbQCNrHSrbDMtCksTjvQsMeAkoudRGUJnPpQTEzwwnKoZBHtpMSIQBfYSPDYHwKktvCiFpewrsdDTQpqBajOWZkKURaKszEqDmdYMkzSAkMtlkXPfHroiTbyxZwzvrrMSXMRSavrBdgVYZanudjacRHWfpErJMkomXpzagXIzwbaeFgAgFnMxLuQHsdvZysqAsngkCZILvVLaFpkWnOpuYensROwkhwqUdngvlTsXBoCBwJUENUFgVdnSnxVOvfksyiabglFPqmSwhGabjNZiWGyvktzSDOQNGlEvoxhJCAOhxVAtZfyimzsziakpzfIszSWYVgKZTHatWSfttHYTkvgafcsVmitfEfQDuyyDAAAoTKpuhLrnHVFKgmEsSgygqcNLQYkpnhOosKiZJKpDolXcxAKHABtALqVXoVcSHpskrpWPrkkZLTpUXkENhnesmoQjonLWxkpcuJrOosXKNTDNuZaWIEtrDILXsIFTjAnrnwJBoirgNHcDURwDIzAXJSLPLmWkurOhWSLPrIOyqNvADBdIFaCGoZeewKleBHUGmKFWFcGgZIGUdOHwwINZqcOClPAjYaLNdLgDsUNCPwKMrOXJEyPvMRLaTJGgxzeoLCggJYTVjlJpyMsoCRZBDrBDckNMhJSQWBAxYBlqSpXnpmLeEJYirwjfCqZGBZdgkHzWGoAMxgNKHOAvGXsIbbuBjeeORhZaIrruBwDfzgTICuwWCAhCPqMqkHrxkQMZbXUIavknNhuIycoDssXlOtbSWsxVXQhWMyDQZWDlEtewXWKBPUcHDYWWgyOerbnoAxrnpsCulOxqxdywFJFoeWNpVGIPMUJSWwvlVDWNkjIBMlXPi
It will only print
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
Edit
The method that I would recommend to better handle edge-cases is to use purpose-built bioinformatics software, e.g. seqkit
seqkit seq -m 50 -M 800 example.fasta
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDI
FLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEI
MIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEA
KDYGLIDDIIINKSGLKGHHHHHH
Is perl an option?
perl -nle '$prev && print if length() >50 and length() < 800 && print $prev; $prev = $_' input_file
$prev - Create a variable which will hold every line. When the length condition is met, and there has been a previous line $prev, then it prints the condition matched in $prev and prints the last line.
$prev = $_ Assigns the current line to the prev line variable
If the upper limit 800 is not essential, could sed be an option?
$ sed -En '/>/ {N;/[a-zA-Z0-9]{50,}/p}' input_file
/>/ - Match > and read into the pattern space
N; Run the condition on the next line after the match and append that to the pattern space also:
{50,} - If the length is 50 or more
\1/p - Return it and print
Output
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
With your shown samples, please try following awk code. Written and tested with GNU awk.
awk -v RS= '
{
val=""
delete arr
while(match($0,/>[^\n]*\n*[^\n]*/)){
val=substr($0,RSTART,RLENGTH)
split(val,arr,"\n")
if(length(arr[2])>50 && length(arr[2])<=800){
print val
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
If only the next line should meet the length restrictions, you can match and store the line that starts with > in a variable, for example previous
Then for the next line, check for the length and if the previous line is not empty.
If is is not, print the previous and the current line.
At the end, set the previous variable to an empty string.
awk '{
if (/^>/) {
previous = $0
next
}
if (length(previous) != 0 && length($0) > 50 && length($0) <= 800) {
print previous ORS $0
}
previous=""
}' sample.txt
See an AWK demo
How to properly select columns in awk after some processing. My file here:
cat foo
A;B;C
9;6;7
8;5;4
1;2;3
I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way:
awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo
gives me this unexpected output:
linenumber;A;B;C
1 9 7
2 8 4
3 1 3
but expected is (note B is now the third column as we added linenumber as first):
linenumber;B
1;6
2;5
3;2
[fixed and revised]
To get your expected output, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1)
}' file
Output:
linenumber;C
1;9
2;8
3;1
To add a column with line number and extract first and last columns, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$1,$NF
}' file
Output this time:
linenumber;A;C
1;9;7
2;8;4
3;1;3
Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here).
[EDIT]: see a generic solution at the end.
If the field you want to keep is the second of the input file (the B column), try:
$ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo
linenumber;B
1;6
2;5
3;2
or
$ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number:
$ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example):
$ awk -F\; -v cols='1;3' '
BEGIN { OFS = ";"; n = split(cols, c); }
{ printf("%s", FNR == 1 ? "linenumber" : FNR - 1);
for(i = 1; i <= n; i++) printf("%s", OFS $(c[i]));
printf("\n");
}' foo
linenumber;A;C
1;9;7
2;8;4
3;1;3
The Problem
I just need to combine a whole bunch of files and strip out the header (line 1) from the 1st file.
The Data
Here are the last three lines (with line 1: header) from three of these files:
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170101","20170131","1","5.49","EUR","5.49"
"20170101","20170131","1","4.27","EUR","4.27"
"","","","","9.76",""
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170201","20170228","1","5.49","EUR","5.49"
"20170201","20170228","1","4.88","EUR","4.88"
"20170201","20170228","1","0.61","EUR","0.61"
"20170201","20170228","1","0.61","EUR","0.61"
"","","","","11.59",""
START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170301","20170331","1","4.88","EUR","4.88"
"20170301","20170331","1","4.27","EUR","4.27"
"","","","","9.15",""
Problem (Continued)
As you can see, the last line has a number (it's a column total) in column 5. Of course, I don't want that last line. But it's (obviously) on a different line number in each file.
(G)awk is clearly the solution, but I don't know (g)awk.
What I've Tried
I've tried a number of combinations of things, but I guess the one that I'm most surprised does not work is:
gawk '
{ if (!$1 ) nextfile }
NR == 1 {$0 = "Filename" "StartDate" OFS $0; print}
FNR > 1 {$0 = FILENAME StartDate OFS $0; print}
' OFS=',' */*.csv > ../path/file.csv
Expected Output (by request)
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT
20170101","20170131","1","5.49","EUR","5.49
20170101","20170131","1","4.27","EUR","4.27
20170201","20170228","1","5.49","EUR","5.49
20170201","20170228","1","4.88","EUR","4.88
20170201","20170228","1","0.61","EUR","0.61
20170201","20170228","1","0.61","EUR","0.61
20170301","20170331","1","4.88","EUR","4.88
20170301","20170331","1","4.27","EUR","4.27"
And, of course, I've tried searching both Google and SO. Most of the answers I see require much more awk knowledge than I have, just to understand them. (I'm not a data wrangler, but I have a data wrangling task.)
Thanks for any help!
this should do...
awk 'NR==1; FNR==1{next} FNR>2{print p} {p=$0}' file{1..3}
print first header, skip other headers and last lines.
Another awk approach:-
awk -F, '
NR == 1 {
header = $0
print
next
}
FNR > 1 && $1 != "\"\""
' *.csv
Something like the following should do the trick:
awk -F"," 'NR==1{header=$0; print $0} $0!=header && $1!=""{print $0}' */*.csv > ../path/file.csv\
Here awk will:
Split the records by comma -F","
If this is the first record awk encounters, it sets variable header to the entire contents of the line and then prints the header NR==1{header=$0; print $0}
If the contents of the current line are not a header and the first field isn't empty (indicating a "total" line), then print the line $0!=header && $1!=""{print $0}'
As mentioned in my comment below, if the first field of your records always begin with an 8 digit date, then you could simplify (this is less generic than the code above):
awk -F"," 'NR == 1 || $1 ~ /"[0-9]{8}"/ {print $0} /*.csv > outfile.csv
Essentially that says if this is the first record to process then print it (it's a header) OR || if the first field is an 8 digit number surrounded by double quotes then print it.
I have a task to do with awk. I am doing sequence analysis for some genes.
I have several files with sequences in order. I would like to extract first sequence of each file into new file and like till the last sequence. I know only how to do with first or any specific line with awk.
awk 'FNR == 2 {print; nextfile}' *.txt > newfile
Here I have input like this
File 1
Saureus081.1
ATCGGCCCTTAA
Saureus081.2
ATGCCTTAAGCTATA
Saureus081.3
ATCCTAAAGGTAAGG
File 2
SaureusRF1.1
ATCGGCCCTTAC
SauruesRF1.2
ATGCCTTAAGCTAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
File 3
SaureusN305.1
ATCGGCCCTTACT
SauruesN305.2
ATGCCTTAAGCTAGA
SaureusN305.3
ATCCTAAAGGTAATG
And similar files 12 are there
File 4
.
.
.
.File 12
Required Output
Newfile
Saureus081.1
ATCGGCCCTTAA
SaureusRF1.1
ATCGGCCCTTAC
SaureusN305.1
ATCGGCCCTTACT
Saureus081.2
ATGCCTTAAGCTATA
SaureusRF1.2
ATGCCTTAAGCTAGG
SauruesN305.2
ATGCCTTAAGCTAGA
Saureus081.3
ATCCTAAAGGTAAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
SaureusN305.3
ATCCTAAAGGTAATG
I guess this task can be done easily with awk but not getting any idea how to do for multiple lines
Based on the modified question, the answer shall be done with some changes.
$ awk -F'.' 'NR%2{k=$2;v=$0;getline;a[k]=a[k]?a[k] RS v RS $0:v RS $0} END{for(i in a)print a[i]}' file1 file2 file3
Saureus081.1
ATCGGCCCTTAA
SaureusRF1.1
ATCGGCCCTTAC
SaureusN305.1
ATCGGCCCTTACT
Saureus081.2
ATGCCTTAAGCTATA
SauruesRF1.2
ATGCCTTAAGCTAGG
SauruesN305.2
ATGCCTTAAGCTAGA
Saureus081.3
ATCCTAAAGGTAAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
SaureusN305.3
ATCCTAAAGGTAATG
Brief explanation,
Set '.' as the delimeter
For every odd record, distinguish k=$2 as the key of array a
Invoke getline to set $0 of next record as the value corresponds to the key k
Print the whole array for the last step
If your data is very large, I would suggest creating temporary files:
awk 'FNR%2==1 { filename = $1 }
{ print $0 >> filename }' file1 ... filen
Afterwards, you can cat them together:
cat Seq1 ... Seqn > result
This has the additional advantage that it will work if not all sequences are present in all files.
paste + awk solution:
paste File1 File2 | awk '{ p=$2;$2="" }NR%2{ k=p; print }!(NR%2){ v=p; print $1 RS k RS v }'
paste File1 File2 - merge corresponding lines of files
p=$2;$2="" - capture the value of the 2nd field which is the respective key/value from File2
The output:
Seq1
ATCGGCCCTTAA
Seq1
ATCGGCCCTTAC
Seq2
ATGCCTTAAGCTATA
Seq2
ATGCCTTAAGCTAGG
Seq3
ATCCTAAAGGTAAGG
Seq3
ATCCTAAAGGTAAGC
Additional approach for multiple files:
paste Files[0-9]* | awk 'NR%2{ k=$1; n=NF; print k }
!(NR%2){ print $1; for(i=2;i<=n;i++) print k RS $i }'
I have 10 files that have the same tab-delimited column structure. I need to merge the columns 8 and 9 from each of the files. I came up with the following AWK code, but it only merge two files at a time. I am looking for help with merging all 10 files same time, I am not sure if this is feasible.
All my file names following the same pattern (s1s2.txt, s3s4.txt, s5s6.txt, .... s19s20.txt)
#!/bin/bash
awk '
BEGIN {
#load array with contents of the first file
while ( getline < "s1s2.txt" > 0)
{
s1s2_counter++
f1_8[s1s2_counter] = $8
f1_9[s1s2_counter] = $9
}
}
{OFS="\t"}
{ #output the columns 8 and 9 from the first file before the second file
print f1_8[NR],f1_9[NR], $8, $9
} ' s3s4.txt
awk -F'\t' '{a[FNR] = a[FNR] (NR==FNR?"":FS) $8 FS $9} END{for (i=1;i<=FNR;i++) print a[i]}' s1s2.txt .... s19s20.txt
Using getline is usually the wrong approach, see http://awk.info/?tip/getline.