(g)awk next file on partially blank line - awk

The Problem
I just need to combine a whole bunch of files and strip out the header (line 1) from the 1st file.
The Data
Here are the last three lines (with line 1: header) from three of these files:
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170101","20170131","1","5.49","EUR","5.49"
"20170101","20170131","1","4.27","EUR","4.27"
"","","","","9.76",""
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170201","20170228","1","5.49","EUR","5.49"
"20170201","20170228","1","4.88","EUR","4.88"
"20170201","20170228","1","0.61","EUR","0.61"
"20170201","20170228","1","0.61","EUR","0.61"
"","","","","11.59",""
START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170301","20170331","1","4.88","EUR","4.88"
"20170301","20170331","1","4.27","EUR","4.27"
"","","","","9.15",""
Problem (Continued)
As you can see, the last line has a number (it's a column total) in column 5. Of course, I don't want that last line. But it's (obviously) on a different line number in each file.
(G)awk is clearly the solution, but I don't know (g)awk.
What I've Tried
I've tried a number of combinations of things, but I guess the one that I'm most surprised does not work is:
gawk '
{ if (!$1 ) nextfile }
NR == 1 {$0 = "Filename" "StartDate" OFS $0; print}
FNR > 1 {$0 = FILENAME StartDate OFS $0; print}
' OFS=',' */*.csv > ../path/file.csv
Expected Output (by request)
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT
20170101","20170131","1","5.49","EUR","5.49
20170101","20170131","1","4.27","EUR","4.27
20170201","20170228","1","5.49","EUR","5.49
20170201","20170228","1","4.88","EUR","4.88
20170201","20170228","1","0.61","EUR","0.61
20170201","20170228","1","0.61","EUR","0.61
20170301","20170331","1","4.88","EUR","4.88
20170301","20170331","1","4.27","EUR","4.27"
And, of course, I've tried searching both Google and SO. Most of the answers I see require much more awk knowledge than I have, just to understand them. (I'm not a data wrangler, but I have a data wrangling task.)
Thanks for any help!

this should do...
awk 'NR==1; FNR==1{next} FNR>2{print p} {p=$0}' file{1..3}
print first header, skip other headers and last lines.

Another awk approach:-
awk -F, '
NR == 1 {
header = $0
print
next
}
FNR > 1 && $1 != "\"\""
' *.csv

Something like the following should do the trick:
awk -F"," 'NR==1{header=$0; print $0} $0!=header && $1!=""{print $0}' */*.csv > ../path/file.csv\
Here awk will:
Split the records by comma -F","
If this is the first record awk encounters, it sets variable header to the entire contents of the line and then prints the header NR==1{header=$0; print $0}
If the contents of the current line are not a header and the first field isn't empty (indicating a "total" line), then print the line $0!=header && $1!=""{print $0}'
As mentioned in my comment below, if the first field of your records always begin with an 8 digit date, then you could simplify (this is less generic than the code above):
awk -F"," 'NR == 1 || $1 ~ /"[0-9]{8}"/ {print $0} /*.csv > outfile.csv
Essentially that says if this is the first record to process then print it (it's a header) OR || if the first field is an 8 digit number surrounded by double quotes then print it.

Related

Edit multiple files using awk

I am trying to edit multiple files using awk in the following way:
awk '{F = FILENAME ".inp"; print $0 > F ; if(NR==3) print "%moinp usr/speciale/br/brhgooh/scan/newscan/freqBZ/cas/prov2/",FILENAME > F }' *.xyz
It works well with the first file, but in the rest of the files the change does not appear.
Any suggestions?
You need to use FNR not NR
NR is the record number of all records seen thus far
FNR is the record number of the current file
Adding some whitespace helps readability:
awk '
BEGINFILE {
close(F)
F = FILENAME ".inp"
}
{print > F}
FNR == 3 {print "%moinp usr/speciale/br/brhgooh/scan/newscan/freqBZ/cas/prov2/" FILENAME > F }
' *.xyz
If your awk does not have BEGINFILE, you can use FNR == 1 instead.
Other changes:
print instead of print $0 ($0 is the default)
print "%mo .../" FILENAME without a comma -- a comma will insert a space after the slash.
close(F) to prevent "too many open files" errors

Keep current and previous line only if current line fulfills a given condition

I have a file which looks like this:
>4RYF_1
MAENTKNENITNILTQKLIDTRTVLIYGEINQELAEDVSKQLLLLESISNDPITIFINSQGGHVEAGDTIHDMIKFIKPTVKVVGTGWVASAGITIYLAAEKENRFSLPNTRYMIHQPAGGVQGQSTEIEIEAKEIIRMRERINRLIAEATGQSYEQISKDTDRNFWLSVNEAKDYGIVNEIIENRDGLKMASWSHPQFEK
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
I want to keep the sequence and previous line only if the sequence has a given length. For selecting only lines with that condition I use:
awk 'length($0) > 50 && length($0) <=800)' sample.txt
But how can I keep lines starting with > as well if this condition is met?
Yet another awk solution:
awk '/^>/ { header = $0; next } length > 50 && length <= 800 { print header ORS $0 }'
Would you please try the following:
awk -v RS='>' -F'\n' '
length($2) > 50 && length($2) <= 800 {printf ">%s", $0}
' sample.txt
Assigning RS to '>' tells awk to split the file on > into records,
treating the header line and the sequence line in the same record.
Assigning FS to '\n' splits the record to the header and
sequence, each assigning $1 to the header and $2 to the sequence.
As the leading > is chopped off as a delimiter, we need to prepend it
when printing the matched records.
Here is one-liner:
LANG=C grep -B1 '^.\{51,800\}$' < sample.txt
The command was really slow with LANG=en_US.UTF-8 that I set by default, so using LANG=C instead.
man grep tells you that '-B NUM' means ' Print NUM lines of leading context before matching lines.'.
'^' means start of line
'.' means any character
'{51,800}' means we want between 51 and 800 of the previous thing
'$' means end of line.
Or in other words, we want to match lines that are between 51 and 800 characters, and print it and the previous line.
A potential solution with AWK is:
awk '!/^>/ {next}; {getline s}; length(s) > 50 && length(s) <= 800 { print $0 "\n" s }' example.fasta
e.g. if example.fasta contains
>4RYF_1
WLSVNEAKDYGIVNEIIENRDGLKMASWSHPQFEK
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
>1000_chars
YiJOgeCApTkcJWxIuvooOxuqVnPdSLtOQmUfnzpBvcpYKyCvelFwKgMchYFnlvuZwVxNcnSvGcACsMywDQVvYBAiaIesQkLkYNsExRbqKPZIPnCRMAFHLmIzxIBqLwoNEPSKMZCTpwbbQCNrHSrbDMtCksTjvQsMeAkoudRGUJnPpQTEzwwnKoZBHtpMSIQBfYSPDYHwKktvCiFpewrsdDTQpqBajOWZkKURaKszEqDmdYMkzSAkMtlkXPfHroiTbyxZwzvrrMSXMRSavrBdgVYZanudjacRHWfpErJMkomXpzagXIzwbaeFgAgFnMxLuQHsdvZysqAsngkCZILvVLaFpkWnOpuYensROwkhwqUdngvlTsXBoCBwJUENUFgVdnSnxVOvfksyiabglFPqmSwhGabjNZiWGyvktzSDOQNGlEvoxhJCAOhxVAtZfyimzsziakpzfIszSWYVgKZTHatWSfttHYTkvgafcsVmitfEfQDuyyDAAAoTKpuhLrnHVFKgmEsSgygqcNLQYkpnhOosKiZJKpDolXcxAKHABtALqVXoVcSHpskrpWPrkkZLTpUXkENhnesmoQjonLWxkpcuJrOosXKNTDNuZaWIEtrDILXsIFTjAnrnwJBoirgNHcDURwDIzAXJSLPLmWkurOhWSLPrIOyqNvADBdIFaCGoZeewKleBHUGmKFWFcGgZIGUdOHwwINZqcOClPAjYaLNdLgDsUNCPwKMrOXJEyPvMRLaTJGgxzeoLCggJYTVjlJpyMsoCRZBDrBDckNMhJSQWBAxYBlqSpXnpmLeEJYirwjfCqZGBZdgkHzWGoAMxgNKHOAvGXsIbbuBjeeORhZaIrruBwDfzgTICuwWCAhCPqMqkHrxkQMZbXUIavknNhuIycoDssXlOtbSWsxVXQhWMyDQZWDlEtewXWKBPUcHDYWWgyOerbnoAxrnpsCulOxqxdywFJFoeWNpVGIPMUJSWwvlVDWNkjIBMlXPi
It will only print
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
Edit
The method that I would recommend to better handle edge-cases is to use purpose-built bioinformatics software, e.g. seqkit
seqkit seq -m 50 -M 800 example.fasta
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDI
FLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEI
MIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEA
KDYGLIDDIIINKSGLKGHHHHHH
Is perl an option?
perl -nle '$prev && print if length() >50 and length() < 800 && print $prev; $prev = $_' input_file
$prev - Create a variable which will hold every line. When the length condition is met, and there has been a previous line $prev, then it prints the condition matched in $prev and prints the last line.
$prev = $_ Assigns the current line to the prev line variable
If the upper limit 800 is not essential, could sed be an option?
$ sed -En '/>/ {N;/[a-zA-Z0-9]{50,}/p}' input_file
/>/ - Match > and read into the pattern space
N; Run the condition on the next line after the match and append that to the pattern space also:
{50,} - If the length is 50 or more
\1/p - Return it and print
Output
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
With your shown samples, please try following awk code. Written and tested with GNU awk.
awk -v RS= '
{
val=""
delete arr
while(match($0,/>[^\n]*\n*[^\n]*/)){
val=substr($0,RSTART,RLENGTH)
split(val,arr,"\n")
if(length(arr[2])>50 && length(arr[2])<=800){
print val
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
If only the next line should meet the length restrictions, you can match and store the line that starts with > in a variable, for example previous
Then for the next line, check for the length and if the previous line is not empty.
If is is not, print the previous and the current line.
At the end, set the previous variable to an empty string.
awk '{
if (/^>/) {
previous = $0
next
}
if (length(previous) != 0 && length($0) > 50 && length($0) <= 800) {
print previous ORS $0
}
previous=""
}' sample.txt
See an AWK demo

Awk Remove lines if one column matches another column, and keep line if max value from another column

I have a file of ~8,000 lines. I am trying to remove the lines where when the 5th column matches (in this case ga2016mldlzd), but keep only the line with the max value in the 6th column. For example, if given this:
-25.559,129.8529,6674.560547,2.0,ga2016mldlzd,6
-25.5596,129.8565,6902.750651,2.0,ga2016mldlzd,7
-25.5450,129.830,969.8079427,2.0,ga2016mldlzd,8
-25.5450,129.834,57.04752604,2.0,ga2016mldlzd,9
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
remove all lines except the final line with 10 as the max value, to get this. I'm stumped as to how this could be done either in awk or sed?
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
If tried this:
awk -F, '!a[$5]++'
but I want to keep last column e.g., the column with '10', rather than the column with '6'. Thanks
Keep track of the max and line associated with that max and print at the end:
awk -F, '
{
if ($6>max[$5]) {
max[$5]=$6
tl[$5]=$0
}
}
END{
for (l in tl) print tl[l]
}' file
Prints:
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
The order of the file will be lost; ie, the groups may be reordered compared to the original file.
If you are dealing with a file with many different keys for $5 and not all of them could fit in memory, you could sort into blocks grouped by the fifth field and then by the numeric value of the sixth. Then have awk print the last line every time the fifth field changes. Since it is sorted, that will be the max:
sort -t , -k 5,5 -k 6n file |
awk -F, '
FNR==1{lf=$5;ll=$0}
lf!=$5{print ll}
{ll=$0; lf=$5}
END{print $0}'
# same print out
The second there will be way slower but way less memory for a large number of $5 uniq values.
If you want to maintain original order of lines then use this awk:
awk -F, 'NR==FNR {if ($6 > max[$5]) max[$5] = $6; next} $5 in max && max[$5] == $6' file file
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
If you want to filter for ga2016mldlzd while maintaining original order of lines then use this awk:
awk -F, '
NR==FNR {
if ($5 == "ga2016mldlzd" && $6 > max[$5]) {
max[$5] = $6
n = FNR
}
next
}
FNR == n' file file
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10

Duplicate Lines 2 times and transpose from row to column

I will like to duplicate each line 2 times and print values of column 5 and 6 separated.( transpose values of column 5 and 6 from column to row ) for each line
I mean value on column 5 (first line) value in column 6 ( second line)
Input File
08,1218864123180000,3201338573,VV,22,27
08,1218864264864000,3243738789,VV,15,23
08,1218864278580000,3244738513,VV,3,13
08,1218864310380000,3243938789,VV,15,23
08,1218864324180000,3244538513,VV,3,13
08,1218864334380000,3200538561,VV,22,27
Desired Output
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
I use this code to duplicate the lines 2 times, but i cant'n figer out the condition with values of column 5 and 6
awk '{print;print}' file
Thanks in advance
To repeatedly print the start of a line for each of the last N fields where N is 2 in this case:
$ awk -v n=2 '
BEGIN { FS=OFS="," }
{
base = $0
sub("("FS"[^"FS"]+){"n"}$","",base)
for (i=NF-n+1; i<=NF; i++) {
print base, $i
}
}
' file
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
In this simple case where the last field has to be removed and placed on the last line, you can do
awk -F , -v OFS=, '{ x = $6; NF = 5; print; $5 = x; print }'
Here -F , and -v OFS=, will set the input and output field separators to a comma, respectively, and the code does
{
x = $6 # remember sixth field
NF = 5 # Set field number to 5, so the last one won't be printed
print # print those first five fields
$5 = x # replace value of fifth field with remembered value of sixth
print # print modified line
}
This approach can be extended to handle fields in the middle with a function like the one in the accepted answer of this question.
EDIT: As Ed notes in the comments, writing to NF is not explicitly defined to trigger a rebuild of $0 (the whole-line record that print prints) in the POSIX standard. The above code works with GNU awk and mawk, but with BSD awk (as found on *BSD and probably Mac OS X) it fails to do anything.
So to be standards-compliant, we have to be a little more explicit and force awk to rebuild $0 from the modified field state. This can be done by assigning to any of the field variables $1...$NF, and it's common to use $1=$1 when this problem pops up in other contexts (for example: when only the field separator needs to be changed but not any of the data):
awk -F , -v OFS=, '{ x = $6; NF = 5; $1 = $1; print; $5 = x; print }'
I've tested this with GNU awk, mawk and BSD awk (which are all the awks I can lay my hands on), and I believe this to be covered by the awk bit in POSIX where it says "setting any other field causes the re-evaluation of $0" right at the top. Mind you, the spec could be more explicit on this point, and I'd be interested to test if more exotic awks behave the same way.
Could you please try following(considering that your Input_file always is same as shown and you need to print every time 1st four fields and then rest of the fields(one by one printing along with 1st four)).
awk 'BEGIN{FS=OFS=","}{for(i=5;i<=NF;i++){print $1,$2,$3,$4,$i}}' Input_file
This might work for you (GNU awk):
awk '{print gensub(/((.*,).*),/,"\\1\n\\2",1)}' file
Replace the last comma by a newline and the previous fields less the penultimate.

searching multiple patterns in awk

I've a text file of thousands of lines
:ABC:xyz:1234:200:some text:xxx:yyyy:11818:AAA:BBB
:ABC:xyz:6789:200:some text:xxx:yyyy:203450:AAA:BBB
:EFG:xyz:11818:200:some text:xxx:yyyy:154678:AAA:BBB
:HIJ:xyz:203450:200:some text:xxx:yyyy:154678:AAA:BBB
:KLM:xyz:7777:200:some text:xxx:yyyy:11818:AAA:BBB
.....
....
:DEL:xyz:1234:200:some text:xxx:yyyy:203450:AAA:BBB
I need to find more than one occurrence of the 9th column i.e the o/p should show
:ABC:xyz:1234:200:some text:xxx:yyyy:11818:AAA:BBB
:KLM:xyz:7777:200:some text:xxx:yyyy:11818:AAA:BBB
:ABC:xyz:6789:200:some text:xxx:yyyy:203450:AAA:BBB
:DEL:xyz:1234:200:some text:xxx:yyyy:203450:AAA:BBB
I tried:
awk -F ":" '$9 > 2 {split($0,a,":"); print $0}'
this prints all the records.
awk -F':' 'NR==FNR{cnt[$9]++;next} cnt[$9]>1' file file
or if you don't want to parse the file twice:
awk -F':' 'cnt[$9]++{printf "%s", prev[$9]; delete prev[$9]; print; next} {prev[$9]=$0 ORS}' file
This should do it in pure awk:
awk -F":" '{if( s[$9] ){ print } else if( f[$9] ){ print f[$9]; s[$9]=1; print }; f[$9]=$0 }'
Explanation:
The "f" array stores values of the 9th column that have occurred at least once.
The "s" array stores values of the 9th column that have occurred twice or more.
If the 9th column has occurred before, print the first occurrence, and this line.
If the 9th column has occurred twice or more before, print this line.
Here is another awk
awk -F: '{++a[$9];b[NR]=$0} END {for (i=1;i<=NR;i++) {split(b[i],c,":");if (a[c[9]]>1) print b[i]}}' file