awk script for finding smallest value from column - awk

I am beginner in AWK, so please help me to learn it. I have a text file with name snd and it values are
1 0 141
1 2 223
1 3 250
1 4 280
I want to print the entire row when the third column value is minimu

This should do it:
awk 'NR == 1 {line = $0; min = $3}
NR > 1 && $3 < min {line = $0; min = $3}
END{print line}' file.txt
EDIT:
What this does is:
Remember the 1st line and its 3rd field.
For the other lines, if the 3rd field is smaller than the min found so far, remember the line and its 3rd field.
At the end of the script, print the line.
Note that the test NR > 1 can be skipped, as for the 1st line, $3 < min will be false. If you know that the 3rd column is always positive (not negative), you can also skip the NR == 1 ... test as min's value at the beginning of the script is zero.
EDIT2:
This is shorter:
awk 'NR == 1 || $3 < min {line = $0; min = $3}END{print line}' file.txt

You don't need awk to do what you want. Use sort
sort -nk 3 file.txt | head -n 1
Results:
1 0 141

I think sort is an excellent answer, unless for some reason what you're looking for is the awk logic to do this in a larger script, or you want to avoid the extra pipes, or the purpose of this question is to learn more about awk.
$ awk 'NR==1{x=$3;line=$0} $3<x{line=$0} END{print line}' snd
Broken out into pieces, this is:
NR==1 {x=$3;line=$0} -- On the first line, set an initial value for comparison and store the line.
$3<x{line=$0} - On each line, compare the third field against our stored value, and if the condition is true, store the line. (We could make this run only on NR>1, but it doesn't matter.
END{print line} -- At the end of our input, print whatever line we've stored.
You should read man awk to learn about any parts of this that don't make sense.

a short answer for this would be:
sort -k3,3n temp|head -1
since you have asked for awk:
awk '{if(min>$3||NR==1){min=$3;a[$3]=$0}}END{print a[min]}' your_file
But i prefer the shorter one always.

For calculating the smallest value in any column , let say last column
awk '(FNR==1){a=$NF} {a=$NF < a?$NF:a} END {print a}'
this will only print the smallest value of the column.
In case if complete line is needed better to use sort:
sort -r -n -t [delimiter] -k[column] [file name]

awk -F ";" '(NR==1){a=$NF;b=$0} {a=$NF<a?$NF:a;b=$NF>a?b:$0} END {print b}' filename
this will print the line with smallest value which is encountered first.

awk 'BEGIN {OFS=FS=","}{if ( a[$1]>$2 || a[$1]=="") {a[$1]=$2;} if (b[$1]<$2) {b[$1]=$2;} } END {for (i in a) {print i,a[i],b[i]}}' input_file
We use || a[$1]=="" because when 1st value of field 1 is encountered it will have null in a[$1].

Related

Keep current and previous line only if current line fulfills a given condition

I have a file which looks like this:
>4RYF_1
MAENTKNENITNILTQKLIDTRTVLIYGEINQELAEDVSKQLLLLESISNDPITIFINSQGGHVEAGDTIHDMIKFIKPTVKVVGTGWVASAGITIYLAAEKENRFSLPNTRYMIHQPAGGVQGQSTEIEIEAKEIIRMRERINRLIAEATGQSYEQISKDTDRNFWLSVNEAKDYGIVNEIIENRDGLKMASWSHPQFEK
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
I want to keep the sequence and previous line only if the sequence has a given length. For selecting only lines with that condition I use:
awk 'length($0) > 50 && length($0) <=800)' sample.txt
But how can I keep lines starting with > as well if this condition is met?
Yet another awk solution:
awk '/^>/ { header = $0; next } length > 50 && length <= 800 { print header ORS $0 }'
Would you please try the following:
awk -v RS='>' -F'\n' '
length($2) > 50 && length($2) <= 800 {printf ">%s", $0}
' sample.txt
Assigning RS to '>' tells awk to split the file on > into records,
treating the header line and the sequence line in the same record.
Assigning FS to '\n' splits the record to the header and
sequence, each assigning $1 to the header and $2 to the sequence.
As the leading > is chopped off as a delimiter, we need to prepend it
when printing the matched records.
Here is one-liner:
LANG=C grep -B1 '^.\{51,800\}$' < sample.txt
The command was really slow with LANG=en_US.UTF-8 that I set by default, so using LANG=C instead.
man grep tells you that '-B NUM' means ' Print NUM lines of leading context before matching lines.'.
'^' means start of line
'.' means any character
'{51,800}' means we want between 51 and 800 of the previous thing
'$' means end of line.
Or in other words, we want to match lines that are between 51 and 800 characters, and print it and the previous line.
A potential solution with AWK is:
awk '!/^>/ {next}; {getline s}; length(s) > 50 && length(s) <= 800 { print $0 "\n" s }' example.fasta
e.g. if example.fasta contains
>4RYF_1
WLSVNEAKDYGIVNEIIENRDGLKMASWSHPQFEK
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
>1000_chars
YiJOgeCApTkcJWxIuvooOxuqVnPdSLtOQmUfnzpBvcpYKyCvelFwKgMchYFnlvuZwVxNcnSvGcACsMywDQVvYBAiaIesQkLkYNsExRbqKPZIPnCRMAFHLmIzxIBqLwoNEPSKMZCTpwbbQCNrHSrbDMtCksTjvQsMeAkoudRGUJnPpQTEzwwnKoZBHtpMSIQBfYSPDYHwKktvCiFpewrsdDTQpqBajOWZkKURaKszEqDmdYMkzSAkMtlkXPfHroiTbyxZwzvrrMSXMRSavrBdgVYZanudjacRHWfpErJMkomXpzagXIzwbaeFgAgFnMxLuQHsdvZysqAsngkCZILvVLaFpkWnOpuYensROwkhwqUdngvlTsXBoCBwJUENUFgVdnSnxVOvfksyiabglFPqmSwhGabjNZiWGyvktzSDOQNGlEvoxhJCAOhxVAtZfyimzsziakpzfIszSWYVgKZTHatWSfttHYTkvgafcsVmitfEfQDuyyDAAAoTKpuhLrnHVFKgmEsSgygqcNLQYkpnhOosKiZJKpDolXcxAKHABtALqVXoVcSHpskrpWPrkkZLTpUXkENhnesmoQjonLWxkpcuJrOosXKNTDNuZaWIEtrDILXsIFTjAnrnwJBoirgNHcDURwDIzAXJSLPLmWkurOhWSLPrIOyqNvADBdIFaCGoZeewKleBHUGmKFWFcGgZIGUdOHwwINZqcOClPAjYaLNdLgDsUNCPwKMrOXJEyPvMRLaTJGgxzeoLCggJYTVjlJpyMsoCRZBDrBDckNMhJSQWBAxYBlqSpXnpmLeEJYirwjfCqZGBZdgkHzWGoAMxgNKHOAvGXsIbbuBjeeORhZaIrruBwDfzgTICuwWCAhCPqMqkHrxkQMZbXUIavknNhuIycoDssXlOtbSWsxVXQhWMyDQZWDlEtewXWKBPUcHDYWWgyOerbnoAxrnpsCulOxqxdywFJFoeWNpVGIPMUJSWwvlVDWNkjIBMlXPi
It will only print
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
Edit
The method that I would recommend to better handle edge-cases is to use purpose-built bioinformatics software, e.g. seqkit
seqkit seq -m 50 -M 800 example.fasta
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDI
FLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEI
MIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEA
KDYGLIDDIIINKSGLKGHHHHHH
Is perl an option?
perl -nle '$prev && print if length() >50 and length() < 800 && print $prev; $prev = $_' input_file
$prev - Create a variable which will hold every line. When the length condition is met, and there has been a previous line $prev, then it prints the condition matched in $prev and prints the last line.
$prev = $_ Assigns the current line to the prev line variable
If the upper limit 800 is not essential, could sed be an option?
$ sed -En '/>/ {N;/[a-zA-Z0-9]{50,}/p}' input_file
/>/ - Match > and read into the pattern space
N; Run the condition on the next line after the match and append that to the pattern space also:
{50,} - If the length is 50 or more
\1/p - Return it and print
Output
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
With your shown samples, please try following awk code. Written and tested with GNU awk.
awk -v RS= '
{
val=""
delete arr
while(match($0,/>[^\n]*\n*[^\n]*/)){
val=substr($0,RSTART,RLENGTH)
split(val,arr,"\n")
if(length(arr[2])>50 && length(arr[2])<=800){
print val
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
If only the next line should meet the length restrictions, you can match and store the line that starts with > in a variable, for example previous
Then for the next line, check for the length and if the previous line is not empty.
If is is not, print the previous and the current line.
At the end, set the previous variable to an empty string.
awk '{
if (/^>/) {
previous = $0
next
}
if (length(previous) != 0 && length($0) > 50 && length($0) <= 800) {
print previous ORS $0
}
previous=""
}' sample.txt
See an AWK demo

Awk Remove lines if one column matches another column, and keep line if max value from another column

I have a file of ~8,000 lines. I am trying to remove the lines where when the 5th column matches (in this case ga2016mldlzd), but keep only the line with the max value in the 6th column. For example, if given this:
-25.559,129.8529,6674.560547,2.0,ga2016mldlzd,6
-25.5596,129.8565,6902.750651,2.0,ga2016mldlzd,7
-25.5450,129.830,969.8079427,2.0,ga2016mldlzd,8
-25.5450,129.834,57.04752604,2.0,ga2016mldlzd,9
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
remove all lines except the final line with 10 as the max value, to get this. I'm stumped as to how this could be done either in awk or sed?
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
If tried this:
awk -F, '!a[$5]++'
but I want to keep last column e.g., the column with '10', rather than the column with '6'. Thanks
Keep track of the max and line associated with that max and print at the end:
awk -F, '
{
if ($6>max[$5]) {
max[$5]=$6
tl[$5]=$0
}
}
END{
for (l in tl) print tl[l]
}' file
Prints:
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
The order of the file will be lost; ie, the groups may be reordered compared to the original file.
If you are dealing with a file with many different keys for $5 and not all of them could fit in memory, you could sort into blocks grouped by the fifth field and then by the numeric value of the sixth. Then have awk print the last line every time the fifth field changes. Since it is sorted, that will be the max:
sort -t , -k 5,5 -k 6n file |
awk -F, '
FNR==1{lf=$5;ll=$0}
lf!=$5{print ll}
{ll=$0; lf=$5}
END{print $0}'
# same print out
The second there will be way slower but way less memory for a large number of $5 uniq values.
If you want to maintain original order of lines then use this awk:
awk -F, 'NR==FNR {if ($6 > max[$5]) max[$5] = $6; next} $5 in max && max[$5] == $6' file file
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
If you want to filter for ga2016mldlzd while maintaining original order of lines then use this awk:
awk -F, '
NR==FNR {
if ($5 == "ga2016mldlzd" && $6 > max[$5]) {
max[$5] = $6
n = FNR
}
next
}
FNR == n' file file
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10

Using awk pattern to file filter data

I have the folling file(named /tmp/test99) which containd the rows:
"0","15","wall15"
123132,09808098,"0","15"
I am trying to filter the rows that contains "0" in the 3rd place, and "15" in 4th place (like in the second row)
I tried running:
cat /tmp/test99 | awk '/"0","15"/{print>"/tmp/0_15_file.out"} '
but instead of getting only the second row, I get also the first row starting with "0","15".
Could you please help with the pattern ?
Thanks:)
You may check if Fields 3 and 4 are equal to some hardcoded value using
awk -F, '$3=="\"0\"" && $4=="\"15\""'
Set the field separator to a comma and then, if Field 3 is "0" and Field 4 is "15" print the line, else discard.
See the online demo:
s='"0","15","wall15"
123132,09808098,"0","15"'
awk -F, '$3=="\"0\"" && $4=="\"15\""' <<< "$s"
# => 123132,09808098,"0","15"
Could you please try following.(comment on your effort, you need NOT to use cat with awk it could read Input_file by itself)
awk -F, '$3!~/\"0\"/ && $4!~/\"15\"/' Input_file

Duplicate Lines 2 times and transpose from row to column

I will like to duplicate each line 2 times and print values of column 5 and 6 separated.( transpose values of column 5 and 6 from column to row ) for each line
I mean value on column 5 (first line) value in column 6 ( second line)
Input File
08,1218864123180000,3201338573,VV,22,27
08,1218864264864000,3243738789,VV,15,23
08,1218864278580000,3244738513,VV,3,13
08,1218864310380000,3243938789,VV,15,23
08,1218864324180000,3244538513,VV,3,13
08,1218864334380000,3200538561,VV,22,27
Desired Output
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
I use this code to duplicate the lines 2 times, but i cant'n figer out the condition with values of column 5 and 6
awk '{print;print}' file
Thanks in advance
To repeatedly print the start of a line for each of the last N fields where N is 2 in this case:
$ awk -v n=2 '
BEGIN { FS=OFS="," }
{
base = $0
sub("("FS"[^"FS"]+){"n"}$","",base)
for (i=NF-n+1; i<=NF; i++) {
print base, $i
}
}
' file
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
In this simple case where the last field has to be removed and placed on the last line, you can do
awk -F , -v OFS=, '{ x = $6; NF = 5; print; $5 = x; print }'
Here -F , and -v OFS=, will set the input and output field separators to a comma, respectively, and the code does
{
x = $6 # remember sixth field
NF = 5 # Set field number to 5, so the last one won't be printed
print # print those first five fields
$5 = x # replace value of fifth field with remembered value of sixth
print # print modified line
}
This approach can be extended to handle fields in the middle with a function like the one in the accepted answer of this question.
EDIT: As Ed notes in the comments, writing to NF is not explicitly defined to trigger a rebuild of $0 (the whole-line record that print prints) in the POSIX standard. The above code works with GNU awk and mawk, but with BSD awk (as found on *BSD and probably Mac OS X) it fails to do anything.
So to be standards-compliant, we have to be a little more explicit and force awk to rebuild $0 from the modified field state. This can be done by assigning to any of the field variables $1...$NF, and it's common to use $1=$1 when this problem pops up in other contexts (for example: when only the field separator needs to be changed but not any of the data):
awk -F , -v OFS=, '{ x = $6; NF = 5; $1 = $1; print; $5 = x; print }'
I've tested this with GNU awk, mawk and BSD awk (which are all the awks I can lay my hands on), and I believe this to be covered by the awk bit in POSIX where it says "setting any other field causes the re-evaluation of $0" right at the top. Mind you, the spec could be more explicit on this point, and I'd be interested to test if more exotic awks behave the same way.
Could you please try following(considering that your Input_file always is same as shown and you need to print every time 1st four fields and then rest of the fields(one by one printing along with 1st four)).
awk 'BEGIN{FS=OFS=","}{for(i=5;i<=NF;i++){print $1,$2,$3,$4,$i}}' Input_file
This might work for you (GNU awk):
awk '{print gensub(/((.*,).*),/,"\\1\n\\2",1)}' file
Replace the last comma by a newline and the previous fields less the penultimate.

(g)awk next file on partially blank line

The Problem
I just need to combine a whole bunch of files and strip out the header (line 1) from the 1st file.
The Data
Here are the last three lines (with line 1: header) from three of these files:
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170101","20170131","1","5.49","EUR","5.49"
"20170101","20170131","1","4.27","EUR","4.27"
"","","","","9.76",""
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170201","20170228","1","5.49","EUR","5.49"
"20170201","20170228","1","4.88","EUR","4.88"
"20170201","20170228","1","0.61","EUR","0.61"
"20170201","20170228","1","0.61","EUR","0.61"
"","","","","11.59",""
START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170301","20170331","1","4.88","EUR","4.88"
"20170301","20170331","1","4.27","EUR","4.27"
"","","","","9.15",""
Problem (Continued)
As you can see, the last line has a number (it's a column total) in column 5. Of course, I don't want that last line. But it's (obviously) on a different line number in each file.
(G)awk is clearly the solution, but I don't know (g)awk.
What I've Tried
I've tried a number of combinations of things, but I guess the one that I'm most surprised does not work is:
gawk '
{ if (!$1 ) nextfile }
NR == 1 {$0 = "Filename" "StartDate" OFS $0; print}
FNR > 1 {$0 = FILENAME StartDate OFS $0; print}
' OFS=',' */*.csv > ../path/file.csv
Expected Output (by request)
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT
20170101","20170131","1","5.49","EUR","5.49
20170101","20170131","1","4.27","EUR","4.27
20170201","20170228","1","5.49","EUR","5.49
20170201","20170228","1","4.88","EUR","4.88
20170201","20170228","1","0.61","EUR","0.61
20170201","20170228","1","0.61","EUR","0.61
20170301","20170331","1","4.88","EUR","4.88
20170301","20170331","1","4.27","EUR","4.27"
And, of course, I've tried searching both Google and SO. Most of the answers I see require much more awk knowledge than I have, just to understand them. (I'm not a data wrangler, but I have a data wrangling task.)
Thanks for any help!
this should do...
awk 'NR==1; FNR==1{next} FNR>2{print p} {p=$0}' file{1..3}
print first header, skip other headers and last lines.
Another awk approach:-
awk -F, '
NR == 1 {
header = $0
print
next
}
FNR > 1 && $1 != "\"\""
' *.csv
Something like the following should do the trick:
awk -F"," 'NR==1{header=$0; print $0} $0!=header && $1!=""{print $0}' */*.csv > ../path/file.csv\
Here awk will:
Split the records by comma -F","
If this is the first record awk encounters, it sets variable header to the entire contents of the line and then prints the header NR==1{header=$0; print $0}
If the contents of the current line are not a header and the first field isn't empty (indicating a "total" line), then print the line $0!=header && $1!=""{print $0}'
As mentioned in my comment below, if the first field of your records always begin with an 8 digit date, then you could simplify (this is less generic than the code above):
awk -F"," 'NR == 1 || $1 ~ /"[0-9]{8}"/ {print $0} /*.csv > outfile.csv
Essentially that says if this is the first record to process then print it (it's a header) OR || if the first field is an 8 digit number surrounded by double quotes then print it.