Finding max value using AWK - awk

I have a file with two column data and I want to find the max value and print it.
file =load_measure
11:20,18.03
11:25,17.85
11:30,18.24
11:35,19.19
11:40,18.45
11:45,17.53
11:50,17.56
11:55,17.60
12:00,18.51
12:05,18.50
I try via hereunder code but It returns 0
awk 'BEGIN {max = 0} {if ($2>max) max=$2} END {print max}' load_measure
0
I try via declaring max as $max but it does not count the real max:
awk 'BEGIN {max = 0} {if ($2>max) max=$2} END {print $max}' load_measure
12:05,18.50
Can anyone explain what I'm doing wrong?
thank you!

When your fields are separated by something other that white space you need to tell awk what that something is by populating FS. You also need to set max to the first value read so the script will work for all-negative input and you need to print max+0 in the END to ensure numeric output even if the input file is empty:
awk -F, 'NR==1{max=$2} $2>max{max=$2} END{print max+0}' file
Whern max is 2, print max is printing the value of max, i.e. 2, while print $max is printing the value of the field indexed by the value of max, i.e. $2, which in an END section will either be null or the value of $2 on the last line read (undefined behavior per POSIX so awk-dependent).

You should specify the value of FS that is the input field separator. It describes how each record is split into fields; it may even be an extended regular expression.
On awk's command line, FS can be specified as -F <sep> (or -v FS=<sep>). You can also set it in the BEGIN block.
I'm normally using the later method but that's just a personal preference:
BEGIN {max=0;FS=","} ....
Also Your problem can be solved like this too:
awk -F, -v m=0 '$2>m {m=$2} END {print m}'
thus sparing an if statement.
The POSIX-mandated default value is a space (0x20). But be aware that running spaces (more than one) might be considered as one field separator.
Here is the official documentation for GNU Awk.

Related

Get values from the next row and merge- awk

I have a pipe delimited file like this
OLD|123432
NEW|232322
OLD|1234452
NEW|232324
OLD|656966
NEW|232325
I am trying to create a new file where I am trying to merge rows based on the value in the first column (OLD/NEW). First column in the output file will have the new number and the second column will have the old number.
Output
232322|123432
232324|1234452
232325|656966
I looked at the answer here How to merge every two lines into one from the command line?. I know it is not the exact solution but used as a starting point.
and tried to make it work to solve this but throws syntax error.
awk -F "|" 'NR%2{OFS = "|" printf "%s ",$0;next;}1'
You may use this awk:
awk 'BEGIN {FS=OFS="|"} $1 == "NEW" {print $2, old} $1 == "OLD" {old = $2}' file
232322|123432
232324|1234452
232325|656966
Using $0 will have the value of the whole line. If the field separator is a pipe, the second column is $2 that has the number.
If you want to use the remainder with NR%2, another option could be storing the value of the second column in a variable, for example v
awk 'BEGIN{FS=OFS="|"} NR%2{v=$2;next;}{print $2,v}' file
Output
232322|123432
232324|1234452
232325|656966

awk: Adding a new column based on concatenated value of two columns

I am trying to add a new column to a text file based on the concatenated values of two columns. Value is being inserted in the middle instead of the end of the string.
I am using awk. Here are two sample lines
$ head -1 file.txt
8502CC169154|02|GA|TN|89840|9|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|1|TEAM1|1639009|1000000|0|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|85|00|37421||241|20|331|1052A|5000|0|.1500|Chattanooga|47065|.000|025|35|25000|0|0|0|0|0|718||E|-17.00|-17.00|-17.00|-17.00|-17.00|-2.55|-2.55|-2.55|-2.55|D|C9N7I4115531902|-2.19|-2.19|-2.19|-2.19|-14.81|051|2008-12-31 00:00:00.000|151|2008-12-17 00:00:00.000|||AC|CC|Y||2008-12-31 00:00:00.000|.000000|A|.000000|.000000|.000000|Y|8502CC169154-8|8502CC169154|8|||122130|122130M|7764298|RA
I tried the following.
$ head -1 file.txt | awk -F'|' '{$(NF+1)=$1"-"$6;}1' OFS='|'
I am expecting a new column at the end of the string. But you can see that the concatenated field is being inserted in the middle of the string instead of the end of the string.
8502CC169154|02|GA|TN|89840|9|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|1|TEAM1|1639009|1000000|0|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|85|00|37421||241|20|331|1052A|5000|0|.1500|Chattanooga|47065|.000|025|35|25000|0|0|0|0|0|718||E|-17.00|-17.00|-17.00|-17.00|-17.00|-2.55|-2.55|-2.55|-2.55|D|C9N7I4115531902|-2.19|-2.19|-2.19|-2.19|-14.81|051|2008-12-31 00:00:00.000|151|2008|8502CC169154-9.000|||AC|CC|Y||2008-12-31 00:00:00.000|.000000|A|.000000|.000000|.000000|Y|8502CC169154-8|8502CC169154|8|||122130|122130M|7764298|RA
Your original code works for me using GNU awk but I suspect that not all awks support setting $(NF+1). To avoid that, try:
head -1 file.txt | awk -F'|' '{$0=$0 FS $1"-"$6;}1' OFS='|'
Awk is a surprising powerful language and it has all the capabilities that head has, making the pipeline unnecessary. So, for greater efficiency, try the simple command:
awk -F'|' '{print $0 FS $1"-"$6; exit}' file.txt
How it works:
-F'|'
This sets the field separator to a vertical bar.
print $0 FS $1"-"$6
This prints the output line that you want which consists of the original line, $0, followed by a field separator, FS, followed by combination of the first field, a dash, and the sixth field.
exit
After the first line is printed, this tells awk to exit. This eliminates the need for head -1.

How to filter the OTU by counts with AWK?

I am trying to filter all the singleton from a fasta file.
Here is my input file:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU2;size=1;
ATCCGGGACTGATC
>OTU3;size=5;
GAACTATCGGGTAA
>OTU4;size=1;
AATTGGCCATCT
The expected output is:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
I've tried
awk -F'>' '{if($1>=2) {print $0}' input.fasta > ouput.fasta
but this will remove all the header for each OTU.
Anyone could help me out?
Could you please try following.
awk -F'[=;]' '/^>/{flag=""} $3>=3{flag=1} flag' Input_file
$ awk '/>/{f=/=1;/}!f' file
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
awk -v FS='[;=]' 'prev_sz>=2 && !/size/{print prev RS $0} /size/{prev=$0;prev_sz=$(NF-1)}'
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
Store the size from each line in prev_sz variable and whole line in prev variables. Now check if its >= 2, then print the previous line and the current line. RS is used to print new line.
While all the above methods work, they are limited to the fact that input always has to look the same. I.e the sequence-name in your fasta-file needs to have the form:
>NAME;size=value;
A few solutions can handle a bit more extended sequence-names, but none handle the case where things go a bit more generic, i.e.
>NAME;label1=value1;label2=value2;STRING;label3=value3;
Print sequence where label xxx matches value vvv:
awk '/>{f = /;xxx=vvv;/}f' file.fasta
Print sequence where label xxx has a numeric value p bigger than q:
awk -v label="xxx" -v limit=q \
'BEGIN{ere=";" label "="}
/>/{ f=0; match($0,ere);value=0+substr($0,RSTART+length(ere)); f=(value>limit)}
f' <file>
In the above ere is a regular expression we try to match. We use it to find the location of the value attached to label xxx. This substring will have none-numeric characters after its value, but by adding 0 to it, it is converted to a number, losing all non-numeric values (i.e. 3;label4=value4; is converted to 3). We check if the value is bigger than our limit, and print the sequence based on that result.

Output field separators in awk after substitution in fields

Is it always the case, after modifying a specific field in awk, that information on the output field separator is lost? What happens if there are multiple field separators and I want them to be recovered?
For example, suppose I have a simple file example that contains:
a:e:i:o:u
If I just run an awk script, which takes account of the input field separator, that prints each line in my file, such as running
awk -F: '{print $0}' example
I will see the original line. If however I modify one of the fields directly, e.g. with
awk -F: '{$2=$2"!"; print $0}' example
I do not get back a modified version of the original line, rather I see the fields separated by the default whitespace separator, i.e:
a e! i o u
I can get back a modified version of the original by specifying OFS, e.g.:
awk -F: 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example
In the case, however, where there are multiple potential field separators but in the case of multiple separators is there a simple way of restoring the original separators?
For example, if example had both : and ; as separators, I could use -F":|;" to process the file but OFS would no be sufficient to restore the original separators in their relative positions.
More explicitly, if we switched to example2 containing
a:e;i:o;u
we could use
awk -F":|;" 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example2
(or -F"[:;]") to get
a:e!:i:o:u
but we've lost the distinction between : and ; which would have been maintained if we could recover
a:e!;i:o;u
You need to use GNU awk for the 4th arg to split() which saves the separators, like RT does for RS:
$ awk -F'[:;]' '{split($0,f,FS,s); $2=$2"!"; r=s[0]; for (i=1;i<=NF;i++) r=r $i s[i]; $0=r} 1' file
a:e!;i:o;u
There is no automatically populated array of FS matching strings because of how expensive it'd be in time and memory to store the string that matches FS every time you split a record into fields. Instead the GNU awk folks provided a 4th arg to split() so you can do it yourself if/when you want it. That is the result of a long conversation a few years ago in the comp.lang.awk newsgroup between experienced awk users and gawk providers before all agreeing that this was the best approach.
See split() at https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions.

Awk with a variable

I have an awk command to print out the total number of times "200" occurred in column 26.
awk '$26 ~ /200/{n++}; END {print n+0}' testfile
How do I modify this statement so I can pass 200 as a variable? e.g. if I have a variable $code with a value of 200
Thanks in advance
awk '$26 ~ code {n++} END {print n+0}' code=200 testfile
If a filename on the command line has the form var=val it is treated as a variable
assignment. The variable var will be assigned the value val.
ยง Awk Program Execution
awk -v var="$shellVar" '$26~var{n++} END{print n}' file
you see above line how to use shell variable in awk. some notes for your awk one-liner:
print n+0 not necessary. because the n defined by you, not picked from input text, and you explicitly did n++, so it is number type, n+0 makes no sense
the ; before END should be removed
I copied your code about the checking 200 part. but it is risky. if the $26 has only a number, you can consider to use 1*$26 == 200 or $26 == "200" using regex in this situation may give wrong result, think about in your $26, value was : 20200