Variable evaluation before assignment in awk - awk
In the following awk statement:
awk '$2 > maxrate {maxrate = $2; maxemp = $1}
END {print "highest hourly rate:", maxrate, "for", maxemp}' pay.data
run on the following data:
Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18
How does $2 > maxrate works since it is evaluated before its assignment to $2?
From the GNU awk manual
By default, variables are initialized to the empty string, which is
zero if converted to a number. There is no need to explicitly
initialize a variable in awk, which is what you would do in C and in
most other traditional languages.
This implicit way, which usually apply for scripting languages, is very comfortable but also leaves room for mistakes or confusion.
For example, in this case, you can calculate the maximum, with no need to initialise max:
awk '$2 > max{max = $2} END{print "max:", max}' file
max: 5.50
But if you do the same for the min you get the empty string as result, because min is initially zero as a number and empty as a string.
awk '$2 < min{min = $2} END{print "min:", min}' file
min:
Also the max calculation could fail, if we had all values negative. So it would be better to assign something first time for sure.
awk 'NR==1{min=$2; next} $2<min{min = $2} END{print "min:", min}' file
min: 3.75
This way should work for both min and max, for numbers of any range. In general, when scripting, we have to think of all possible cases when our not defined and/or not initialised variable will be initialised. And for the cases that it will be tested before getting a value.
By default if you dont't assign any value to a variable in awk then it's default value will be null(without explicitly mentioning a variable we could directly assign values to it in awk), so your first time condition is getting compared with null hence it's getting true and going inside block for further statements execution(where inside block it's assigning maxrate to 2nd field).
After very first execution when variable maxrate is getting 2nd field value in it then next line onwards it's comparing 1st line's 2nd field value to current line's 2nd field and keep doing the same till all lines of Input_file are read. At last in END section of code it print it.
Related
Value of an assignment in awk, ternary operator parsing in awk
I new to awk and playing around with it. When trying to use the ternary operator, at some point I wanted to execute two operations upon true condition, and as I couldn't find the syntax to do so I tried to smuggle one of the two operations inside the condition to take advantage of the lazy evaluation. I have an input file as follow: file.csv A,B,C,D 1,,, 2,,, 3,,, 4,,, 5,,, 6,,, 7,,, 8,,, And I'd like, for the sake of the exercise, to put assign B and C to 0 if A is less than 5 ; C to 1 if A is 5 or more. I guess the ternary operator is a terrible way to do this but this is not my point. The question is: why does the following line outputs that? How does awk parse this expression ? awk '(FNR!=1){$1<5 && $3=0 ? $2=0 : $2=1}{print $0}' FS=, OFS=, file.csv Output: 1,1,1, 2,1,1, 3,1,1, 4,1,1, 5,,, 6,,, 7,,, 8,,, I was expecting the $3=0 expression to be executed and evaluated to true, and being skipped when the first part of the condition ($1<5) is false. Expected result: 1,0,0, 2,0,0, 3,0,0, 4,0,0, 5,1,, 6,1,, 7,1,, 8,1,, Extra question: can I actually use the ternary operator and have in it several instructions executed depending on the condition value ? Is it only a bad practice or actually impossible ?
1st solution: You should have like this code, written and tested with your shown samples and attempts. I have used ternary operators to check if value of 1st field is lesser than 5 or not and based on that setting values for 2nd and 3rd fields here. awk ' BEGIN { FS=OFS="," } FNR==1{ print next } { $2=($1<5?0:$1) $3=($1<5?0:$3) } 1 ' Input_file 2nd solution(Generic approach): If you have to pass N number of fields to be checked in program then better create a function and do the checks and assignment there, using again ternary operators here for computation. Where: threshold is an awk variable which is assigned to 5 value by which you want to do comparison fir 1st field. fieldCompare is again an awk variable which contains 1 in this case since we want to compare 1st field value to 5 here. checkValue is function where field numbers(eg: 2 and 3 in this case) are being passed with comma separated values to be checked in a single shot within function. awk -v threshold="5" -v fieldCompare="1" ' function checkValue(fields){ num=split(fields,arr,",") for(i=1;i<=num;i++){ fieldNum = arr[i] $fieldNum = ($fieldCompare<threshold?0:$fieldNum) } } BEGIN { FS=OFS="," } FNR==1{ print next } checkValue("2,3") 1 ' Input_file
If I look at the expected outcome, the 2nd field should be one. Setting field 2 and 3 to zero if field 1 is smaller than five, else set field 2 to one. The one at the end }1 evaluates to true and will print the whole line. awk 'BEGIN{FS=OFS=","}(FNR!=1){($1 < 5) ? $2=$3=0 : $2=1}1' file.csv Output A,B,C,D 1,0,0, 2,0,0, 3,0,0, 4,0,0, 5,1,, 6,1,, 7,1,, 8,1,,
If you want to write cryptic code, this is one way to do it. You don't even need the ternary operator. $ awk 'BEGIN {FS=OFS=","} NR>1 {$2=$1>=5 || $3=0 }1' file A,B,C,D 1,0,0, 2,0,0, 3,0,0, 4,0,0, 5,1,, 6,1,, 7,1,, 8,1,,
I was expecting the $3=0 expression to be executed and evaluated to true The result of an assignment is the value assigned. Zero is false. ... and being skipped when the first part of the condition ($1<5) is false. Since && has a higher precedence than ?:, and ?: has a higher precedence than =, awk is doing this: $1<5 && ($3 = (0 ? $2=0 : $2=1)) When $1 < 5, if 0 is true (it is not) then assign $3 the result of $2 = 0, else assign $3 the result of $2 = 1. When $1 >= 5, do nothing.
tested and confirmed working on mawk-1, mawk-2, gawk, and nawk : only difference being order of precedence at 3rd section {g,n}awk 'BEGIN { _+=_^=FS=OFS="," } NR<_ || ($_=_^_<+$!_) || $(_--)=!++_ ""' or mawk 'BEGIN { _+=_^=FS=OFS="," } NR<_ || ($_=_^_<+$!_) || $++_ = !--_ ""' | A,B,C,D 1,0,0, 2,0,0, 3,0,0, 4,0,0, 5,1,, 6,1,, 7,1,, 8,1,, concating with empty string ("") at the tail ensures print out for a zero-value assignment
How do I print the line where the max value is found using awk?
I have a file called probabilities.txt and it's a two column file with the first column listing distances and the second column probabilities. The sample data is as follows: 0.2 0.05 0.4 0.10 0.6 0.63 0.8 0.11 1.0 0.03 ... ... 10.0 0.01 I would like to print out the line that has the maximum value in column 2. I've tried the following: awk 'BEGIN{a= 0} {if ($2 > a) a = $2} END{print $1, a}' probabilities.txt This was the desired output: 0.6 0.63 But this is the output I get: 10.0 0.63 It seems like the code I wrote is just getting the max value in each column and then printing it out rather than printing out the line that has the max value in column 2. Printing out $0 also just prints out the last line of the file. I assume one could fix this by treating the lines as an array rather than a scalar but I'm not really sure how to do that since I'm a beginner. Would appreciate any help
I had contemplated just leaving the answer as a comment, but given the trouble you had with the command it's worth writing up. To begin, you don't need BEGIN. In awk variables are initialized 0 until set, so you can simply use a max variable for the first time after comparing it. Note: If your data involves negative numbers (neither distance or probabilities can), just add a new first rule and set max to the value in the first record (e.g. FNR==1 (max=$2; next}) Next, you don't save individual field values when you are wanting to capture the entire line (record) with the largest probability, save the entire record associated with the max value. Then in your END rule all you need to do is print that record. Putting it altogether you would have: awk '{if($2 > max) {max=$2; maxline=$0}} END {print maxline}' file or, if you prefer: awk '$2 > max {max=$2; maxline=$0} END {print maxline}' file Example Use/Output With your data in the file distprobs.txt you would get: $ awk '{if($2 > max) {max=$2; maxline=$0}} END {print maxline}' distprobs.txt 0.6 0.63 and, second version same result: $ awk '$2 > max {max=$2; maxline=$0} END {print maxline}' distprobs.txt 0.6 0.63
AWK script- Not showing data
I'm trying to create a variable to sum columns 26 to 30 and 32. SO far I have this code which prints me the hearder and the output format like I want but no data is being shown. #! /usr/bin/awk -f BEGIN { FS="," } NR>1 { TotalPositiveStats= ($26+$27+$28+$29+$30+$32) } {printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, %s\n", EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats } NR==1 { print "EndYear,Rk,G,Date,Years,Days,Age,Tm,HOme,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats" }#header Input data: EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc 1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5 1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9 1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7 1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2 1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3 Output expected: EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats 1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5,35 1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4,34 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9,54 1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7,38 1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2,29 1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9,36 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3,51 This script will be called like gawk -f script.awk <filename>. Currently when calling this is the output (It seems to be calculating the variable but the rest of fields are empty)
awk is well suited to summing columns: awk 'NR>1{$(NF+1)=$26+$27+$28+$29+$30+$32}1' FS=, OFS=, input-file > tmp mv tmp input-file That doesn't add a field in the header line, so you might want something like: awk '{$(NF+1) = NR>1 ? ($26+$27+$28+$29+$30+$32) : "TotalPositiveStats"}1' FS=, OFS=,
An explanation on the issues with the current printf output is covered in the 2nd half of this answer (below). It appears OP's objective is to reformat three of the current fields while also adding a new field on the end of each line. (NOTE: certain aspects of OPs code are not reflected in the expected output so I'm not 100% sure what OP is looking to generate; regardless, OP should be able to tweak the provided code to generate the desired result) Using sprintf() to reformat the three fields we can rewrite OP's current code as: awk ' BEGIN { FS=OFS="," } NR==1 { print $0, "TotalPositiveStats"; next } { TotalPositiveStats = ($26+$27+$28+$29+$30+$32) $17 = sprintf("%.3f",$17) # FG_PCT if ($20 != "") $20 = sprintf("%.3f",$20) # 3P_PCT $23 = sprintf("%.3f",$23) # FT_PCT print $0, TotalPositiveStats } ' raw.dat NOTE: while OP's printf shows a format of %.2f % for the 3 fields of interest ($17, $20, $23), the expected output shows that the fields are not actually being reformatted (eg, $17 remains %.3f, $20 is an empty string, $23 remains %.2f); I've opted to leave $20 as blank otherwise reformat all 3 fields as %.3f; OP can modify the sprintf() calls as needed This generates: EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats 1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5,40 1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1.000,3,2,5,5,2,1,3,4,21,19.4,37 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9,57 1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1.000,2,2,4,5,3,1,6,5,25,14.7,44 1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.750,3,2,5,5,1,1,2,4,17,13.2,31 1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9,41 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.750,4,4,8,5,3,2,5,2,33,29.3,56 NOTE: in OP's expected output it appears the last/new field (TotalPositiveStats) does not contain the value from $30 hence the mismatch between the expected results and this answer; again, OP can modify the assignment statement for TotalPositiveStats to include/exclude fields as needed Regarding the issues with the current printf ... {printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, %s\n", EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats} ... is referencing (awk) variables that have not been defined (eg, EndYear, Rk, G). [NOTE: one exeception is the very last variable in the list - TotalPositiveStats - which has in fact been defined earlier in the script.] The default value for undefined variables is the empty string ("") or zero (0), depending on how the awk code is referencing the variable, eg: printf "%s", EndYear => EndYear is treated as a string and the printed result is an empty string; with an output field delimiter of a comma (,) this empty strings shows up as 2 commas next to each other (,,) printf "%.2f %", FG_PCT => FG_PCT is treated as a numeric (because of the %f format) and the printed result is 0.00 % Where it gets a little interesting is when the (undefined) variable name starts with a numeric (eg, 3P) in which case the P is ignored and the entire reference is treated as a number, eg: printf "%s", 3P => 3P is processed as 3 and the printed result is 3 This should explain the 5 static values (0.00 %, 3, 3, 3.00 % and 0.00 %) printed in all output lines as well as the 'missing' values between the rest of the commas (eg, ,,,,). Obviously the last value in the line is an actual number, ie, the value of the awk variable TotalPositiveStats.
awk NR wrong with the total number of lines returned
when awk NR was used for getting the total number of lines of a file, wrong number was returned. Could you help to find out what happened? File 'test.txt' contents : > 2012 09 10 30.0 8 14 fdafadf > 2013 08 11 05.0 9 1.5 fdafa > 2011 01 12 02.0 7 1.2 daff The average of the last column of records with '>' beginning was expected to get. Code: awk 'BEGIN{SUM=0}{/^> /{SUM=SUM+$6}END{print SUM/NR}' test.txt With this code, the wrong mean of the last column was obtained instead of the right number 3. How can I get the right result with awk mission? Thanks
Could you please try following. This will take SUM of all line's last column and it will keep doing till Input_file is done with reading. It will alos count the number of occurrences of > lines because average means SUM divided by count(here count of lines), in END block of awk we could divide them and could get average as needed. awk 'BEGIN{sum=0;count=0}/^>/{sum+=$NF;count++} END{print "avg="sum/count}' Input_file If you want to take average of 6th column then use $6 in spite of $NF in above code too. Explanation: Adding following only for explanation purposes. awk ' ##Starting awk command/script here. /^>/{ ##Checking condition if a line starts from > then do following. sum+=$NF ##Creating a variable named sum wohse value is adding in its own value of $NF last field of current line. count++ ##Creating a variable named count whose value is incrementing by 1 each time cursor comes here. } END{ ##END block of awk code here. print "avg="sum/count ##Printing string avg= then dividing sum/count it will print the result of it. } ' Input_file ##Mentioning Input_file name here.
setting default numeric format in awk
I wanted to do a simple parsing of two files with ids and some corresponding numerical values. I didn't want awk to print numbers in scientific notation. File looks like this: someid-1 860025 50.0401 4.00022 someid-2 384319 22.3614 1.78758 someid-3 52096 3.03118 0.242314 someid-4 43770 2.54674 0.203587 someid-5 33747 1.96355 0.156967 someid-6 20281 1.18004 0.0943328 someid-7 12231 0.711655 0.0568899 someid-8 10936 0.636306 0.0508665 someid-9 10224.8 0.594925 0.0475585 someid-10 10188.8 0.59283 0.047391 when use print instead of printf : awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); print $1,k[2],k[3],k[4],$2,$3,$4}' OSCAo.txt dme_miRNA_PIWI_OSC.txt | sort -n -r -k 7 | head i get this result: dme-miR-iab-4-5p 0.333333 0.000016 0.000001 0.25 0.000605606 9.36543e-07 dme-miR-9c-5p 10987.300000 0.525413 0.048798 160.2 0.388072 0.000600137 dme-miR-9c-3p 731.986000 0.035003 0.003251 2.10714 0.00510439 7.89372e-06 dme-miR-9b-5p 30322.500000 1.450020 0.134670 595.067 1.4415 0.00222922 dme-miR-9b-3p 2628.280000 0.125684 0.011673 48 0.116276 0.000179816 dme-miR-9a-3p 10.365000 0.000496 0.000046 0.25 0.000605606 9.36543e-07 dme-miR-999-5p 103.433000 0.004946 0.000459 0.0769231 0.00018634 2.88167e-07 dme-miR-999-3p 1513.790000 0.072389 0.006723 28 0.0678278 0.000104893 dme-miR-998-5p 514.000000 0.024579 0.002283 73 0.176837 0.000273471 dme-miR-998-3p 3529.000000 0.168756 0.015673 42 0.101742 0.000157339 Notice the scientific notation in the last column I understand that printf with appropriate format modifier can do the job but the code becomes very lengthy. I have to write something like this: awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt This becomes clumsy when I have to parse fileout with another similarly structured file. Is there any way to specify default numerical output, such that any string will be printed like a string but all numbers follow a particular format.
I think You misinterpreted the meaning of %3.6f. The first number before the decimal point is the field width not the "number of digits before decimal point". (See prinft(3)) So You should use %10.6f instead. It can be tested easily in bash $ printf "%3.6f\n%3.6f\n%3.6f" 123.456 12.345 1.234 123.456000 12.345000 1.234000 $ printf "%10.6f\n%10.6f\n%10.6f" 123.456 12.345 1.234 123.456000 12.345000 1.234000 You can see that the later aligns to the decimal point properly. As sidharth c nadhan mentioned You can use the OFMT awk internal variable (seem awk(1)). An example: $ awk 'BEGIN{print 123.456; print 12.345; print 1.234}' 123.456 12.345 1.234 $ awk -vOFMT=%10.6f 'BEGIN{print 123.456; print 12.345; print 1.234}' 123.456000 12.345000 1.234000 As I see in You example the number with maximum digits can be 123456.1234567, so the format %15.7f to cover all and show a nice looking table. But unfortunately it will not work if the number has no decimal point in it or even if it does, but it ends with .0. $ awk -vOFMT=%15.7f 'BEGIN{print 123.456;print 123;print 123.0;print 0.0+123.0}' 123.4560000 123 123 123 I even tried gawk's strtonum() function, but the integers are considered as non-OFMT strings. See awk -vOFMT=%15.7f -vCONVFMT=%15.7f 'BEGIN{print 123.456; print strtonum(123); print strtonum(123.0)}' It has the same output as before. So I think, you have to use printf anyway. The script can be a little bit shorter and a bit more configurable: awk -vf='\t'%15.7f 'NR==FNR{x[$1]=sprintf("%s"f f f,$1,$2,$3,$4);next}$1 in x{printf("%s"f f f"\n",x[$1],$2,$3,$4)}' file1.txt file2.txt The script will not work properly if there are duplicated IDs in the first file. If it does not happen then the two conditions can be changed and the ;next can be left off.
awk 'NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%9s\t%9s\t%9s\t%9s\t%9s\t%9s\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt