comparing subtracting and printing using awk - awk
I have a file which looks like:
4.97911047 1.00000000 631.000000 369.343907
4.98065923 0.999492004 632.000000 369.771568
5.70441060 0.480057937 974.000000 642.686561
5.70448527 0.479704863 975.000000 641.643578
8.23090986 0.310811710 2020.00000 331.182895
8.23096067 0.312290865 2021.00000 331.188128
14.8077297 0.357914635 3181.00000 449.390996
14.8077541 0.352977613 3180.00000 449.377675
I want to subtract the consecutive numbers in the first column. If the result is less than or equal to some threshold say modulus 0.01, it should print the corresponding data range in the 3rd column.
For example, the difference in the 2nd and 1st consecutive data of the 1st column is
4.98065923-4.97911047=0.00154876.
Thus it should print
"631-632"
Then the difference in the 3rd and 2nd consecutive data of the 1st column is
5.70441060-4.98065923=0.723751
Thus it should not print anything. Again, the difference in the 4th and 3rd consecutive data of the 1st column is
5.70448527-5.70441060=7.467e-05
Thus it should print
"974-975"...
in this way. The final output should be like:\
631-632 \
974-975 \
2020-2021 \
3181-3180 \
Note: If the difference is less than 0.01 for 3 or more consecutive numbers of the 1st column, then it should also print the total range like "631-635" say.\
The best I could do until now is to use awk command and make the corresponding differences: \
awk 'NR > 1 { print $0 - prev } { prev = $0 }' < filename
Any help?
Regards
Sitangshu
I would GNU AWK for this task following way, let file.txt content be
4.97911047 1.00000000 631.000000 369.343907
4.98065923 0.999492004 632.000000 369.771568
5.70441060 0.480057937 974.000000 642.686561
5.70448527 0.479704863 975.000000 641.643578
8.23090986 0.310811710 2020.00000 331.182895
8.23096067 0.312290865 2021.00000 331.188128
14.8077297 0.357914635 3181.00000 449.390996
14.8077541 0.352977613 3180.00000 449.377675
then
awk 'NR>1&&($1-prev[1])<=0.01{printf "%d-%d\n",prev[3],$3}{split($0,prev)}' file.txt
gives output
631-632
974-975
2020-2021
3181-3180
Explanation: For each line I save under prev not whole line (string) but rather array, using split function to do so, so I can access for example 3rd field of previous line as prev[3]. For each line after 1st one and where difference between 1st field of current line and 1st field or previous line is lesser or equal 0.01 I do printf 3rd field of previous line and 3rd field of current line, using %d to get value formatted as integer.
(tested in gawk 4.2.1)
Related
AWK script- Not showing data
I'm trying to create a variable to sum columns 26 to 30 and 32. SO far I have this code which prints me the hearder and the output format like I want but no data is being shown. #! /usr/bin/awk -f BEGIN { FS="," } NR>1 { TotalPositiveStats= ($26+$27+$28+$29+$30+$32) } {printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, %s\n", EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats } NR==1 { print "EndYear,Rk,G,Date,Years,Days,Age,Tm,HOme,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats" }#header Input data: EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc 1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5 1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9 1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7 1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2 1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3 Output expected: EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats 1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5,35 1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4,34 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9,54 1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7,38 1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2,29 1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9,36 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3,51 This script will be called like gawk -f script.awk <filename>. Currently when calling this is the output (It seems to be calculating the variable but the rest of fields are empty)
awk is well suited to summing columns: awk 'NR>1{$(NF+1)=$26+$27+$28+$29+$30+$32}1' FS=, OFS=, input-file > tmp mv tmp input-file That doesn't add a field in the header line, so you might want something like: awk '{$(NF+1) = NR>1 ? ($26+$27+$28+$29+$30+$32) : "TotalPositiveStats"}1' FS=, OFS=,
An explanation on the issues with the current printf output is covered in the 2nd half of this answer (below). It appears OP's objective is to reformat three of the current fields while also adding a new field on the end of each line. (NOTE: certain aspects of OPs code are not reflected in the expected output so I'm not 100% sure what OP is looking to generate; regardless, OP should be able to tweak the provided code to generate the desired result) Using sprintf() to reformat the three fields we can rewrite OP's current code as: awk ' BEGIN { FS=OFS="," } NR==1 { print $0, "TotalPositiveStats"; next } { TotalPositiveStats = ($26+$27+$28+$29+$30+$32) $17 = sprintf("%.3f",$17) # FG_PCT if ($20 != "") $20 = sprintf("%.3f",$20) # 3P_PCT $23 = sprintf("%.3f",$23) # FT_PCT print $0, TotalPositiveStats } ' raw.dat NOTE: while OP's printf shows a format of %.2f % for the 3 fields of interest ($17, $20, $23), the expected output shows that the fields are not actually being reformatted (eg, $17 remains %.3f, $20 is an empty string, $23 remains %.2f); I've opted to leave $20 as blank otherwise reformat all 3 fields as %.3f; OP can modify the sprintf() calls as needed This generates: EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats 1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5,40 1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1.000,3,2,5,5,2,1,3,4,21,19.4,37 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9,57 1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1.000,2,2,4,5,3,1,6,5,25,14.7,44 1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.750,3,2,5,5,1,1,2,4,17,13.2,31 1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9,41 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.750,4,4,8,5,3,2,5,2,33,29.3,56 NOTE: in OP's expected output it appears the last/new field (TotalPositiveStats) does not contain the value from $30 hence the mismatch between the expected results and this answer; again, OP can modify the assignment statement for TotalPositiveStats to include/exclude fields as needed Regarding the issues with the current printf ... {printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, %s\n", EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats} ... is referencing (awk) variables that have not been defined (eg, EndYear, Rk, G). [NOTE: one exeception is the very last variable in the list - TotalPositiveStats - which has in fact been defined earlier in the script.] The default value for undefined variables is the empty string ("") or zero (0), depending on how the awk code is referencing the variable, eg: printf "%s", EndYear => EndYear is treated as a string and the printed result is an empty string; with an output field delimiter of a comma (,) this empty strings shows up as 2 commas next to each other (,,) printf "%.2f %", FG_PCT => FG_PCT is treated as a numeric (because of the %f format) and the printed result is 0.00 % Where it gets a little interesting is when the (undefined) variable name starts with a numeric (eg, 3P) in which case the P is ignored and the entire reference is treated as a number, eg: printf "%s", 3P => 3P is processed as 3 and the printed result is 3 This should explain the 5 static values (0.00 %, 3, 3, 3.00 % and 0.00 %) printed in all output lines as well as the 'missing' values between the rest of the commas (eg, ,,,,). Obviously the last value in the line is an actual number, ie, the value of the awk variable TotalPositiveStats.
Print all rows where the day of birth can be found in the phone number using awk
in my assignment I have to create an awk script. The script should print all rows where the day of birth can be found in the phone number. Here are som row from the input file: firstname,lastname,city,born,phone,email Salma,Helin,Hällaryd,2002-07-03,555674792,salma.helin#hallaryd.se Sanna,Wahlgren,Torhamn,2004-08-02,555493393,sanna.wahlgren#torhamn.se Anni,Örn,Resarö,1994-07-08,555408537,anni.orn#resaro.se Thilda,Brandt,Holmsjö,1994-06-25,555197921,thilda.brandt#holmsjo.se Teo,Stenström,Borgholm,1994-04-29,555229873,teo.stenstrom#borgholm.se Alexis,Sjödin,Ardala,1991-03-04,555190611,alexis.sjodin#ardala.se Stina,Örn,Gladö kvarn,2010-05-25,555622513,stina.orn#glado_kvarn.se The desired output is: Anni Örn, 1994-07-08, 555408537 Teo Stenström, 1994-04-29, 555229873 Stina Örn, 2010-05-25, 555622513 Here how far I got BEGIN { FS="," } NR == 1 { next } { split($4, d, "-") day = d[3] } I tried split the date and extract the day of birth. Now I have to some how find match in field 5. For example 08 exist in 555408537. Now I am stuck. I don't know accomplish this assignment. I am open to any suggestion. Thanks in advance.
You can try this awk which will check if the day matches within the phone number. awk -F"[,-]" '$7 ~ $6 && /./' input file $7 ~ $6 will check for a match in the two columns /./ will remove the blank lines Output Anni,Örn,Resarö,1994-07-08,555408537,anni.orn#resaro.se Teo,Stenström,Borgholm,1994-04-29,555229873,teo.stenstrom#borgholm.se Stina,Örn,Gladö kvarn,2010-05-25,555622513,stina.orn#glado_kvarn.se
You might use index function, it gives start of match if found or 0 otherwise, so in your case is it enough to check >0. Let file.txt content be firstname,lastname,city,born,phone,email Salma,Helin,Hällaryd,2002-07-03,555674792,salma.helin#hallaryd.se Sanna,Wahlgren,Torhamn,2004-08-02,555493393,sanna.wahlgren#torhamn.se Anni,Örn,Resarö,1994-07-08,555408537,anni.orn#resaro.se Thilda,Brandt,Holmsjö,1994-06-25,555197921,thilda.brandt#holmsjo.se Teo,Stenström,Borgholm,1994-04-29,555229873,teo.stenstrom#borgholm.se Alexis,Sjödin,Ardala,1991-03-04,555190611,alexis.sjodin#ardala.se Stina,Örn,Gladö kvarn,2010-05-25,555622513,stina.orn#glado_kvarn.se then awk 'BEGIN{FS=","}NR==1{next}{split($4, d, "-");day = d[3];if(index($5,day)>0){print}}' file.txt output Anni,Örn,Resarö,1994-07-08,555408537,anni.orn#resaro.se Teo,Stenström,Borgholm,1994-04-29,555229873,teo.stenstrom#borgholm.se Stina,Örn,Gladö kvarn,2010-05-25,555622513,stina.orn#glado_kvarn.se Note that due to how if behaves you do not have to compare with 0 explicitly, but might do BEGIN{FS=","}NR==1{next}{split($4, d, "-");day = d[3];if(index($5,day)){print}} and get same result. I left reworking print-ing to show desired column as exercise. As side note: rather than skipping unwanted lines using next you might register action only for interesting lines, in this case BEGIN{FS=","}NR>1{split($4, d, "-");day = d[3];if(index($5,day)){print}} (tested in gawk 4.2.1)
Here is another awk alternative: $ awk -F, 'NR > 1 && $5 ~ substr($4,9,10)' input Anni,Örn,Resarö,1994-07-08,555408537,anni.orn#resaro.se Teo,Stenström,Borgholm,1994-04-29,555229873,teo.stenstrom#borgholm.se Stina,Örn,Gladö kvarn,2010-05-25,555622513,stina.orn#glado_kvarn.se Explanation: With the field separator set to comma-separated values, print all records apart from the first (the header) where field number 5 matches the last part of the date in field 4.
Bash one liner for calculating the average of a specific row of numbers in bash
I just began learning bash. Trying to figure out how to convert a two-liner into a one liner using bash. The First Line of code... searches the first column of the input.txt for the word - KEYWORD. captures every number in this KEYWORD row from column2 until the last column. dumps all these numbers into the values.txt file placing each number on a new line. The second line of code calculates average value for all the numbers in the first column of values txt the prints out the this value. awk '{if($1=="KEYWORD") for(i=2;i<=NF;i++) print $i}' input.txt > values.txt awk 'BEGIN{sum=0; counter=0}{sum+=$1; counter+=1}END{print sum/counter}' values.txt How do I create a one liner from this?
Something like awk ' BEGIN { count = sum = 0 } $1 == "KEYWORD" { for (n = 2; n <= NF; n++) sum += $n count += NF - 1 } END { print sum/count }' input.txt Just keep track of the sum and total count of numbers in the first (and only) pass through the file and then average them at the end instead of printing a value for each matching line.
After reviewing this problem with several people and learning some new bash/awk shortcuts, the code below appears to be the shortest answer. awk '/KEYWORD/{for(n=3;n<=NF;n++)sum+=$n;print sum/(NF-2)}' input.txt This code searches the input file for the row containing "KEYWORD". Then sums up all the field from the 3rd column to the last column. Then prints out the average value of all those numbers. (i.e. the mean).
Print rows whose last field is negative
The last column of my file contains both negative and positive numbers: a, b, -1 c, d, 2 e, f, -3 I need to extract the lines whose last field contains a negative number. Currently, I am using the following: awk '/-/{print}' in.csv>out.csv The above fails if '-' appears in other columns. I wonder if there is a way to test the last field in each row to see if they are negative and then extract the line.
Just tell awk to do... awk -F, '$NF < 0' file This sets the field separator to the comma (it looks like this is what you need) and then checks if $NF is lower than 0. And what is $NF? The last field, since NF contains the number of fields and $i points to the field number i. The line is then printed, because a True condition triggers the default awk action, consisting in printing the current record.
Awk: printing undetermined number of columns
I have a file that contains a number of fields separated by tab. I am trying to print all columns except the first one but want to print them all in only one column with AWK. The format of the file is col 1 col 2 ... col n There are at least 2 columns in one row. Sample 2012029754 901749095 2012028240 901744459 258789 2012024782 901735922 2012026032 901738573 257784 2012027260 901742004 2003062290 901738925 257813 257822 2012026806 901741040 2012024252 901733947 257493 2012024365 901733700 2012030848 901751693 260720 260956 264843 264844 So I want to tell awk to print column 2 to column n for n greater than 2 without printing blank lines when there is no info in column n of that row, all in one column like the following. 901749095 901744459 258789 901735922 901738573 257784 901742004 901738925 257813 257822 901741040 901733947 257493 901733700 901751693 260720 260956 264843 264844 This is the first time I am using awk, so bear with me. I wrote this from command line which works: awk '{i=2; while ($i ~ /[0-9]+/) { printf "%s\n", $i i++ } }' bth.data It is more of a seeking approval than asking a question whether it is the right way of doing something like this in AWK or is there a better/shorter way of doing it. Note that the actual input file could be millions of lines. Thanks
Is this what you want as output? awk '{for(i=2; i<=NF; i++) print $i}' bth.data gives 901749095 901744459 258789 901735922 901738573 257784 901742004 901738925 257813 257822 901741040 901733947 257493 901733700 901751693 260720 260956 264843 264844 NF is one of several pre-defined awk variables. It indicates the number of fields on a given input line. For instance, it is useful if you want to always print out the last field in a line print $NF. Or of course if you want to iterate through all or part of the fields on a given line to the end of the line.
Seems like awk is the wrong tool. I would do: cut -f 2- < bth.data | tr -s '\t' '\n' Note that with -s, this avoids printing blank lines as stated in the original problem.