not getting sort output in awk - awk

I am using book "The AWK programming langauge" by Aho ,Kernighan ..
On page 20 they have given a program which doesn't work on my system.
emp.data is
Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18
program they have given is
awk '{ printf("%6.2f %s\n" , $2*$3, $0) }' emp.data | sort
and the output they have given is
But my output is
0.00 Beth 4.00 0
0.00 Dan 3.75 0
100.00 Mark 5.00 20
121.00 Mary 5.50 22
40.00 Kathy 4.00 10
76.50 Susie 4.25 18
so whats happening ?

Your awk is broken or you have control chars in your input.
Your printf syntax is wrong (but would still produce correct output)
To get "2" out of the way: printf is a builtin language construct, not a function. When you do this:
printf("%s",foo)
you are not calling a printf function with 2 arguments, you are invoking the printf builtin with 1 argument which you are constructing from "(" "%s" "," "foo" and ")". The correct syntax is simply:
printf "%s",foo
but you can stick brackets around any of that and it won't add any value but won't break it either. Any of these would "work" in the same way:
printf ("%s"),foo
printf "%s",(foo)
printf ("%s"),(foo)
printf (((((((((("%s",foo))))))))))
More importantly, though is point "1" above: you're telling awk to produce output formatted as:
"%6.2f ...."
which means that the leading digits should be padded with up to 2 leading spaces on the left but your output has no leading spaces on the first line. That is impacting your "sort" but there's more going on here too since given the strings:
2
10
it doesn't matter if you do a numeric sort or an alphabetic sort because 2 is numerically less than 10 but space is also numerically less than 1 so the result should be the same either way.
Your posted output, though, is implying that your sort is sorting alphabetically in such a way that "100" is less than " 40" which just is not the way sort works. Even if somehow in your locale was greater than "1" alphabetically, it wouldn't explain why you get the equivalent of:
2
10
3
in your output, i.e. sometimes it treats space as less than one and other times as more.
Since your awk is clearly producing bad output there is definitely a problem with either your awk or your input file, so I think it's unlikely that there's also a problem with your sort tool.
Try these commands and post your result if you'd like help debugging your problem:
$ awk '{ printf "%6.2f\n" , $2*$3 }' emp.data
0.00
0.00
40.00
100.00
121.00
76.50
$ awk '{ printf "%6.2f\n" , $2*$3 }' emp.data | sort
0.00
0.00
40.00
76.50
100.00
121.00
I had one other thought - if you messed up the copy/paste of your awk output then maybe it's a locale issue. Try doing this:
export LC_ALL=C
and then running the command again (without the "-n" on sort).

Try sort -n at the end, to do a numerical sort. The default sort would put 10 before 2.

They have assumed sort sorts numerically, your sort appears to default to alphabetically.
Have a look at your sort command line options to see if you can make it numeric.

Related

Variable evaluation before assignment in awk

In the following awk statement:
awk '$2 > maxrate {maxrate = $2; maxemp = $1}
END {print "highest hourly rate:", maxrate, "for", maxemp}' pay.data
run on the following data:
Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18
How does $2 > maxrate works since it is evaluated before its assignment to $2?
From the GNU awk manual
By default, variables are initialized to the empty string, which is
zero if converted to a number. There is no need to explicitly
initialize a variable in awk, which is what you would do in C and in
most other traditional languages.
This implicit way, which usually apply for scripting languages, is very comfortable but also leaves room for mistakes or confusion.
For example, in this case, you can calculate the maximum, with no need to initialise max:
awk '$2 > max{max = $2} END{print "max:", max}' file
max: 5.50
But if you do the same for the min you get the empty string as result, because min is initially zero as a number and empty as a string.
awk '$2 < min{min = $2} END{print "min:", min}' file
min:
Also the max calculation could fail, if we had all values negative. So it would be better to assign something first time for sure.
awk 'NR==1{min=$2; next} $2<min{min = $2} END{print "min:", min}' file
min: 3.75
This way should work for both min and max, for numbers of any range. In general, when scripting, we have to think of all possible cases when our not defined and/or not initialised variable will be initialised. And for the cases that it will be tested before getting a value.
By default if you dont't assign any value to a variable in awk then it's default value will be null(without explicitly mentioning a variable we could directly assign values to it in awk), so your first time condition is getting compared with null hence it's getting true and going inside block for further statements execution(where inside block it's assigning maxrate to 2nd field).
After very first execution when variable maxrate is getting 2nd field value in it then next line onwards it's comparing 1st line's 2nd field value to current line's 2nd field and keep doing the same till all lines of Input_file are read. At last in END section of code it print it.

Search and Print by Two Conditions using AWK

I have this file:
- - - Results from analysis of weight - - -
Akaike Information Criterion 307019.66 (assuming 2 parameters).
Bayesian Information Criterion 307036.93
Approximate stratum variance decomposition
Stratum Degrees-Freedom Variance Component Coefficients
id 39892.82 490.360 0.7 0.6 1.0
damid 0.00 0.00000 0.0 0.0 1.0
Residual Variance 1546.46 320.979 0.0 0.0 1.0
Model_Term Gamma Sigma Sigma/SE % C
id NRM_V 17633 0.18969 13.480 4.22 0 P
damid NRM_V 17633 0.07644 13.845 2.90 0 P
ide(damid) IDV_V 17633 0.00000 32.0979 1.00 0 S
Residual SCA_V 12459 1.0000 320.979 27.81 0 P
And I Would Like to print the Value of Sigma on id, note there are two id on the file, so I used the condition based on NRM_V too.
I tried this code:
tac myfile | awk '(/id/ && /NRM_V/){print $5}'
but the results printed were:
13.480
13.845
and I need just the first one
Could you please try following, I have added exit function of awk here which will help us to exit from code ASAP whenever first occurrence of condition comes, it will help us to save time too, since its no longer reading whole Input_file.
awk '(/id/ && /NRM_V/){print $5;exit}' Input_file
OR with columns:
awk '($1=="id" && $2=="NRM_V"){print $5;exit}' Input_file
In case you want to read file from last line towards first line and get its first value then try:
tac Input_file | awk '(/id/ && /NRM_V/){print $5;exit}'
OR with columns comparisons:
tac Input_file | awk '($1=="id" && $2=="NRM_V"){print $5;exit}'
The problem is that /id/ also matches damid. You could use the following to print the Sigma value only if the first field is id and the second field is NRM_V:
awk '$1=="id" && $2=="NRM_V"{ print $5 }' myfile

sed/awk - return rows that match certain strings at the second column

I have a csv file in HDFS that I am using fread() to read into R. The data looks like this:
Date Code Value TransactionID
2016-01-01 769 123 16U11863C2MS0000337625C1
2016-02-01 2MS-US 456 16U11863C2MS0000337626C1
2016-03-01 E9E-US 789 16U11863C2MS0000337627C1
I would like to use sed to only read in the rows with code "2MS-US" or "769". I am not using grep because it doesn't keep the headers. I tried:
fread("hadoop fs -text /path/to/file.csv | sed -n '1p;/^[0-9]*-[0-9]*-[0-9]* 2MS-US/p; /^[0-9]*-[0-9]*-[0-9]* 769/p'", fill=TRUE)
But this returns zero rows. I am guessing I have the wrong regex, but not sure how to fix that.
I have also tried using awk, but haven't had any luck with just filtering with a condition:
fread("hadoop fs -text /path/to/file.csv | awk '$2 == 2MS-US'", fill=TRUE)
which returns the following error message:
Error in fread(....)
Expecting 2 cols, but line 5293 contains text after processing all cols. Try again with fill=TRUE.
Any suggestion on a fix with sed or awk(if awk will be able to keep the headers) will be much appreciated!
EDIT:
Thanks to the help from #amaurea as well as this thread, I have been able to achieve my goal by the following code:
fread("/usr/bin/hadoop fs -text /path/to/file.csv | awk -F '\"*,\"*' 'FNR==1||$2==\"2MS-US\"||$2==\"769\"'"
Correct me if I am wrong, but it seems to me that when using awk to work with csv files, -F '\"*,\"*' is needed, which is not the case with text files.
It looks like you have an issue with quoting in your awk script. 2MS-US needs to be quoted. It works for me with this awk command:
awk 'FNR==1||$2=="2MS-US"||$2=="769"' hdfs.txt
where hdfs.txt is a file with your example contents. This outputs
Date Code Value TransactionID
2016-01-01 769 123 16U11863C2MS0000337625C1
2016-02-01 2MS-US 456 16U11863C2MS0000337626C1
which I think is what you want. However, since you're calling awk inside a quoted string, you will probably need to escape the double quotes in the awk command to avoid conflicts with the double quotes in your fread, like this:
fread("hadoop fs -text /path/to/file.csv | awk 'FNR==1||$2==\"2MS-US\"||$2==\"769\"'", fill=TRUE)
Though really, one would expect it to be cleaner to just filter the table directly in R.
Edit: Since you're still having problems, here's a small test case that works for me, which you can run directly in the terminal:
$ cat <<HERE > hdfs.txt
Date Code Value TransactionID
2016-01-01 769 123 16U11863C2MS0000337625C1
2016-02-01 2MS-US 456 16U11863C2MS0000337626C1
2016-03-01 E9E-US 789 16U11863C2MS0000337627C1
HERE
$ cat <<HERE > test.r
library(data.table)
fread("awk 'FNR==1||$2==\"2MS-US\"||$2==\"769\"' hdfs.txt")
$ R -q -f test.r
> library(data.table)
> fread("awk 'FNR==1||$2==\"2MS-US\"||$2==\"769\"' hdfs.txt")
Date Code Value TransactionID
1: 2016-01-01 769 123 16U11863C2MS0000337625C1
2: 2016-02-01 2MS-US 456 16U11863C2MS0000337626C1
>

Use awk to get min-max column values

Given a file with data such as
2015-12-24 22:02 12 9.87 feet High Tide
2015-12-25 03:33 12 -0.38 feet Low Tide
2015-12-25 06:11 12 Full Moon
2015-12-25 10:16 12 11.01 feet High Tide
2015-12-25 16:09 12 -1.29 feet Low Tide
This awk command will return a min value in col 4:
awk 'min=="" || $4 < min {min=$4} END{ print min}' FS=" " 12--December.txt
How do I get it to exclude any line where $4 contains text? I imagine this needs regex but poring over the regex manuals I am lost as to how to do it.
You can use a regular expression comparison on the fourth field as
$4~/[0-9]+/
Test
$ awk '$4~/[0-9]+/ && $4 < min {min=$4} END{print min}' input
-1.29
Note This is a minimised version of the code. You can safely skip some of the statements in the example code as in the test code

comparing two columns in two files

I would like to compare two columns in two files.
Here's an example:
1 722603 0.08 0.0013 0.0035 0.02
1 793227 0.17 0 0 0.01
2 931508 0.52 0.95 0.93 0.92
1 722603 0.0348543
1 793227 0.130642
2 931508 0.275751
2 1025859 0.0739543
2 1237036 0.476705
This code compares the second columns of the two files:
awk 'FNR==NR {a[$2]++; next} a[$2]' file 1 file 2
However, I want to print the common second column if the first column is also the same. More specifically, if it finds 722603 in both files, it must check that the first column is also equal to 1 and then prints it. If the number in second column is repeated, it is important that it gets printed more than once with different values of column 1.
I'd be very thankful if you could guide me through this, thank you.
like this? extended your codes a bit:
awk 'FNR==NR {a[$1 FS $2]++; next} a[$1 FS $2]' file1 file2