Counting the number of specific values in a column with awk - awk

I have data (data.csv):
"1",5.1,"s"
"2",3.3,"s"
"3",2.7,"c"
and I want to count the number of line whose 3rd element is "s" or "c" with AWK (count.awk):
BEGIN{FS=","; s_count=0; c_count=0}
($3=="s"){s_count++}
($3=="c"){c_count++}
END{print s_count; print c_count}
then
$awk -f count.awk data.csv
but this does not work. Its output is:
0
0
this is not I expected. Why?
$ awk -V
GNU Awk 4.1.0, API: 1.0 (GNU MPFR 3.1.2, GNU MP 5.1.2)
Note: I use Awk on cygwin.

The problem is that your target field has embedded double quotes, so you need to match them too, by including them - \-escaped - in the string to match against:
awk '
BEGIN{FS=","; s_count=0; c_count=0}
($3=="\"s\"") {s_count++}
($3=="\"c\"") {c_count++}
END{ print s_count; print c_count }
' data.csv
As an aside, you can simplify your awk program somewhat:
the parentheses are not needed (have not verified on cygwin, but given that it's awk interpreting the string, I wouldn't expect that to matter)
you don't strictly need to initialize your output variables, because awk defaults uninitialized variables to 0 in numerical contexts.
BEGIN{FS=","}
$3 == "\"s\"" {s_count++}
$3 == "\"c\"" {c_count++}
END{ print s_count; print c_count }

This is a job for an array. Here is an awk command:
awk -F, '{gsub(/\"/,"",$3);a[$3]++} END {for (i in a) print i,a[i]}' file
c 1
s 2
It counts the number of c and s occurrences. Also counts other letters if they exist.

Related

How to check if a string contains at least one letter different from 4 using bash or awk

How to check that a sequence has at least one letter that is not A, U, C, G characters using awk or bash?
Can it be done without the typical for loop?
Example of sequence:
AUVGAU
I give this as input I should get it back given that it has V
The input file looks something like this, so I think awk would be better.
>7A0E_1|
AUVGAU
>7A0E_2|
GUCAU
Expected output
>7A0E_1|
AUVGAU
Here is what I tried:
awk '!/^>/ {next}; {getline s}; s !~ /AUGC/ { print $0 "\n" s }' sample
But obviously /AUGC/ is not right... can someone help me with this regex?
I think awk is the tool if you want to conditionally output the > line if the next record does not contain [AUCG]. You can do that with:
awk '/^>/ {rec=$0; next} /[^AUGC]/ {printf "%s\n%s\n", rec, $0}' sample
In your case that results in:
$ awk '/^>/ {rec=$0; next} /[^AUGC]/ {printf "%s\n%s\n", rec, $0}' sample
>7A0E_1|
AUVGAU
(note: you can use print rec; print instead of printf, but printf above reduced the output to a single call)
Where you ran into trouble was forgetting to save the current record that began with > and then using getline -- which wasn't needed at all.
How to check that a sequence has at least one letter that is not A, U, C, G characters using awk(...)? Can it be done without the typical for loop?
Yes, GNU AWK can do that. Let file.txt content be
AUVGAU
AUCG
(empty line is intentional) then
awk 'BEGIN{FPAT="[^AUCG]"}{print NF>=1}' file.txt
output
1
0
0
Explanation: both solutions count number of characters which are not one of: A, U, C, G, any other character is treated as constituing field and number of fields (NF) is then checked (>=1). Note that this solution does redefine what is field and if that is problem you might use patsplit instead
awk '{patsplit($0,arr,"[^AUCG]");print length(arr)>=1}' file.txt
(tested in gawk 4.2.1)

What does this Awk expression mean

I am working with bash script that has this command in it.
awk -F ‘‘ ‘/abc/{print $3}’|xargs
What is the meaning of this command?? Assume input is provided to awk.
The quick answer is it'll do different things depending on the version of awk you're running and how many fields of output the awk script produces.
I assume you meant to write:
awk -F '' '/abc/{print $3}'|xargs
not the syntactically invalid (due to "smart quotes"):
awk -F ‘’’/abc/{print $3}’|xargs
-F '' is undefined behavior per POSIX so what it will do depends on the version of awk you're running. In some awks it'll split the current line into 1 character per field. in others it'll be ignored and the line will be split into fields at every sequence of white space. In other awks still it could do anything else.
/abc/ looks for a string matching the regexp abc on the current line and if found invokes the subsequent action, in this case {print $3}.
However it's split into fields, print $3 will print the 3rd such field.
xargs as used will just print chunks of the multi-line input it's getting all on 1 line so you could get 1 line of all-fields output if you don't have many fields being output or several lines of multi-field output if you do.
I suspect the intent of that code was to do what this code actually will do in any awk alone:
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
e.g.:
$ printf 'foo\nxabc\nyzabc\nbar\n' |
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
b a

How to control the format of float numbers in gawk?

The following two runs are different. How to make the first run the same as the second run (I still want print without any explicit arguments)? Is there a way to control the number of digits in $1 = 1/3?
$ gawk -v OFMT='%.20g' -e 'BEGIN { $1 = 1/3; print }'
0.333333
$ gawk -v OFMT='%.20g' -e 'BEGIN { print 1/3}'
0.33333333333333331483
EDIT: The following comparison is also unexpected. Ideally, if there is just one field, print $1 and print should be just the same. I think it could be considered as a bug?
$ gawk -v OFMT='%.20g' -e 'BEGIN { $1 = 1/3; print $1}'
0.33333333333333331483
$ gawk -v OFMT='%.20g' -e 'BEGIN { $1 = 1/3; print}'
0.333333
There is a subtlety here. There are two variables, OFMT and CONVFMT. The variable OFMT is used to control how numbers are converted to strings in the print statement while the variable CONVFMT is used to define how numbers are converted to strings in general (outside of the print statement):
Prior to the POSIX standard, awk used the value of OFMT for converting numbers to strings. OFMT specifies the output format to use when printing numbers with print. CONVFMT was introduced in order to separate the semantics of conversion from the semantics of printing. Both CONVFMT and OFMT have the same default value: "%.6g". In the vast majority of cases, old awk programs do not change their behaviour.
source: GNU awk manual
More detailed information about this reasoning can be found in the secion rationale of the awk POSIX standard.
numeric value in print statement:
$ awk 'BEGIN{print 1/3}'
0.333333
$ awk 'BEGIN{OFMT="%.20g"; print 1/3 }'
0.33333333333333331483
$ awk 'BEGIN{CONVFMT="%.20g"; print 1/3 }'
0.333333
variable with a numeric value in print statement:
$ awk 'BEGIN{a=1/3; print a}'
0.333333
$ awk 'BEGIN{OFMT="%.20g"; a=1/3; print a }'
0.33333333333333331483
$ awk 'BEGIN{CONVFMT="%.20g"; a=1/3; print a }'
0.333333
variable with a numeric value converted to string in print statement:
$ awk 'BEGIN{a=1/3; a=a""; print a}'
0.333333
$ awk 'BEGIN{OFMT="%.20g"; a=1/3; a=a""; print a }'
0.333333
$ awk 'BEGIN{CONVFMT="%.20g"; a=1/3; a=a""; print a }'
0.33333333333333331483
I am not sure if its a bug, but try to set a variable and not first field
gawk -v OFMT='%.20g' -e 'BEGIN { a = 1/3; print a}'
0.33333333333333331483

Using awk command to display lines backwards

Fairly new to the awk command and still playing with it, I am trying to display multiple lines of a file, lets say lines 3-5, and display it backwards. So with the given file:
Hello World
How are you
I love computer science,
I am using awk,
And it is hard.
And it should output:
science, computer love I
awk, using am I
hard. is it And
Any step in the correct direction will be beneficial!!
Following awk may help you in same, where I am using start and end variables to get only those lines which are needed to be printed by OP.
awk -v start=3 -v end=5 'FNR>=start && FNR<=end{for(;NF>0;NF--){printf("%s%s",$NF,NF==1?RS:FS)}}' Input_file
Output will be as follows.
science, computer love I
awk, using am I
hard. is it And
Explanation: Adding explanation to solution too now.
awk -v start=3 -v end=5 ' ##Mentioning variables named start and end where start is denoting the starting line and end is denoting end line which we want to print.
FNR>=start && FNR<=end{ ##Checking condition here if variable FNR(awk out of the box variable) value is greater than or equal to variable start AND FNR value is less than or equal to end variable. If condition is TRUE then do following:
for(;NF>0;NF--){ ##Initiating a for loop which starts from value of NF(Number of fields, which is out of the box variable of awk) and it runs till NF is 0.
printf("%s%s",$NF,NF==1?RS:FS)} ##Printing value of NF(value of field) and other string will be either space of new line(by checking when field is first then print new line as print space).
}
' Input_file ##Mentioning Input_file name here.
$ cat tst.awk
NR>2 && NR<6 {
for (i=NF; i>0; i--) {
printf "%s%s", $i, (i>1?OFS:ORS)
}
}
$ awk -f tst.awk file
science, computer love I
awk, using am I
hard. is it And
You can use the following awk command to reach your goal:
input:
$ cat text
Hello World
How are you
I love computer science,
I am using awk,
And it is hard.
output:
$ awk 'NR<3{print}NR>=3{for(i=0; i<NF; i++){printf "%s ",$(NF-i);} printf "\n";}' text
Hello World
How are you
science, computer love I
awk, using am I
hard. is it And
Explanations:
NR<3{print} will print first 2 lines in the correct order
NR>=3{for(i=0; i<NF; i++){printf $(NF-i)" ";} printf "\n";}' from the 3rd line you have a loop on all the field identified by NF and you print them one after another from the last one to the first one ($NF is the last one $1 is the first one) and you separate each field with a space. Last but not least after the loop you print and EOL char.
Now, if you do not need to print the first 2 lines use:
$ awk 'NR>=3{for(i=0; i<NF; i++){printf "%s ",$(NF-i);} printf "\n";}' text
science, computer love I
awk, using am I
hard. is it And
For files with more lines for which you want to print only a range (3-5) use:
$ awk 'NR>=3 && NR<=5{for(i=0; i<NF; i++){printf "%s ",$(NF-i);} printf "\n";}' text

Exact string match in awk

I have a file test.txt with the next lines
1997 100 500 2010TJ
2010TJXML 16 20 59
I'm using the next awk line to get information only about string 2010TJ
awk -v var="2010TJ" '$0 ~ var {print $0}' test.txt
But the code print the two lines. I want to know how to get the line containing the exact string
1997 100 500 2010TJ
the string can be placed in any column of the file.
Several options:
Use a gawk word boundary (not POSIX awk...):
$ gawk '/\<2010TJ\>/' file
An actual space or tab or what is separating the columns:
$ awk '/^2010TJ /' file
Or compare the field directly to the string:
$ awk '$1=="2010TJ"' file
You can loop over the fields to test each field if you wish:
$ awk '{for (i=1;i<=NF;i++) if ($i=="2010TJ") {print; next}}' file
Or, given your example of setting a variable, those same using a variable:
$ gawk -v s=2010TJ '$0~"\\<" s "\\>"'
$ awk -v s=2010TJ '$0~"^" s " "'
$ awk -v s=2010TJ '$1==s'
Note the first is a little different than the second and third. The first is the standalone string 2010TJ anywhere in $0; the second and third is a string that starts with that string.
Try this (for testing only column 1) :
awk '$1 == "2010TJ" {print $0}' test.txt
or grep like (all columns) :
gawk '/\<2010TJ\>/ {print $0}' test.txt
Note
\< \> is word boundarys
another awk with word boundary
awk '/\y2010TJ\y/' file
note \y matches either beginning or end of a word.