awk : awk script to group by column with condition

awk : awk script to group by column with condition - awk

I have tab delimited file like following and I am trying to write a awk script
aaa_log-000592 2 p STARTED 7027691 21.7 a1
aaa_log-000592 28 r STARTED 7027815 21.7 a2
aaa_log-000592 33 p STARTED 7032607 21.7 a3
aaa_log-000592 33 r STARTED 7032607 21.7 a4
aaa_log-000592 43 p STARTED 7025709 21.7 a5
aaa_log-000592 43 r STARTED 7025709 21.7 a6
aaa_log-000595 2 r STARTED 7027691 21.7 a7
aaa_log-000598 28 p STARTED 7027815 21.7 a8
aaa_log-000599 13 p STARTED 7033090 21.7 a9
I am trying to count for 3rd column (p or r) and group by column 1
Output would be like
Col1 Count-P Count-R
aaa_log-000592 3 3
aaa_log-000595 0 1
aaa_log-000598 1 0
aaa_log-000599 1 0
I can't find an example that would have IF condition with group by in awk.

awk(more specifically, the GNU variant, gawk) has multi-dimensional arrays that can be indexed using input values (including character strings like in your example). As such, you can count the values in the way you want by doing
{
values[$3] = 1 # this line records the values in column three
counts[$1][$3]++ # and this lines counts their frequency
}
The first line isn't strictly required, but it simplifies generating the output.
The only remaining part is to have an END clause that outputs the tabulated results.
END {
# Print column headings
printf "Col1 "
for (v in values) {
printf " Count-%s", v
}
printf "\n"
# Print tabulated results
for (i in counts) {
printf "%-20s", i
for (v in values) {
printf " %d", counts[i][v]
}
printf "\n"
}
}
Generating the values array handles the case when the values of column three may not be known (e.g., like when there's an error in your input).
If you're using a different awk implementation (like what you might find in macOS, for example), array indexing may be different (e.g., they are single-dimensional arrays, but indexed by a comma-separate list of indices). This may add some additional complexity, but the idea is the same.
{
files[$1] = 1
values[$3] = 1
counts[$1,$3]++
}
END {
# Print column headings
printf "Col1 "
for (v in values) {
printf " Count-%s", v
}
printf "\n"
# Print tabulated results
for (f in files) {
printf "%-20s", f
for (v in values) {
printf " %d", counts[f,v]
}
printf "\n"
}
}

Related

Awk: for-loop with array of numbers

How to use an array of numbers in a for loop with awk ?
I tried:
awk '{ for (i in [10, 12, 18]) print i }' myfile.txt
But I'm getting a syntax error.

The in operator works on arrays. The way to create an array from a list of numbers like 10 12 18 is to split() a string containing those numbers.
To have those numbers stored as values in an array a[] with indices 1 to 3:
awk 'BEGIN{FS=OFS="|"; split("10 12 18",a," ")}
(FNR>2) { for(j in a) { i=a[j]; k=$i OFS $(i+1); c[k]++; d[k] = i } }
END{for (k in c) print d[k],k,c[k] }' myfile.txt
To have those numbers stored as indices of an array b[] with all values 0-or-null (same as an uninitialized scalar variable):
awk 'BEGIN{FS=OFS="|"; split("10 12 18",a," "); for (j in a) b[a[j]]}
(FNR>2) { for(i in b) { k=$i OFS $(i+1); c[k]++; d[k] = i } }
END{for (k in c) print d[k],k,c[k] }' myfile.txt
If you didn't want to create the array once up front for some reason (e.g. the list of numbers you want to split is created dynamically) then you could create it every time you need it, e.g.:
awk 'BEGIN{FS=OFS="|"}
(FNR>2) { split("10 12 18",a," "); for(j in a) { i=a[j]; k=$i OFS $(i+1); c[k]++; d[k] = i } }
END{for (k in c) print d[k],k,c[k] }' myfile.txt
but obviously creating the same array multiple times is less efficient than creating it once.

kinda made a very rough emulation of a for-loop that directly takes in a list without needing an extra function call prior to that to initialize it :
it tries to be as flexible as possible regarding what the
delimiter(s) might be, so
foreach("[CA=MX=JP=FR=SG=AUS=N.Z.]")
would actually also work.
Despite being shown with the gawk profile below,
and using the PROCINFO array, you don't need gawk for it to work :
it's functional on mawk 1.3.4, mawk 1.9.9.6, gnu gawk 5.1.1, and macos x
just added a Unicode UTF8 feature, which works regardless of what your locale setting is, or whether you're using gawk mawk or nawk
emojis work fine too
that said, it cannot properly parse CSV, XML, or JSON inputs
(didn't have the time to make it that fancy)
list 1 :: 10
list 1 :: 12
list 1 :: 18.
list 1 :: 27
list 1 :: 36
list 1 :: pi
list 1 :: hubble
list 1 :: kelvins
list 1 :: euler
list 1 :: higgs
list 1 :: 9.6
list 2 :: CA
list 2 :: MX
list 2 :: JP
list 2 :: FR
list 2 :: SG
list 2 :: AUS
list 2 :: N.Z.
# gawk profile, created Mon May 9 22:06:03 2022
# BEGIN rule(s)
BEGIN {
11 while (i = foreach("[10, 12, 18., 27, 36, pi, hubble, kelvins, euler, higgs, 9.6]")) {
11 print "list 1 :: ", i
}
1 printf ("\n\n\n")
7 while (i = foreach("[CA, MX, JP, FR, SG, AUS, N.Z., ]")) {
7 print "list 2 :: ", i
}
}
# Functions, listed alphabetically
20 function foreach(_, __)
{
20 if (_=="") {
return \
PROCINFO["foreach", "_"] = \
PROCINFO["foreach", "#"] = _
}
20 __ = "\032\016\024"
20 if (_!= PROCINFO["foreach", "_"]) { # 2
2
PROCINFO["foreach","_"]=_
2 gsub("^[ \t]*[[<{(][ \t]*"\
"|[ \t]*[]>})][ \t]*$"\
"|\\300|\\301","",_)
gsub("[^"(\
"\333\222"~"[^\333\222]"\
? "\\200-\\277"\
"\\302-\\364"\
: "" \
)"[:alnum:]"\
\
"\302\200""""-\337\277" \
"\340\240\200-\355\237\277" \
"\356\200\200-\357\277\277" \
"\360\220\200\200-\364\217\277\277"\
\
".\42\47#$&%+-]+",__,_)
gsub("^"(__)"|"\
(__)"$","", _)
2 PROCINFO["foreach","#"]=_
}
20 if ((_=PROCINFO["foreach","#"])=="") { # 2
2 return _
}
18 sub((__) ".*$", "", _)
sub("^[^"(__)"]+("(__)")?","",PROCINFO["foreach","#"])
18 return _
}
list 2 :: CA
list 2 :: MX
list 2 :: JP
list 2 :: FR
list 2 :: SG
list 2 :: 눷
list 2 :: 🤡
list 2 :: N.Z.
while(i = foreach("[CA=MX=JP=FR=SG=\353\210\267"\
"=\360\237\244\241=N.Z.]")) {
print "list 2 :: ", i
}

Successive averaging of repeating data but different number of lines

I have the following format of data:
1 3
1.723608 0.8490000
1.743011 0.8390000
1.835833 0.7830000
2 5
1.751377 0.8350000
1.907603 0.7330000
1.780053 0.8190000
1.601427 0.9020000
1.950540 0.6970000
3 2
1.993951 0.6610000
1.796519 0.8090000
4 4
1.734961 0.8430000
1.840741 0.7800000
1.818444 0.7950000
1.810717 0.7980000
5 1
2.037940 0.6150000
6 7
1.738221 0.8330000
1.767678 0.8260000
1.788517 0.8140000
2.223586 0.4070000
1.667492 0.8760000
2.039232 0.6130000
1.758823 0.8300000
...
Data consists of data blocks. Each data block has the same format as follows:
The very first line is the header line. The header line contains the ID number and the total number of lines of each data block. For example, the first data block's ID is 1, and the total number of lines is 3. For the third data block, ID is 3, and the total number of lines is 2. All data blocks have this header line.
Then, the "real data" follows. As I explained, the number of lines of "real data" is designated in the second integer of the header line.
Accordingly, the total number of lines for each data block will be number_of_lines+1. In this example, the total number of lines for data block 1 is 4, and data block 2 costs 6 lines...
This format repeats all the way up to 10000 number of data blocks in my current data, but I can provide this 10000 as a variable in the bash or awk script as an input value. I know the total number of data blocks.
Now, what I wish to do is, I want to get the average of data of each two columns and print it out with data block ID number and a total number of lines. The output text will have:
ID_number number_of_lines average_of_column_1 average_of_column_2
using 5 spaces between columns with 6 decimal places format. The result will have 10000 lines, and each line will have ID, number of lines, avg of column 1 of data, and avg of column 2 of data for each data block. The result of this example will look like
1 3 1.767484 0.823666
2 5 1.798200 0.797200
3 2 1.895235 0.735000
...
I know how to get the average of a simple data column in awk and bash. These are already answered in StackOverflow a lot of times. For example, I really favor using
awk '{ total += $2; count++ } END { print total/count }' data.txt
So, I wish to this using awk or bash. But I really have no clue how can I approach and even start to get this kind of average of multiple repeating data blocks, but with a different number of lines for each data block.
I was trying based on awk, following
Awk average of n data in each column
and
https://www.unix.com/shell-programming-and-scripting/135829-partial-average-column-awk.html
But I'm not sure how can I use NR or FNR for the average of data with a varying number of total lines of data, for each data block.

You may try this awk:
awk -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.76748 0.823667
2 5 1.7982 0.7972
3 2 1.89524 0.735
4 4 1.80122 0.804
5 1 2.03794 0.615
6 7 1.85479 0.742714
If you have gnu awk then use OFMT for getting fixed size decimal numbers like this:
awk -v OFMT="%.6f" -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.767484 0.823667
2 5 1.798200 0.797200
3 2 1.895235 0.735000
4 4 1.801216 0.804000
5 1 2.037940 0.615000
6 7 1.854793 0.742714
An expanded form:
awk OFMT='%.6f' -v OFS='\t' '
$2 ~ /\./ {
s1 += $1
s2 += $2
next
}
{
if (id) {
print id, num, s1/num, s2/num
s1 = s2 = 0
}
id = $1
num = $2
}
END {
print id, num, s1/num, s2/num
}' file

And yet another one:
awk -v num_blocks=10000 '
BEGIN {
OFS = "\t"
OFMT = "%.6f"
}
num_lines == 0 {
id = $1
num_lines = $2
sum1 = sum2 = 0
next
}
lines_read < num_lines {
sum1 += $1
sum2 += $2
lines_read++
}
lines_read >= num_lines {
print id, num_lines,
sum1 / num_lines,
sum2 / num_lines
num_lines = lines_read = 0
num_blocks--;
}
num_blocks <= 0 {
exit
}' file

You could try
awk -v qnt=none 'qnt == "none" {id = $1; qnt = $2; s1 = s2 = line = 0;next}{s1 += $1; s2 += $2; ++line} line == qnt{printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt; qnt="none"}'
The above is expanded as follows:
qnt == "none"
{
id = $1;
qnt = $2;
s1 = s2 = line = 0;
next
}
{
s1 += $1;
s2 += $2;
++line
}
line == qnt
{
printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt;
qnt="none"
}
After a data block is processed (or at the beginning), record header info.
Otherwise, add to sum and print the result when we've done with all lines in this block.

separate fields based on first column content, match in second column and subtract in fourth column values in awk

My input file is like:
a10 otu1 xx 44
b24 otu2 xxx 52
x35 otu3 xy 11
x45 otu3 zz 22
z452 Otu5 rr 78
control1 otu1 w 4
control2 otu2 ee 30
control3 otu3 tt 20
control4 otu4 yy 10
First, I want to separate control from the others in column 1, and then match second column
values of control with other’s second column. Where match does find in second column, I want
to subtract the corresponding values in fourth column.
Output file would be:
a10 otu1 xx 40
b24 otu2 xxx 22
x35 otu3 xy -9
x45 otu4 zz 12
z452 Otu5 rr 78
Now, to match the second column and subtract values in fourth column, I use:
awk 'NR==FNR {a[$2]=$2 in a?a[$2]-$4:$4; next} !b[$2]++ {print $1,$2,$3,a[$2]}' inputfile.txt{,}
How can I feed separate field information (control vs others) in the script?

Could you please try following.
awk '
!/^control/{
a[++count1]=$NF
b[count1]=$1 OFS $2 OFS $3
next
}
{
c[++count2]=$NF
}
END{
for(i=1;i<=count1;i++){
print b[i],a[i]-c[i]
}
}
' Input_file
More generic solution: In case you don't want to hardcode field values in first array a and you have more than 4 fields in first file then try following.
awk '
!/^control/{
a[++count1]=$NF
$NF=""
sub(/ +$/,"")
b[count1]=$0
next
}
{
c[++count2]=$NF
}
END{
for(i=1;i<=count1;i++){
print b[i],a[i]-c[i]
}
}
' Input_file

$ cat tst.awk
NR==FNR {
if ( /^control/ ) {
control[$2] = $NF
}
next
}
!/^control/ {
$NF = $NF - control[$2]
print
}
$ awk -f tst.awk file file
a10 otu1 xx 40
b24 otu2 xxx 22
x35 otu3 xy -9
x45 otu3 zz 2
z452 Otu5 rr 78

Here's another take on this:
/^control/ {
a[$2]=a[$2]-$4
next
}
{
a[$2]=a[$2]+$4
b[$2]=$1 OFS $2 OFS $3
}
END {
for(i in b) print b[i] OFS a[i]
}
This subtracts any values on control lines, adds any values on other lines, storing them in the array a[]. It maintains an array of line content, b[].
By storing content in the array, it's possible for there to be multiple data or control lines affecting the value, and they can appear in any order in your input (since 44 - 40 is the same as -40 + 44).
Note that because our END for loop steps through the array, output is not guaranteed to be in the same order as input.

Iterate awk function for every unique field in column

I have write an awk script to analyse my table data - I am calculating p-value and log2 odds ratio.
This is an example of data table I have.
Label Value1 Value2
Label1 9 6
Label1 7 6
Label1 1 6
Label2 5 7
Label2 3 7
Label2 8 7
For every label (Label1/2) I count how many times value1 > value2 and divide this number by total times Label was observed - I am getting p-value.
Additionally to this, I compare their log2 ratio.
This is my awk script.
awk '{a[$1]=$1}; ($2>=$3) {c++}; {sum+=$2} END
{print c/NR,log($3/(sum/NR))/log(2),a[$1]}'
And this is result I get
0.666667 0.0824622 Label1
Column1 is p-value; Column 2 is odds ratio; Column 3 is Label.
Problem is that I don't know how to apply this calculation for both Labels - I am getting result only for the first one.
My question is - how to iterate such awk function for every unique field in column 1 (Label1/2)

I assume two lines before first line of data, so I compare NR with 3. The program saves previous label name ($1) and only when it changes ($1 != label) it does the calculations and print. Other condition (NR >= 3) only saves data while processing same label.
awk '
NR == 3 { label = $1 }
NR >=3 && $1 != label {
printf "%.6f %.6f %s\n", c/l, log( v / (sum/l) ) / log(2), label
c = l = sum = 0
label = $1
}
NR >= 3 {
if ( $2 >= $3 ) { c++ }
l++
sum += $2
v = $3
}
END {
printf "%.6f %.6f %s\n", c/l, log( v / (sum/l) ) / log(2), label
}
' infile
It yields:
0.666667 0.082462 Label1
0.333333 0.392317 Label2

Another way with awk (using arrays):
awk '
NR>1 && $2>$3 {
times[$1]++
}
{
total[$1]+=$2;
col3[$1]=$3;
seen[$1]++
}
END {
for(label in times) {
print times[label]/seen[label],log(col3[label]/(total[label]/seen[label]))/log(2),label
}
}' inputFile
Output:
0.666667 0.0824622 Label1
0.333333 0.392317 Label2

calculate the difference from flat file

I have a text file and the last 2 lines look like this...
Uptime: 822832 Threads: 32 Questions: 13591705 Slow queries: 722 Opens: 81551 Flush tables: 59 Open tables: 64 Queries per second avg: 16.518
Uptime: 822893 Threads: 31 Questions: 13592768 Slow queries: 732 Opens: 81551 Flush tables: 59 Open tables: 64 Queries per second avg: 16.618
How do I find the difference between the two values of each parameter?
The expected output is:
61 -1 1063 10 0 0 0 0.1
In other words I will like to deduct the current uptime value from the earlier uptime.
Find the difference between the threads and Questions and so on.
The purpose of this exercise is to watch this file and alert the user when the difference is too high. For e.g. if the slow queries are more than 500 or the "Questions" parameter is too low (<100)
(It is the MySQL status but has nothing to do with it, so mysql tag does not apply)

Just a slight variation on ghostdog74's (original) answer:
tail -2 file | awk ' {
gsub(/[a-zA-Z: ]+/," ")
m=split($0,a," ");
for (i=1;i<=m;i++)
if (NR==1) b[i]=a[i]; else print a[i] - b[i]
} '

here's one way. tail is used to get the last 2 lines, especially useful in terms of efficiency if you have a big file.
tail -2 file | awk '
{
gsub(/[a-zA-Z: ]+/," ")
m=split($0,a," ")
if (f) {
for (i=1;i<=m;i++){
print -(b[i]-a[i])
}
# to check for Questions, slow queries etc
if ( -(b[3]-a[3]) < 100 ){
print "Questions parameter too low"
}else if ( -(b[4]-a[4]) > 500 ){
print "Slow queries more than 500"
}else if ( a[1] - b[1] < 0 ){
print "mysql ...... "
}
exit
}
for(i=1;i<=m;i++ ){ b[i]=a[i] ;f=1 }
} '
output
$ ./shell.sh
61
-1
1063
10
0
0
0
0.1

gawk:
BEGIN {
arr[1] = "0"
}
length(arr) > 1 {
print $2-arr[1], $4-arr[2], $6-arr[3], $9-arr[4], $11-arr[5], $14-arr[6], $17-arr[7], $22-arr[8]
}
{
arr[1] = $2
arr[2] = $4
arr[3] = $6
arr[4] = $9
arr[5] = $11
arr[6] = $14
arr[7] = $17
arr[8] = $22
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas