Printing numeric values < NF in awk? - awk

I ran a bunch of tests using pgbench, logging the results:
run-params: 1 1 1
transaction type: SELECT only
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 90 s
number of transactions actually processed: 280465
tps = 3116.254233 (including connections establishing)
tps = 3116.936248 (excluding connections establishing)
run-params: 1 1 2
transaction type: SELECT only
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 90 s
number of transactions actually processed: 505943
tps = 5621.570463 (including connections establishing)
tps = 5622.811538 (excluding connections establishing)
run-params: 10000 10 3
transaction type: SELECT only
scaling factor: 10000
query mode: simple
number of clients: 10
number of threads: 1
duration: 90 s
number of transactions actually processed: 10
tps = 0.012268 (including connections establishing)
tps = 0.012270 (excluding connections establishing)
I want to extract the values for graphing. Trying to learn AWK at the same time. Here's my AWK program:
/run-params/ { scaling = $2 ; clients = $3 ; attempt = $4 }
/^tps.*excluding/ { print $scaling "," $clients "," $attempt "," $3 }
When I run that, I get the following output:
$ awk -f b.awk -- b.log
tps,tps,tps,3116.936248
tps,tps,=,5622.811538
,,0.012270,0.012270
Which is not what I want.
I understand when scaling = 1, the 1 references field 1, which in this case happens to be tps. When scaling = 10000, because there aren't 10000 fields on the line, then null is returned. I did try assigning scaling and friends using "" $2, to no avail.
How does one use / report numeric values in a subsequent action block?

Simply drop the $ in front of scaling, etc. That is, scaling is a variable reference, $scaling is a field reference.

Related

AWK-Get total count of records for numerical grouped column

I have a variable which splits the results of a column based on a condition (group by in others programming languages).
I'm trying to have a variable that counts the NR of each group. If we sum all the groups we should have the NR of the file.
When I try to use NR in the calculation for example NR[variable that splits], I get a fatal error "you tried to use scalar as matrix.
Any ideas how to use NR as a variable, but not counting all the records, only those from each group?
sex, weight
male,50
female,49
female,48
male,66
male,78
female,98
male,74
male,54
female,65
In this case the NR would be 9 BUT, in reality I want a way to get that NR of male is 5 and 4 for female.
I have the total sum of weigth column but struggle to get the avg:
sex= $(f["sex"])
ccWeight[sex] += $(f["weight"])
avgWeight = ccWeight[sex] / ¿?
Important: I don't need to print the result as of now, just to store this number on a variable.
One awk idea:
awk -F, '
NR>1 { counts[$1]++ # keep count of each distinct sex
counts_total++ # replace dependency on NR
weight[$1]+=$2 # keep sum of weights by sex
}
END { for (i in counts) {
printf "%s: (count) %s of %s (%.2f%)\n",i,counts[i],counts_total,(counts[i]/counts_total*100)
printf "%s: (avg weight) %.2f ( %s / %s )\n",i,(weight[i]/counts[i]),weight[i],counts[i]
}
}
' sample.dat
NOTE:
OP can add additional code to verify total counts and weights are not zero (so as to keep from generating a 'divide by zero' error)
perhaps print a different message if there are no (fe)male records to process?
This generates:
female: (count) 4 of 9 (44.44%)
female: (avg weight) 65.00 ( 260 / 4 )
male: (count) 5 of 9 (55.56%)
male: (avg weight) 64.40 ( 322 / 5 )
GNU datamash might be what you are looking for, e.g.:
<infile datamash -Hst, groupby 1 count 1 sum 2 mean 2 | column -s, -t
Output:
GroupBy(sex) count(sex) sum(weight) mean(weight)
female 4 260 65
male 5 322 64.4

Reading fields in previous lines for moving average

Main Question
What is the correct syntax for recursively calling AWK inside of another AWK program, and then saving the output to a (numeric) variable?
I want to call AWK using 2/3 variables:
N -> Can be read from Bash or from container AWK script.
Linenum -> Read from container AWK program
J -> Field that I would like to read
This is my attempt.
Container AWk program:
BEGIN {}
{
...
# Loop in j
...
k=NR
# Call to other instance of AWK
var=(awk -f -v n="$n_steps" linenum=k input-file 'linenum-n {printf "%5.4E", $j}'
...
}
END{}
Background for more general questions:
I have a file for which I would like to calculate a moving average of n (for example 2280) steps.
Ideally, for the first n rows the average is of the values 1 to k,
where k <= n.
For rows k > n the average would be of the last n values.
I will eventually execute the code in many large files, with several columns, and thousands to millions of rows, so I'm interested in streamlining the code as much as possible.
Code Excerpt and Description
The code I'm trying to develop looks something like this:
NR>1
{
# Loop over fields
for (j in columns)
{
# Rows before full moving average is done
if ( $1 <= n )
{
cumsum[j]=cumsum[j]+$j #Cumulative sum
$j=cumsum[j]/$1 # Average
}
#moving average
if ( $1 > n )
{
k=NR
last[j]=(awk -f -v n="$n_steps" ln=k input-file 'ln-n {printf "%5.4E", $j}') # Obtain value that will get ubstracted from moving average
cumsum[j]=cumsum[j]+$j-last[j] # Cumulative sum adds last step and deleted unwanted value
$j=cumsum[j]/n # Moving average
}
}
}
My input file contains several columns. The first column contains the row number, and the other columns contain values.
For the cumulative sum of the moving average: If I am in row k, I want to add it to the cumulative sum, but also start subtracting the first value that I don't need (k-n).
I don't want to have to create an array of cumulative sums for the last steps, because I feel it could impact performance. I prefer to directly select the values that I want to substract.
For that I need to call AWK once again (but on a different line). I attempt to do it in this line:
k=NR
last[j]=(awk -f -v n="$n_steps" ln=k input-file 'ln-n {printf "%5.4E", $j}'
I am sure that this code cannot be correct.
Discussion Questions
What is the best way to obtain information about a field in a previous line to the one that AWK is working on? Can it be then saved into a variable?
Is this recursive use of AWK allowed or even recommended?
If not, what could be the most efficient way to update the cumulative sum values so that I get an efficient enough code?
Sample input and Output
Here is a sample of the input (second column) and the desired output (third column). I'm using 3 as the number of averaging steps (n)
N VAL AVG_VAL
1 1 1
2 2 1.5
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
11 11 10
12 12 11
13 13 12
14 14 13
14 15 14
If you want to do a running average of a single column, you can do it this way:
BEGIN{n=2280; c=7}
{ s += $c - a[NR%n]; a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
Here we store the last n values in an array a and keep track of the cumulative sum s. Every time we update the sum we correct by first removing the last value from it.
If you want to do this for a couple of columns, you have to be a bit handy with keeping track of your arrays
BEGIN{n=2280; c[0]=7; c[1]=8; c[2]=9}
{ for(i in c) { s[i] += $c[i] - a[n*i + NR%n]; a[n*i + NR%n] = $c[i] } }
{ printf $0
for(i=0;i<length(c);++i) printf OFS (s[i]/(NR < n : NR ? n))
printf ORS
}
However, you mentioned that you have to add millions of entries. That is where it becomes a bit more tricky. Summing a lot of values will introduce numeric errors as you loose precision bit by bit (when you add floats). So in this case, I would suggest implementing the Kahan summation.
For a single column you get:
BEGIN{n=2280; c=7}
{ y = $c - a[NR%n] - k; t = s + y; k = (t - s) - y; s = t; a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
or a bit more expanded as:
BEGIN{n=2280; c=7}
{ y = $c - k; t = s + y; k = (t - s) - y; s = t; }
{ y = -a[NR%n] - k; t = s + y; k = (t - s) - y; s = t; }
{ a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
For a multi-column problem, it is now straightforward to adjust the above script. All you need to know is that y and t are temporary values and k is the compensation term which needs to be stored in memory.

labview while loop execution condition

Is there a way to give while loop a condition which makes it give an an output every ten times' executions, however it continues running after this output?
I hope I made myself clear...
Thanks!
Amy
Modulo is useful for this.
As an example; In swift to do modulo you use the % symbol. Essentially modulo outputs the remainder of the given terms.
So;
Value 1 MODULO Value 2 outputs Remainder.
Furthermore;
6 % 2 = 0
6 % 5 = 1
6 % 4.5 = 1.5
Essentially you want every nth element to output a value, with n being the rate. You need to track how many loops of the while you have gone through already.
The code below will run through the while 1000 times, and print out every 10 times ( for a total of 100 prints of output. )
var execution : Int = 0
while ( execution != 1000 ) {
if ( execution % 10 == 0 ) {
print("output")
}
execution = execution + 1
}
Here is the same answer as given by adam but then in Labview.

How to do multi-row calculations using awk on a large file

I have a big file that is sorted on the first word. I need to add a new column for each line with the proportional value: line value/total value for that group; group is determined by the first column. In the below example, the total of group "a" = 100 and hence each line gets a proportion. The total of group "the" is 1000 and hence each line gets the proprotion value of the total of that group.
I need an awk script to do this.
Sample File:
a lot 10
a few 20
a great 20
a little 40
a good 10
the best 250
the dog 750
zisty cool 20
Output:
a lot 10 0.1
a few 20 0.2
a great 20 0.1
a little 40 0.4
a good 10 0.1
the best 25 .25
the dog 75 .75
zisty cool 20 1
You describe this as a "big file." Consequently, this solution tries to save memory: it holds no more than one group in memory at a time. When we are done with that group, we print it out before starting on the next group:
$ awk -v i=0 'NR==1{name=$1} $1==name{a[i]=$0;b[i++]=$3;tot+=$3+0;next} {for (j=0;j<i;j++){print a[j],b[j]/tot} name=$1;a[0]=$0;tot=b[0]=$3;i=1} END{for (j=0;j<i;j++){print a[j],b[j]/tot}}' file
a lot 10 0.1
a few 20 0.2
a great 20 0.2
a little 40 0.4
a good 10 0.1
the best 250 0.25
the dog 750 0.75
zisty cool 20 1
How it works
-v i=0
This initializes the variable i to zero.
NR==1{name=$1}
For the first line, set the variable name to the first field, $1. This is the name of the group.
$1==name {a[i]=$0; b[i++]=$3; tot+=$3+0; next}
If the first field matches name, then save the whole line into array a and save the value of column (field) three into array b. Increment the variable tot by the value of the third field. Then, skip the rest of the commands and jump to the next line.
for (j=0;j<i;j++){print a[j],b[j]/tot} name=$1;a[0]=$0;tot=b[0]=$3;i=1
If we get to this line, then we are at the start of a new group. Print out all the values for the old group and initialize the variables for the start of the next group.
END{for (j=0;j<i;j++){print a[j],b[j]/tot}}
After we get to the last line, print out what we have for the last group.
awk '{a[$1]+=$3; b[i++]=$0; c[j++]=$1; d[k++]=$3} END{for(i=0;i<NR;i++) {print b[i], d[i]/a[c[i]]}}' File
Example:
sdlcb#Goofy-Gen:~/AMD$ cat ff
a lot 10
a few 20
a great 20
a little 40
a good 10
the best 250
the dog 750
zisty cool 20
sdlcb#Goofy-Gen:~/AMD$ awk '{a[$1]+=$3; b[i++]=$0; c[j++]=$1; d[k++]=$3} END{for(i=0;i<NR;i++) {print b[i], d[i]/a[c[i]]}}' ff
a lot 10 0.1
a few 20 0.2
a great 20 0.2
a little 40 0.4
a good 10 0.1
the best 250 0.25
the dog 750 0.75
zisty cool 20 1
Logic: update an array (a[]) with first column as index for each line. save array b[] with complete line for each line, to be used in the end for printing. similarly, update arrays c[] and d[] with first and third column values for each line. at the end, use these arrays to get the results using a for loop, looping through all the lines processed. First printing the line as itself, then the proportion value.

predefined preemption points among processes

I have forked many child processes and assigned priority and core to each of them. Porcess A executes at period of 3 sec and process B at a period of 6 sec. I want them to execute in such a way that the higher priority processes should preempt lower priority ones only at predefined points and tried to acheive it with semaphores. I have used this same code snippets within the 2 processes with different array values in both.
'bubblesort_desc()' sorts the array in descending order and prints it. 'bubblesort_asc()' sorts in ascending order and prints.
while(a<3)
{
printf("time in sort1.c: %d %ld\n", (int)request.tv_sec, (long int)request.tv_nsec);
int array[SIZE] = {5, 1, 6 ,7 ,9};
semaphore_wait(global_sem);
bubblesort_desc(array, SIZE);
semaphore_post(global_sem);
semaphore_wait(global_sem);
bubblesort_asc(array, SIZE);
semaphore_post(global_sem);
semaphore_wait(global_sem);
a++;
request.tv_sec = request.tv_sec + 6;
request.tv_nsec = request.tv_nsec; //if i add 1ms here like an offset to the lower priority one, it works.
semaphore_post(global_sem);
semaphore_close(global_sem); //close the semaphore file
//sleep for the rest of the time after the process finishes execution until the period of 6
clk = clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &request, NULL);
if (clk != 0 && clk != EINTR)
printf("ERROR: clock_nanosleep\n");
}
I get the output like this whenever two processes get activated at the same time. For example at time units of 6, 12,..
time in sort1.c: 10207 316296689
time now in sort.c: 10207 316296689
9
99
7
100
131
200
256
6
256
200
5
131
100
99
1
1
5
6
7
9
One process is not supposed to preempt the other while one set of sorted list is printing. But it's working as if there are no semaphores. I defined semaphores as per this link: http://linux.die.net/man/3/pthread_mutexattr_init
Can anyone tell me what can be the reason for that? Is there a better alternative than semaphores?
Its printf that's causing the ambiguous output. If the results are printed without '\n' then we get a more accurate result. But its always better to avoid printf statements for real time applications. I used trace-cmd and kernelshark to visualize the behaviour of the processes.