awk, calculate the average for different interval of time - awk

can anybody teach me how to calculate the average for between the difference of time? for example
412.00 560.00
0 0
361.00 455.00 561.00
0 0
0 0
0 0
237.00 581.00
425.00 464.00
426.00 520.00
0 0
the normal case, they do the sum of all of those number divide by total set of number
sum/NR
the challenge here
the number of column is dynamic, which mean not all of the line have the same number column
to calculate the average , example we have this : 361.00 455.00 561.00
so the calculation :
((455-361) + (561 - 455))/2
so, the output i'm expecting is like this :
total_time divided_by average
148 1 148
0 1 0
200 2 100
0 1 0
0 1 0
0 1 0
344 1 344
: : :
: : :
: : :
im trying to use awk, but i stuck...

The intermediate values on lines with three or more time values are meaningless -- only the number of values matters. To see this from your example, note that:
((455-361) + (561 - 455))/2 = (561 - 361) / 2
Thus, you really just need to do something like
cat time_data |
awk '{ printf("%f\t%d\t%f\n", ($NF - $1), (NF - 1), ($NF - $1) / (NF - 1)) }'
For your sample data, this gives the results you specify (although not formatted as nicely as you present it).
This assumes that the time values are sorted on the lines. If not, calculate the maximum and minimum values and replace the $NF and $1 uses, respectively.

A bash script:
#!/bin/bash
(echo "total_time divided_by average"
while read line
do
arr=($line)
count=$((${#arr[#]}-1))
total=$(bc<<<${arr[$count]}-${arr[0]})
echo "$total $count $(bc<<<$total/$count)"
done < f.txt ) | column -t
Output
total_time divided_by average
148.00 1 148
0 1 0
200.00 2 100
0 1 0
0 1 0
0 1 0
344.00 1 344
39.00 1 39
94.00 1 94

Related

Awk multiply integers in geometric sequence in each cell

I have a dataframe like this
0 1 0 1 0 0....
1 1 1 1 0 0
0 0 1 1 0 1
.
.
.
And I want to multiply each of them with a geometric sequence
1, 10, 100, 1000, 10000 ... 10^(n-1)
so the result will be
0 10 0 1000 0 0....
1 10 100 1000 0 0
0 0 100 1000 0 100000
.
.
.
I have tried with
awk '{n=0 ; x=0 ; for (i = 1; i <= NF; i++) if ($i == 1) {n=10**i ; x = x+n } print x }' test.txt
But the results were not what I expected
With GNU awk:
awk '{for (i=1; i<=NF; i++){if($i==1){n=10**(i-1); $i=$i*n}} print}' test.txt
Output:
0 10 0 1000 0 0
1 10 100 1000 0 0
0 0 100 1000 0 100000
Note: In this answer, we always assume single digits per column
There are a couple of things you have to take into account:
If you have a sequence given by:
a b c d e
Then the final number will be edcba
awk is not aware of integers, but knows only floating point numbers, so there is a maximum number it can reach, from an integer perspective, and that is 253 (See biggest integer that can be stored in a double). This means that multiplication is not the way forward. If we don't use awk, this is still valid for integer arithmetic as the maximum value is 264-1 in unsigned version.
Having that said, it is better to just write the number with n places and use 0 as a delimiter. Example, if you want to compute 3 × 104, you can do;
awk 'BEGIN{printf "%0.*d",4+1,3}' | rev
Here we make use of rev to reverse the strings (00003 → 30000)
Solution 1: In the OP, the code alludes to the fact that the final sum is requested (a b c d e → edcba). So we can just do the following:
sed 's/ //g' file | rev
awk -v OFS="" '{$1=$1}1' file | rev
If you want to get rid of the possible starting zero's you can do:
sed 's/ //g;s/^0*//; file | rev
Solution 2: If the OP only wants the multiplied columns as output, we can do:
awk '{for(i=NF;i>0;--i) printf("%0.*d"(i==1?ORS:OFS),i,$i)}' file | rev
Solution 3: If the OP only wants the multiplied columns as output and the sum:
awk '{ s=$0;gsub(/ /,"",s); printf s OFS }
{ for(i=NF;i>0;--i) printf("%0.*d"(i==1?ORS:OFS),i,$i)} }
' | rev
What you wrote is absolutely not what you want. Your awk program parses each line of the input and computes only one number per line which happens to be 10 times the integer you would see if you were writing the 0's and 1's in reverse order. So, for a line like:
1 0 0 1 0 1
your awk program computes 10+0+0+10000+0+1000000=1010010. As you can see, this is the same as 10 times 101001 (100101 reversed).
To do what you want you can loop over all fields and modify them on the fly by multiplying them by the corresponding power of 10, as shown in the an other answer.
Note: another awk solution, a bit more compact, but strictly equivalent for your inputs, could be:
awk '{for(i=1;i<=NF;i++) $i*=10**(i-1)} {print}' test.txt
The first block loops over the fields and modifies them on the fly by multiplying them by the corresponding power of 10. The second block prints the modified line.
As noted in an other answer there is a potential overflow issue with the pure arithmetic approach. If you have lines with many fields you could hit the maximum of integer representation in floating point format. It could be that the strange 1024 values in the output you show are due to this.
If there is a risk of overflow, as suggested in the other answer, you could use another approach where the trailing zeroes are added not by multiplying by a power of 10, but by concatenating value 0 represented on the desired number of digits, something that printf and sprintf can do:
$ awk 'BEGIN {printf("%s%0.*d\n",1,4,0)}' /dev/null
10000
So, a GNU awk solution based on this could be:
awk '{for(i=1;i<=NF;i++) $i = $i ? sprintf("%s%0.*d",$i,i-1,0) : $i} 1' test.txt
how about not doing any math at all :
{m,n,g}awk '{ for(_+=_^=___=+(__="");_<=NF;_++) {
$_=$_ ( \
__=__""___) } } gsub((_=" "(___))"+",_)^+_'
=
1 0 0 0 10000 0 0 0 0 1000000000 10000000000
1 0 0 0 10000 0 0 10000000 0 0 10000000000
1 0 0 0 10000 100000 0 0 0 0 10000000000
1 0 0 1000 0 0 1000000 0 100000000 1000000000
1 0 0 1000 10000 0 0 0 0 1000000000 10000000000
1 0 100 0 0 0 1000000 10000000 0 0 10000000000
1 0 100 0 0 100000 1000000 10000000 100000000 1000000000
1 0 100 0 10000 0 1000000 0 100000000
1 0 100 1000 0 100000 0 0 0 0 10000000000
1 0 100 1000 10000 0 0 10000000
1 10 0 0 0 0 1000000 10000000 0 1000000000
1 10 0 1000 0 100000 0 0 100000000
1 10 0 1000 0 100000 0 0 100000000 1000000000 10000000000
1 10 0 1000 0 100000 0 10000000 100000000 1000000000
1 10 100 1000 10000 100000 0 0 0 1000000000

How to awk interval expression and grouping together?

I am trying to get the measure of cpu time spent on user tasks, system tasks, interrupt handling, io wait etc by parsing the the below output of /proc/stat.
My intent is to retrieve the numerical value in the first line{the one that starts with "cpu " into seperate array elements indexed from 1 through N
[kcube#myPc ~]$ cat /proc/stat
cpu 70508209 48325 12341967 18807644987 671141 0 11736 0 0 0
cpu0 4350458 319 868828 1175271469 23047 0 2397 0 0 0
cpu1 3944197 277 857728 1175822236 16462 0 1025 0 0 0
cpu2 3919468 538 924717 1175628294 136617 0 2270 0 0 0
cpu3 3763268 441 855219 1175968114 43631 0 733 0 0 0
cpu4 3551196 147 856029 1176198902 18392 0 851 0 0 0
cpu5 5394823 1806 997806 1174089493 120122 0 2056 0 0 0
cpu6 3425023 656 839042 1176324091 58718 0 3 0 0 0
cpu7 3167959 189 811389 1176654383 19218 0 2 0 0 0
cpu8 4454976 5046 625657 1175714502 10447 0 26 0 0 0
cpu9 5049813 5365 655732 1175082394 10511 0 30 0 0 0
cpu10 4746872 4727 630042 1175408141 10959 0 28 0 0 0
cpu11 5367186 4684 659408 1174759103 9992 0 23 0 0 0
cpu12 4744405 5940 704282 1175177246 149934 0 714 0 0 0
cpu13 4689816 5954 650193 1175439255 13494 0 5 0 0 0
cpu14 4872185 5479 699429 1175126266 16945 0 898 0 0 0
cpu15 5066558 6748 706459 1174981089 12643 0 669 0 0 0
I have below awk script.
[kcube#myPc ~]$ cat test.awk
{
if ( match($0,/cpu\s(\s[[:digit:]]+){10}$/, ary) ) {
print ary[0]
print ary[1]
}
}
This always gives me the last numeric value in the first line into ary[1].
What I am looking for is to have like :
ary[1] = 70508209
ary[2] = 48325
.
.
so on
I never used interval expression and grouping together. I tried searching for answers but couldn't find one. Can someone help me out?
I'm using GNU Awk 4.0.2
$ cat tst.awk
match($0,/^cpu\s(\s[[:digit:]]+){10}$/,ary) {
print "bad match:", ary[1]
print "bad match:", ary[2]
}
match($0,/^cpu\s+([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)$/,ary) {
print "good match:", ary[1]
print "good match:", ary[2]
}
/^cpu\s/ && split($0,tmp,/[[:digit:]]+/,ary) {
print "good split:", ary[1]
print "good split:", ary[2]
}
$ awk -f tst.awk file
bad match: 0
bad match:
good match: 70508209
good match: 48325
good split: 70508209
good split: 48325
An interval expression defines how many repetitions of the previous expression must exist for the regexp to match, that is all. It is not part of populating capture groups - that is entirely up to use of round brackets enclosing regexp segments. To do what you want you need to either define explicit capture groups for each number, or use split() or similar to create the array based on a regexp that describes each entity you want to be captured.
All of the above uses GNU awk - for the 3rd arg to match() and the 4th arg to split(). Note that you can also just do this with GNU awk for FPAT:
$ awk -v FPAT='[0-9]+' '/^cpu /{for (i=1; i<=NF; i++) print i, $i}' file
1 70508209
2 48325
3 12341967
4 18807644987
5 671141
6 0
7 11736
8 0
9 0
10 0

How to loop awk command over row values

I would like to use awk to search for a particular word in the first column of a table and print the value in the 6th column. I understand how to do this searching one word at time using something along the lines of:
awk '$1 == "<insert-word>" { print $6 }' file.txt
But I was wondering if it is possible to loop this over a list of words in a row?
For example If I had a table like file1.txt below:
cat file1.txt
dna1 dna4 dna5
dna3 dna6 dna2
dna7 dna8 dna9
Could I loop over each value in row 1 and search for this word in column 1 of file2.txt below, each time printing the value of column 6? Then do this for row 2, 3 and so on...
cat file2
dna1 0 229 7 0 4 0 0
dna2 0 296 39 2 1 3 100
dna3 0 255 15 0 6 0 0
dna4 0 209 3 0 0 0 0
dna5 0 253 14 2 3 7 100
dna6 0 897 629 7 8 1 100
dna7 0 214 4 0 9 0 0
dna8 0 255 15 0 2 0 0
dna9 0 606 338 8 3 1 100
So an example looping the awk over row 1 of file 1 would return the numbers 4, 0 and 3.
The looping the command over row 2 would return the numbers 6, 8 and 1
And finally looping over row 3 would return the number 9, 2, 3
An example output might be
4 0 3
6 8 1
9 2 3
What I would really like to to is sum the total value of the numbers returned for each row. I just wasn't sure if this would be possible...
An example output of this would be
7
15
14
But I am not worried if this step isn't possible using awk as I could just do it separately
Hope this makes sense
Cheers
Ollie
yes, you can give awk multiple input files. For your example:
awk 'NR==FNR{a[$1]=a[$2]=1;next}a[$1]{print $6}' file1 file2
I didn't test the above one-liner, but it should go. At least you get the idea.
If you don't know how many columns in your file1, as you said, you want to do a loop:
awk 'NR==FNR{for(x=1;x<=NF;x++)a[$x]=1;next}a[$1]{print $6}' file1 file2
update
edit for the new requirement:
awk 'NR==FNR{a[$1]=$6;next}{for(i=1;i<=NF;i++)s+=a[$i];print s;s=0}' f2 f1
The output of above one-liner: (take f1 and f2 as your input example file1 file2):
7
15
14

Eliminate lines based on values in multiple columns

I'm trying to remove rows from a big table but conditioned that one column has one value and another column has other values.
So far I've been trying this but I guess I'm not combining the awk properly..
awk '$11 !="1"'| awk '$20==2 || $20==3' infile.txt >out.txt
The code is probably too simple but should work anyways..or not?
Thanks
edit:
This is what the table looks like
5306083 TGATCAATCTCATAAC[A/C]AAAAAAAAA consensus_24 211 1 species 0 0 0 T 0 7 T recommended 0.708 F 0 -100 T recommended
5193751 AGTAGCTTGCGCGGA[C/T]GGGGGGGGG consensus_32 227 1 species 0 0 0 T 1 1 T not_recommended 0.75 F 0 -100 T not_recommended
5193254 TAAAAAAAAAAAAAA[G/T]ATTCATCC consensus_26 192 1 species 0 0 0 T 1 0 T not_recommended 0.726 F 0 -100 T neutral
So if I filter based in that $11=1 and $20 needs to be "neutral" or "not_recommended" I would get this
5306083 TGATCAATCTCATAAC[A/C]AAAAAAAAA consensus_24 211 1 species 0 0 0 T 0 7 T recommended 0.708 F 0 -100 T recommended
awk '$11!=1 && ($20==2 || $20==3)' infile.txt > out.txt
should do.
UPDATE: based on the input given, you should get two lines in the output for this condition
$ awk '$11==1 && ($20=="not_recommended" || $20=="neutral")' file
5193751 AGTAGCTTGCGCGGA[C/T]GGGGGGGGG consensus_32 227 1 species 0 0 0 T 1 1 T not_recommended 0.75 F 0 -100 T not_recommended
5193254 TAAAAAAAAAAAAAA[G/T]ATTCATCC consensus_26 192 1 species 0 0 0 T 1 0 T not_recommended 0.726 F 0 -100 T neutral
But I guess, what you mean is you want the negation of this which is different from your original post
$ awk '$11!=1 || ($20!="not_recommended" && $20!="neutral")' file
5306083 TGATCAATCTCATAAC[A/C]AAAAAAAAA consensus_24 211 1 species 0 0 0 T 0 7 T recommended 0.708 F 0 -100 T recommended

Selecting columns using specific patterns then finding sum and ratio

I want to calculate the sum and ratio values from data below. (The actual data contains more than 200,000 columns and 45000 rows (lines)).
For clarity purpose I have given only simple data format.
#Frame BMR_42#O22 BMR_49#O13 BMR_59#O13 BMR_23#O26 BMR_10#O13 BMR_61#O26 BMR_23#O25
1 1 1 0 1 1 1 1
2 0 1 0 0 1 1 0
3 1 1 1 0 0 1 1
4 1 1 0 0 1 0 1
5 0 0 0 0 0 0 0
6 1 0 1 1 0 1 0
7 1 1 1 1 0 0 0
8 1 1 1 0 0 0 0
9 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0
The columns need to be selected with certain criteria.
The column data which I consider is columns with "#O13" only. Below I have given the selected columns from above example.
BMR_49#O13 BMR_59#O13 BMR_10#O13
1 0 1
1 0 1
1 1 0
1 0 1
0 0 0
0 1 0
1 1 0
1 1 0
1 1 1
0 0 0
From the selected column, I want to calculate:
1) the sum of all the "1"s. In this example we get value 16.
2) the number of total rows containing occurrence of "1" (at least once). From above example there are 8 rows which contain at least one occurrence of "1".
lastly,
3) the ratio of total of all "1"s with total lines with occurrence of "1"s.
That is :: (total of all "1"s)/(total rows with the occurance of "1").
Example 16/8
As a start, I tried with this command to select only the columns with "#O13"
awk '{for (i=1;i<=NF;i++) if (i~/#O13/); print ""}' $file2
Although this run but doesn't show up the values.
This should do:
awk 'NR==1{for (i=1;i<=NF;i++) if ($i~/#O13/) a[i];next} {f=0;for (i in a) if ($i) {s++;f++};if (f) r++} END {print "number of 1="s"\nrows with 1="r"\nratio="s/r}' file
number of 1=16
rows with 1=8
ratio=2
Some more readable:
awk '
NR==1{
for (i=1;i<=NF;i++)
if ($i~/#O13/)
a[i]
next
}
{
f=0
for (i in a)
if ($i=="1") {
s++
f++
}
if (f) r++
}
END {
print "number of 1="s \
"\nrows with 1="r \
"\nratio="s/r
}
' file