sum 4th field data between the pattern - scripting

suppose my data is :
*dnet *1234 1.2
1 port *12 2.3
3 port1 *34 0.2
7 *15 0.1
*dnet *234 0.2
2 *12 0.1
4 *123 *234 1.2
fields are separated by space.
In this I want to get the sum of 4th fields of data present inside each *dnet. Some fields have 4th field data some has not. I want 4th field sum value for each *dnet seperate.
I tried using awk but could not get. It will be thankful if someone helps.
the output for above will look like
*dnet *1234 1.2 2.5
*dnet *234 0.2 1.2

Commented, slightly simplified, version of the comment...
awk '
# look for header line
$1=="*dnet" {
# print any previously calculated sum
if (header) print header, sum
# reset sum for next block of lines
sum = 0
# save new header line
header = $0
# skip remaining actions
next
}
# if we get here, we know this is not a header line
# if there is a 4th field, add it to the sum
$4 {
sum += $4
}
END {
# print the final sum
if (header) print header, sum
}
' datafile

Related

awk script to calculate delay from trace file ns-3

I want to measure time between transmitted and received packets in below trace file.
Input:
+ 0.01 /NodeList/1/DeviceList/1/$ns3::PointToPointNetDevice/TxQueue/Enqueue
- 0.01 /NodeList/1/DeviceList/1/$ns3::PointToPointNetDevice/TxQueue/Dequeue
r 0.0200001 /NodeList/0/DeviceList/2/$ns3::PointToPointNetDevice/MacRx
+ 0.11 /NodeList/1/DeviceList/1/$ns3::PointToPointNetDevice/TxQueue/Enqueue
- 0.11 /NodeList/1/DeviceList/1/$ns3::PointToPointNetDevice/TxQueue/Dequeue
r 0.12 /NodeList/0/DeviceList/2/$ns3::PointToPointNetDevice/MacRx
+ 0.12 /NodeList/0/DeviceList/3/$ns3::PointToPointNetDevice/TxQueue/Enqueue
- 0.12 /NodeList/0/DeviceList/3/$ns3::PointToPointNetDevice/TxQueue/Dequeue
r 0.120001 /NodeList/2/DeviceList/2/$ns3::PointToPointNetDevice/MacRx
Here + represents transmitted data and r represents received data. 2nd column in the trace file shows the time.
How can I measure time between r and + for the whole file using awk code?
The expected output can be as below:
Output:
0.0100001
0.01
0.000001
I'll be grateful if anyone helps.
I generated my own trace file, called trace, as follows:
+ 0.1 Stuff stuff and more stuff
- 0.2 Yet more stuff
r 0.4 Something new
+ 0.8 Something else, not so new
- 1.6 Jiggery
r 3.2 Pokery
+ 6.4 Higgledy Piggledy
Then, I would approach your question with awk as follows:
awk '/^+/{tx=$2} /^r/{rx=$2; d=rx-tx; $1=$1 "(d=" d ")"} 1' trace
Sample Output
+ 0.1 Stuff stuff and more stuff
- 0.2 Yet more stuff
r(d=0.3) 0.4 Something new
+ 0.8 Something else, not so new
- 1.6 Jiggery
r(d=2.4) 3.2 Pokery
+ 6.4 Higgledy Piggledy
That says... "If you see a line starting with +, save the second field as variable tx. If you see a line starting with r, save the second field as variable rx. Calculate the difference between rx and tx and save it as d. Rebuild the first field of the line by appending (d=variable d) to the end of whatever it was. The 1 at the end tells awk to do its natural thing - i.e. print the line."

combine specific output and display in specific header

I am trying to combine all matching text before the left of the | and output that to the column "Gene". The amount of lines in the match are outputted to the "Targets" column, the average of $3 to the "Average Depth" column, along with the average of the #'s to the right of the = to the " Average GC" column. I am having some trouble in doing this and need some expert help. Thank you :).
input
chr10:79793602-79793721 RPS24|gc=59.7 150.3
chr10:79795083-79795202 RPS24|gc=41.2 111.4
chr10:79797665-79797784 RPS24|gc=37 69.8
chr11:119077113-119077232 CBL|gc=67.9 27.3
chr11:119103143-119103420 CBL|gc=41.9 240.3
chr11:119142430-119142606 CBL|gc=42.6 177.1
chr11:119144563-119144749 CBL|gc=46.2 324.4
current output
Gene TargetsAverage DepthAverage GC
gc 803 0.0 0.0
desired output
ID times depth GC
RPS24 3 110.5 46.0
CBL 4 192.3 49.7
awk
awk -F'[ |=]' '
{
id[$2] += $4
value[$2] += $5
occur[$2]++
}
END{
printf "%-8s%8s%8s%8s\n", "Gene", "Targets", "Average Depth", "Average GC"
for (i in id)
printf "%-8s%8d%8.1f%8.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]
}' input
#Chris - Your editing of the question has not been very helpful, but I can confirm that, except for the first printf statement, the program runs as expected, which is in accordance with the "desired output". I have used three different awks; the only difference between the outputs is (as expected) the ordering of the rows. You may have to be more specific about the version of awk you are using.
Solution in TXR:
$ txr table2.txr data
ID times depth GC
RPS24 3 110.5 46.0
CBL 4 192.3 49.7
Code in table2.txr:
#(output)
ID times depth GC
#(end)
#(repeat)
# (all)
#nil:#nil-#nil #id|#nil
# (and)
# (collect :gap 0)
#nil:#nil-#nil #id|gc=#gc #dep
# (set (gc dep) (#(tofloat gc) #(tofloat dep)))
# (end)
# (bind n #(length gc))
# (bind avg-gc #(format nil "~,1f"
(/ [reduce-left + gc] n)))
# (bind avg-dep #(format nil "~,1f"
(/ [reduce-left + dep] n)))
# (output)
#{id 9} #{n 6} #{avg-dep 13} #{avg-gc}
# (end)
# (end)
#(end)
What lumps together the entries with the same ID is the two parallel branches of the all directive. The first branch loosely matches the pattern of a single line, extracting the ID, binding it to the id variable. This variable is visible to the second branch, where its appearance introduces a back-referencing constraint. Here, multiple consecutive (:gap 0) lines are matched at the same position (thus including the one which was matched in the first branch of the all). Only lines with the matching id are processed; the collect ends when a non-matching id is encountered (due to the :gap 0 constraint) or when the input ends.

awk to print out lines for cumulative sum

I want to print out lines of a file until the cumulative sum of the third field is greater than 0.99, then print out only the first line for which the cumulative sum is greater than or equal to 0.99. However, if field 2 of the first line for which cumulative sum of field 3 is greater than or equal to 0.99 matches field 2 of the next line, then both lines should be printed.
My file looks like:
rs76832595 -4.4524 0.501109
rs74660964 -4.9815 0.49886
rs12992037 -4.9815 9.8159e-06
rs934367 -4.3376 3.06953e-06
Desired output:
rs76832595 -4.4524 0.501109
rs74660964 -4.9815 0.49886
rs12992037 -4.9815 9.8159e-06
In the above example, the cumulative sum of field 3 exceeds 0.99 at line 2, but I print line 3 as well since field 2 of lines 2 and 3 are equal. If these fields had not been equal, I would print out lines 1 and 2 only.
I have the following command, which works for the cumulative sum, but not for comparing field 2 between adjacent lines:
awk '{sum+=$3;print $0;if(sum>=0.99)exit}' file
Can someone modify this to incorporate the above requirements?
The following should work according to your specifications:
Given file containing
rs76832595 -4.4524 0.501109
rs74660964 -4.9815 0.49886
rs12992037 -4.9815 9.8159e-06
rs934367 -4.3376 3.06953e-06
The following awk-script
awk '{sum+=$3; print $0; if(sum >= 0.99 && prev_row == $2)exit;prev_row=$2}' file
will produce
rs76832595 -4.4524 0.501109
rs74660964 -4.9815 0.49886
rs12992037 -4.9815 9.8159e-06
The change in the script consisted of adding a prev_row=$2 at the end of the statement to keep track of the previous row, and incorporating prev_row into the if-statement.

How to do multi-row calculations using awk on a large file

I have a big file that is sorted on the first word. I need to add a new column for each line with the proportional value: line value/total value for that group; group is determined by the first column. In the below example, the total of group "a" = 100 and hence each line gets a proportion. The total of group "the" is 1000 and hence each line gets the proprotion value of the total of that group.
I need an awk script to do this.
Sample File:
a lot 10
a few 20
a great 20
a little 40
a good 10
the best 250
the dog 750
zisty cool 20
Output:
a lot 10 0.1
a few 20 0.2
a great 20 0.1
a little 40 0.4
a good 10 0.1
the best 25 .25
the dog 75 .75
zisty cool 20 1
You describe this as a "big file." Consequently, this solution tries to save memory: it holds no more than one group in memory at a time. When we are done with that group, we print it out before starting on the next group:
$ awk -v i=0 'NR==1{name=$1} $1==name{a[i]=$0;b[i++]=$3;tot+=$3+0;next} {for (j=0;j<i;j++){print a[j],b[j]/tot} name=$1;a[0]=$0;tot=b[0]=$3;i=1} END{for (j=0;j<i;j++){print a[j],b[j]/tot}}' file
a lot 10 0.1
a few 20 0.2
a great 20 0.2
a little 40 0.4
a good 10 0.1
the best 250 0.25
the dog 750 0.75
zisty cool 20 1
How it works
-v i=0
This initializes the variable i to zero.
NR==1{name=$1}
For the first line, set the variable name to the first field, $1. This is the name of the group.
$1==name {a[i]=$0; b[i++]=$3; tot+=$3+0; next}
If the first field matches name, then save the whole line into array a and save the value of column (field) three into array b. Increment the variable tot by the value of the third field. Then, skip the rest of the commands and jump to the next line.
for (j=0;j<i;j++){print a[j],b[j]/tot} name=$1;a[0]=$0;tot=b[0]=$3;i=1
If we get to this line, then we are at the start of a new group. Print out all the values for the old group and initialize the variables for the start of the next group.
END{for (j=0;j<i;j++){print a[j],b[j]/tot}}
After we get to the last line, print out what we have for the last group.
awk '{a[$1]+=$3; b[i++]=$0; c[j++]=$1; d[k++]=$3} END{for(i=0;i<NR;i++) {print b[i], d[i]/a[c[i]]}}' File
Example:
sdlcb#Goofy-Gen:~/AMD$ cat ff
a lot 10
a few 20
a great 20
a little 40
a good 10
the best 250
the dog 750
zisty cool 20
sdlcb#Goofy-Gen:~/AMD$ awk '{a[$1]+=$3; b[i++]=$0; c[j++]=$1; d[k++]=$3} END{for(i=0;i<NR;i++) {print b[i], d[i]/a[c[i]]}}' ff
a lot 10 0.1
a few 20 0.2
a great 20 0.2
a little 40 0.4
a good 10 0.1
the best 250 0.25
the dog 750 0.75
zisty cool 20 1
Logic: update an array (a[]) with first column as index for each line. save array b[] with complete line for each line, to be used in the end for printing. similarly, update arrays c[] and d[] with first and third column values for each line. at the end, use these arrays to get the results using a for loop, looping through all the lines processed. First printing the line as itself, then the proportion value.

awk - Rounding to 2 decimal places in subtotals

Short version:
Is there a way to tell awk to round to 2 decimal places during the consolidation, not during the printing?
Long version:
I have an incoming file in the format below. I should get the net balances per currency and if the net is NOT zero, print the result in two columns: net balances less than zero go to neg_bal column and positive balances go to pos_bal column. For some reason, the USD column is still being printed despite netting to zero
JPY||170
JPY||40
USD|-42.61|
USD|-166.27
USD||42.61|
GBP|-20|
EUR||18.7
USD||174.6|
USD|-8.33||
EUR|-30.6|
GBP||100
JPY|-210|
Here is the code am using:
#!/bin/awk -f
BEGIN {
FS="|";
}
{
bal[$1]+=$2+$3
ccy[$1]=$1
}
END {
for (i in ccy)
{
if (bal[i] >0 )
{
pos_bal = bal[i]
neg_bal = 0
}
else
{
neg_bal = bal[i]
pos_bal = 0
}
if (bal[i] != 0 )
{
printf "%s|%.2f|%.2f\n",ccy[i],neg_bal,pos_bal
}
}
}
Result (notice JPY is not displayed since it nets to zero):
awk]$ ./scr1 file1
EUR|-11.90|0.00
USD|0.00|0.00
GBP|0.00|80.00
If I increase the decimal places to say, 20, I see that the USD net amount is not really zero. (Why is this, btw? Even excel gives a net of -1.59872E-14)
awk]$ ./scr1 file1
EUR|-11.90000000000000213163|0.00000000000000000000
USD|0.00000000000000000000|0.00000000000001243450
GBP|0.00000000000000000000|80.00000000000000000000
Is there a way to tell awk to round to 2 decimal places during the
consolidation, not during the printing?
Yes: multiply by 100 and convert to int. Then divide by 100 when you're ready to print.
(In other words, count pennies instead of dollars.)