How to append column to existing file in awk? - awk

I have a file named bt.B.1.log that looks like this:
.
.
.
Time in seconds = 260.37
.
.
.
Compiled procs = 1
.
.
.
Time in seconds = 260.04
.
.
.
Compiled procs = 1
.
.
.
and so on for 40 records of Time in seconds and Compiled procs (dots represent useless lines).
How do I add a single column with the value of Compiled procs (which is 1) to the result of the following two commands:
This prints the average of Time in seconds values (thanks to dawg for this one)
awk -F= '/Time in seconds/ {s+=$2; c++} END {print s/c}' bt.B.1.log > t1avg.dat
Desired output:
260.20 1
This prints all of the ocurrences of Time in seconds, but there is a small problem with it; it is printing an extra blank line at the beginning of the list.
awk 'BEGIN { FS = "Time in seconds =" } ; { printf $2 } {printf " "}' bt.B.1.log > t1.dat
Desired output:
260.37 1
260.04
.
.
.
In both cases I need the value of Compiled procs to appear only once, preferrably in the first line, and no use of intermediate files.
What I managed to do so far prints all values of Time in seconds with the Compiled procs column appearing in every line and with a strange identation:
awk '/seconds/ {printf $5} {printf " "} /procs/ {print $4}' bt.B.1.log > t1.dat
Please help!
UPDATE
Contents of file bt.B.1.log:
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
Start in 16:40:51--25/12/2014
NAS Parallel Benchmarks 3.3 -- BT Benchmark
No input file inputbt.data. Using compiled defaults
Size: 102x 102x 102
Iterations: 200 dt: 0.0003000
Number of active processes: 1
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class B
accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 0.1423359722929E+04 0.1423359722929E+04 0.7507984505732E-14
2 0.9933052259015E+02 0.9933052259015E+02 0.3147459568137E-14
3 0.3564602564454E+03 0.3564602564454E+03 0.4783990739472E-14
4 0.3248544795908E+03 0.3248544795908E+03 0.2309751522921E-13
5 0.3270754125466E+04 0.3270754125466E+04 0.8481098651866E-14
Comparison of RMS-norms of solution error
1 0.5296984714094E+02 0.5296984714094E+02 0.2682819657265E-15
2 0.4463289611567E+01 0.4463289611567E+01 0.1989963674771E-15
3 0.1312257334221E+02 0.1312257334221E+02 0.4060995034457E-15
4 0.1200692532356E+02 0.1200692532356E+02 0.2958887128106E-15
5 0.1245957615104E+03 0.1245957615104E+03 0.2281113665977E-15
Verification Successful
BT Benchmark Completed.
Class = B
Size = 102x 102x 102
Iterations = 200
Time in seconds = 260.37
Total processes = 1
Compiled procs = 1
Mop/s total = 2696.83
Mop/s/process = 2696.83
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3
Compile date = 25 Dec 2014
Compile options:
MPIF77 = mpif77
FLINK = $(MPIF77)
FMPI_LIB = -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lo...
FMPI_INC = -I/usr/lib/openmpi/include -I/usr/lib/openm...
FFLAGS = -O
FLINKFLAGS = -O
RAND = (none)
Please send the results of this run to:
NPB Development Team
Internet: npb#nas.nasa.gov
If email is not available, send this to:
MS T27A-1
NASA Ames Research Center
Moffett Field, CA 94035-1000
Fax: 650-604-3957
Finish in 16:45:14--25/12/2014
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
Start in 16:58:50--25/12/2014
NAS Parallel Benchmarks 3.3 -- BT Benchmark
No input file inputbt.data. Using compiled defaults
Size: 102x 102x 102
Iterations: 200 dt: 0.0003000
Number of active processes: 1
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class B
accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 0.1423359722929E+04 0.1423359722929E+04 0.7507984505732E-14
2 0.9933052259015E+02 0.9933052259015E+02 0.3147459568137E-14
3 0.3564602564454E+03 0.3564602564454E+03 0.4783990739472E-14
4 0.3248544795908E+03 0.3248544795908E+03 0.2309751522921E-13
5 0.3270754125466E+04 0.3270754125466E+04 0.8481098651866E-14
Comparison of RMS-norms of solution error
1 0.5296984714094E+02 0.5296984714094E+02 0.2682819657265E-15
2 0.4463289611567E+01 0.4463289611567E+01 0.1989963674771E-15
3 0.1312257334221E+02 0.1312257334221E+02 0.4060995034457E-15
4 0.1200692532356E+02 0.1200692532356E+02 0.2958887128106E-15
5 0.1245957615104E+03 0.1245957615104E+03 0.2281113665977E-15
Verification Successful
BT Benchmark Completed.
Class = B
Size = 102x 102x 102
Iterations = 200
Time in seconds = 260.04
Total processes = 1
Compiled procs = 1
Mop/s total = 2700.25
Mop/s/process = 2700.25
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3
Compile date = 25 Dec 2014
Compile options:
MPIF77 = mpif77
FLINK = $(MPIF77)
FMPI_LIB = -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lo...
FMPI_INC = -I/usr/lib/openmpi/include -I/usr/lib/openm...
FFLAGS = -O
FLINKFLAGS = -O
RAND = (none)
Please send the results of this run to:
NPB Development Team
Internet: npb#nas.nasa.gov
If email is not available, send this to:
MS T27A-1
NASA Ames Research Center
Moffett Field, CA 94035-1000
Fax: 650-604-3957
Finish in 17:03:12--25/12/2014
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
There are 40 entries in the log, but I've provided only 2 for abbreviation purposes.

To fix the first issue, replace:
awk -F= '/Time in seconds/ {s+=$2; c++} END {print s/c}' bt.B.1.log > t1avg.dat
with:
awk 'BEGIN { FS = "[ \t]*=[ \t]*" } /Time in seconds/ { s += $2; c++ } /Compiled procs/ { if (! CP) CP = $2 } END { print s/c, CP }' bt.B.1.log >t1avg.dat
A potential minor issue is that 260.205 1 might be output but the question does not address this as a weakness of the given script. Rounding with something like printf "%.2f %s\n", s/c, CP gives 260.21 1 though. To truncate the extra digit, use something like printf "%.2f %s\n", int (s/c * 100) / 100, CP.
To fix the second issue, replace:
awk 'BEGIN { FS = "Time in seconds =" } ; { printf $2 } {printf " "}' bt.B.1.log > t1.dat
with:
awk 'BEGIN { FS = "[ \t]*[=][ \t]" } /Time in seconds/ { printf "%s", $2 } /Compiled procs/ { if (CP) { printf "\n" } else { CP = $2; printf " %s\n", $2 } }' bt.B.1.log > t1.dat
BTW, the "strange indentation" is a result of failing to output a newline when using printf and failing to filter unwanted input lines from processing.

Related

Successive averaging of repeating data but different number of lines

I have the following format of data:
1 3
1.723608 0.8490000
1.743011 0.8390000
1.835833 0.7830000
2 5
1.751377 0.8350000
1.907603 0.7330000
1.780053 0.8190000
1.601427 0.9020000
1.950540 0.6970000
3 2
1.993951 0.6610000
1.796519 0.8090000
4 4
1.734961 0.8430000
1.840741 0.7800000
1.818444 0.7950000
1.810717 0.7980000
5 1
2.037940 0.6150000
6 7
1.738221 0.8330000
1.767678 0.8260000
1.788517 0.8140000
2.223586 0.4070000
1.667492 0.8760000
2.039232 0.6130000
1.758823 0.8300000
...
Data consists of data blocks. Each data block has the same format as follows:
The very first line is the header line. The header line contains the ID number and the total number of lines of each data block. For example, the first data block's ID is 1, and the total number of lines is 3. For the third data block, ID is 3, and the total number of lines is 2. All data blocks have this header line.
Then, the "real data" follows. As I explained, the number of lines of "real data" is designated in the second integer of the header line.
Accordingly, the total number of lines for each data block will be number_of_lines+1. In this example, the total number of lines for data block 1 is 4, and data block 2 costs 6 lines...
This format repeats all the way up to 10000 number of data blocks in my current data, but I can provide this 10000 as a variable in the bash or awk script as an input value. I know the total number of data blocks.
Now, what I wish to do is, I want to get the average of data of each two columns and print it out with data block ID number and a total number of lines. The output text will have:
ID_number number_of_lines average_of_column_1 average_of_column_2
using 5 spaces between columns with 6 decimal places format. The result will have 10000 lines, and each line will have ID, number of lines, avg of column 1 of data, and avg of column 2 of data for each data block. The result of this example will look like
1 3 1.767484 0.823666
2 5 1.798200 0.797200
3 2 1.895235 0.735000
...
I know how to get the average of a simple data column in awk and bash. These are already answered in StackOverflow a lot of times. For example, I really favor using
awk '{ total += $2; count++ } END { print total/count }' data.txt
So, I wish to this using awk or bash. But I really have no clue how can I approach and even start to get this kind of average of multiple repeating data blocks, but with a different number of lines for each data block.
I was trying based on awk, following
Awk average of n data in each column
and
https://www.unix.com/shell-programming-and-scripting/135829-partial-average-column-awk.html
But I'm not sure how can I use NR or FNR for the average of data with a varying number of total lines of data, for each data block.
You may try this awk:
awk -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.76748 0.823667
2 5 1.7982 0.7972
3 2 1.89524 0.735
4 4 1.80122 0.804
5 1 2.03794 0.615
6 7 1.85479 0.742714
If you have gnu awk then use OFMT for getting fixed size decimal numbers like this:
awk -v OFMT="%.6f" -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.767484 0.823667
2 5 1.798200 0.797200
3 2 1.895235 0.735000
4 4 1.801216 0.804000
5 1 2.037940 0.615000
6 7 1.854793 0.742714
An expanded form:
awk OFMT='%.6f' -v OFS='\t' '
$2 ~ /\./ {
s1 += $1
s2 += $2
next
}
{
if (id) {
print id, num, s1/num, s2/num
s1 = s2 = 0
}
id = $1
num = $2
}
END {
print id, num, s1/num, s2/num
}' file
And yet another one:
awk -v num_blocks=10000 '
BEGIN {
OFS = "\t"
OFMT = "%.6f"
}
num_lines == 0 {
id = $1
num_lines = $2
sum1 = sum2 = 0
next
}
lines_read < num_lines {
sum1 += $1
sum2 += $2
lines_read++
}
lines_read >= num_lines {
print id, num_lines,
sum1 / num_lines,
sum2 / num_lines
num_lines = lines_read = 0
num_blocks--;
}
num_blocks <= 0 {
exit
}' file
You could try
awk -v qnt=none 'qnt == "none" {id = $1; qnt = $2; s1 = s2 = line = 0;next}{s1 += $1; s2 += $2; ++line} line == qnt{printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt; qnt="none"}'
The above is expanded as follows:
qnt == "none"
{
id = $1;
qnt = $2;
s1 = s2 = line = 0;
next
}
{
s1 += $1;
s2 += $2;
++line
}
line == qnt
{
printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt;
qnt="none"
}
After a data block is processed (or at the beginning), record header info.
Otherwise, add to sum and print the result when we've done with all lines in this block.

awk need epoch days logic for data transformation

In excel when I give in the cell A1=1/1/1970 and A2=today(), A2 resolves to 17-Apr-2020.
And when I subtract A2 with A1, I get 18369 which is the epoch days for my data transformation logic.
I'm trying to simulate the same functionality in awk, but not getting the whole number
$ awk ' BEGIN { t=strftime("%F",systime()); print t; gsub("-"," ",t); t2=mktime(t " 0 0 0" ); print t2/(24*60*60) } '
2020-04-17
18368.8
$
Is this translation for awk correct?. What is the reason for 0.2 difference in my awk code. How to fix it.
Converting my comment to answer so that solution is easy to find for future visitors.
You are seeing this difference due to timezone offset between your current local timezone and UTC.
On more recent gnu awk version you can use this to pass an additional utc-flag to mktime function:
awk 'BEGIN {t=strftime("%F",systime()); print t;
gsub("-"," ",t); t2=mktime(t " 0 0 0", 1); print t2/(24*60*60)}'
However if you are on older awk versions then use this work-around to get time computations in UTC:
TZ=UTC awk 'BEGIN {t=strftime("%F",systime()); print t;
gsub("-"," ",t); t2=mktime(t " 0 0 0"); print t2/(24*60*60) } '
The epoch days involve calculating how many multiples of "86400" seconds have passed since 1-Jan-1970. We have a boundary issue in this calculation, i.e we should not take 1st Jan, 1970 as 1 epoch day because the particular day has not completed.
That is the reason the java epoch days function returns 0 for 1-Jan-1970
java.time.LocalDate.parse("1970-01-01").toEpochDay
res2: Long = 0
Now, with this for 17-Apr-2020, we get
java.time.LocalDate.parse("2020-04-17").toEpochDay
res3: Long = 18369
Let's try to get the same answer using awk.
awk -v dt="2020-04-17" '
function epoch_days(dta)
{
yr=dta[0];
for(i=1970;i<strftime("%Y",mktime(dta[1] " 12 31 0 0 0"));i++)
e1+=strftime("%j",mktime(i " 12 31 0 0 0"))+0;
e1+=strftime("%j",mktime(dta[1] " " dta[2] " " dta[3] " 0 0 0"))-1;
return e1
}
BEGIN {
split(dt,dta,"-");
print epoch_days(dta) ;
}'
18369
Lets cross verify for current date i.e 22-Aug-2021
java.time.LocalDate.now
res5: java.time.LocalDate = 2021-08-22
java.time.LocalDate.now.toEpochDay
res6: Long = 18861
And our awk code returns
awk '
function epoch_days(dta)
{
yr=dta[0];
for(i=1970;i<strftime("%Y",mktime(dta[1] " 12 31 0 0 0"));i++)
e1+=strftime("%j",mktime(i " 12 31 0 0 0"))+0;
e1+=strftime("%j",mktime(dta[1] " " dta[2] " " dta[3] " 0 0 0"))-1;
return e1
}
BEGIN {
dt=systime(); # 22-Aug-2021
dt2=strftime("%F",dt);
split(dt2,dta,"-");
print epoch_days(dta) ;
}'
18861

Awk Script to process data from a trace file

I have a table (.tr file) with different rows (events).
**Event** **Time** **PacketLength** PacketId
sent 1 100 1
dropped 2 100 1
sent 3 100 2
sent 4.5 100 3
dropped 5 100 2
sent 6 100 4
sent 7 100 5
sent 8 100 6
sent 10 100 7
And I would like to create a new table as the following and I don't know how to it in AWK.
**SentTime** **PacketLength Dropped**
1 100 Yes
3 100 Yes
4.5 100
6 100
7 100
8 100
10 100
I have a simple code to find dropped or sent packets, time and id but I do not know how to create a column in my table with the results for dropped packets.
BEGIN{}
{
Event = $1;
Time = $2;
Packet = $6;
Node = $10;
id = $11;
if (Event=="s" && Node=="1.0.1.2"){
printf ("%f\t %d\n", $2, $6);
}
}
END {}
You have to save all the information in an array to postprocess it at the end of the file. Obviously, if the file is huge, this could cause memory problems.
BEGIN {
template="#sentTime\t#packetLength\t#dropped";
}
{
print $0;
event = $1;
time = $2;
packet_length = $3;
packet_id = $4;
# save all the info in an array
packet_info[packet_id] = packet_info[packet_id] "#" packet_length "#" time "#" event;
}
END {
# traverse the information of the array
for( time in packet_info )
{
print "the time is: " time " = " packet_info[time];
# for every element in the array (= packet),
# the data has this format "#100#1#sent#100#2#dropped"
split( packet_info[time], info, "#" );
# info[2] <-- 100
# info[3] <-- 1
# info[4] <-- sent
# info[5] <-- 100
# info[6] <-- 2
# info[7] <-- dropped
line = template;
line = gensub( "#sentTime", info[3], "g", line );
line = gensub( "#packetLength", info[2], "g", line );
if( info[4] == "dropped" )
line = gensub( "#dropped", "yes", "g", line );
if( info[7] == "dropped" )
line = gensub( "#dropped", "yes", "g", line );
line = gensub( "#dropped", "", "g", line );
print line;
} # for
}
I would say...
awk '/sent/{pack[$4]=$2; len[$4]=$3}
/dropped/{drop[$4]}
END {print "Sent time", "PacketLength", "Dropped";
for (p in pack)
print pack[p], len[p], ((p in drop)?"yes":"")
}' file
This stores the packages in pack[], the lengths in len[] and the dropped in drop[], so that they are fetched later on.
Test
$ awk '/sent/{pack[$4]=$2; len[$4]=$3} /dropped/{drop[$4]} END {print "Sent time", "PacketLength", "Dropped"; for (p in pack) print pack[p], len[p], ((p in drop)?"yes":"")}' a
Sent time PacketLength Dropped
1 100 yes
3 100 yes
4.5 100
6 100
7 100
8 100
10 100

How substract millisecond with AWK - script

I'm trying to create an awk script to subtract milliseconds between 2 records joined-up for example:
By command line I might do this:
Input:
06:20:00.120
06:20:00.361
06:20:15.205
06:20:15.431
06:20:35.073
06:20:36.190
06:20:59.604
06:21:00.514
06:21:25.145
06:21:26.125
Command:
awk '{ if ( ( NR % 2 ) == 0 ) { printf("%s\n",$0) } else { printf("%s ",$0) } }' input
I'll obtain this:
06:20:00.120 06:20:00.361
06:20:15.205 06:20:15.431
06:20:35.073 06:20:36.190
06:20:59.604 06:21:00.514
06:21:25.145 06:21:26.125
To substract milliseconds properly:
awk '{ if ( ( NR % 2 ) == 0 ) { printf("%s\n",$0) } else { printf("%s ",$0) } }' input| awk -F':| ' '{print $3, $6}'
And to avoid negative numbers:
awk '{if ($2<$1) sub(/00/, "60",$2); print $0}'
awk '{$3=($2-$1); print $3}'
The goal is get this:
Call 1 0.241 ms
Call 2 0.226 ms
Call 3 1.117 ms
Call 4 0.91 ms
Call 5 0.98 ms
And finally and average.
I might perform this but command by command. I dunno how to place this into a script.
Please need help.
Using awk:
awk '
BEGIN { cmd = "date +%s.%N -d " }
NR%2 {
cmd $0 | getline var1;
next
}
{
cmd $0 | getline var2;
var3 = var2 - var1;
print "Call " ++i, var3 " ms"
}
' file
Call 1 0.241 ms
Call 2 0.226 ms
Call 3 1.117 ms
Call 4 0.91 ms
Call 5 0.98 ms
One way using awk:
Content of script.awk:
## For every input line.
{
## Convert formatted dates to time in miliseconds.
t1 = to_ms( $0 )
getline
t2 = to_ms( $0 )
## Calculate difference between both dates in miliseconds.
tr = (t1 >= t2) ? t1 - t2 : t2 - t1
## Print to output with time converted to a readable format.
printf "Call %d %s ms\n", ++cont, to_time( tr )
}
## Convert a date in format hh:mm:ss:mmm to miliseconds.
function to_ms(time, time_ms, time_arr)
{
split( time, time_arr, /:|\./ )
time_ms = ( time_arr[1] * 3600 + time_arr[2] * 60 + time_arr[3] ) * 1000 + time_arr[4]
return time_ms
}
## Convert a time in miliseconds to format hh:mm:ss:mmm. In case of 'hours' or 'minutes'
## with a value of 0, don't print them.
function to_time(i_ms, time)
{
ms = int( i_ms % 1000 )
s = int( i_ms / 1000 )
h = int( s / 3600 )
s = s % 3600
m = int( s / 60 )
s = s % 60
# time = (h != 0 ? h ":" : "") (m != 0 ? m ":" : "") s "." ms
time = (h != 0 ? h ":" : "") (m != 0 ? m ":" : "") s "." sprintf( "%03d", ms )
return time
}
Run the script:
awk -f script.awk infile
Result:
Call 1 0.241 ms
Call 2 0.226 ms
Call 3 1.117 ms
Call 4 0.910 ms
Call 5 0.980 ms
If you're not tied to awk:
to_epoch() { date -d "$1" "+%s.%N"; }
count=0
paste - - < input |
while read t1 t2; do
((count++))
diff=$(printf "%s-%s\n" $(to_epoch "$t2") $(to_epoch "$t1") | bc -l)
printf "Call %d %5.3f ms\n" $count $diff
done

calculate the difference from flat file

I have a text file and the last 2 lines look like this...
Uptime: 822832 Threads: 32 Questions: 13591705 Slow queries: 722 Opens: 81551 Flush tables: 59 Open tables: 64 Queries per second avg: 16.518
Uptime: 822893 Threads: 31 Questions: 13592768 Slow queries: 732 Opens: 81551 Flush tables: 59 Open tables: 64 Queries per second avg: 16.618
How do I find the difference between the two values of each parameter?
The expected output is:
61 -1 1063 10 0 0 0 0.1
In other words I will like to deduct the current uptime value from the earlier uptime.
Find the difference between the threads and Questions and so on.
The purpose of this exercise is to watch this file and alert the user when the difference is too high. For e.g. if the slow queries are more than 500 or the "Questions" parameter is too low (<100)
(It is the MySQL status but has nothing to do with it, so mysql tag does not apply)
Just a slight variation on ghostdog74's (original) answer:
tail -2 file | awk ' {
gsub(/[a-zA-Z: ]+/," ")
m=split($0,a," ");
for (i=1;i<=m;i++)
if (NR==1) b[i]=a[i]; else print a[i] - b[i]
} '
here's one way. tail is used to get the last 2 lines, especially useful in terms of efficiency if you have a big file.
tail -2 file | awk '
{
gsub(/[a-zA-Z: ]+/," ")
m=split($0,a," ")
if (f) {
for (i=1;i<=m;i++){
print -(b[i]-a[i])
}
# to check for Questions, slow queries etc
if ( -(b[3]-a[3]) < 100 ){
print "Questions parameter too low"
}else if ( -(b[4]-a[4]) > 500 ){
print "Slow queries more than 500"
}else if ( a[1] - b[1] < 0 ){
print "mysql ...... "
}
exit
}
for(i=1;i<=m;i++ ){ b[i]=a[i] ;f=1 }
} '
output
$ ./shell.sh
61
-1
1063
10
0
0
0
0.1
gawk:
BEGIN {
arr[1] = "0"
}
length(arr) > 1 {
print $2-arr[1], $4-arr[2], $6-arr[3], $9-arr[4], $11-arr[5], $14-arr[6], $17-arr[7], $22-arr[8]
}
{
arr[1] = $2
arr[2] = $4
arr[3] = $6
arr[4] = $9
arr[5] = $11
arr[6] = $14
arr[7] = $17
arr[8] = $22
}