awk need epoch days logic for data transformation

awk need epoch days logic for data transformation - awk

In excel when I give in the cell A1=1/1/1970 and A2=today(), A2 resolves to 17-Apr-2020.
And when I subtract A2 with A1, I get 18369 which is the epoch days for my data transformation logic.
I'm trying to simulate the same functionality in awk, but not getting the whole number
$ awk ' BEGIN { t=strftime("%F",systime()); print t; gsub("-"," ",t); t2=mktime(t " 0 0 0" ); print t2/(24*60*60) } '
2020-04-17
18368.8
$
Is this translation for awk correct?. What is the reason for 0.2 difference in my awk code. How to fix it.

Converting my comment to answer so that solution is easy to find for future visitors.
You are seeing this difference due to timezone offset between your current local timezone and UTC.
On more recent gnu awk version you can use this to pass an additional utc-flag to mktime function:
awk 'BEGIN {t=strftime("%F",systime()); print t;
gsub("-"," ",t); t2=mktime(t " 0 0 0", 1); print t2/(24*60*60)}'
However if you are on older awk versions then use this work-around to get time computations in UTC:
TZ=UTC awk 'BEGIN {t=strftime("%F",systime()); print t;
gsub("-"," ",t); t2=mktime(t " 0 0 0"); print t2/(24*60*60) } '

The epoch days involve calculating how many multiples of "86400" seconds have passed since 1-Jan-1970. We have a boundary issue in this calculation, i.e we should not take 1st Jan, 1970 as 1 epoch day because the particular day has not completed.
That is the reason the java epoch days function returns 0 for 1-Jan-1970
java.time.LocalDate.parse("1970-01-01").toEpochDay
res2: Long = 0
Now, with this for 17-Apr-2020, we get
java.time.LocalDate.parse("2020-04-17").toEpochDay
res3: Long = 18369
Let's try to get the same answer using awk.
awk -v dt="2020-04-17" '
function epoch_days(dta)
{
yr=dta[0];
for(i=1970;i<strftime("%Y",mktime(dta[1] " 12 31 0 0 0"));i++)
e1+=strftime("%j",mktime(i " 12 31 0 0 0"))+0;
e1+=strftime("%j",mktime(dta[1] " " dta[2] " " dta[3] " 0 0 0"))-1;
return e1
}
BEGIN {
split(dt,dta,"-");
print epoch_days(dta) ;
}'
18369
Lets cross verify for current date i.e 22-Aug-2021
java.time.LocalDate.now
res5: java.time.LocalDate = 2021-08-22
java.time.LocalDate.now.toEpochDay
res6: Long = 18861
And our awk code returns
awk '
function epoch_days(dta)
{
yr=dta[0];
for(i=1970;i<strftime("%Y",mktime(dta[1] " 12 31 0 0 0"));i++)
e1+=strftime("%j",mktime(i " 12 31 0 0 0"))+0;
e1+=strftime("%j",mktime(dta[1] " " dta[2] " " dta[3] " 0 0 0"))-1;
return e1
}
BEGIN {
dt=systime(); # 22-Aug-2021
dt2=strftime("%F",dt);
split(dt2,dta,"-");
print epoch_days(dta) ;
}'
18861

Related

awk : awk script to group by column with condition

I have tab delimited file like following and I am trying to write a awk script
aaa_log-000592 2 p STARTED 7027691 21.7 a1
aaa_log-000592 28 r STARTED 7027815 21.7 a2
aaa_log-000592 33 p STARTED 7032607 21.7 a3
aaa_log-000592 33 r STARTED 7032607 21.7 a4
aaa_log-000592 43 p STARTED 7025709 21.7 a5
aaa_log-000592 43 r STARTED 7025709 21.7 a6
aaa_log-000595 2 r STARTED 7027691 21.7 a7
aaa_log-000598 28 p STARTED 7027815 21.7 a8
aaa_log-000599 13 p STARTED 7033090 21.7 a9
I am trying to count for 3rd column (p or r) and group by column 1
Output would be like
Col1 Count-P Count-R
aaa_log-000592 3 3
aaa_log-000595 0 1
aaa_log-000598 1 0
aaa_log-000599 1 0
I can't find an example that would have IF condition with group by in awk.

awk(more specifically, the GNU variant, gawk) has multi-dimensional arrays that can be indexed using input values (including character strings like in your example). As such, you can count the values in the way you want by doing
{
values[$3] = 1 # this line records the values in column three
counts[$1][$3]++ # and this lines counts their frequency
}
The first line isn't strictly required, but it simplifies generating the output.
The only remaining part is to have an END clause that outputs the tabulated results.
END {
# Print column headings
printf "Col1 "
for (v in values) {
printf " Count-%s", v
}
printf "\n"
# Print tabulated results
for (i in counts) {
printf "%-20s", i
for (v in values) {
printf " %d", counts[i][v]
}
printf "\n"
}
}
Generating the values array handles the case when the values of column three may not be known (e.g., like when there's an error in your input).
If you're using a different awk implementation (like what you might find in macOS, for example), array indexing may be different (e.g., they are single-dimensional arrays, but indexed by a comma-separate list of indices). This may add some additional complexity, but the idea is the same.
{
files[$1] = 1
values[$3] = 1
counts[$1,$3]++
}
END {
# Print column headings
printf "Col1 "
for (v in values) {
printf " Count-%s", v
}
printf "\n"
# Print tabulated results
for (f in files) {
printf "%-20s", f
for (v in values) {
printf " %d", counts[f,v]
}
printf "\n"
}
}

awk: create column with increasing values

I've a text file with 3 columns like this one
2010-01-03 11:00:00 -134
2010-01-03 11:01:00 -131
2010-01-03 11:02:00 -128
...
Now I need the time steps in seconds rather then the existing ones.
How can I create a new column between $2 and $3 filled with increasing values (0, 60, 120, ...) until the end of the file?

According to your statement and data, you may need this:
awk '{ print $1, $2, i*60, $3; i++;}' orifile

in connection with the luoluo's answer, slightly shorter version: awk '{ print $1, $2, (NR-1)*60, $3}' orifile

Assuming that the time stamps are not all evenly spaced and that you have to parse them: With GNU awk you could use mktime to do that:
gawk '{ ts = $1 " " $2; gsub(/[-:]/, " ", ts); t = mktime(ts) } NR == 1 { start = t } { $2 = $2 OFS (t - start); } 1'
This works as follows:
{ # for all lines:
ts = $1 " " $2 # concat first and second fields,
gsub(/[-:]/, " ", ts) # replace - and : with spaces. The result is the
# format mktime expects: "YYYY MM DD HH MM SS"
t = mktime(ts) # convert to seconds since Epoch
}
NR == 1 { # in the first line:
start = t # set the starting point
}
{ # for all lines:
$2 = $2 OFS (t - start) # append the seconds since start to the second field,
# effectively inserting a third
}
1 # then print.

Another solution, to insert column in awk
awk '$3=(NR-1)*60 FS $3' file
you get,
2010-01-03 11:00:00 0 -134
2010-01-03 11:01:00 60 -131
2010-01-03 11:02:00 120 -128

How to append column to existing file in awk?

I have a file named bt.B.1.log that looks like this:
.
.
.
Time in seconds = 260.37
.
.
.
Compiled procs = 1
.
.
.
Time in seconds = 260.04
.
.
.
Compiled procs = 1
.
.
.
and so on for 40 records of Time in seconds and Compiled procs (dots represent useless lines).
How do I add a single column with the value of Compiled procs (which is 1) to the result of the following two commands:
This prints the average of Time in seconds values (thanks to dawg for this one)
awk -F= '/Time in seconds/ {s+=$2; c++} END {print s/c}' bt.B.1.log > t1avg.dat
Desired output:
260.20 1
This prints all of the ocurrences of Time in seconds, but there is a small problem with it; it is printing an extra blank line at the beginning of the list.
awk 'BEGIN { FS = "Time in seconds =" } ; { printf $2 } {printf " "}' bt.B.1.log > t1.dat
Desired output:
260.37 1
260.04
.
.
.
In both cases I need the value of Compiled procs to appear only once, preferrably in the first line, and no use of intermediate files.
What I managed to do so far prints all values of Time in seconds with the Compiled procs column appearing in every line and with a strange identation:
awk '/seconds/ {printf $5} {printf " "} /procs/ {print $4}' bt.B.1.log > t1.dat
Please help!
UPDATE
Contents of file bt.B.1.log:
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
Start in 16:40:51--25/12/2014
NAS Parallel Benchmarks 3.3 -- BT Benchmark
No input file inputbt.data. Using compiled defaults
Size: 102x 102x 102
Iterations: 200 dt: 0.0003000
Number of active processes: 1
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class B
accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 0.1423359722929E+04 0.1423359722929E+04 0.7507984505732E-14
2 0.9933052259015E+02 0.9933052259015E+02 0.3147459568137E-14
3 0.3564602564454E+03 0.3564602564454E+03 0.4783990739472E-14
4 0.3248544795908E+03 0.3248544795908E+03 0.2309751522921E-13
5 0.3270754125466E+04 0.3270754125466E+04 0.8481098651866E-14
Comparison of RMS-norms of solution error
1 0.5296984714094E+02 0.5296984714094E+02 0.2682819657265E-15
2 0.4463289611567E+01 0.4463289611567E+01 0.1989963674771E-15
3 0.1312257334221E+02 0.1312257334221E+02 0.4060995034457E-15
4 0.1200692532356E+02 0.1200692532356E+02 0.2958887128106E-15
5 0.1245957615104E+03 0.1245957615104E+03 0.2281113665977E-15
Verification Successful
BT Benchmark Completed.
Class = B
Size = 102x 102x 102
Iterations = 200
Time in seconds = 260.37
Total processes = 1
Compiled procs = 1
Mop/s total = 2696.83
Mop/s/process = 2696.83
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3
Compile date = 25 Dec 2014
Compile options:
MPIF77 = mpif77
FLINK = $(MPIF77)
FMPI_LIB = -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lo...
FMPI_INC = -I/usr/lib/openmpi/include -I/usr/lib/openm...
FFLAGS = -O
FLINKFLAGS = -O
RAND = (none)
Please send the results of this run to:
NPB Development Team
Internet: npb#nas.nasa.gov
If email is not available, send this to:
MS T27A-1
NASA Ames Research Center
Moffett Field, CA 94035-1000
Fax: 650-604-3957
Finish in 16:45:14--25/12/2014
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
Start in 16:58:50--25/12/2014
NAS Parallel Benchmarks 3.3 -- BT Benchmark
No input file inputbt.data. Using compiled defaults
Size: 102x 102x 102
Iterations: 200 dt: 0.0003000
Number of active processes: 1
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class B
accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 0.1423359722929E+04 0.1423359722929E+04 0.7507984505732E-14
2 0.9933052259015E+02 0.9933052259015E+02 0.3147459568137E-14
3 0.3564602564454E+03 0.3564602564454E+03 0.4783990739472E-14
4 0.3248544795908E+03 0.3248544795908E+03 0.2309751522921E-13
5 0.3270754125466E+04 0.3270754125466E+04 0.8481098651866E-14
Comparison of RMS-norms of solution error
1 0.5296984714094E+02 0.5296984714094E+02 0.2682819657265E-15
2 0.4463289611567E+01 0.4463289611567E+01 0.1989963674771E-15
3 0.1312257334221E+02 0.1312257334221E+02 0.4060995034457E-15
4 0.1200692532356E+02 0.1200692532356E+02 0.2958887128106E-15
5 0.1245957615104E+03 0.1245957615104E+03 0.2281113665977E-15
Verification Successful
BT Benchmark Completed.
Class = B
Size = 102x 102x 102
Iterations = 200
Time in seconds = 260.04
Total processes = 1
Compiled procs = 1
Mop/s total = 2700.25
Mop/s/process = 2700.25
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3
Compile date = 25 Dec 2014
Compile options:
MPIF77 = mpif77
FLINK = $(MPIF77)
FMPI_LIB = -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lo...
FMPI_INC = -I/usr/lib/openmpi/include -I/usr/lib/openm...
FFLAGS = -O
FLINKFLAGS = -O
RAND = (none)
Please send the results of this run to:
NPB Development Team
Internet: npb#nas.nasa.gov
If email is not available, send this to:
MS T27A-1
NASA Ames Research Center
Moffett Field, CA 94035-1000
Fax: 650-604-3957
Finish in 17:03:12--25/12/2014
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
There are 40 entries in the log, but I've provided only 2 for abbreviation purposes.

To fix the first issue, replace:
awk -F= '/Time in seconds/ {s+=$2; c++} END {print s/c}' bt.B.1.log > t1avg.dat
with:
awk 'BEGIN { FS = "[ \t]*=[ \t]*" } /Time in seconds/ { s += $2; c++ } /Compiled procs/ { if (! CP) CP = $2 } END { print s/c, CP }' bt.B.1.log >t1avg.dat
A potential minor issue is that 260.205 1 might be output but the question does not address this as a weakness of the given script. Rounding with something like printf "%.2f %s\n", s/c, CP gives 260.21 1 though. To truncate the extra digit, use something like printf "%.2f %s\n", int (s/c * 100) / 100, CP.
To fix the second issue, replace:
awk 'BEGIN { FS = "Time in seconds =" } ; { printf $2 } {printf " "}' bt.B.1.log > t1.dat
with:
awk 'BEGIN { FS = "[ \t]*[=][ \t]" } /Time in seconds/ { printf "%s", $2 } /Compiled procs/ { if (CP) { printf "\n" } else { CP = $2; printf " %s\n", $2 } }' bt.B.1.log > t1.dat
BTW, the "strange indentation" is a result of failing to output a newline when using printf and failing to filter unwanted input lines from processing.

How substract millisecond with AWK - script

I'm trying to create an awk script to subtract milliseconds between 2 records joined-up for example:
By command line I might do this:
Input:
06:20:00.120
06:20:00.361
06:20:15.205
06:20:15.431
06:20:35.073
06:20:36.190
06:20:59.604
06:21:00.514
06:21:25.145
06:21:26.125
Command:
awk '{ if ( ( NR % 2 ) == 0 ) { printf("%s\n",$0) } else { printf("%s ",$0) } }' input
I'll obtain this:
06:20:00.120 06:20:00.361
06:20:15.205 06:20:15.431
06:20:35.073 06:20:36.190
06:20:59.604 06:21:00.514
06:21:25.145 06:21:26.125
To substract milliseconds properly:
awk '{ if ( ( NR % 2 ) == 0 ) { printf("%s\n",$0) } else { printf("%s ",$0) } }' input| awk -F':| ' '{print $3, $6}'
And to avoid negative numbers:
awk '{if ($2<$1) sub(/00/, "60",$2); print $0}'
awk '{$3=($2-$1); print $3}'
The goal is get this:
Call 1 0.241 ms
Call 2 0.226 ms
Call 3 1.117 ms
Call 4 0.91 ms
Call 5 0.98 ms
And finally and average.
I might perform this but command by command. I dunno how to place this into a script.
Please need help.

Using awk:
awk '
BEGIN { cmd = "date +%s.%N -d " }
NR%2 {
cmd $0 | getline var1;
next
}
{
cmd $0 | getline var2;
var3 = var2 - var1;
print "Call " ++i, var3 " ms"
}
' file
Call 1 0.241 ms
Call 2 0.226 ms
Call 3 1.117 ms
Call 4 0.91 ms
Call 5 0.98 ms

One way using awk:
Content of script.awk:
## For every input line.
{
## Convert formatted dates to time in miliseconds.
t1 = to_ms( $0 )
getline
t2 = to_ms( $0 )
## Calculate difference between both dates in miliseconds.
tr = (t1 >= t2) ? t1 - t2 : t2 - t1
## Print to output with time converted to a readable format.
printf "Call %d %s ms\n", ++cont, to_time( tr )
}
## Convert a date in format hh:mm:ss:mmm to miliseconds.
function to_ms(time, time_ms, time_arr)
{
split( time, time_arr, /:|\./ )
time_ms = ( time_arr[1] * 3600 + time_arr[2] * 60 + time_arr[3] ) * 1000 + time_arr[4]
return time_ms
}
## Convert a time in miliseconds to format hh:mm:ss:mmm. In case of 'hours' or 'minutes'
## with a value of 0, don't print them.
function to_time(i_ms, time)
{
ms = int( i_ms % 1000 )
s = int( i_ms / 1000 )
h = int( s / 3600 )
s = s % 3600
m = int( s / 60 )
s = s % 60
# time = (h != 0 ? h ":" : "") (m != 0 ? m ":" : "") s "." ms
time = (h != 0 ? h ":" : "") (m != 0 ? m ":" : "") s "." sprintf( "%03d", ms )
return time
}
Run the script:
awk -f script.awk infile
Result:
Call 1 0.241 ms
Call 2 0.226 ms
Call 3 1.117 ms
Call 4 0.910 ms
Call 5 0.980 ms

If you're not tied to awk:
to_epoch() { date -d "$1" "+%s.%N"; }
count=0
paste - - < input |
while read t1 t2; do
((count++))
diff=$(printf "%s-%s\n" $(to_epoch "$t2") $(to_epoch "$t1") | bc -l)
printf "Call %d %5.3f ms\n" $count $diff
done

calculate the difference from flat file

I have a text file and the last 2 lines look like this...
Uptime: 822832 Threads: 32 Questions: 13591705 Slow queries: 722 Opens: 81551 Flush tables: 59 Open tables: 64 Queries per second avg: 16.518
Uptime: 822893 Threads: 31 Questions: 13592768 Slow queries: 732 Opens: 81551 Flush tables: 59 Open tables: 64 Queries per second avg: 16.618
How do I find the difference between the two values of each parameter?
The expected output is:
61 -1 1063 10 0 0 0 0.1
In other words I will like to deduct the current uptime value from the earlier uptime.
Find the difference between the threads and Questions and so on.
The purpose of this exercise is to watch this file and alert the user when the difference is too high. For e.g. if the slow queries are more than 500 or the "Questions" parameter is too low (<100)
(It is the MySQL status but has nothing to do with it, so mysql tag does not apply)

Just a slight variation on ghostdog74's (original) answer:
tail -2 file | awk ' {
gsub(/[a-zA-Z: ]+/," ")
m=split($0,a," ");
for (i=1;i<=m;i++)
if (NR==1) b[i]=a[i]; else print a[i] - b[i]
} '

here's one way. tail is used to get the last 2 lines, especially useful in terms of efficiency if you have a big file.
tail -2 file | awk '
{
gsub(/[a-zA-Z: ]+/," ")
m=split($0,a," ")
if (f) {
for (i=1;i<=m;i++){
print -(b[i]-a[i])
}
# to check for Questions, slow queries etc
if ( -(b[3]-a[3]) < 100 ){
print "Questions parameter too low"
}else if ( -(b[4]-a[4]) > 500 ){
print "Slow queries more than 500"
}else if ( a[1] - b[1] < 0 ){
print "mysql ...... "
}
exit
}
for(i=1;i<=m;i++ ){ b[i]=a[i] ;f=1 }
} '
output
$ ./shell.sh
61
-1
1063
10
0
0
0
0.1

gawk:
BEGIN {
arr[1] = "0"
}
length(arr) > 1 {
print $2-arr[1], $4-arr[2], $6-arr[3], $9-arr[4], $11-arr[5], $14-arr[6], $17-arr[7], $22-arr[8]
}
{
arr[1] = $2
arr[2] = $4
arr[3] = $6
arr[4] = $9
arr[5] = $11
arr[6] = $14
arr[7] = $17
arr[8] = $22
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk need epoch days logic for data transformation - awk

Related

awk : awk script to group by column with condition

awk: create column with increasing values

How to append column to existing file in awk?

How substract millisecond with AWK - script

calculate the difference from flat file

Categories

Resources