How to Increment timestamps in text file - awk

I am comparing two long log files which are exactly the same except for the timestamp.
Eg: Log1
fn1-start 11:10:10
fn2-start 11:10:12
fn2-end 11:10:19
fn1-end 11:11:20
...
A long list
...
Log 2
fn1-start 11:22:11
fn2-start 11:22:13
fn2-end 11:22:20
fn1-end 11:23:41
...
A long list
...
I want to compare two log files like this to find out which function is causing performance degradation using some comparison tool.
What I want is to increment or decrement all the time stamps in one of the log files. The timestamp of the second file starts with 11:22:11, In my case I could add 00:10:01 to the 1st log file time stamps and compare the logs.
So, increment the log 1 timestamps by 00:12:01.
So Log 1 is now:
fn1-start 11:22:11
fn2-start 11:22:13
fn2-end 11:22:20
fn1-end 11:23:21
...
A long list
...
In this case, fn1 takes 20 seconds longer to complete after the fn2 function call in log 2.
How can I achieve this? Which tools should I use? any alternate methods?

So you want to compare the run times for individual functions as opposed to offsetting them so the first one start on the same time.
This means you need to calculate the function run time duration before comparing the two files. This can be done with a rather simple awk script:
function get_durations() {
awk '
BEGIN{
# Split spaces and dashes
FS="[ ]*|-"
}
/start/ {
start[$1] = $3
}
/end/ {
if($1 in start)
end[$1] = $3
else
print "No corresponding \"start\" for function " $1 > "/dev/stderr"
}
# Function to convert timestamps into seconds using gnu coreutils date
function timestamp_to_seconds(ts) {
close(sprintf("date \"+%%s\" --date=\"%s\"", ts) | getline sec)
return sec
}
END {
for (x in start){
if(end[x]){
end_seconds = timestamp_to_seconds(end[x])
start_seconds = timestamp_to_seconds(start[x])
printf("%s %s\n", x, end_seconds - start_seconds)
}
else{
printf("%s inf\n", x)
print "No corresponding \"end\" for function " x > "/dev/stderr"
}
}
}
' "${1}"
}
To compare the durations you can proceed in a similar way using awk arrays:
function compare_durations() {
gawk -P '
BEGIN{
print "function,file1_duration,file2_duration,12_difference"
}
f[$1] {
printf("%s,%s,%s,%s\n",
$1,
$2,
f[$1],
($2 == "inf" || f[$1] == "inf" ? "inf" : $2 - f[$1]))
}
!f[$1]{
f[$1] = $2
}
' "${1}" "${2}"
}
This function takes two files as inputs and prints out a csv with the comparison between the two files.
Finally you can use these functions together to compare the files:
compare_durations <(get_durations input1) <(get_durations input2) > summary.csv
This solution assumes the function names don't repeat, if they do
repeat you can change the script to add a counter for each function.
The time complexity of the script is O(n) but it uses O(n) space, so
if you have really long logs you should find another approach.

Related

How to return 0 if awk returns null from processing an expression?

I currently have a awk method to parse through whether or not an expression output contains more than one line. If it does, it aggregates and prints the sum. For example:
someexpression=$'JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)'
might be the one-liner where it DOESN'T yield any information. Then,
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
for (i in a) {
printf "%d\n", a[i]
}
}'
this will yield NULL or an empty return. Instead, I would like to have it return a numeric value of $0$ if empty. How can I modify the above to do this?
Nothing in UNIX "returns" anything (despite the unfortunately named keyword for setting the exit status of a function), everything (tools, functions, scripts) outputs X and exits with status Y.
Consider these 2 identical functions named foo(), one in C and one in shell:
C (x=foo() means set x to the return code of foo()):
foo() {
printf "7\n"; // this is outputting 7 from the full program
return 3; // this is returning 3 from this function
}
x=foo(); <- 7 is output on screen and x has value '3'
shell (x=foo means set x to the output of foo()):
foo() {
printf "7\n"; # this is outputting 7 from just this function
return 3; # this is setting this functions exit status to 3
}
x=foo <- nothing is output on screen, x has value '7', and '$?' has value '3'
Note that what the return statement does is vastly different in each. Within an awk script, printing and return codes from functions behave the same as they do in C but in terms of a call to the awk tool, externally it behaves the same as every other UNIX tool and shell script and produces output and sets an exit status.
So when discussing anything in UNIX avoid using the term "return" as it's imprecise and ambiguous and so different people will think you mean "output" while others think you mean "exit status".
In this case I assume you mean "output" BUT you should instead consider setting a non-zero exit status when there's no match like grep does, e.g.:
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
for (i in a) {
print a[i]
}
exit (NR < 2)
}'
and then your code that uses the above can test for the success/fail exit status rather than testing for a specific output value, just like if you were doing the equivalent with grep.
You can of course tweak the above to:
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
if ( NR > 1 ) {
for (i in a) {
print a[i]
}
}
else {
print "$0$"
exit 1
}
}'
if necessary and then you have both a specific output value and a success/fail exit status.
You may keep a flag inside for loop to detect whether loop has executed or not:
echo "$someexpression" |
awk 'NR>1 {
a[$4]++
}
END
{
for (i in a) {
p = 1
printf "%d\n", a[i]
}
if (!p)
print "$0$"
}'
$0$

Can I do a time based progress in awk?

I am currently using awk scripting to censor the console output and I print one dot for each censored line.
I want to update this code to make it avoid printing more than one dot per minute (or something similar). Obviously that if I do not get any progress (streamed new lines), no update is supposed to happen.
Current version of the code is at https://gist.github.com/ssbarnea/f7b72491af524fa364d9fc328cd43f2a
Note: I know that I could print a newline with "mod 10" or similar in order to limit the output but that approach is not good because the lines are not received with a consistent speed, sometimes I get lots of them, sometimes i get only one or two. Due to this I need to use a timer based approach which would do something like "print newline if the last one was printed more than x seconds ago"
With GNU awk for time functions you can print dots no more frequently than once per minute by simply comparing the time in seconds since the epoch when the current input line is being processed with the time when the previous dot was printed:
awk '
function prtDot() {
currTime = systime()
if ( (currTime - prevTime) > 60 ) {
printf "." | "cat>&2"
prevTime = currTime
}
}
{ print $0; prtDot() }
END { print "" | "cat>&2" }
'
e.g. printing a . every 10 seconds within a stream of numbers:
$ cat tst.awk
function prtDot() {
currTime = systime()
if ( (currTime - prevTime) > 10 ) {
printf "." | "cat>&2"
prevTime = currTime
}
}
{ printf "%s",$0%10 | "cat>&2"; prtDot() }
END { print "" | "cat>&2" }
$ i=0; while (( i < 50 )); do echo $((++i)); sleep 1; done | awk -f tst.awk
1.2345678901.23456789012.3456789012.34567890123.4567890
$ i=0; while (( i < 50 )); do echo $((++i)); sleep 3; done | awk -f tst.awk
1.2345.6789.0123.4567.8901.2345.6789.0123.4567.8901.2345.6789.0
the slight difference between the actual digits printed and expected is due to how long other parts of the while loop add to the overall interval between echos and other small imprecisions affecting when the shell loop is printing numbers and consequently when systime() is getting called in awk.

Print smallest integer from file using awk custom function?

awk function looks like this in a file name fun.awk:
{
print small()
}
function small()
{
a[NR]=$0
smal=0
for(i=1;i<=3;i++)
{
if( a[i]<a[i+1])
smal=a[i]
else
smal=a[i+1]
}
return smal
}
The contents of awk.write:
1
23
32
The awk command is:
awk -f fun.awk awk.write
It gives me no result? Why?
I think you are going about this the wrong way. In awk, one approach might be:
NR == 1 {
small = $0
}
$0 < small {
small = $0
}
END {
print small
}
which simply simply sets small to the smallest integer we've seen so far on each line, and prints it at the end. (Note: you need to start with a initializing small on the first line.
A simpler approach might just be to sort the lines as numbers with sort, and pick the first one.

AWK: How can I print averages of consecutive numbers in a file, but skip over alphabetical characters/strings?

I've figured out how to get the average of a file that contains numbers in all lines such as:
Numbers.txt
1
2
4
8
Output:
Average: 3.75
This is the code I use for that:
awk '{ sum += $1; tot++ } END { print sum / tot; }' Numbers.txt
However, the problem is that this doesn't take into account possible strings that might be in the file. For example, a file that looks like this:
NumbersAndExtras.txt
1
2
4
8
Hello
4
5
6
Cat
Dog
2
4
3
For such a file I'd want to print the multiple averages of the consecutive numbers, ignoring the strings such that the result looks something like this:
Output:
Average: 3.75
Average: 5
Average: 3
I could devise some complicated code that might accomplish that with variables and 'if' statements and loops and whatnot, but I've been told it's easier than that given some of awk features. I'd like to know how that might look like, along with an explanation of why it works.
BEGIN runs before reading the first line from file. Set sum and count to 0.
awk 'BEGIN{ sum=0; count=0} {if ( /[a-z][A-Z]/ ) { if (count > 0) {avg = sum/count; print avg;} count=0; sum=0} else { count++; sum += $1} } END{if (count > 0) {avg = sum/count; print avg}} ' NumbersAndExtras.txt
When there is an alphabet on the line, calculate and print average so far.
And do the same in the END block that runs after processing the whole file.
Keep it simple:
awk '/^$/{next}
/^[0-9]+/{a+=$1+0;c++;next}
c&&a{print "Average: "a/c;a=c=0}
END{if(c&&a){print "Average: "a/c}}' input_file
Results:
Average: 3.75
Average: 5
Average: 3
Another one:
$ awk '
function avg(s, c) { print "Average: ", s/c }
NF && !/^[[:digit:]]/ { if (count) avg(sum, count); sum = 0; count = 0; next}
NF { sum += $1; count++ }
END {if (count) avg(sum, count)}
' <file
Note: The value of this answer in explaining the solution; other answers offer more concise alternatives.
Try the following:
Note that this is an awk command with a script specified as a multi-line shell string literal - you can paste the whole thing into your terminal to try it; while it is possible to cram this into a single line, it hurts readability and the ability to comment:
awk '
# Define output function that prints an average.
function printAvg() { print "Average: ", sum/count }
# Skip blank lines
NF == 0 { next}
# Is the line non-numeric?
/[[:alpha:]]/ {
# If this line ends a numeric block, print its
# average now and reset the variables to start the next group.
if (count) {
printAvg()
wasNum = sum = count = 0
}
# Skip to next line.
next
}
# Numeric line: set flag, sum, and increment counter.
{ sum += $1; count++ }
# Finally:
END {
# If there is a group whose average has not been printed yet,
# do it now.
if (count) printAvg()
}
' NumbersAndExtras.txt
If we condense whitespace and strip the comments, we still get a reasonably readable solution, as long as we still use multiple lines:
awk '
function printAvg() { print "Average: ", sum/count }
NF == 0 { next}
/[[:alpha:]]/ { if (count) { printAvg(); sum = count = 0 } next }
{ sum += $1; count++ }
END { if (count) printAvg() }
' NumbersAndExtras.txt

awk | Add new row or update existing row in a file

I want to update file1 on the basis of file2. If any row is new in file2 then it should be added in file1. If any row from file2 is already in file1, then update that row with the row from file2 if the time is greater in file2.
file1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051015,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
file2
DL,1111111101,201312041013,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051016,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111104,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
newfile1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
Notes:
2nd field should be unique in the output.
Addition of new value: the latest 2nd field for value "1111111104" in file2 is taken which is newer (201312051016) then old value (201312051014) on the basis of date column (3rd field).
Update an existing value: updated "1111111102" with newer value on the basis of date in 3rd column
file1 is very LARGE whereas file2 has 5-10 entries only.
row with 2nd field "1111111101" doesn't need to b updated because it's entry in file1 already has the latest date "201312051014" as compared to new date "201312041013" in file2.
I haven't tried much on this because it really has complex condition for me as beginner..
BEGIN { FS = OFS = "," }
FNR == NR {
m=$2;
a[m] = $0;
next
}
{
if($2 in a)
{
split(a[$2],datetime,",")
if($3>datetime[3])
print $0;
else
print a[$2]"Old time"
}
else print $0"NOMATCH";
delete a[$2];
}
Assuming that you can start your awk as follows:
awk -f script.awk input2.csv input1.csv > result.csv
you can use the following script to obtain the desired output:
BEGIN {
FS = OFS = ","
}
FILENAME == "input2.csv" {
date[$2] = $3
data[$2] = $0
used[$2] = 0
}
FILENAME == "input1.csv" {
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Notes:
The script takes advantages of the assumption that file2 is smaller than file1 because it uses an array only for the few entries in file2.
The new entries are simply appended to the output. There is no sorting. If this is required there will have to be an extra effort.
EDIT
Heeding #JonathanLeffler's remark about the way I determine which file is being processed I would like to offer an alternate version that may (or may not :-) ) be a little more straight forward to understand than checking NR=FNR. However, it only works for sufficiently recent versions of awk which are capable of returning the size of an array as length(array):
BEGIN {
FS = ","
}
{
# The following effectively creates an array entry for each filename found (for "known" filenames existing entries are overwritten).
files[FILENAME] = 1
# check the number of files we have so far
if (length(files) == 1) {
# we are still in the first file
date[$2] = $3
data[$2] = $0
used[$2] = 0
} else {
# we are in the second file (or any other following file)
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Also, if you require your output to be sorted according to the second row you can replace the call to awk by this:
awk -f script.awk input2.csv input1.csv | sort -t "," -n -k 2 > result.csv
The latter, of course, works for both versions of the script.
Since file1 is very large but file2 is very small (5-10 entries), you need to read all of file2 into memory first, dealing with the duplicate values. As a result, you'll have an array indexed by the record number with the new data; you should also have a record of the date for each record in a separate array. Then, as you read the main file, you look up the the record number and the date in the arrays, and if you need to, substitute the saved new record for the incoming old record.
Your outline script is most of the way there. It is more complex because you didn't save the dates coming in. This more or less works:
awk -F, '
FNR == NR { if (!($2 in date) || date[$2] < $3) { date[$2] = $3; line[$2] = $0; } next; }
{ if ($2 in date)
{
if (date[$2] > $3)
print line[$2]
else
print
delete line[$2]
delete date[$2]
}
else
print
}
END { for (l in line) print line[l]; }' file2 file1
Sample output for given data:
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
However, if there were 4 new records, there's no guarantee that they'd be in sorted order, though they would all be at the end of the list. It would be possible to upgrade the script to print the new records at the appropriate place in the list if the input is guaranteed to be in sorted order. You simply have to search through the list of lines to see whether there are any lines that should be printed before the current line, and if so, do so (and delete the record so that they are not printed at the end).
Note that uniqueness in the output depends on uniqueness in the input (file1). That is, if field 2 in the input is repeated, this code won't notice. There is also nothing that can be done with the current design even if a duplicate was spotted; the old row has been printed so printing the new row will simply cause the duplicate. If you were worried about this, you could design the awk script to keep the whole of file1 in memory and only print anything when the whole of the input has been processed. Needless to say, this uses a lot more memory than the current design, and will generally be less efficient because of that. Nevertheless, it could be done if needed.