How to return 0 if awk returns null from processing an expression? - awk

I currently have a awk method to parse through whether or not an expression output contains more than one line. If it does, it aggregates and prints the sum. For example:
someexpression=$'JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)'
might be the one-liner where it DOESN'T yield any information. Then,
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
for (i in a) {
printf "%d\n", a[i]
}
}'
this will yield NULL or an empty return. Instead, I would like to have it return a numeric value of $0$ if empty. How can I modify the above to do this?

Nothing in UNIX "returns" anything (despite the unfortunately named keyword for setting the exit status of a function), everything (tools, functions, scripts) outputs X and exits with status Y.
Consider these 2 identical functions named foo(), one in C and one in shell:
C (x=foo() means set x to the return code of foo()):
foo() {
printf "7\n"; // this is outputting 7 from the full program
return 3; // this is returning 3 from this function
}
x=foo(); <- 7 is output on screen and x has value '3'
shell (x=foo means set x to the output of foo()):
foo() {
printf "7\n"; # this is outputting 7 from just this function
return 3; # this is setting this functions exit status to 3
}
x=foo <- nothing is output on screen, x has value '7', and '$?' has value '3'
Note that what the return statement does is vastly different in each. Within an awk script, printing and return codes from functions behave the same as they do in C but in terms of a call to the awk tool, externally it behaves the same as every other UNIX tool and shell script and produces output and sets an exit status.
So when discussing anything in UNIX avoid using the term "return" as it's imprecise and ambiguous and so different people will think you mean "output" while others think you mean "exit status".
In this case I assume you mean "output" BUT you should instead consider setting a non-zero exit status when there's no match like grep does, e.g.:
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
for (i in a) {
print a[i]
}
exit (NR < 2)
}'
and then your code that uses the above can test for the success/fail exit status rather than testing for a specific output value, just like if you were doing the equivalent with grep.
You can of course tweak the above to:
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
if ( NR > 1 ) {
for (i in a) {
print a[i]
}
}
else {
print "$0$"
exit 1
}
}'
if necessary and then you have both a specific output value and a success/fail exit status.

You may keep a flag inside for loop to detect whether loop has executed or not:
echo "$someexpression" |
awk 'NR>1 {
a[$4]++
}
END
{
for (i in a) {
p = 1
printf "%d\n", a[i]
}
if (!p)
print "$0$"
}'
$0$

Related

How to Increment timestamps in text file

I am comparing two long log files which are exactly the same except for the timestamp.
Eg: Log1
fn1-start 11:10:10
fn2-start 11:10:12
fn2-end 11:10:19
fn1-end 11:11:20
...
A long list
...
Log 2
fn1-start 11:22:11
fn2-start 11:22:13
fn2-end 11:22:20
fn1-end 11:23:41
...
A long list
...
I want to compare two log files like this to find out which function is causing performance degradation using some comparison tool.
What I want is to increment or decrement all the time stamps in one of the log files. The timestamp of the second file starts with 11:22:11, In my case I could add 00:10:01 to the 1st log file time stamps and compare the logs.
So, increment the log 1 timestamps by 00:12:01.
So Log 1 is now:
fn1-start 11:22:11
fn2-start 11:22:13
fn2-end 11:22:20
fn1-end 11:23:21
...
A long list
...
In this case, fn1 takes 20 seconds longer to complete after the fn2 function call in log 2.
How can I achieve this? Which tools should I use? any alternate methods?
So you want to compare the run times for individual functions as opposed to offsetting them so the first one start on the same time.
This means you need to calculate the function run time duration before comparing the two files. This can be done with a rather simple awk script:
function get_durations() {
awk '
BEGIN{
# Split spaces and dashes
FS="[ ]*|-"
}
/start/ {
start[$1] = $3
}
/end/ {
if($1 in start)
end[$1] = $3
else
print "No corresponding \"start\" for function " $1 > "/dev/stderr"
}
# Function to convert timestamps into seconds using gnu coreutils date
function timestamp_to_seconds(ts) {
close(sprintf("date \"+%%s\" --date=\"%s\"", ts) | getline sec)
return sec
}
END {
for (x in start){
if(end[x]){
end_seconds = timestamp_to_seconds(end[x])
start_seconds = timestamp_to_seconds(start[x])
printf("%s %s\n", x, end_seconds - start_seconds)
}
else{
printf("%s inf\n", x)
print "No corresponding \"end\" for function " x > "/dev/stderr"
}
}
}
' "${1}"
}
To compare the durations you can proceed in a similar way using awk arrays:
function compare_durations() {
gawk -P '
BEGIN{
print "function,file1_duration,file2_duration,12_difference"
}
f[$1] {
printf("%s,%s,%s,%s\n",
$1,
$2,
f[$1],
($2 == "inf" || f[$1] == "inf" ? "inf" : $2 - f[$1]))
}
!f[$1]{
f[$1] = $2
}
' "${1}" "${2}"
}
This function takes two files as inputs and prints out a csv with the comparison between the two files.
Finally you can use these functions together to compare the files:
compare_durations <(get_durations input1) <(get_durations input2) > summary.csv
This solution assumes the function names don't repeat, if they do
repeat you can change the script to add a counter for each function.
The time complexity of the script is O(n) but it uses O(n) space, so
if you have really long logs you should find another approach.

Can I do a time based progress in awk?

I am currently using awk scripting to censor the console output and I print one dot for each censored line.
I want to update this code to make it avoid printing more than one dot per minute (or something similar). Obviously that if I do not get any progress (streamed new lines), no update is supposed to happen.
Current version of the code is at https://gist.github.com/ssbarnea/f7b72491af524fa364d9fc328cd43f2a
Note: I know that I could print a newline with "mod 10" or similar in order to limit the output but that approach is not good because the lines are not received with a consistent speed, sometimes I get lots of them, sometimes i get only one or two. Due to this I need to use a timer based approach which would do something like "print newline if the last one was printed more than x seconds ago"
With GNU awk for time functions you can print dots no more frequently than once per minute by simply comparing the time in seconds since the epoch when the current input line is being processed with the time when the previous dot was printed:
awk '
function prtDot() {
currTime = systime()
if ( (currTime - prevTime) > 60 ) {
printf "." | "cat>&2"
prevTime = currTime
}
}
{ print $0; prtDot() }
END { print "" | "cat>&2" }
'
e.g. printing a . every 10 seconds within a stream of numbers:
$ cat tst.awk
function prtDot() {
currTime = systime()
if ( (currTime - prevTime) > 10 ) {
printf "." | "cat>&2"
prevTime = currTime
}
}
{ printf "%s",$0%10 | "cat>&2"; prtDot() }
END { print "" | "cat>&2" }
$ i=0; while (( i < 50 )); do echo $((++i)); sleep 1; done | awk -f tst.awk
1.2345678901.23456789012.3456789012.34567890123.4567890
$ i=0; while (( i < 50 )); do echo $((++i)); sleep 3; done | awk -f tst.awk
1.2345.6789.0123.4567.8901.2345.6789.0123.4567.8901.2345.6789.0
the slight difference between the actual digits printed and expected is due to how long other parts of the while loop add to the overall interval between echos and other small imprecisions affecting when the shell loop is printing numbers and consequently when systime() is getting called in awk.

Print smallest integer from file using awk custom function?

awk function looks like this in a file name fun.awk:
{
print small()
}
function small()
{
a[NR]=$0
smal=0
for(i=1;i<=3;i++)
{
if( a[i]<a[i+1])
smal=a[i]
else
smal=a[i+1]
}
return smal
}
The contents of awk.write:
1
23
32
The awk command is:
awk -f fun.awk awk.write
It gives me no result? Why?
I think you are going about this the wrong way. In awk, one approach might be:
NR == 1 {
small = $0
}
$0 < small {
small = $0
}
END {
print small
}
which simply simply sets small to the smallest integer we've seen so far on each line, and prints it at the end. (Note: you need to start with a initializing small on the first line.
A simpler approach might just be to sort the lines as numbers with sort, and pick the first one.

How to detect the last line in awk before END?

I'm trying to concatenate String values and print them, but if the last types are Strings and there is no change of type then the concatenation won't print:
input.txt:
String 1
String 2
Number 5
Number 2
String 3
String 3
awk:
awk '
BEGIN { tot=0; ant_t=""; }
{
t = $1; val=$2;
#if string, concatenate its value
if (t == "String") {
tot+=val;
nx=1;
} else {
nx=0;
}
#if type change, add tot to res
if (t != "String" && ant_t == "String") {
res=res tot;
tot=0;
}
ant_t=t;
#if string, go next
if (nx == 1) {
next;
}
res=res"\n"val;
}
END { print res; }' input.txt
Current output:
3
5
2
Expected output:
3
5
2
6
How can I detect if awk is reading last line, so if there won't be change of type it will check if it is the last line?
awk reads line by line hence it cannot determine if it is reading the last line or not. The END block can be useful to perform actions once the end of file has reached.
To perform what you expect
awk '/String/{sum+=$2} /Number/{if(sum) print sum; sum=0; print $2} END{if(sum) print sum}'
will produce output as
3
5
2
6
what it does?
/String/ selects line that matches String so is Number
sum+=$2 performs the concatanation with String lines. When Number occurs, print the sum and reset to zero
Like this maybe:
awk -v lines="$(wc -l < /etc/hosts)" 'NR==lines{print "LAST"};1' /etc/hosts
I am pre-calculating the number of lines (using wc) and passing that into awk as a variable called lines, if that is unclear.
Just change last line to:
END { print res; print tot;}'
awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
Explanation
y is used as a boolean, and I check at the END if the last pattern was a string and print the sum
You can actually use x as the boolean like nu11p01n73R does which is smarter
Test
$ cat file
String 1
String 2
Number 5
Number 2
String 3
String 3
$ awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
3
5
2
6

array over non-existing indices in awk

Sorry for the verbose question, it boils down to a very simple problem.
Assume there are n text files each containing one column of strings (denominating groups) and one of integers (denominating the values of instances within these groups):
# filename xxyz.log
a 5
a 6
b 10
b 15
c 101
c 100
#filename xyzz.log
a 3
a 5
c 116
c 128
Note that while the length of both columns within any given file is always identical it differs between files. Furthermore, not all files contain the same range of groups (the first one contains groups a, b, c, while the second one only contains groups a and c). In awk one could calculate the average of column 2 for each string in column 1 within each file separately and output the results with the following code:
NAMES=$(ls|grep .log|awk -F'.' '{print $1}');
for q in $NAMES;
do
gawk -F' ' -v y=$q 'BEGIN {print "param", y}
{sum1[$1] += $2; N[$1]++}
END {for (key in sum1) {
avg1 = sum1[key] / N[key];
printf "%s %f\n", key, avg1;
} }' $q.log | sort > $q.mean;
done;
Howerver, for the abovementioned reasons, the length of the resulting .mean files differs between files. For each .log file I'd like to output a .mean file listing the entire range of groups (a-d) in the first column and the corresponding mean value or empty spaces in the second column depending on whether this category is present in the .log file. I've tried the following code (given without $NAMES for brevity):
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
{sum[$1] += $2; N[$1]++}
END {for (i in arr) {
if (i in sum) {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;}
else {
printf "%s %s\n" i, "";}
}}' xxyz.log > xxyz.mean;
but it returns the following error:
awk: (FILENAME=myfile FNR=7) fatal: not enough arguments to satisfy format string
`%s %s
'
^ ran out for this one
Any suggestions would be highly appreciated.
Will you ever have explicit zeroes or negative numbers in the log files? I'm going to assume not.
The first line of your second script doesn't do what you wanted:
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
This assigns "a" to arr[0] (because a is a variable not previously used), then "b" to the same element (because b is a variable not previously used), then "c", then "d". Clearly, not what you had in mind. This (untested) code should do the job you need as long as you know that there are just the four groups. If you don't know the groups a priori, you need a more complex program (it can be done, but it is harder).
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0) N[i] = 1 # Divide by zero protection
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}' xxyz.log > xxyz.mean;
This will print a zero average for the missing groups. If you prefer, you can do:
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0)
printf("%s\n", i;
else {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}
}' xxyz.log > xxyz.mean;
For each .log file I'd like to output a .mean file listing the entire
range of groups (a-d) in the first column and the corresponding mean
value or empty spaces in the second column depending on whether this
category is present in the .log file.
Not purely an awk solution, but you can get all the groups with this.
awk '{print $1}' *.log | sort -u > groups
After you calculate the means, you can then join the groups file. Let's say the means for your second input file look like this temporary, intermediate file. (I called it xyzz.tmp.)
a 4
c 122
Join the groups, preserving all the values from the groups file.
$ join -a1 groups xyzz.tmp > xyzz.mean
$ cat xyzz.mean
a 4
b
c 122
Here's my take on the problem. Run like:
./script.sh
Contents of script.sh:
array=($(awk '!a[$1]++ { print $1 }' *.log))
readarray -t sorted < <(for i in "${array[#]}"; do echo "$i"; done | sort)
for i in *.log; do
for j in "${sorted[#]}"; do
awk -v var=$j '
{
sum[$1]+=$2
cnt[$1]++
}
END {
print var, (var in cnt ? sum[var]/cnt[var] : "")
}
' "$i" >> "${i/.log/.main}"
done
done
Results of grep . *.main:
xxyz.main:a 5.5
xxyz.main:b 12.5
xxyz.main:c 100.5
xyzz.main:a 4
xyzz.main:b
xyzz.main:c 122
Here is a pure awk answer:
find . -maxdepth 1 -name "*.log" -print0 |
xargs -0 awk '{SUBSEP=" ";sum[FILENAME,$1]+=$2;cnt[FILENAME,$1]+=1;next}
END{for(i in sum)print i, sum[i], cnt[i], sum[i]/cnt[i]}'
Easy enough to push this into a file --