How to detect the last line in awk before END? - awk

I'm trying to concatenate String values and print them, but if the last types are Strings and there is no change of type then the concatenation won't print:
input.txt:
String 1
String 2
Number 5
Number 2
String 3
String 3
awk:
awk '
BEGIN { tot=0; ant_t=""; }
{
t = $1; val=$2;
#if string, concatenate its value
if (t == "String") {
tot+=val;
nx=1;
} else {
nx=0;
}
#if type change, add tot to res
if (t != "String" && ant_t == "String") {
res=res tot;
tot=0;
}
ant_t=t;
#if string, go next
if (nx == 1) {
next;
}
res=res"\n"val;
}
END { print res; }' input.txt
Current output:
3
5
2
Expected output:
3
5
2
6
How can I detect if awk is reading last line, so if there won't be change of type it will check if it is the last line?

awk reads line by line hence it cannot determine if it is reading the last line or not. The END block can be useful to perform actions once the end of file has reached.
To perform what you expect
awk '/String/{sum+=$2} /Number/{if(sum) print sum; sum=0; print $2} END{if(sum) print sum}'
will produce output as
3
5
2
6
what it does?
/String/ selects line that matches String so is Number
sum+=$2 performs the concatanation with String lines. When Number occurs, print the sum and reset to zero

Like this maybe:
awk -v lines="$(wc -l < /etc/hosts)" 'NR==lines{print "LAST"};1' /etc/hosts
I am pre-calculating the number of lines (using wc) and passing that into awk as a variable called lines, if that is unclear.

Just change last line to:
END { print res; print tot;}'

awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
Explanation
y is used as a boolean, and I check at the END if the last pattern was a string and print the sum
You can actually use x as the boolean like nu11p01n73R does which is smarter
Test
$ cat file
String 1
String 2
Number 5
Number 2
String 3
String 3
$ awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
3
5
2
6

Related

How to return 0 if awk returns null from processing an expression?

I currently have a awk method to parse through whether or not an expression output contains more than one line. If it does, it aggregates and prints the sum. For example:
someexpression=$'JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)'
might be the one-liner where it DOESN'T yield any information. Then,
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
for (i in a) {
printf "%d\n", a[i]
}
}'
this will yield NULL or an empty return. Instead, I would like to have it return a numeric value of $0$ if empty. How can I modify the above to do this?
Nothing in UNIX "returns" anything (despite the unfortunately named keyword for setting the exit status of a function), everything (tools, functions, scripts) outputs X and exits with status Y.
Consider these 2 identical functions named foo(), one in C and one in shell:
C (x=foo() means set x to the return code of foo()):
foo() {
printf "7\n"; // this is outputting 7 from the full program
return 3; // this is returning 3 from this function
}
x=foo(); <- 7 is output on screen and x has value '3'
shell (x=foo means set x to the output of foo()):
foo() {
printf "7\n"; # this is outputting 7 from just this function
return 3; # this is setting this functions exit status to 3
}
x=foo <- nothing is output on screen, x has value '7', and '$?' has value '3'
Note that what the return statement does is vastly different in each. Within an awk script, printing and return codes from functions behave the same as they do in C but in terms of a call to the awk tool, externally it behaves the same as every other UNIX tool and shell script and produces output and sets an exit status.
So when discussing anything in UNIX avoid using the term "return" as it's imprecise and ambiguous and so different people will think you mean "output" while others think you mean "exit status".
In this case I assume you mean "output" BUT you should instead consider setting a non-zero exit status when there's no match like grep does, e.g.:
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
for (i in a) {
print a[i]
}
exit (NR < 2)
}'
and then your code that uses the above can test for the success/fail exit status rather than testing for a specific output value, just like if you were doing the equivalent with grep.
You can of course tweak the above to:
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
if ( NR > 1 ) {
for (i in a) {
print a[i]
}
}
else {
print "$0$"
exit 1
}
}'
if necessary and then you have both a specific output value and a success/fail exit status.
You may keep a flag inside for loop to detect whether loop has executed or not:
echo "$someexpression" |
awk 'NR>1 {
a[$4]++
}
END
{
for (i in a) {
p = 1
printf "%d\n", a[i]
}
if (!p)
print "$0$"
}'
$0$

Stored each of the first 2 blocks of lines in arrays

I've sorted it by using Google Sheet, but its gonna takes a long time, so I figured it out, to settle it down by awk.
input.txt
Column 1
2
2
2
4
4
Column 2
562
564
119
215
12
Range
13455,13457
13161
11409
13285,13277-13269
11409
I've tried this script, so it's gonna rearrange the value.
awk '/Column 1/' RS= input.txt
(as referred in How can I set the grep after context to be "until the next blank line"?)
But it seems, it's only gonna take one matched line
It should be sorted by respective lines.
Result:
562Value2#13455
562Value2#13457
564Value2#13161
119Value2#11409
215Value4#13285
215Value4#13277-13269
12Value4#11409
it should be something like that, the "comma" will be repeating the value from Column 1 and Column 2
etc:
Range :
13455,13457
Result :
562Value2#13455
562Value2#13457
idk what sorting has to do with it but it seems like this is what you're looking for:
$ cat tst.awk
BEGIN { FS=","; recNr=1; print "Result:" }
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print b[lineNr] "Value" a[lineNr] "#" $i
}
}
$ awk -f tst.awk input.txt
Result:
562Value2#13455
562Value2#13457
564Value2#13161
119Value2#11409
215Value4#13285
215Value4#13277-13269
12Value4#11409

How to find records with minimum value in a specific column in multiple files?

I have 2 column 4000 dat files. from each file, I need to identify first mimum value of column 2 and print corresponding row. Then this should run on multiple files in the folder and append these values to a new file. I have tried below code.
File names include common string:
fig_3-28333.dat
^^^^^ file number
awk'BEGIN{min=0}{if(($2)>min) min=($2)}END {print line}' cat >> new.dat
output file expected to be
file number Column 1 column2
28333 x value first minimum value
28334 x value first minimum value
NOTE: This only works with gawk (which understands the ENDFILE pattern), and not the regular awk
Here is my script, min.awk:
BEGIN {
print "file number Column 1 column2"
}
FNR == 1 {
min = $2;
first = $1
second = $2
}
$2 < min {
min = $2
first = $1
second = $2
}
ENDFILE {
# Extract the file number to a[1]
match(FILENAME, /.*-([0-9]+)\.dat/, a)
print a[1], first, second
}
Notes
The BEGIN pattern prints the heading
At the first line of each file (pattern: FNR == 1), establish the minimum value
For those lines whose second value is less than the minimum (pattern: $2 < min), establish the new minimum value
At the end of each file, print out the minimum value for that file
Invoke the script
gawk -f min.awk *.dat
Update
After reviewing my script, I duplicated code which I can eliminate by combining the two blocks:
BEGIN {
print "file number Column 1 column2"
}
FNR == 1 || $2 < min{
min = $2;
first = $1
second = $2
}
ENDFILE {
# Extract the file number to a[1]
match(FILENAME, /.*-([0-9]+)\.dat/, a)
print a[1], first, second
}

AWK: How can I print averages of consecutive numbers in a file, but skip over alphabetical characters/strings?

I've figured out how to get the average of a file that contains numbers in all lines such as:
Numbers.txt
1
2
4
8
Output:
Average: 3.75
This is the code I use for that:
awk '{ sum += $1; tot++ } END { print sum / tot; }' Numbers.txt
However, the problem is that this doesn't take into account possible strings that might be in the file. For example, a file that looks like this:
NumbersAndExtras.txt
1
2
4
8
Hello
4
5
6
Cat
Dog
2
4
3
For such a file I'd want to print the multiple averages of the consecutive numbers, ignoring the strings such that the result looks something like this:
Output:
Average: 3.75
Average: 5
Average: 3
I could devise some complicated code that might accomplish that with variables and 'if' statements and loops and whatnot, but I've been told it's easier than that given some of awk features. I'd like to know how that might look like, along with an explanation of why it works.
BEGIN runs before reading the first line from file. Set sum and count to 0.
awk 'BEGIN{ sum=0; count=0} {if ( /[a-z][A-Z]/ ) { if (count > 0) {avg = sum/count; print avg;} count=0; sum=0} else { count++; sum += $1} } END{if (count > 0) {avg = sum/count; print avg}} ' NumbersAndExtras.txt
When there is an alphabet on the line, calculate and print average so far.
And do the same in the END block that runs after processing the whole file.
Keep it simple:
awk '/^$/{next}
/^[0-9]+/{a+=$1+0;c++;next}
c&&a{print "Average: "a/c;a=c=0}
END{if(c&&a){print "Average: "a/c}}' input_file
Results:
Average: 3.75
Average: 5
Average: 3
Another one:
$ awk '
function avg(s, c) { print "Average: ", s/c }
NF && !/^[[:digit:]]/ { if (count) avg(sum, count); sum = 0; count = 0; next}
NF { sum += $1; count++ }
END {if (count) avg(sum, count)}
' <file
Note: The value of this answer in explaining the solution; other answers offer more concise alternatives.
Try the following:
Note that this is an awk command with a script specified as a multi-line shell string literal - you can paste the whole thing into your terminal to try it; while it is possible to cram this into a single line, it hurts readability and the ability to comment:
awk '
# Define output function that prints an average.
function printAvg() { print "Average: ", sum/count }
# Skip blank lines
NF == 0 { next}
# Is the line non-numeric?
/[[:alpha:]]/ {
# If this line ends a numeric block, print its
# average now and reset the variables to start the next group.
if (count) {
printAvg()
wasNum = sum = count = 0
}
# Skip to next line.
next
}
# Numeric line: set flag, sum, and increment counter.
{ sum += $1; count++ }
# Finally:
END {
# If there is a group whose average has not been printed yet,
# do it now.
if (count) printAvg()
}
' NumbersAndExtras.txt
If we condense whitespace and strip the comments, we still get a reasonably readable solution, as long as we still use multiple lines:
awk '
function printAvg() { print "Average: ", sum/count }
NF == 0 { next}
/[[:alpha:]]/ { if (count) { printAvg(); sum = count = 0 } next }
{ sum += $1; count++ }
END { if (count) printAvg() }
' NumbersAndExtras.txt

array over non-existing indices in awk

Sorry for the verbose question, it boils down to a very simple problem.
Assume there are n text files each containing one column of strings (denominating groups) and one of integers (denominating the values of instances within these groups):
# filename xxyz.log
a 5
a 6
b 10
b 15
c 101
c 100
#filename xyzz.log
a 3
a 5
c 116
c 128
Note that while the length of both columns within any given file is always identical it differs between files. Furthermore, not all files contain the same range of groups (the first one contains groups a, b, c, while the second one only contains groups a and c). In awk one could calculate the average of column 2 for each string in column 1 within each file separately and output the results with the following code:
NAMES=$(ls|grep .log|awk -F'.' '{print $1}');
for q in $NAMES;
do
gawk -F' ' -v y=$q 'BEGIN {print "param", y}
{sum1[$1] += $2; N[$1]++}
END {for (key in sum1) {
avg1 = sum1[key] / N[key];
printf "%s %f\n", key, avg1;
} }' $q.log | sort > $q.mean;
done;
Howerver, for the abovementioned reasons, the length of the resulting .mean files differs between files. For each .log file I'd like to output a .mean file listing the entire range of groups (a-d) in the first column and the corresponding mean value or empty spaces in the second column depending on whether this category is present in the .log file. I've tried the following code (given without $NAMES for brevity):
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
{sum[$1] += $2; N[$1]++}
END {for (i in arr) {
if (i in sum) {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;}
else {
printf "%s %s\n" i, "";}
}}' xxyz.log > xxyz.mean;
but it returns the following error:
awk: (FILENAME=myfile FNR=7) fatal: not enough arguments to satisfy format string
`%s %s
'
^ ran out for this one
Any suggestions would be highly appreciated.
Will you ever have explicit zeroes or negative numbers in the log files? I'm going to assume not.
The first line of your second script doesn't do what you wanted:
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
This assigns "a" to arr[0] (because a is a variable not previously used), then "b" to the same element (because b is a variable not previously used), then "c", then "d". Clearly, not what you had in mind. This (untested) code should do the job you need as long as you know that there are just the four groups. If you don't know the groups a priori, you need a more complex program (it can be done, but it is harder).
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0) N[i] = 1 # Divide by zero protection
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}' xxyz.log > xxyz.mean;
This will print a zero average for the missing groups. If you prefer, you can do:
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0)
printf("%s\n", i;
else {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}
}' xxyz.log > xxyz.mean;
For each .log file I'd like to output a .mean file listing the entire
range of groups (a-d) in the first column and the corresponding mean
value or empty spaces in the second column depending on whether this
category is present in the .log file.
Not purely an awk solution, but you can get all the groups with this.
awk '{print $1}' *.log | sort -u > groups
After you calculate the means, you can then join the groups file. Let's say the means for your second input file look like this temporary, intermediate file. (I called it xyzz.tmp.)
a 4
c 122
Join the groups, preserving all the values from the groups file.
$ join -a1 groups xyzz.tmp > xyzz.mean
$ cat xyzz.mean
a 4
b
c 122
Here's my take on the problem. Run like:
./script.sh
Contents of script.sh:
array=($(awk '!a[$1]++ { print $1 }' *.log))
readarray -t sorted < <(for i in "${array[#]}"; do echo "$i"; done | sort)
for i in *.log; do
for j in "${sorted[#]}"; do
awk -v var=$j '
{
sum[$1]+=$2
cnt[$1]++
}
END {
print var, (var in cnt ? sum[var]/cnt[var] : "")
}
' "$i" >> "${i/.log/.main}"
done
done
Results of grep . *.main:
xxyz.main:a 5.5
xxyz.main:b 12.5
xxyz.main:c 100.5
xyzz.main:a 4
xyzz.main:b
xyzz.main:c 122
Here is a pure awk answer:
find . -maxdepth 1 -name "*.log" -print0 |
xargs -0 awk '{SUBSEP=" ";sum[FILENAME,$1]+=$2;cnt[FILENAME,$1]+=1;next}
END{for(i in sum)print i, sum[i], cnt[i], sum[i]/cnt[i]}'
Easy enough to push this into a file --