pre-populate associative array keys in awk? - awk

I've written a munin plugin that uses slurm's sacct to monitor job states on a HPC cluster. I've written it in sh + awk (rather than my usual tool of choice, perl).
The script works, but it took me ages to figure out how to pre-populate the associative array of possible states (some/most may not be present in sacct output, and i want them to default to zero). Google wasn't much help, and the best I could come up with was to use split on a string to produce a temporary array, which I then iterated over.
I came up with this:
BEGIN {
num = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
for (i=1;i<=num;i++) {
states[statenames[i]] = 0
}
}
This works, but seems clumsy compared to how i'd do it in perl, like this:
foreach (qw(cancelled completed completing failed nodefail pending running suspended timeout)) {
$states{$_} = 0;
}
or this
%states = map {$_ => 0} qw(cancelled completed completing failed nodefail pending running suspended timeout);
my question is: is there a way of doing this in awk that is similar to either of the perl versions?
[ edited ]
to clarify, here's a sample of the sacct output i'm piping into awk. Note that the only states in this output are RUNNING, COMPLETED, and CANCELLED - the others don't exist (because they haven't occurred today), but i want them in my script's output anyway (in a form usable by munin as "statename.value 0").
# sacct -X -P -o 'state' -n
RUNNING
RUNNING
RUNNING
RUNNING
COMPLETED
RUNNING
COMPLETED
RUNNING
COMPLETED
COMPLETED
CANCELLED by 1000
COMPLETED
[ edited again ]
and here's sample output from my munin plugin:
# ./slurm-sacct
suspended.value 0
pending.value 0
nodefail.value 0
failed.value 0
running.value 6
completing.value 0
completed.value 5
timeout.value 0
cancelled.value 1
The script runs and does what I want, I just wanted to know if there was a better way to initialise the associative array.

You probably don't need to do it at all. Variables in awk are dynamic, which means they're automatically initialized when they are first used (either assigned to or accessed), and this applies to array elements as well.
A variable will be initialized to 0 if it's accessed in a numeric context, or to the empty string otherwise. (At least gawk does this, though I'm not sure if it's implementation-dependent) So if you're doing something like counting the number of jobs that are in each state, the entire program is as simple as something like
{ states[$1]++ }
END {
for (state in states) print state, states[state]
}
Each time the expression states[$1]++ is executed, it will check for the existence of states[$1] and initialize it to 0 if it doesn't already exist.
EDIT: From your comment I'm guessing you want to print out a line for each possible state, regardless of whether there are any jobs in that state or not. In that case, you need to include all the possible state names, and there is no shortcut notation for doing so as there is in Perl. As far as I know, what you've already found is about as clean as it gets. (Awk is not really designed with that usage in mind)
I'd suggest the following:
{ states[$1]++ }
END {
split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
for (state in statenames) print state, states[state]+0
}

Perhaps Craig can use instead of :
print "Timeout states ",states[timeout],".";
this:
print "Timeout states ",int(states[timeout]),".";
In my case if there is no timeout state in awk input, the first print will give:
Timeout states .
While the second will give:
Timeout states 0.

I think a more natural approach in awk would be to have a separate file of keys. Consider a file keys.txt with one key per line. You could then do something like this:
printf "key1\nkey2\nkey2\nkey5" |
awk '
FILENAME == "keys.txt" {
counts[$0] = 0
next
}
{
counts[$0]++
}
END {
for (key in counts) {
print key, counts[key]
}
}' keys.txt -
With five keys in keys.txt, this produces:
key1 1
key2 2
key3 0
key4 0
key5 1
Although the keys are shown in order here, that's just incidental and shouldn't be relied upon.
For the specific example, you could also skip the associative array altogether. Instead, you could minimally process the lines with awk and use sort | uniq -c to tabulate the counts. The presence of all keys could be ensured using join against a file of keys.

awk is somewhat clumsier (I would say "less terse") than Perl.
You could write this (similar to #Michael's answer):
pipeline of data |
awk '
NR == FNR {statenames[$1]=0; next}
{ usual processing }
END { usual output }
' <(printf "%s\n" cancelled completed completing failed nodefail pending running suspended timeout) -

One tweak to #DavidZaslavsky's answer might be to print the states in the order you specified them on the split() line. That would be:
{ states[tolower($1)]++ }
END {
n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
for (i=1; i<=n; i++) {
state = statenames[i]
print state, states[state]+0
}
}
I also converted the input to lower case so it matches your hard-coded values, got rid of the unnecessary 3rd arg to split() and the subsequent null statement (trailing semi-colon).
In case you want to account for finding state names in your input that weren't in your hard-coded set, you could tweak it to:
{ states[tolower($1)]++ }
END {
n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
for (i=1; i<=n; i++) {
state = statenames[i]
print state, states[state]+0
delete states[state]
}
for (state in states) {
print "WARNING: found new state name %s\n",state | "cat>&2"
print state, states[state]+0
}
}

Related

Ghostscript for PS integrity test: terminate at EOF, return error unless stack is empty

To test the integrity of PostScript files, I'd like to run Ghostscript in the following way:
Return 1 (or other error code) on error
Return 0 (success) at EOF if stack is empty
Return 1 (or other error code) otherwise
I could run gs in the background, and use a timeout to force termination if gs hangs with items left on the stack. Is there an easier solution?
Ghostscript won't hang if you send files as input (unless you write a program which enters an infinite loop or otherwise fails to reach a halting state). Having items on any of the stacks won't cause it to hang.
On the other hand, it won't give you an error if a PostScript program leaves operands on the operand stack (or dictionaries on the dictionary stack, clips on the clip stack or gstates on the graphics state stack). This is because that's not an error, and since PostScript interpreters normally run in a job server loop its not a problem either. Terminating the job returns control to the job server loop which does a save and restore round the total job, thereby clearing up anything left behind.
I'd suggest that if you really want to do this you need to adopt the same approach, you need to write a PostScript program which executes the PostScript program you want to 'test', then checks the operand stack (and other stacks if required) to see if anything is left. Note that you will want to execute the test program in a stopped context, as an error in the course of the program will clearly potentially leave stuff lying around.
Ghostscript returns 0 on a clean exit and a value less than 0 for errors, if I remember correctly. You would need to use signalerror in your test framework in order to raise an error if items are left at the end of a program.
[EDIT]
Anything supplied to Ghostscript on the command line by either -s or -d is defined in systemdict, so if we do -sInputFileName=/test.pdf then we will find in systemdict a key /InputFileName whose value is a string with the contents (/test.pdf). We can use that to pass the filename to our program.
The stopped operator takes an executable array as an argument, and returns either true or false depending on whether an error occurred while executing the array (3rd Edition PLRM, p 697).
So we need to run the program contained in the filename we've been given, and do it in a 'stopped' context. Something like this:
{InputFileName run} stopped
{
(Error occurred\n) print flush
%% Potentially check $error for more information.
}{
(program terminated normally\n) print flush
%% Here you could check the various stacks
} ifelse
The following, based 90% on KenS's answer, is 99% satisfactory:
Program checkIntegrity.ps:
{Script run} stopped
{
(\n===> Integrity test failed: ) print Script print ( has error\n\n) print
handleerror
(ignore this error which only serves to force a return value of 1) /syntaxerror signalerror
}{
% script passed, now check the stack
count dup 0 eq {
pop (\n===> Integrity test passed: ) print Script print ( terminated normally\n\n) print
} {
(\n===> Integrity test failed: ) print Script print ( left ) print
3 string cvs print ( item(s) on stack\n\n) print
Script /syntaxerror signalerror
} ifelse
} ifelse
quit
Execute with
gs -q -sScript=CodeToBeChecked.ps checkIntegrity.ps ; echo $?
For the last 1% of satisfaction I would need a replacement for
(blabla) /syntaxerror signalerror
It forces exit with return code 1, but is very verbous and distracts from the actual error in the checked script that is reported by handleerror. Therefore a cleaner way to exit(1) would be welcome.

AWK: go through the file twice, doing different tasks

I am processing a fairly big collection of Tweets and I'd like to obtain, for each tweet, its mentions (other user's names, prefixed with an #), if the mentioned user is also in the file:
users = new Dictionary()
for each line in file:
username = get_username(line)
userid = get_userid(line)
users.add(key = userid, value = username)
for each line in file:
mentioned_names = get_mentioned_names(line)
mentioned_ids = mentioned_names.map(x => if x in users: users[x] else null)
print "$line | $mentioned_ids"
I was already processing the file with GAWK, so instead of processing it again in Python or C I decided to try and add this to my AWK script. However, I can't find a way to make to passes over the same file, executing different code for each one. Most solutions imply calling AWK several times, but then I'd loose the associative array I made in the first pass.
I could do it in very hacky ways (like cat'ing the file twice, passing it through sed to add a different prefix to all the lines in each cat), but I'd like to be able to understand this code in a couple of months without hating myself.
What would be the AWK way to do this?
PD:
The less terrible way I've found:
function rewind( i)
{
# from https://www.gnu.org/software/gawk/manual/html_node/Rewind-Function.html
# shift remaining arguments up
for (i = ARGC; i > ARGIND; i--)
ARGV[i] = ARGV[i-1]
# make sure gawk knows to keep going
ARGC++
# make current file next to get done
ARGV[ARGIND+1] = FILENAME
# do it
nextfile
}
BEGIN {
count = 1;
}
count == 1 {
# first pass, fills an associative array
}
count == 2 {
# second pass, uses the array
}
FNR == 30 {
# handcoded length, horrible
# could also be automated calling wc -l, passing as parameter
if (count == 1) {
count = 2;
rewind(1)
}
}
The idiomatic way to process two separate files, or the same file twice in awk is like this:
awk 'NR==FNR{
# fill associative array
next
}
{
# use the array
}' file1 file2
The total record number NR is only equal to the record number for the current file FNR on the first file. next skips the second block for the first file. The second block is then processed for the second file. If file1 and file2 are the same file, then this passes through the file twice.

How to handle 3 files with awk?

Ok, so after spending 2 days, I am not able solve it and I am almost out of time now. It might be a very silly question, so please bear with me. My awk script does something like this:
BEGIN{ n=50; i=n; }
FNR==NR {
# Read file-1, which has just 1 column
ids[$1]=int(i++/n);
next
}
{
# Read file-2 which has 4 columns
# Do something
next
}
END {...}
It works fine. But now I want to extend it to read 3 files. Let's say, instead of hard-coding the value of "n", I need to read a properties file and set value of "n" from that. I found this question and have tried something like this:
BEGIN{ n=0; i=0; }
FNR==NR {
# Block A
# Try to read file-0
next
}
{
# Block B
# Read file-1, which has just 1 column
next
}
{
# Block C
# Read file-2 which has 4 columns
# Do something
next
}
END {...}
But it is not working. Block A is executed for file-0, I am able to read the property from properties files. But Block B is executed for both files file-1 and file-2. And Block C is never executed.
Can someone please help me solve this? I have never used awk before and the syntax is very confusing. Also, if someone can explain how awk reads input from different files, that will be very helpful.
Please let me know if I need to add more details to the question.
If you have gawk, just test ARGIND:
awk '
ARGIND == 1 { do file 1 stuff; next }
ARGIND == 2 { do file 2 stuff; next }
' file1 file2
If you don't have gawk, get it.
In other awks though you can just test for the file name:
awk '
FILENAME == ARGV[1] { do file 1 stuff; next }
FILENAME == ARGV[2] { do file 2 stuff; next }
' file1 file2
That only fails if you want to parse the same file twice, if that's the case you need to add a count of the number of times that file's been opened.
Update: The solution below works, as long as all input files are nonempty, but see #Ed Morton's answer for a simpler and more robust way of adding file-specific handling.
However, this answer still provides a hopefully helpful explanation of some awk basics and why the OP's approach didn't work.
Try the following (note that I've made the indices 1-based, as that's how awk does it):
awk '
# Increment the current-file index, if a new file is being processed.
FNR == 1 { ++fIndex }
# Process current line if from 1st file.
fIndex == 1 {
print "file 1: " FILENAME
next
}
# Process current line if from 2nd file.
fIndex == 2 {
print "file 2: " FILENAME
next
}
# Process current line (from all remaining files).
{
print "file " fIndex ": " FILENAME
}
' file-1 file-2 file-3
Pattern FNR==1 is true whenever a new input file is starting to get processed (FNR contains the input file-relative line number).
Every time a new file starts processing, fIndexis incremented and thus reflects the 1-based index of the current input file. Tip of the hat to #twalberg's helpful answer.
Note that an uninitialized awk variable used in a numeric context defaults to 0, so there's no need to initialize fIndex (unless you want a different start value).
Patterns such as fIndex == 1 can then be used to execute blocks for lines from a specific input file only (assuming the block ends in next).
The last block is then executed for all input files that don't have file-specific blocks (above).
As for why your approach didn't work:
Your 2nd and 3rd blocks are potentially executed unconditionally, for lines from all input files, because they are not preceded by a pattern (condition).
So your 2nd block is entered for lines from all subsequent input files, and its next statement then prevents the 3rd block from ever getting reached.
Potential misconceptions:
Perhaps you think that each block functions as a loop processing a single input file. This is NOT how awk works. Instead, the entire awk program is processed in a loop, with each iteration processing a single input line, starting with all lines from file 1, then from file 2, ...
An awk program can have any number of blocks (typically preceded by patterns), and whether they're executed for the current input line is solely governed by whether the pattern evaluates to true; if there is no pattern, the block is executed unconditionally (across input files). However, as you've already discovered, next inside a block can be used to skip subsequent blocks (pattern-block pairs).
Perhaps you need to consider adding some additional structure like this:
BEGIN { file_number=1 }
FNR==1 { ++file_number }
file_number==3 && /something_else/ { ...}

How does this NAWK script work to show the ports being used by a process on Solaris?

I am trying to understand how the following command works (from here):
<!-- language: lang-bash -->
pfiles /proc/* 2>&- |
nawk 'END {
if (f) print p
}
/^[0-9]/ {
if (f) print p, RS
p = $0
f = 0
}
/INET / {
sub(/.*INET/,"")
p = p ? p RS $0 : $0
f = 1
}'
This command works well (in SOLARIS 5.10) and shows all the ports opened by processes.
I understand that, pfiles /proc/* displays a bunch of output related to all processes by querying the /proc/ filesystem. From the man-page:
pfiles Report fstat(2) and fcntl(2) information
for all open files in each process. In
addition, a path to the file is reported
if the information is available from
/proc/pid/path. This is not necessarily
the same name used to open the file. See
proc(4) for more information.
The output from pfiles is then processed by nawk ('New Awk').
Questions
Could you please explain how NAWK is processing the output of pfiles in the following command? It would be most helpful to know how the parameters f, p and $0 mean.
In the first line, what does redirection of standard error to &- mean? Does it mean the standard error stream is being closed ?
I had to read that script once or twice to make sure I got it straight in
my head. It's a little confusing because we see the END at the beginning.
$0 is the entire line.
The line /^[0-9]/ matches the process id (specifically) and that block
then sets the sentinel variable f to 0.
The block starting with /INET / matches (and then strips, via the sub(..))
the open port number. The sentinel value f is set to 1 so that we know to
print differently when we hit the END. Each time we finish an output
collection (ie, the entire output from pfiles for a process), we hit the END
block and print the output.
BTW, the RS is the Record Separator.
Running the script on just one process might make it a little easier to get
the head around it.
Sorry, forgot to answer your other question re the redirection.
2>&-
in this context means "redirect stderr from the process to standard input",
so that nawk takes input from there rather than a file.

Perl SQL file write delayed

Here is the simple perl script fetching data from SQL.
Read data and write on a file OUTFILE, and print the data on screen for every 10000th line.
One thing I am curious is that the printing the data on screen terminates very quickly(in 30 seconds), however, data fetching and writing on a file ends very slowly(30 minutes later).
The amount of data is not large. The output files size is less than 100Mbyte.
while ( my ($a,$b) = $curSqlEid->fetchrow_array() )
{
printf OUTFILE ("%s,%d\n", $a,$b);
$counter ++;
if($counter % 10000 == 0){
printf ("%s,%d\n", $a,$b);
}
}
$curSqlEid->finish();
$dbh->disconnect();
close(OUTFILE);
You are suffering from buffering.
Handles other than STDERR are buffered by default, and most handles use a block buffering. That means Perl will wait until there is 8KB* of data to write before sending anything to the system.
STDOUT is special. When is attached to a terminal (and only then), it uses a different kind of buffering: line buffering. When using line buffering, the data is flushed every time a newline is encountered in the data to write.
You can see this by running
$ perl -e'print "abc"; print "def"; sleep 5; print "\n"; sleep 5;'
[ 5 seconds pass ]
abcdef
[ 5 seconds pass ]
$ perl -e'print "abc"; print "def"; sleep 5; print "\n"; sleep 5;' | cat
[ 10 seconds pass ]
abcdef
The solution is to turn off buffering.
use IO::Handle qw( ); # Not needed on Perl 5.14 or later
OUTFILE->autoflush(1);
* — 8KB is the default. It can be configured when Perl is compiled. It used to be a non-configurable 4KB until 5.14.
I think you are seeing the output file size as 0 while the script is running and displaying on the console. Do not go by that. The file size will show up only once the script has finished. This is due to output buffering.
Anyways, the delay cannot be as large as 30 min. Once the script is done, you should see the output file data.
I tried various things, but the final conclusion is that python and perl has basically different handling data flow from DB. It looks like in perl, it is possible to handle data line by line while the data is transferred from DB. However, in Python it needs to wait until the entire data download from the server to process it.