Non blocking read from GNU awk coprocess?

Non blocking read from GNU awk coprocess? - awk

I would like to implement incremental execution of scripts using gawk in order to interleave script source and script output in a document.
The idea would be to read script lines into awk to print them and also pipe them into an appropriate interpreter. Then, on a queue from the input file, read any output from the coprocess and print it to standard output. But it seems that I must know how much output has been generated before looping over the coprocess output.
Is there any way to do a non-blocking read from the coprocess?
function script_checkpoint() {
while(("python3" |& getline output) > 0)
print output
}
/^# checkpoint/ { script_checkpoint(); next }
{ print; print $0 |& "python3" }
END { script_checkpoint() }
EDIT: I have tried to implement this without using a coprocess by buffering the input lines until a checkpoint and just letting the interpreter print to standard out itself but the interpreter always buffers its output until the stream closes. I don't want to close it until the program ends to preserve its internal state.
EDIT: made it more clear that my first intended use case is running python scripts. Here is a sample input/output pair.
print('first line')
# checkpoint
print('second line')
should result in
print('first line')
first line
print('second line')
second line

The general issue:
while ((interpreter |& getline output) > 0) runs until it sees an EOF but ...
interpreter does not end/terminate/exit, thus no EOF is sent so ...
awk hangs while waiting for interpreter to send more data so ...
we end up with a deadlock situation (awk waiting for input from interpreter; interpreter waiting for input from awk)
Assumptions:
need to maintain a single invocation of interpreter throughout the run (per a comment from OP); net result: awk cannot depend on interpreter sending an EOF
interpreter can be modified (to generate additional output)
the awk script has no way of knowing how many lines of output will be generated by interpreter
One idea is to setup a handshake between awk and interpreter. Within the while ((interpreter |& getline output) > 0) loop we'll test for our handshake and when we see it break out of the loop and return back to the main awk script.
For demo purposes I'll use a simple bash script that does some handshake processing otherwise just prints to stdout whatever it reads from stdin:
$ cat interpreter
#!/usr/bin/bash
while read -r line
do
if [[ "${line}" = 'checkpoint' ]] # received 'checkpoint' handshake?
then
echo "CHECKPOINT" # send "CHECKPOINT" handshake/acknowledgement
continue
else
echo "interpreter: $line"
fi
done
Demo awk code with handshake logic:
awk '
function script_checkpoint() {
while (( cmd |& getline output) > 0) {
if ( output == "CHECKPOINT" ) # received "CHECKPOINT" handshake/acknowledgement?
break
print output
}
}
BEGIN { cmd= "./interpreter" }
/^# checkpoint/ { print "checkpoint" |& cmd # send "checkpoint" handshake
script_checkpoint()
next
}
{ print "awk: " $0
print $0 |& cmd
}
END { print "awk: last checkpoint" # in case last line of input is not "# checkpoint" we will ...
print "checkpoint" |& cmd # send one last "checkpoint" handshake
script_checkpoint()
print "awk: done"
}
' test.dat
Sample input file:
$ cat test.dat
line1
line2
# checkpoint
line3
line4
# checkpoint
line5
Output:
awk: line1
awk: line2
interpreter: line1
interpreter: line2
awk: line3
awk: line4
interpreter: line3
interpreter: line4
awk: line5
awk: last checkpoint
interpreter: line5
awk: done
NOTES:
awk will still hang in the event interpreter crashes and/or fails to send back the CHECKPOINT handshake
if the strings checkpoint and/or CHECKPOINT can show up in the 'normal' data streams then update the code to use strings that are not expected in the data streams

It sounds like you're trying to do something like this:
BEGIN { cmd="/my/python/script/path" }
function script_checkpoint( output) {
close(cmd,"to")
while ( (cmd |& getline output) > 0 ) {
print output
}
close(cmd)
}
/^# checkpoint/ {
script_checkpoint()
next
}
{
print
print |& cmd
}
END { script_checkpoint() }

Related

Trap or evaluate bad regular expression string at runtime in awk script

How can I trap an error if a dynamic regular expression evaluation is bad like:
var='lazy dog'
# a fixed Regex here, but original is coming from ouside the script
Regex='*.'
#try and failed
if (var ~ Regex) foo
The goal is to manage this error as I cannot test the regex itself (it comes from external source). Using POSIX awk (AIX)

Something like this?
$ echo 'foo' |
awk -v re='*.' '
BEGIN {
cmd="awk --posix \047/" re "/\047 2>&1"
cmd | getline rslt
print "rslt="rslt
close(cmd)
}
{ print "got " $0 " but re was bad" }
'
rslt=awk: cmd. line:1: error: Invalid preceding regular expression: /*./
got foo but re was bad
I use gawk so I had to add --posix to make it not just accept that regexp as a literal * followed by any char. You'll probably have to change the awk command being called in cmd to behave sensibly for your needs with both valid and invalid regexps but you get the idea - to do something like an eval in awk you need to have awk call itself via system() or a pipe to getline. Massage to suit...
Oh, and I don't think you can get the exit status of cmd with the above syntax and you can't capture the output of a system() call within awk so you may need to test the re twice - first with system() to find out if it fails but redirecting it's output to /dev/null, and then on a failure run it again with getline to capture the error message.
Something like:
awk -v re='*.' '
BEGIN {
cmd="awk --posix \047/" re "/\047 2>&1"
if ( system(cmd " > /dev/null") ) {
close(cmd " > /dev/null")
cmd | getline rslt
print "rslt="rslt
close(cmd)
}
}
{ print "got " $0 " but re was bad" }
'

Awk: I want to use the input filename to generate an output file with same name different extension

I have a script that looks like this:
#! /bin/awk -f
BEGIN { print "start" }
{ print $0 }
END { print "end" }
Call the script like this: ./myscript.awk test.txt
Pretty simple - takes a file and adds "start" to the start and "end" to the end.
Now I want to take the input filename, lets call it test.txt, and print the output to a file called test.out.
So I tried to print the input filename:
BEGIN { print "fname: '" FILENAME "'" }
But that printed: fname: '' :(
The rest I can figure out I think, I have this following to print to a hard-coded filename:
#! /bin/awk -f
BEGIN { print "start" > "test.out" }
{ print $0 >> "test.out" }
END { print "end" >> "test.out" }
And that works great.
So the questions are:
how do I get the input filename?
Assuming somehow I get the input file name in a variable, e.g. FILENAME which contains "test.txt" how would I make another variable, e.g. OUTFILE, which contains "test.out"?
Note: I will be doing much more awk processing so please don't suggest to use sed or other languages :))

Try something like this:
#! /bin/awk -f
BEGIN {
file = gensub(".txt",".out","g",ARGV[1])
print "start" > file
}
{ print $0 >> file }
END {
print "end" >> file
close(file)
}
I'd suggest to close() the file too in the END{} statement. Good call to Sundeep for pointing out that FILENAME is empty in BEGIN.

$ echo 'foo' > ip.txt
$ awk 'NR==1{op=FILENAME; sub(/\.[^.]+$/, ".log", op); print "start" > op}
{print > op}
END{print "end" > op}' ip.txt
$ cat ip.log
start
foo
end
Save FILENAME to a variable, change the extension using sub and then print as required
From gawk manual
Inside a BEGIN rule, the value of FILENAME is "", because there are no input files being processed yet

If you're using GNU awk (gawk), you can use the patterns BEGINFILE and ENDFILE
awk 'BEGINFILE{
outfile=FILENAME;
sub(".txt",".out",outfile);
print "start" > outfile
}
ENDFILE{
print "stop" >outfile
}' file1.txt file2.txt
You can then use the variable outfile your the main {...} loop.
Doing so will allow you to process more that 1 file in a single awk command.

golang command not working even though manually executing it does

I'm trying to execute an awk command in my go program (the awk command pulls zip codes for a specified city, San Francisco in this case, from a tab delimited file of California zip codes):
cmd := exec.Command(
"awk",
"-F",
"'\\t'",
"'{if ($4 == \"SAN FRANCISCO\") print $0; }'",
"zipcodes_ca.txt",
)
fmt.Println(cmd.Args)
var out bytes.Buffer
var stderr bytes.Buffer
cmd.Stdout = &out
cmd.Stderr = &stderr
err := cmd.Run()
if err != nil {
fmt.Println(fmt.Sprint(err) + ": " + stderr.String())
return
}
This outputs:
[awk -F '\t' '{if ($4 == "SAN FRANCISCO") print $0; }' zipcodes_ca.txt]
exit status 2: awk: syntax error at source line 1
context is
>>> ' <<<
awk: bailing out at source line 1
If I take the printed args from the command and just run that as a command awk -F '\t' '{if ($4 == "SAN FRANCISCO") print $0; }' zipcodes_ca.txt it works. But, running it through my go program seems to be having issues. I'm not sure what I'm doing wrong here. I'm guessing I'm escaping things incorrectly, but nothing I try seems to work.

I don't think that you need single quotes around arguments. They are an artifact of using shell that prevents shell from interpreting argument content. Try without them.

How can I check if a GNU awk coprocess is open, or force it to open without writing to it?

I have a gawk program that uses a coprocess. However, sometimes I don't have any data to write to the coprocess, and my original script hangs while waiting for the output of the coprocess.
The code below reads from STDIN, writes each line to a "cat" program, running as a coprocess. Then it reads the coprocess output back in and writes it to STDOUT. If we change the if condition to be 1==0, nothing gets written to the coprocess, and the program hangs at the while loop.
From the manual, it seems that the coprocess and the two-way communication channels are only started the first time there is an IO operation with the |& operator. Perhaps we can start things without actually writing anything (e.g. writing an empty string)? Or is there a way to check if the coprocess ever started?
#!/usr/bin/awk -f
BEGIN {
cmd = "cat"
## print "" |& cmd
}
{
if (1 == 1) {
print |& cmd
}
}
END {
close (cmd, "to")
while ((cmd |& getline line)>0) {
print line
}
close(cmd)
}

Great question, +1 for that!
Just test the return code of the close(cmd, "to") - it will be zero if the pipe was open, -1 (or some other value) otherwise. e.g.:
if (close(cmd, "to") == 0) {
while ((cmd |& getline line)>0) {
print line
}
close(cmd)
}

Assigning system command's output to variable

I want to run the system command in an awk script and get its output stored in a variable. I've been trying to do this, but the command's output always goes to the shell and I'm not able to capture it. Any ideas on how this can be done?
Example:
$ date | awk --field-separator=! {$1 = system("strip $1"); /*more processing*/}
Should call the strip system command and instead of sending the output to the shell, should assign the output back to $1 for more processing. Rignt now, it's sending output to shell and assigning the command's retcode to $1.

Note: Coprocess is GNU awk specific.
Anyway another alternative is using getline
cmd = "strip "$1
while ( ( cmd | getline result ) > 0 ) {
print result
}
close(cmd)
Calling close(cmd) will prevent awk to throw this error after a number of calls :
fatal: cannot open pipe `…' (Too many open files)

To run a system command in awk you can either use system() or cmd | getline.
I prefer cmd | getline because it allows you to catch the value into a variable:
$ awk 'BEGIN {"date" | getline mydate; close("date"); print "returns", mydate}'
returns Thu Jul 28 10:16:55 CEST 2016
More generally, you can set the command into a variable:
awk 'BEGIN {
cmd = "date -j -f %s"
cmd | getline mydate
close(cmd)
}'
Note it is important to use close() to prevent getting a "makes too many open files" error if you have multiple results (thanks mateuscb for pointing this out in comments).
Using system(), the command output is printed automatically and the value you can catch is its return code:
$ awk 'BEGIN {d=system("date"); print "returns", d}'
Thu Jul 28 10:16:12 CEST 2016
returns 0
$ awk 'BEGIN {d=system("ls -l asdfasdfasd"); print "returns", d}'
ls: cannot access asdfasdfasd: No such file or directory
returns 2

Figured out.
We use awk's Two-way I/O
{
"strip $1" |& getline $1
}
passes $1 to strip and the getline takes output from strip back to $1

gawk '{dt=substr($4,2,11); gsub(/\//," ",dt); "date -d \""dt"\" +%s"|getline ts; print ts}'

You can use this when you need to process a grep output:
echo "some/path/exex.c:some text" | awk -F: '{ "basename "$1"" |& getline $1; print $1 " ==> " $2}'
option -F: tell awk to use : as field separator
"basename "$1"" execute shell command basename on first field
|& getline $1 reads output of previous shell command in substream
output:
exex.c ==> some text

I am using macOS's awk and I also needed exit status of the command. So I extended #ghostdog74's solution to get the exit status too:
Exit if non-zero exit status:
cmd = <your command goes here>
cmd = cmd" ; printf \"\n$?\""
last_res = ""
value = ""
while ( ( cmd | getline res ) > 0 ) {
if (value == "") {
value = last_res
} else {
value = value"\n"last_res
}
last_res = res
}
close(cmd)
# Now `res` has the exit status of the command
# and `value` has the complete output of command
if (res != 0) {
exit 1
} else {
print value
}
So basically I just changed cmd to print exit status of the command on a new line. After the execution of the above while loop, res would contain the exit status of the command and
value would contain the complete output of the command.
Honestly not a very neat way and I myself would like to know if there is some better way.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Non blocking read from GNU awk coprocess? - awk

Related

Trap or evaluate bad regular expression string at runtime in awk script

Awk: I want to use the input filename to generate an output file with same name different extension

golang command not working even though manually executing it does

How can I check if a GNU awk coprocess is open, or force it to open without writing to it?

Assigning system command's output to variable

Categories

Resources