I would like to implement incremental execution of scripts using gawk in order to interleave script source and script output in a document.
The idea would be to read script lines into awk to print them and also pipe them into an appropriate interpreter. Then, on a queue from the input file, read any output from the coprocess and print it to standard output. But it seems that I must know how much output has been generated before looping over the coprocess output.
Is there any way to do a non-blocking read from the coprocess?
function script_checkpoint() {
while(("python3" |& getline output) > 0)
print output
}
/^# checkpoint/ { script_checkpoint(); next }
{ print; print $0 |& "python3" }
END { script_checkpoint() }
EDIT: I have tried to implement this without using a coprocess by buffering the input lines until a checkpoint and just letting the interpreter print to standard out itself but the interpreter always buffers its output until the stream closes. I don't want to close it until the program ends to preserve its internal state.
EDIT: made it more clear that my first intended use case is running python scripts. Here is a sample input/output pair.
print('first line')
# checkpoint
print('second line')
should result in
print('first line')
first line
print('second line')
second line
The general issue:
while ((interpreter |& getline output) > 0) runs until it sees an EOF but ...
interpreter does not end/terminate/exit, thus no EOF is sent so ...
awk hangs while waiting for interpreter to send more data so ...
we end up with a deadlock situation (awk waiting for input from interpreter; interpreter waiting for input from awk)
Assumptions:
need to maintain a single invocation of interpreter throughout the run (per a comment from OP); net result: awk cannot depend on interpreter sending an EOF
interpreter can be modified (to generate additional output)
the awk script has no way of knowing how many lines of output will be generated by interpreter
One idea is to setup a handshake between awk and interpreter. Within the while ((interpreter |& getline output) > 0) loop we'll test for our handshake and when we see it break out of the loop and return back to the main awk script.
For demo purposes I'll use a simple bash script that does some handshake processing otherwise just prints to stdout whatever it reads from stdin:
$ cat interpreter
#!/usr/bin/bash
while read -r line
do
if [[ "${line}" = 'checkpoint' ]] # received 'checkpoint' handshake?
then
echo "CHECKPOINT" # send "CHECKPOINT" handshake/acknowledgement
continue
else
echo "interpreter: $line"
fi
done
Demo awk code with handshake logic:
awk '
function script_checkpoint() {
while (( cmd |& getline output) > 0) {
if ( output == "CHECKPOINT" ) # received "CHECKPOINT" handshake/acknowledgement?
break
print output
}
}
BEGIN { cmd= "./interpreter" }
/^# checkpoint/ { print "checkpoint" |& cmd # send "checkpoint" handshake
script_checkpoint()
next
}
{ print "awk: " $0
print $0 |& cmd
}
END { print "awk: last checkpoint" # in case last line of input is not "# checkpoint" we will ...
print "checkpoint" |& cmd # send one last "checkpoint" handshake
script_checkpoint()
print "awk: done"
}
' test.dat
Sample input file:
$ cat test.dat
line1
line2
# checkpoint
line3
line4
# checkpoint
line5
Output:
awk: line1
awk: line2
interpreter: line1
interpreter: line2
awk: line3
awk: line4
interpreter: line3
interpreter: line4
awk: line5
awk: last checkpoint
interpreter: line5
awk: done
NOTES:
awk will still hang in the event interpreter crashes and/or fails to send back the CHECKPOINT handshake
if the strings checkpoint and/or CHECKPOINT can show up in the 'normal' data streams then update the code to use strings that are not expected in the data streams
It sounds like you're trying to do something like this:
BEGIN { cmd="/my/python/script/path" }
function script_checkpoint( output) {
close(cmd,"to")
while ( (cmd |& getline output) > 0 ) {
print output
}
close(cmd)
}
/^# checkpoint/ {
script_checkpoint()
next
}
{
print
print |& cmd
}
END { script_checkpoint() }
I have a gawk program that uses a coprocess. However, sometimes I don't have any data to write to the coprocess, and my original script hangs while waiting for the output of the coprocess.
The code below reads from STDIN, writes each line to a "cat" program, running as a coprocess. Then it reads the coprocess output back in and writes it to STDOUT. If we change the if condition to be 1==0, nothing gets written to the coprocess, and the program hangs at the while loop.
From the manual, it seems that the coprocess and the two-way communication channels are only started the first time there is an IO operation with the |& operator. Perhaps we can start things without actually writing anything (e.g. writing an empty string)? Or is there a way to check if the coprocess ever started?
#!/usr/bin/awk -f
BEGIN {
cmd = "cat"
## print "" |& cmd
}
{
if (1 == 1) {
print |& cmd
}
}
END {
close (cmd, "to")
while ((cmd |& getline line)>0) {
print line
}
close(cmd)
}
Great question, +1 for that!
Just test the return code of the close(cmd, "to") - it will be zero if the pipe was open, -1 (or some other value) otherwise. e.g.:
if (close(cmd, "to") == 0) {
while ((cmd |& getline line)>0) {
print line
}
close(cmd)
}
I have a file in which I have to merge 2 rows on the basis of:
- Common sessionID
- Immediate next matching pattern (GX with QG)
file1:
session=001,field01,name=GX1_TRANSACTION,field03,field04
session=001,field91,name=QG
session=001,field01,name=GX2_TRANSACTION,field03,field04
session=001,field92,name=QG
session=004,field01,name=GX1_TRANSACTION,field03,field04
session=002,field01,name=GX1_TRANSACTION,field03,field04
session=002,field01,name=GX2_TRANSACTION,field03,field04
session=002,field92,name=QG
session=003,field91,name=QG
session=003,field01,name=GX2_TRANSACTION,field03,field04
session=003,field92,name=QG
session=004,field91,name=QG
session=004,field01,name=GX2_TRANSACTION,field03,field04
session=004,field92,name=QG
I have created an awk (I am new and learnt awk only from This portal only) which created my desired output.
Output1
session=001,field01,name=GX1_TRANSACTION,field03,field04,session=001,field91,name=QG
session=001,field01,name=GX2_TRANSACTION,field03,field04,session=001,field92,name=QG
session=002,field01,name=GX1_TRANSACTION,field03,field04,NOMATCH-QG
session=002,field01,name=GX2_TRANSACTION,field03,field04,session=002,field92,name=QG
session=003,field01,name=GX2_TRANSACTION,field03,field04,session=003,field92,name=QG
session=004,field01,name=GX1_TRANSACTION,field03,field04,session=004,field91,name=QG
session=004,field01,name=GX2_TRANSACTION,field03,field04,session=004,field92,name=QG
Output2: Pending
session=003,field91,name=QG
Awk:
{
if($0~/name=GX1_TRANSACTION/ || $0~/GX2_TRANSACTION/) {
if($1 in ccr)
print ccr[$1]",NOMATCH-QG";
ccr[$1]=$0;
}
if($0~/name=QG/) {
if($1 in ccr) {
print ccr[$1]","$0;
delete ccr[$1];
}
else {
print $0",NOUSER" >> Pending
}
}
}
END {
for (i in ccr)
print ccr[i]",NOMATCH-QG"
}
Command:
awk -F"," -v Pending=t -f a.awk file1
But Issue is my "file1" is really big, So I want to improve the performance of this script. Is their any way by which I can improve its performance?
There are a couple of changes that may lead to small improvements in speed, and if not may give you some ideas for future awk scripts.
Don't "manually" test every line if you don't have to - raise the name= tests to the main awk loop. Currently your script checks $0 up to three times per line for a name= match.
Since you're using , as the FS, test the corresponding field ($3) instead of $0. It only saves a few leading chars of pattern matching in your example data.
Here's a refactored a.awk:
$3~/name=GX[12]_TRANSACTION/ {
if($1 in ccr)
print ccr[$1]",NOMATCH-QG";
ccr[$1]=$0;
}
$3~/name=QG/ {
if($1 in ccr) {
print ccr[$1]","$0;
delete ccr[$1];
}
else {
print $0",NOUSER" >> Pending
}
}
END { for (i in ccr) print ccr[i]",NOMATCH-QG" }
I've also condensed the GX pattern match to one regex. I get the same output as your example.
In any program, IO (e.g. print statements) is usually the most real-time intensive operation. In awk there's an operation that's even slower, though, and that's string concatenation. Because awk doesn't require you to pre-allocate memory for strings, the memory gets allocated dynamically so then when you increase the length of a string, it must get dynamically re-allocated. So, you can speed up your program by removing the string concatenations, e.g. for all those hard-coded ","s you're printing instead of just setting/using the OFS.
I haven't really thought about the logic of your overall approach but there's a couple of other tweaks you could try:
BEGIN{ FS=OFS="," }
NF {
if ($3 ~ /name=GX[12]_TRANSACTION/) {
if($1 in ccr) {
print ccr[$1], "NOMATCH-QG"
}
ccr[$1]=$0
}
else {
if($1 in ccr) {
print ccr[$1], $0
delete ccr[$1]
}
else {
print $0, "NOUSER" >> Pending
}
}
}
END {
for (i in ccr)
print ccr[i], "NOMATCH-QG"
}
Note that by setting FS in the script you no longer need to use -F"," on the command line.
Are you sure you want >> instead of > on the print to "Pending"? Those 2 constructs don't mean the same in awk as they do in shell.
I am trying to append lines to some new files with awk in this way:
#!/usr/bin/awk -f
BEGIN {
FS = "[ \t|]"; }
{
print $5 "\t" $13 "\t" $14 >> "./bed/" $5 ".bed";
}
END {
}
New file is created with filename derived from a field of awk input file (5th field). I am unable to execute this script since it fails with
awk: ./blast2bed.awk:6: (FILENAME=blastout000 FNR=1) fatal: can't redirect to `./bed/AY517392.1.bed' (No such file or directory)
Any hints?
Thanks
The directory bed has to exist so create it first with mkdir bed either before you run your script or in the BEGIN block. You should also add brackets around the output file:
print $5"\t"$13"\t"$14 >> ("./bed/"$5".bed")
Notes: You don't need to end lines with ; if you have a single statement per line and the BEGIN and END blocks are optional.
I would like to read a file like this
1.23213213
0.12321321
-1.12321321
0.23232322
into a variable, or array to use it somewhere in the main {} code.
But I would like to use it if this file exists. How can I check if it already exists or not, and if not, then do not use that variable or array?
I don't understand completely what you want to achieve, but perhaps something like this can be useful to you:
It process the file line by line and saves each one in an array, the key is the line number so you keep the order. In the END section check how many lines were processed and get if the file had content.
awk '{ line[ FNR ] = $0 } END { if ( FNR > 0 ) { print "File" } else { print "NO file" } }' infile
EDIT to comment:
But in awk you can process many files from command line.
BEGIN {
...
}
## Processing of first file in command line.
FNR == NR {
a[ FNR ] = $0
next
}
## Processing of second file in command line
FNR < NR {
## Check if array 'a' has the values you want and use them
## 'for(...)variable += a[i]' or whatever.
}
Run script like:
awk -f script.awk first_file.txt second_file.txt
But if first_file.txt doesn't exists, awk will complain with an error.