Endless recursion in gawk-script - awk

Please pardon me in advance for posting such a big part of my problem, but I just can't put my finger on the part that fails...
I got input-files like this (abas-FO if you care to know):
.fo U|xiininputfile = whatever
.type text U|xigibsgarnich
.assign U|xigibsgarnich
..
..Comment
.copy U|xigibswohl = Spaß
.ein "ow1/UWEDEFTEST.FOP"
.in "ow1/UWEINPUT2"
.continue BOTTOM
.read "SOemthing" U|xttmp
!BOTTOM
..
..
Now I want to recursivly follow each .in[put]/.ein[gabe]-statement, parse the mentioned file and if I don't know it yet, add it to an array. My code looks like this:
#!/bin/awk -f
function getFopMap(inputregex, infile, mandantdir, infiles){
while(getline f < infile){
#printf "*"
#don't match if there is a '
if(f ~ inputregex "[^']"){
#remove .input-part
sub(inputregex, "", f)
#trim right
sub(/[[:blank:]]+$/, "", f)
#remove leading and trailing "
gsub(/(^\"|\"$)/,"" ,f)
if(!(f in infiles)){
infiles[f] = "found"
}
}
}
close(infile)
for (i in infiles){
if(infiles[i] == "found"){
infiles[i] = "parsed"
cmd = "test -f \"" i "\""
if(system(cmd) == 0){
close(cmd)
getFopMap(inputregex, f, mandantdir, infiles)
}
}
}
}
BEGIN{
#Matches something like [.input myfile] or [.ein "ow1/myfile"]
inputregex = "^\\.(in|ein)[^[:blank:]]*[[:blank:]]+"
#Get absolute path of infile
cmd = "python -c 'import os;print os.path.abspath(\"" ARGV[1] "\")'"
cmd | getline rootfile
close(cmd)
infiles[rootfile] = "parsed"
getFopMap(inputregex, rootfile, mandantdir, infiles)
#output result
for(infile in infiles) print infile
exit
}
I call the script (in the same directory the paths are relative to) like this:
./script ow1/UWEDEFTEST.FOP
I get no output. It just hangs up. If I remove the comment before the printf "*" command, I'm seeing stars, without end.
I appreciate every help and hints how to do it better.
My awk:
gawk Version 3.1.7

idk it it's your only problem but you're calling getline incorrectly and consequently will go into an infinite loop in some scenarios. Make sure you fully understand all of the caveats at http://awk.info/?tip/getline and you might want to use the recursion example there as the starting point for your code.
The most important item initially for your code is that when getline fails it can return a negative value so then while(getline f < infile) will create an infinite loop since the failing getline will always be returning non-zero and will so continue to be called and continue to fail. You need to use while ( (getline f < infile) > 0) instead.

Related

GAWK does not terminate after ENDFILE block with single file

I have a gawk script below that reads a protein FASTA file and only prints out the records that don't have an X in their sequence and are within a certain range length. I wanted to modify the file in place so I had the script write to a temporary file and then rename it to the original file. The BEGINFILE and ENDFILE constructs in gawk seemed convenient for this. However, for some reason, gawk does not exit after executing the code in the ENDFILE even if it is given a single file argument. It seems to jump back to another line of code and then just hang. Does anyone know what could cause this to happen? The weird part is that this doesn't happen for every FASTA file, only a few and I can't tell what is different between the ones that trigger the bug and the ones that don't
#! /bin/gawk -f
function trim(s) {
gsub(/^[ \t]+|[ \t]+$/, "", s)
return s
}
function printFasta(header, seq, outfile, seq_line_max_chars) {
print ">" header > outfile
seq_line_max_chars = 80
start = 1
end = length(seq)
while (start <= end) {
print substr(seq, start, seq_line_max_chars) > outfile
start += seq_line_max_chars
}
}
BEGIN {
min_prot_len = 400
}
BEGINFILE {
tmp_out = FILENAME ".tmp"
}
/^>.+/ {
headerStartIdx = index($0, ">") + 1
header = trim(substr($0, headerStartIdx))
getline
sequence = ""
while ($0 !~ /(^>.+)|(^[[:space:]]*$)/) {
x_matched = match($0, "X")
if (x_matched != 0) {
next
}
gsub("*", "")
sequence = sequence $0
getline
}
if (length(sequence) >= min_prot_len) {
printFasta(header, sequence, tmp_out)
}
}
ENDFILE {
print "move called"
# system(("mv " tmp_out " " FILENAME))
}
I called the script with
$ ./filter_proteins.awk test.faa
When I run this, move called is printed and then it hangs. I tried stepping through with the debugger and I see that it reaches the ENDFILE block having processed all the lines in the input file, but when I type the next command, it jumps to the getline statement on line 44. After several iterations of next and print $0 it seems that the program is stuck reading the last line of the input file till the end of time. Perhaps this is a bug?
I am using GAWK 5.1.0
Edit
A minimal input file.
https://github.com/CuriousTim/pastebin/blob/main/mb.34.faa
When I run the script with only a few sequences, it works, but when I use the whole file, it hangs. I wasn't sure how to make a minimal example without providing the whole file.
We'll know for sure after you provide sample input to test with but my money's on the call to getline within the loop reaching the end of the file and so triggering the ENDFILE condition to be true but you're still in the loop.
Look:
$ cat file
foo
bar
$ cat tst.awk
{
while (1) {
print "about to execute getline"
getline
print "just executed getline:", $0
if (++c == 5) {
exit
}
}
}
ENDFILE {
print "*** in ENDFILE ***"
}
$ awk -f tst.awk file
about to execute getline
just executed getline: bar
about to execute getline
*** in ENDFILE ***
just executed getline: bar
about to execute getline
just executed getline: bar
about to execute getline
just executed getline: bar
about to execute getline
just executed getline: bar
Calling getline in a loop is not how you want to write an awk script and what you have in your code is the wrong syntax to use when calling getline at any time - see http://awk.freeshell.org/AllAboutGetline.
I modified my script to remove the getline in case anyone finds it useful.
#! /bin/gawk -f
function printFasta(header, seq, seq_line_max_chars) {
print ">" header
seq_line_max_chars = 80
start = 1
end = length(seq)
while (start <= end) {
print substr(seq, start, seq_line_max_chars)
start += seq_line_max_chars
}
}
BEGIN {
FS = ">"
min_prot_len = 400
max_prot_len = 700
}
NF > 1 {
if (sequence &&
length(sequence) >= min_prot_len &&
length(sequence) <= max_prot_len) {
printFasta(header, sequence)
}
}
{
if (!header) {
next
} else {
x_in_seq = match($0, "X")
if (!x_in_seq) {
gsub("*", "")
sequence = sequence $0
} else {
header = ""
}
}
}
END {
if (header) {
printFasta(header, sequence)
}
}

Else syntax error when nesting array formula

I am recieving a syntax error on "else" for this shell:
{for (i=8;i<=NF;i+=3)
{if ($0~"=>") # if-else statement designed to flag file / directory transfers
print "=> flag,"$1"," $2","$3","$4 ","$5","$6","$7"," $(i)","$(i+1)","$(i+2);
{split ($(i+2), array, "/");
for (x in array)
{j++;
a[j] =j;
printf (array[x] ",");}
printf ("%s\n", "");}
else
print "no => flag,"$1"," $2","$3","$4 ","$5","$6","$7"," $(i)","$(i+1)","$(i+2)
}
}
Can't figure out why. If I delete the array block (starting with split()), all is well. But I need to scan the contents of $(i+2), so cutting it does me no good.
Also, if anyone has guidance on a good list of how to interpret error messages, that would be great.
Thanks for your advice.
EDIT: here is the above script laid out with sensible formatting:
{
for (i=8;i<=NF;i+=3) {
if ($0~"=>") # if-else statement designed to flag file / directory transfers
print "=> flag,"$1"," $2","$3","$4 ","$5","$6","$7"," $(i)","$(i+1)","$(i+2);
{
split ($(i+2), array, "/");
for (x in array) {
j++;
a[j] =j;
printf (array[x] ",");
}
printf ("%s\n", "");
}
else
print "no => flag,"$1"," $2","$3","$4 ","$5","$6","$7"," $(i)","$(i+1)","$(i+2)
}
}
First thing first, since you didn't post any samples of input and expected output so didn't test it at all. Could you please try following, I hope you are running this in .awk script style. Also these are mostly syntax/cosmetic changes NOT on logic part, since no background was given on problem.
BEGIN{
OFS=","
}
{
for (i=8;i<=NF;i+=3){
if ($0~/=>/){
print "=> flag,"$1,$2,$3,$4,$5,$6,$7,$(i),$(i+1),$(i+2)
split ($(i+2), array, "/");
for(x in array){
j++;
a[j] =j;
printf (array[x] ",")
}
printf ("%s\n", "")
}
else{
print "no => flag",$1,$2,$3,$4,$5,$6,$7,$(i),$(i+1),$(i+2)
}
}
}
Problems fixed in OP's attempt:
{ starting curly braces(which indicates that if condition of for loop with multiple statements is started) could be in last of the line where they are present, NOT in next line, for better visibility purposes, I fixed in for loop and if condition first.
Since you are using regexp matching with a pattern so I fixed from $0~"=>" TO $0~/=>/.
Added BEGIN section in your attempt where I have set OFS(output field separator) value to , so that you need NOT to print like "," to print comma between variables, just , between variables will do the trick.
Fixed indentation, so that we are NOT confused where to close loop/condition and where to NOT.

Removing Quote From Field For Filename Using AWK

I've been playing around with this for an hour trying to work out how to embed the removal of quotes from a specific field using AWK.
Basically, the file encapsulates text in quotes, but I want to use the second field to name the file and split them based on the first field.
ID,Name,Value1,Value2,Value3
1,"AAA","DEF",1,2
1,"AAA","GGG",7,9
2,"BBB","DEF",1,2
2,"BBB","DEF",9,0
3,"CCC","AAA",1,1
What I want to get out are three files, all with the header row named:
AAA [1].csv
BBB [2].csv
CCC [3].csv
I have got it all working, except for the fact that I can't for the life of me work out how to remove the quotes around the filename!!
So, this command does everything (except the file is named with quotes around $2, but I need to do some kind of transformation on $2 before it goes into evname. In the actual file, I want to keep the encapsulating quotes.
awk -F, 'NR==1{h=$0;next}!($1 in files){evname=$2" ["$1"].csv";files[$1]=1;print h>evname}{print > evname}' DataExtract.csv
I've tried to push a gsub into this, but I'm struggling to work out exactly how this should look.
This is I think as close as I have got, but it is just calling everything "2" for $2, I'm not sure if this means I need to do an escape of $2 somehow in the gsub, but trying that doesn't seem to be working, so I'm at a loss as to what I'm doing wrong.
awk -F, 'NR==1{h=$0;next}!($1 in files){evname=gsub(""\","", $2)" - Event ID ["$1"].csv";files[$1]=1;print h>evname}{print > evname}' DataExtract.csv
Any help greatly appreciated.
Thanks in advance!!
Gannon
If I understand what you are attempting correctly, then
awk -F, 'NR==1{h=$0;next}!($1 in files){gsub(/"/, "", $2); evname=$2" ["$1"].csv";files[$1]=1;print h>evname}{print > evname}' DataExtract.csv
should work. That is
NR == 1 {
h = $0;
next
}
!($1 in files) {
stub = $2 # <-- this is the new bit: make a working copy
# of $2 (so that $2 is unchanged and the line
# is not rebuilt with changes for printing),
gsub(/"/, "", stub) # remove the quotes from it, and
evname = stub " [" $1 "].csv" # use it to assemble the filename.
files[$1] = 1;
print h > evname
}
{
print > evname
}
You can, of course, use
evname = stub " - Event ID [" $1 "].csv"
or any other format after the substitution (this one seems to be what you tried to get in your second code snippet).
The gsub function returns the number of substitutions made, not the result of the substitutions; that is why evname=gsub(""\","", $2)" - Event ID ["$1"].csv" does not work.
Things are always clearer with a little white space:
awk -F, '
NR==1 { hdr=$0; next }
!seen[$1]++ {
evname = $2
gsub(/"/,"",evname)
outfile = evname " [" $1 "].csv"
print hdr > outfile
}
{ print > outfile }
' DataExtract.csv
Aside: It's pretty unusual for someone to WANT to create files with spaces in their names given the complexity that introduces in any later scripts you write to process them. You sure you want to do that?
P.S. here's the gawk version as suggested by #JID below
awk -F, '
NR==1 { hdr=$0; next }
!seen[$1]++ {
outfile = gensub(/"/,"","g",$2) " [" $1 "].csv"
print hdr > outfile
}
{ print > outfile }
' DataExtract.csv
Apply the gsub before you make the assignment:
awk -F, 'NR==1{h=$0;next}
!($1 in files){
gsub("\"","",$2); # Add this line
evname=$2" ["$1"].csv";files[$1]=1;print...

Awk to scape quotation marks

So I have a file like
select * from tb where start_date = to_date('20131010','yyyymmdd');
p23 VARCHAR2(300):='something something
still part of something above with 'this' between single quotes and close
something to end';
(code goes on)
That would be some automatically generated code which I should be able execute via sqlplus. But that obviously won't work, since the third line should have had its quotes escaped like (..) with ''this'' between (...).
I can't access the script that generated that code, but I was trying to get a awk to do the job. Notice the script has to be smart enough to not scape every quote in the code (the to_date('20131010','yyyymmdd') is correct).
I am no expert in awk, so I went as far as:
BEGIN {
RS=";"
FS="\n"
}
/\tp[0-9]+/{
ini = match($0, "\tp[0-9]+")
fim = match($0, ":='")
s = substr($0,ini,fim+1)
txt = substr($0, fim+3, length($0))
block = substr(txt, 0, length(txt)-1)
print gensub("'", "''", block)
}
!/\tp[0-9]+/{
print $0";"
}
but it went way too messy with the print gensub("'", "''", block) and it is not working.
Can someone give me a quick way out?
You have forgotten one parameter to gensub. Try:
BEGIN {
RS=";"
FS="\n"
}
/^[[:space:]]+p[0-9]+/{
ini = match($0, "\tp[0-9]+")
fim = match($0, ":='")
s = substr($0,ini,fim+1)
txt = substr($0, fim+3, length($0))
block = substr(txt, 0, length(txt)-1)
printf "%s'%s';", s, gensub("'", "''", "g",block)
next
}
{
printf "%s;", $0
}

Using a variable defined inside AWK

I got this piece of script working. This is what i wanted:
input
3.76023 0.783649 0.307724 8766.26
3.76022 0.764265 0.307646 8777.46
3.7602 0.733251 0.30752 8821.29
3.76021 0.752635 0.307598 8783.33
3.76023 0.79528 0.307771 8729.82
3.76024 0.814664 0.307849 8650.2
3.76026 0.845679 0.307978 8802.97
3.76025 0.826293 0.307897 8690.43
with script
!/bin/bash
awk -F ', ' '
{
for (i=3; i<=10; i++) {
if (i==NR) {
npc1[i]=sprintf("%s", $1);
npc2[i]=sprintf("%s", $2);
npc3[i]=sprintf("%s", $3);
npRs[i]=sprintf("%s", $4);
print npc1[i],npc2[i],\
npc3[i], npc4[i];
}
}
} ' p_walls.raw
echo "${npc1[100]}"
But now I can't use those arrays npc1[i], outside awk. That last echo prints nothing. Isnt it possible or am I missing something?
AWK is a separate process, after it finishes all internal data is gone. This is true for all external processes/commands. Bash only sees what bash builtins touch.
i is never 100, so why do you want to access npc1[100]?
What are you really trying to do? If you rewrite the question we might be able to help...
(Cherry on the cake is always good!)
Sorry, but all of #yi_H 's answer and comments above are correct.
But there's really no problem loading 2 sets of data into 2 separate arrays in awk, ie.
awk '{
if (FILENAME == "file1") arr1[i++]=$0 ;
#same for file2; }
END {
f1max=++i; f2max=++j;
for (i=1;i<f1max;i++) {
arr1[i]
# put what you need here for arr1 processing
#
# dont forget that you can do things like
if (arr1[i] in arr2) { print arr1[i]"=arr2[arr1["i"]=" arr2[arr1[i]] }
}
for j=1;j<f2max;j++) {
arr2[j]
# and here for arr2
}
}' file1 file2
You'll have to fill the actual processing for arr1[i] and arr2[j].
Also, get an awk book for the weekend and be up and running by Monday. It's easy. You can probably figure it out from grymoire.com/Unix/awk.html
I hope this helps.