looks like widechar input from getline in awk - awk

I'm having trouble with AWK that I've never seen before.
I'm reading in a file, no special chars, and printing it back out.
When I read a text file, it prints out with a NUL between every char.
Reading an HTML file works exactly as expected and prints out what was read in.
Code snippet:
while ((getline line < In) > 0) {
print ":0:", line, ":0:" > "out";
reads the line "signature1"
and prints
":0: xFFxFEsNULiNULgNULnNULaNULtNULuNULrNULeNUL1NUL/r
NUL :0:/r/n"
as viewed in Notepad++.
"In" is the input filename.
I assume it is some Language setting on my machine, but I can't find anything.
A second print line, redirected to a file, prints every other line in Chinese.
TL;RD; Complete text of the app:
BEGIN { ProcessFile(); }
function ProcessFile() {
In = "default.txt";
Works = "NoProblem.html";
Out = "quote.txt";
RS = "/n";
while ((getline textLine < In) > 0) {
print "*0*", textLine, "*0:*" > "out.txt";
print textLine > Out; # prints every other line in Chinese ???
}
close(In);
close(Out);
}
Output of the second print line:
signature1
਍猀椀最渀愀琀甀爀攀㈀ഀഀ

Related

GAWK does not terminate after ENDFILE block with single file

I have a gawk script below that reads a protein FASTA file and only prints out the records that don't have an X in their sequence and are within a certain range length. I wanted to modify the file in place so I had the script write to a temporary file and then rename it to the original file. The BEGINFILE and ENDFILE constructs in gawk seemed convenient for this. However, for some reason, gawk does not exit after executing the code in the ENDFILE even if it is given a single file argument. It seems to jump back to another line of code and then just hang. Does anyone know what could cause this to happen? The weird part is that this doesn't happen for every FASTA file, only a few and I can't tell what is different between the ones that trigger the bug and the ones that don't
#! /bin/gawk -f
function trim(s) {
gsub(/^[ \t]+|[ \t]+$/, "", s)
return s
}
function printFasta(header, seq, outfile, seq_line_max_chars) {
print ">" header > outfile
seq_line_max_chars = 80
start = 1
end = length(seq)
while (start <= end) {
print substr(seq, start, seq_line_max_chars) > outfile
start += seq_line_max_chars
}
}
BEGIN {
min_prot_len = 400
}
BEGINFILE {
tmp_out = FILENAME ".tmp"
}
/^>.+/ {
headerStartIdx = index($0, ">") + 1
header = trim(substr($0, headerStartIdx))
getline
sequence = ""
while ($0 !~ /(^>.+)|(^[[:space:]]*$)/) {
x_matched = match($0, "X")
if (x_matched != 0) {
next
}
gsub("*", "")
sequence = sequence $0
getline
}
if (length(sequence) >= min_prot_len) {
printFasta(header, sequence, tmp_out)
}
}
ENDFILE {
print "move called"
# system(("mv " tmp_out " " FILENAME))
}
I called the script with
$ ./filter_proteins.awk test.faa
When I run this, move called is printed and then it hangs. I tried stepping through with the debugger and I see that it reaches the ENDFILE block having processed all the lines in the input file, but when I type the next command, it jumps to the getline statement on line 44. After several iterations of next and print $0 it seems that the program is stuck reading the last line of the input file till the end of time. Perhaps this is a bug?
I am using GAWK 5.1.0
Edit
A minimal input file.
https://github.com/CuriousTim/pastebin/blob/main/mb.34.faa
When I run the script with only a few sequences, it works, but when I use the whole file, it hangs. I wasn't sure how to make a minimal example without providing the whole file.
We'll know for sure after you provide sample input to test with but my money's on the call to getline within the loop reaching the end of the file and so triggering the ENDFILE condition to be true but you're still in the loop.
Look:
$ cat file
foo
bar
$ cat tst.awk
{
while (1) {
print "about to execute getline"
getline
print "just executed getline:", $0
if (++c == 5) {
exit
}
}
}
ENDFILE {
print "*** in ENDFILE ***"
}
$ awk -f tst.awk file
about to execute getline
just executed getline: bar
about to execute getline
*** in ENDFILE ***
just executed getline: bar
about to execute getline
just executed getline: bar
about to execute getline
just executed getline: bar
about to execute getline
just executed getline: bar
Calling getline in a loop is not how you want to write an awk script and what you have in your code is the wrong syntax to use when calling getline at any time - see http://awk.freeshell.org/AllAboutGetline.
I modified my script to remove the getline in case anyone finds it useful.
#! /bin/gawk -f
function printFasta(header, seq, seq_line_max_chars) {
print ">" header
seq_line_max_chars = 80
start = 1
end = length(seq)
while (start <= end) {
print substr(seq, start, seq_line_max_chars)
start += seq_line_max_chars
}
}
BEGIN {
FS = ">"
min_prot_len = 400
max_prot_len = 700
}
NF > 1 {
if (sequence &&
length(sequence) >= min_prot_len &&
length(sequence) <= max_prot_len) {
printFasta(header, sequence)
}
}
{
if (!header) {
next
} else {
x_in_seq = match($0, "X")
if (!x_in_seq) {
gsub("*", "")
sequence = sequence $0
} else {
header = ""
}
}
}
END {
if (header) {
printFasta(header, sequence)
}
}

How can I store the length of a line into a var withing awk script?

I have this simple awk script with which I attempt to check the amount of characters in the first line.
if the first line has more of less than 10 characters I want to store the amount
of caracters into a var.
Somehow the first print statement works but storing that result into a var doesn't.
Please help.
I tried removing dollar sign " thelength=(length($0))"
and removing the parenthesis "thelength=length($0)" but it doen't print anything...
Thanks!
#!/bin/ksh
awk ' BEGIN {FS=";"}
{
if (NR==1)
if(length($0)!=10)
{
print(length($0))
thelength=$(length($0))
print "The length of the first line is: ",$thelength;
exit 1;
}
}
END { print "STOP" }' $1
Two issues dealing with mixing ksh and awk scripting ...
no need to make a sub-shell call within awk to obtain the length; use thelength=length($0)
awk variables do not require a leading $ when being referenced; use print ... ,thelength
So your code becomes:
#!/bin/ksh
awk ' BEGIN {FS=";"}
{
if (NR==1)
if(length($0)!=10)
{
print(length($0))
thelength=length($0)
print "The length of the first line is: ",thelength;
exit 1;
}
}
END { print "STOP" }' $1

How to remove space and the specific character in string - awk

Below is a input.
!{ID=34, ID2=35}
>
!{ID=99, ID2=23}
>
!{ID=18, ID2=87}
<
I am trying to make a final result like as following. That is, wanted to remove space,'{' and '}' character and check if the next line is '>' or '<'.
In fact, the input above is repeated. I also need to parse '>' and '<' character so I will put the parsed string(YES or NO) into database.
ID=34,ID=35#YES#NO
ID=99,ID=23#YES#NO
ID=18,ID=87#NO#YES
So, with 'sub' function I thought I can replace the space with blank but the result shows:
1#YES#NO
Can you let me know what is wrong?
If possible, teach me how to remove '{' and '}' as well.
Appreciated if you could show me the awk file version instead of one-liner.
BEGIN {
VALUES = ""
L_EXIST = "NO"
R_EXIST = "NO"
}
/!/ { VALUES = gsub(" ", "", $0);
getline;
if ($1 == ">") L_EXIST = "YES";
else if ($1 == "<") R_EXIST = "YES";
print VALUES"#"L_EXIST"#"R_EXIST
}
END {
}
Given your sample input:
$ cat file
!{ID=34, ID2=35}
>
!{ID=99, ID2=23}
>
!{ID=18, ID2=87}
<
This script produces the desired output:
BEGIN { FS="[}{=, ]+"; RS="!" }
NR > 1 { printf "ID=%d,ID=%d#%s\n", $3, $5, ($6==">"?"YES#NO":"NO#YES") }
The Field Separator is set to consume the spaces and other characters between the parts of the line that you're interested in. The Record Separator is set to !, so that each pair of lines is treated as a single record.
The first record is empty (the start of the first line, up to the first !), so we only process the ones after that. The output is constructed using printf, with a ternary to determine the last part (I assume that there are only two options, > or <).
Let's say you have this input:
input.txt
!{ID=34, ID2=35}
!{ID=36, ID2=37}
>
You can use the following awk command
awk -F'[!{}, ]' 'NR>1{yn="NO";if($1==">")yn="YES";print l"#"yn}{l=$3","$5}' input.txt
to produce this output:
ID=34,ID2=35#NO
ID=36,ID2=37#YES

Endless recursion in gawk-script

Please pardon me in advance for posting such a big part of my problem, but I just can't put my finger on the part that fails...
I got input-files like this (abas-FO if you care to know):
.fo U|xiininputfile = whatever
.type text U|xigibsgarnich
.assign U|xigibsgarnich
..
..Comment
.copy U|xigibswohl = Spaß
.ein "ow1/UWEDEFTEST.FOP"
.in "ow1/UWEINPUT2"
.continue BOTTOM
.read "SOemthing" U|xttmp
!BOTTOM
..
..
Now I want to recursivly follow each .in[put]/.ein[gabe]-statement, parse the mentioned file and if I don't know it yet, add it to an array. My code looks like this:
#!/bin/awk -f
function getFopMap(inputregex, infile, mandantdir, infiles){
while(getline f < infile){
#printf "*"
#don't match if there is a '
if(f ~ inputregex "[^']"){
#remove .input-part
sub(inputregex, "", f)
#trim right
sub(/[[:blank:]]+$/, "", f)
#remove leading and trailing "
gsub(/(^\"|\"$)/,"" ,f)
if(!(f in infiles)){
infiles[f] = "found"
}
}
}
close(infile)
for (i in infiles){
if(infiles[i] == "found"){
infiles[i] = "parsed"
cmd = "test -f \"" i "\""
if(system(cmd) == 0){
close(cmd)
getFopMap(inputregex, f, mandantdir, infiles)
}
}
}
}
BEGIN{
#Matches something like [.input myfile] or [.ein "ow1/myfile"]
inputregex = "^\\.(in|ein)[^[:blank:]]*[[:blank:]]+"
#Get absolute path of infile
cmd = "python -c 'import os;print os.path.abspath(\"" ARGV[1] "\")'"
cmd | getline rootfile
close(cmd)
infiles[rootfile] = "parsed"
getFopMap(inputregex, rootfile, mandantdir, infiles)
#output result
for(infile in infiles) print infile
exit
}
I call the script (in the same directory the paths are relative to) like this:
./script ow1/UWEDEFTEST.FOP
I get no output. It just hangs up. If I remove the comment before the printf "*" command, I'm seeing stars, without end.
I appreciate every help and hints how to do it better.
My awk:
gawk Version 3.1.7
idk it it's your only problem but you're calling getline incorrectly and consequently will go into an infinite loop in some scenarios. Make sure you fully understand all of the caveats at http://awk.info/?tip/getline and you might want to use the recursion example there as the starting point for your code.
The most important item initially for your code is that when getline fails it can return a negative value so then while(getline f < infile) will create an infinite loop since the failing getline will always be returning non-zero and will so continue to be called and continue to fail. You need to use while ( (getline f < infile) > 0) instead.

Awk to scape quotation marks

So I have a file like
select * from tb where start_date = to_date('20131010','yyyymmdd');
p23 VARCHAR2(300):='something something
still part of something above with 'this' between single quotes and close
something to end';
(code goes on)
That would be some automatically generated code which I should be able execute via sqlplus. But that obviously won't work, since the third line should have had its quotes escaped like (..) with ''this'' between (...).
I can't access the script that generated that code, but I was trying to get a awk to do the job. Notice the script has to be smart enough to not scape every quote in the code (the to_date('20131010','yyyymmdd') is correct).
I am no expert in awk, so I went as far as:
BEGIN {
RS=";"
FS="\n"
}
/\tp[0-9]+/{
ini = match($0, "\tp[0-9]+")
fim = match($0, ":='")
s = substr($0,ini,fim+1)
txt = substr($0, fim+3, length($0))
block = substr(txt, 0, length(txt)-1)
print gensub("'", "''", block)
}
!/\tp[0-9]+/{
print $0";"
}
but it went way too messy with the print gensub("'", "''", block) and it is not working.
Can someone give me a quick way out?
You have forgotten one parameter to gensub. Try:
BEGIN {
RS=";"
FS="\n"
}
/^[[:space:]]+p[0-9]+/{
ini = match($0, "\tp[0-9]+")
fim = match($0, ":='")
s = substr($0,ini,fim+1)
txt = substr($0, fim+3, length($0))
block = substr(txt, 0, length(txt)-1)
printf "%s'%s';", s, gensub("'", "''", "g",block)
next
}
{
printf "%s;", $0
}