awk illegal statement at source line 1 - awk

I am executing following awk command:
awk -F'\t' '{ split($4,array,"[- ]"); print > array[1]""array[2]""array[3]}' myFile.txt
but seeing this error:
awk: syntax error at source line 1
context is
{ split($4,array,"[- ]"); print > >>> array[1]"" <<<
awk: illegal statement at source line 1
awk: illegal statement at source line 1
What can be the reason for that? How to fix the script?

Those pairs of double quotes are doing nothing, you could just remove them:
awk -F'\t' '{ split($4,array,"[- ]"); print > array[1] array[2] array[3]}' myFile.txt
An unparenthesized expression on the right side of input or output redirection is undefined behavior per POSIX which is why some awks (e.g. gawk) will interpret your code as you intended:
awk -F'\t' '{ split($4,array,"[- ]"); print > (array[1] array[2] array[3])}' myFile.txt
while others can interpret it as:
awk -F'\t' '{ split($4,array,"[- ]"); (print > array[1]) (array[2] array[3])}' myFile.txt
which is a syntax error in any awk, or anything else.
You can fix your syntax error by adding the parens:
awk -F'\t' '{ split($4,array,"[- ]"); print > (array[1] array[2] array[3])}' myFile.txt
but that could have other problems too and the right way to do what you're trying to do depends on whatever it is you're trying to do, which we can't tell just from your code. If you post a new question with sample input and expected output then we can help you write your code the right way.

You need
print > (array[1]""array[2]""array[3])
in many implementations of awk. Note the parenthesis around the expression that generates the filename.
Might want to close the file afterwards too in case there's a lot of possible filenames that can be created, and use appending instead:
awk -F'\t' '{ split($4,array,"[- ]")
file = array[1] "" array[2] "" array[3]
print >> file
close(file)
}' myFile.txt

here's an awk-based solution verified on 4 awk variants, requires no array splitting, while also closing file connections along the way :
pristine $0 has been pre-saved, thus performing ++NF against a blank OFS does not result in data truncation
(ps : as a matter of fact, saving $0 is
only necessary for gawk and nawk )
SETUP and INPUT
removed 'wyx-8979479BCCF-;#%&*[)(]~'
zsh: no matches found: wyx*
1 --------INPUT------------
2 bca 0106 qsr wyx-8979479BCCF-=;#%&*[)(]~ testtail
CODE
{m,n,g}awk '
BEGIN {
OFS = _
FS = "^[^\t]*\t[^\t]*\t[^\t]*\t|[ -]|\t[^\t]*$"
} {
___ = $(_*(__==_?_:close(__)))
print(___) > (__ = $!++NF) }'
# mawk-1/2 specific streamlining
mawk 'BEGIN { FS="^[^\t]*\t[^\t]*\t[^\t]*\t|[ -]|\t[^\t]*$"(OFS=_)
} { print $(_*(__==_?_:close(__))) > (__ = $!++NF) }'
OUTPUT
-rw-r--r-- 1 501 20 50 Jun 19 12:13 wyx8979479BCCF=;#%&*[)(]~
1 bca 0106 qsr wyx-8979479BCCF-=;#%&*[)(]~ testtail

Related

assigning a var inside AWK for use outside awk

I am using ksh on AIX.
I have a file with multiple comma delimited fields. The value of each field is read into a variable inside the script.
The last field in the file may contain multiple | delimited values. I need to test each value and keep the first one that doesn't begin with R, then stop testing the values.
sample value of $principal_diagnosis0
R65.20|A41.9|G30.9|F02.80
I've tried:
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){echo $i; primdiag = $i}}}'
but I get this message : awk: Field $i is not correct.
My goal is to have a variable that I can use outside of the awk statement that gets assigned the first non-R code (in this case it would be A41.9).
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){print $i}}}'
gets me the output of :
A41.9
G30.9
F02.80
So I know it's reading the values and evaluating properly. But I need to stop after the first match and be able to use that value outside of awk.
Thanks!
To answer your specific question:
$ principal_diagnosis0='R65.20|A41.9|G30.9|F02.80'
$ foo=$(echo "$principal_diagnosis0" | awk -v RS='|' '/^[^R]/{sub(/\n/,""); print; exit}')
$ echo "$foo"
A41.9
The above will work with any awk, you can do it more briefly with GNU awk if you have it:
foo=$(echo "$principal_diagnosis0" | awk -v RS='[|\n]' '/^[^R]/{print; exit}')
you can make FS and OFS do all the hard work :
echo "${principal_diagnosis0}" |
mawk NF=NF FS='^(R[^|]+[|])+|[|].+$' OFS=
A41.9
——————————————————————————————————————————
another slightly different variation of the same concept — overwriting fields but leaving OFS as is :
gawk -F'^.*R[^|]+[|]|[|].+$' '$--NF=$--NF'
A41.9
this works, because when you break it out :
gawk -F'^.*R[^|]+[|]|[|].+$' '
{ print NF
} $(_=--NF)=$(__=--NF) { print _, __, NF, $0 }'
3
1 2 1 A41.9
you'll notice you start with NF = 3, and the two subsequent decrements make it equivalent to $1 = $2,
but since final NF is now reduced to just 1, it would print it out correctly instead of 2 copies of it
…… which means you can also make it $0 = $2, as such :
gawk -F'^.*R[^|]+[|]|[|].+$' '$-_=$-—NF'
A41.9
——————————————————————————————————————————
a 3rd variation, this time using RS instead of FS :
mawk NR==2 RS='^.*R[^|]+[|]|[|].+$'
A41.9
——————————————————————————————————————————
and if you REALLY don't wanna mess with FS/OFS/RS, use gsub() instead :
nawk 'gsub("^.*R[^|]+[|]|[|].+$",_)'
A41.9

awk: print each column of a file into separate files

I have a file with 100 columns of data. I want to print the first column and i-th column in 99 separate files, I am trying to use
for i in {2..99}; do awk '{print $1" " $i }' input.txt > data${i}; done
But I am getting errors
awk: illegal field $(), name "i"
input record number 1, file input.txt
source line number 1
How to correctly use $i inside the {print }?
Following single awk may help you too here:
awk -v start=2 -v end=99 '{for(i=start;i<=end;i++){print $1,$i > "file"i;close("file"i)}}' Input_file
An all awk solution. First test data:
$ cat foo
11 12 13
21 22 23
Then the awk:
$ awk '{for(i=2;i<=NF;i++) print $1,$i > ("data" i)}' foo
and results:
$ ls data*
data2 data3
$ cat data2
11 12
21 22
The for iterates from 2 to the last field. If there are more fields that you desire to process, change the NF to the number you'd like. If, for some reason, a hundred open files would be a problem in your system, you'd need to put the print into a block and add a close call:
$ awk '{for(i=2;i<=NF;i++){f=("data" i); print $1,$i >> f; close(f)}}' foo
If you want to do what you try to accomplish :
for i in {2..99}; do
awk -v x=$i '{print $1" " $x }' input.txt > data${i}
done
Note
the -v switch of awk to pass variables
$x is the nth column defined in your variable x
Note2 : this is not the fastest solution, one awk call is fastest, but I just try to correct your logic. Ideally, take time to understand awk, it's never a wasted time

awk to output two files based on match or no match

In the below awk I am trying to print out the lines that match have the string FP or RFP $2 in the tab-delimited input. If a match is found in $2 then in result only the lines of file that do not have those keywords in them are printed. At the same time another file removed is printed that has those lines that did have those keywords in them. The awk has a syntax error in it when I try to print two files, if I only print one the awk runs. Thank you :).
input
12 aaa
123 FP bbb
11 ccc
10 RFP ddd
result
12 aaa
11 ccc
removed
123 FP bbb
10 RFP ddd
awk
awk -F'\t' 'BEGIN{d["FP"];d["RFP"]}!($2 in d) {print > "removed"}; else {print > "result"}' file
awk: cmd. line:1: BEGIN{d["FP"];d["RFP"]}!($2 in d) {print > "removed"}; else {print > "result"}
awk: cmd. line:1: ^ syntax error
else goes with if. Your script didn't have an if, just an else, hence the syntax error. All you need is:
awk -F'\t' '{print > ($2 ~ /^R?FP$/ ? "removed" : "result")}' file
or if you prefer the array approach you are trying to use:
awk -F'\t' '
BEGIN{ split("FP RFP",t,/ /); for (i in t) d[t[i]] }
{ print > ($2 in d ? "removed" : "result") }
' file
Read the book Effective Awk Programming, 4th Edition, by Arnold Robbins to learn awk syntax and semantics.
Btw when writing if/else code like you show in your question:
if ( !($2 in d) ) removed; else result
THINK about the fact you're using negative (!) logic which makes your code harder to understand right away AND opens you up to potential double negatives. Always try to express every condition in a positive way, in this case that'd be:
if ($2 in d) result; else removed

How to do calculations over lines of a file in awk

I've got a file that looks like this:
88.3055
45.1482
37.7202
37.4035
53.777
What I have to do is isolate the value from the first line and divide it by the values of the other lines (it's a speedup calculation). I thought of maybe storing the first line in a variable (using NR) and then iterate over the other lines to obtain the values from the divisions. Desired output is:
1,9559
2,3410
2,3608
1,6420
UPDATE
Sorry Ed, my mistake, the desired decimal point is .
I made some small changes to Ed's answer so that awk prints the division of 88.3055 by itself and outputs it to a file speedup.dat:
awk 'NR==1{n=$0} {print n/$0}' tavg.dat > speedup.dat
Is it possible to combine the contents of speedup.dat and the results from another awk command without using intermediate files and in one single awk command?
First command:
awk 'BEGIN { FS = \"[ \\t]*=[ \\t]*\" } /Total processes/ { if (! CP) CP = $2 } END {print CP}' cg.B.".n.".log ".(n == 1 ? ">" : ">>")." processes.dat
This first command outputs:
1
2
4
8
16
Paste of the two files:
paste processes.dat speedup.dat > prsp.dat
which gives the now desired output:
1 1
2 1.9559
4 2.34107
8 2.36089
16 1.64207
$ awk 'NR==1{n=$0;next} {print n/$0}' file
1.9559
2.34107
2.36089
1.64207
$ awk 'NR==1{n=$0;next} {printf "%.4f\n", n/$0}' file
1.9559
2.3411
2.3609
1.6421
$ awk 'NR==1{n=$0;next} {printf "%.4f\n", int(n*10000/$0)/10000}' file
1.9559
2.3410
2.3608
1.6420
$ awk 'NR==1{n=$0;next} {x=sprintf("%.4f",int(n*10000/$0)/10000); sub(/\./,",",x); print x}' file
1,9559
2,3410
2,3608
1,6420
Normally you'd just use the correct locale to have . or , as your decimal point but your input uses . while your output uses , so I don't think that's an option.
awk '{if(n=="") n=$1; else print n/$1}' inputFile

gawk FS to split record into individual characters

If the field separator is the empty string, each character becomes a separate field
$ echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
5,h,e,l,l,o
However, if FS is a regex that can possibly match zero times, the same behaviour does not occur:
$ echo hello | awk -F ' *' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
Anyone know why that is? I could not find anything in the gawk manual. Is FS="" just a special case?
I'm most interested in understanding why the 2nd case does not split the record into more fields. It's as if awk is treating FS=" *" like FS=" +"
Interesting question!
I just pulled gnu-awk 4.1.0's codes, I think the answer we could find in the file field.c.
line 371:
* re_parse_field --- parse fields using a regexp.
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is a regular
* expression -- either user-defined or because RS=="" and FS==" "
*/
static long
re_parse_field(lo...
also this line: (line 425):
if (REEND(rp, scan) == RESTART(rp, scan)) { /* null match */
here is the case of <space>* matching in your question. The implementation didn't increment the nf, that is, it thinks the whole line is one single field. Note this function was used in do_split() function too.
First, if FS is null string, gawk separates each char into its own field. gawk's doc has clearly written this, also in codes, we could see:
line 613:
* null_parse_field --- each character is a separate field
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is the null string.
*/
static long
null_parse_field(long up_to,
If the FS has single character, awk won't consider it as regex. This was mentioned in doc too. Also in codes:
#line 667
* sc_parse_field --- single character field separator
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is a single character
* other than space.
*/
static long
sc_parse_field(l
if we read the function, no regex match handling was done there.
In the comments of the function re_parse_field(), and sc_parse_field(), we see do_split invokes them too. It explains why we have 1 in following command instead of 3:
kent$ echo "foo"|awk '{split($0,a,/ */);print length(a)}'
1
Note, to avoid to make the post too long, I didn't paste the complete codes here, we can find the codes here:
http://git.savannah.gnu.org/cgit/gawk.git/
As was mentioned, an empty field separator generates undefined behavior; the same code will give different results on different platforms / flavors of awk. For example (all Mac OSX 10.8.5):
> echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
awk: field separator FS is empty
1,hello
So awk complains, but keeps going.
Let's look at some other examples:
> echo hello | awk -F '.' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
A . by itself is not considered a regular expression
> echo hello | awk -F '[.]' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
Still nothing
> echo hello | awk -F '.?' -v OFS=, '{$1 = NF OFS $1} 1'
6,,,,,,
Now we have something like a regex: .? is "zero or one character". It is expanded to one character (which is consumed), so the output is "a whole lot of nothings"
> echo hello | awk -F '*' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
Not a regular expression
> echo hello | awk -F '.*' -v OFS=, '{$1 = NF OFS $1} 1'
2,,
A regular expression that consumes the entire thing
> echo hello | awk -F 'l' -v OFS=, '{$1 = NF OFS $1} 1'
3,he,,o
Match the letter l twice - two empty strings
> echo hello | awk -F 'ell' -v OFS=, '{$1 = NF OFS $1} 1'
2,h,o
Match all of ell at once
> echo hello | awk -F '.?|' -v OFS=, '{$1 = NF OFS $1} 1'
awk: illegal primary in regular expression .?| at
input record number 1, file
source line number 1
Attempt to be clever: sometimes an | with empty string on one side will match "anything" but awk's regex engine doesn't like it.
Conclusion - the regular expressions cannot match "empty", and whatever is matched is consumed. Attempts to use (?:.) or even (?=.) generate errors.
It seems to be a special case in gawk.
Traditionally, the behavior of FS equal to "" was not defined. In this
case, most versions of Unix awk simply treat the entire record as only
having one field. (d.c.) In compatibility mode (see Options), if FS is
the null string, then gawk also behaves this way.
What POSIX has to say about this:
If FS is a null string, the behavior is unspecified.
So the gawk behaviour is implementation-specific and sort of explains why your two examples don't yield the same output.
Another data point: gawk and perl disagree on how to do this:
$ perl -E '$,=","; $s="hello"; $r=qr( *); #s=split($r,$s); say scalar(#s), #s'
5,h,e,l,l,o
$ gawk 'BEGIN {s="hello";r=" *";n=split(s,a,r); print n,a[n]; if (s~r) print "match"}'
1 hello
match
$ gawk 'BEGIN {s="hello";r=""; n=split(s,a,r); print n,a[n]; if (s~r) print "match"}'
5 o
match