Why AWK program FS variable can be specified with -F flag of gawk (or other awk) interpreter/command? - awk

Why AWK program's FS variable can be specified with -F flag of gawk (or other awk) interpreter/command?
Let me explain, AWK is a programming language and gawk is (one of many) an interpreter for AWK. gawk interpreter/execute/runs the AWK program that given to it. So why the FS (field separator) variable can be specified with gawk's -F flag? I find it kind of unnatural... and how does it technically do that?

My best guess as to "why" is as a convenience. FS is probably the most used/manipulated awk variable, so having a short option to set it is helpful
Consider
awk -F, '...' file.csv
# vs
awk 'BEGIN {FS=","} ...' file.csv
"How does it technically do that" -- see https://git.savannah.gnu.org/cgit/gawk.git/tree/main.c#n1586
Historically -F was implemented in gawk v1.01 so it would have existed in whatever legacy awk that gawk was based on.
Additionally, the POSIX specification mandates -F.

So why the FS (field separator) variable can be specified with gawk's
-F flag?
awk man page claims that
Command line variable assignment is most useful for dynamically
assigning values to the variables AWK uses to control how input is
broken into fields and records. It is also useful for controlling
state if multiple passes are needed over a single data file.
So -F comes handy when field separator is not etched in stone, but rather computed dynamically, as -F allows you tu use bash variable easily, imagine that you was tasked with developing part of bash script which should output last field of each line of file.txt when using character stored in variable sep as separator, then you could do that following way
awk -F ${sep} '{print $NF}' file.txt
find it kind of unnatural
This depend on what you have used before, cut user which want to get 3rd column from csv file might do that following way
cut -d , -f 3 file.csv

Related

awk command to print columns with colum data

cat file1.txt | awk -F '{print $1 "|~|" $2 "|~|" $3}' > file2.txt
I am using above command to filter first three columns from file1 and put into file.
But only getting the column names and not the column data.
How to do that?
|~| - is the delimiter.
file1.txt has values as :
a|~|b|~|c|~|d|~|e
1|~|2|~|3|~|4|~|5
11|~|22|~|33|~|44|~|55
111|~|222|~|333|~|444|~|555
my expedted output is :
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
With your shown samples, please try following awk code. You need to set field separator to |~| and remove starting space from lines, then print the lines.
awk -F'\\|~\\|' -v OFS='|~|' '{sub(/^[[:blank:]]+/,"");print $1,$2,$3}' Input_file
In case you want to keep spaces(which was in initial post before edit) then try following:
awk -F'\\|~\\|' -v OFS='|~|' '{print $1,$2,$3}' Input_file
NOTE: Had a chat with user in room and got to know why this code was not working for user because of gunzip -c file was being used wrongly, its output was being saved into a variable on which user was running awk program, so correcting that command generated right file and awk program ran fine on it. Adding this as a reference for future readers.
One approach would be:
awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
The approach simply replaces all "|~|" with a "," setting the output file separator to "|~|". All leading whitespace is trimmed with sub().
Example Use/Output
With your data in file1.txt, you would have:
$ awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
Let me know if this is what you intended. You can simply redirect, e.g. > file2.txt to write to the second file.
For such cases, my bash+awk script rcut comes in handy:
rcut -Fd'|~|' -f-3 ip.txt
The -F option enables fixed string input delimiter (which is given using the -d option). And by default, the output field separator will also be same as -d when -F is active. -f-3 is similar to cut syntax to specify first three fields.
For better speed, use hck command:
hck -Ld'|~|' -D'|~|' -f-3 ip.txt
Here, -L enables literal field separator and -D specifies output field separator.
Another benefit is that hck supports -z option to automatically handle common compressed formats based on filename extension (adding this since OP had an issue with compressed input).
Another way:
sed 's/|~|/\t/g' file1.txt | awk '{print $1"|~|"$2"|~|"$3}' > file2.txt
First replace the |~| delimiter, and use the default awk separator, then print columns what you need.

What does this Awk expression mean

I am working with bash script that has this command in it.
awk -F ‘‘ ‘/abc/{print $3}’|xargs
What is the meaning of this command?? Assume input is provided to awk.
The quick answer is it'll do different things depending on the version of awk you're running and how many fields of output the awk script produces.
I assume you meant to write:
awk -F '' '/abc/{print $3}'|xargs
not the syntactically invalid (due to "smart quotes"):
awk -F ‘’’/abc/{print $3}’|xargs
-F '' is undefined behavior per POSIX so what it will do depends on the version of awk you're running. In some awks it'll split the current line into 1 character per field. in others it'll be ignored and the line will be split into fields at every sequence of white space. In other awks still it could do anything else.
/abc/ looks for a string matching the regexp abc on the current line and if found invokes the subsequent action, in this case {print $3}.
However it's split into fields, print $3 will print the 3rd such field.
xargs as used will just print chunks of the multi-line input it's getting all on 1 line so you could get 1 line of all-fields output if you don't have many fields being output or several lines of multi-field output if you do.
I suspect the intent of that code was to do what this code actually will do in any awk alone:
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
e.g.:
$ printf 'foo\nxabc\nyzabc\nbar\n' |
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
b a

awk set command line options in script

I'm curious about how to set command-line options in awk script, like -F for field separator. I try to write the shebang line like
#!/usr/bin/awk -F ":" -f
and get the following error:
awk: 1: unexpected character '.'
For this example, I can do with
BEGIN {FS=":"}
but I still want to know a way to set all those options. Thanks in advance.
EDIT:
let's use another example that should be easy to test.
inputfile:
1
2
3
4
test.awk:
#!/usr/bin/awk -d -f
{num += $1}
END { print num}
run
/usr/bin/awk -d -f test.awk inputfile
will get 10 and generate a file called awkvars.out with some awk global variables in it.
but
./test.awk inputfile
will get
awk: cmd. line:1: ./test.awk
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: ./test.awk
awk: cmd. line:1: ^ unterminated regexp
if I remove '-d' from shebang line,
./test.awk inputfile
will normally output 10.
My question is that whether there is a way to write "-d" in test.awk file to generate awkvars.out file?
Answering for the OP question, beyond the setting of FS.
Short Answer: you can not use multiple options with '#!', and since you need to tell awk to read the program from stdin (-f-), you are out of luck.
Long Answer:
When using shebang (#!), there is a limit of single argument (which is passed to the named programs as the 1st argument. So in general:
#! /path/to/prog arg1
input-1
input-2
Will execute /path/to/prog arg1, with the content of the file (including the leading shebang) available as stdin. This is oversimplification, actual rules are more complex., see https://unix.stackexchange.com/questions/87560/does-the-shebang-determine-the-shell-which-runs-the-script
Given this limitation of one argument, when executing awk, the only valid and required parameter is '-f', which indicates that the awk programs is provided on STDIN. You can prepend few other options that do NOT take any argument, for example 'traditional' (e.g., '-Pf-' will force POSIX behavior).
As much as I can tell, all the 'interesting' options (setting FS, RS, ORS, ...) need to be separated from the '-f-' with a space, making it impossible to embed them into the command line, other then using the 'BEGIN { ... }' or similar in the script.
Bottom line, trying #! /usr/bin/awk -f- -F, will attempt to look for program is the same as awk -f' -F', and will look for a file named '- -F`. Usually not very useful, and will not set the FS.
Let's say following is our Input_file, which we are going to use for all mentioned solutions here.
cat Input_file
a,b,c,d
ab,c
1st way of setting Field separator: 1st simple way will be setting FS value in BEGIN section of awk program file. Following is our .awk file.
cat file1.awk
BEGIN{
FS=","
}
{
print $1"..."$2
}
Now when we run the code following output will come:
/usr/local/bin/awk -f file1.awk Input_file
a...b
ab...c
2nd way of setting field separator: 2nd way will be pass FS value before reading Input_file like as follows.
/usr/local/bin/awk -f file.awk FS="," Input_file
Example: Now following is the file.awk file which has awk code.
cat file.awk
{
print $1".."$2
}
Now when we run awk file with awk -f .. command as follows will be result.
/usr/local/bin/awk -f file.awk FS="," Input_file
a..b
ab..c
Which means it is picking up the field separator as , in this above program.
3rd way of setting field separator: We can set field separator in awk -f programs like how we do for usual awk programs using -F',' option as follows.
/usr/local/bin/awk -F',' -f file.awk Input_file
a..b
ab..c
4th way of setting field separator: We could mention field separator as a variable by using -v option on command line while running file.awk script as follows.
/usr/local/bin/awk -v FS=',' -f file.awk Input_file
Never use a shebang to call awk as it robs you of the ability to separate shell arguments into awk arguments and awk variables and do anything else that's better done in shell (e.g. arg parsing with getopts) before calling awk. Just call awk from inside your shell script.
Also, don't name your shell script test.awk as it's a shell script. The fact it's implemented in awk is irrelevant. There's no reason to create a file that you sometimes call as awk file to have awk interpret and other times as just file to have the shell interpret.

Why is field separator taken into account differently if set before or after the expression?

The code print split("foo:bar", a) returns how many slices did split() when trying to cut based on the field separator. Since the default field separator is the space and there is none in "foo:bar", the result is 1:
$ awk 'BEGIN{print split("foo:bar",a)}'
1
However, if the field separator is ":" then the result is obviously 2 ("foo" and "bar"):
$ awk 'BEGIN{FS=":"; print split("foo:bar", a)}'
2
$ awk -F: 'BEGIN{print split("foo:bar", a)}'
2
However, it does not if FS is defined after the Awk expression:
$ awk 'BEGIN{print split("foo:bar", a)}' FS=":"
1
If I print it not in the BEGIN block but when processing a file, the FS is already taken into account:
$ echo "bla" > file
$ awk '{print split("foo:bar",a)}' FS=":" file
2
So it looks like FS set before the expression is already taken into account in the BEGIN block, while it is not if defined after.
Why is this happening? I could not find details on this in GNU Awk User's Guide → 4.5.4 Setting FS from the Command Line. I am working on GNU Awk 5.
This feature is not inherent to GNU awk but is POSIX.
Calling convention:
The awk calling convention is the following:
awk [-F sepstring] [-v assignment]... program [argument...]
awk [-F sepstring] -f progfile [-f progfile]... [-v assignment]...
[argument...]
This shows that any option (flags -F,-v,-f) passed to awk should occur before the program definition and possible arguments. This shows that:
# this works
$ awk -F: '1' /dev/null
# this fails
$ awk '1' -F: /dev/null
awk: fatal: cannot open file `-F:' for reading (No such file or directory)
Fieldseparators and assignments as options:
The Standard states:
-F sepstring: Define the input field separator. This option shall be equivalent to: -v FS=sepstring
-v assignment:
The application shall ensure that the assignment argument is in the same form as an assignment operand. The specified variable assignment shall occur prior to executing the awk program, including the actions associated with BEGIN patterns (if any). Multiple occurrences of this option can be specified.
source: POSIX awk standard
So, if you define a variable assignment or declare a field separator using the options, BEGIN will know them:
$ awk -F: -v a=1 'BEGIN{print FS,a}'
: 1
What are arguments?:
The Standard states:
argument: Either of the following two types of argument can be intermixed:
file
A pathname of a file that contains the input to be read, which is matched against the set of patterns in the program. If no file operands are specified, or if a file operand is '-', the standard input shall be used.
assignment
An <snip: extremely long sentence to state varname=varvalue>, shall specify a variable assignment rather than a pathname. <snip: some extended details on the meaning of varname=varvalue> Each such variable assignment shall occur just prior to the processing of the following file, if any. Thus, an assignment before the first file argument shall be executed after the BEGIN actions (if any), while an assignment after the last file argument shall occur before the END actions (if any). If there are no file arguments, assignments shall be executed before processing the standard input.
source: POSIX awk standard
Which means that if you do:
$ awk program FS=val file
BEGIN will not know about the new definition of FS but any other part of the program will.
Example:
$ awk -v OFS="|" 'BEGIN{print "BEGIN",FS,a,""}END{print "END",a,""}' FS=: a=1 /dev/null
BEGIN| ||
END|:|1|
$ awk -v OFS="|" 'BEGIN{print "BEGIN",FS,a,""}
{print "ACTION",FS,a,""}
END{print "END",a,""}' FS=: a=1 <(echo 1) a=2
BEGIN| ||
ACTION|:|1|
END|:|2|
See also:
GNU awk manual: Section Other arguments for an understanding how GNU awk interprets the above.
Because you can set the variable individually for each file you process, and BEGIN happens before any of that.
bash$ awk '{ print NF }' <(echo "foo:bar") FS=: <(echo "foo:bar")
1
2

Combine grep -f and awk

I am using two commands:
awk '{ print $2 }' SomeFile.txt > Pattern.txt
grep -f Pattern.txt File.txt
With the first command I create a list of desirable patterns. With the second command I extract all lines in File.txt that match the lines in the Pattern.txt
My question is, is there a way to combine awk and grep in a pipeline so that I don't have to generate the intermediate Pattern.txt file?
Thanks!
You can do this all in one invocation of awk:
awk 'NR==FNR{a[$2];next}{for(i in a)if($0~i)print}' Somefile.txt File.txt
Populate keys in the array a from the second column of the first file. NR==FNR identifies the first file (total record number is equal to this file's record number). next skips the second block for the first file.
In the second block, loop through all the keys in the array and if the line matches any of them, print it. To avoid printing the line more than once if it matches more than one pattern, you could add a next here too, i.e. {for(i in a)if($0~i){print;next}}.
If the "patterns" are actually fixed strings, it is even simpler:
awk 'NR==FNR{a[$2];next}$0 in a' Somefile.txt File.txt
If your shell supports it, you can use process substitution:
grep -f <(awk '{ print $2 }' SomeFile.txt) File.txt
bash and zsh will support that, others will probably too, didn't tested.
Simpler as the above and supported by all shells would be to use a pipe:
awk '{ print $2 }' SomeFile.txt | grep -f - File.txt
- is used as the argument to -f. - has a special meaning here and stands for stdin. Thanks to Tom Fenech for mentioning that!