Different results in awk when using different FS syntax - awk

I have a sample file which contains the following.
logging.20160309.113.txt.log: 0 Rows successfully loaded.
logging.20160309.1180.txt.log: 0 Rows successfully loaded.
logging.20160309.1199.txt.log: 0 Rows successfully loaded.
I currently am familiar with 2 ways of implementing a Field Separator syntax in awk. However, I am currently getting different results.
For the longest time I use
"FS=" syntax when my FS is more than one character.
"-f" flag when my FS is just one character.
I would like to understand why the FS= syntax is giving me an unexpected result as seen below. Somehow the 1st record is being left behind.
$ head -3 reload_list | awk -F"\.log\:" '{ print $1 }'
awk: warning: escape sequence `\.' treated as plain `.'
awk: warning: escape sequence `\:' treated as plain `:'
logging.20160309.113.txt
logging.20160309.1180.txt
logging.20160309.1199.txt
$ head -3 reload_list | awk '{ FS="\.log\:" } { print $1 }'
awk: warning: escape sequence `\.' treated as plain `.'
awk: warning: escape sequence `\:' treated as plain `:'
logging.20160309.113.txt.log:
logging.20160309.1180.txt
logging.20160309.1199.txt

The reason you are getting different results, is that in the case where you set FS in the awk program, it is not in a BEGIN block. So by the time you've set it, the first record has already been parsed into fields (using the default separator).
Setting with -F
$ awk -F"\\.log:" '{ print $1 }' b.txt
logging.20160309.113.txt
logging.20160309.1180.txt
logging.20160309.1199.txt
Setting FS after parsing first record
$ awk '{ FS= "\\.log:"} { print $1 }' b.txt
logging.20160309.113.txt.log:
logging.20160309.1180.txt
logging.20160309.1199.txt
Setting FS before parsing any records
$ awk 'BEGIN { FS= "\\.log:"} { print $1 }' b.txt
logging.20160309.113.txt
logging.20160309.1180.txt
logging.20160309.1199.txt
I noticed this relevant bit in an awk manual. If perhaps you've seen different behavior previously or with a different implementation, this could explain why:
According to the POSIX standard, awk is supposed to behave as if
each record is split into fields at the time that it is read. In
particular, this means that you can change the value of FS after a
record is read, but before any of the fields are referenced. The value
of the fields (i.e. how they were split) should reflect the old value
of FS, not the new one.
However, many implementations of awk do not do this. Instead,
they defer splitting the fields until a field reference actually
happens, using the current value of FS! This behavior can be
difficult to diagnose.

-f is for running a script from a file. -F and FS works the same
$ awk -F'.log' '{print $1}' logs
logging.20160309.113.txt
logging.20160309.1180.txt
logging.20160309.1199.txt
$ awk 'BEGIN{FS=".log"} {print $1}' logs
logging.20160309.113.txt
logging.20160309.1180.txt
logging.20160309.1199.txt

Related

awk command to print columns with colum data

cat file1.txt | awk -F '{print $1 "|~|" $2 "|~|" $3}' > file2.txt
I am using above command to filter first three columns from file1 and put into file.
But only getting the column names and not the column data.
How to do that?
|~| - is the delimiter.
file1.txt has values as :
a|~|b|~|c|~|d|~|e
1|~|2|~|3|~|4|~|5
11|~|22|~|33|~|44|~|55
111|~|222|~|333|~|444|~|555
my expedted output is :
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
With your shown samples, please try following awk code. You need to set field separator to |~| and remove starting space from lines, then print the lines.
awk -F'\\|~\\|' -v OFS='|~|' '{sub(/^[[:blank:]]+/,"");print $1,$2,$3}' Input_file
In case you want to keep spaces(which was in initial post before edit) then try following:
awk -F'\\|~\\|' -v OFS='|~|' '{print $1,$2,$3}' Input_file
NOTE: Had a chat with user in room and got to know why this code was not working for user because of gunzip -c file was being used wrongly, its output was being saved into a variable on which user was running awk program, so correcting that command generated right file and awk program ran fine on it. Adding this as a reference for future readers.
One approach would be:
awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
The approach simply replaces all "|~|" with a "," setting the output file separator to "|~|". All leading whitespace is trimmed with sub().
Example Use/Output
With your data in file1.txt, you would have:
$ awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
Let me know if this is what you intended. You can simply redirect, e.g. > file2.txt to write to the second file.
For such cases, my bash+awk script rcut comes in handy:
rcut -Fd'|~|' -f-3 ip.txt
The -F option enables fixed string input delimiter (which is given using the -d option). And by default, the output field separator will also be same as -d when -F is active. -f-3 is similar to cut syntax to specify first three fields.
For better speed, use hck command:
hck -Ld'|~|' -D'|~|' -f-3 ip.txt
Here, -L enables literal field separator and -D specifies output field separator.
Another benefit is that hck supports -z option to automatically handle common compressed formats based on filename extension (adding this since OP had an issue with compressed input).
Another way:
sed 's/|~|/\t/g' file1.txt | awk '{print $1"|~|"$2"|~|"$3}' > file2.txt
First replace the |~| delimiter, and use the default awk separator, then print columns what you need.

What does this Awk expression mean

I am working with bash script that has this command in it.
awk -F ‘‘ ‘/abc/{print $3}’|xargs
What is the meaning of this command?? Assume input is provided to awk.
The quick answer is it'll do different things depending on the version of awk you're running and how many fields of output the awk script produces.
I assume you meant to write:
awk -F '' '/abc/{print $3}'|xargs
not the syntactically invalid (due to "smart quotes"):
awk -F ‘’’/abc/{print $3}’|xargs
-F '' is undefined behavior per POSIX so what it will do depends on the version of awk you're running. In some awks it'll split the current line into 1 character per field. in others it'll be ignored and the line will be split into fields at every sequence of white space. In other awks still it could do anything else.
/abc/ looks for a string matching the regexp abc on the current line and if found invokes the subsequent action, in this case {print $3}.
However it's split into fields, print $3 will print the 3rd such field.
xargs as used will just print chunks of the multi-line input it's getting all on 1 line so you could get 1 line of all-fields output if you don't have many fields being output or several lines of multi-field output if you do.
I suspect the intent of that code was to do what this code actually will do in any awk alone:
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
e.g.:
$ printf 'foo\nxabc\nyzabc\nbar\n' |
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
b a

Use awk to interpret }{ as RS and output with ORS }\n{

I have data that looks like this:
{"anonymousId":"abc123",{"hello":"world"}}{"anonymousId":"abc456",{"hi": "again"}}
It's as if you took a newline-delimited json file and removed all the newlines.
I'm trying to use awk to convert it to to ndjson.
That is, my expected output is this:
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
I don't want to load the entire file into memory (which is why I'm not using sed), so my thought is I should use }{ as row separator. Then, I figure if I use }\n{ as ORS I should get my desired output.
So I tried this:
cat my-file.txt | awk -v RS="}{" -v ORS="}\n{" '{$1=$1}1'
But it doesn't work!
Here's the output I get:
{"anonymousId":"abc123",{"hello":"world"}
{}
{{"anonymousId":"abc456",{"hi": "again"}
{}
{}
{
Apart from the constraint of not loading the entire file into memory, I don't care what bash command is used, but my thinking is awk will be the way. E.g. if tr supported multi-character expressions, that would be fine with me.
Please help me understand why this isn't working as expected and what I need to change.
Thanks!
Update
Following the answers given, will add some learnings.
The TLDR is don't use a macOS if you need to do trickier things like this.
For one this doen't work on mac: echo -e "a\nb\nc\nd\ne\n" | head -n -2; it complains about illegal line parameter, but this is valid on a linux system.
The other problem was the way awk was working on my (mac) system.
My awk command was close to correct.
On linux it produces this output:
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}}
{
So I just have to find a way to trim the trailing }\n{ (and as pointed out in the answer, the {$1=$1} is not necessary).
But all of those extraneous newlines were due to the screwy implementation of awk on my system ( It wasn't gawk and i'm not sure what it was ).
Doing $1=$1 inside awk -v RS='}{' -v ORS='}\n{' '{$1=$1}1' file isn't useful - it tells awk to recompile the current record replacing all chains of white space with blanks but you the only white space in your example is the \n at the end of the file and there's no point converting that to a blank. So your script can be reduced to:
awk -v RS='}{' -v ORS='}\n{' '1' file
but RS='}{' means different things to different awk variants.
Use of a multi-char RS with GNU awk (and probably a couple of others now) means that the RS is treated as a regexp to separate the records:
$ awk -v RS='}{' -v ORS='}\n{' '1' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
}
{$
Note the extra }\n{ added at the end because there is no }{ at the end of your input and so the end of input itself indicates the end of a record and so gets replaced with the ORS value.
Use of a multi-char RS with a POSIX awk means that the 2nd and subsequent chars in the RS get ignored and the first char is taken as the RS, hence the output you reported seeing in your question:
$ awk --posix -v RS='}{' -v ORS='}\n{' '1' file
{"anonymousId":"abc123",{"hello":"world"}
{}
{{"anonymousId":"abc456",{"hi": "again"}
{}
{
}
{$
where every } alone gets treated as matching RS and so gets replaced by ORS.
So you are not using an awk that supports multi-char RS. Your choices are to install one (preferably gawk) and do:
$ awk -v RS='}[{\n]' '{ORS=gensub(/}{/,"}\n{",1,RT)} 1' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
otherwise do something like this with any awk:
$ awk --posix -v RS='{' -v ORS= '{print pfx $0; pfx=(/}$/ ? "\n" : "") RS}' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
In the gawk solution above we define the RS as '}[{\n]' to say that the records mid-line are terminated by }{ but the record at the end of the line is terminated by }\n. So RT holds }{ for every record except the last one on the line which is }\n if your line ends with \n or NULL otherwise and so we just have to set ORS to be RT but with }{ converted to }\n{ for those records where RT has that value, otherwise ORS just gets set to }\n when RT has that value or NULL if your input didn't have a terminating \n.
An alternative gawk solution that I think I might actually prefer would be:
$ awk -v RS='}{' -v ORS='}\n{' 'NR>1{print prev} {prev=$0} END{printf "%s",prev}' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
EDIT: original answer for posterity before I noticed the OP said they don't want to read the whole file into memory:
Simple substitutions on individual strings like this is what sed is best at:
$ sed 's/}{/}\n{/g' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
otherwise with any awk:
$ awk '{gsub(/}{/,"}\n{")} 1' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
using record separator will create an extra delimiter at the end of the file, since it's static we can just remove it afterwards
$ echo '{"anonymousId":"abc123",{"hello":"world"}}{"anonymousId":"abc456",{"hi": "again"}}' |
awk -v RS='}{' -v ORS='}\n{' 1 | head -n -2
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
if you don't have gawk for multi-char RS support, you can have this workaround
$ echo ... |
awk -v RS='}' 'NF{printf "%s", $0 RS} !NF{print RS}' | head -n -2
there will be an extra RS, which will be trimmed afterwards.

Why is field separator taken into account differently if set before or after the expression?

The code print split("foo:bar", a) returns how many slices did split() when trying to cut based on the field separator. Since the default field separator is the space and there is none in "foo:bar", the result is 1:
$ awk 'BEGIN{print split("foo:bar",a)}'
1
However, if the field separator is ":" then the result is obviously 2 ("foo" and "bar"):
$ awk 'BEGIN{FS=":"; print split("foo:bar", a)}'
2
$ awk -F: 'BEGIN{print split("foo:bar", a)}'
2
However, it does not if FS is defined after the Awk expression:
$ awk 'BEGIN{print split("foo:bar", a)}' FS=":"
1
If I print it not in the BEGIN block but when processing a file, the FS is already taken into account:
$ echo "bla" > file
$ awk '{print split("foo:bar",a)}' FS=":" file
2
So it looks like FS set before the expression is already taken into account in the BEGIN block, while it is not if defined after.
Why is this happening? I could not find details on this in GNU Awk User's Guide → 4.5.4 Setting FS from the Command Line. I am working on GNU Awk 5.
This feature is not inherent to GNU awk but is POSIX.
Calling convention:
The awk calling convention is the following:
awk [-F sepstring] [-v assignment]... program [argument...]
awk [-F sepstring] -f progfile [-f progfile]... [-v assignment]...
[argument...]
This shows that any option (flags -F,-v,-f) passed to awk should occur before the program definition and possible arguments. This shows that:
# this works
$ awk -F: '1' /dev/null
# this fails
$ awk '1' -F: /dev/null
awk: fatal: cannot open file `-F:' for reading (No such file or directory)
Fieldseparators and assignments as options:
The Standard states:
-F sepstring: Define the input field separator. This option shall be equivalent to: -v FS=sepstring
-v assignment:
The application shall ensure that the assignment argument is in the same form as an assignment operand. The specified variable assignment shall occur prior to executing the awk program, including the actions associated with BEGIN patterns (if any). Multiple occurrences of this option can be specified.
source: POSIX awk standard
So, if you define a variable assignment or declare a field separator using the options, BEGIN will know them:
$ awk -F: -v a=1 'BEGIN{print FS,a}'
: 1
What are arguments?:
The Standard states:
argument: Either of the following two types of argument can be intermixed:
file
A pathname of a file that contains the input to be read, which is matched against the set of patterns in the program. If no file operands are specified, or if a file operand is '-', the standard input shall be used.
assignment
An <snip: extremely long sentence to state varname=varvalue>, shall specify a variable assignment rather than a pathname. <snip: some extended details on the meaning of varname=varvalue> Each such variable assignment shall occur just prior to the processing of the following file, if any. Thus, an assignment before the first file argument shall be executed after the BEGIN actions (if any), while an assignment after the last file argument shall occur before the END actions (if any). If there are no file arguments, assignments shall be executed before processing the standard input.
source: POSIX awk standard
Which means that if you do:
$ awk program FS=val file
BEGIN will not know about the new definition of FS but any other part of the program will.
Example:
$ awk -v OFS="|" 'BEGIN{print "BEGIN",FS,a,""}END{print "END",a,""}' FS=: a=1 /dev/null
BEGIN| ||
END|:|1|
$ awk -v OFS="|" 'BEGIN{print "BEGIN",FS,a,""}
{print "ACTION",FS,a,""}
END{print "END",a,""}' FS=: a=1 <(echo 1) a=2
BEGIN| ||
ACTION|:|1|
END|:|2|
See also:
GNU awk manual: Section Other arguments for an understanding how GNU awk interprets the above.
Because you can set the variable individually for each file you process, and BEGIN happens before any of that.
bash$ awk '{ print NF }' <(echo "foo:bar") FS=: <(echo "foo:bar")
1
2

Combine grep -f and awk

I am using two commands:
awk '{ print $2 }' SomeFile.txt > Pattern.txt
grep -f Pattern.txt File.txt
With the first command I create a list of desirable patterns. With the second command I extract all lines in File.txt that match the lines in the Pattern.txt
My question is, is there a way to combine awk and grep in a pipeline so that I don't have to generate the intermediate Pattern.txt file?
Thanks!
You can do this all in one invocation of awk:
awk 'NR==FNR{a[$2];next}{for(i in a)if($0~i)print}' Somefile.txt File.txt
Populate keys in the array a from the second column of the first file. NR==FNR identifies the first file (total record number is equal to this file's record number). next skips the second block for the first file.
In the second block, loop through all the keys in the array and if the line matches any of them, print it. To avoid printing the line more than once if it matches more than one pattern, you could add a next here too, i.e. {for(i in a)if($0~i){print;next}}.
If the "patterns" are actually fixed strings, it is even simpler:
awk 'NR==FNR{a[$2];next}$0 in a' Somefile.txt File.txt
If your shell supports it, you can use process substitution:
grep -f <(awk '{ print $2 }' SomeFile.txt) File.txt
bash and zsh will support that, others will probably too, didn't tested.
Simpler as the above and supported by all shells would be to use a pipe:
awk '{ print $2 }' SomeFile.txt | grep -f - File.txt
- is used as the argument to -f. - has a special meaning here and stands for stdin. Thanks to Tom Fenech for mentioning that!