Using shell variables in gensub in Awk - awk

I tried to answer a question asked here
How do I sed only matched grep line
as below
awk -F":" -v M="Mary Jane" -v A="Runs" -v S="Sleeps" '{OFS=":" ; print gensub(/(M):(A)/,"\\1;S","g")}'
but it did not work so I am guessing gensub is not able to recognize the shell variables M, A and S passed to awk , Is there a way to use shell variables in gensub in awk ?

What you're experiencing is not because you're using gensub(), it's to do with regexps. There are regexp literals, e.g. $0~/foo/, and then there are dynamic regexps which are strings that are converted to regexps when they are evaluated, e.g. $0~"foo" or {var="foo"} $0~var. If you want to use a variable in ANY regexp context then you need a dynamic regexp and so that is the syntax you need to use, not the regexp literal syntax. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps.
The correct syntax for what you wrote would be:
awk -F":" -v M="Mary Jane" -v A="Runs" -v S="Sleeps" '{OFS=":" ; print gensub("("M"):"A,"\\1;"S,"g")}'
but that has some semantic issues (e.g. partial matches and setting OFS for every line instead of once when it's never even used and almost certainly unnecessarily using "g" instead of 1 and why use regexps at all when you're really trying to do exact matches on strings)

Related

Why AWK program FS variable can be specified with -F flag of gawk (or other awk) interpreter/command?

Why AWK program's FS variable can be specified with -F flag of gawk (or other awk) interpreter/command?
Let me explain, AWK is a programming language and gawk is (one of many) an interpreter for AWK. gawk interpreter/execute/runs the AWK program that given to it. So why the FS (field separator) variable can be specified with gawk's -F flag? I find it kind of unnatural... and how does it technically do that?
My best guess as to "why" is as a convenience. FS is probably the most used/manipulated awk variable, so having a short option to set it is helpful
Consider
awk -F, '...' file.csv
# vs
awk 'BEGIN {FS=","} ...' file.csv
"How does it technically do that" -- see https://git.savannah.gnu.org/cgit/gawk.git/tree/main.c#n1586
Historically -F was implemented in gawk v1.01 so it would have existed in whatever legacy awk that gawk was based on.
Additionally, the POSIX specification mandates -F.
So why the FS (field separator) variable can be specified with gawk's
-F flag?
awk man page claims that
Command line variable assignment is most useful for dynamically
assigning values to the variables AWK uses to control how input is
broken into fields and records. It is also useful for controlling
state if multiple passes are needed over a single data file.
So -F comes handy when field separator is not etched in stone, but rather computed dynamically, as -F allows you tu use bash variable easily, imagine that you was tasked with developing part of bash script which should output last field of each line of file.txt when using character stored in variable sep as separator, then you could do that following way
awk -F ${sep} '{print $NF}' file.txt
find it kind of unnatural
This depend on what you have used before, cut user which want to get 3rd column from csv file might do that following way
cut -d , -f 3 file.csv

Attempting to pass an escape char to awk as a variable

I am using this command;
awk -v regex1='new[[:blank:]]+File\(' 'BEGIN{print "Regex1 =", regex1}'
which warns me;
awk: warning: escape sequence `\(' treated as plain `(
which prints;
new[[:blank:]]+File(
I would like the value to be;
new[[:blank:]]+File\(
I've tried amending the command to account for escape chars but always get the same warning
When you run:
$ awk -v regex1='new[[:blank:]]+File\(' 'BEGIN{print "Regex1 =", regex1}'
awk: warning: escape sequence `\(' treated as plain `('
Regex1 = new[[:blank:]]+File(
you're in shell assigning a string to an awk variable. When you use -v in awk you're asking awk to interpret escape sequences in such an assignment so that \t can become a literal tab char, \n a newline, etc. but the ( in your string has no special meaning when escaped and so \( is exactly the same as (, hence the warning message.
If you want to get a literal \ character you'd need to escape it so that \\ gets interpreted as just \:
$ awk -v regex1='new[[:blank:]]+File\\(' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File\(
You seem to be trying to pass a regexp to awk and in my opinion once you get to needing 2 escapes your code is clearer and simpler if you put the target character into a bracket expression instead:
$ awk -v regex1='new[[:blank:]]+File[(]' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File[(]
If you want to assign an awk variable the value of a literal string with no interpretation of escape sequences then there are other ways of doing so without using -v, see How do I use shell variables in an awk script?.
If you use gnu awk then you can use a regexp literal with #/.../ format and avoid double escaping:
awk -v regex1='#/new[[:blank:]]+File\(/' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File\(
i think gawk and mawk 1/2 are also okay with the hideous but fool-proof octal method like
-v regex1="new[[:blank:]]+File[\\050]" # note the double quotes
once the engine takes out the first \\ layer, the regex being tested against is equivalent to
/new[[:blank:]]+File[\050]/
which is as safe as it gets. Reason why it matters is that something like
/new[[:blank:]]+File[\(]/
is something mawk/mawk2 are totally cool with but gawk will give an annoying warning message. octals (or [\x28]) get rid of that cross-awk weirdness and allow the same custom string regex to be deployed across all 3
(haven't tested against less popular variants like BWK original or NAWK etc).
ps : since i'm on the subject of octal caveats, mawk/mawk2 and gawk in binary mode are cool with square bracket octals for all bytes, meaning
"[\\302-\\364][\\200-\\277]+" # this happens to be a *very* rough proxy for UTF-8
is valid for all 3. if you really want to be the hex guy, that same regex becomes
"[\\xC2-\\xF4][\\x80-\\xBF]+"
however, gawk in unicode mode will scream about locale whenever you attempt to put squares around any non-ASCII byte. To circumvent that, you'll have to just list them out with a bunch of or's like :
(\302|\303|\304.....|\364)(\200|\201......|\277)+
this way you can get gawk unicode mode to handle any arbitrary byte and also handle binary input data (whatever the circumstances caused that to happen), and perform full base64 or URI plus encoding/decoding from within (plus anything else you want, like SHA256 or LZMA etc).... So far I've even managed to get gawk in unicode mode to base64 encode an MP4 file input without gawk spitting out the "illegal multi byte" error message.
.....and also get gawk and mawk in binary modes to become mostly UTF-8 aware and safe.
The "mostly" caveat being I haven't implemented the minute details like directly doing normalization form conversions from within instead of dumping out to python3 and getting results back via getline, or keeping modifier linguistics marks with its intended character if i do a UC-safe-substring string-reversal.

Delete string from line that matches regex with AWK

I have file that contains a lot of data like this and I have to delete everything that matches this regex [-]+\d+(.*)
Input:
zxczxc-6-9hw7w
qweqweqweqweqwe-18-8c5r6
asdasdasasdsad-11-br9ft
Output should be:
zxczxc
qweqweqweqweqwe
asdasdasasdsad
How can I do this with AWK?
sed might be easier...
$ sed -E 's/-+[0-9].*//' file
note that .* covers +.*
AFAIK awk doesn't support \d so you could use [0-9], your regex is correct only thing you need to put it in correct function of awk.
awk '{sub(/-+[0-9].*/,"")} 1' Input_file
You don't need the extra <plus> sign afther [0-9] as this is covered by the .*
Generally, if you want to delete a string that matches a regular expression, then all you need to do is substitute it with an empty string. The most straightforward solution is sed which is presented by karafka, the other solution is using awk as presented by RavinderSingh13.
The overall syntax would look like this:
sed -e 's/ere//g' file
awk '{gsub(/ere/,"")}1' file
with ere the regular expression representation. Note I use g and gsub here to substitute all non-overlapping strings.
Due to the nature of the regular expression in the OP, i.e. it ends with .*, the g can be dropped. It also allows us to write a different awk solution which works with field separators:
awk -F '-+[0-9]' '{print $1}' file

For gawk, how to set FS and RS in the same command as an awk script?

I have an awk command that returns the duplicates in an input stream with
awk '{a[$0]++}END{for (i in a)if (a[i]>1)print i;}'
However, I want to change the field separator characters and record separator characters before I do that. The command I use for that is
FS='\n' RS='\n\n'
Yet I'm having trouble making that happen. Is there a way to effectively combine these two commands into one? Piping one to the other doesn't seem to work either.
the action of BEGIN rule is executed before reading any input.
awk 'BEGIN{FS="\n";RS="\n\n"}{a[$0]++}END{for (i in a)if (a[i]>1)print i;}'
or you can specify them using command line options like:
awk -F '\n' -v RS='\n\n' '{a[$0]++}END{for (i in a)if (a[i]>1)print i;}'

Awk Greater Than Less Than

I am using this command
num1=2.2
num2=4.5
result=$(awk 'BEGIN{print ($num2>$num1)?1:0}')
This always returns 0. Whether num2>numl or num1>num2
But when I put the actual numbers as such
result=$(awk 'BEGIN{print (4.5>2.2)?1:0}')
I would get a return value of 1. Which is correct.
What can I do to make this work?
The reason it fails when you use variables is because the awk script enclosed by single quotes is evaluated by awk and not bash: so if you'd like to pass variables you are using from bash to awk, you'll have to specify it with the -v option as follows:
num1=2.2
num2=4.5
result=$(awk -v n1=$num1 -v n2=$num2 'BEGIN{print (n2>n1)?1:0}')
Note that program variables used inside the awk script must not be prefixed with $
Try doing this :
result=$(awk -v num1=2.2 -v num2=4.5 'BEGIN{print (num2 > num1) ? 1 : 0}')
See :
man awk | less +/'^ *-v'
Because $num1 and $num2 are not expanded by bash -- you are using single quotes. The following will work, though:
result=$(awk "BEGIN{print ($num2>$num1)?1:0}")
Note, however, as pointed out in the comments that this is poor coding style and mixing bash and awk. Personally, I don't mind such constructs; but in general, especially for complex things and if you don't remember what things will get evaluated by bash when in double quotes, turn to the other answers to this question.
See the excellent example from #EdMorton below in the comments.
EDIT: Actually, instead of awk, I would use bc:
$num1=2.2
$num2=4.5
result=$( echo "$num2 > $num1" | bc )
Why? Because it is just a bit clearer... and lighter.
Or with Perl (because it is shorter and because I like Perl more than awk and because I like backticks more than $():
result=`perl -e "print ( $num2 > $num1 ) ? 1 : 0;"`
Or, to be fancy (and probably inefficient):
if [ `echo -e "$num1\n$num2" | sort -n | head -1` != "$num1" ] ; then result=0 ; else result=1 ; fi
(Yes, I know)
I had a brief, intensive, 3-year long exposure to awk, in prehistoric times. Nowadays bash is everywhere and can do loads of stuff (I had sh/csh only at that time) so often it can be used instead of awk, while computers are fast enough for Perl to be used in ad hoc command lines instead of awk. Just sayin'.
This might work for you:
result=$(awk 'BEGIN{print ('$num2'>'$num1')?1:0}')
Think of the ''s as like poking holes through the awk command to the underlying bash shell.