How to pass a regular expression to a function in AWK - awk

I do not know how to pass an regular expression as an argument to a function.
If I pass a string, it is OK,
I have the following awk file,
#!/usr/bin/awk -f
function find(name){
for(i=0;i<NF;i++)if($(i+1)~name)print $(i+1)
}
{
find("mysql")
}
I do something like
$ ./fct.awk <(echo "$str")
This works OK.
But when I call in the awk file,
{
find(/mysql/)
}
This does not work.
What am I doing wrong?
Thanks,
Eric J.

you cannot (should not) pass regex constant to a user-defined function. you have to use dynamic regex in this case. like find("mysql")
if you do find(/mysql/), what does awk do is : find($0~/mysql/) so it pass a 0 or 1 to your find(..) function.
see this question for detail.
awk variable assignment statement explanation needed
also
http://www.gnu.org/software/gawk/manual/gawk.html#Using-Constant-Regexps
section: 6.1.2 Using Regular Expression Constants

warning: regexp constant for parameter #1 yields boolean value
The regex gets evaluated (matching against $0) before it's passed to the function. You have to use strings.
Note: make sure you do proper escaping: http://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps

If you use GNU awk, you can use regular expression as user defined function parameter.
You have to define your regex as #/.../.
In your example, you would use it like this:
function find(regex){
for(i=1;i<=NF;i++)
if($i ~ regex)
print $i
}
{
find(#/mysql/)
}
It's called strongly type regexp constant and it's available since GNU awk version 4.2 (Oct 2017).
Example here.

use quotations, treat them as a string. this way it works for mawk, mawk2, and gnu-gawk. but you'll also need to double the backslashes since making them strings will eat away one of them right off the bat.
in your examplem just find("mysql") will suffice.
you can actually get it to pass arbitrary regex as you wish, and not be confined to just gnu-gawk, as long as you're willing to make them strings not the #/../ syntax others have mentioned. This is where the # of backslashes make a difference.
You can even make regex out of arbitrary bytes too, preferably via octal codes. if you do "\342\234\234" as a regex, the system will convert that into actual bytes in the regex before matching.
While there's nothing with that approach, if you wanna be 100% safe and prefer not having arbitrary bytes flying around , write it as
"[\\342][\\234][\\234]" ----> ✜
Once initially read by awk to create an internal representation, it'll look like this :
[\342][\234][\234]
which will still match the identical objects you desire (in this case, some sort of cross-looking dingbat). This will spit out annoying warnings in unicode-aware mode of gawk due to attempting to enclose non-ASCII bytes directly into square brackets. For that use case,
"\\342\\234\\234" ------(eqv to )---> /\342\234\234/
will keep gawk happy and quiet. Lately I've been filling the gaps in my own codes and write regex that can mimic all the Unicode-script classes that perl enjoys.

Related

What ends up happening when we try to use regex modifiers in awk?

See this output:
❯ awk '/indubitably/i' /usr/share/dict/words | wc -l
102401
Awk does not complain about invalid syntax or anything like that, and just spits out all lines in the file (words indeed has 102401 words inside).
Since it's very reasonable as an awk newbie to try this as a guess for case insensitivity (I am aware that IGNORECASE=1; is the right way to do it) I'm now curious how awk actually interprets /indubitably/i.
actually there's nothing invalid about that syntax.
it's regex matching "indubitably" anywhere in each input line, and concat with a uninitialized variable "i" that, by default, is an empty string, or boolean value of False.
but i'm guessing what happened instead, is that awk concatenating that empty string into regex (not as a comparison item, but afterwards), which becomes a non-empty string, since you have a word inside the regex.
and basically anything that evaluates to non-zero numerically or non-empty string becomes a boolean True at the pattern level, which then defaults to print as an action. you literally can throw anything there -
writing a "1" there is just conventional notation, but you can even place NF, OFMT, FNR, SUBSEP, or RS right there at the pattern (as long as it's not an empty string), and it'll work as if you've placed a "1" there.

Why is the order of evaluation of expressions used for concatenation undefined in Awk?

In GNU Awk User's Guide, I went through the section 6.2.2 String Concatenation and found interesting insights:
Because string concatenation does not have an explicit operator, it is often necessary to ensure that it happens at the right time by using parentheses to enclose the items to concatenate.
Then, I was quite surprised to read the following:
Parentheses should be used around concatenation in all but the most common contexts, such as on the righthand side of ‘=’. Be careful about the kinds of expressions used in string concatenation. In particular, the order of evaluation of expressions used for concatenation is undefined in the awk language. Consider this example:
BEGIN {
a = "don't"
print (a " " (a = "panic"))
}
It is not defined whether the second assignment to a happens before or after the value of a is retrieved for producing the concatenated value. The result could be either ‘don't panic’, or ‘panic panic’.
In particular, in my GNU Awk 5.0.0 it performs like this, doing the replacement before printing the value:
$ gawk 'BEGIN {a = "dont"; print (a " " (a = "panic"))}'
dont panic
However, I wonder: why isn't the order of evaluation of expressions defined? What are the benefits of having "undefined" outputs that may vary depending on the version of Awk you are running?
This particular example is about expressions with side-effects. Traditionally, in C and awk syntax (closely inspired by C), assignments are allowed inside expressions. How those expressions are then evaluated is up to the implementation.
Leaving something unspecified would make sure that people don't use potentially confusing or ambiguous language constructs. But that assumes they are aware of the lack of specification.

Perl6 regex not matching end $ character with filenames

I've been trying to learn Perl6 from Perl5, but the issue is that the regex works differently, and it isn't working properly.
I am making a test case to list all files in a directory ending in ".p6$"
This code works with the end character
if 'read.p6' ~~ /read\.p6$/ {
say "'read.p6' contains 'p6'";
}
However, if I try to fit this into a subroutine:
multi list_files_regex (Str $regex) {
my #files = dir;
for #files -> $file {
if $file.path ~~ /$regex/ {
say $file.path;
}
}
}
it no longer works. I don't think the issue with the regex, but with the file name, there may be some attribute I'm not aware of.
How can I get the file name to match the regex in Perl6?
Regexes are a first-class language within Perl 6, rather than simply strings, and what you're seeing here is a result of that.
The form /$foo/ in Perl 6 regex will search for the string value in $foo, so it will be looking, literally, for the characters read\.p6$ (that is, with the dot and dollar sign).
Depending on the situation of the calling code, there are a couple of options:
If you really are receiving regexes as strings, for example read as input or from a file, then use $file.path ~~ /<$regex>/. This means it will treat what's in $regex as regex syntax.
If you will just be passing a range of different regexes in, change the parameter to be of type Regex, and then do $file.path ~~ $regex. In this case, you'd pass them like list_files_regex(/foo/).
Last but not least, dir takes a test parameter, and so you can instead write:
for dir(test => /<$regex>/) -> $file {
say $file.path;
}

What's the difference between parenthesis $() and curly bracket ${} syntax in Makefile?

Is there any differences in invoking variables with syntax ${var} and $(var)? For instance, in the way the variable will be expanded or anything?
There's no difference – they mean exactly the same (in GNU Make and in POSIX make).
I think that $(round brackets) look tidier, but that's just personal preference.
(Other answers point to the relevant sections of the GNU Make documentation, and note that you shouldn't mix the syntaxes within a single expression)
The Basics of Variable References section from the GNU make documentation state no differences:
To substitute a variable's value, write a dollar sign followed by the
name of the variable in parentheses or braces: either $(foo) or
${foo} is a valid reference to the variable foo.
As already correctly pointed out, there is no difference but be be wary not to mix the two kind of delimiters as it can lead to cryptic errors like in the GNU make example by unomadh.
From the GNU make manual on the Function Call Syntax (emphasis mine):
[…] If the arguments themselves contain other function calls or variable references, it is wisest to use the same kind of delimiters for all the references; write $(subst a,b,$(x)), not $(subst a,b,${x}). This is because it is clearer, and because only one type of delimiter is matched to find the end of the reference.
The ${} style lets you test the make rules in the shell, if you have the corresponding environment variables set, since that is compatible with bash.
Actually, it seems to be fairly different:
, = ,
list = a,b,c
$(info $(subst $(,),-,$(list))_EOL)
$(info $(subst ${,},-,$(list))_EOL)
outputs
a-b-c_EOL
md/init-profile.md:4: *** unterminated variable reference. Stop.
But so far I only found this difference when the variable name into ${...} contains itself a comma. I first thought ${...} was expanding the comma not as part as the value, but it turns out i'm not able to hack it this way. I still don't understand this... If anyone had an explanation, I'd be happy to know !
It makes a difference if the expression contains unbalanced brackets:
${info ${subst ),(,:-)}}
$(info $(subst ),(,:-)))
->
:-(
*** insufficient number of arguments (1) to function 'subst'. Stop.
For variable references, this makes a difference for functions, or for variable names that contain brackets (bad idea)

How does %NNN$hhn work in a format string?

I am trying out a classic format string vulnerability. I want to know how exactly the following format string works:
"%NNN$hhn" where 'N' is any number.
E.g: printf("%144$hhn",....);
How does it work and how do I use this to overwrite any address I want with arbitrary value?
Thanks and Regards,
Hrishikesh Murali
It's a POSIX extension (not found in C99) which will simply allow you to select which argument from the argument list to use for the source of the data.
With regular printf, each % format specifier grabs the current argument from the list and advances the "pointer" to the next one. That means if you want to print a single value in two different ways, you need something like:
printf ("%c %d\n", chVal, chVal);
By using positional specifiers, you can do this as:
printf ("%1$c %1$d\n", chVal);
because both format strings will use the first argument as their source.
Another example on the wikipedia page is:
printf ("%2$d %2$#x; %1$d %1$#x",16,17);
which will give you the output:
17 0x11; 16 0x10
It basically allows you to disconnect the order of the format specifiers from the provided values, letting you bounce around the argument list in any way you want, using the values over and over again, in any arbitrary order.
Now whether you can use this as an user attack vector, I'm doubtful, since it only adds a means for the programmer to change the source of the data, not where the data is sent to.
It's no less secure than the regular style printf and I can see no real vulnerabilities unless you have the power to change the format string somehow. But, if you could do that, the regular printf would also be wide open to abuse.