Why awk does not remove BOM from the middle of a line?

Why awk does not remove BOM from the middle of a line? - awk

I try to use awk to remove all byte order marks from a file (I have many of them):
awk '{sub(/\xEF\xBB\xBF/,"")}{print}' f1.txt > f2.txt
It seems to remove all the BOMs that are in the beginning of the line but those in the middle are not removed. I can verify that by:
grep -U $'\xEF\xBB\xBF' f2.txt
Grep returns me one line where BOM is in the middle.

As mentioned sub() will only swap out the leftmost substring, so if global is what you're after then using gsub(), or even better gensub() is the way to go.
sub(regexp, replacement [, target])
Search target, which is treated as a string, for the leftmost, longest
substring matched by the regular expression regexp. Modify the entire
string by replacing the matched text with replacement. The modified
string becomes the new value of target. Return the number of
substitutions made (zero or one).
gsub(regexp, replacement [, target])
Search target for all of the longest, leftmost, nonoverlapping
matching substrings it can find and replace them with replacement. The
‘g’ in gsub() stands for “global,” which means replace everywhere.
gensub(regexp, replacement, how [, target]) #
Search the target string target for matches of the regular expression
regexp. If how is a string beginning with ‘g’ or ‘G’ (short for
“global”), then replace all matches of regexp with replacement.
Otherwise, "how" is treated as a number indicating which match of regexp
to replace. gensub() is a general substitution function. Its purpose is to provide more features than the standard sub() and gsub() functions.
There's tons more helpful information and examples linked below:
↳ The GNU Awk User's Guide: String Functions / 9.1.3 String-Manipulation Functions

Related

Trino regexp_replace this character in the beginning but not in the middle Trino [duplicate]

I am a complete Reg-exp noob, so please bear with me. Tried to google this, but haven't found it yet.
What would be an appropriate way of writing a Regular expression matching files starting with a dot, such as .buildpath or .htaccess?
Thanks a lot!

In most regex languages, ^\. or ^[.] will match a leading dot.

The ^ matches the beginning of a string in most languages. This will match a leading .. You need to add your filename expression to it.
^\.
Likewise, $ will match the end of a string.

You may need to substitute the \ for the respective language escape character. However, under Powershell the Regex I use is: ^(\.)+\/
Test case:
"../NameOfFile.txt" -match '^(\\.)+\\\/'
works, while
"_./NameOfFile.txt" -match '^(\\.)+\\\/'
does not.
Naturally, you may ask, well what is happening here?
The (\\.) searches for the literal . followed by a +, which matches the previous character at least once or more times.
Finally, the \\\/ ensures that it conforms to a Window file path.

It depends a bit on the regular expression library you use, but you can do something like this:
^\.\w+
The ^ anchors the match to the beginning of the string, the \. matches a literal period (since an unescaped . in a regular expression typically matches any character), and \w+ matches 1 or more "word" characters (alphanumeric plus _).
See the perlre documentation for more info on Perl-style regular expressions and their syntax.

It depends on what characters are legal in a filename, which depends on the OS and filesystem.
For example, in Windows that would be:
^\.[^<>:"/\\\|\?\*\x00-\x1f]+$
The above expression means:
Match a string starting with the literal character .
Followed by at least one character which is not one of (whole class of invalid chars follows)
I used this as reference regarding which chars are disallowed in filenames.

To match the string starting with dot in java you will have to write a simple expression
^\\..*
^ means regular expression is to be matched from start of string
\. means it will start with string literal "."
.* means dot will be followed by 0 or more characters

What ends up happening when we try to use regex modifiers in awk?

See this output:
❯ awk '/indubitably/i' /usr/share/dict/words | wc -l
102401
Awk does not complain about invalid syntax or anything like that, and just spits out all lines in the file (words indeed has 102401 words inside).
Since it's very reasonable as an awk newbie to try this as a guess for case insensitivity (I am aware that IGNORECASE=1; is the right way to do it) I'm now curious how awk actually interprets /indubitably/i.

actually there's nothing invalid about that syntax.
it's regex matching "indubitably" anywhere in each input line, and concat with a uninitialized variable "i" that, by default, is an empty string, or boolean value of False.
but i'm guessing what happened instead, is that awk concatenating that empty string into regex (not as a comparison item, but afterwards), which becomes a non-empty string, since you have a word inside the regex.
and basically anything that evaluates to non-zero numerically or non-empty string becomes a boolean True at the pattern level, which then defaults to print as an action. you literally can throw anything there -
writing a "1" there is just conventional notation, but you can even place NF, OFMT, FNR, SUBSEP, or RS right there at the pattern (as long as it's not an empty string), and it'll work as if you've placed a "1" there.

regex capture middle of url

I'm trying to figure out the base regex to capture the middle of a google url out of a sql database.
For example, a few links:
https://www.google.com/cars/?year=2016&model=dodge+durango&id=1234
https://www.google.com/cars/?year=2014&model=jeep+cherokee+crossover&id=6789
What would be the regex to capture the text to get dodge+durango , or jeep+cherokee+crossover ? (It's alright that the + still be in there.)
My Attempts:
1)
\b[=.]\W\b\w{5}\b[+.]?\w{7}
, but this clearly does not work as this is a hard coded scenario that would only work like something for the dodge durango example. (would extract "dodge+durango)
2) Using positive lookback ,
[^+]( ?=&id )
but I am not fully sure how to use this, as this only grabs one character behind the & symbol.
How can I extract a string of (potentially) any length with any amount of + delimeters between the "model=" and "&id" boundaries?

seems like you could use regexp_replace and access match groups:
regexp_replace(input, 'model=(.*?)([&\\s]|$)', E'\\1')
from here:
The regexp_replace function provides substitution of new text for
substrings that match POSIX regular expression patterns. It has the
syntax regexp_replace(source, pattern, replacement [, flags ]). The
source string is returned unchanged if there is no match to the
pattern. If there is a match, the source string is returned with the
replacement string substituted for the matching substring. The
replacement string can contain \n, where n is 1 through 9, to indicate
that the source substring matching the n'th parenthesized
subexpression of the pattern should be inserted, and it can contain \&
to indicate that the substring matching the entire pattern should be
inserted. Write \ if you need to put a literal backslash in the
replacement text. The flags parameter is an optional text string
containing zero or more single-letter flags that change the function's
behavior. Flag i specifies case-insensitive matching, while flag g
specifies replacement of each matching substring rather than only the
first one

I may be misunderstanding, but if you want to get the model, just select everything between model= and the ampersand (&).
regexp_matches(input, 'model=([^&]*)')
model=: Match literally
([^&]*): Capture
[^&]*: Anything that isn't an ampersand
*: Unlimited times

Regular expression to match specific variations of function

I am trying to construct a regular expression to find the text of the following variations.
NSLocalizedString(#"TEXT")
NSLocalizedStringFromTable(#"TEXT")
NSLocalizedStringWithDefaultValue(#"TEXT")
...
The goal is to extract TEXT. I have been able to construct a regex for each individual function or macro, e.g., (?<=NSLocalizedString)\(#"(.*?)". However, I am looking for a solution that does the job no matter what the name of the function as long as it starts with NSLocalizedString.
I assumed it was as simple as (?<=NSLocalizedString\w+)\(#"(.*?)", but that does't seem to do the trick.

How about this one?
/NSLocalizedString\w*\(#"(.*)"\)/
Explanation:
NSLocalizedString 'NSLocalizedString'
\w+ word characters (a-z, A-Z, 0-9, _) (0 or
more times (matching the most amount
possible))
\( '('
#" '#"'
( group and capture to \1:
.* any character except \n (0 or more times
(matching the most amount possible))
) end of \1
" '"'
\) ')'

The only reason your regex doesn't work is because the regex engine doesn't support variable length lookbehinds. The (?<=NSLocalizedString\w+) is variable length so can't be used.
Firstly it needs to be \w* not \w+, to allow your first example string to match.
If you move the \w* outside the lookbehind (?<=NSLocalizedString)\w* it will work just fine.
Alternatively, since you have to use a capturing group to grab the text value anyway, theres no need for the lookbehind at all. Change the (?<= to a (?: and it becomes a non-capturing group (which can be variable length), and then just grab your text value from group 1.
Your attempt was:
(?<=NSLocalizedString\w+)\(#"(.*?)"
Both of these minor changes should make it work:
(?<=NSLocalizedString)\w*\(#"(.*?)"
(?:NSLocalizedString\w*)\(#"(.*?)"
The following is actually not supported in Objective-C:
The solution that will extract exactly TEXT without using any groups is:
NSLocalizedString\w*\(#"\K[^"]*
It avoids the need to use a negative lookbehind (which can't be used for reasons I explain below) by using the \K modifier, which chops off anything before it from the match.

How to pass a regular expression to a function in AWK

I do not know how to pass an regular expression as an argument to a function.
If I pass a string, it is OK,
I have the following awk file,
#!/usr/bin/awk -f
function find(name){
for(i=0;i<NF;i++)if($(i+1)~name)print $(i+1)
}
{
find("mysql")
}
I do something like
$ ./fct.awk <(echo "$str")
This works OK.
But when I call in the awk file,
{
find(/mysql/)
}
This does not work.
What am I doing wrong?
Thanks,
Eric J.

you cannot (should not) pass regex constant to a user-defined function. you have to use dynamic regex in this case. like find("mysql")
if you do find(/mysql/), what does awk do is : find($0~/mysql/) so it pass a 0 or 1 to your find(..) function.
see this question for detail.
awk variable assignment statement explanation needed
also
http://www.gnu.org/software/gawk/manual/gawk.html#Using-Constant-Regexps
section: 6.1.2 Using Regular Expression Constants

warning: regexp constant for parameter #1 yields boolean value
The regex gets evaluated (matching against $0) before it's passed to the function. You have to use strings.
Note: make sure you do proper escaping: http://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps

If you use GNU awk, you can use regular expression as user defined function parameter.
You have to define your regex as #/.../.
In your example, you would use it like this:
function find(regex){
for(i=1;i<=NF;i++)
if($i ~ regex)
print $i
}
{
find(#/mysql/)
}
It's called strongly type regexp constant and it's available since GNU awk version 4.2 (Oct 2017).
Example here.

use quotations, treat them as a string. this way it works for mawk, mawk2, and gnu-gawk. but you'll also need to double the backslashes since making them strings will eat away one of them right off the bat.
in your examplem just find("mysql") will suffice.
you can actually get it to pass arbitrary regex as you wish, and not be confined to just gnu-gawk, as long as you're willing to make them strings not the #/../ syntax others have mentioned. This is where the # of backslashes make a difference.
You can even make regex out of arbitrary bytes too, preferably via octal codes. if you do "\342\234\234" as a regex, the system will convert that into actual bytes in the regex before matching.
While there's nothing with that approach, if you wanna be 100% safe and prefer not having arbitrary bytes flying around , write it as
"[\\342][\\234][\\234]" ----> ✜
Once initially read by awk to create an internal representation, it'll look like this :
[\342][\234][\234]
which will still match the identical objects you desire (in this case, some sort of cross-looking dingbat). This will spit out annoying warnings in unicode-aware mode of gawk due to attempting to enclose non-ASCII bytes directly into square brackets. For that use case,
"\\342\\234\\234" ------(eqv to )---> /\342\234\234/
will keep gawk happy and quiet. Lately I've been filling the gaps in my own codes and write regex that can mimic all the Unicode-script classes that perl enjoys.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Why awk does not remove BOM from the middle of a line? - awk

Related

Trino regexp_replace this character in the beginning but not in the middle Trino [duplicate]

What ends up happening when we try to use regex modifiers in awk?

regex capture middle of url

Regular expression to match specific variations of function

How to pass a regular expression to a function in AWK

Categories

Resources