Regex Help for wildcard lookup - sql

I am having issues with this wildcard lookup, not sure why this doesn't work:
I am looking up example Sales Agent 42. As you can imagine being sales, they do not really care about garbage in = garbage out. So their agent codes are usually a mess to sort through.
Valid Examples for Agent 42:
42
30-42-22-holiday
42easter
42-coupon
42coupon-423355
29-42sale-52
Non-Valid Examples that explicitly need to not show up
A4290042
4297901
42cmowc209d
o203f9j42po0
Here is the most successful model I came up with:
SELECT company_id, agent
FROM cust_data
WHERE (agent = ('42') OR agent LIKE ('42%-%') OR agent LIKE ('%-%42') OR agent LIKE ('%-%42%-%') OR agent LIKE ('42[a-z]%-%') OR agent LIKE ('%-%42[a-z]%') OR agent LIKE ('%-%42[a-z]%-%') OR agent LIKE ('42[a-z]%'))
I get most of the valid ones to return and none of the non-valid ones, but I still can't seem to grab the examples like 42easter or 29-42sale-52 even though I am telling it to grab that style...
Any suggestions?

If you need to match 42 that is not surrounded with digits, you can use alternations with anchors (^ standing for the start of string and $ standing for the end of string) and negated character classes:
WHERE agent ~ '(^|[^0-9])42($|[^0-9])'
See the regex demo
Explanation:
(^|[^0-9]) - either the start of the string ^ or a non-digit [^0-9]
42 - literal 42
($|[^0-9]) - end of string $ or a non-digit [^0-9]

Related

Regex like telephone number on Hive without prefix (+01)

We have a problem with a regular expression on hive.
We need to exclude the numbers with +37 or 0037 at the beginning of the record (it could be a false result on the regex like) and without letters or space.
We're trying with this one:
regexp_like(tel_number,'^\+37|^0037+[a-zA-ZÀÈÌÒÙ ]')
but it doesn't work.
Edit: we want it to come out from the select as true (correct number) or false.
To exclude numbers which start with +01 0r +001 or +0001 and having only digits without spaces or letters:
... WHERE tel_number NOT rlike '^\\+0{1,3}1\\d+$'
Special characters like + and character classes like \d in Hive should be escaped using double-slash: \\+ and \\d.
The general question is, if you want to describe a malformed telephone number in your regex and exclude everything that matches the pattern or if you want to describe a well-formed telephone number and include everything that matches the pattern.
Which way to go, depends on your scenario. From what I understand of your requirements, adding "not starting with 0037 or +37" as a condition to a well-formed telephone number could be a good approach.
The pattern would be like this:
Your number can start with either + or 00: ^(\+|00)
It cannot be followed by a 37 which in regex can be expressed by the following set of alternatives:
a. It is followed first by a 3 then by anything but 7: 3[0-689]
b. It is followed first by anything but 3 then by any number: [0-24-9]\d
After that there is a sequence of numbers of undefined length (at least one) until the end of the string: \d+$
Putting everything together:
^(\+|00)(3[0-689]|[0-24-9]\d)\d+$
You can play with this regex here and see if this fits your needs: https://regex101.com/r/KK5rjE/3
Note: as leftjoin has pointed out: To use this regex in hive you might need to additionally escape the backslashes \ in the pattern.
You can use
regexp_like(tel_number,'^(?!\\+37|0037)\\+?\\d+$')
See the regex demo. Details:
^ - start of string
(?!\+37|0037) - a negative lookahead that fails the match if there is +37 or 0037 immediately to the right of the current location
\+? - an optional + sign
\d+ - one or more digits
$ - end of string.

Targeting a string for deletion with grep, sed, awk (or cut)

I am trying to parse some logs to gain the user agent and account id per line. I have already managed to pull the user agent and a string which contains the account id all on the same line.
The next step is to extract the account id from its longer string. I thought this would be fairly simple as I will know the start of the string and there are / slashes for the delimiter but the user agent also contains slashes and have varied number of fields.
The log file currently looks something like the following example but there are hundreds to thousands of lines to parse. Luckily I am working off a partition with plenty of space to spare.
USER_AGENT_PART ACCOUNT_ID_Part_/plus/path/to/stuff/they/access
some user agent/1.3 KnownString1_32d4-56e-009f98/some/stuff/here
user/agent KnownString1_12d3-345e-4c534/more/stuff/here
User/Agent cURL/1.5.0 KnownString2_12d34e56/stuff/things/stuff/stuff
one/User Agent/2.0 KnownString1_12d3_456e_7g8/more/random/stuff/stuff
So the goal is to keep the user agent part and the account id part and drop the path of the stuff they are accessing in the last string. But I can't use / or spaces as general delimiters because many user agents have / and various amounts of spaces in their name.
Also, the different types of user agents is way more than this little sample I have here. There are anywhere from 25 - 50 distinct types depending on the log. So it doesn't seem worth it to target the user agent and try to exclude it.
It seems the logical way to start is by targeting the part of the account ID which is a known string (KnownString1 or KnownString2) and grab everything from there (which is unknown numbers and letters with dashes) up until the first / of that account string.
Then I would delete the first / (In the account ID string) and everything after. I expect I will need to do this in two passes to utilize the two known parts of the user IDs.
This seemed like it would be easy but I just can't wrap my head around how to start targeting that last string. I don't even have a good example of something that is close to working because I don't know how to target the last string by delimiters without catching the same delimiters in the user agent part.
Any ideas?
Edit: Every line will have an account id that starts with one of two common KnownString_ in it but then is followed by a series of unknown digits and dashes until it gets to the first /. So I don't need to search for lines containing that before targeting the string.
Edit2: My original examples of the Account ID did not reflect there were letters mixed in with the numbers.
Edit3: Thanks to the responses from oguz ismail and kesubagu I was able to solve this using egrep. Looks like I was trying to make things more complicated than they were. I also realized I need to revisit grep as its capable of doing far more than what I tend to use it for.
This is what I ended up using which worked in one pass:
egrep -o ".+(KnownString1|KnownString2)_[^/]+" logfile > logfile2
Using grep:
$ grep -o '.*KnownString[^/]*' file
some user agent/1.3 KnownString1_32d4-56e-009f98
user/agent KnownString1_12d3-345e-4c534
User/Agent cURL/1.5.0 KnownString2_12d34e56
one/User Agent/2.0 KnownString1_12d3_456e_7g8
.* matches everything before KnownString, and [^/]* matches everything after KnownString until the first /.
You can use egrep with the -o option which will only output the part of that matches the provided regex, so you could do something like this
cat test | egrep -o ".+(KnownString1|KnownString2)_[_0-9-]+"
where the test file contains the input you've given, the output in this case was
some user agent/1.3 KnownString1_324-56-00998
user/agent KnownString1_123-345-4534
User/Agent cURL/1.5.0 KnownString2_123456
one/User Agent/2.0 KnownString1_123_456_78

regex not working correctly when the test is fine

For my database, I have a list of company numbers where some of them start with two letters. I have created a regex which should eliminate these from a query and according to my tests, it should. But when executed, the result still contains the numbers with letters.
Here is my regex, which I've tested on https://www.regexpal.com
([^A-Z+|a-z+].*)
I've tested it against numerous variations such as SC08093, ZC000191 and NI232312 which shouldn't match and don't in the tests, which is fine.
My sql query looks like;
SELECT companyNumber FROM company_data
WHERE companyNumber ~ '([^A-Z+|a-z+].*)' order by companyNumber desc
To summerise, strings like SC08093 should not match as they start with letters.
I've read through the documentation for postgres but I couldn't seem to find anything regarding this. I'm not sure what I'm missing here. Thanks.
The ~ '([^A-Z+|a-z+].*)' does not work because this is a [^A-Z+|a-z+].* regex matching operation that returns true even upon a partial match (regex matching operation does not require full string match, and thus the pattern can match anywhere in the string). [^A-Z+|a-z+].* matches a letter from A to Z, +,|or a letter fromatoz`, and then any amount of any zero or more chars, anywhere inside a string.
You may use
WHERE companyNumber NOT SIMILAR TO '[A-Za-z]{2}%'
See the online demo
Here, NOT SIMILAR TO returns the inverse result of the SIMILAR TO operation. This SIMILAR TO operator accepts patterns that are almost regex patterns, but are also like regular wildcard patterns. NOT SIMILAR TO '[A-Za-z]{2}%' means all records that start with two ASCII letters ([A-Za-z]{2}) and having anything after (%) are NOT returned and all others will be returned. Note that SIMILAR TO requires a full string match, same as LIKE.
Your pattern: [^A-Z+|a-z+].* means "a string where at least some characters are not A-Z" - to extend that to the whole string you would need to use an anchored regex as shown by S-Man (the group defined with (..) isn't really necessary btw)
I would probably use a regex that specifies want the valid pattern is and then use !~ instead.
where company !~ '^[0-9].*$'
^[0-9].*$ means "only consists of numbers" and the !~ means "does not match"
or
where not (company ~ '^[0-9].*$')
Not start with a letter could be done with
WHERE company ~ '^[^A-Za-z].*'
demo: db<>fiddle
The first ^ marks the beginning. The [^A-Za-z] says "no letter" (including small and capital letters).
Edit: Changed [A-z] into the more precise [A-Za-z] (Why is this regex allowing a caret?)

Regular expression to extract a number of steps

I have a localized string that looks something like this in English:
"
5 Mile(s)
5,252 Step(s)
"
My app is localized both in left-to-right and right-to-left languages so I don't want to make assumptions either about the ordering of the step(s) or about the formatting of the number (e.g. 5,252 can be 5.252 depending on user locale). So I need to account for possibilities that can include things like
Step(s) 5.252
as well as what's above.
A few other caveats
All I know is that if the Step(s) line is in there, it will be on its own line (hence in my regex I require \n at each end of the string)
No guarantee that the Mile(s) information will be in the string at all, let alone whether it will be before or after Step(s)
Here's my attempt at pattern extraction:
NSString *patternString = [NSString stringWithFormat:#"\\n(([0-9,\\.]*)\s*%#|%#\s*([0-9,\\.]*))\\n",
NSLocalizedString(#"Step(s)",nil), NSLocalizedString(#"Step(s)",nil)];
There appear to be two problems with this:
XCode is indicating Unknown escape sequence '\s' for the second \s in the pattern string above
No matches are being found even for strings like the following:
0.2 Mile(s)
1,482 Step(s)
Ideally I would extract the 1,482 out of this string in a way that is localization friendly. How should I modify my regex?
as far as the regex, perhaps this approach might work - it simply matches (with named groups) each couplet of numbers in sequence, with the assumption the first is miles and the second is steps. Decimals in the . or , form are optional:
(?<miles>\d+(?:[.,]\d+)?).*?(?<steps>\d+(?:[.,]\d+)?)
(and i think it should be \\s) - i'm not an ios guy, but if you can use a regex literal it would be way more readable.
regular expression demo
First I'd like to ask - Why is Mile(s) mentioned in the question at all?
And now to my two bits - you could simply use a positive look-ahead:
^(?=.*Step\(s\))[^\d]*(\d+(?:[.,]\d+)?)
It makes sure the expected word is present on the line, and then captures the number on it, allowing for localized, optional, decimal separator and decimals. This way it doesn't matter if the numer is before, or after, the "word".
It doesn't take localization of the "word" into account, but that you seem to have handled by yourself ;)
See it here at regex101.
Your regex is close, although in Obj-C you need to double-escape the \s and (s):
^(([0-9,.]*)\\s*%#|%#\\s*([0-9,.]*))$
In your NSLocalizedString you likely also need to escape the parentheses enclosing (s):
NSString *patternString = [NSString stringWithFormat:#"^(([\\d,.]+)\\s%#|%#\\s([\\d,.]+))$",
NSLocalizedString(#"Step\\(s\\)",nil), NSLocalizedString(#"Step\\(s\\)",nil)];
If you don't escape (s) then the regex engine is probably going to interpret it as a capture group.
Looking at NSLog you can see what the pattern actually reads like:
NSLog(#"patternString: %#", patternString);
Output:
patternString: ^(([\d,.]+)\sStep\(s\)|Step\(s\)\s([\d,.]+))$
Since you mentioned the Mile(s) part may not be in the string at all I'm assuming it isn't relevant to the regular expression. As I understand from the question, you just need to capture the number of steps and nothing else. On this basis, here's a modified version of your existing regex:
NSString *patternString =
[NSString stringWithFormat:#"^(?:([0-9,.]*)\\s*%#|%#\\s*([0-9,.]*))$",
NSLocalizedString(#"Step\\(s\\)",nil), NSLocalizedString(#"Step\\(s\\)",nil)];
Demo:
https://www.regex101.com/r/Q6ff1b/1
This is based on the following tips/modifications:
Use the m (= UREGEX_MULTILINE) flag option when creating the regex to specify that ^ and $ match the start and end of each line. This is more sophisticated than using \n as it will also handle the start and end of the string where this might not be present. See here.
Always use a double backslash (\\) for regex escaping - otherwise NSString will interpret the single backslash to be escaping the next character and convert it before it gets to the regex.
Literal parentheses need to be escaped - e.g. Step\\(s\\) instead of Step(s).
Characters within a character class (i.e. anything within the [] square brackets) don't need to be escaped - so it would be . rather than \\. - the latter.
If you are using (x|y|...) as a choice and don't need it to be a capturing group, use ?: after the first parenthesis to ensure it doesn't get captured - i.e. (?:x|y|...).

regexp_matches() returns two matches for $ (end of string)

Can somebody explain this odd behavior of regexp_matches() in PostgreSQL 9.2.4 (same result in 9.1.9):
db=# SELECT regexp_matches('test string', '$') AS end_of_string;
end_of_string
---------------
{""}
(1 row)
db=# SELECT regexp_matches('test string', '$', 'g') AS end_of_string;
end_of_string
---------------
{""}
{""}
(2 rows)
-> SQLfiddle demo.
The second parameter is a regular expression. $ marks the end of the string.
The third parameter is for flags. g is for "globally", meaning the the function doesn't stop at the first match.
The function seems to report the end of the string twice with the g flag, but that can only exist once per definition. It breaks my query. :(
Am I missing something?
I would need my query to return one more row at the end, for any possible string. I expected this query to do the job, but it adds two rows:
SELECT (regexp_matches('test & foo/bar', '(&|/|$)', 'ig'))[1] AS delim
I know how to manually add a row, but I want to let the function take care of it.
It looks like it was a bug in PostgreSQL. I verified for sure it is fixed in 9.3.8. Looking at the release notes, I see possible references in:
9.3.4
Allow regular-expression operators to be terminated early by query
cancel requests (Tom Lane)
This prevents scenarios wherein a pathological regular expression
could lock up a server process uninterruptably for a long time.
9.3.6
Fix incorrect search for shortest-first regular expression matches
(Tom Lane)
Matching would often fail when the number of allowed iterations is
limited by a ? quantifier or a bound expression.
Thanks to Erwin for narrowing it down to 9.3.x.
I am not sure about what I am going to say because I don't use PostgreSQL so this is just me thinking out loud.
Since you are trying to match the end of string/line $, then in the first situation the outcome is expected, but when you turn on global match modifier g and because matching the end of line character doesn't actually consume or read any characters from the input string then the next match attempt will start where the first one left off, that is at the end of string and this will cause an infinite loop if it kept going like that so PostgreSQL engine might be able to detect this and stop it to prevent a crash or an infinite loop.
I tested the same expression in RegexBuddy with POSIX ERE flavor and it caused the program to become unresponsive and crash and this is the reason for my reasoning.
the same occurs for example in C# in which I had the same problem recently so I think this is a normal behaviour for regexps
this is because $ doesn't stand for a specific sign but a specific position instead
so $ doesn't really match anything and the position of parser stays in the same position
you need to change your convention a little;
to test for an empty string you can use ^$
This was a bug that has been fixed in Postgres 9.3. See accepted answer.
For Postgres 9.2 or older: A halfway decent workaround for my situation would be to use the expression .$ instead - matches for any string once at the last character:
WITH x(id, t) AS (
VALUES
(1, 'test & foo/bar')
,(2, 'test')
,(3, '') -- empty string
,(4, 'test & foo/') -- other branch as last character
)
SELECT id, (regexp_matches(t, '(&|/|.$)', 'ig'))[1] AS delim
FROM x;
But it fails for empty strings.
And it fails if the last character happens to match another branch. Like: 'foo/bar/'.
And it isn't perfect to have the actual final character returned. An empty string would be much preferable.
-> SQLfiddle.