How to make a regular expression in postgres lazy - sql

I have this sql expression:
regexp_matches(view_definition,'(ms_sub_[0-9]+)(.*group by)','ig')
which is trying to return the text between 'ms_sub_' and 'group by'. It returns all the text to the last occurrence of 'group by' but I only want the text to the next occurrence of group by. I've tried to make 'group by' lazy but can't figure out how to do this.

You need to get rid of the first greedy quantifier, then you may use a lazy *? quantifier with the dot:
regexp_matches(view_definition,'ms_sub_[0-9].*?group by','ig')
See the online demo
When you used [0-9]+, you set the whole expression greediness to "greedy", and the .*? you used later was treated as a greedy, .*, pattern. You might also use \d+? to get the same results. The only difference is what your capturing groups may contain, but in this case, you seem to only want whole match values.
If you care about the captured values, you may use
(ms_sub_\d+?(?!\d))(.*?group by)
where (?!\d) lookahead will make sure you \d+? matches all consecutive digits there are after ms_sub_ while still using a lazy quantifier.
If you expect at least 1 non-digit char after the ms_sub_<digits> between it and group by, you may use (ms_sub_\d+?)(\D.*?group by) where \D matches any char other than digit.

Related

Replacing multiple occurrence by single occurrence in string

I have data like "11223311" and I want all the multiple occurrence to be replaced by single occurrence i.e. the above should turn into '123'. I am working in SAP HANA.
But by using below logic I am getting
'1231' from '11223311'.
SELECT REPLACE_REGEXPR('(.)\1+' IN '11223331' WITH '\1' OCCURRENCE ALL) FROM DUMMY;
Your regular expression only replaces multiple consecutive occurrences of characters; that's what the \1+ directly after it's matching (.) is doing.
You can use look-ahead to remove all characters that also occur somewhere after that match. Note that this keeps the last occurrence, not the first:
SELECT REPLACE_REGEXPR('(.)(?=.*\1)' IN '11223331' WITH '' OCCURRENCE ALL) FROM DUMMY
This returns: 231
If you want to keep the first occurrence, I don't see a possibility just with one regex (I could be wrong though). Using a look-behind in the same way does not work because it would need to be variable-length, which is not supported in HANA and most other implementations. Often \K is recommended as alternative, but something like (.).*\K\1 wouldn't work with replace all, because all characters before \K are still consumed in replace. If you could run the same regex in a loop, it could work but then why not use a non-regex loop (like a user-defined HANA function) in the first place.
Please try this
SELECT REPLACE_REGEXPR(concat(concat('[^','11223331'),']') IN '0123456789' WITH '' OCCURRENCE ALL)
FROM DUMMY;

How to determine whether a varchar field DOES NOT contain characters in set

I need to determine if all rows in varchar column in a db contain any characters outside of the particular set below:
abcdefghijklmonpqrstuvwxyzABCDEFGHIJKLMONPQRSTUVWXYZ.-#,1234567890/\&%();:+#_*?|=''
I tried this but am not sure if it is correct:
select AccName
from Transactions
where AccName not like '%[!abcdefghijklmonpqrstuvwxyzABCDEFGHIJKLMONPQRSTUVWXYZ.-#,1234567890/\&%();:+#_*?|='']%'
Should this work?
Any help appeciated.
You cannot use a regular expression inside an ordinary LIKE condition in a query. If you want to use regular expressions, you will have to use a special operator. In MySQL, you could try the following:
SELECT AccName
FROM Transactions
WHERE AccName REGEXP [!abcdefghijklmonpqrstuvwxyzABCDEFGHIJKLMONPQRSTUVWXYZ.-#,1234567890/\&%();:+#_*?|='']%';
If this doesn't run to boot, then you may have to tidy up the regular expression you gave. And as marc_s asked, the exact regular expression and query will depend on the DB system you are using.
Database management systems vary in their support for matching regular expressions. Examples below use PostgreSQL, which supports POSIX regular expressions, along with other flavors. Examples below also test for case-sensitive matches to avoid sentences like "'Mike' doesn't not match the regular expression".
AFAIK, no DBMS lets you mix the like operator with a regular expression.
A like expression in the form column_name like '%a%' will match 'a' if it appears anywhere in the column. But you need your regular expression to match on the whole value of the column. Anchor the regular expression at the start and end of each value (^ and $), and tell the dbms to match one or more instances (+) of the atom.
select 'Mike' ~ '^[a-zA-Z0-9]+$'; -- 'Mike' matches the regex
Write a failing test.
select 'Mike?' ~ '^[a-zA-Z0-9]+$'; -- 'Mike?' doesn't match the regex
Add the question mark to the regex, and verify the test succeeds.
select 'Mike?' ~ '^[a-zA-Z0-9?]+$'; -- 'Mike?' matches the regex
Repeat failing test and succeeding test for each character. When you've caught all the characters you want, invert the logic using the !~ operator in place of the ~ operator.
When your data is clean move this into a CHECK constraint.
PostgreSQL pattern matching

SQL IN clause RegEx

I am trying to write regural expression to validate the SQL IN clause where values inside bracket are numbers (ids) e.g (23,109,1) but NOT (23,109,) or (23,,) or ().
My current expression is:
^\([0-9,]+\)$
but it allows also the wrong values.
I am not really good at regural expressions, also tried something like:
^\(([0-9]+,)+\)$
but I guess it's not the point.
Any ideas?
Your second try is almost there; the problem is, ^\(([0-9]+,)+\)$ would require trailing comma. Let's try ^\([0-9]+(,[0-9]+)*\)$.
No idea on your regex library/dialect; maybe there's much to be improved (\d for digits; allowing spaces between elements; etc).

Regular expression filter

I have this regular expression in my sql query
DECLARE #RETURN_VALUE VARCHAR(MAX)
IF #value LIKE '%[0-9]%[^A-Z]%[0-9]%'
BEGIN
SET #RETURN_VALUE = NULL
END
I am not sure, but whenever I have this in my row 12 TEST then it gives me the value of 12, but if I have three digit number then it filters out the three digit numbers.How can I modify the regular expression to return me the three digits numbers too.
any help will be appreciated.
SQL doesn't have regular expressions: it has SQL wildcard expressions. They are much simpler than regular expressions and long predate regular expressions. For instance, there is no way to specify alternation (a|b) or repetition ( a*, a+, a?, a{m,n} ) such as you might find in a regular expression.
The 'like expression' that you have
LIKE '%[0-9]%[^A-Z]%[0-9]%'
will match any string containing the following pattern anywhere in the string
zero or more of any character, followed by...
a single decimal digit, followed by...
zero or more of any character, followed by...
a single character other than A–Z (whether it's case sensitive or not depends on the collating sequence in use), followed by...
zero or of any character, followed by...
a single decimal digit, followed by...
zero or more of any character
One should note that the % is likely to match perhaps more than you might like.
Have you tried ([0-9]*). I believe that this will capture every digit for you. However, I am not as strong at regex. When I ran this through rubular, it worked, though :) BTW, rubular is a great way to test out regular expressions
You can easily create a SQL CLR function and use this in your queries. Visual Studio has a project template for this and makes deploying the functions a snap.
Here is more information from Microsoft about how to create the function and how to use it (for boolean matches and for data extraction).
First of all, note that this is not really a "regular expression", it's a SQL-specific form of wildcard matching. You are very limited in what you can accomplish with SQL wildcards. As one example, you cannot "optionally" match a specific character or character set.
Your expression, as you've written it, will match any value that contains two digits with at least one non-letter character in between them, meaning it will match:
111
1^1
1?7
1AAAAAAAAAAA?AAAAAAAAA1
-----------------------5-----------------3-------
And infinitely more items of a similar structure.
Oddly, one string that would not match this pattern is "12 TEST" because there is no character between the 1 and 2. The pattern also won't "give you" the value of 12 back because it's not a parsing expression, just a matching expression: it returns 1 (true) or 0 (false).
There is clearly something else going on in your application, possibly even an actual regular expression, but it has nothing to do with the SQL you've included here.

Return sql rows where field contains ONLY non-alphanumeric characters

I need to find out how many rows in a particular field in my sql server table, contain ONLY non-alphanumeric characters.
I'm thinking it's a regular expression that I need along the lines of [^a-zA-Z0-9] but Im not sure of the exact syntax I need to return the rows if there are no valid alphanumeric chars in there.
SQL Server doesn't have regular expressions. It uses the LIKE pattern matching syntax which isn't the same.
As it happens, you are close. Just need leading+trailing wildcards and move the NOT
WHERE whatever NOT LIKE '%[a-z0-9]%'
If you have short strings you should be able to create a few LIKE patterns ('[^a-zA-Z0-9]', '[^a-zA-Z0-9][^a-zA-Z0-9]', ...) to match strings of different length. Otherwise you should use CLR user defined function and a proper regular expression - Regular Expressions Make Pattern Matching And Data Extraction Easier.
This will not work correctly, e.g. abcÑxyz will pass thru this as it has a,b,c... you need to work with Collate or check each byte.