Replacing multiple occurrence by single occurrence in string - sql

I have data like "11223311" and I want all the multiple occurrence to be replaced by single occurrence i.e. the above should turn into '123'. I am working in SAP HANA.
But by using below logic I am getting
'1231' from '11223311'.
SELECT REPLACE_REGEXPR('(.)\1+' IN '11223331' WITH '\1' OCCURRENCE ALL) FROM DUMMY;

Your regular expression only replaces multiple consecutive occurrences of characters; that's what the \1+ directly after it's matching (.) is doing.
You can use look-ahead to remove all characters that also occur somewhere after that match. Note that this keeps the last occurrence, not the first:
SELECT REPLACE_REGEXPR('(.)(?=.*\1)' IN '11223331' WITH '' OCCURRENCE ALL) FROM DUMMY
This returns: 231
If you want to keep the first occurrence, I don't see a possibility just with one regex (I could be wrong though). Using a look-behind in the same way does not work because it would need to be variable-length, which is not supported in HANA and most other implementations. Often \K is recommended as alternative, but something like (.).*\K\1 wouldn't work with replace all, because all characters before \K are still consumed in replace. If you could run the same regex in a loop, it could work but then why not use a non-regex loop (like a user-defined HANA function) in the first place.

Please try this
SELECT REPLACE_REGEXPR(concat(concat('[^','11223331'),']') IN '0123456789' WITH '' OCCURRENCE ALL)
FROM DUMMY;

Related

How to make a regular expression in postgres lazy

I have this sql expression:
regexp_matches(view_definition,'(ms_sub_[0-9]+)(.*group by)','ig')
which is trying to return the text between 'ms_sub_' and 'group by'. It returns all the text to the last occurrence of 'group by' but I only want the text to the next occurrence of group by. I've tried to make 'group by' lazy but can't figure out how to do this.
You need to get rid of the first greedy quantifier, then you may use a lazy *? quantifier with the dot:
regexp_matches(view_definition,'ms_sub_[0-9].*?group by','ig')
See the online demo
When you used [0-9]+, you set the whole expression greediness to "greedy", and the .*? you used later was treated as a greedy, .*, pattern. You might also use \d+? to get the same results. The only difference is what your capturing groups may contain, but in this case, you seem to only want whole match values.
If you care about the captured values, you may use
(ms_sub_\d+?(?!\d))(.*?group by)
where (?!\d) lookahead will make sure you \d+? matches all consecutive digits there are after ms_sub_ while still using a lazy quantifier.
If you expect at least 1 non-digit char after the ms_sub_<digits> between it and group by, you may use (ms_sub_\d+?)(\D.*?group by) where \D matches any char other than digit.

How to select values around .(dot) using sql

I am running below query in Teradata :
sel requesttext from dbc.tables
where tablename='old_employee_table'
Result:
alter table DB_NAME.employee_table,no fallback ;
I want to get below result using SQL:
DB_NAME.employee_table
Requesttext can be:
create set table DB_NAME.employee_table;
DB Name and table can occur anywhere in the result. Since .(dot) is joining them that's why i want to split with .(dot).
Basically I need sql which can result me surrounding values of .(dot)
I want DBName and Tablename in result.
I'm not a Teradata person, but this should work for both strings given so far, as long as teradata's regexp_substr() supports positive look-behind and positive look-ahead assertions (I might have the Teradata syntax wrong, so a little tweaking may be needed):
SELECT REGEXP_SUBSTR(requesttext, '(?<= )(\w+\.\w+)(?=[,$]?)', 1, 1)
FROM dbc.tables
WHERE tablename='old_employee_table'
See the regex101 example. Hopefully it translates to Teradata easily.
The regex looks for and returns the words either side of and including the period, when preceded by a space, and followed by an optional comma or the end of the line.
You could do this with either regexp_substr() or strtok().
As Jamie Zawinski said:
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
So I would go with the strtok() method. Also I'm lazy and regular expressions are hard.
Function strtok() takes three arguments:
The string being split
The delimiter to split the string
The number of the token to grab.
To get at the <database>.<table> from that string that is returned in your query, we can split by a space, grab the third token, then split that by a comma and grab the first token.
That would look like:
SELECT strtok(strtok(requestText,' ',3),',',1)
FROM dbc.tables
WHERE tablename='old_employee_table'

Redshift SQL - Extract numbers from string

In Amazon Redshift tables, I have a string column from which I need to extract numbers only out. For this currently I use
translate(stringfield, '0123456789'||stringfield, '0123456789')
I was trying out REPLACE function, but its not gonna be elegant.
Any thoughts with converting the string into ASCII first and then doing some operation to extract only number? Or any other alternatives.
It is hard here as Redshift do not support functions and is missing lot of traditional functions.
Edit:
Trying out the below, but it only returns 051-a92 where as I need 05192 as output. I am thinking of substring etc, but I only have regexp_substr available right now. How do I get rid of any characters in between
select REGEXP_SUBSTR('somestring-051-a92', '[0-9]+..[0-9]+', 1)
might be late but I was solving the same problem and finally came up with this
select REGEXP_replace('somestring-051-a92', '[a-z/-]', '')
alternatively, you can create a Python UDF now
Typically your inputs will conform to some sort of pattern that can be used to do the parsing using SUBSTRING() with CHARINDEX() { aka STRPOS(), POSITION() }.
E.g. find the first hyphen and the second hyphen and take the data between them.
If not (and assuming your character range is limited to ASCII) then your best bet would be to nest 26+ REPLACE() functions to remove all of the standard alpha characters (and any punctuation as well).
If you have multibyte characters in your data though then this is a non-starter.
Better method is to remove all the non-numeric values:
select REGEXP_replace('somestring-051-a92', '[^0-9]', '')
You can specify "any non digit" that includes non-printable, symbols, alpha, etc.
e.g., regexp_replace('brws--A*1','[\D]')
returns
"1"

Regular expression filter

I have this regular expression in my sql query
DECLARE #RETURN_VALUE VARCHAR(MAX)
IF #value LIKE '%[0-9]%[^A-Z]%[0-9]%'
BEGIN
SET #RETURN_VALUE = NULL
END
I am not sure, but whenever I have this in my row 12 TEST then it gives me the value of 12, but if I have three digit number then it filters out the three digit numbers.How can I modify the regular expression to return me the three digits numbers too.
any help will be appreciated.
SQL doesn't have regular expressions: it has SQL wildcard expressions. They are much simpler than regular expressions and long predate regular expressions. For instance, there is no way to specify alternation (a|b) or repetition ( a*, a+, a?, a{m,n} ) such as you might find in a regular expression.
The 'like expression' that you have
LIKE '%[0-9]%[^A-Z]%[0-9]%'
will match any string containing the following pattern anywhere in the string
zero or more of any character, followed by...
a single decimal digit, followed by...
zero or more of any character, followed by...
a single character other than A–Z (whether it's case sensitive or not depends on the collating sequence in use), followed by...
zero or of any character, followed by...
a single decimal digit, followed by...
zero or more of any character
One should note that the % is likely to match perhaps more than you might like.
Have you tried ([0-9]*). I believe that this will capture every digit for you. However, I am not as strong at regex. When I ran this through rubular, it worked, though :) BTW, rubular is a great way to test out regular expressions
You can easily create a SQL CLR function and use this in your queries. Visual Studio has a project template for this and makes deploying the functions a snap.
Here is more information from Microsoft about how to create the function and how to use it (for boolean matches and for data extraction).
First of all, note that this is not really a "regular expression", it's a SQL-specific form of wildcard matching. You are very limited in what you can accomplish with SQL wildcards. As one example, you cannot "optionally" match a specific character or character set.
Your expression, as you've written it, will match any value that contains two digits with at least one non-letter character in between them, meaning it will match:
111
1^1
1?7
1AAAAAAAAAAA?AAAAAAAAA1
-----------------------5-----------------3-------
And infinitely more items of a similar structure.
Oddly, one string that would not match this pattern is "12 TEST" because there is no character between the 1 and 2. The pattern also won't "give you" the value of 12 back because it's not a parsing expression, just a matching expression: it returns 1 (true) or 0 (false).
There is clearly something else going on in your application, possibly even an actual regular expression, but it has nothing to do with the SQL you've included here.

Return sql rows where field contains ONLY non-alphanumeric characters

I need to find out how many rows in a particular field in my sql server table, contain ONLY non-alphanumeric characters.
I'm thinking it's a regular expression that I need along the lines of [^a-zA-Z0-9] but Im not sure of the exact syntax I need to return the rows if there are no valid alphanumeric chars in there.
SQL Server doesn't have regular expressions. It uses the LIKE pattern matching syntax which isn't the same.
As it happens, you are close. Just need leading+trailing wildcards and move the NOT
WHERE whatever NOT LIKE '%[a-z0-9]%'
If you have short strings you should be able to create a few LIKE patterns ('[^a-zA-Z0-9]', '[^a-zA-Z0-9][^a-zA-Z0-9]', ...) to match strings of different length. Otherwise you should use CLR user defined function and a proper regular expression - Regular Expressions Make Pattern Matching And Data Extraction Easier.
This will not work correctly, e.g. abcÑxyz will pass thru this as it has a,b,c... you need to work with Collate or check each byte.