Substring before first uppecase word excluding first word - sql

String contains words separated by spaces.
How to get substring from start until first uppercase word (uppercase word excluded)? If string start with uppercase, this word should included. Search should start from secord word. First word should always appear in result.
For example
select substringtiluppercase('Aaa b cC Dfff dfgdf')
should return
Aaa b cC
Can regexp substring used or other idea?
Using PostgreSQL 13.2
Uppercase letters are latin letters A .. Z and additionally Õ, Ä, Ö , Ü, Š, Ž

Replace everything from a leading word boundary then an uppercase letter onwards with blank:
regexp_replace('aaa b cc Dfff dfgdf', '(?<!^)\m[A-ZÕÄÖÜŠŽ].*', '')
See live demo.
In Postgres flavour of regex, \m "word boundary at the beginning of a word".
(?<!^) is a negative look behind asserting that the match is not preceded by start of input.
fyi the other Postgres word boundaries are \M at end of a word, \y either end (same as usual \b) and \Y not a word boundary (same as usual \B).

Sunstring supprts Regular expüression in Postgres
SELECT substring('aaa b cc Dfff dfgdf' from '^[^A-ZÕÄÖÜŠŽ]*')
substring
aaa b cc
SELECT 1
fiddle
SELECT
reverse(substr(reverse(substring('aaa b ccD Dfff dfgdf' from '.*\s[A-ZÕÄÖÜŠŽ]')),2))
reverse
aaa b ccD
SELECT 1
fiddle

Related

how to use positioning/range in regexp

I have a product code where the references always follows this pattern: XX00XX000XX. Characters 1 and 2 are always a combination of 2 letters, 3 to 4 a combination of 2 numbers, 5 to 6 letters, 7 to 10 numbers and 10 to 11 letters again (they`re always varying so it'll never be the same).
I want to do a regexp_contains (or another variant) that matches by position like; position 1 - 2 must be [[:alpha:]], 3 - 4 [[:digit:]], and so on.
(I need this to find product codes that match the reference pattern inside sell links, but I can't find any clear explanation on how to use positioning on regex statements...)
You can use character classes for this.
[a-zA-Z][a-zA-Z]\d\d[a-zA-Z][a-zA-Z]\d\d\d[a-zA-Z][a-zA-Z]
This regex contains the class [a-zA-Z] and \d, which matches letter and digit respectively. This explicitly checks, first character is a letter, second character is a letter, third character is a digit, etc.
The character classes match 1 character in the set specified, so [a-zA-Z] matches any letter, [13579] will match any odd number, etc.

Select rows which contain numeric substrings in Pandas

I need to delete rows from a dataframe in which a particular column contains string which contains numeric substrings. See the shaded column of my dataframe.
rows with values like 0E as prefix or 21 (any two digit number) as suffix or 24A (any two digit number with a letter) as suffix should be deleted.
Any suggestions?
Thanks in advance.
You can use boolean indexing with a str.contains() regex:
^0E - starts with 0E
\d{2}$ - ends with 2 digits
\d{2}[A-Z]$ - ends with 2 digits and 1 capital letter
col = ... # target column
mask = df[col].str.contains(r'^0E|\d{2}$|\d{2}[A-Z]$')
df = df.loc[~mask]
#tdy gave a good answer, but only one place need to be modified if I understand it correctly.
For value ends with two digits or two digits and a capital character, the regex should be:
.*\d{2}[A-Z]?$

Regex to replace multiple patterns with single not working

I am working on replacing multiple occurance of string 0000 with single random number in HANA SQL
I have used these patterns
'(\w+)\s+\1'
'([0000 ]+) \1'
but all occurrences are replaced except the last occurrence of the pattern
SELECT REPLACE_REGEXPR('(\w+)\s+\1' IN '0000 0000 0000' WITH ROUND(RAND()*1000) OCCURRENCE ALL) AS a2
FROM DUMMY;
Current output is
RANDOM 0000
expected output is
RANDOM
Try this regex:
((0000) +)+(0000)
Look Here
And if it's OK to use any digit and more \ less times then 4:
(\d+ +)+\d+
Good Luck!
You may use
\b(\d+)(?:\s+\1)+\b
See the regex demo
You need \d to match digits (if you need to match letters and _ keep on using \w).
Also, to match 1 or more repetitions of a sequence of patterns you need (?:....)+, a + quantified non-capturing group.
Pattern details
\b - word boundary
(\d+) - Group 1: one or more digits
(?:\s+\1)+ - 1+ repetitions of 1+ whitespaces and the same value as captured in Group 1
\b - word boundary
Regex graph:

Replace space but except last occurrence in Postgres

I need to replace spaces from a string but not the last occurance. I tried some regex but did not find the solution.
I have string like 'ABCD;140 0 0 EUR;350 0 0 0 EUR' and I need to make it as ABCD;14000 EUR;350000 EUR'
I tried following ways
select regexp_replace('ABCD;14000 EUR;350 0 0 0 EUR', '\s[^EUR]', '','g');
ABCD;14000 EUR;350 EUR
select regexp_replace('ABCD;14000 EUR;350 0 0 0 EUR', '\s', '', 'g');
ABCD;14000EUR;350000EUR
any suggestion or help ?
-Neelesh
It appears you want to remove all whitespace chars that are not followed with EUR (or any word consisting of exactly 3 uppercase letters) as a whole word (a currency abbreviation).
Use
select regexp_replace('ABCD;140050 EUR;350000 EUR', '\s+(?![[:upper:]]{3}\y)', '\1','g');
See the online demo.
Details
\s+ - 1 or more whitespaces...
(?![[:upper:]]{3}\y) - not immediately followed with 3 uppercase letters that are not followed with a word char (\y is a word boundary, \M is equivalent here since it matches the end of word position).
Note that (?![[:upper:]]{3}\y) will remove all but one whitespace before EUR. If you want to keep all whitespace chars before EUR, use \s+(?!\s*[[:upper:]]{3}\y) pattern.
You may replace all spaces in from of a digit like:
select regexp_replace('ABCD;14000 EUR;350 0 0 0 EUR', '\s+(\d)', '\1','g');
results to:
ABCD;14000 EUR;350000 EUR
'\s+(\d)' matches all spaces followed by a digit, and '\1' is the digit, which was found.

Extra blank space between words

Please help me with 2 questions on how to do the GREL expression for:
If there are double spaces between 2 words in a column, how can I eliminate 1 space Example: Robert--Smith to Robert-Smith The minus character equals a blank for illustration
How can I look for an exact word in a text filter.
Thanks!
1°) try transform---> value.replace(" "," ")
Or, simply common transforms ----> collapse consecutive white spaces
2°) Column ---> text filters and enter you word
Or, do column---> Facet---> Customs facet and type : value.contains(" you_word ")
or value.contains(/(yourexactword)/)
This will return a True or False facet
H.
#hpiedcoq is the right answer if you need to have them in GREL. if not you can just use the point and click interface:
for the first question: Select your column and select Edit cells > Common transforms > Collapse consecutive white space
for the second question: select your column > text filter > enter the work you are looking for. You can select case sensitive if you want to take into account upper and lower case in your search.
1.1 transform -- > value.replace(" "," ")
Deletes all double whitespace.
1.2 transform -- > value.trim()
Deletes all double whitespace and deletes whitespaces before and after the string.
1.3 transform -- > value.replace(/\b \b/," ")
Replace with regular expression, deletes only double whitespace between two words.
Text filter > turn on regular expression and use \b.
Text filter with regular expression: \bWord\b = exact word, before and after the word may or may not be a only whitespace.