Complex PG regex query - sql

I have the following working function which is used in check constraint (I'll only post the relevant SQL part):
-- a comma should always be followed by a space
-- a period should always be followed by a space, except if it is the last character of the string OR the string contains 'caporal'
-- a question mark should always be followed by a space, except if it is the last character of the string
-- must not contain 2 or more spaces in a row
-- must not contain ((
-- must not contain ))
-- any open parenthesis should be closed: number of '(' should equal to number of ')'
SELECT
($1 !~ ',(?!\s)|\s{2}|[?](?!\s(?!$)|$)|[()]{2,}') AND
((array_length(string_to_array($1, '('), 1) - 1) = (array_length(string_to_array($1, ')'), 1) - 1)) AND
($1 ~ 'caporal' OR $1 !~ '[.](?!\s(?!$)|$)')
Overtime I realized that I need to allow a period without a following space for the cases:
.fr
.com
.net
.co.uk
Also, I realized that I need to allow float numbers to be written with comma/period as separator. The following cases should be valid:
2,5cm
10.4l
I was trying multiple things but it seems I'm just breaking the existing rules instead of adding "exceptions" to them.
My latter attempt was the following:
SELECT
($1 !~ '[[a-zA-Z]àâçéèêëîïôûùüÿæœ],(?!\s)|\s{2}|[?](?!\s(?!$)|$)|[()]{2,}') AND
((array_length(string_to_array($1, '('), 1) - 1) = (array_length(string_to_array($1, ')'), 1) - 1)) AND
($1 ~ 'caporal' OR $1 !~ '[[a-zA-Z]àâçéèêëîïôûùüÿæœ][.](?!\s(?!$)|(?!fr)|(?!com)|$)')
But it clearly isn't what I want. Thank you in advance for hints and advices!

You should change the first regex to
,(?!\d(?<=\d,\d)|\s)|\s{2}|\?(?!\s(?!$)|$)|[()]{2,}
and the last one to
\.(?!\d(?<=\d\.\d)|(?:fr|com|co\.uk|(?<=\yco\.)uk|net)\y|\s(?!$)|$)
The changes are additions to the negative lookaheads that fail the match if their patterns match immediately to the right of the current location.
In the first case, ,(?!\d(?<=\d,\d)|\s) is used to match any comma that is not followed with a whitespace or any digit that is a fractional digit (as it must be preceded with a digit and a comma).
In the second regex, a similar restriction is added, see \d(?<=\d\.\d) that makes the \. match a dot that is not the first fractional digit in a float number with a period as a decimal separator, and the (?:fr|com|co\.uk|(?<=\yco\.)uk|net)\y part is added to avoid matching a . that is followed with fr, com, co.uk, the second period in co.uk ((?<=\yco\.)uk lookbehind makes sure that the comma before uk not preceded with co. is still matched) or net as whole words (see \y, word boundary).

Related

Extract string between different special symbols

I am having following string in my query
.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt
beginning with a period from which I need to extract the segment between the final \ and the file extension period, meaning following expected result
ABC__123_123_123_ABC123
Am fairly new to using REGEXP and couldn't help myself to an elegant (or workable) solution with what Q&A here or else. In all queries the pattern is the same in quantity and order but for my growth of knowledge I'd prefer to not just count and cut.
You can use REGEXP_REPLACE function such as
REGEXP_REPLACE(col,'(.*\\)(.*)\.(.*)','\2')
in order to extract the piece starting from the last slash upto the dot. Preceding slashes in \\ and \. are used as escape characters to distinguish the special characters and our intended \ and . characters.
Demo
You need just regexp_substr and simple regexp ([^\]+)\.[^.]*$
select
regexp_substr(
'.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt',
'([^\]+)\.[^.]*$',
1, -- position
1, -- occurence
null, -- match_parameter
1 -- subexpr
) substring
from dual;
([^\]+)\.[^.]*$ means:
([^\]+) - find one or more(+) any characters except slash([] - set, ^ - negative, ie except) and name it as group \1(subexpression #1)
\. - then simple dot (. is a special character which means any character, so we need to "escape" it using \ which is an escape character)
[^.]* - zero or more any characters except .
$ - end of line
So this regexp means: find a substring which consist from: one or more any characters except slash followed by dot followed by zero or more any characters except dot and it should be in the end of string. And subexpr parameter = 1, says oracle to return first subexpression (ie first matched group in (...))
Other parameters you can find in the doc.
Here is my simple full compatible example with Oracle 11g R2, PCRE2 and some other languages.
Oracle 11g R2 using function substr (Reference documentation)
select
regexp_substr(
'.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt',
'((\w)+(_){2}(((\d){3}(_)){3}){1}((\w)+(\d)+){1}){1}',
1,
1
) substring
from dual;
Pattern: ((\w)+(_){2}(((\d){3}(_)){3}){1}((\w)+(\d)+){1}){1}
Result: ABC__123_123_123_ABC123
Just as simple as it can be, regular expressions always follow a minimal standard, as you can see portability also provided, just for the case someone else is interested in going the simplest way.
Hopefully, this will help you out!

SQL script to update all column values starting with number and - with blank in Postgresql

I need to update a varchar column's values
The values start with a number followed by - and then some letters
For Ex: 27-Check This
I need to update this value ie, I need to remove the starting number and the -
Expected output Example: Check This
NB: only the starting number and - should be removed all the values after the first alphabet should not be changed. Ie, if some number or - is present after the first alphabet then that should not be removed.
For ex: 27-Check 23-C This
Expected output: Check 23-C This
NB: I am new to sql so please help even if this looks simple to you
you can use regexp_replace to remove the leading digits:
update the_table
set the_column = regexp_replace(the_column, '^[0-9]{1,}\s*-\s*', '')
where the_column ~ '^[0-9]{1,}'
^[0-9]{1,}- in details:
^ match at the start of the string
[0-9]{1,} at least one number
\s* followed by zero or more (white) space
- followed by a dash
\s* followed by zero or more (white) space
The where clause ensure that only those columns are changed that need to be changed (e.g. values not starting with a number won't be touched at all).
If you just want everything after the first hyphen when the pattern starts with a number, you can use:
update t
set col = substring(col from '-(.*)')
where col ~ '^[0-9]+-';
substring() with a pattern is a nice implementation of what would be called regexp_substr() in other databases. It simply returns the first time the pattern is in the string. The full pattern is matched, but if there are parentheses, then only that portion is returned.

Delete specific pattern between commas in text file

I have thousand of SQL queries written over notepad++ line by line.Single line contain single SQL query.Every SQL query contain list of columns to be selected from database as comma separated values.Now we want certain columns not to be part of that list which follow a specific pattern/regular expression.The SQL query follows a specific pattern :
A trimmed column has been selected as alias 'PK'
Every query has got a 'dated'where condition at the end of it.
Sometimes the pattern which we wish to remove exist in either PK/where or both.we don't want to remove that column/pattern from those places.Just from the column selection list.
Below is the example of a SQL query :
select (TRIM(TAE_TSP_REC_UPDATE)) as PK,TAE_AMT_FAIR_MV,TAE_TXT_ACCT_NUM,TAE_CDE_OWNER_TYPE,TAE_DTE_AQA_ABA,TAE_RID_OWNER,TAE_FID_OWNER,TAE_CID_OWNER,TAE_TSP_REC_UPDATE from TABLE_TAX_REP where DATE(TAE_TSP_REC_UPDATE)>='03/31/2018'
After removal of columns/patterns query should look like below :
select (TRIM(TAE_TSP_REC_UPDATE)) as PK,TAE_AMT_FAIR_MV,TAE_TXT_ACCT_NUM,TAE_CDE_OWNER_TYPE,TAE_DTE_AQA_ABA from TABLE_TAX_REP where DATE(TAE_TSP_REC_UPDATE)>='03/31/2018'
want to remove below patterns from each and every query between the commas :
.FID.
.RID.
.CID.
.TSP.
If the pattern exist within TRIM/DATE function it should not be touched.It should only be removed from column selection list.
Could somebody please help me regarding above.Thanks in advance
You may use
(?:\G(?!^)|\sas\s(?=.*'\d{2}/\d{2}/\d{4}'$))(?:(?!\sfrom\s).)*?\K,?\s*[A-Z_]+_(?:[FRC]ID|TSP)_[A-Z_]+
Details
(?:\G(?!^)|\sas\s(?=.*'\d{2}/\d{2}/\d{4}'$)) - two alternatives:
\G(?!^) - the end of the previous location, not a position at the start of the line
| - or
\sas\s(?=.*'\d{2}/\d{2}/\d{4}'$) - an as surrounded with single whitespaces that is followed with any 0+ chars other than line break chars and then ', 2 digits, /, 2 digits, /, 4 digits and ' at the end of the line
(?:(?!\sfrom\s).)*? - consumes any char other than a linebreak char, 0 or more repetitions, as few as possible, that does not start whitespace, from, whitespace sequence
\K - a match reset operator discarding all text matched so far
,?\s* - an optional comma followed with 0+ whitespaces
[A-Z_]+_(?:[FRC]ID|TSP)_[A-Z_]+ - ASCII letters or/and _, 1 or more occurrences, followed with _, then F, R or C followed with ID or TSP, then _, and again 1 or more occurrences of ASCII letters or/and _.
See the regex demo.

How can I extract a substring from a character column without using SUBSTR()?

I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.

Parse stringto get final end result

I'm trying to parse this string 'Smith, Joe M_16282' to get everything before the comma, combined with everything after the underscore.
The resulting string would be: Smith16282
string longName = "Smith, Joe M_16282";
string shortName = longName.Substring(0, longName.IndexOf(",")) + longName.Substring(longName.LastIndexOf("_") + 1);
Notes:
The second "substring" doesn't need a length parameter, because we want everything after the underscore
The LastIndexOf is used instead of IndexOf in case there are other underscores appearing in the name such as "Smith_Jones, Joe M_16282"
This code assumes that there is at least one comma and at least one underscore in the string "longName." If not, the code fails. I will leave that checking to you if you need it.
As others have said, the simple approach for parsing a string like that would be to use the String's various parsing methods, such as IndexOf and SubString. If you want something more powerful and flexible, you may also want to consider using a RegEx replacement. For instance, you could do something like this:
Dim input As String = "Smith, Joe M_16282"
Dim pattern As String = "(.*?),.*?_(.*)"
Dim replacement As String = "$1$2"
Dim output As String = Regex.Replace(input, pattern, replacement)
Or, more simply:
Dim output As String = Regex.Replace("Smith, Joe M_16282", "(.*?),.*?_(.*)", "$1$2")
Here's the meaning of the pattern:
(.*?) - The first group capturing all of the characters before the comma
( - Starts the capturing group
. - This is a wildcard which matches any character
* - Specifies that the previous thing (any character) is repeated any number of times
? - Specifies that the * is non-greedy, meaning it won't match everything until the end of the string--it will only match until it finds the following comma
) - Ends the capturing group
, - The comma to look for
.*? - Says that there will be any number of any characters between the comma and the underscore which we don't care about
. - Any character
* - Any number of times
? - Until you find the underscore
_ - The underscore the look for
(.*) - The second group capturing all of the characters after the underscore
( - Starts the capturing group
. - Any character
* - Any number of times
) - Ends the capturing group
Here's the meaning of the replacement:
$1 - The value of all of the characters found in the first capturing group
$2 - The value of all of the characters found in the second capturing group
RegEx may be overkill for your particular situation, but it is a very handy tool to learn. One major advantage is that you could move the pattern and replacement values out into external settings in the app.config, or somewhere. Then, you could modify the replacement rules without recompiling your application.