Regex greedy match based on string - regex-lookarounds

1) I try to match full parts of the string with a regex. There is a problem with the greediness, I think.Below is the structure of pl/sql under consideration
ERRORHANDLER
WHEN TRUE THEN
IF SOMETHING THEN
ELSE
END IF;
WHEN FALSE THEN
END;
Following is the regex which I have framed to match ^(\s*)ERRORHANDLER((?!FUNCTION).)END[(\s)(\w+)|;]
Where ,
^(\s*)ERRORHANDLER- to match random spaces and static string ERRORHANDLER which will always be the start of the pattern.
((?!FUNCTION).)* - negative look ahead with greedy *
END[(\s*)(\w+)|;] - to match END along with optional string ending with semi-colon.
although I am sure of the approach to match the string but this regex is not matching it properly.
Expected Output:
complete match for
ERRORHANDLER
WHEN TRUE THEN
IF SOMETHING THEN
ELSE
END IF;
WHEN FALSE THEN
END;

since here both the start and end are word based , it requires usage of boundary and below 2 are incorporated ,
starting and ending words.
between them quantifying with character and meta-character.
so for the question this regex will match \bERRORHANDLER[\w|\W]+END\b;

Related

How ''~'' and ''^'' actually works with practical examples in PostgreSQL?

I'm trying to solve a case that, a lot of users have used the syntax that contains the "~".
As below:
select
business_postal_code as zip,
count(distinct case when left(business_address,1) ~ '^[0-9]' then lower(split_part(business_address, ' ', 2))
else lower(split_part(business_address, ' ', 1)) end ) as n_street
from sf_restaurant_health_violations
where business_postal_code is not null
group by 1
order by 2 desc, 1 asc;
link to acess the case: https://platform.stratascratch.com/coding/10182-number-of-streets-per-zip-code?python=
But I couldn't undernstand how this part of the code actually works: ... ~ '^ ....
Let's simplify the query in your question to the component parts you're asking about. Once we see how they work individually, perhaps the whole query will make more sense.
To start, the ~ (tilde) is the POSIX, case-sensitive regular expression operator. The linked PostgreSQL documentation provides brief descriptions and usage examples of it and its sibling operators:
Operator
Description
Example
~
Matches regular expression, case sensitive
'thomas' ~ '.*thomas.*'
~*
Matches regular expression, case insensitive
'thomas' ~* '.*Thomas.*'
!~
Does not match regular expression, case sensitive
'thomas' !~ '.*Thomas.*'
!~*
Does not match regular expression, case insensitive
'thomas' !~* '.*vadim.*'
We can see that each operator has two operands: a constant string on the left, and a pattern on the right. If the string on the left is a match for the pattern on the right, the statement is true, otherwise it is false.
In the given example for the operator you're asking about, 'thomas' is a match for the pattern '.*thomas.*' by standard regular expression rules. The '.*' pre-and-postfixes mean "match any character (except newline) any number of times (zero or more)". The whole pattern then means, "match any character any number of times, then the literal string 'thomas', then any character any number of times". One such match would be 'john thomas jones' where 'john ' matches the first '.*' and ' jones' matches the second '.*'.
I don't think this is a great example because it is functionally equivalent to 'thomas' LIKE '%thomas%' which is likely to run faster, among other benefits like being a SQL-standard operator.
A better example is the query in your question where the pattern '^[0-9]' is used. Setting aside the ^ for now, this pattern means, "match any character in 0-9 (0, 1, 2, ..., 8, 9)", which would be much more verbose if you were to use the LIKE operator: field LIKE '^0' OR field LIKE '^1' OR field LIKE '^2' ....
The ^ operator is not PostgreSQL-specific. Rather it is a special character in regular expressions with one of two meanings (aside from its use as a literal character; more about that in this answer):
The match should begin at the start of the line/string.
For example, the string "Hello, World!" would contain a match for the pattern 'World' since the word "World" appears in it, but would not contain a match for the pattern '^World' since the word "World" is not at the start of the string.
The string "Hello, World!" would contain a match for both of the following patterns: 'Hello' and '^Hello' since the word "Hello" is at the start of the string.
The given character set should be negated when making a match.
For example, the pattern [^0-9] means, "match any character that is not in the range 0-9". So 'a' would match, '&' would match, and 'G' would match, but '7' would not match since it is in the character set that is being excluded.
The query in your question uses the first of the two meanings. The pattern '^[0-9]' means, "match any character in the range 0-9 starting at the beginning of the string". So '0123' would match since the string starts with "0", but 'a5' would not match since the string starts with "a" which is not the character set that is being matched.
Back to the query in your question, then. The relevant part reads:
1 count(distinct
2 case
3 when left(business_address, 1) ~ '^[0-9]'
4 then lower(split_part(business_address, ' ', 2))
5 else lower(split_part(business_address, ' ', 1))
6 end
7 ) as n_street
Line 3 contains a regular expression match that will determine if we should use this case in the overall CASE statement. If the string matches the pattern, the expression will be true and we will use this case. If the string does not match the pattern, the expression will be false and we will try the next case.
The string we are matching to the pattern is left(business_address, 1). The LEFT function takes the first n characters from the string. Since n is "1" here, this returns the first character of the field business_address.
The pattern we are trying to match this string to is '^[0-9]' which we have already said means, "match any character in the range 0-9 starting at the beginning of the string". Technically we don't need the ^ regex operator here since LEFT(..., 1) will return at most one character (which will always be the first character in the resulting string).
As an example, if business_address is "123 Jones Street, Anytown, USA", then LEFT(business_address, 1) will return "1" which will match the pattern (and therefore the expression will be true and we will use the first case).
If, instead, business_address were "Jones Plaza, Suite 123, Anytown, USA", then LEFT(business_address, 1) would return "J" which would not match the pattern (since the first character is "J" which is not in the range 0-9). Our expression would be false and we would continue to the next case.

Case when statement SQL

I am facing some difficulties for a Datawarehouse transformation task, I have some source columns which are coming in varchar format, data contained: Blanks, -, decimal numbers such as (1234.44).
Those columns in target are declared as number.
I am trying to treat that data with this code but I keep receiving invalid number error:
CASE WHEN
LENGTH(TRIM(TRANSLATE(column78input, '-', ' '))) is null then null
WHEN column78input IS NULL THEN 0
else to_number(column78input)
END
In first when statement I am trying to check if there is - in source, it returns null when found, and if you find it then place it as null (replacing dashes with nulls in essence)
In second when statements I am trying to treat those blanks, I thought that they might cause the error
And finally in else statement I want to parse it from varchar to number to load in target table.
If someone has some kind of suggestion, please help!
Thanks
Try with
CASE
WHEN INSTR(column78input, '-') > 0 OR column78input IS NULL THEN 0
ELSE TO_NUMBER(REPLACE(column78input, ' '))
END
INSTR returns the first position of a character in a string. So if there is no dash, it would return 0. A value greater than 0 means there is at least one dash in the string.
Here are a few mistakes in your code :
A case when statement will exit when a condition is met. So you can remove the dash in the first condition and expect it to continue to process your string in the next condition. In your code if a string had a dash, the result would be null.
LENGTH function returns the number of characters in a string. It will return a null value only if the string is null. So it's easier to directly write column78input IS NULL
You current first condition is basically this : "After replacing the dash by a space and removing all the leading/trailing spaces, if the string is null then". Because you are replacing the dashes in the string, you can't check if there is an occurrence or not.

Detect specific string including wildcards and isolate wildcards in PLPGSQL

Is it possible in PostgreSQL 9.4 (PLPGSQL) to detect if a string contains a certain string including wildcards and get the wildcards, ex.:
IF NEW.my_string CONTAINS 'patternXYZ' THEN
NEW.my_values := getXYZ(my_string)
END IF;
Which would result in NEW.my_values to contain XYZ (which can be anything in the string, but only the 3 characters).
SELECT
CASE
WHEN (NEW.my_string like '%patternXYZ%')
THEN substring(NEW.my_string from '+pattern+')
ELSE '00'
END AS data
FROM my_table;
Pattern should be the parameter for query.

Teradata substring out of bounds

I'm having issues figuring out the bounds between a substring. For example for the string 063016_shape_tea_cleanse__emshptea1_I want to substring out emshptea1, but it also has to work for the string 063016_shape_tea_cleanse__emshptea1_TESTDATA_HERE.
Currently I have:
sel SUBSTR('063016_shape_tea_cleanse__emshptea1_',POSITION('__' IN '063016_shape_tea_cleanse__emshptea1_')+2,
POSITION('_' IN SUBSTR('063016_shape_tea_cleanse__emshptea1_',POSITION('__' IN '063016_shape_tea_cleanse__emshptea1_') + 2,CHARACTER_LENGTH('063016_shape_tea_cleanse__emshptea1_') - (POSITION('__' IN '063016_shape_tea_cleanse__emshptea1_') + 2)))-1)
But that is erroring out due to it trying to substring 27 to -1.
You might use a regular expression, this will extract everything between __ and the following _ or end of string:
REGEXP_SUBSTR(col, '(?<=__).+?(?=(_|$))')
'(?<= )' is a look-behind, i.e search for previous characters without adding it to the result. Here: search for __
'.+' matches any character, one or multiple times. This would match until the end of the string ("greedy"), '?' ("lazy") prevents that.
'(?= )' is a look-ahead, i.e. search for following characters without adding it to the result.
( | ) The pipe splits an expression in multiple alternatives. Here either an underscore character or the end of the string $

Parse stringto get final end result

I'm trying to parse this string 'Smith, Joe M_16282' to get everything before the comma, combined with everything after the underscore.
The resulting string would be: Smith16282
string longName = "Smith, Joe M_16282";
string shortName = longName.Substring(0, longName.IndexOf(",")) + longName.Substring(longName.LastIndexOf("_") + 1);
Notes:
The second "substring" doesn't need a length parameter, because we want everything after the underscore
The LastIndexOf is used instead of IndexOf in case there are other underscores appearing in the name such as "Smith_Jones, Joe M_16282"
This code assumes that there is at least one comma and at least one underscore in the string "longName." If not, the code fails. I will leave that checking to you if you need it.
As others have said, the simple approach for parsing a string like that would be to use the String's various parsing methods, such as IndexOf and SubString. If you want something more powerful and flexible, you may also want to consider using a RegEx replacement. For instance, you could do something like this:
Dim input As String = "Smith, Joe M_16282"
Dim pattern As String = "(.*?),.*?_(.*)"
Dim replacement As String = "$1$2"
Dim output As String = Regex.Replace(input, pattern, replacement)
Or, more simply:
Dim output As String = Regex.Replace("Smith, Joe M_16282", "(.*?),.*?_(.*)", "$1$2")
Here's the meaning of the pattern:
(.*?) - The first group capturing all of the characters before the comma
( - Starts the capturing group
. - This is a wildcard which matches any character
* - Specifies that the previous thing (any character) is repeated any number of times
? - Specifies that the * is non-greedy, meaning it won't match everything until the end of the string--it will only match until it finds the following comma
) - Ends the capturing group
, - The comma to look for
.*? - Says that there will be any number of any characters between the comma and the underscore which we don't care about
. - Any character
* - Any number of times
? - Until you find the underscore
_ - The underscore the look for
(.*) - The second group capturing all of the characters after the underscore
( - Starts the capturing group
. - Any character
* - Any number of times
) - Ends the capturing group
Here's the meaning of the replacement:
$1 - The value of all of the characters found in the first capturing group
$2 - The value of all of the characters found in the second capturing group
RegEx may be overkill for your particular situation, but it is a very handy tool to learn. One major advantage is that you could move the pattern and replacement values out into external settings in the app.config, or somewhere. Then, you could modify the replacement rules without recompiling your application.