find all occurrences of a regex as an array - sql

have the following string (it's a salesforce query, but not important):
IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),
CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),
"Free Mail","Business Email")
and I want to get an array of all substrings that are encapsulated between double quotes like so:
['yahoo',
'gmail',
'protonmail.com,att.net,chpmail.com,smail.com',
'Free Mail',
'Business Email']
in python I do:
re.findall(r'"(.+?)"', <my string>)
but is there a way to replicate this in Snowflake?
I've tried
SELECT
REGEXP_SUBSTR('IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),
CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),
"Free Mail","Business Email")', '"(.+?)"') as emails;
but I get this:
"yahoo"),CONTAINS(EmailDomain,"gmail"

You can use
select split(trim(regexp_replace(regexp_replace(col, '"([^"]+)"|.', '\\1|'),'\\|+','|'), '|'), '|');
Details:
regexp_replace(col, '"([^"]+)"|.', '\\1|') - finds any strings between the closest double quotes while capturing the part inside quotes into Group 1, or matching any single char and replaces each match with Group 1 contents + | char (see the regex demo)
regexp_replace(...,'\\|+','|') - this shrinks all consecutive pipe symbols into a single occurrence of a | char (see this regex demo)
trim(..., '|') - removes | chars on both ends of the string
split(..., '|') - splits the string with a | char.

Wiktor's answer works great. I'm adding an alternate answer for anyone who needs to do this and their quoted strings may contain the pipe | character. Using the replacement method on strings containing pipe(s) will split the string into more than one array member. Here's a way (not the only way) to do it that will work in case the quoted strings could potentially contain pipe characters:
set col = $$IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),"Free Mail","Business Email | Other")$$;
create or replace function GET_QUOTED_STRINGS("s" string)
returns array
language javascript
strict immutable
as
$$
var re = /(["'])(?:\\.|[^\\])*?\1/g;
var m;
var out = [];
do {
m = re.exec(s);
if (m) {
out.push(m[0].replace(/['"]+/g, ''));
}
} while (m);
return out;
$$;
select get_quoted_strings($col);

Related

Remove template text on regexp_replace in Oracle's SQL

I am trying to remove template text like &#x; or &#xx; or &#xxx; from long string
Note: x / xx / xxx - is number, The length of the number is unknown, The cell type is CLOB
for example:
SELECT 'H'ello wor±ld' FROM dual
A desirable result:
Hello world
I know that regexp_replace should be used, But how do you use this function to remove this text?
You can use
SELECT REGEXP_REPLACE(col,'&&#\d+;')
FROM t
where
& is put twice to provide escaping for the substitution character
\d represents digits and the following + provides the multiple occurrences of them
ending the pattern with ;
or just use a single ampersand ('&#\d+;') for the pattern as in the case of Demo , since an ampersand has a special meaning for Oracle, a usage is a bit problematic.
In case you wanted to remove the entities because you don't know how to replace them by their character values, here is a solution:
UTL_I18N.UNESCAPE_REFERENCE( xmlquery( 'the_double_quoted_original_string' RETURNING content).getStringVal() )
In other words, the original 'H'ello wor±ld' should be passed to XMLQUERY as '"H'ello wor±ld"'.
And the result will be 'H'ello wo±ld'

How to add delimiter to String after every n character using hive functions?

I have the hive table column value as below.
"112312452343"
I want to add a delimiter such as ":" (i.e., a colon) after every 2 characters.
I would like the output to be:
11:23:12:45:23:43
Is there any hive string manipulation function support available to achieve the above output?
For fixed length this will work fine:
select regexp_replace(str, "(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})","$1:$2:$3:$4:$5:$6")
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
Another solution which will work for dynamic length string. Split string by the empty string that has the last match (\\G) followed by two digits (\\d{2}) before it ((?<= )), concatenate array and remove delimiter at the end (:$):
select regexp_replace(concat_ws(':',split(str,'(?<=\\G\\d{2})')),':$','')
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
If it can contain not only digits, use dot (.) instead of \\d:
regexp_replace(concat_ws(':',split(str,'(?<=\\G..)')),':$','')
This is actually quite simple if you're familiar with regex & lookahead.
Replace every 2 characters that are followed by another character, with themselves + ':'
select regexp_replace('112312452343','..(?=.)','$0:')
+-------------------+
| _c0 |
+-------------------+
| 11:23:12:45:23:43 |
+-------------------+

Removing leading special characters in Hive

I am trying to remove leading special characters (could be -"$&^#_)
from "Persi és Levon Cnatówóeez using Hive.
select REGEXP_REPLACE('“Persi és Levon Cnatówóeez', '[^a-zA-Z0-9]+', '')
but this removes all special characters.
I am expecting an output similar to
Persi és Levon Cnatówóeez
Try this:
select REGEXP_REPLACE('"Persi és Levon Cnatówóeez', '[^a-zA-Z0-9\u00E0-\u00FC ]+', '');
I tried it on Hive and it replaces any character that is not a letter (a-zA-Z) a number (0-9) or an accented character (\u00E0-\u00FC).
0: jdbc:hive2://localhost:10000> select REGEXP_REPLACE('"Persi és Levon Cnatówóeez', '[^a-zA-Z0-9\u00E0-\u00FC ]+', '');
+----------------------------+--+
| _c0 |
+----------------------------+--+
| Persi és Levon Cnatówóeez |
+----------------------------+--+
1 row selected (0.104 seconds)
0: jdbc:hive2://localhost:10000>
From the Hive documentation:
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. For example, regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
You should do something like this:
select REGEXP_REPLACE('“Persi és Levon Cnatówóeez', '^[\!-\/\[-\`]+', '')
I haven't Hive right know to try this code, but the idea should be correct. In the second field you must put what you want to substitute, not what you want to keep in your string. In this specific case, this should remove (substitute with empty string '') every consequent character in the beginning of the line, that is in the range from ! to /, or in the range [ to ` referring to the ASCII table.

Get string after '/' character

I want to extract the string after the character '/' in a PostgreSQL SELECT query.
The field name is source_path, table name is movies_history.
Data Examples:
Values for source_path:
184738/file1.mov
194839/file2.mov
183940/file3.mxf
118942/file4.mp4
And so forth. All the values for source_path are in this format
random_number/filename.xxx
I need to get 'file.xxx' string only.
If your case is that simple (exactly one / in the string) use split_part():
SELECT split_part(source_path, '/', 2) ...
If there can be multiple /, and you want the string after the last one, a simple and fast solution would be to process the string backwards with reverse(), take the first part, and reverse() again:
SELECT reverse(split_part(reverse(source_path), '/', 1)) ...
Or you could use the more versatile (and more expensive) substring() with a regular expression:
SELECT substring(source_path, '[^/]*$') ...
Explanation:
[...] .. encloses a list of characters to form a character class.
[^...] .. if the list starts with ^ it's the inversion (all characters not in the list).
* .. quantifier for 0-n times.
$ .. anchor to end of string.
db<>fiddle here
Old sqlfiddle
You need use substring function
SQL FIDDLE
SELECT substring('1245487/filename.mov' from '%/#"%#"%' for '#');
Explanation:
%/
This mean % some text and then a /
#"%#"
each # is the place holder defined in the last part for '#' and need and aditional "
So you have <placeholder> % <placeholder> and function will return what is found inside both placeholder. In this case is % or the rest of the string after /
FINAL QUERY:
SELECT substring(source_path from '%/#"%#"%' for '#');
FROM movies_history
you can use the split_part string function,
syntax: split_part(string,delimiter,position)
string example: exx = "2022-06-12"
Note: can be "#ertl/eitd/record_4" etc
delimiter: any character for the above example ("-" or "/")
Position: nth position,
How it works: the above exx string will be split in x times based on the delimiter
e.g position 1- 2022, position 2-06, position 3-12
so the nth position helps choose what you want to return
thus based on your example:
syntax: slipt_part(random_number/filename.xxx,"/",2)
output: filename.xxx

Parse stringto get final end result

I'm trying to parse this string 'Smith, Joe M_16282' to get everything before the comma, combined with everything after the underscore.
The resulting string would be: Smith16282
string longName = "Smith, Joe M_16282";
string shortName = longName.Substring(0, longName.IndexOf(",")) + longName.Substring(longName.LastIndexOf("_") + 1);
Notes:
The second "substring" doesn't need a length parameter, because we want everything after the underscore
The LastIndexOf is used instead of IndexOf in case there are other underscores appearing in the name such as "Smith_Jones, Joe M_16282"
This code assumes that there is at least one comma and at least one underscore in the string "longName." If not, the code fails. I will leave that checking to you if you need it.
As others have said, the simple approach for parsing a string like that would be to use the String's various parsing methods, such as IndexOf and SubString. If you want something more powerful and flexible, you may also want to consider using a RegEx replacement. For instance, you could do something like this:
Dim input As String = "Smith, Joe M_16282"
Dim pattern As String = "(.*?),.*?_(.*)"
Dim replacement As String = "$1$2"
Dim output As String = Regex.Replace(input, pattern, replacement)
Or, more simply:
Dim output As String = Regex.Replace("Smith, Joe M_16282", "(.*?),.*?_(.*)", "$1$2")
Here's the meaning of the pattern:
(.*?) - The first group capturing all of the characters before the comma
( - Starts the capturing group
. - This is a wildcard which matches any character
* - Specifies that the previous thing (any character) is repeated any number of times
? - Specifies that the * is non-greedy, meaning it won't match everything until the end of the string--it will only match until it finds the following comma
) - Ends the capturing group
, - The comma to look for
.*? - Says that there will be any number of any characters between the comma and the underscore which we don't care about
. - Any character
* - Any number of times
? - Until you find the underscore
_ - The underscore the look for
(.*) - The second group capturing all of the characters after the underscore
( - Starts the capturing group
. - Any character
* - Any number of times
) - Ends the capturing group
Here's the meaning of the replacement:
$1 - The value of all of the characters found in the first capturing group
$2 - The value of all of the characters found in the second capturing group
RegEx may be overkill for your particular situation, but it is a very handy tool to learn. One major advantage is that you could move the pattern and replacement values out into external settings in the app.config, or somewhere. Then, you could modify the replacement rules without recompiling your application.