Extract all characters before a period with HiveQL regex? - sql

I have a table that looks like:
bl.ah
foo.bar
bar.fight
And I'd like to use HiveQL's regexp_extract to return
bl
foo
bar

Given the docs data about regexp_extract:
regexp_extract(string subject, string pattern, int index)
Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.
So, if you have a table with a single column (let's call it description for our example) you should be able to use regexp_extract as follows to get the data before a period, if one exists, or the entire string in the absence of a period:
regexp_extract(description,'^([^\.]+)\.?',1)
The components of the regex are as follows:
^ start of string
([^\.]+) any non-period character one or more times, in a capture group
\.? a period either once or no times
Because the part of the string we're interested in will be in the first (and only) capture group, we refer to it by passing the index parameter a value of 1.

Related

Snowflake - Check if 1st 3 Characters of string are letters

Am trying to determine how one attempts to identify, in Snowflake SQL, if a product code begins with three letters.
Suggestions?
I did just try: LEFT(P0.PRODUCTCODE,3) NOT LIKE '[a-zA-Z]%' but it didn't work.
Thanks folks
You can use REGEXP_LIKE to return a boolean value indicating whether or not your string matched the pattern you're interested in.
In your case, something like REGEXP_LIKE(string_field_here, '[a-zA-Z]{3}.*')
Breaking down the regular expression pattern:
[a-zA-Z]: Only match letter characters, both upper and lowercase
{3}: Require three of those letters
.*: Allow any number of any characters after those three letters
Note: in many cases, you would need to specifically indicate the beginning/ending of the string in the pattern, but Snowflake's implementation handles that for you. From the docs:
The function implicitly anchors a pattern at both ends (i.e. ''
automatically becomes '^$', and 'ABC' automatically becomes '^ABC$').
To match any string starting with ABC, the pattern would be 'ABC.*'.
You can try running these examples:
SELECT REGEXP_LIKE('abc', '[a-zA-Z]{3}.*') AS _abc,
REGEXP_LIKE('123', '[a-zA-Z]{3}.*') AS _123,
REGEXP_LIKE('abc123', '[a-zA-Z]{3}.*') AS _abc123,
REGEXP_LIKE('123abc', '[a-zA-Z]{3}.*') AS _123abc

difference between pandas.Series.str.match and pandas.Series.str.contains

What's the difference between pandas.Series.str.contains and pandas.Series.str.match? Why is the case below?
s1 = pd.Series(['house and parrot'])
s1.str.contains(r"\bparrot\b", case=False)
I got True, but when i do
s1.str.match(r"\bparrot\b", case=False)
I got False. Why is the case?
The documentation for str.contains() states:
Test if pattern or regex is contained within a string of a Series or
Index.
The documentation for str.match() states:
Determine if each string matches a regular expression.
The difference in these two methods is that str.contains() uses: re.search, while str.match() uses re.match.
As per documentation of re.match()
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding match object.
Return None if the string does not match the pattern; note that this
is different from a zero-length match.
So parrot does not match the first character of the string so your expression returns False. House does match the first character so it finds house and returns true.

What does this SQL query replacing JSON text mean?

I'm trying to understand a part of SQL query but I don't know what's it used for; can anyone help me?
I know it wants to replace something, but what is " ":"(.+)" ", and why the string like "store" can be used in substring()?
replace((
CASE
WHEN(char_length(substring(xxx_json::text FROM 'Name":"(.+)" , "store')) > 0)
THEN substring(xxx_json::text FROM 'Name":"(.+)" , "store')
ELSE substring(xxx_json::text FROM 'Name":"(.+)" , "employees')
END),'\u0016','''')
This appears to be a variant of substring that does regular-expression matching. The first argument, xxx_json::text, is the string to be searched. The second argument is the regular expression to match.
Note that the second argument consists of the entire SQL string literal 'Name":"(.+)" , "store' (in the first two cases). Everything in that string, except for the (.+), should literally match a portion of the string to be searched. The (.+) is regex syntax. A dot matches any character; a + means one or more occurrences; the parentheses define this as a capture group. In this context, the text that matches the capture group is what will be returned by substring.
So for instance if the contents of the string to be searched was a simple JSON expression like this: { "Name":"John Smith" , "store":"London" }, the regular expression would match and the substring would return 'John Smith'.
In short, this is a slightly hacky way of parsing JSON in SQL to extract the value of the Name element (or some element whose key ends with Name).
See section 9.7.3 in https://www.postgresql.org/docs/9.4/static/functions-matching.html for detailed documentation on this form of substring.

Pig Nesting STRSPLIT

I have a string in field 'product' in the following form:
";TT_RAV;44;22;"
and am wanting to first split on the ';' and then split on the '_' so that what is returned is
"RAV"
I know that I can do something like this:
parse_1 = foreach {
splitup = STRSPLIT(product,';',3);
generate splitup.$1 as depiction;
};
This will return the string 'TT_RAV' and then I can do another split and project out the 'RAV' however this seems like it will be passing the data through multiple Map jobs -- Is it possible to parse out the desired field in one pass?
This example does NOT work, as the inner splitstring retuns tuples, but shows logic:
c parse_1 = foreach {
splitup = STRSPLIT(STRSPLIT(product,';',3),'_',1);
generate splitup.$1 as depiction;
};
Is it possible to do this in pure piglatin without multiple map phases?
Don't use STRSPLIT. You are looking for REGEX_EXTRACT:
REGEX_EXTRACT(product, '_([^;]*);', 1) AS depiction
If it's important to be able to precisely pick out the second semicolon-delimited field and then the second underscore-delimited subfield, you can make your regex more complicated:
REGEX_EXTRACT(product, '^[^;]*;[^_;]*_([^_;]*)', 1) AS depiction
Here's a breakdown of how that regex works:
^ // Start at the beginning
[^;]* // Match as many non-semicolons as possible, if any (first field)
; // Match the semicolon; now we'll start the second field
[^_;]* // Match any characters in the first subfield
_ // Match the underscore; now we'll start the second subfield (what we want)
( // Start capturing!
[^_;]* // Match any characters in the second subfield
) // End capturing
The only time there will be multiple maps is if you have an operator that triggers a reduce (JOIN, GROUP, etc...). If you run an explain on the script you can see if there is more than one reduce phase.

How can I check for a certain suffix in my string?

I got a list of strings. And I want to check for every string in there. Sometimes, a string can have the suffix _anim(X) where X is an integer. If such string has that kind of suffix, I need to check for all other strings that have the same "base" (the base being the part without suffix) and finally group such strings and send them to my function.
So, given the next list:
Man_anim(1)
Woman
Man_anim(3)
Man_anim(2)
My code would discover the base Man has a special suffix, and will then generate a new list grouping all Man objects and arrange them depending on the value inside parenthesis. The code is supposed to return
Man_anim(1)
Man_anim(2)
Man_anim(3)
And send such list to my function for further processing.
My problem is, how can I check for the existence of such suffix, and afterwards, check for the value inside parenthesis?
If you know that the suffix is going to be _anim(X) every time (obviously, with X varying) then you can use a regular expression:
Regex.IsMatch(value, #"_anim\(\d+\)$")
If the suffix isn't at least moderately consistent, then you'll have to look into data structures, like Suffix Trees, which you can use to determine common structures in strings.