Hive convert a string to an array of characters - hive

How can I convert a string to an array of characters, for example
"abcd" -> ["a","b","c","d"]
I know the split methd:
SELECT split("abcd","");
#["a","b","c","d",""]
is a bug for the last whitespace? or any other ideas?

This is not actually a bug. Hive split function simply calls the underlying Java String#split(String regexp, int limit) method with limit parameter set to -1, which causes trailing whitespace(s) to be returned.
I'm not going to dig into implementation details on why it's happening since there is already a brilliant answer that describes the issue. Note that str.split("", -1) will return different results depending on the version of Java you use.
A few alternatives:
Use "(?!\A|\z)" as a separator regexp, e.g. split("abcd", "(?!\\A|\\z)"). This will make the regexp matcher skip zero-width matches at the start and at the end positions of the string.
Create a custom UDF that uses either String#toCharArray(), or accepts limit as an argument of the UDF so you can use it as: SPLIT("", 0)

I don't know if it is a bug or that's how it works. As an alternative, you could use explode and collect_list to exclude blanks from a where clause
SELECT collect_list(l)
FROM ( SELECT EXPLODE(split('abcd','') ) as l ) t
WHERE t.l <> '';

Related

TERADATA REGEXP_SUBSTR Get string between two values

I am fairly new to teradata, but I was trying to understand how to use REGEXP_SUBSTR
For example I have the following cell value = ABCD^1234567890^1
How can I extract 1234567890
What I attempted to do is the following:
REGEXP_SUBSTR(x, '(?<=^).*?(?=^)')
But this didnt seem to work.
Can anyone help?
It might (or might not) be possible to use REGEXP_SUBSTR() to handle this, but you would need to use a capture group. An alternative here would be to do a regex replacement instead:
SELECT x, REGEXP_REPLACE(x, '^.*?\^|\^.*$', '') AS output
FROM yourTable;
The regex pattern used here matches:
^.*?\^ everything from the start to the first ^
| OR
\^.*$ everything from the second ^ to the end
We then replace with empty string to remove the content being matched.

How can I add a string character based on a position in OpenRefine?

I have a column in Openrefine, which I would like to add a character string in each of its rows, based on the position in the string.
For example:
I have an 8th character number string: 85285296 and would like to add "-" at the fourth place: "8528-5296".
Anyone can help me find the specific function in OpenRefine?
Thanks
Tzipy
The simplest approach is to just use the expression language's built-in string indexing and concatenation:
value[0,4]+'-'+value[4,8]
or more generally, if you don't know that your value is exactly 8 characters long:
value[0,4]+'-'+value[4,999]
Possible solution (not sure if it's the most straightforward):
value.replace(/(\d{4})(.+)/, "$1-$2")
This means : if $1 represents the content of the first parenthesis/group in the regular expression before and $2 the content of the second one, replaces each value in the column with $1-$2.
Some other options:
value.splitByLengths(4,4).join("-")
value.match(/(\d{4})(\d{4})/).join("-")
value.substring(0,4)+"-"+value.substring(4,8)
I think 'splitByLengths' is the neatest, but I might use 'match' instead because it fails with an error if your starting string isn't 8 digits - which means you don't accidentally process data that doesn't conform to your assumption of what data is in the column - but you could use a facet/filter to check this with any of the others

How to escape delimiter found in value - pig script?

In pig script, I would like to find a way to escape the delimiter character in my data so that it doesn't get interpreted as extra columns. For example, if I'm using colon as a delimiter, and I have a column with value "foo:bar" I want that string interpreted as a single column without having the loader pick up the comma in the middle.
You can try http://pig.apache.org/docs/r0.12.0/func.html#regex-extract-all
A = LOAD 'somefile' AS (s:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(s, '(.*) : (.*)'));
The regex might have to be adapted.
It seems Pig takes the Input as the string its not so intelligent to identify how what is data or what is not.
The pig Storage works on the Strong Tokenizer. So if u want to do something like
a = LOAD '/abc/def/file.txt' USING PigStorage(':');
It doesn't seems to be solving your problem. But if we can write our own PigStorage() Method possibly we could come across some solution.
I will try posting the Code to resolve this.
you can use STRSPLIT(string, regex, limit); for the column split based on the delimiter.

substring extraction in HQL

There's a URL field in my Hive DB that is of string type with this specific pattern:
/Cats-g294078-o303631-Maine_Coon_and_Tabby.html
and I would like to extract the two Cat "types" near the end of the string, with the result being something like:
mainecoontabby
Basically, I'd like to only extract - as one lowercase string - the Cat "types" which are always separated by '_ and _', preceded by '-', and followed by '.html'.
Is there a simple way to do this in HQL? I know HQL has limited functionality, otherwise I'd be using regexp or substring or something like that.
Thanks,
Clark
HQL does have a substr function as cited here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
It returns the piece of a string starting at a value until the end (or for a particular length)
I'd also utilize the function locate to determine the location of the '-' and '_' in the URL.
As long as there are always three dashes and three underscores this should be pretty straight forward.
Might need case statements to determine number of dashes and underscores otherwise.
solution here...
LOWER(REGEXP_REPLACE(SUBSTRING(catString, LOCATE('-', catString, 19)+1), '(_to_)|(\.html)|_', ''))
Interestingly, the following did NOT work... JJFord3, any idea why?
LOWER(REGEXP_EXTRACT(SUBSTRING(FL.url, LOCATE('-', FL.url, 19)+1), '[^(_to_)|(\.html)|_]', 0))

regexp_matches() returns two matches for $ (end of string)

Can somebody explain this odd behavior of regexp_matches() in PostgreSQL 9.2.4 (same result in 9.1.9):
db=# SELECT regexp_matches('test string', '$') AS end_of_string;
end_of_string
---------------
{""}
(1 row)
db=# SELECT regexp_matches('test string', '$', 'g') AS end_of_string;
end_of_string
---------------
{""}
{""}
(2 rows)
-> SQLfiddle demo.
The second parameter is a regular expression. $ marks the end of the string.
The third parameter is for flags. g is for "globally", meaning the the function doesn't stop at the first match.
The function seems to report the end of the string twice with the g flag, but that can only exist once per definition. It breaks my query. :(
Am I missing something?
I would need my query to return one more row at the end, for any possible string. I expected this query to do the job, but it adds two rows:
SELECT (regexp_matches('test & foo/bar', '(&|/|$)', 'ig'))[1] AS delim
I know how to manually add a row, but I want to let the function take care of it.
It looks like it was a bug in PostgreSQL. I verified for sure it is fixed in 9.3.8. Looking at the release notes, I see possible references in:
9.3.4
Allow regular-expression operators to be terminated early by query
cancel requests (Tom Lane)
This prevents scenarios wherein a pathological regular expression
could lock up a server process uninterruptably for a long time.
9.3.6
Fix incorrect search for shortest-first regular expression matches
(Tom Lane)
Matching would often fail when the number of allowed iterations is
limited by a ? quantifier or a bound expression.
Thanks to Erwin for narrowing it down to 9.3.x.
I am not sure about what I am going to say because I don't use PostgreSQL so this is just me thinking out loud.
Since you are trying to match the end of string/line $, then in the first situation the outcome is expected, but when you turn on global match modifier g and because matching the end of line character doesn't actually consume or read any characters from the input string then the next match attempt will start where the first one left off, that is at the end of string and this will cause an infinite loop if it kept going like that so PostgreSQL engine might be able to detect this and stop it to prevent a crash or an infinite loop.
I tested the same expression in RegexBuddy with POSIX ERE flavor and it caused the program to become unresponsive and crash and this is the reason for my reasoning.
the same occurs for example in C# in which I had the same problem recently so I think this is a normal behaviour for regexps
this is because $ doesn't stand for a specific sign but a specific position instead
so $ doesn't really match anything and the position of parser stays in the same position
you need to change your convention a little;
to test for an empty string you can use ^$
This was a bug that has been fixed in Postgres 9.3. See accepted answer.
For Postgres 9.2 or older: A halfway decent workaround for my situation would be to use the expression .$ instead - matches for any string once at the last character:
WITH x(id, t) AS (
VALUES
(1, 'test & foo/bar')
,(2, 'test')
,(3, '') -- empty string
,(4, 'test & foo/') -- other branch as last character
)
SELECT id, (regexp_matches(t, '(&|/|.$)', 'ig'))[1] AS delim
FROM x;
But it fails for empty strings.
And it fails if the last character happens to match another branch. Like: 'foo/bar/'.
And it isn't perfect to have the actual final character returned. An empty string would be much preferable.
-> SQLfiddle.