from string to map object in Hive - hive

My input is a string that can contain any characters from A to Z (no duplicates, so maximum 26 characters it may have).
For example:-
set Input='ATK';
The characters within the string can appear in any order.
Now I want to create a map object out of this which will have fixed keys from A to Z. The value for a key is 1 if its corresponding character appears in the input string. So in case of this example (ATK) the map object should look like:-
So what is the best way to do this?
So the code should look like:-
set Input='ATK';
select <some logic>;
It should return a map object (Map<string,int>) with 26 key value pairs within it. What is the best way to do it, without creating any user defined functions in Hive. I know there is a function str_to_map that easily comes to mind.But it only works if key value pairs exist in source string and also it will only consider the key value pairs specified in the input.

Maybe not efficient but works:
select str_to_map(concat_ws('&',collect_list(concat_ws(":",a.dict,case when
b.character is null then '0' else '1' end))),'&',':')
from
(
select explode(split("A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z",',')) as dict
) a
left join
(
select explode(split(${hiveconf:Input},'')) as character
) b
on a.dict = b.character
The result:
{"A":"1","B":"0","C":"0","D":"0","E":"0","F":"0","G":"0","H":"0","I":"0","J":"0","K":"1","L":"0","M":"0","N":"0","O":"0","P":"0","Q":"0","R":"0","S":"0","T":"1","U":"0","V":"0","W":"0","X":"0","Y":"0","Z":"0"}

Related

SparkSQL: How to query a column with datatype: List of Maps

I have a dataframe with column of array (or list) with each element being a map of String, complex data type (meaning --String, nested map, list etc; in a way you may assume column data type is similar to List[Map[String,AnyRef]])
now i want to query on this table like..
select * from the tableX where column.<any of the array element>['someArbitaryKey'] in ('a','b','c')
I am not sure how to represent <any of the array element> in the spark SQL. Need help.
The idea is to transform the list of maps into a list of booleans, where each boolean indicates if the respective map contains the wanted key (k2 in the code below). After that all we have to check if the boolean array contains at least one true element.
select * from tableX where array_contains(transform(col1, map->map_contains_key(map,'k2')), true)
I have assumed that the name of the column holding the list of maps is col1.
The second parameter of the transform function could be replaced by any expression that returns a boolean value. In this example map_contains_key is used, but any check resulting in a boolean value would work.
A bit unrelated: I believe that the data type of the map cannot be Map[String,AnyRef] as there is no encoder for AnyRef available.

How to retrieve the required string in SQL having a variable length parameter

Here is my problem statement:
I have single column table having the data like as :
ROW-1>> 7302-2210177000-XXXX-XXXXXX-XXX-XXXXXXXXXX-XXXXXX-XXXXXX-U-XXXXXXXXX-XXXXXX
ROW-2>> 0311-1130101-XXXX-000000-XXX-XXXXXXXXXX-XXXXXX-XXXXXX-X-XXXXXXXXX-WIPXXX
Here i want to separate these values from '-' and load into a new table. There are 11 segments in this string separated by '-', therefore, 11 columns. The problem is:
A. The length of these values are changing, however, i have to keep it as the length of these values in the standard format or the length which it has
e.g 7302- (should have four values, if the value less then that then keep that value eg. 73 then it should populate 73.
Therefore, i have to separate as well as mentation the integrity. The code which i am writing is :
select
SUBSTR(PROFILE_ID,1,(case when length(instr(PROFILE_ID,'-')<>4) THEN (instr(PROFILE_ID,'-') else SUBSTR(PROFILE_ID,1,4) end)
)AS [RQUIRED_COLUMN_NAME]
from [TABLE_NAME];
getting right parenthesis error
Please help.
I used the regex_substr SQL function to solve the above issue. Here below is an example:
select regex_substr('7302-2210177000-XXXX-XXXXXX-XXX-XXXXXXXXXX-XXXXXX-XXXXXX-U-XXXXXXXXX-XXXXXX ROW-2>> 0311-1130101-XXXX-000000-XXX-XXXXXXXXXX-XXXXXX-XXXXXX-X-XXXXXXXXX-WIPXXX',[^-]+,1,1);
Output is: 7302 --which is the 1st segment of the string
Similarly, the send string segment which is separated by "-" in the string can be obtained by just replacing the 1 with 2 in the above query at the end.
Example : select regex_substr('7302-2210177000-XXXX-XXXXXX-XXX-XXXXXXXXXX-XXXXXX-XXXXXX-U-XXXXXXXXX-XXXXXX ROW-2>> 0311-1130101-XXXX-000000-XXX-XXXXXXXXXX-XXXXXX-XXXXXX-X-XXXXXXXXX-WIPXXX',[^-]+,1,2);
output: 2210177000 which is the 2nd segment of the string

How to cast postgres JSON column to int without key being present in JSON (simple JSON values)?

I am working on data in postgresql as in the following mytable with the fields id (type int) and val (type json):
id
val
1
"null"
2
"0"
3
"2"
The values in the json column val are simple JSON values, i.e. just strings with surrounding quotes and have no key.
I have looked at the SO post How to convert postgres json to integer and attempted something like the solution presented there
SELECT (mytable.val->>'key')::int FROM mytable;
but in my case, I do not have a key to address the field and leaving it empty does not work:
SELECT (mytable.val->>'')::int as val_int FROM mytable;
This returns NULL for all rows.
The best I have come up with is the following (casting to varchar first, trimming the quotes, filtering out the string "null" and then casting to int):
SELECT id, nullif(trim('"' from mytable.val::varchar), 'null')::int as val_int FROM mytable;
which works, but surely cannot be the best way to do it, right?
Here is a db<>fiddle with the example table and the statements above.
Found the way to do it:
You can access the content via the keypath (see e.g. this PostgreSQL JSON cheatsheet):
Using the # operator, you can access the json fields through the keypath. Specifying an empty keypath like this {} allows you to get your content without a key.
Using double angle brackets >> in the accessor will return the content without the quotes, so there is no need for the trim() function.
Overall, the statement
select id
, nullif(val#>>'{}', 'null')::int as val_int
from mytable
;
will return the contents of the former json column as int, respectvely NULL (in postgresql >= 9.4):
id
val_int
1
NULL
2
0
3
2
See updated db<>fiddle here.
--
Note: As pointed out by #Mike in his comment above, if the column format is jsonb, you can also use val->>0 to dereference scalars. However, if the format is json, the ->> operator will yield null as result. See this db<>fiddle.

How to use regexp_matches() in an UPDATE statement?

I am trying to clean up a table that has a very messy varchar column, with entries of the sorts:
<u><font color="#0000FF">VA Lidar</font></u> OR <u><font color="#0000FF">InPort Metadata</font></u>
I would like to update the column by keeping only the html links, and separating them with a coma if there are more than one. Ideally I would do something like this:
UPDATE mytable
SET column = array_to_string(regexp_matches(column,'(?<=href=").+?(?=\")','g') , ',');
But unfortunately this returns an error in Postgres 10:
ERROR: set-returning functions are not allowed in UPDATE
I assume regexp_matches() is the said set-returning function. Any ideas on how I can achieve this?
Notes
1.
You don't need to base the correlated subquery on a separate instance of the base table (like other answers suggested). That would be doing more work for nothing.
2.
For simple cases an ARRAY constructor is cheaper than array_agg(). See:
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
3.
I use a regular expression without lookahead and lookbehind constraints and parentheses instead: href="([^"]+)
See query 1.
This works because parenthesized subexpressions are captured by regexp_matches() (and several other Postgres regexp functions). So we can replace the more sophisticated constraints with plain parentheses. The manual on regexp_match():
If a match is found, and the pattern contains no parenthesized
subexpressions, then the result is a single-element text array
containing the substring matching the whole pattern. If a match is
found, and the *pattern* contains parenthesized subexpressions, then the
result is a text array whose n'th element is the substring matching
the n'th parenthesized subexpression of the pattern
And for regexp_matches():
This function returns no rows if there is no match, one row if there
is a match and the g flag is not given, or N rows if there are N
matches and the g flag is given. Each returned row is a text array
containing the whole matched substring or the substrings matching
parenthesized subexpressions of the pattern, just as described above
for regexp_match.
4.
regexp_matches() returns a set of arrays (setof text[]) for a reason: not only can a regular expression match several times in a single string (hence the set), it can also produce multiple strings for each single match with multiple capturing parentheses (hence the array). Does not occur with this regexp, every array in the result holds a single element. But future readers shall not be lead into a trap:
When feeding the resulting 1-D arrays to array_agg() (or an ARRAY constructor) that produces a 2-D array - which is only even possible since Postgres 9.5 added a variant of array_agg() accepting array input. See:
Is there something like a zip() function in PostgreSQL that combines two arrays?
However, quoting the manual:
inputs must all have same dimensionality, and cannot be empty or NULL
I think this can never fail as the same regexp always produces the same number of array elements. Ours always produces one element. But that may be different with other regexp. If so, there are various options:
Only take the first element with (regexp_matches(...))[1]. See query 2.
Unnest arrays and use string_agg() on base elements. See query 3.
Each approach works here, too.
Query 1
UPDATE tbl t
SET col = (
SELECT array_to_string(ARRAY(SELECT regexp_matches(col, 'href="([^"]+)', 'g')), ',')
);
Columns with no match are set to '' (empty string).
Query 2
UPDATE tbl
SET col = (
SELECT string_agg(t.arr[1], ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
);
Columns with no match are set to NULL.
Query 3
UPDATE tbl
SET col = (
SELECT string_agg(elem, ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
, unnest(t.arr) elem
);
Columns with no match are set to NULL.
db<>fiddle here (with extended test case)
You could use a correlated subquery to deal with the offending set-returning function (which is regexp_matches). Something like this:
update mytable
set column = (
select array_to_string(array_agg(x), ',')
from (
select regexp_matches(t2.c, '(?<=href=").+?(?=\")', 'g')
from t t2
where t2.id = t.id
) dt(x)
)
You're still stuck with the "CSV in a column" nastiness but that's a separate issue and presumably not a problem for you.
Building on the approach of mu is too short with slightly different regex and a COALESCE function to retain values that do not contain href-links:
UPDATE a
SET bad_data = COALESCE(
(SELECT Array_to_string(Array_agg(x), ',')
FROM (SELECT Regexp_matches(a.bad_data,
'(?<=href=")[^"]+', 'g'
) AS x
FROM a a2
WHERE a2.id = a.id) AS sub), bad_data
);
SQL Fiddle

Check subset using either string or array in Impala

I have a table like this
col
-----
A,B
The col could be string with comma or array. I have flexibility on the storage.
How to check of col is a subset of either another string or array variable? For example:
B,A --> TRUE (order doesn't matter)
A,D,B --> TRUE (other item in between)
A,D,C --> FALSE (missing B)
I have flexibility on the type. The variable is something I cannot store in a table.
Please let me know if you have any suggestion for Impala only (no Hive).
Thanks
A not pretty method, but perhaps a starting point...
Assuming a table with a unique identifier column id and an array<string> column col, and a string variable with ',' as a separator (and no occurrences of escaped '\,')...
SELECT
yourTable.id
FROM
yourTable,
yourTable.col
GROUP BY
yourTable.id
HAVING
COUNT(DISTINCT CASE WHEN find_in_set(col.item, ${VAR:yourString}) > 0 THEN col.item END)
=
LENGTH(regexp_replace(${VAR:yourString},'[^,]',''))+1
Basically...
Expand the arrays in your table, to one row per array item.
Check if each item exists in your string.
Aggregate back up to count how many of the items were found in the string.
Check that the number of items found is the same as the number of items in the string
The COUNT(DISTINCT <CASE>) copes with arrays like {'a', 'a', 'b', 'b'}.
Without expanding the string to an array or table (which I don't know how to do) you're dependent on the items in the string being unique. (Because I'm just counting commas in the string to find out how many items there are...)