ARRAY_CONTAINS muliple values in hive - sql

Is there a convenient way to use the ARRAY_CONTAINS function in hive to search for multiple entries in an array column rather than just one? So rather than:
WHERE ARRAY_CONTAINS(array, val1) OR ARRAY_CONTAINS(array, val2)
I would like to write:
WHERE ARRAY_CONTAINS(array, val1, val2)
The full problem is that I need to read val1 and val2 dynamically from the command line arguments when I run the script and I generally don't know how many values will be conditioned on. So you can think of vals being a comma separated list (or array) containing values val1, val2, ..., and I want to write
WHERE ARRAY_CONTAINS(array, vals)
Thanks in advance!

There is a UDF here that will let you take the intersection of two arrays. Assuming your values have the structure
values_array = [val1, val2, ..., valn]
You could then do
where array_intersection(array, values_array)[0] is not null
If they don't have any elements in common, [] will be returned and therefore [][0] will be null

Create table tmp_cars AS Select make,COLLECT_LIST(TRIM(model))
model_List from default.cars GROUP BY make;
Select array_contains(model_List,CAST('Rainier' as varchar(40)))
FROM Default.tmp_cars t
where make = 'Buick';
Data
[" Rainier"," Rendezvous CX"," Century Custom 4dr"," LeSabre Custom 4dr"," Regal LS 4dr"," Regal GS 4dr"," LeSabre Limited 4dr"," Park Avenue 4dr"," Park Avenue Ultra 4dr"]
Return
True

select *
from table
lateral view explode(array) a as arr
where arr in (vals)
;

Related

Extract information from a json string in BigQuery

I am storing a table in Bigquery with the results of a classification algorithm. The table schema is INT, STRING and looks something like this :
ID
Output
1001
{'Apple Cider': 0.7, 'Coffee' : 0.2, 'Juice' : 0.1}
1002
{'Black Coffee':0.9, 'Tea':0.1}
The problem is how to fetch the first (or second or any order) element of each string together with its score. It doesn't seem likely that JSON_EXTRACT can work and most likely it can be done with Javascript. Was wondering what an elegant solution would look like here.
Consider below
select ID,
trim(split(kv, ':')[offset(0)], " '") element,
cast(split(kv, ':')[offset(1)] as float64) score,
element_position
from `project.dataset.table` t,
unnest(regexp_extract_all(trim(Output, '{}'), r"'[^':']+'\s?:\s?[^,]+")) kv with offset as element_position
If applied to sample data in your question - output is
Note: you can use less verbose unnest statement if you wish
unnest(split(trim(Output, '{}'))) kv with offset as element_position

How do I Process This String?

I have some results in one of my tables and the results vary, each; represents multiple entries in one column which I need to split out.
Here is my SQL and the results:
select REGEXP_COUNT(value,';') as cnt,
description
from mytable;
1 {Managed By|xBoss}{xBoss xBoss Number|X0910505569}{Time
Requested|2009-04-15 20:47:11.0}{Time Arrived|2009-04-15 21:46:11.0};
1 {Managed By|Modern Management}{xBoss Number|}{Time Requested|2009-04-
16 14:01:29.0}{Time Arrived|2009-04-16 14:44:11.0};
2 {Managed By|xBoss}{xBoss Number|X091480092}{Time Requested|2009-05-28
08:58:41.0}{Time Arrived|};{Managed By|Jims Allocation}{xBoss xBoss
Number|}{Time Requested|}{Time Arrived|};
Desired output:
R1:
Managed By: xBoss
Time Requested:2009-10-19 07:53:45.0
Time Arrived: 2009-10-19 07:54:46.0
R2:
Managed By:Own Arrangements
Number: x5876523
Time Requested: 2009-10-19 07:57:46.0
Time Arrived:
R3:
Managed By: xBoss
Time Requested:2009-10-19 08:07:27.0
select
SPLIT_PART(description, '}', 1),
SPLIT_PART(description, '}', 2),
SPLIT_PART(description, '}', 3),
SPLIT_PART(description, '}', 4),
SPLIT_PART(description, '}', 5)
as description_with_tag from mytable;
This is ok when the count is 1, but when there are multiple ; in the description it doesn't give me the results.
Is it possible to put this into an array based on the count?
First, it's worth pointing out that data in this type of format cannot take advantage of all the benefits that Redshift can offer. Amazon Redshift is a columnar database that can provide amazing performance when data is stored in appropriate columns. However, selecting specific text from a text field will always perform poorly.
Therefore, my main advice would be to pre-process the data into normal rows and columns so that Redshift can provide you the best capabilities.
However, to answer your question, I would recommend making a Scalar User-Defined Function:
CREATE FUNCTION f_extract_curly (s TEXT, key TEXT)
RETURNS TEXT
STABLE
AS $$
# List of items in {brackets}
items = s[1:-1].split('}{')
# Dictionary of Key|Value from items
entries = {i.split('|')[0]: i.split('|')[1] for i in items}
# Return desired value
return entries.get(key, None)
$$ LANGUAGE plpythonu;
I loaded sample data with:
CREATE TABLE foo (
description TEXT
);
INSERT INTO foo values('{Managed By|xBoss}{xBoss xBoss Number|X0910505569}{Time Requested|2009-04-15 20:47:11.0}{Time Arrived|2009-04-15 21:46:11.0};');
INSERT INTO foo values('{Managed By|Modern Management}{xBoss Number|}{Time Requested|2009-04-16 14:01:29.0}{Time Arrived|2009-04-16 14:44:11.0};');
INSERT INTO foo values('{Managed By|xBoss}{xBoss Number|X091480092}{Time Requested|2009-05-28 08:58:41.0}{Time Arrived|};{Managed By|Jims Allocation}{xBoss xBoss Number|}{Time Requested|}{Time Arrived|};');
Then I tested it with:
SELECT
f_extract_curly(description, 'Managed By'),
f_extract_curly(description, 'Time Requested')
FROM foo
and got the result:
xBoss 2009-04-15 20:47:11.0
Modern Management 2009-04-16 14:01:29.0
xBoss
It doesn't know how to handle lines that have the same field specified twice (with semi-colons between). You did not provide enough sample input and output lines for me to figure out what you wanted in such situations, but feel free to tweak the code for your requirements.
There is no array data type in Redshift. There are 2 options:
1) First split_part by ';', then union results separately for every index of the first split_part output, then split_part results by '}' and finally get what you need.
2) Create a Python UDF and process these strings with Python. I guess this is the best solution for your use case.
3) Transform your data outside Redshift. From your data structure it seems like it's much better to process it before copying to Redshift, unnesting the arrays into rows and extracting keys from your objects into columns.

hive explode list from json-string

I have table with jsons:
CREATE TABLE TABLE_JSON (
json_body string
);
Json has structure:
{ obj1: { fields ... }, obj2: [array] }
I want to select all elements from array, but I can't.
For example, I can get all fields from first object:
SELECT f.fields...
FROM (
SELECT q1.obj1, q1.obj2
FROM TABLE_JSON jt
LATERAL VIEW JSON_TUPLE(jt.json_body, 'obj1', 'obj2') q1 AS obj1, obj2
) as json_table2
LATERAL VIEW JSON_TUPLE(TABLE_JSON.obj1, 'fields...') f AS fields...;
But with array this method doesnt work.
I've tried to use
...
LATERAL VIEW explode(json_table2.obj2) adTable AS arr;
hive explode doc
But obj2 - string with array. How to transform string-json to array and explode it?
The json_split UDF from Brickhouse ( http://github.com/klout/brickhouse ) can convert a JSON array to a Hive List, and then you can explode that.
See http://mail-archives.apache.org/mod_mbox/hive-user/201406.mbox/%3CCAO78EnLgSrrUY3Ad_ZWS9zWNKLQRwS9jXrqEE869FhUNiWgCXA#mail.gmail.com%3E and https://brickhouseconfessions.wordpress.com/2014/02/07/hive-and-json-made-simple/
You can consider using Hive-JSON SerDe to read the data from JSON.
Refer: https://github.com/rcongiu/Hive-JSON-Serde
This may not be an optimal solution but can help unblock you. For a JSON object which looks like below
'{"obj1":"field1","obj2":["a1","a2","a3"]}'
this query can help you obtain all items of array into individual columns given that the size of the array is constant across all rows.
SELECT split(results,",")[0] AS arrayItem1,
split(results,",")[1] AS arrayItem2,
regexp_replace(split(results,",")[2], "[\\]|}]", "") AS arrayItem3
FROM
(SELECT split(translate(get_json_object(TABLE_JSON.json_body,'$.obj2'), '"\\[|]|\""',''), "},") AS r
FROM TABLE_JSON) t1 LATERAL VIEW explode(r) rr AS results
It produces the result which looks like this
arrayitem1| arrayitem2| arrayitem3
a1 | a2 | a3
You can scale it to any number of array size on a condition that size is constant across the table.

Extracting Values from Array in Redshift SQL

I have some arrays stored in Redshift table "transactions" in the following format:
id, total, breakdown
1, 100, [50,50]
2, 200, [150,50]
3, 125, [15, 110]
...
n, 10000, [100,900]
Since this format is useless to me, I need to do some processing on this to get the values out. I've tried using regex to extract it.
SELECT regexp_substr(breakdown, '\[([0-9]+),([0-9]+)\]')
FROM transactions
but I get an error returned that says
Unmatched ( or \(
Detail:
-----------------------------------------------
error: Unmatched ( or \(
code: 8002
context: T_regexp_init
query: 8946413
location: funcs_expr.cpp:130
process: query3_40 [pid=17533]
--------------------------------------------
Ideally I would like to get x and y as their own columns so I can do the appropriate math. I know I can do this fairly easy in python or PHP or the like, but I'm interested in a pure SQL solution - partially because I'm using an online SQL editor (Mode Analytics) to plot it easily as a dashboard.
Thanks for your help!
If breakdown really is an array you can do this:
select id, total, breakdown[1] as x, breakdown[2] as y
from transactions;
If breakdown is not an array but e.g. a varchar column, you can cast it into an array if you replace the square brackets with curly braces:
select id, total,
(translate(breakdown, '[]', '{}')::integer[])[1] as x,
(translate(breakdown, '[]', '{}')::integer[])[2] as y
from transactions;
You can try this :
SELECT REPLACE(SPLIT_PART(breakdown,',',1),'[','') as x,REPLACE(SPLIT_PART(breakdown,',',2),']','') as y FROM transactions;
I tried this with redshift db and this worked for me.
Detailed Explanation:
SPLIT_PART(breakdown,',',1) will give you [50.
SPLIT_PART(breakdown,',',2) will give you 50].
REPLACE(SPLIT_PART(breakdown,',',1),'[','') will replace the [ and will give just 50.
REPLACE(SPLIT_PART(breakdown,',',2),']','') will replace the ] and will give just 50.
Know its an old post.But if someone needs a much easier way
select json_extract_array_element_text('[100,101,102]', 2);
output : 102

explode function in hive

I have the following sample data and I am trying to explode it in hive.. I used split but I know I am missing something..
["[[-80.742426,35.23248],[-80.740424,35.23184],[-80.739583,35.231562],[-80.735935,35.23041],[-80.728624,35.228069],[-80.727753,35.227836],[-80.727294,35.227741],[-80.726762,35.227647],[-80.726321,35.227594],[-80.725687,35.227544],[-80.725134,35.227535],[-80.721502,35.227615],[-80.691298,35.216202],[-80.688009,35.215396],[-80.686516,35.215016],[-80.598433,35.234307]]"]
I used the below query
select explode(split(col, ',')) from sample2;
and the result is this
["[[-80.742426
35.23248]
[-80.740424
35.23184]
[-80.739583
35.231562]
[-80.735935
35.23041]
[-80.728624
35.228069]
[-80.727753
35.227836]
[-80.71143
35.227831]
[-80.711007
35.227795]
[-80.710638
35.227741]
[-80.673884
35.21014]
[-80.672358
35.209481]
[-80.672036
35.209356]
[-80.671686
35.209234]
[-80.67124
35.209099]
[-80.670815
35.209006]
[-80.670267
35.208906]
[-80.669612
35.208833]
[-80.668924
35.208806]
[-80.598433
35.234307]]"]
I need it in below format
[-80.742426,35.23248]
[-80.740424,35.23184]
[-80.739583,35.231562]
[-80.735935,35.23041]
[-80.728624,35.228069]
[-80.727753,35.227836]
[-80.727294,35.227741]
[-80.726762,35.227647]
[-80.726321,35.227594]
[-80.725687,35.227544]
[-80.725134,35.227535]
[-80.721502,35.227615]
[-80.691298,35.216202]
[-80.688009,35.215396]
[-80.686516,35.215016]
[-80.684281,35.214466]
[-80.68396,35.214395]
[-80.683375,35.214231]
[-80.682908,35.214079]
[-80.682444,35.213905]
[-80.682045,35.213733]
[-80.68062,35.213112]
[-80.678078,35.211983]
[-80.676836,35.211447]
[-80.598433,35.234307]
Any help over here..?
You have your data set as arrays of array and you want to explode your data at first level only, so use LATERAL VIEW explode(colname) to explode at the first level.
Below is the SELECT query with explode():
SELECT col1 FROM sample2 LATERAL VIEW EXPLODE(col) explodeVal AS col1;
output generated from your input data set as below:
[-80.742426,35.23248]
[-80.740424,35.23184]
[-80.739583,35.231562]
[-80.735935,35.23041]
[-80.728624,35.228069]
[-80.727753,35.227836]
[-80.727294,35.227741]
[-80.726762,35.227647]
[-80.726321,35.227594]
[-80.725687,35.227544]
[-80.725134,35.227535]
[-80.721502,35.227615]
[-80.691298,35.216202]
[-80.688009,35.215396]
[-80.686516,35.215016]
[-80.684281,35.214466]
[-80.68396,35.214395]
[-80.683375,35.214231]
[-80.682908,35.214079]
[-80.682444,35.213905]
[-80.682045,35.213733]
[-80.68062,35.213112]
[-80.678078,35.211983]
[-80.676836,35.211447]
[-80.598433,35.234307]