Hive - Reformat data structure - sql

So I have a sample of Hive data:
Customer
xx_var
yy_var
branchflow
{"customer_no":"239230293892839892","acct":["2324325","23425345"]}
23
3
[{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]
And I want to transform it into something like this:
Customer_no
acct
xx_var
yy_var
branchflow
239230293892839892
2324325
23
3
[1,2,3,4,5,6,6,6,4]
239230293892839892
23425345
23
3
[1,2,3,4,5,6,6,6,99,4]
I have tried using this query, but getting the wrong output format.
SELECT
customer.customer_no,
acct,
xx_var,
yy_var,
bi_acctno,
values_bi
FROM
struct_test
LATERAL VIEW explode(customer.acct) acct AS acctno
LATERAL VIEW explode(brancflow.acctno) bia as bi_acctno
LATERAL VIEW explode(brancflow.value) biv as values_bi
WHERE bi_acctno = acctno
Does anyone know how to approach this problem?

Use json_tuple to extract JSON elements. In case of array, it returns it also as string: remove square brackets, split and explode. See comments in the demo code.
Demo:
with mytable as (--demo data, use your table instead of this CTE
select '{"customer_no":"239230293892839892","acct":["2324325","23425345"]}' as customer,
23 xx_var, 3 yy_var,
'[{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]' branchflow
)
select c.customer_no,
a.acct,
t.xx_var, t.yy_var,
get_json_object(b.acct_branchflow,'$.value') value
from mytable t
--extract customer_no and acct array
lateral view json_tuple(t.customer, 'customer_no', 'acct') c as customer_no, accts
--remove [] and " and explode array of acct
lateral view explode(split(regexp_replace(c.accts,'^\\[|"|\\]$',''),',')) a as acct
--remove [] and explode array of json
lateral view explode(split(regexp_replace(t.branchflow,'^\\[|\\]$',''),'(?<=\\}),(?=\\{)')) b as acct_branchflow
--this will remove duplicates after lateral view: need only matching acct
where get_json_object(b.acct_branchflow,'$.acctno') = a.acct
Result:
customer_no acct xx_var yy_var value
239230293892839892 2324325 23 3 [1,2,3,4,5,6,6,6,4]
239230293892839892 23425345 23 3 [1,2,3,4,5,6,6,6,99,4]

Related

Add a column to BigQuery results that adds a description for an ID in a column from the results

I am using BQ to pull some data and I need to add a column to the results that includes a lookup.
SELECT
timestamp_trunc(a.timestamp,day) date,
a.custom_parameter1,
a.custom_parameter2,
a.score,
a.type,
b.ref
FROM
`data-views_batch_20221021` a
left outer join (select client_uuid,STRING_AGG(document_referrer, "," LIMIT 1) ref from `activities_batch_20221021` where app_id="12345" and document_referrer is not null group by client_uuid) b using (client_uuid)
WHERE
a.app_id="12345"
How can I add a column that takes an array in a.type and looks up each value in the dict. I currently do this in Python and it looks up the values in a dict but I want to include it in the query.
The dict is:
{23:"Description1", 24:"Description2", 25:"Description3"}
I don't have these values in a table within BQ, can I include it within the query? There are about 14 total descriptions to map.
My end result would look like this:
date | custom_parameter1 | customer_paramter2 | score | types | ref | type_descriptions
Edited to add that types is an array.
I don't have these values in a table within BQ, can I include it within the query?
Yes, you can have them as CTE as in below example
with dict as (
select 23 type, "description1" type_description union all
select 24, "description2" union all
select 25, "description3"
)
select
timestamp_trunc(a.timestamp,day) date,
a.custom_parameter1,
a.custom_parameter2,
a.score,
a.type,
b.ref,
type_description
from `data-views_batch_20221021` a
left outer join (
select client_uuid, string_agg(document_referrer, "," limit 1) ref
from `activities_batch_20221021`
where app_id="12345" and document_referrer is not null
group by client_uuid
) b using (client_uuid)
left join dict using (type)
where a.app_id="12345"
There are about 14 total descriptions to map
You can add to dict CTE as many as you need

Hive Explode the Array of Struct key: value:

This is the below Hive Table
CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable
(
USER_ID string,
DETAIL_DATA array<struct<key:string,value:string>>
)
And this is the data in the above table-
11111 [{"key":"client_status","value":"ACTIVE"},{"key":"name","value":"Jane Doe"}]
Is there any way I can get the below output using HiveQL?
**client_status** | **name**
-------------------+----------------
ACTIVE Jane Doe
I tried use explode() but I get result like that:
SELECT details
FROM sample_table
lateral view explode(DETAIL_DATA) exploded_table as details;
**details**
-------------------------------------------+
{"key":"client_status","value":"ACTIVE"}
------------------------------------------+
{"key":"name","value":"Jane Doe"}
Use laterral view [outer] inline to get struct elements already etracted and use conditional aggregation to get values corresponting to some keys grouped in single row, use group_by user_id.
Demo:
with sample_table as (--This is your data example
select '11111' USER_ID,
array(named_struct('key','client_status','value','ACTIVE'),named_struct('key','name','value','Jane Doe')) DETAIL_DATA
)
SELECT max(case when e.key='name' then e.value end) as name,
max(case when e.key='client_status' then e.value end) as status
FROM sample_table
lateral view inline(DETAIL_DATA) e as key, value
group by USER_ID
Result:
name status
------------------------
Jane Doe ACTIVE
If you can guarantee the order of structs in array (one with status comes first always), you can address nested elements dirctly
SELECT detail_data[0].value as client_status,
detail_data[1].value as name
from sample_table
One more approach, if you do not know the order in array, but array is of size=2, CASE expressions without explode will give better performance:
SELECT case when DETAIL_DATA[0].key='name' then DETAIL_DATA[0].value else DETAIL_DATA[1].value end as name,
case when DETAIL_DATA[0].key='client_status' then DETAIL_DATA[0].value else DETAIL_DATA[1].value end as status
FROM sample_table

Finding most common elements of column of arrays in Presto

I would like to find the most common elements within a column of arrays in presto.
For example...
col1
[A,B,C]
[A,B]
[A,D]
with output of...
col1 - col2
A - 3
B - 2
C - 1
D - 1
I have tried using flatten and unnest. I am able to get it into a single array using
select flatten(array_agg(col1))
from tablename;
but I am then not sure how to group and count by the distinct elements. I also am struggling to get this to run on all of my data because of the large amount of memory required.
Thanks for any help!
You can use to unnest() to flatten Array and then group by to group the unique values.
The Query to generate the data set for your case. You can replace this part with your select command in the final query:
with dataset AS (
SELECT ARRAY[
ARRAY['A','B','C'],
ARRAY['A','B'],
ARRAY['A','D']
] AS data
)
select dt from dataset
CROSS JOIN UNNEST(data) AS t(dt)
O/P:
------
dt
------
[A,B,C]
------
[A,B]
------
[A,D]
Now in the final query we will first flatten this data to remove all the values from all the rows and then group those value to get unique values and their count.
FINAL QUERY:
with da AS(
with dataset AS (
SELECT ARRAY[
ARRAY['A','B','C'],
ARRAY['A','B'],
ARRAY['A','D']
] AS data
)
select dt from dataset
CROSS JOIN UNNEST(data) AS t(dt)
)
select daVal,count(*) from da
CROSS JOIN UNNEST(dt) AS t(daVal)
GROUP BY daVal
You can unnest() and aggregate:
select u.col, count(*)
from t cross join
unnest(col1) u(col)
group by u.col;

Return a NULL value if Date not in CTE

I have a query that counts the number of records imported for every day according to the current date. The only problem is that the count only returns when records have been imported and NULLS are ignored
I have created a CTE with one column in MSSQL that lists dates in a certain range e.g. 2019-01-01 - today.
The query that i've currently got is like this:
SELECT TableName, DateRecordImported, COUNT(*) AS ImportedRecords
FROM Table
WHERE DateRecordImported IN (SELECT * FROM DateRange_CTE)
GROUP BY DateRecordImported
I get the results fine for the dates that exist in the table for example:
TableName DateRecordImported ImportedRecords
______________________________________________
Example 2019-01-01 165
Example 2019-01-02 981
Example 2019-01-04 34
Example 2019-01-07 385
....
but I need a '0' count returned if the date from the CTE is not in the Table. Is there a better alternative to use in order to return a 0 count or does my method need altering slightly
You can do LEFT JOIN :
SELECT C.Date, COUNT(t.DateRecordImported) AS ImportedRecords
FROM DateRange_CTE C LEFT JOIN
table t
ON t.DateRecordImported = C.Date -- This may differ use actual column name instead
GROUP BY C.Date; -- This may differ use actual column name instead
Move the position of the CTE from a subquery to the FROM:
SELECT T.TableName,
DT.PCTEDateColumn} AS DateRecordImported,
COUNT(T.{TableIDColumn}) AS ImportedRecords
FROM DateRange_CTE DT
LEFT JOIN [Table] T ON DT.{TEDateColumn} = T.DateRecordImported
GROUP BY DT.{CTEDateColumn};
You'll need to replace the values in braces ({})
You can try this
SELECT TableName, DateRecordImported,
case when DateRecordImported is null
then '0'
else count(*) end AS ImportedRecords
FROM Table full join DateRange_CTE
on Table.DateRecordImported DateRange_CTE.ImportedDate
group by DateRecordImported,ImportedDate
(ImportedDate is name of column of CTE)

Split array by portions in PostgreSQL

I need split array by 2-pair portions, only nearby values.
For example I have following array:
select array[1,2,3,4,5]
And I want to get 4 rows with following values:
{1,2}
{2,3}
{3,4}
{4,5}
Can I do it by SQL query?
select a
from (
select array[e, lead(e) over()] as a
from unnest(array[1,2,3,4,5]) u(e)
) a
where not exists (
select 1
from unnest(a) u (e)
where e is null
);
a
-------
{1,2}
{2,3}
{3,4}
{4,5}
One option is to do this with a recursive cte. Starting from the first position in the array and going up to the last.
with recursive cte(a,val,strt,ed,l) as
(select a,a[1:2] as val,1 strt,2 ed,cardinality(a) as l
from t
union all
select a,a[strt+1:ed+1],strt+1,ed+1,l
from cte where ed<l
)
select val from cte
a in the cte is the array.
Another option if you know the max length of the array is to use generate_series to get all numbers from 1 to max length and cross joining the array table on cardinality. Then use lead to get slices of the array and omit the last one (as lead on last row for a given partition would be null).
with nums(n) as (select * from generate_series(1,10))
select a,res
from (select a,t.a[nums.n:lead(nums.n) over(partition by t.a order by nums.n)] as res
from nums
cross join t
where cardinality(t.a)>=nums.n
) tbl
where res is not null