Can we flatten column which contain Json as value in Hive table? - sql

I have one hive column 'events' with Json values.How can i flatten this Json to create one hive table with columns as the key field of Json.Is it even possible?
ex- I need hive table columns to be events,start_date,id,details with corresponding values.
| events |
|[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}] |
|[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}]|

Demo:
select events,
get_json_object(element,'$.id') as id,
get_json_object(element,'$.start_date') as start_date,
get_json_object(element,'$.details') as details
from
(
select '[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}]' as events
union all
select '[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}]' as events
) s lateral view outer explode (split(regexp_replace(events, '\\[|\\]',''),'(?<=\\}),(?=\\{)')) e as element
Initial string is splitted by comma between curly brackets, (see explanation here), array exploded with lateral view and JSON objects parsed using get_json_object
Result:
events id start_date details
[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}] 3245ret 20201230 Imp
[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}] 3245rtr 20201228 NoImp
[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}] 3245ret 20191230 vImp
[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}] 3245rwer 20191228 NoImp

Related

Create a nested json string in bigquery

i want to create a nested json string where one phone number returns all product_name in one row only
I have tried but the output of TO_JSON_STRING isn't what i need. Here is the image of the query result
Image
Here the query that i used:
select cus.phone,
TO_JSON_STRING(STRUCT(
line.product_name
)) as attributes
from `dtm_med.t1_customer` cus
left join `dtm_med.t2_toa_total_sales_line` line on cus.phone = line.phone
left join `med_product.raw_cms_users` u on u.id = line.patient_id
where date_diff(current_date(), date(latest_order_date), week) < 26
and sale_contribution > 3000000
and transaction_count > 2
I want all the product_name in one row and only one phone number, not duplicated.
is there a way to do that in bigquery?
Here This might Help you.
Credit to the Helpers of this question:
listagg function alternative in bigquery
You can use STRING_AGG() for csv or ARRAY_AGG() if you want a list-like structure (array). Then GROUP BY the other two columns.

Unnesting repeated records to a single row in Big Query

I have a dataset that includes repeated records. When I unnest them I get 2 rows. 1 per nested record.
Before unnest raw data:
After unnest using this query:
SELECT
eventTime
participant.id
FROM
`public.table`,
UNNEST(people) AS participant
WHERE
verb = 'event'
These are actually 2 rows that are expanded to 4. I've been trying to unnest into a single row so I have 3 columns,
eventTime, buyer.Id, seller.Id.
I've been trying to use REPLACE to build a struct of the unnested content but I cannot figure out how to do it. Any pointer , documentation or steps that could help me out?
Consider below approach
SELECT * EXCEPT(key) FROM (
SELECT
eventTime,
participant.id,
personEventRole,
TO_JSON_STRING(t) key
FROM `public.table` t,
UNNEST(people) AS participant
WHERE verb = 'event'
)
PIVOT (MIN(id) FOR personEventRole IN ('buyer', 'seller'))
if applied to sample data in your question - output is

Is there a way to check if any items in a string array are in a string in Snowflake/Redshift?

I am looking for a way to check if a string contains any words in another field which is a single string that holds a list of items. Something like this...
id items (STRING)
1 burger;hotdog
I have a second dataset that might look like...
transaction_id description amount
10 cheeseburger 10
Now I need to grab the amount if the description matches any items in the first table, in this case it does match with the string burger, however, i can't seem to get the SQL right since if I were to use LIKE ANY in Snowflake, i'd need to pass in **('%burger%",'%hotdog%') which are two separate strings - in this case I can't make explicit calls as each id/item permutation may be different in the first table. While in Redshift when I try to use
CASE WHEN lower(t.description) SIMILAR TO '%(' || replace(items,';','|') || ')%' then amount END
I get the following error: Specified types or functions (one per INFO message) not supported on Redshift tables.
Thanks in advance!
If your wanting a snowflake answer:
WITH keys AS (
SELECT * FROM VALUES (1,'burger;hotdog') a(id,items)
), data AS (
SELECT * FROM VALUES (10,'cheeseburger',10) b(transaction_id, description, amount)
), seq_keys AS (
SELECT s.seq_id, f.value as key
FROM (
SELECT seq8() as seq_id, k.*
FROM keys AS k
) AS s
,lateral flatten(input=>split(s.items,';')) F
)
SELECT d.*, sk.*
FORM data d
JOIN seq_keys sk ON d.description ILIKE '%'||sk.key||'%'
gives:
TRANSACTION_ID DESCRIPTION AMOUNT SEQ_ID KEY
10 cheeseburger 10 0 "burger"
which is you distinct on the SEQ_ID then you can de-dupe if there are multi keys that match.. I would be inclined to also add an ID to the "data table".

How can I aggregate Jsonb columns in postgres using another column type

I have the following data in a postgres table,
where data is a jsonb column. I would like to get result as
[
{field_type: "Design", briefings_count: 1, meetings_count: 13},
{field_type: "Engineering", briefings_count: 1, meetings_count: 13},
{field_type: "Data Science", briefings_count: 0, meetings_count: 3}
]
Explanation
Use jsonb_each_text function to extract data from jsonb column named data. Then aggregate rows by using GROUP BY to get one row for each distinct field_type. For each aggregation we also need to include meetings and briefings count which is done by selecting maximum value with case statement so that you can create two separate columns for different counts. On top of that apply coalesce function to return 0 instead of NULL if some information is missing - in your example it would be briefings for Data Science.
At a higher level of statement now that we have the results as a table with fields we need to build a jsonb object and aggregate them all to one row. For that we're using jsonb_build_object to which we are passing pairs that consist of: name of the field + value. That brings us with 3 rows of data with each row having a separate jsonb column with the data. Since we want only one row (an aggregated json) in the output we need to apply jsonb_agg on top of that. This brings us the result that you're looking for.
Code
Check LIVE DEMO to see how it works.
select
jsonb_agg(
jsonb_build_object('field_type', field_type,
'briefings_count', briefings_count,
'meetings_count', meetings_count
)
) as agg_data
from (
select
j.k as field_type
, coalesce(max(case when t.count_type = 'briefings_count' then j.v::int end),0) as briefings_count
, coalesce(max(case when t.count_type = 'meetings_count' then j.v::int end),0) as meetings_count
from tbl t,
jsonb_each_text(data) j(k,v)
group by j.k
) t
You can aggregate columns like this and then insert data to another table
select array_agg(data)
from the_table
Or use one of built-in json function to create new json array. For example jsonb_agg(expression)

Select with filters on nested JSON array

Postgres 10: I have a table and a query below:
CREATE TABLE individuals (
uid character varying(10) PRIMARY KEY,
data jsonb
);
SELECT data->'files' FROM individuals WHERE uid = 'PDR7073706'
It returns this structure:
[
{"date":"2017-12-19T22-35-49","type":"indiv","name":"PDR7073706_indiv_2017-12-19T22-35-49.jpeg"},
{"date":"2017-12-19T22-35-49","type":"address","name":"PDR7073706_address_2017-12-19T22-35-49.pdf"}
]
I'm struggling with adding two filters by date and time. Like (illegal pseudo-code!):
WHERE 'type' = "indiv"
or like:
WHERE 'type' = "indiv" AND max('date')
It is probably easy, but I can't crack this nut, and need your help!
Assuming data type jsonb for lack of info.
Use the containment operator #> for the first clause (WHERE 'type' = "indiv"):
SELECT data->'files'
FROM individuals
WHERE uid = 'PDR7073706'
AND data -> 'files' #> '[{"type":"indiv"}]';
Can be supported with various kinds of indexes. See:
Query for array elements inside JSON type
Index for finding an element in a JSON array
The second clause (AND max('date')) is more tricky. Assuming you mean:
Get rows where the JSON array element with "type":"indiv" also has the latest "date".
SELECT i.*
FROM individuals i
JOIN LATERAL (
SELECT *
FROM jsonb_array_elements(data->'files')
ORDER BY to_timestamp(value ->> 'date', 'YYYY-MM-DD"T"HH24-MI-SS') DESC NULLS LAST
LIMIT 1
) sub ON sub.value -> 'type' = '"indiv"'::jsonb
WHERE uid = 'PDR7073706'
AND data -> 'files' #> '[{"type":"indiv"}]' -- optional; may help performance
to_timestamp(value ->> 'date', 'YYYY-MM-DD"T"HH24-MI-SS') is my educated guess on your undeclared timestamp format. Details in the manual here.
The last filter is redundant and optional. but it may help performance (a lot) if it is selective (only few rows qualify) and you have a matching index as advised:
AND data -> 'files' #> '[{"type":"indiv"}]'
Related:
Optimize GROUP BY query to retrieve latest record per user
Select first row in each GROUP BY group?
Update nth element of array using a WHERE clause