Spark INLINE Vs. LATERAL VIEW EXPLODE differences? - sql

In Spark, for the following use case, I'd like to understand what are the main differences between using the INLINE and EXPLODE ... I'm not sure if there are any performance implications or if one method is preferred over the other one or if there are any other uses cases where one is appropriate and the other is not...
The use case is to select 2 fields from a complex data type (array of structs), my instinct was to use INLINE since it explodes an array of structs
For example:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
),
inline_data AS (
SELECT id,
INLINE(array_of_structs)
FROM sample
)
SELECT id,
name AS person_name,
age AS person_age
FROM inline_data
And using LATERAL VIEW EXPLODE:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
)
SELECT id,
person.name,
person.age
FROM sample
LATERAL VIEW EXPLODE(array_of_structs) exploded_people as person
The documentation clearly states what each one of these do but I'd like to better understand when to pick one over the other one.

EXPLODE UDTF will generate rows of struct (single column of type struct), and to get person name you need to use person.name:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
)
SELECT id,
person.name,
person.age
FROM sample
LATERAL VIEW explode(array_of_structs) exploded_people as person
Result:
id,name,age
1,frank,40
1,maria,51
And INLINE UDTF will generate a row-set with N columns (N = number of top level elements in the struct), so you do not need to use dot notation person.name because name and other struct elements are already extracted by INLINE:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
)
SELECT id,
name,
age
FROM sample
LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state
Result:
id,name,age
1,frank,40
1,maria,51
Both INLINE and EXPLODE are UDTFs and require LATERAL VIEW in Hive. In Spark it works fine without lateral view. The only difference is that EXPLODE returns dataset of array elements(struct in your case) and INLINE is used to get struct elements already extracted. You need to define all struct elements in case of INLINE like this: LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state
From performance perspective both INLINE and EXPLODE work the same, you can use EXPLAIN command to check the plan. Extraction of struct elements in the UDTF or after UDTF does not affect performance.
INLINE requires to describe all struct elements (in Hive) and EXPLODE does not, so, explode may be more convenient if you do not need to extract all struct elements of if you do not need to extract elements at all. INLINE is convenient when you need to extract all or most of struct elements.
Your first code example works only in Spark. In Hive 2.1.1 it throws an exception because lateral view required.
In Spark this will work also:
inline_data AS (
SELECT id,
EXPLODE(array_of_structs) as person
FROM sample
)
And to get age column you need to use person.age

Related

How to pivot table without using aggregate function SQL?

I would like to pivot table, but to use PIVOT() have to use any aggregation functions such as max(), min(), mean(), count(), sum(). I don't need use these functions, but I need to transform my table without using them.
Source table
SOURCE
ATTRIBUTE
CATEGORY
GOOGLE
MOVIES
1
YAHOO
JOURNAL
2
GOOGLE
MUSIC
1
AOL
MOVIES
3
The new table should be like this:
ATTRIBUTE
GOOGLE
YAHOO
AOL
MOVIES
1
3
Will be grateful if someone would help me.
The pivot syntax requires an aggregate function. It's not optional.
https://docs.snowflake.com/en/sql-reference/constructs/pivot.html
SELECT ... FROM ... PIVOT ( <aggregate_function> ( <pivot_column> )
FOR <value_column> IN ( <pivot_value_1> [ , <pivot_value_2> ... ] ) )
[ ... ]
In the Snowflake documentation, optional syntax is in square braces. The <aggregate_function> is not in square braces, so it's required. If you only have one value, any of the aggregate functions you listed except count will work and give the same result.
create or replace table T1("SOURCE" string, ATTRIBUTE string, CATEGORY int);
insert into T1("SOURCE", attribute, category) values
('GOOGLE', 'MOVIES', 1),
('YAHOO', 'JOURNAL', 2),
('GOOGLE', 'MUSIC', 1),
('AOL', 'MOVIES', 3);
select *
from T1
PIVOT ( sum ( CATEGORY )
for "SOURCE" in ( 'GOOGLE', 'YAHOO', 'AOL' ));
ATTRIBUTE
'GOOGLE'
'YAHOO'
'AOL'
MOVIES
1
null
3
MUSIC
1
null
null
JOURNAL
null
2
null
Here's an alternative approach utilising ARRAY_AGG() which maybe more flexible. For example if you wanted to pivot the Attribute instead of the category, easy peasy. The 'category' here appears to be more of a classification or label instead of something we'd want summed.
The array doesn't need to be constrained any particular data type. This can be extended to pivots with many objects.
WITH CTE AS ( SELECT 'GOOGLE' SOURCE, 'MOVIES' ATTRIBUTE, 1 CATEGORY UNION ALL SELECT 'YAHOO', 'JOURNAL', 2 UNION ALL SELECT 'GOOGLE', 'MUSIC', 1 UNION ALL SELECT 'AOL', 'MOVIES', 3 );
SELECT
ATTRIBUTE, GOOGLE[0] GOOGLE, YAHOO[0] YAHOO, AOL[0] AOL
FROM
CTE PIVOT (ARRAY_AGG(CATEGORY) FOR SOURCE IN ('GOOGLE', 'YAHOO', 'AOL')
) AS A (ATTRIBUTE, GOOGLE, YAHOO, AOL);

Aggregate rows into complex json - Athena

I've an Athena query which gives me the below table for a given IDs:
ID
ID_2
description
state
First
row
abc
[MN, SD]
Second
row
xyz
[AL, CA ]
I'm using the array_agg function to merge states into an array. Within the query itself I want convert the output into the format below:
ID
ID_2
custom_object
First
row
{'description': 'abc', 'state': ['MN', 'SD']}
I'm looking at the Athena docs but haven't found function that does just this. I'm experimenting with multimap_agg and map_agg but this seems to be too complex to achieve. How can I do this, please help!
You can do it after aggregation by combining casts to json and creating map:
-- sample data
WITH dataset (ID, ID_2, description, state) AS (
VALUES ('First', 'row', 'abc', array['MN', 'SD']),
('Second', 'row', 'xyz', array['AL', 'CA' ])
)
-- query
select ID,
ID_2,
cast(
map(
array [ 'description',
'state' ],
array [ cast(description as json),
cast(state as json) ]
) as json
) custom_object
from dataset
Output:
ID
ID_2
custom_object
First
row
{"description":"abc","state":["MN","SD"]}
Second
row
{"description":"xyz","state":["AL","CA"]}

Are there any direct ways to use GROUP BY with maps without explode in Hive?

I have the following table schema in Hive:
id: String
results: map <int, struct <followers: <string, array <int>>>>
Sample record:
abcd-1234-efgh-5678 {12: {"followers": {"Richard": [1, 2, 3], "Rick": [99, 10, 100]}}, 88: {"followers": {"Mina": [88], "Ray": [100000, 88, 7213, 3]}}}
If I want to get minimum per follower per result, I used this query:
SELECT result_id, follower_name, MIN(follower_count)
FROM the_table
LATERAL VIEW explode(results) results AS result_id, result
LATERAL VIEW explode(result.followers) followers AS follower_name, follower_count_list
LATERAL VIEW explode(follower_count_list) follower_count_list AS follower_count
GROUP BY result_id, follower_name;
This is correct, resulting in redundant values for the first column if it had many followers.
Are there any other ways to use GROUP BY on maps and arrays?

BigQuery: Store semi-structured JSON data

I have data which can have varying json keys, I want to store all of this data in bigquery and then explore the available fields later.
My structure will be like so:
[
{id: 1111, data: {a:27, b:62, c: 'string'} },
{id: 2222, data: {a:27, c: 'string'} },
{id: 3333, data: {a:27} },
{id: 4444, data: {a:27, b:62, c:'string'} },
]
I wanted to use a STRUCT type but it seems all the fields need to be declared?
I then want to be able to query and see how often each key appears, and basically run queries over all records with for example a key as though it was in its own column.
Side note: this data is coming from URL query strings, maybe someone thinks it is best to push the full url and the use the functions to run analysis?
There are two primary methods for storing semi-structured data as you have in your example:
Option #1: Store JSON String
You can store the data field as a JSON string, and then use the JSON_EXTRACT function to pull out the values it can find, and it will return NULL for any value it cannot find.
Since you mentioned needing to do mathematical analysis on the fields, let's do a simple SUM for the values of a and b:
# Creating an example table using the WITH statement, this would not be needed
# for a real table.
WITH records AS (
SELECT 1111 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
UNION ALL
SELECT 2222 AS id, "{\"a\":27, \"c\": \"string\"}" as data
UNION ALL
SELECT 3333 AS id, "{\"a\":27}" as data
UNION ALL
SELECT 4444 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
)
# Example Query
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum FROM (
SELECT id,
CAST(JSON_EXTRACT(data, "$.a") AS INT64) AS aValue, # Extract & cast as an INT
CAST(JSON_EXTRACT(data, "$.b") AS INT64) AS bValue # Extract & cast as an INT
FROM records
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
There are some pros and cons to this approach:
Pros
The syntax is fairly straight forward
Less error prone
Cons
Store costs will be slightly higher since you have to store all the characters to serialize to JSON.
Queries will run slower than using pure native SQL.
Option #2: Repeated Fields
BigQuery has support for repeated fields, allowing you to take your structure and express it natively in SQL.
Using the same example, here is how we would do that:
## Using a with to create a sample table
WITH records AS (SELECT * FROM UNNEST(ARRAY<STRUCT<id INT64, data ARRAY<STRUCT<key STRING, value STRING>>>>[
(1111, [("a","27"),("b","62"),("c","string")]),
(2222, [("a","27"),("c","string")]),
(3333, [("a","27")]),
(4444, [("a","27"),("b","62"),("c","string")])
])),
## Using another WITH table to take records and unnest them to be joined later
recordsUnnested AS (
SELECT id, key, value
FROM records, UNNEST(records.data) AS keyVals
)
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum
FROM (
SELECT R.id, CAST(RA.value AS INT64) AS aValue, CAST(RB.value AS INT64) AS bValue
FROM records R
LEFT JOIN recordsUnnested RA ON R.id = RA.id AND RA.key = "a"
LEFT JOIN recordsUnnested RB ON R.id = RB.id AND RB.key = "b"
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
As you can see, to perform a similar, it is still rather complex. You also still have to store items like strings and CAST them to other values when necessary, since you cannot mix types in a repeated field.
Pros
Store size will be less than JSON
Queries will typically execute faster.
Cons
The syntax is more complex, not as straight forward
Hope that helps, good luck.

Oracle SQL aggregate rows into column listagg with condition

I am having the following - simplified - layout for tables:
TABLE blocks (id)
TABLE content (id, blockId, order, data, type)
content.blockId is a foreign key to blocks.id. The idea is that in the content table you have many content entries with different types for one block.
I am now looking for a query that can provide me with an aggregation based on a blockId where all the content entries of the 3 different types are concatenated and put into respective columns.
I have already started and found the listagg function which is working well, I did the following statement and lists me all the content entries in a column:
SELECT listagg(c.data, ',') WITHIN GROUP (ORDER BY c.order) FROM content c WHERE c.blockId = 330;
Now the concatenated string however contains all the data elements of the block in one column. What I would like to achieve is that its put into separate columns based on the type. For example the following content of content would be like this:
1, 1, 0, "content1", "FRAGMENT"
2, 1, 1, "content2", "BULK"
3, 1, 3, "content4", "FRAGMENT"
4, 1, 2, "content3", "FRAGMENT"
Now I wanted to get as an output 2 columns, one is FRAGMENT and one is BULK, where FRAGMENT contains "content1;content3;content4" and BULK contains "content2"
Is there an efficient way of achieving this?
You can use case:
SELECT listagg(CASE WHEN content = 'FRAGMENT' THEN c.data END, ',') WITHIN GROUP (ORDER BY c.order) as fragments,
listagg(CASE WHEN content = 'BULK' THEN c.data END, ',') WITHIN GROUP (ORDER BY c.order) as bulks
FROM content c
WHERE c.blockId = 330;
As an alternative, if you want it more dynamic, you could pivot the outcome.
Note, that this will only work for Oracle 11.R2. HereĀ“s an example how it could look like:
select * from
(with dataSet as (select 1 idV, 1 bulkid, 0 orderV, 'content1' dataV, 'FRAGMENT' typeV from dual union
select 2, 1, 1, 'content2', 'BULK' from dual union
select 3, 1, 3, 'content4', 'FRAGMENT' from dual union
select 4, 1, 2, 'content3', 'FRAGMENT' from dual)
select typeV, listagg(dataSet.dataV ,',') WITHIN GROUP (ORDER BY orderV) OVER (PARTITION BY typeV) dataV from dataSet)
pivot
(
max(dataV)
for typeV in ('BULK', 'FRAGMENT')
)
O/P
Bulk | FRAGMENT
-----------------
content2 | content1,content3,content4
The important things here:
OVER (PARTITION BY typeV): this acts like a group by for the listagg, concatinating everything having the same typeV.
for typeV in ('BULK', 'FRAGMENT'): this will gather the data for BULK and FRAGMENT and produce separate columns for each.
max(dataV) simply to provide a aggregate function, otherwise pivot wont work.