Sorting within collect_list() in hive - hive

Let's say I have a hive table that looks like this:
ID event order_num
------------------------
A red 2
A blue 1
A yellow 3
B yellow 2
B green 1
...
I'm trying to use collect_list to generate a list of events for each ID. So something like the following:
SELECT ID,
collect_list(event) as events_list,
FROM table
GROUP BY ID;
However, within each of the IDs that I group by, I need to sort by order_num. So that my resulting table would look like this:
ID events_list
------------------------
A ["blue","red","yellow"]
B ["green","red"]
I can't do a global sort by ID and order_num before the collect_list() query because the table is massive. Is there a way to sort by order_num within collect_list?
Thanks!

So, I found the answer here. The trick is to use a subquery with a DISTRIBUTE BY and SORT BY statement. See below:
WITH table1 AS (
SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL
SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL
SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL
SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL
SELECT 'B' AS ID, 'green' AS event, 1 AS order_num
)
-- Collect it
SELECT subquery.ID,
collect_list(subquery.event) as events_list
FROM (
SELECT
table1.ID,
table1.event,
table1.order_num
FROM table1
DISTRIBUTE BY
table1.ID
SORT BY
table1.ID,
table1.order_num
) subquery
GROUP BY subquery.ID;

The function sort_array() should sort the collect_list() items
select ID, sort_array(collect_list(event)) as events_list,
from table
group by ID;

this my first answer question of stack overflow.
but the answer is very very userful.
WITH table1 AS (
SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL
SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL
SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL
SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL
SELECT 'B' AS ID, 'green' AS event, 1 AS order_num
)
select ID
,sort_array(collect_list(struct(order_num, item_score))).col2 as item_list
from (
select ID
,event
,order_num
,concat(event, ':', order_num) as item_score
from table1
) t0
group by ID

Try the following:
WITH tmp AS (
SELECT * FROM data DISTRIBUTE BY ID SORT BY ID, order_num desc
)
SELECT ID, collect_list(event)
FROM tmp
GROUP BY ID

Related

Select number of IDs in more than one table (from three tables)

I need the count of this:
select distinct ID
from (
select ID from A
union all
select ID from B
union all
select ID from C
) ids
GROUP BY ID HAVING COUNT(*) > 1;
but I have no idea how to do it.
Use a subquery:
select count(*)
from (select ID
from (select ID from A
union all
select ID from B
union all
select ID from C
) ids
group by ID
having count(*) > 1
) i;
SELECT DISTINCT is almost never needed with GROUP BY and definitely not in this case.
You just want to find the id that appear 2 more times in the A,B,C table, the SQL is below:
select count(1) from (
select
id,
count(1)
from
(
select ID from A
union all
select ID from B
union all
select ID from C
)
group by id having(count(1)>1)
) tmp

SQL assigning incremental ID to subgroups

As the title says, I'm trying to add an extra column to a table which autoincrements everytime a different string in another column changes.
I would like to do this in a query.
Example:
MyCol GroupID
Cable 1
Cable 1
Foo 2
Foo 2
Foo 2
Fuzz 3
Fizz 4
Tv 5
Tv 5
The GroupID column is what I want to accomplish.
We can be sure that MyCol's strings will be the same in each subgroup (Foo will always be Foo, etc).
Thanks in advance
If I understand correctly, you can use dense_rank():
select t.*, dense_rank() over (order by col1) as groupid
from t;
You could make a temporal table with the distinct value of the MyCol and get the groupId throught the RowNumber of the temp table, and join the rownumbered result with your table.
This is a raw example in oracle:
WITH data AS
(SELECT 'Cable' MyCol FROM dual
UNION ALL
SELECT 'Cable' FROM dual
UNION ALL
SELECT 'Foo' FROM dual
UNION ALL
SELECT 'Foo' FROM dual
UNION ALL
SELECT 'Foo' FROM dual
UNION ALL
SELECT 'Fuzz' FROM dual
UNION ALL
SELECT 'Fizz' FROM dual
UNION ALL
SELECT 'Tv' FROM dual
UNION ALL
SELECT 'Tv' FROM dual
),
tablename AS
(SELECT * FROM data
),
temp AS
( SELECT DISTINCT mycol FROM tablename
),
temp2 AS
( SELECT mycol, rownum AS groupid from temp
)
SELECT tablename.mycol, temp2.groupid FROM temp2 JOIN tablename ON temp2.mycol = tablename.mycol
You could also check for a way to implement the tabibitosan method knowing that your column condition is string.

Big Query view (table without duplicate rows)

I need to create a view that is pretty much just like some table with some simple transformations and I want to make sure the values in a particular column are not duplicate.
So let's say the table looks like this:
ID, ColumnA, ColumnB
-------------------
1 cars shirts
2 tvs dogs
1 fingers computers
And the resulting view would look like this:
ID, ColumnA, ColumnB
-------------------
1 cars shirts
2 tvs dogs
So, is there an equivalent to SELECT distint(ID), ColumnA, ColumnB?
What's the most efficient way to do it?
If you just want an arbitrary row for each ID, use ANY_VALUE:
#standardSQL
WITH Input AS (
SELECT 1 AS ID, 'cars' AS ColumnA, 'shirts' AS ColumnB UNION ALL
SELECT 2 AS ID, 'tvs' AS ColumnA, 'dogs' AS ColumnB UNION ALL
SELECT 1 AS ID, 'fingers' AS ColumnA, 'computers' AS ColumnB
)
SELECT
ANY_VALUE(t).*
FROM Input AS t
GROUP BY t.ID;
Or you can use the ARRAY_AGG trick to select the latest row based on a condition.
Below is for BigQuery Standard SQL
#standardSQL
WITH yourTable AS (
SELECT 1 AS id, 'cars' AS columnA, 'shirts' AS columnB UNION ALL
SELECT 2, 'tvs', 'dogs' UNION ALL
SELECT 1, 'fingers', 'computers'
)
SELECT r.*
FROM (
SELECT ARRAY_AGG(t ORDER BY columnA LIMIT 1)[OFFSET (0)] AS r
FROM yourTable t
GROUP BY id
)
-- ORDER BY id
Note: you should have some logic about selecting row with cars over the fingers!
Above version (as an example) is based on asc order

How can I filter an SQL table to show only keys with exactly N entries?

Lets say I have a table with a column named KEY.
I want to find all KEYs which are in the table exactly 3 times.
How can I do that?
I managed to get a list of how many entries I have for each KEY, like this:
select count(*) from my_table group by KEY;
but how can I filter it to show only those who have the value 3?
select KEY
from my_table
group by KEY
having count(*) = 3
The having clause filters after grouping (where filters before).
select `key`
from my_table
group by `KEY`
having count(*) = 3;
select KEY
from my_table
group by KEY
having count(1) = 3
Try with Row Number concept
;
WITH Temp_tab AS
( SELECT '1' Key_,'az' Key_Value
UNION SELECT '1' ,'a5'
UNION SELECT '1' ,'a6'
UNION SELECT '2' ,'a1'
UNION SELECT '3' ,'a2'
UNION SELECT '4' ,'a3'
UNION SELECT '1' ,'a4'
UNION SELECT '3' ,'a21'
UNION SELECT '3' ,'a22'),
Tab2 AS
(SELECT *, ROW_NUMBER() over(partition BY key_ ORDER BY key_) count_ FROM Temp_Tab)
SELECT key_
FROM tab2 WHERE count_ = 3
code for your table
;with temp_table
(select *,ROW_NUMBER() over(partition by key_ order by key_) count_ from my_table)
select key_ from temp_table where count_ = 3

Case on union of multiple unions and issue with alias

I have 2 series of unions which I wish to join by another union. In the first one, I have 3 Selects and in the second one I have 2 different Selects.
Select id, min(value)
from table1 t1
join (Select id, value
Union
Select id, value
Union
Select id, value) as foo
on foo.id=t1.id
Group by id
Select id, max(value)
from table1 t1
join (Select id, value
Union
Select id, value) as bar
on bar.id=t1.id
Group by id
I tried to do a union between these two, but it made things pretty complicated. My biggest issue is with my alias. My second is with the case linked to my value columns, which I wish to name value.
Select (alias).id,
Case
When foo.value= 0 or bar.value=1 THEN 1
Else 0
End as value
from table1 t1
Join (Select id, min(value)
from table1 t1
join (Select id, value
Union
Select id, value
Union
Select id, value) as foo
on foo.id=t1.id
Group by id
UNION
Select id, max(value)
from table1 t1
join (Select id, value
Union
Select id, value) as bar
on bar.id=t1.id
Group by id) as (alias)
on ??.id=??.id
I wrote my case the way I think it should be written, but normally, when there are more than one column with the same name, SQL states it as ambiguous. I am still unsure if I should use UNION or INTERSECT, but I assume either of them would be done the same way. How should I deal with this?
I'm reading this right, you probably want something like this
SELECT ...
FROM ( ... union #1 ) AS u1
JOIN (... union #2 ) AS u2 ON u1.id = u2.id