How to sort a column in one table based on the rank in another table - hive

I have a table Table 1 that has User_ID and Item_List where items are arranged randomly
Customer_id Item_List
22 1,4,3,2
24 6,3,2,1
23 4,5,7,8
Table 2 has the ranks of the item according to the highest value
Item_Id Item_Rank
1 8
2 5
3 3
4 4
5 2
6 7
7 1
8 6
I want to produce a Table that has Customer_id with the corresponding Item List ranked according to the Item Rank in Table 2
Customer_id Ranked_Item_List
22 3,4,2,1
24 3,2,6,1
23 7,5,4,8
I don't know any efficient method to do it in hive. Any suggestions?

I can think in 2 different ways, create your UDF to avoid the explode or
select customer_id, collect_list(item_id) from (
select customer_id, item_id, item_rank from
table1 lateral view inline(item_list) item_id join
table2 on table1.item_id = table2.item_id --this should be done as mapjoin if your rank table is not big
) distributed by customer_id, sort by item_rank;
Like I said before, depending on the size of your data, you could create an UDF to apply the sort at mapper level based on your lookup table

Related

What is the proper way to complete cross-tab on the following segment in SQL?

I create frequencies on one column in SQL in a standard way.
My code is
select id , count(*) as counts
from TABLE
group by id
order by counts desc
Suppose the output is as follows for six id
id counts
-- -----
1 3 two id have 3 counts per
2 3
---------
3 6 three id have 6 counts per
4 6
5 6
---------
6 2 one id has 2 counts
How can I produce the following?
nid counts
--- ------
1 2
2 3
3 6
I am writing in a hive environment, but that should be standard SQL.
Thanks in advance for answering.
You want two levels of aggregation:
select counts, count(*)
from (select id , count(*) as counts
from TABLE
group by id
) c
group by counts
order by counts;
I call this a "histogram-of-histograms" query. I usually include min(id) and max(id) in the outer select, so I have examples of ids with given frequencies.

Derby DB last x row average

I have the following table structure.
ITEM TOTAL
----------- -----------------
ID | TITLE ID |ITEMID|VALUE
1 A 1 2 6
2 B 2 1 4
3 C 3 3 3
4 D 4 3 8
5 E 5 1 2
6 F 6 5 4
7 4 5
8 2 8
9 2 7
10 1 3
11 2 2
12 3 6
I am using Apache Derby DB. I need to perform the average calculation in SQL. I need to show the list of item IDs and their average total of the last 3 records.
That is, for ITEM.ID 1, I will go to TOTAL table and select the last 3 records of the rows which are associated with the ITEMID 1. And take average of them. In Derby database, I am able to do this for a given item ID but I cannot make it without giving a specific ID. Let me show you what I've done it.
SELECT ITEM.ID, AVG(VALUE) FROM ITEM, TOTAL WHERE TOTAL.ITEMID = ITEM.ID GROUP BY ITEM.ID
This SQL gives the average of all items in a list. But this calculates for all values of the total tables. I need last 3 records only. So I changed the SQL to this:
SELECT AVG(VALUE) FROM (SELECT ROW_NUMBER() OVER() AS ROWNUM, TOTAL.* FROM TOTAL WHERE ITEMID = 1) AS TR WHERE ROWNUM > (SELECT COUNT(ID) FROM TOTAL WHERE ITEMID = 1) - 3
This works if I supply the item ID 1 or 2 etc. But I cannot do this for all items without giving an item ID.
I tried to do the same thing in ORACLE using partition and it worked. But derby does not support partitioning. There is WINDOW but I could not make use of it.
Oracle one
SELECT ITEMID, AVG(VALUE) FROM(SELECT ITEMID, VALUE, COUNT(*) OVER (PARTITION BY ITEMID) QTY, ROW_NUMBER() OVER (PARTITION BY ITEMID ORDER BY ID) IDX FROM TOTAL ORDER BY ITEMID, ID) WHERE IDX > QTY -3 GROUP BY ITEMID ORDER BY ITEMID
I need to use derby DB for its portability.
The desired output is this
RESULT
-----------------
ITEMID | AVERAGE
1 (9/3)
2 (17/3)
3 (17/3)
4 (5/1)
5 (4/1)
6 NULL
As you have noticed, Derby's support for the SQL 2003 "OLAP Operations" support is incomplete.
There was some initial work (see https://wiki.apache.org/db-derby/OLAPOperations), but that work was only partially completed.
I don't believe anyone is currently working on adding more functionality to Derby in this area.
So yes, Derby has a row_number function, but no, Derby does not (currently) have partition by.

SQL - How to Count from multiple tables and add them together

I am trying to do a count from multiple tables but there could be multiple entries in each table. Here is simple sample data simplified. There are actually more then 3 tables but just so I get an understanding of how to do it
table2 table2 table3
person_ID person_id person_id
1 1 2
2 1 2
3 2 1
4 2 4
5 4 5
I'm trying to get a count of each person ID in each table so the output would be the following. Note that personID is a key I don't need the addition of the number of the ID not 2+2+2+2. But the count of the number of appearances it makes in the all tables then the count in each table added together for total number of appearances it makes. Basically I'm trying to find a total number of items attached to each personID
person_id total
1 4
2 4
3 1
4 3
5 2
Select the ids from all the tables together withunion. That result can be grouped by the id and counted for each
select person_id, count(*) as count
from
(
select person_id from table1
union all
select person_id from table2
union all
select person_id from table3
) tmp
group by person_id

How to create cartesian products between records for each group separately?

Suppose I sell services that span a time interval (days, months or even years). I have a Products table, where each product is listed, together with the Customer_ID and Service_start and Service_end date.
Now I want to list all combinations of pairs (Service_start, Service_end) inside each customer; e.g. (table sorted by Customer_ID)
Lp Service_start Service_end Customer_ID
--------------------------------------------
1 2-Feb-2014 8-Aug-2014 1
2 5-May-2014 20-Dec-2014 1
3 7-Jul-2014 9-Sep-2014 1
4 13-Jan-2014 13-Jan-2015 2
.. ... ... ...
I want to turn into
Lp Service_start Service_end Customer_ID
--------------------------------------------
1 2-Feb-2014 8-Aug-2014 1
2 2-Feb-2014 20-Dec-2014 1
3 2-Feb-2014 9-Sep-2014 1
4 5-May-2014 8-Aug-2014 1
5 5-May-2014 20-Dec-2014 1
6 5-May-2014 9-Sep-2014 1
7 13-Jan-2014 8-Aug-2014 1
8 13-Jan-2014 20-Dec-2014 1
9 13-Jan-2014 9-Sep-2014 1
10 13-Jan-2014 13-Jan-2015 2
... ... ... ...
The table is big enough that it doesn't fit into memory.
How it can be achievable by SQL? Or SAS?
You can do this in SAS and SQL. Here is the SQL idea:
select ss.service_start, se.service_end, ss.customer_id
from (select distinct customer_id, service_start from table) ss join
(select distinct customer_id service_end from table) se
on ss.customer_id = se.customer_id;
This is compatible with SAS proc sql.
In most dialects of SQL, you can add the lp column using row_number() over (order by customer_id, service_start, service_end). In SAS, you can use monotonic() or a data step after proc sql.

SQL - Order by amount of occurrences

It's my first question here so I hope I can explain it well enough,
I want to order my data by amount of occurrences in the table.
My table is like this:
id Daynr
1 2
1 4
2 4
2 5
2 6
3 1
4 2
4 5
And I want it to sort it like this:
id Daynr
3 1
1 2
1 4
4 2
4 5
2 4
2 5
2 6
Player #3 has one day in the table, and Player #1 has 2.
My table is named "dayid"
Both id and Daynr are foreign keys, together making it a primary key
I hope this explains my problem enough, Please ask for more information it's my first time here.
Thanks in advance
You can do this by counting the number of times that things occur for each id. Most databases support window functions, so you can do this as:
select id, daynr
from (select t.*, count(*) over (partition by id) as cnt
from table t
) t
order by cnt, id;
You can also express this as a join:
select t.id, t.daynr
from table as t inner join
(select id, count(*) as cnt
from table
group by id
) as tg
on t.id = tg.id
order by tg.cnt, id;
Note that both of these include the id in the order by. That way, if two ids have the same count, all rows for the id will appear together.