windowing function avg in Hive with - over (order by colName) - hive

i'm trying to understand how windowing function avg works, and somehow it seems to not be working as i expect.
here is the dataset :
select * from winsales;
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| winsales.salesid | winsales.dateid | winsales.sellerid | winsales.buyerid | winsales.qty | winsales.qty_shipped |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| 30001 | NULL | 3 | b | 10 | 10 |
| 10001 | NULL | 1 | c | 10 | 10 |
| 10005 | NULL | 1 | a | 30 | NULL |
| 40001 | NULL | 4 | a | 40 | NULL |
| 20001 | NULL | 2 | b | 20 | 20 |
| 40005 | NULL | 4 | a | 10 | 10 |
| 20002 | NULL | 2 | c | 20 | 20 |
| 30003 | NULL | 3 | b | 15 | NULL |
| 30004 | NULL | 3 | b | 20 | NULL |
| 30007 | NULL | 3 | c | 30 | NULL |
| 30001 | NULL | 3 | b | 10 | 10 |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
When i fire the following query ->
select salesid, sellerid, qty, avg(qty) over (order by sellerid) as avg_qty from winsales order by sellerid,salesid;
I get the following ->
+----------+-----------+------+---------------------+--+
| salesid | sellerid | qty | avg_qty |
+----------+-----------+------+---------------------+--+
| 10001 | 1 | 10 | 20.0 |
| 10005 | 1 | 30 | 20.0 |
| 20001 | 2 | 20 | 20.0 |
| 20002 | 2 | 20 | 20.0 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30003 | 3 | 15 | 18.333333333333332 |
| 30004 | 3 | 20 | 18.333333333333332 |
| 30007 | 3 | 30 | 18.333333333333332 |
| 40001 | 4 | 40 | 19.545454545454547 |
| 40005 | 4 | 10 | 19.545454545454547 |
+----------+-----------+------+---------------------+--+
Question is - how is the avg(qty) being calculated.
Since i'm not using partition by, i would expect the avg(qty) to be the same for all rows.
Any ideas ?

if you want to have same avg(qty) to get for all rows then remove order by sellerid in over clause, then you are going to have 19.545454545454547 value for all the rows.
Query to get same avg(qty) for all rows:
hive> select salesid, sellerid, qty, avg(qty) over () as avg_qty from winsales order by sellerid,salesid;
If we include order by sellerid in over clause then you are getting cumulative avg is caluculated for each sellerid.
i.e. for
sellerid 1 you are having 2 records total 2 records with qty as 10,30 so avg would be
(10+30)/2.
sellerid 2 you are having 2 records total 4 records with qty as 20,20 so avg would be
(10+30+20+20)/4 = 20.0
sellerid 3 you are having 5 records total 9 records with qty as so 10,10,15,20,30 avg would be
(10+30+20+20+10+10+15+20+30)/9 = 18.333
sellerid 4 avg is 19.545454545454547
when we include over clause then this is an expected behavior from hive.

Related

Incremental/Update in hive

I have a hive external table with data say, (version less than 0.14)
+--------+------+------+------+
| id | A | B | C |
+--------+------+------+------+
| 10011 | 10 | 3 | 0 |
| 10012 | 9 | 0 | 40 |
| 10015 | 10 | 3 | 0 |
| 10017 | 9 | 0 | 40 |
+--------+------+------+------+
And I have a delta file having data given below.
+--------+------+------+------+
| id | A | B | C |
+--------+------+------+------+
| 10012 | 50 | 3 | 10 | --> update
| 10013 | 29 | 0 | 40 | --> insert
| 10014 | 10 | 3 | 0 | --> update
| 10013 | 19 | 0 | 40 | --> update
| 10015 | 70 | 3 | 0 | --> update
| 10016 | 17 | 0 | 40 | --> insert
+--------+------+------+------+
How can I update my hive table with the delta file, without using sqoop. Any help on how to proceed will be great! Thanks.
This is because there is duplicates in the file. How do you know which you should keep? The last one?
In that case you can use, for example, the row_number and then get the maximum value. Something like that.
SELECT coalesce(tmp.id,initial.id) as id,
coalesce(tmp.A, initial.A) as A,
coalesce(tmp.B,initial.B) as B,
coalesce(tmp.C, initial.C) as C
FROM
table_a initial
FULL OUTER JOIN
( SELECT *, row_number() over( partition by id ) as row_num
,COUNT(*) OVER (PARTITION BY id) AS cnt
FROM temp_table
) tmp
ON initial.id=tmp.id
WHERE row_num=cnt
OR row_num IS NULL;
Output:
+--------+-----+----+-----+--+
| id | a | b | c |
+--------+-----+----+-----+--+
| 10011 | 10 | 3 | 0 |
| 10012 | 50 | 3 | 10 |
| 10013 | 19 | 0 | 40 |
| 10014 | 10 | 3 | 0 |
| 10015 | 70 | 3 | 0 |
| 10016 | 17 | 0 | 40 |
| 10017 | 9 | 0 | 40 |
+--------+-----+----+-----+--+
You can load the file to a temporary table in hive and then execute a FULL OUTER JOIN between the two tables.
Query Example:
SELECT coalesce(tmp.id,initial.id) as id,
coalesce(tmp.A, initial.A) as A,
coalesce(tmp.B,initial.B) as B,
coalesce(tmp.C, initial.C) as C
FROM
table_a initial
FULL OUTER JOIN
temp_table tmp on initial.id=tmp.id;
Output
+--------+-----+----+-----+--+
| id | a | b | c |
+--------+-----+----+-----+--+
| 10011 | 10 | 3 | 0 |
| 10012 | 50 | 3 | 10 |
| 10013 | 29 | 0 | 40 |
| 10013 | 19 | 0 | 40 |
| 10014 | 10 | 3 | 0 |
| 10015 | 70 | 3 | 0 |
| 10016 | 17 | 0 | 40 |
| 10017 | 9 | 0 | 40 |
+--------+-----+----+-----+--+

SQL window excluding current group?

I'm trying to provide rolled up summaries of the following data including only the group in question as well as excluding the group. I think this can be done with a window function, but I'm having problems with getting the syntax down (in my case Hive SQL).
I want the following data to be aggregated
+------------+---------+--------+
| date | product | rating |
+------------+---------+--------+
| 2018-01-01 | A | 1 |
| 2018-01-02 | A | 3 |
| 2018-01-20 | A | 4 |
| 2018-01-27 | A | 5 |
| 2018-01-29 | A | 4 |
| 2018-02-01 | A | 5 |
| 2017-01-09 | B | NULL |
| 2017-01-12 | B | 3 |
| 2017-01-15 | B | 4 |
| 2017-01-28 | B | 4 |
| 2017-07-21 | B | 2 |
| 2017-09-21 | B | 5 |
| 2017-09-13 | C | 3 |
| 2017-09-14 | C | 4 |
| 2017-09-15 | C | 5 |
| 2017-09-16 | C | 5 |
| 2018-04-01 | C | 2 |
| 2018-01-13 | D | 1 |
| 2018-01-14 | D | 2 |
| 2018-01-24 | D | 3 |
| 2018-01-31 | D | 4 |
+------------+---------+--------+
Aggregated results:
+------+-------+---------+----+------------+------------------+----------+
| year | month | product | ct | avg_rating | avg_rating_other | other_ct |
+------+-------+---------+----+------------+------------------+----------+
| 2018 | 1 | A | 5 | 3.4 | 2.5 | 4 |
| 2018 | 2 | A | 1 | 5 | NULL | 0 |
| 2017 | 1 | B | 4 | 3.6666667 | NULL | 0 |
| 2017 | 7 | B | 1 | 2 | NULL | 0 |
| 2017 | 9 | B | 1 | 5 | 4.25 | 4 |
| 2017 | 9 | C | 4 | 4.25 | 5 | 1 |
| 2018 | 4 | C | 1 | 2 | NULL | 0 |
| 2018 | 1 | D | 4 | 2.5 | 3.4 | 5 |
+------+-------+---------+----+------------+------------------+----------+
I've also considered producing two aggregates, one with the product in question and one without, but having trouble with creating the appropriate joining key.
You can do:
select year(date), month(date), product,
count(*) as ct, avg(rating) as avg_rating,
sum(count(*)) over (partition by year(date), month(date)) - count(*) as ct_other,
((sum(sum(rating)) over (partition by year(date), month(date)) - sum(rating)) /
(sum(count(*)) over (partition by year(date), month(date)) - count(*))
) as avg_other
from t
group by year(date), month(date), product;
The rating for the "other" is a bit tricky. You need to add everything up and subtract out the current row -- and calculate the average by doing the sum divided by the count.

Get the Id of the matched data from other table. No duplicates of ID from both tables

Here is my table A.
| Id | GroupId | StoreId | Amount |
| 1 | 20 | 7 | 15000 |
| 2 | 20 | 7 | 1230 |
| 3 | 20 | 7 | 14230 |
| 4 | 20 | 7 | 9540 |
| 5 | 20 | 7 | 24230 |
| 6 | 20 | 7 | 1230 |
| 7 | 20 | 7 | 1230 |
Here is my table B.
| Id | GroupId | StoreId | Credit |
| 12 | 20 | 7 | 1230 |
| 14 | 20 | 7 | 15000 |
| 15 | 20 | 7 | 14230 |
| 16 | 20 | 7 | 1230 |
| 17 | 20 | 7 | 7004 |
| 18 | 20 | 7 | 65523 |
I want to get this result without getting duplicate Id of both table.
I need to get the Id of table B and A where the Amount = Credit.
| A.ID | B.ID | Amount |
| 1 | 14 | 15000 |
| 2 | 12 | 1230 |
| 3 | 15 | 14230 |
| 4 | null | 9540 |
| 5 | null | 24230 |
| 6 | 16 | 1230 |
| 7 | null | 1230 |
My problem is when I have 2 or more same Amount in table A, I get duplicate ID of table B. which should be null. Please help me. Thank you.
I think you want a left join. But this is tricky because you have duplicate amounts, but you only want one to match. The solution is to use row_number():
select . . .
from (select a.*, row_number() over (partition by amount order by id) as seqnum
from a
) a left join
(select b.*, row_number() over (partition by credit order by id) as seqnum
from b
)b
on a.amount = b.credit and a.seqnum = b.seqnum;
Another approach, I think simplier and shorter :)
select ID [A.ID],
(select top 1 ID from TABLE_B where Credit = A.Amount) [B.ID],
Amount
from TABLE_A [A]

Join Distinct or First

I have a table structure for SalesItems, and Sales.
SalesItems is setup something like this
| SaleItemID | SaleID | ProductID | ProductType |
| 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 15 | 1 |
| 4 | 2 | 5 | 2 |
| 5 | 3 | 1 | 1 |
| 6 | 3 | 8 | 5 |
And Sales is setup something like this
| Sale | Cash |
| 1 | 1.00 |
| 2 | 10.00 |
| 3 | 28.50 |
I am trying to export a basic 'Daily History' that uses joins to spit out the information like this.
| Date | StoreID | Type1Sales | Type2Sales | ... | Cash Taken |
| 5/2 | 50 | 50 | 40 | ... | 39.50 |
| 5/3 | 50 | 10 | 32.50 | ... | 48.50 |
The issue I'm having is if I do an inner join From Sales to Sales Items, I'll end up with this.
| SaleItemID | SaleID | ProductID | ProductType | Sale | Cash |
| 1 | 1 | 1 | 1 | 1 | 1.00 |
| 2 | 1 | 2 | 2 | 1 | 1.00 |
| 3 | 1 | 15 | 1 | 1 | 1.00 |
| 4 | 2 | 5 | 2 | 2 | 10.00 |
| 5 | 3 | 1 | 1 | 3 | 28.50 |
| 6 | 3 | 8 | 5 | 3 | 28.50 |
So if I do a SUM(Cash), then I'll end up returning $70.00, instead of the correct $39.50. I'm not the best with joins, so I've been researching outer joins and such, but none of those seem to work as it's still matching up. Is there a way to only match on the FIRST instance, and return NULL for the rest? For example, something like this
| SaleItemID | SaleID | ProductID | ProductType | Sale | Cash |
| 1 | 1 | 1 | 1 | 1 | 1.00 |
| 2 | 1 | 2 | 2 | 1 | NULL |
| 3 | 1 | 15 | 1 | 1 | NULL |
| 4 | 2 | 5 | 2 | 2 | 10.00 |
| 5 | 3 | 1 | 1 | 3 | 28.50 |
| 6 | 3 | 8 | 5 | 3 | NULL |
Or do you have any other suggestions for returning back the correct amount of Cash for each particular day?
Use DISTINCT(SaleID) in your SELECT to return a single row for each Sale ID.

Ask about query in sql server

i have table like this:
| ID | id_number | a | b |
| 1 | 1 | 0 | 215 |
| 2 | 2 | 28 | 8952 |
| 3 | 3 | 10 | 2000 |
| 4 | 1 | 0 | 215 |
| 5 | 1 | 0 |10000 |
| 6 | 3 | 10 | 5000 |
| 7 | 2 | 3 |90933 |
I want to sum a*b where id_number is same, what the query to get all value for every id_number? for example the result is like this :
| ID | id_number | result |
| 1 | 1 | 0 |
| 2 | 2 | 523455 |
| 3 | 3 | 70000 |
This is a simple aggregation query:
select id_number, sum(a*b)
from t
group by id_number
I'm not sure what the first column is for.