I have two tables
customers
+---------+-------+
| cust_id | name |
+---------+-------+
| 1 | Tom |
+---------+-------+
| 2 | John |
+---------+-------+
| 3 | Lisa |
+---------+-------+
| 4 | Wendy |
+---------+-------+
purchases
+---------------+-------------+---------+
| purchase_date | purchase_id | cust_id |
+---------------+-------------+---------+
| 2021-01-01 | 1 | 1 |
+---------------+-------------+---------+
| 2021-01-01 | 2 | 1 |
+---------------+-------------+---------+
| 2021-01-01 | 3 | 2 |
+---------------+-------------+---------+
| 2021-01-01 | 4 | 1 |
+---------------+-------------+---------+
| 2021-01-01 | 5 | 4 |
+---------------+-------------+---------+
| 2021-01-02 | 6 | 3 |
+---------------+-------------+---------+
| 2021-01-02 | 7 | 3 |
+---------------+-------------+---------+
| 2021-01-02 | 8 | 2 |
+---------------+-------------+---------+
| 2021-01-02 | 9 | 1 |
+---------------+-------------+---------+
| 2021-01-02 | 10 | 4 |
+---------------+-------------+---------+
| 2021-01-03 | 11 | 2 |
+---------------+-------------+---------+
| 2021-01-03 | 12 | 2 |
+---------------+-------------+---------+
| 2021-01-03 | 13 | 3 |
+---------------+-------------+---------+
| 2021-01-03 | 14 | 3 |
+---------------+-------------+---------+
I want to query the count of unique purchasing customers by date (easy) and the cust_id of the customer who made the most purchases by date. If more than one customer made the same number of purchases on the same date, I want to show the lesser cust_id. The results should look like this:
+---------------+------------------+-----------------+
| purchase_date | unique_customers | biggest_spender |
+---------------+------------------+-----------------+
| 2021-01-01 | 3 | 1 |
+---------------+------------------+-----------------+
| 2021-01-02 | 4 | 3 |
+---------------+------------------+-----------------+
| 2021-01-03 | 2 | 2 |
+---------------+------------------+-----------------+
Here is the query in Postgresql, using mode() to determine the biggest spender, alias the most frequent value for each date in your purchase table
SELECT p.purchase_date, count(DISTINCT p.cust_id) as unique_customers , mode() within group (order by p.cust_id) as biggest_spender
FROM purchases p
GROUP BY p.purchase_date
ORDER BY COUNT(p.cust_id) DESC;
Related
Given two tables, sales_reps and sales:
sales_reps
+--------+-------+
| rep_id | name |
+--------+-------+
| 1 | Tony |
+--------+-------+
| 2 | Jim |
+--------+-------+
| 3 | Laura |
+--------+-------+
| 4 | Sam |
+--------+-------+
sales
+------------+----------+--------+-------------+
| sale_date | sales_id | rep_id | sale_amount |
+------------+----------+--------+-------------+
| 2021-01-01 | 1 | 1 | 2000 |
+------------+----------+--------+-------------+
| 2021-01-01 | 2 | 1 | 4000 |
+------------+----------+--------+-------------+
| 2021-01-01 | 3 | 2 | 3000 |
+------------+----------+--------+-------------+
| 2021-01-01 | 4 | 1 | 1000 |
+------------+----------+--------+-------------+
| 2021-01-01 | 5 | 4 | 5000 |
+------------+----------+--------+-------------+
| 2021-01-02 | 6 | 3 | 10000 |
+------------+----------+--------+-------------+
| 2021-01-02 | 7 | 3 | 10000 |
+------------+----------+--------+-------------+
| 2021-01-02 | 8 | 2 | 4000 |
+------------+----------+--------+-------------+
| 2021-01-02 | 9 | 1 | 6000 |
+------------+----------+--------+-------------+
| 2021-01-02 | 10 | 4 | 2000 |
+------------+----------+--------+-------------+
| 2021-01-03 | 11 | 2 | 8000 |
+------------+----------+--------+-------------+
| 2021-01-03 | 12 | 2 | 1000 |
+------------+----------+--------+-------------+
| 2021-01-03 | 13 | 3 | 4500 |
+------------+----------+--------+-------------+
| 2021-01-03 | 14 | 3 | 4500 |
+------------+----------+--------+-------------+
I want to show how many unique reps made sales by date (easy) and the rep_id and name of the rep who generated the highest total sales amount for each date. If more than one rep generated the same greatest total sales amount for a date, I want to show the lesser rep_id and that rep's name. The results should look like this:
+------------+-------------+----------+----------+
| sale_date | unique_reps | best_rep | rep_name |
+------------+-------------+----------+----------+
| 2021-01-01 | 3 | 1 | Tony |
+------------+-------------+----------+----------+
| 2021-01-02 | 4 | 3 | Laura |
+------------+-------------+----------+----------+
| 2021-01-03 | 2 | 2 | Jim |
+------------+-------------+----------+----------+
Laura and Jim both generated $9,000 in sales on 2021-01-03. But Jim's rep_id is 2, which is less than Laura's rep_id of 3. So Jim is displayed as the rep who generated the greatest sales amount on 2021-01-03.
Postgres has a mode() function, but it doesn't allow you to choose which rep to choose in the case of ties. For that, you can be more explicit:
select distinct on (s.sale_date) s.sale_date,
s.rep_id, sr.name,
count(*) over (partition by s.date) as num_reps
from sales s join
sales_reps sr
on s.rep_id = sr.rep_id
group by s.sale_date, s.rep_id
order by s.sale_date, sum(s.sale_amount) desc, s.rep_id, sr.name;
What is this doing? It is aggregating by the date and sales rep. Because of the distinct on, though, it is only taking one row per date. In this row:
count(*) over (partition by date) counts the number of reps (it is counting the rows after the aggregation).
s.rep_id is chosen based on the order by criteria -- first the most sales, then by the lowest rep id.
sr.name is the name of the sales rep.
TABLE 2 : trip_delivery_sales_lines
+-------+---------------------+------------+----------+------------+-------------+--------+--+
| Sl no | Order_date | Partner_id | Route_id | Product_id | Product qty | amount | |
+-------+---------------------+------------+----------+------------+-------------+--------+--+
| 1 | 2020-08-01 04:25:35 | 34567 | 152 | 432 | 2 | 100 | |
| 2 | 2021-09-11 02:25:35 | 34572 | 130 | 312 | 4 | 150 | |
| 3 | 2020-05-10 04:25:35 | 34567 | 152 | 432 | 3 | 123 | |
| 4 | 2021-02-16 01:10:35 | 34572 | 130 | 432 | 5 | 123 | |
| 5 | 2020-02-19 01:10:35 | 34567 | 152 | 432 | 2 | 600 | |
| 6 | 2021-03-20 01:10:35 | 34569 | 152 | 123 | 1 | 123 | |
| 7 | 2021-04-23 01:10:35 | 34570 | 152 | 432 | 4 | 200 | |
| 8 | 2021-07-08 01:10:35 | 34567 | 152 | 432 | 3 | 32 | |
| 9 | 2019-06-28 01:10:35 | 34570 | 152 | 432 | 2 | 100 | |
| 10 | 2018-11-14 01:10:35 | 34570 | 152 | 432 | 5 | 20 | |
| | | | | | | | |
+-------+---------------------+------------+----------+------------+-------------+--------+--+
From Table 2 : we had to find partners in route=152 and find the sum of product_qty of the last 2 sale [can be selected by desc order_date]
. We can find its result in table 3.
34567 – Serial number [ 1,8]
34570 – Serial number [ 7,9]
34569 – Serial number [6]
TABLE 3 : RESULT OBTAINED FROM TABLE 1,2
+------------+-------+
| Partner_id | count |
+------------+-------+
| 34567 | 5 |
| 34569 | 1 |
| 34570 | 6 |
| | |
+------------+-------+
From table 4 we want to find the above partner_ids leaf count
TABLE 4 :coupon_leaf
+------------+-------+
| Partner_id | Leaf |
+------------+-------+
| 34567 | XYZ1 |
| 34569 | XYZ2 |
| 34569 | DDHC |
| 34567 | DVDV |
| 34570 | DVFDV |
| 34576 | FVFV |
| 34567 | FVV |
| | |
+------------+-------+
From that we can find result as:
34567 – 3
34569-2
34570 -1
TABLE 5: result obtained from TABLE 4
+------------+-------+
| Partner_id | count |
+------------+-------+
| 34567 | 3 |
| 34569 | 2 |
| 34570 | 1 |
| | |
+------------+-------+
Now we want compare table 3 and 5
If partner_id count [table 3] > partner_id count [table 4]
Print partner_id
I want a single query to do all these operation
distinct partner_id can be found by: fROM TABLE 1
SELECT DISTINCT partner_id
FROM trip_delivery_sales ts
WHERE ts.route_id='152'
GROUP BY ts.partner_id
This answers the original version of the problem.
You seem to want to compare totals after aggregating tables 2 and 3. I don't know what table1 is for. It doesn't seem to do anything.
So:
select *
from (select partner_id, sum(quantity) as sum_quantity
from (select tdsl.*,
row_number() over (partition by t2.partner_id order by order_date) as seqnum
from trip_delivery_sales_lines tdsl
) tdsl
where seqnum <= 2
group by tdsl.partner_id
) tdsl left join
(select cl.partner_id, count(*) as leaf_cnt
from coupon_leaf cl
group by cl.partner_id
) cl
on cl.partner_id = tdsl.partner_id
where leaf_cnt is null or sum_quantity > leaf_cnt
i'm trying to understand how windowing function avg works, and somehow it seems to not be working as i expect.
here is the dataset :
select * from winsales;
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| winsales.salesid | winsales.dateid | winsales.sellerid | winsales.buyerid | winsales.qty | winsales.qty_shipped |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| 30001 | NULL | 3 | b | 10 | 10 |
| 10001 | NULL | 1 | c | 10 | 10 |
| 10005 | NULL | 1 | a | 30 | NULL |
| 40001 | NULL | 4 | a | 40 | NULL |
| 20001 | NULL | 2 | b | 20 | 20 |
| 40005 | NULL | 4 | a | 10 | 10 |
| 20002 | NULL | 2 | c | 20 | 20 |
| 30003 | NULL | 3 | b | 15 | NULL |
| 30004 | NULL | 3 | b | 20 | NULL |
| 30007 | NULL | 3 | c | 30 | NULL |
| 30001 | NULL | 3 | b | 10 | 10 |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
When i fire the following query ->
select salesid, sellerid, qty, avg(qty) over (order by sellerid) as avg_qty from winsales order by sellerid,salesid;
I get the following ->
+----------+-----------+------+---------------------+--+
| salesid | sellerid | qty | avg_qty |
+----------+-----------+------+---------------------+--+
| 10001 | 1 | 10 | 20.0 |
| 10005 | 1 | 30 | 20.0 |
| 20001 | 2 | 20 | 20.0 |
| 20002 | 2 | 20 | 20.0 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30003 | 3 | 15 | 18.333333333333332 |
| 30004 | 3 | 20 | 18.333333333333332 |
| 30007 | 3 | 30 | 18.333333333333332 |
| 40001 | 4 | 40 | 19.545454545454547 |
| 40005 | 4 | 10 | 19.545454545454547 |
+----------+-----------+------+---------------------+--+
Question is - how is the avg(qty) being calculated.
Since i'm not using partition by, i would expect the avg(qty) to be the same for all rows.
Any ideas ?
if you want to have same avg(qty) to get for all rows then remove order by sellerid in over clause, then you are going to have 19.545454545454547 value for all the rows.
Query to get same avg(qty) for all rows:
hive> select salesid, sellerid, qty, avg(qty) over () as avg_qty from winsales order by sellerid,salesid;
If we include order by sellerid in over clause then you are getting cumulative avg is caluculated for each sellerid.
i.e. for
sellerid 1 you are having 2 records total 2 records with qty as 10,30 so avg would be
(10+30)/2.
sellerid 2 you are having 2 records total 4 records with qty as 20,20 so avg would be
(10+30+20+20)/4 = 20.0
sellerid 3 you are having 5 records total 9 records with qty as so 10,10,15,20,30 avg would be
(10+30+20+20+10+10+15+20+30)/9 = 18.333
sellerid 4 avg is 19.545454545454547
when we include over clause then this is an expected behavior from hive.
I'm trying to provide rolled up summaries of the following data including only the group in question as well as excluding the group. I think this can be done with a window function, but I'm having problems with getting the syntax down (in my case Hive SQL).
I want the following data to be aggregated
+------------+---------+--------+
| date | product | rating |
+------------+---------+--------+
| 2018-01-01 | A | 1 |
| 2018-01-02 | A | 3 |
| 2018-01-20 | A | 4 |
| 2018-01-27 | A | 5 |
| 2018-01-29 | A | 4 |
| 2018-02-01 | A | 5 |
| 2017-01-09 | B | NULL |
| 2017-01-12 | B | 3 |
| 2017-01-15 | B | 4 |
| 2017-01-28 | B | 4 |
| 2017-07-21 | B | 2 |
| 2017-09-21 | B | 5 |
| 2017-09-13 | C | 3 |
| 2017-09-14 | C | 4 |
| 2017-09-15 | C | 5 |
| 2017-09-16 | C | 5 |
| 2018-04-01 | C | 2 |
| 2018-01-13 | D | 1 |
| 2018-01-14 | D | 2 |
| 2018-01-24 | D | 3 |
| 2018-01-31 | D | 4 |
+------------+---------+--------+
Aggregated results:
+------+-------+---------+----+------------+------------------+----------+
| year | month | product | ct | avg_rating | avg_rating_other | other_ct |
+------+-------+---------+----+------------+------------------+----------+
| 2018 | 1 | A | 5 | 3.4 | 2.5 | 4 |
| 2018 | 2 | A | 1 | 5 | NULL | 0 |
| 2017 | 1 | B | 4 | 3.6666667 | NULL | 0 |
| 2017 | 7 | B | 1 | 2 | NULL | 0 |
| 2017 | 9 | B | 1 | 5 | 4.25 | 4 |
| 2017 | 9 | C | 4 | 4.25 | 5 | 1 |
| 2018 | 4 | C | 1 | 2 | NULL | 0 |
| 2018 | 1 | D | 4 | 2.5 | 3.4 | 5 |
+------+-------+---------+----+------------+------------------+----------+
I've also considered producing two aggregates, one with the product in question and one without, but having trouble with creating the appropriate joining key.
You can do:
select year(date), month(date), product,
count(*) as ct, avg(rating) as avg_rating,
sum(count(*)) over (partition by year(date), month(date)) - count(*) as ct_other,
((sum(sum(rating)) over (partition by year(date), month(date)) - sum(rating)) /
(sum(count(*)) over (partition by year(date), month(date)) - count(*))
) as avg_other
from t
group by year(date), month(date), product;
The rating for the "other" is a bit tricky. You need to add everything up and subtract out the current row -- and calculate the average by doing the sum divided by the count.
I have a table structure for SalesItems, and Sales.
SalesItems is setup something like this
| SaleItemID | SaleID | ProductID | ProductType |
| 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 15 | 1 |
| 4 | 2 | 5 | 2 |
| 5 | 3 | 1 | 1 |
| 6 | 3 | 8 | 5 |
And Sales is setup something like this
| Sale | Cash |
| 1 | 1.00 |
| 2 | 10.00 |
| 3 | 28.50 |
I am trying to export a basic 'Daily History' that uses joins to spit out the information like this.
| Date | StoreID | Type1Sales | Type2Sales | ... | Cash Taken |
| 5/2 | 50 | 50 | 40 | ... | 39.50 |
| 5/3 | 50 | 10 | 32.50 | ... | 48.50 |
The issue I'm having is if I do an inner join From Sales to Sales Items, I'll end up with this.
| SaleItemID | SaleID | ProductID | ProductType | Sale | Cash |
| 1 | 1 | 1 | 1 | 1 | 1.00 |
| 2 | 1 | 2 | 2 | 1 | 1.00 |
| 3 | 1 | 15 | 1 | 1 | 1.00 |
| 4 | 2 | 5 | 2 | 2 | 10.00 |
| 5 | 3 | 1 | 1 | 3 | 28.50 |
| 6 | 3 | 8 | 5 | 3 | 28.50 |
So if I do a SUM(Cash), then I'll end up returning $70.00, instead of the correct $39.50. I'm not the best with joins, so I've been researching outer joins and such, but none of those seem to work as it's still matching up. Is there a way to only match on the FIRST instance, and return NULL for the rest? For example, something like this
| SaleItemID | SaleID | ProductID | ProductType | Sale | Cash |
| 1 | 1 | 1 | 1 | 1 | 1.00 |
| 2 | 1 | 2 | 2 | 1 | NULL |
| 3 | 1 | 15 | 1 | 1 | NULL |
| 4 | 2 | 5 | 2 | 2 | 10.00 |
| 5 | 3 | 1 | 1 | 3 | 28.50 |
| 6 | 3 | 8 | 5 | 3 | NULL |
Or do you have any other suggestions for returning back the correct amount of Cash for each particular day?
Use DISTINCT(SaleID) in your SELECT to return a single row for each Sale ID.