Getting Top 40% users basis sales - sql

I have a table which has columns date, user_id, sales_amount. The table sample is as below
+------------+---------+--------------+
| date | user_id | sales_amount |
+------------+---------+--------------+
| 2020-01-01 | 1 | 27 |
| 2020-01-01 | 2 | 32 |
| 2020-01-01 | 3 | 17 |
| 2020-01-03 | 1 | 19 |
| 2020-01-03 | 2 | 18 |
| 2020-01-03 | 3 | 40 |
| ………….. | ………….. | ………….. |
| ………….. | ………….. | ………….. |
| ………….. | ………….. | ………….. |
+------------+---------+--------------+
I want to get top 40% users basis sales. I would have used something like SELECT TOP 40 PERCENT users after aggregation. But I am not using MS-SQL server, so that method is not applicable.
Something that I know is as below
First get number of rows from below query
SELECT MAX(Rn) AS number_of_rows
FROM(
SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn
FROM
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B
Second calculate the 40 % of the above value and get the users
SELECT *
FROM
(SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn
FROM
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B
WHERE Rn <= 0.4* (number_of_rows)
Above two steps can be combined as below
SELECT *
FROM
(SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn
FROM
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B
WHERE Rn <= 0.4 * (SELECT MAX(Rn) AS number_of_rows
FROM(
SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn
FROM
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B)
Is there any optimum way/builtin function to obtain this in hive ?

Yes! You can do both in one step:
SELECT u.*
FROM (SELECT user_id, SUM(AMOUNT) as amt,
ROW_NUMBER() OVER (ORDER BY SUM(AMOUNT) DESC) as seqnum,
COUNT(*) OVER () as cnt
FROM t
GROUP BY user_id
) u
WHERE seqnum <= cnt * 0.4;

Related

SQL - get rid of the nested aggregate select

There is a table Payment, which for example tracks the amount of money user puts into account, simplified as
===================================
Id | UserId | Amount | PayDate |
===================================
1 | 42 | 11 | 01.02.99 |
2 | 42 | 31 | 05.06.99 |
3 | 42 | 21 | 04.11.99 |
4 | 24 | 12 | 05.11.99 |
What is need is to receive a table with balance before payment moment, eg:
===============================================
Id | UserId | Amount | PayDate | Balance |
===============================================
1 | 42 | 11 | 01.02.99 | 0 |
2 | 42 | 31 | 05.06.99 | 11 |
3 | 42 | 21 | 04.11.99 | 42 |
4 | 24 | 12 | 05.11.99 | 0 |
Currently the select statement looks something like
SELECT
Id,
UserId,
Amount,
PaidDate,
(SELECT sum(amount) FROM Payments nestedp
WHERE nestedp.UserId = outerp.UserId AND
nestedp.PayDate < outerp.PayDate) as Balance
FROM
Payments outerp
How can I rewrite this select to get rid of the nested aggregate selection? The database in question is SQL Server 2019.
You need to use cte with some custom logic to handle this type of problem.
WITH PaymentCte
AS (
SELECT ROW_NUMBER() OVER (
PARTITION BY UserId ORDER BY Id
) AS RowId
,Id
,UserId
,PayDate
,Amount
,SUM(Amount) OVER (
PARTITION BY UserId ORDER BY Id
) AS Balance
FROM Payment
)
SELECT X.Id
,X.UserId
,X.Amount
,X.PayDate
,Y.Balance
FROM PaymentCte x
INNER JOIN PaymentCte y ON x.userId = y.UserId
AND X.RowId = Y.RowId + 1
UNION
SELECT X.Id
,X.UserId
,X.Amount
,X.PayDate
,0 AS Balance
FROM PaymentCte x
WHERE X.RowId = 1
This provides the desired output
You can try the following using lag with a cumulative sum
with b as (
select * , isnull(lag(amount) over (partition by userid order by id),0) Amt
from t
)
select Id, UserId, Amount, PayDate,
Sum(Amt) over (partition by userid order by id) Balance
from b
order by Id
Thanks to other participants' leads I came up with a query that (seems) to work:
SELECT
Id,
UserId,
Amount,
PayDate,
COALESCE(sum(Amount) over (partition by UserId
order by PayDate
rows between unbounded preceding and 1 preceding), 0) as Balance
FROM
Payments
ORDER BY
UserId, PayDate
Lots of related examples can be found here

Filtering consecutive dates ranges using SQL Server

I want to filter categories that only have consecutive dates.
I will explain with an example.
My table is
| ID | Category | Date |
|--------------------|-----------------|---------------------|
| 1 | 1 | 01-04-2021 |
| 2 | 1 | 02-04-2021 |
| 3 | 2 | 01-03-2021 |
| 4 | 2 | 04-03-2021 |
| 5 | 2 | 01-02-2010 |
| 6 | 3 | 02-02-2010 |
| 7 | 3 | 03-02-2010 |
| 8 | 4 | 03-02-2010 |
Expected output:
| Category |
|----------------|
| 1 |
| 3 |
| 4 |
I would like to filter my data such as I only have categories that do not contain consecutive dates.
… for unique dates per category
select category
from mytable
group by category
having max(Date) = dateadd(day, count(*)-1, min(Date))
Here's one way. You'll have to maybe adjust it for your particular flavor of SQL.
WITH a AS (
SELECT
category,
DATEDIFF('days', date, LAG(date) OVER (PARTITION BY category ORDER BY
date)) AS days_apart
FROM tbl
),
b AS (
SELECT
category,
MAX(days_apart) AS max_days_apart
FROM a
GROUP BY 1
)
SELECT
category
FROM b
WHERE max_days_apart IS NULL OR max_days_apart = 1
select distinct category
from dates
where category not in (
select distinct category
from (
select category, [date],
row_number() over (partition by category order by [date]) as days_cnt,
min([date]) over (partition by category) as min_date
from dates
group by category, [date]
) as c
where c.[date]<>dateadd(d, c.days_cnt-1, c.min_date))
order by category
Categories where the sequence of dates is the same as the sequence of ids.
with cte as (
select [category],
row_number() over (partition by [category] order by [date], [id])
- row_number() over (partition by [category] order by [id]) drn
)
select [category]
from cte
group by [category]
having sum(abs(drn)) = 0;

Repeat rows cumulative

I have this table
| date | id | number |
|------------|----|--------|
| 2021/05/01 | 1 | 10 |
| 2021/05/02 | 2 | 20 |
| 2021/05/03 | 3 | 30 |
| 2021/05/04 | 1 | 20 |
I am trying to write a query to have this other table
| date | id | number |
|------------|----|--------|
| 2021/05/01 | 1 | 10 |
| 2021/05/02 | 1 | 10 |
| 2021/05/02 | 2 | 20 |
| 2021/05/03 | 1 | 10 |
| 2021/05/03 | 2 | 20 |
| 2021/05/03 | 3 | 30 |
| 2021/05/04 | 1 | 20 |
| 2021/05/04 | 2 | 20 |
| 2021/05/04 | 3 | 30 |
The idea is that each date should have all the previus different ids with its number, and if an id is repeated then only the last value should be considered.
One way is to expand out all the rows for each date. Then take the most recent value using qualify:
with t as (
select date '2021-05-01' as date, 1 as id, 10 as number union all
select date '2021-05-02' as date, 2 as id, 20 as number union all
select date '2021-05-03' as date, 3 as id, 30 as number union all
select date '2021-05-04' as date, 1 as id, 20 as number
)
select d.date, t.id, t.number
from t join
(select date
from (select min(date) as min_date, max(date) as max_date
from t
) tt cross join
unnest(generate_date_array(min_date, max_date, interval 1 day)) date
) d
on t.date <= d.date
where 1=1
qualify row_number() over (partition by d.date, t.id order by t.date desc) = 1
order by 1, 2, 3;
A more efficient method doesn't generate all the rows and then filter them. Instead, it just generates the rows that are needed by generating the appropriate dates within each row. That requires a couple of window functions to get the "next" date for each id and the maximum date in the data:
with t as (
select date '2021-05-01' as date, 1 as id, 10 as number union all
select date '2021-05-02' as date, 2 as id, 20 as number union all
select date '2021-05-03' as date, 3 as id, 30 as number union all
select date '2021-05-04' as date, 1 as id, 20 as number
)
select date, t.id, t.number
from (select t.*,
date_add(lead(date) over (partition by id order by date), interval -1 day) as next_date,
max(date) over () as max_date
from t
) t cross join
unnest(generate_date_array(date, coalesce(next_date, max_date))) date
order by 1, 2, 3;
Consider below [less verbose] approach
select t1.date, t2.id, t2.number
from (
select *, array_agg(struct(date, id,number)) over(order by date) arr
from `project.dataset.table`
) t1, unnest(arr) t2
where true
qualify row_number() over (partition by t1.date, t2.id order by t2.date desc) = 1
# order by date, id
if applied to sample data in your question - output is

PostgreSQL: Filter select query by comparing against other rows

Suppose I have a table of Events that lists a userId and the time the Event occurred:
+----+--------+----------------------------+
| id | userId | time |
+----+--------+----------------------------+
| 1 | 46 | 2020-07-22 11:22:55.307+00 |
| 2 | 190 | 2020-07-13 20:57:07.138+00 |
| 3 | 17 | 2020-07-11 11:33:21.919+00 |
| 4 | 46 | 2020-07-22 10:17:11.104+00 |
| 5 | 97 | 2020-07-13 20:57:07.138+00 |
| 6 | 17 | 2020-07-04 11:33:21.919+00 |
| 6 | 17 | 2020-07-11 09:23:21.919+00 |
+----+--------+----------------------------+
I want to get the list of events that had a previous event on the same day, by the same user. The result for the above table would be:
+----+--------+----------------------------+
| id | userId | time |
+----+--------+----------------------------+
| 1 | 46 | 2020-07-22 11:22:55.307+00 |
| 3 | 17 | 2020-07-11 11:33:21.919+00 |
+----+--------+----------------------------+
How can I perform a select query that filters results by evaluating them against other rows in the table?
This can be done using an EXISTS condition:
select t1.*
from the_table t1
where exists (select *
from the_table t2
where t2.userid = t1.userid -- for the same user
and t2.time::date = t1.time::date -- on the same
and t2.time < t1.time); -- but previously on that day
You can use lag():
select t.*
from (select t.*,
lag(time) over (partition by userid, time::date order by time) as prev_time
from t
) t
where prev_time is not null;
Here is a db<>fiddle.
Or row_number():
select t.*
from (select t.*,
row_number() over (partition by userid, time::date order by time) as seqnum
from t
) t
where seqnum >= 2;
You can use LAG() to find the previous row for a user. Then a simple comparison will tell if it occured in the same day or not.
For example:
select *
from (
select
*,
lag(time) over(partition by userId order by time) as prev_time
from t
) x
where date::date = prev_time::date
You can use ROW_NUMBER() analytic function :
SELECT id , userId , time
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY UserId, date_trunc('day',time) ORDER BY time DESC) AS rn,
t.*
FROM Events
) q
WHERE rn > 1
in order to bring the latest event for UserId who takes place in more than one event.

SQL Server: Swap two lines depending on criteria

Suppose a table named Sales with this data in SQL Server
--------------------------------------------
Id | Customer_Id | Rate | Pid
--------------------------------------------
180 | 374 | 1 | A01
277 | 374 | 0 | NULL
346 | 785 | 1 | D03
476 | 785 | 0 | NULL
1821 | 1234 | 0 | E07
25951 | 1951 | 1 | K73
How update my table to swap Rate and Pid values between lines having same customer_Id, so I can have a result like this:
--------------------------------------------
Id | Customer_Id | Rate | Pid
--------------------------------------------
180 | 374 | 0 | NULL
277 | 374 | 1 | A01
346 | 785 | 0 | NULL
476 | 785 | 1 | D03
1821 | 1234 | 0 | E07
25951 | 1951 | 1 | K73
How can I achieve this?
If you always have at most two records per customer then you can use the following query:
SELECT ID, Customer_Id,
CASE
-- 2 records per Customer_id -> swap
WHEN COUNT(*) OVER (PARTITION BY Customer_id) = 2 THEN
CASE
WHEN ROW_NUMBER() OVER (PARTITION BY Customer_id ORDER BY ID) = 1
THEN LEAD(Rate) OVER (PARTITION BY Customer_id ORDER BY ID)
ELSE LAG(Rate) OVER (PARTITION BY Customer_id ORDER BY ID)
END
-- 1 record per Customer_id -> don't swap
ELSE Rate
END,
CASE
WHEN COUNT(*) OVER (PARTITION BY Customer_id) = 2 THEN
CASE
WHEN ROW_NUMBER() OVER (PARTITION BY Customer_id ORDER BY ID) = 1
THEN LEAD(Pid) OVER (PARTITION BY Customer_id ORDER BY ID)
ELSE LAG(Pid) OVER (PARTITION BY Customer_id ORDER BY ID)
END
ELSE Pid
END
FROM Sales
Demo here
Edit:
If you want to UPDATE then you can wrap the above query in a CTE and do the update on the CTE:
;WITH ToUpdate AS (
SELECT ID, Customer_Id, Rate, Pid,
COUNT(*) OVER (PARTITION BY Customer_id) AS cnt,
CASE
WHEN ROW_NUMBER() OVER (PARTITION BY Customer_id ORDER BY ID) = 1
THEN LEAD(Rate) OVER (PARTITION BY Customer_id ORDER BY ID)
ELSE LAG(Rate) OVER (PARTITION BY Customer_id ORDER BY ID)
END AS NewRate,
CASE
WHEN ROW_NUMBER() OVER (PARTITION BY Customer_id ORDER BY ID) = 1
THEN LEAD(Pid) OVER (PARTITION BY Customer_id ORDER BY ID)
ELSE LAG(Pid) OVER (PARTITION BY Customer_id ORDER BY ID)
END AS NewPid
FROM Sales)
UPDATE ToUpdate
SET Rate = NewRate, Pid = NewPid
WHERE cnt = 2
Demo here
This will do it, with the caveat that you have at most only two records with the same Customer_Id...
update Sales
set Rate =
(
select Rate from Sales sls
where sls.Customer_Id = Sales.Customer_Id
and sls.Rate <> Sales.Rate
)