Count distinct customers over rolling window partition - sql

My question is similar to redshift: count distinct customers over window partition but I have a rolling window partition.
My query looks like this but distinct within COUNT in Redshift is not supported
select p_date, seconds_read,
count(distinct customer_id) over (order by p_date rows between unbounded preceding and current row) as total_cumulative_customer
from table_x
My goal is to calculate total unique customer up to every date (hence rolling window).
I tried using the dense_rank() approach but it would simply fail since I cannot use window function like this
select p_date, max(total_cumulative_customer) over ()
(select p_date, seconds_read,
dense_rank() over (order by customer_id rows between unbounded preceding and current row) as total_cumulative_customer -- WILL FAIL HERE
from table_x
Any workaround or different approach would be helpful!
EDIT:
INPUT DATA sample
+------+----------+--------------+
| Cust | p_date | seconds_read |
+------+----------+--------------+
| 1 | 1-Jan-20 | 10 |
| 2 | 1-Jan-20 | 20 |
| 4 | 1-Jan-20 | 30 |
| 5 | 1-Jan-20 | 40 |
| 6 | 5-Jan-20 | 50 |
| 3 | 5-Jan-20 | 60 |
| 2 | 5-Jan-20 | 70 |
| 1 | 5-Jan-20 | 80 |
| 1 | 5-Jan-20 | 90 |
| 1 | 7-Jan-20 | 100 |
| 3 | 7-Jan-20 | 110 |
| 4 | 7-Jan-20 | 120 |
| 7 | 7-Jan-20 | 130 |
+------+----------+--------------+
Expected Output
+----------+--------------------------+------------------+--------------------------------------------+
| p_date | total_distinct_cum_cust | sum_seconds_read | Comment |
+----------+--------------------------+------------------+--------------------------------------------+
| 1-Jan-20 | 4 | 100 | total distinct cust = 4 i.e. 1,2,4,5 |
| 5-Jan-20 | 6 | 450 | total distinct cust = 6 i.e. 1,2,3,4,5,6 |
| 7-Jan-20 | 7 | 910 | total distinct cust = 6 i.e. 1,2,3,4,5,6,7 |
+----------+--------------------------+------------------+--------------------------------------------+

For this operation:
select p_date, seconds_read,
count(distinct customer_id) over (order by p_date rows between unbounded preceding and current row) as total_cumulative_customer
from table_x;
You can do pretty much what you want with two levels of aggregation:
select min_p_date,
sum(count(*)) over (order by min_p_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(p_date) as min_p_date
from table_x
group by customer_id
) c
group by min_p_date;
Summing the seconds read as well is a bit tricky, but you can use the same idea:
select p_date,
sum(sum(seconds_read)) over (order by p_date rows between unbounded preceding and current row) as seconds_read,
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by p_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, p_date, seconds_read,
row_number() over (partition by customer_id order by p_date) as seqnum
from table_x
) c
group by min_p_date;

One workaround uses a subquery:
select p_date, seconds_read,
(
select count(distinct t1.customer_id)
from table_x t1
where t1.p_date <= t.p_date
) as total_cumulative_customer
from table_x t

I'd like to add that you can also accomplish this with an explicit self join which is, in my opinion, more straightforward and readable than the subquery approaches described in the other answers.
select
t1.p_date,
sum(t2.seconds_read) as sum_seconds_read,
count(distinct t2.customer_id) as distinct_cum_cust_totals
from
table_x t1
join
table_x t2
on
t2.date <= t1.date
group by
t1.date
Most query planners will reduce a correlated subquery like in the solutions above to an efficient join like this, so either solution is usually fine, but for the general case, I believe this is a better solution since some engines (like BigQuery) won't allow correlated subqueries and will force you to explicitly define the join in your query.

Related

SQL Server Add row number each group

I working on a query for SQL Server 2016. I have order by serial_no and group by pay_type and I would like to add row number same example below
row_no | pay_type | serial_no
1 | A | 4000118445
2 | A | 4000118458
3 | A | 4000118461
4 | A | 4000118473
5 | A | 4000118486
1 | B | 4000118499
2 | B | 4000118506
3 | B | 4000118519
4 | B | 4000118521
1 | A | 4000118534
2 | A | 4000118547
3 | A | 4000118550
1 | B | 4000118562
2 | B | 4000118565
3 | B | 4000118570
4 | B | 4000118572
Help me please..
SELECT
ROW_NUMBER() OVER(PARTITION BY paytype ORDER BY serial_no) as row_no,
paytype, serial_no
FROM table
ORDER BY serial_no
You can assign groups to adjacent pay types that are the same and then use row_number(). For this purpose, the difference of row numbers is a good way to determine the groups:
select row_number() over (partition by pay_type, seqnum - seqnum_2 order by serial_no) as row_no,
t.*
from (select t.*,
row_number() over (order by serial_no) as seqnum,
row_number() over (partition by pay_type order by serial_no) as seqnum_2
from t
) t;
This type of problem is one example of a gaps-and-islands problem. Why does the difference of row numbers work? I find that the simplest way to understand is to look at the results of the subquery.
Here is a db<>fiddle.
add this to your select list
ROW_NUMBER() OVER ( ORDER BY (SELECT 1) )
since you already sorting by your stuff, so you don't need to sorting in your windowing function so consuming less CPU,

Select the highest value of column 2 per column 1

Given the following table P_PROV
+----+-----------+-----------+
| id | date | person_id |
+----+-----------+-----------+
| 1 |19/06/2019 | 1 |
| 2 |18/07/2010 | 2 |
| 3 |19/06/2020 | 1 |
| 4 |17/06/2020 | 2 |
| 5 |28/06/2020 | 3 |
+----+-----------+-----------+
I want this output
+----+-----------+-----------+
| id | date | person_id |
+----+-----------+-----------+
| 3 |19/06/2020 | 1 |
| 4 |17/06/2020 | 2 |
| 5 |28/06/2020 | 3 |
+----+-----------+-----------+
Putting this in words, I want to return per person the maximum date. I tried something like this
SELECT DISTINCT pp.date, pp.id FROM P_PROV pp
WHERE (SELECT MAX(aa.date)
FROM P_PROV aa) = pp.date;
This one is only returning one row (of course, because the MAX will return the maximum date only), but I really don't know how to approach this issue, any kind of help would be appreciated
ROW_NUMBER provides one way to handle this:
SELECT id, date, person_id
FROM
(
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY date DESC) rn
FROM yourTable t
) t
WHERE rn = 1;
Oracle has a fun way to do this using aggregation:
select max(id) keep (dense_rank first order by date desc) as id,
max(date) as date, person_id
from P_PROV
group by person_id;
Given that your ids are increasing, this probably also does what you want:
select max(id) as id, max(date) as date, person_id
from P_PROV
group by person_id;

SQL query for selecting multiple records for one product for a single id

My table looks like this, what I'm trying to achieve is to pull out all the records for one user for the product that have the earliest date
product |type_id| user | Date |Desired ROW_NUMBER as output |
-------+--------+------+-------+---------------------
1 | 1 | A | 0101 | 1
1 | 1 | A | 0102 | 1
2 | 3 | A | 0105 | 2
2 | 5 | A | 0105 | 2
3 | 7 | B | 0101 | 1
3 | 8 | B | 0104 | 1
So I want to pull all the records with "1" in the desired row_num column, but I haven't figured out hot to get this without doing another group by. Any helps would be appreciated.
You can use window functions:
select t.*
from (select t.*,
rank() over (partition by user order by min_date) as seqnum
from (select t.*,
min(date) over (partition by user, product) as min_date
from t
) t
) t
where seqnum = 1;
Or, with only one subquery:
select t.*
from (select t.*,
min(date) over (partition by user, product) as min_date_up,
min(date) over (partition by user) as min_date_u
from t
) t
where min_date_u = min_date_up;
You can interpret this as "return all rows where the product has the minimum date for the user".
Here is a db<>fiddle.
SELECT * FROM [tableName] WHERE Desired ROW_NUMBER = 1 ORDER BY Date[DESC, ASC]
Pass the Desired ROW_NUMBER value dynamically as a parameter.

Query item with closest date based on current date

I am trying to get the closest date for item no and price based on the current date. The query is giving me output, but not the way I want.
There is a different price for the same item and it's not filtering.
Here's my query:
SELECT distinct [ITEM_NO]
,min(REQUIRED_DATE) as Date
,[PRICE]
FROM [DATA_WAREHOUSE].[app].[OHCMS_HOPS_ORDERS]
where (REQUIRED_DATE) >= GETDATE() and PRICE is not null
group by ITEM_NO,PRICE
order by ITEM_NO
Any Ideas?
You can try to use ROW_NUMBER window function to make it.
SELECT ITEM_NO,
REQUIRED_DATE,
PRICE
FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY ITEM_NO ORDER BY REQUIRED_DATE) rn
FROM DATA_WAREHOUSE].[app].[OHCMS_HOPS_ORDERS]
where REQUIRED_DATE >= GETDATE() and PRICE is not null
)t1
WHERE rn = 1
Could you order by the the absolute value of DATEDIFF?
ORDER BY ABS(DATEDIFF(day, REQUIRED_DATE, GETDATE()))
This seems like an iteration of the greatest-n-per-group problem
I'm not quite certain what constraints you're looking to impose
Largest Date
Most Recent Date (but not in future)
Closest Date to today (past or present)
Here's an example table and which row we'd want if queried on 6/3/2019:
| Item | RequiredDate | Price |
|------|--------------|-------|
| A | 2019-05-29 | 10 |
| A | 2019-06-01 | 20 | <-- #2
| A | 2019-06-04 | 30 | <-- #3
| A | 2019-06-05 | 40 | <-- #1
| B | 2019-06-01 | 80 |
But I'm going to guess you're looking for #2
We can identify we the row / largest date by grouping by item and using an aggregate operation like MAX on each group
SELECT o.Item, MAX(o.RequiredDate) AS MostRecentDt
FROM Orders o
WHERE o.RequiredDate <= GETDATE()
GROUP BY o.Item
Which returns this:
| Item | MostRecentDt |
|------|--------------|
| A | 2019-05-29 |
| A | 2019-06-01 |
| B | 2019-06-01 |
However, once we've grouped by that record, the trouble is then in joining back to the original table to get the full row/record in order to select any other information not part of the original GROUP BY statement
Using ROW_NUMBER we can sort elements in a set, and indicate their order (highest...lowest)
SELECT *, ROW_NUMBER() OVER(PARTITION BY Item ORDER BY RequiredDate DESC) rn
FROM Orders o
WHERE o.RequiredDate <= GETDATE()
| Item | RequiredDate | Price | rn |
|------|--------------|-------|----|
| A | 2019-05-29 | 10 | 1 |
| A | 2019-06-01 | 20 | 2 |
| B | 2019-06-01 | 80 | 1 |
Since we've sorted DESC, now we just want to query this group to get the most recent values per group (rn=1)
WITH OrderedPastItems AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY Item ORDER BY RequiredDate DESC) rn
FROM Orders o
WHERE o.RequiredDate <= GETDATE()
)
SELECT *
FROM OrderedPastItems
WHERE rn = 1
Here's a MCVE in SQL Fiddle
Further Reading:
SQL selecting rows by most recent date
Select row with most recent date per user

SELECT based on multiple fields in MS-SQL

I have a table with 4 columns:
AcctNumb | PeriodEndingDate | WaterConsumption | ReadingType
There are multiple records for each AcctNumb, with the date that each record was recorded.
What I want to do is grab the most recent date, consumption reading, and reading type for each account.
I have tried using MAX(PeriodEndingDate) and GROUP BY AcctNumb, but I would need to aggregate all the other values, and none of the aggregate functions help me for the WaterConsumption, etc.
Can anyone point me in the right direction?
Thanks
EDIT
Here is a sample table
+----------+------------------+------------------+-------------+
| AcctNumb | PeriodEndingDate | WaterConsumption | ReadingType |
+----------+------------------+------------------+-------------+
| 1000 | 2018-03-31 | 122230 | A |
| 1001 | 2018-03-31 | 24850 | A |
| 1002 | 2018-03-31 | 88540 | A |
| 1000 | 2017-12-31 | 123800 | A |
| 1001 | 2017-12-31 | 3000 | E |
+----------+------------------+------------------+-------------+
The ReadingType is whether it's an actual (A) reading, or an estimate (E).
Try this
SELECT
AcctNumb,
PeriodEndingDate,
WaterConsumption,
ReadingType
FROM (SELECT
AcctNumb,
PeriodEndingDate,
WaterConsumption,
ReadingType,
ROW_NUMBER() OVER (PARTITION BY AcctNumb ORDER BY PeriodEndingDate DESC) AS MostrecentRecord
FROM <TableName>) dt
WHERE MostrecentRecord= 1
This can be done using ROW_NUMBER. It has been asked an answered thousands of times but the query is easier to write than find a duplicate.
select *
from
(
select *
, RowNum = ROW_NUMBER() over(partition by AcctNumb order by PeriodEndingDate)
from YourTable
) x
where x.RowNum = 1
SELECT DQ.* FROM
(SELECT *,
Row_Number() OVER (PARTITION BY AcctNumb ORDER BY PeriodEndingDate DESC) AS RN
FROM YourTable
) AS DQ
WHERE DQ.RN = 1