SQL group by an column but segmented based on another column - sql

I have this table which contain roughly more than 100000 rows and with 3 columns:
Account_number
Report_date
Outstanding_amount
I need to find a statement that group the outstanding amount by account but also cut it based on the date. Sample data for 1 account:
+----------------+-------------+--------------------+--+
| account_number | report_date | outstanding_amount | |
+----------------+-------------+--------------------+--+
| 1 | 02/01/2019 | 100 | |
| 1 | 03/01/2019 | 100 | |
| 1 | 06/01/2019 | 200 | |
| 1 | 07/01/2019 | 300 | |
| 1 | 10/01/2019 | 200 | |
| 1 | 11/01/2019 | 200 | |
| 1 | 12/01/2019 | 100 | |
+----------------+-------------+--------------------+--+
So if I run this statement:
select * from (select account_number, min(report_date) mindate, max(report_date) maxdate, outstading_amount from table1 grouped by account_number, outstanding_amount)
The result of this statement should be similar to this:
+----------------+------------+------------+--------------------+
| account_number | mindate | maxdate | outstanding_amount |
+----------------+------------+------------+--------------------+
| 1 | 02/01/2019 | 12/01/2019 | 100 |
| 1 | 06/01/2019 | 11/01/2019 | 200 |
| 1 | 07/01/2019 | 07/01/2019 | 300 |
+----------------+------------+------------+--------------------+
So here I want to separate the result so that the days between mindate and maxdate of one row won't overlap the days in the next row. The result I'm looking is something like this:
+----------------+------------+------------+--------------------+
| account_number | mindate | maxdate | outstanding_amount |
+----------------+------------+------------+--------------------+
| 1 | 02/01/2019 | 03/01/2019 | 100 |
| 1 | 06/01/2019 | 06/01/2019 | 200 |
| 1 | 07/01/2019 | 07/01/2019 | 300 |
| 1 | 10/01/2019 | 11/01/2019 | 200 |
| 1 | 12/01/2019 | 12/01/2019 | 100 |
+----------------+------------+------------+--------------------+
Is it possible to construct this statement?

This is a gaps-and-islands problem. In this case, the simplest solution is probably the difference of row numbers:
select account_number, outstanding_amount,
min(report_date), max(report_date)
from (select t.*,
row_number() over (partition by account_number order by report_date) as seqnum,
row_number() over (partition by account_number, outstanding_amount order by report_date) as seqnum_o
from t
) t
group by account_number, outstanding_amount, (seqnum - seqnum_o)
order by account_number, min(report_date);
Why this works is a little tricky to explain. But if you look at the results of the subquery, you will be able to see how the difference of row numbers defines the adjacent rows with the same amount.

To flatten the data, squish it by calculated rank.
select account_number
, min(report_date) as mindate
, max(report_date) as maxdate
, outstanding_amount
from
(
select q1.*
, sum(flag) over (partition by account_number order by report_date) as rnk
from
(
select t.*
, case when outstanding_amount = lag(outstanding_amount, 1) over (partition by account_number order by report_date) then 0 else 1 end as flag
from table1 t
) q1
) q2
group by account_number, outstanding_amount, rnk
order by account_number, mindate;
A test on db<>fiddle here

Related

Unique cumulative customers by each day

Task: Get the total number of unique cumulative customers by each decline reason and by each day.
Input data sample:
+---------+--------------+------------+------+
| Cust_Id | Decline_Dt | Reason | Days |
+---------+--------------+------------+------+
| A | 08-09-2020 | Reason_1 | 0 |
| A | 08-09-2020 | Reason_1 | 1 |
| A | 08-09-2020 | Reason_1 | 2 |
| A | 08-09-2020 | Reason_1 | 4 |
| B | 08-09-2020 | Reason_1 | 0 |
| B | 08-09-2020 | Reason_1 | 2 |
| B | 08-09-2020 | Reason_1 | 3 |
| C | 08-09-2020 | Reason_1 | 1 |
+---------+--------------+------------+------+
1) Decline_dt - The date on which the payment was declined. (Ignore it for this task)
2) Days - Indicates the # of days after the payment decline happened, the customer interacted with IVR channel.
3) Reason - Indicates the payment decline reason
--Expected Output:
+---------------+-----------+---------------+----------------------------+
| Reason | Days | Unique_mtns | total_cumulative_customers |
+---------------+-----------+---------------+----------------------------+
| Reason_1 | 0 | 2 | 2 |
| Reason_1 | 1 | 2 | 3 |
| Reason_1 | 2 | 2 | 3 |
| Reason_1 | 3 | 1 | 3 |
| Reason_1 | 4 | 1 | 3 |
+------------------------------------------------------------------------+
My Hive query:
select a.Reason
, a.days
-- , count(distinct a.cust_id) as unique_mtns
, count(distinct a.cust_id) over (partition by Reason
order by a.days rows between unbounded preceding and current row)
as total_cumulative_customers
from table as a
group by a.reason
, a.days
Output (Incorrect):
+---------------+-----------+----------------------------+
| Reason | Days | total_cumulative_customers |
+---------------+-----------+----------------------------+
| Reason_1 | 0 | 2 |
| Reason_1 | 1 | 2 |
| Reason_1 | 2 | 2 |
| Reason_1 | 3 | 1 |
| Reason_1 | 4 | 1 |
+--------------------------------------------------------+
Ideally, I would expect the window function to be executed without group by.
However, I get an error without group by. When I use group by, I don't get the cumulative customers.
If I follow you correctly, you can use a subquery to compute the first day per customer/reason tuple, and then do conditional aggregation:
select reason, days,
count(distinct cust_id) as unique_mtns,
sum(sum(case when days = min_days then 1 else 0 end))
over(partition by reason order by days) as total_cumulative_customers
from (
select reason, cust_id,
min(days) over(partition by reason, cust_id) as min_days
from mytable
) t
group by reason, days
I would recommend using row_number() to enumerate the rows or a given customer and reason. Your code uses count(distinct) on the user id, suggesting that you might have duplicates on a given day.
This would be:
select reason, days, count(distinct cust_id) as unique_mtns,
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (partition by reason order by days) as total_cumulative_customers
from (select t.*,
row_number() over (partition by reason, cust_id order by days) as seqnum
from t
) t
group by reason, days
order by reason, days;

SQL group by changing column

Suppose I have a table sorted by date as so:
+-------------+--------+
| DATE | VALUE |
+-------------+--------+
| 01-09-2020 | 5 |
| 01-15-2020 | 5 |
| 01-17-2020 | 5 |
| 02-03-2020 | 8 |
| 02-13-2020 | 8 |
| 02-20-2020 | 8 |
| 02-23-2020 | 5 |
| 02-25-2020 | 5 |
| 02-28-2020 | 3 |
| 03-13-2020 | 3 |
| 03-18-2020 | 3 |
+-------------+--------+
I want to group by changes in value within that given date range, and add a value that increments each time as an added column to denote that.
I have tried a number of different things, such as using the lag function:
SELECT value, value - lag(value) over (order by date) as count
GROUP BY value
In short, I want to take the table above and have it look like:
+-------------+--------+-------+
| DATE | VALUE | COUNT |
+-------------+--------+-------+
| 01-09-2020 | 5 | 1 |
| 01-15-2020 | 5 | 1 |
| 01-17-2020 | 5 | 1 |
| 02-03-2020 | 8 | 2 |
| 02-13-2020 | 8 | 2 |
| 02-20-2020 | 8 | 2 |
| 02-23-2020 | 5 | 3 |
| 02-25-2020 | 5 | 3 |
| 02-28-2020 | 3 | 4 |
| 03-13-2020 | 3 | 4 |
| 03-18-2020 | 3 | 4 |
+-------------+--------+-------+
I want to eventually have it all in one small table with the earliest date for each.
+-------------+--------+-------+
| DATE | VALUE | COUNT |
+-------------+--------+-------+
| 01-09-2020 | 5 | 1 |
| 02-03-2020 | 8 | 2 |
| 02-23-2020 | 5 | 3 |
| 02-28-2020 | 3 | 4 |
+-------------+--------+-------+
Any help would be very appreciated
you can use a combination of Row_number and Dense_rank functions to get the required results like below:
;with cte
as
(
select t.DATE,t.VALUE
,Dense_rank() over(partition by t.VALUE order by t.DATE) as d_rank
,Row_number() over(partition by t.VALUE order by t.DATE) as r_num
from table t
)
Select t.Date,t.Value,d_rank as count
from cte
where r_num = 1
You can use a lag and cumulative sum and a subquery:
SELECT value,
SUM(CASE WHEN prev_value = value THEN 0 ELSE 1 END) OVER (ORDER BY date)
FROM (SELECT t.*, LAG(value) OVER (ORDER BY date) as prev_value
FROM t
) t
Here is a db<>fiddle.
You can recursively use lag() and then row_number() analytic functions :
WITH t2 AS
(
SELECT LAG(value,1,value-1) OVER (ORDER BY date) as lg,
t.*
FROM t
)
SELECT t2.date,t2.value, ROW_NUMBER() OVER (ORDER BY t2.date) as count
FROM t2
WHERE value - lg != 0
Demo
and filter through inequalities among the returned values from those functions.

Number based on condition

I'm trying to generate a number based on a condition.
When there is yes in column 'Stop' in the partition of a Client ordered by Start_Date, the Dense Rank has to start over. So I tried several things but it's stil not what I want.
My table with current number and expected number
+-----------+------------+------+------------+-------------+
| Client_No | Start_Date | Stop | Current_No | Expected_No |
+-----------+------------+------+------------+-------------+
| 1 | 1-1-2018 | No | 1 | 1 |
+-----------+------------+------+------------+-------------+
| 1 | 1-2-2018 | No | 2 | 2 |
+-----------+------------+------+------------+-------------+
| 1 | 1-3-2018 | No | 3 | 3 |
+-----------+------------+------+------------+-------------+
| 1 | 1-4-2018 | Yes | 1 | 1 |
+-----------+------------+------+------------+-------------+
| 1 | 1-5-2018 | No | 4 | 2 |
+-----------+------------+------+------------+-------------+
| 1 | 1-6-2018 | No | 5 | 3 |
+-----------+------------+------+------------+-------------+
| 2 | 1-2-2018 | No | 1 | 1 |
+-----------+------------+------+------------+-------------+
| 2 | 1-3-2018 | No | 2 | 2 |
+-----------+------------+------+------------+-------------+
| 2 | 1-4-2018 | Yes | 1 | 1 |
+-----------+------------+------+------------+-------------+
| 2 | 1-5-2018 | No | 3 | 2 |
+-----------+------------+------+------------+-------------+
| 2 | 1-6-2018 | Yes | 2 | 1 |
+-----------+------------+------+------------+-------------+
The query I used so far:
DENSE_RANK() OVER(PARTITION BY Client_No, Stop ORDER BY Start_Date ASC)
This seems not to be the solution because it counts onwart from the value 'no', but I don't no how to handle this in another way.
One way to solve such a Gaps-And-Islands puzzle is to first calculate a rank that starts with the 'Yes' stops.
Then calculate the row_number or dense_rank also over that rank.
For example:
create table test
(
Id int identity(1,1) primary key,
Client_No int,
Start_Date date,
Stop varchar(3)
)
insert into test
(Client_No, Start_Date, Stop) values
(1,'2018-01-01','No')
,(1,'2018-02-01','No')
,(1,'2018-03-01','No')
,(1,'2018-04-01','Yes')
,(1,'2018-05-01','No')
,(1,'2018-06-01','No')
,(2,'2018-02-01','No')
,(2,'2018-03-01','No')
,(2,'2018-04-01','Yes')
,(2,'2018-05-01','No')
,(2,'2018-06-01','Yes')
select *
, row_number() over (partition by Client_no, Rnk order by start_date) as rn
from
(
select *
, sum(case when Stop = 'Yes' then 1 else 0 end) over (partition by Client_No order by start_date) rnk
from test
) q
order by Client_No, start_date
GO
Id | Client_No | Start_Date | Stop | rnk | rn
-: | --------: | :------------------ | :--- | --: | :-
1 | 1 | 01/01/2018 00:00:00 | No | 0 | 1
2 | 1 | 01/02/2018 00:00:00 | No | 0 | 2
3 | 1 | 01/03/2018 00:00:00 | No | 0 | 3
4 | 1 | 01/04/2018 00:00:00 | Yes | 1 | 1
5 | 1 | 01/05/2018 00:00:00 | No | 1 | 2
6 | 1 | 01/06/2018 00:00:00 | No | 1 | 3
7 | 2 | 01/02/2018 00:00:00 | No | 0 | 1
8 | 2 | 01/03/2018 00:00:00 | No | 0 | 2
9 | 2 | 01/04/2018 00:00:00 | Yes | 1 | 1
10 | 2 | 01/05/2018 00:00:00 | No | 1 | 2
11 | 2 | 01/06/2018 00:00:00 | Yes | 2 | 1
db<>fiddle here
The difference between using this:
row_number() over (partition by Client_no, Rnk order by start_date)
versus this:
dense_rank() over (partition by Client_no, Rnk order by start_date)
is that the dense_rank would calculate the same number for the same start_date per Client_no & Rnk.
Below is one approach which gives you the output you want. You can see as a live/working demo here.
The steps involved are:
Create an adjusted stop value where we mark Stop as Yes for the first ever row for every customer
Create a separate table which only includes the rows where we will want to start/restart counting
For each of the rows in this new table we also add an end data, which is basically the date of the next row for every customer, or for the last row a date in the future
We join the original data table with the new table and run a sequence based on this new calculation
-- 1. Creating adjusted stop value
data_adjusted_stop as
(
select *,
case when row_number() over(partition by Client_No order by Start_Date asc) = 1 then 'Yes' else Stop end as adjusted_stop
from data
),
-- 2. Extracting the rows where we will want to (re)start the counting
data_with_cycle as
(
select Client_No,
row_number() over(partition by Client_No order by Start_Date asc) adjusted_stop_cycle,
Start_Date
from data_adjusted_stop
where adjusted_stop = 'Yes'
),
-- 3. Adding an End_Date column for each row where we will want to (re)start counting
data_with_end_date as
(
select *,
coalesce(lead(Start_Date) over (partition by Client_No order by Start_Date asc), '2021-01-01') as End_Date
from data_with_cycle
)
-- 4. Running a sequence partitioned by Client_No and the stop cycle
select data.*,
row_number() over(partition by data.Client_No, data_with_end_date.adjusted_stop_cycle order by data.Start_Date asc) as desired_output_sequence
from data
left join data_with_end_date
on data_with_end_date.Client_no = data.Client_no
where data.Start_Date >= data_with_end_date.Start_Date
and data.Start_Date < data_with_end_date.End_Date
order by 1, 2

SQL Row_Number() not sorting dates

I have a query in which I am trying to get the most recent date from my ROW_NUMBER() selection. I have tried both MAX() and DESC in my ORDER BY clause. It does not show the most recent date as RowNum 1.
This is my query:
;WITH cte3 AS
(
SELECT
o.PartNo,
o.JobNo,
MAX(tt.TicketDate) as rawr,
ROW_NUMBER() OVER (PARTITION BY o.JobNo, o.PartNo
ORDER BY tt.TicketDate DESC) as RowNum
FROM
OrderDet AS o
INNER JOIN
TimeTicketDet AS tt ON o.JobNo = tt.JobNo
WHERE
o.Status = 'Open'
GROUP BY
tt.TicketDate, o.JobNo, o.PartNo
)
SELECT *
FROM cte3
When I get it giving me the correct results, I will add a WHERE RowNum = 1 in the cte query.
With my current query, this is the result:
+--------+-------+-----------+--------+
| PartNo | JobNo | rawr | RowNum |
+--------+-------+-----------+--------+
| 1234 | 20 | 5/30/2012 | 1 |
| 1234 | 20 | 5/29/2012 | 2 |
| 1234 | 20 | 5/25/2012 | 3 |
| 1234 | 20 | 5/24/2012 | 4 |
| 1234 | 20 | 5/23/2012 | 5 |
| 1234 | 20 | 5/22/2012 | 6 |
| 1234 | 20 | 5/16/2012 | 7 |
| 1234 | 20 | 5/15/2012 | 8 |
| 1234 | 20 | 5/14/2012 | 9 |
| 1234 | 20 | 5/11/2012 | 10 |
| 1234 | 20 | 5/10/2012 | 11 |
| 1234 | 20 | 5/9/2012 | 12 |
| 1234 | 20 | 3/27/2015 | 13 |
| 1234 | 20 | 1/3/2013 | 14 |
| 1234 | 20 | 1/2/2013 | 15 |
+--------+-------+-----------+--------+
RowNum = 13 is the most recent date. Am I organizing my sorts incorrectly or incorrectly converting my dates?
EDIT:
TimeTicketDet Table Sample Data:
+------------+-------+
| TicketDate | JobNo |
+------------+-------+
| 5/9/2012 | 20 |
| 5/10/2012 | 20 |
| 5/24/2012 | 20 |
| 3/27/2015 | 20 |
| 5/22/2012 | 20 |
| 5/10/2012 | 20 |
| 5/11/2012 | 20 |
| 5/9/2012 | 100 |
| 5/10/2012 | 100 |
| 5/24/2012 | 100 |
| 3/27/2015 | 100 |
| 5/22/2012 | 100 |
| 5/10/2012 | 100 |
| 5/11/2012 | 100 |
+------------+-------+
OrderDet Table Sample Data:
+--------+--------+-------+
| PartNo | Status | JobNo |
+--------+--------+-------+
| 1234 | Open | 20 |
| 1234 | Open | 100 |
+--------+--------+-------+
Desired Result:
+--------+------------+-------+--------+
| PartNo | TicketDate | JobNo | RowNum |
+--------+------------+-------+--------+
| 1234 | 3/27/2015 | 20 | 1 |
| 1234 | 3/27/2015 | 100 | 1 |
+--------+------------+-------+--------+
As I mentioned in my comment, since your TicketDate column is a char, you need to convert it to a datetime in order to sort it by actual date. Right now, you are sorting it by string value which isn't correct.
I'd recommend changing your query to something like this:
;WITH cte3 AS
(
SELECT
o.PartNo,
o.JobNo,
MAX(tt.TicketDate) as rawr,
ROW_NUMBER() OVER (PARTITION BY o.JobNo, o.PartNo
ORDER BY cast(tt.TicketDate as datetime) DESC) as RowNum
FROM
OrderDet AS o
INNER JOIN
TimeTicketDet AS tt ON o.JobNo = tt.JobNo
WHERE
o.Status = 'Open'
GROUP BY
cast(tt.TicketDate as datetime), o.JobNo, o.PartNo
)
SELECT *
FROM cte3
where RowNum = 1;
Here is a demo. By casting your char to a datetime in your row_number you will be sorting the data by date instead of string.
Additionally, you don't really need the max() and the GROUP BY since by casting the TicketDate to a datetime you will return the correct row:
;WITH cte3 AS
(
SELECT
o.PartNo,
o.JobNo,
tt.TicketDate as rawr,
ROW_NUMBER() OVER (PARTITION BY o.JobNo, o.PartNo
ORDER BY cast(tt.TicketDate as datetime) DESC) as RowNum
FROM
#OrderDet AS o
INNER JOIN
#TimeTicketDet AS tt ON o.JobNo = tt.JobNo
WHERE
o.Status = 'Open'
)
SELECT *
FROM cte3
where RowNum =1;
As Ollie suggest you can CAST your string to DATETIME And you dont need the additional Group By
SQL DEMO
;WITH cte3 AS
(
SELECT
o.PartNo,
o.JobNo,
tt.TicketDate as rawr,
ROW_NUMBER() OVER (PARTITION BY o.JobNo, o.PartNo
ORDER BY cast(tt.TicketDate as datetime) DESC) as RowNum
FROM OrderDet AS o
JOIN TimeTicketDet AS tt
ON o.JobNo = tt.JobNo
WHERE
o.Status = 'Open'
)
SELECT *
FROM cte3
WHERE RowNum = 1
OUTPUT

SQL Server 2014 - How to calculate percentage based on last NOT NULL value in a column?

I have the following table dates, items and sales as show below :
table Dates :
+---------+------------+------------+
| Date_ID | StartDates | EndDates |
+---------+------------+------------+
| 1 | 2016-07-01 | 2016-07-05 |
| 2 | 2016-07-06 | 2016-07-12 |
+---------+------------+------------+
table items :
+--------+----------+---------+
| ITM_ID | ITM_Name | ITM_Qty |
+--------+----------+---------+
| A0001 | Item A | 30 |
| B0001 | Item B | 50 |
+--------+----------+---------+
table sales :
+----------+------------+------------+-----------+
| Sales_ID | Sales_Date | Sales_Item | Sales_Qty |
+----------+------------+------------+-----------+
| S0001 | 2016-07-02 | A | 5 |
| S0002 | 2016-07-04 | A | 15 |
| S0003 | 2016-07-08 | B | 20 |
| S0004 | 2016-07-12 | A | 10 |
+----------+------------+------------+-----------+
I would like to calculate a percentage (act like a ratio of sales on current period compared to the previous period) and the available amount of item after each sales.
My expected output would be like this :
+------------+------------+---------+----------+----------+-----------+
| StartDates | EndDates | Item_ID | Sold_Qty | Percents | Available |
+------------+------------+---------+----------+----------+-----------+
| 2016-07-01 | 2016-07-05 | A0001 | 20 | 100 | 10 |
| 2016-07-01 | 2016-07-05 | B0001 | 0 | 0 | 50 |
| 2016-07-06 | 2016-07-12 | A0001 | 10 | 50 | 0 |
| 2016-07-06 | 2016-07-12 | B0001 | 20 | 100 | 30 |
+------------+------------+---------+----------+----------+-----------+
I hope the expected output will be possible but I have currently not get a working query yet.
As the table above, the column percents is a percentage sales of current period compared to the last period, i.e. on item A0001 first period has sold_qty is 20 and the second period is 10, therefore the percentage of second period is (10/20) * 100 = 50.
EDIT : for the case of item B0001, the sold_qty of the first period is 0, therefore the percentage count should not consider the value on the first period.
Try Below. For Calculating Percents i have used case statement that you can simplify it based on your requirement.
SELECT *,
CASE
WHEN SOLD_QTY = 0 THEN 0
WHEN LAG(SOLD_QTY)
OVER(
PARTITION BY ITM_ID
ORDER BY ITM_ID) = 0
OR LAG(SOLD_QTY)
OVER(
PARTITION BY ITM_ID
ORDER BY ITM_ID) IS NULL THEN 100
ELSE CONVERT(FLOAT, SOLD_QTY) / NULLIF(LAG(SOLD_QTY)
OVER(
PARTITION BY ITM_ID
ORDER BY ITM_ID), 0) * 100
END PERCENTS
FROM (SELECT T1.STARTDATES,
T1.ENDDATES,
T2.ITM_ID,
ISNULL(SUM(T3.SALES_QTY), 0) SOLD_QTY,
( T2.[ITM_QTY] ) - ISNULL(SUM(T3.SALES_QTY), 0)AS AVIL
FROM #TABLE1 T1
CROSS JOIN #TABLE2 T2
LEFT JOIN #TABLE3 T3
ON T3.[SALES_ITEM] = LEFT(T2.[ITM_ID], 1)
AND T3.SALES_DATE BETWEEN T1.STARTDATES AND T1.ENDDATES
GROUP BY T1.STARTDATES,
T1.ENDDATES,
T2.ITM_ID,
T2.[ITM_QTY])A