Number based on condition - sql

I'm trying to generate a number based on a condition.
When there is yes in column 'Stop' in the partition of a Client ordered by Start_Date, the Dense Rank has to start over. So I tried several things but it's stil not what I want.
My table with current number and expected number
+-----------+------------+------+------------+-------------+
| Client_No | Start_Date | Stop | Current_No | Expected_No |
+-----------+------------+------+------------+-------------+
| 1 | 1-1-2018 | No | 1 | 1 |
+-----------+------------+------+------------+-------------+
| 1 | 1-2-2018 | No | 2 | 2 |
+-----------+------------+------+------------+-------------+
| 1 | 1-3-2018 | No | 3 | 3 |
+-----------+------------+------+------------+-------------+
| 1 | 1-4-2018 | Yes | 1 | 1 |
+-----------+------------+------+------------+-------------+
| 1 | 1-5-2018 | No | 4 | 2 |
+-----------+------------+------+------------+-------------+
| 1 | 1-6-2018 | No | 5 | 3 |
+-----------+------------+------+------------+-------------+
| 2 | 1-2-2018 | No | 1 | 1 |
+-----------+------------+------+------------+-------------+
| 2 | 1-3-2018 | No | 2 | 2 |
+-----------+------------+------+------------+-------------+
| 2 | 1-4-2018 | Yes | 1 | 1 |
+-----------+------------+------+------------+-------------+
| 2 | 1-5-2018 | No | 3 | 2 |
+-----------+------------+------+------------+-------------+
| 2 | 1-6-2018 | Yes | 2 | 1 |
+-----------+------------+------+------------+-------------+
The query I used so far:
DENSE_RANK() OVER(PARTITION BY Client_No, Stop ORDER BY Start_Date ASC)
This seems not to be the solution because it counts onwart from the value 'no', but I don't no how to handle this in another way.

One way to solve such a Gaps-And-Islands puzzle is to first calculate a rank that starts with the 'Yes' stops.
Then calculate the row_number or dense_rank also over that rank.
For example:
create table test
(
Id int identity(1,1) primary key,
Client_No int,
Start_Date date,
Stop varchar(3)
)
insert into test
(Client_No, Start_Date, Stop) values
(1,'2018-01-01','No')
,(1,'2018-02-01','No')
,(1,'2018-03-01','No')
,(1,'2018-04-01','Yes')
,(1,'2018-05-01','No')
,(1,'2018-06-01','No')
,(2,'2018-02-01','No')
,(2,'2018-03-01','No')
,(2,'2018-04-01','Yes')
,(2,'2018-05-01','No')
,(2,'2018-06-01','Yes')
select *
, row_number() over (partition by Client_no, Rnk order by start_date) as rn
from
(
select *
, sum(case when Stop = 'Yes' then 1 else 0 end) over (partition by Client_No order by start_date) rnk
from test
) q
order by Client_No, start_date
GO
Id | Client_No | Start_Date | Stop | rnk | rn
-: | --------: | :------------------ | :--- | --: | :-
1 | 1 | 01/01/2018 00:00:00 | No | 0 | 1
2 | 1 | 01/02/2018 00:00:00 | No | 0 | 2
3 | 1 | 01/03/2018 00:00:00 | No | 0 | 3
4 | 1 | 01/04/2018 00:00:00 | Yes | 1 | 1
5 | 1 | 01/05/2018 00:00:00 | No | 1 | 2
6 | 1 | 01/06/2018 00:00:00 | No | 1 | 3
7 | 2 | 01/02/2018 00:00:00 | No | 0 | 1
8 | 2 | 01/03/2018 00:00:00 | No | 0 | 2
9 | 2 | 01/04/2018 00:00:00 | Yes | 1 | 1
10 | 2 | 01/05/2018 00:00:00 | No | 1 | 2
11 | 2 | 01/06/2018 00:00:00 | Yes | 2 | 1
db<>fiddle here
The difference between using this:
row_number() over (partition by Client_no, Rnk order by start_date)
versus this:
dense_rank() over (partition by Client_no, Rnk order by start_date)
is that the dense_rank would calculate the same number for the same start_date per Client_no & Rnk.

Below is one approach which gives you the output you want. You can see as a live/working demo here.
The steps involved are:
Create an adjusted stop value where we mark Stop as Yes for the first ever row for every customer
Create a separate table which only includes the rows where we will want to start/restart counting
For each of the rows in this new table we also add an end data, which is basically the date of the next row for every customer, or for the last row a date in the future
We join the original data table with the new table and run a sequence based on this new calculation
-- 1. Creating adjusted stop value
data_adjusted_stop as
(
select *,
case when row_number() over(partition by Client_No order by Start_Date asc) = 1 then 'Yes' else Stop end as adjusted_stop
from data
),
-- 2. Extracting the rows where we will want to (re)start the counting
data_with_cycle as
(
select Client_No,
row_number() over(partition by Client_No order by Start_Date asc) adjusted_stop_cycle,
Start_Date
from data_adjusted_stop
where adjusted_stop = 'Yes'
),
-- 3. Adding an End_Date column for each row where we will want to (re)start counting
data_with_end_date as
(
select *,
coalesce(lead(Start_Date) over (partition by Client_No order by Start_Date asc), '2021-01-01') as End_Date
from data_with_cycle
)
-- 4. Running a sequence partitioned by Client_No and the stop cycle
select data.*,
row_number() over(partition by data.Client_No, data_with_end_date.adjusted_stop_cycle order by data.Start_Date asc) as desired_output_sequence
from data
left join data_with_end_date
on data_with_end_date.Client_no = data.Client_no
where data.Start_Date >= data_with_end_date.Start_Date
and data.Start_Date < data_with_end_date.End_Date
order by 1, 2

Related

Unique cumulative customers by each day

Task: Get the total number of unique cumulative customers by each decline reason and by each day.
Input data sample:
+---------+--------------+------------+------+
| Cust_Id | Decline_Dt | Reason | Days |
+---------+--------------+------------+------+
| A | 08-09-2020 | Reason_1 | 0 |
| A | 08-09-2020 | Reason_1 | 1 |
| A | 08-09-2020 | Reason_1 | 2 |
| A | 08-09-2020 | Reason_1 | 4 |
| B | 08-09-2020 | Reason_1 | 0 |
| B | 08-09-2020 | Reason_1 | 2 |
| B | 08-09-2020 | Reason_1 | 3 |
| C | 08-09-2020 | Reason_1 | 1 |
+---------+--------------+------------+------+
1) Decline_dt - The date on which the payment was declined. (Ignore it for this task)
2) Days - Indicates the # of days after the payment decline happened, the customer interacted with IVR channel.
3) Reason - Indicates the payment decline reason
--Expected Output:
+---------------+-----------+---------------+----------------------------+
| Reason | Days | Unique_mtns | total_cumulative_customers |
+---------------+-----------+---------------+----------------------------+
| Reason_1 | 0 | 2 | 2 |
| Reason_1 | 1 | 2 | 3 |
| Reason_1 | 2 | 2 | 3 |
| Reason_1 | 3 | 1 | 3 |
| Reason_1 | 4 | 1 | 3 |
+------------------------------------------------------------------------+
My Hive query:
select a.Reason
, a.days
-- , count(distinct a.cust_id) as unique_mtns
, count(distinct a.cust_id) over (partition by Reason
order by a.days rows between unbounded preceding and current row)
as total_cumulative_customers
from table as a
group by a.reason
, a.days
Output (Incorrect):
+---------------+-----------+----------------------------+
| Reason | Days | total_cumulative_customers |
+---------------+-----------+----------------------------+
| Reason_1 | 0 | 2 |
| Reason_1 | 1 | 2 |
| Reason_1 | 2 | 2 |
| Reason_1 | 3 | 1 |
| Reason_1 | 4 | 1 |
+--------------------------------------------------------+
Ideally, I would expect the window function to be executed without group by.
However, I get an error without group by. When I use group by, I don't get the cumulative customers.
If I follow you correctly, you can use a subquery to compute the first day per customer/reason tuple, and then do conditional aggregation:
select reason, days,
count(distinct cust_id) as unique_mtns,
sum(sum(case when days = min_days then 1 else 0 end))
over(partition by reason order by days) as total_cumulative_customers
from (
select reason, cust_id,
min(days) over(partition by reason, cust_id) as min_days
from mytable
) t
group by reason, days
I would recommend using row_number() to enumerate the rows or a given customer and reason. Your code uses count(distinct) on the user id, suggesting that you might have duplicates on a given day.
This would be:
select reason, days, count(distinct cust_id) as unique_mtns,
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (partition by reason order by days) as total_cumulative_customers
from (select t.*,
row_number() over (partition by reason, cust_id order by days) as seqnum
from t
) t
group by reason, days
order by reason, days;

In Sqlite get top 2 for each name returned in different columns

I have this that returns the top most recent 2 dates grouped by Hipaa_Short. I would like the most recent in one column and the 2nd most recent in another column, for each Hipaa_Short. It is possible that there are missing dates (so there is only one row for the Hipaa_Short) in that case I want the to show the empty value as well. I am using Sqlite3 so I'm sure some 'fancy' stuff won't work.
SELECT * FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Hipaa_Short ORDER BY Meeting_Date DESC) AS rn
FROM Meetings
)
WHERE rn < 3
This is what I get, but it is not what I want:
pk_id Hipaa_Short Meeting_Date rn
+-------|-------------|--------------+-----+
| 2 | LastFirst | 2020-02-01 | 2 |
| 5 | LastFirst | 2020-03-01 | 1 |
| 6 | JoneBob | 2020-03-01 | 2 |
| 7 | JoneBob | 2020-04-01 | 1 |
| 8 | JonesTom | 2020-06-01 | 2 |
| 9 | JonesTom | 2020-07-01 | 1 |
| 10 | NortEdw | 2020-04-01 | 1 |
+-------|-------------|--------------+-----+
Meetings Table:
REATE TABLE "Meetings" (
"id_pk" INTEGER NOT NULL,
"Hipaa_Short" TEXT NOT NULL,
"Meeting_Date" TEXT NOT NULL,
"MTG_Year" INTEGER,
"MTG_Month" INTEGER,
"MTG_Day" INTEGER,
"CN_Date" TEXT,
"Meeting_Type" TEXT,
"Date_Added" TEXT,
"Annual" TEXT,
"LOCSI_Flag" TEXT,
"Hipaa_RID" TEXT,
PRIMARY KEY("id_pk"),
UNIQUE("Hipaa_Short","Meeting_Date")
)
Sample Data:
pk_id Hipaa_Short Meeting_Date
+-------|-------------|--------------+
| 1 | LastFirst | 2020-01-01 |
| 2 | LastFirst | 2020-02-01 |
| 3 | JoneBob | 2020-02-01 |
| 4 | JonesTom | 2020-02-01 |
| 5 | LastFirst | 2020-03-01 |
| 6 | JoneBob | 2020-03-01 |
| 7 | JoneBob | 2020-04-01 |
| 8 | JonesTom | 2020-06-01 |
| 9 | JonesTom | 2020-07-01 |
| 10 | NortEdw | 2020-04-01 |
+-------|-------------|--------------+
Desired Output:
Hipaa_Short Prior Date Next Date
+-------------|------------+------------+
| LastFirst | 2020-02-01 | 2020-03-01 |
| JoneBob | 2020-03-01 | 2020-04-01 |
| JonesTom | 2020-06-01 | 2020-07-01 |
| NortEdw | | 2020-04-01 |
+-------------|------------|------------+
You can use conditional aggregation on top of your existing query to pivot the resultset:
select
hipaa_short,
max(case when rn = 2 then meeting_date end) prior_date,
max(case when rn = 1 then meeting_date end) next_date,
from (
select
m.*,
row_number() over (partition by hipaa_short order by meeting_date desc) as rn
from meetings m
) m
where rn <= 2
group by hipaa_short
A slightly shorter form of GMB's answer for this particular problem is:
select hipaa_short, min(meeting_date) as prior_date, max(meeting_date) as next_date
from (select m.*,
row_number() over (partition by hipaa_short order by meeting_date desc) as rn
from meetings m
) m
where rn <= 2
group by hipaa_short
Since you already need to sort the partitions to get just the first one, it's easy (And more efficient) to use the lead() window function to get both dates in a single row without additional aggregation:
WITH cte AS
(SELECT Hippa_Short
, lead(Meeting_Date) OVER w AS "Prior Date"
, Meeting_Date AS "Next Date"
, row_number() OVER w AS rn
FROM meetings
WINDOW w AS (PARTITION BY Hippa_Short ORDER BY Meeting_Date DESC))
SELECT Hippa_Short, "Prior Date", "Next Date"
FROM cte
WHERE rn = 1;
gives
Hippa_Short Prior Date Next Date
----------- ---------- ----------
JoneBob 2020-03-01 2020-04-01
JonesTom 2020-06-01 2020-07-01
LastFirst 2020-02-01 2020-03-01
NortEdw 2020-04-01

SQL group by changing column

Suppose I have a table sorted by date as so:
+-------------+--------+
| DATE | VALUE |
+-------------+--------+
| 01-09-2020 | 5 |
| 01-15-2020 | 5 |
| 01-17-2020 | 5 |
| 02-03-2020 | 8 |
| 02-13-2020 | 8 |
| 02-20-2020 | 8 |
| 02-23-2020 | 5 |
| 02-25-2020 | 5 |
| 02-28-2020 | 3 |
| 03-13-2020 | 3 |
| 03-18-2020 | 3 |
+-------------+--------+
I want to group by changes in value within that given date range, and add a value that increments each time as an added column to denote that.
I have tried a number of different things, such as using the lag function:
SELECT value, value - lag(value) over (order by date) as count
GROUP BY value
In short, I want to take the table above and have it look like:
+-------------+--------+-------+
| DATE | VALUE | COUNT |
+-------------+--------+-------+
| 01-09-2020 | 5 | 1 |
| 01-15-2020 | 5 | 1 |
| 01-17-2020 | 5 | 1 |
| 02-03-2020 | 8 | 2 |
| 02-13-2020 | 8 | 2 |
| 02-20-2020 | 8 | 2 |
| 02-23-2020 | 5 | 3 |
| 02-25-2020 | 5 | 3 |
| 02-28-2020 | 3 | 4 |
| 03-13-2020 | 3 | 4 |
| 03-18-2020 | 3 | 4 |
+-------------+--------+-------+
I want to eventually have it all in one small table with the earliest date for each.
+-------------+--------+-------+
| DATE | VALUE | COUNT |
+-------------+--------+-------+
| 01-09-2020 | 5 | 1 |
| 02-03-2020 | 8 | 2 |
| 02-23-2020 | 5 | 3 |
| 02-28-2020 | 3 | 4 |
+-------------+--------+-------+
Any help would be very appreciated
you can use a combination of Row_number and Dense_rank functions to get the required results like below:
;with cte
as
(
select t.DATE,t.VALUE
,Dense_rank() over(partition by t.VALUE order by t.DATE) as d_rank
,Row_number() over(partition by t.VALUE order by t.DATE) as r_num
from table t
)
Select t.Date,t.Value,d_rank as count
from cte
where r_num = 1
You can use a lag and cumulative sum and a subquery:
SELECT value,
SUM(CASE WHEN prev_value = value THEN 0 ELSE 1 END) OVER (ORDER BY date)
FROM (SELECT t.*, LAG(value) OVER (ORDER BY date) as prev_value
FROM t
) t
Here is a db<>fiddle.
You can recursively use lag() and then row_number() analytic functions :
WITH t2 AS
(
SELECT LAG(value,1,value-1) OVER (ORDER BY date) as lg,
t.*
FROM t
)
SELECT t2.date,t2.value, ROW_NUMBER() OVER (ORDER BY t2.date) as count
FROM t2
WHERE value - lg != 0
Demo
and filter through inequalities among the returned values from those functions.

SQL group by an column but segmented based on another column

I have this table which contain roughly more than 100000 rows and with 3 columns:
Account_number
Report_date
Outstanding_amount
I need to find a statement that group the outstanding amount by account but also cut it based on the date. Sample data for 1 account:
+----------------+-------------+--------------------+--+
| account_number | report_date | outstanding_amount | |
+----------------+-------------+--------------------+--+
| 1 | 02/01/2019 | 100 | |
| 1 | 03/01/2019 | 100 | |
| 1 | 06/01/2019 | 200 | |
| 1 | 07/01/2019 | 300 | |
| 1 | 10/01/2019 | 200 | |
| 1 | 11/01/2019 | 200 | |
| 1 | 12/01/2019 | 100 | |
+----------------+-------------+--------------------+--+
So if I run this statement:
select * from (select account_number, min(report_date) mindate, max(report_date) maxdate, outstading_amount from table1 grouped by account_number, outstanding_amount)
The result of this statement should be similar to this:
+----------------+------------+------------+--------------------+
| account_number | mindate | maxdate | outstanding_amount |
+----------------+------------+------------+--------------------+
| 1 | 02/01/2019 | 12/01/2019 | 100 |
| 1 | 06/01/2019 | 11/01/2019 | 200 |
| 1 | 07/01/2019 | 07/01/2019 | 300 |
+----------------+------------+------------+--------------------+
So here I want to separate the result so that the days between mindate and maxdate of one row won't overlap the days in the next row. The result I'm looking is something like this:
+----------------+------------+------------+--------------------+
| account_number | mindate | maxdate | outstanding_amount |
+----------------+------------+------------+--------------------+
| 1 | 02/01/2019 | 03/01/2019 | 100 |
| 1 | 06/01/2019 | 06/01/2019 | 200 |
| 1 | 07/01/2019 | 07/01/2019 | 300 |
| 1 | 10/01/2019 | 11/01/2019 | 200 |
| 1 | 12/01/2019 | 12/01/2019 | 100 |
+----------------+------------+------------+--------------------+
Is it possible to construct this statement?
This is a gaps-and-islands problem. In this case, the simplest solution is probably the difference of row numbers:
select account_number, outstanding_amount,
min(report_date), max(report_date)
from (select t.*,
row_number() over (partition by account_number order by report_date) as seqnum,
row_number() over (partition by account_number, outstanding_amount order by report_date) as seqnum_o
from t
) t
group by account_number, outstanding_amount, (seqnum - seqnum_o)
order by account_number, min(report_date);
Why this works is a little tricky to explain. But if you look at the results of the subquery, you will be able to see how the difference of row numbers defines the adjacent rows with the same amount.
To flatten the data, squish it by calculated rank.
select account_number
, min(report_date) as mindate
, max(report_date) as maxdate
, outstanding_amount
from
(
select q1.*
, sum(flag) over (partition by account_number order by report_date) as rnk
from
(
select t.*
, case when outstanding_amount = lag(outstanding_amount, 1) over (partition by account_number order by report_date) then 0 else 1 end as flag
from table1 t
) q1
) q2
group by account_number, outstanding_amount, rnk
order by account_number, mindate;
A test on db<>fiddle here

Aggregation by positive/negative values v.2

I've posted several topics and every query had some problems :( Changed table and examples for better understanding
I have a table called PROD_COST with 5 fields
(ID,Duration,Cost,COST_NEXT,COST_CHANGE).
I need extra field called "groups" for aggregation.
Duration = number of days the price is valid (1 day=1row).
Cost = product price in this day.
-Cost_next = lead(cost,1,0).
Cost_change = Cost_next - Cost.
example:
+----+---------+------+-------------+-------+
|ID |Duration | Cost | Cost_change | Groups|
+----+---------+------+-------------+-------+
| 1 | 1 | 10 | -1,5 | 1 |
| 2 | 1 | 8,5 | 3,7 | 2 |
| 3 | 1 | 12.2 | 0 | 2 |
| 4 | 1 | 12.2 | -2,2 | 3 |
| 5 | 1 | 10 | 0 | 3 |
| 6 | 1 | 10 | 3.2 | 4 |
| 7 | 1 | 13.2 | -2,7 | 5 |
| 8 | 1 | 10.5 | -1,5 | 5 |
| 9 | 1 | 9 | 0 | 5 |
| 10 | 1 | 9 | 0 | 5 |
| 11 | 1 | 9 | -1 | 5 |
| 12 | 1 | 8 | 1.5 | 6 |
+----+---------+------+-------------+-------+
Now i need to group("Groups" field) by Cost_change. It can be positive,negative or 0 values.
Some kind guy advised me this query:
select id, COST_CHANGE, sum(GRP) over (order by id asc) +1
from
(
select *, case when sign(COST_CHANGE) != sign(isnull(lag(COST_CHANGE)
over (order by id asc),COST_CHANGE)) and Cost_change!=0 then 1 else 0 end as GRP
from PROD_COST
) X
But there is a problem: If there are 0 values between two positive or negative values than it groups it separately, for example:
+-------------+--------+
| Cost_change | Groups |
+-------------+--------+
| 9.262 | 5777 |
| -9.262 | 5778 |
| 9.262 | 5779 |
| 0.000 | 5779 |
| 9.608 | 5780 |
| -11.231 | 5781 |
| 10.000 | 5782 |
+-------------+--------+
I need to have:
+-------------+--------+
| Cost_change | Groups |
+-------------+--------+
| 9.262 | 5777 |
| -9.262 | 5778 |
| 9.262 | 5779 |
| 0.000 | 5779 |
| 9.608 | 5779 | -- Here
| -11.231 | 5780 |
| 10.000 | 5781 |
+-------------+--------+
In other words, if there's 0 values between two positive ot two negative values than they should be in one group, because Sequence: MINUS-0-0-MINUS - no rotation. But if i had MINUS-0-0-PLUS, than GROUPS should be 1-1-1-2, because positive valus is rotating with negative value.
Thank you for attention!
I'm Using Sql Server 2012
I think the best approach is to remove the zeros, do the calculation, and then re-insert them. So:
with pcg as (
select pc.*, min(id) over (partition by grp) as grpid
from (select pc.*,
(row_number() over (order by id) -
row_number() over (partition by sign(cost_change)
order by id
) as grp
from prod_cost pc
where cost_change <> 0
) pc
)
select pc.*, max(groups) over (order by id)
from prod_cost pc left join
(select pcg.*, dense_rank() over (order by grpid) as groups
from pcg
) pc
on pc.id = pcg.id;
The CTE assigns a group identifier based on the lowest id in the group, where the groups are bounded by actual sign changes. The subquery turns this into a number. The outer query then accumulates the maximum value, to give a value to the 0 records.