SQL: select data before first occurence of a certain value - sql

For example, I have order data come from customers, like this
test = spark.createDataFrame([
(0, 1, 1, "2018-06-03"),
(1, 1, 1, "2018-06-04"),
(2, 1, 3, "2018-06-04"),
(3, 1, 2, "2018-06-05"),
(4, 1, 1, "2018-06-06"),
(5, 2, 3, "2018-06-01"),
(6, 2, 1, "2018-06-01"),
(7, 3, 1, "2018-06-02"),
(8, 3, 1, "2018-06-02"),
(9, 3, 1, "2018-06-05")
])\
.toDF("order_id", "customer_id", "order_status", "created_at")
test.show()
Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.
Now, I want to do analysis for order comes from
new customers (who has not made purchase before)
old customers (who has finished purchase before)
so I want to add a feature to the above the data, turn into like this
The logic is for every customer, every order created before first order with status 3 (include itself) is counted as come from new customer, and every order after that is counted as old customer.
Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)
How can I do this, in SQL?
I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.

This is tested for SQLite:
SELECT order_id, customer_id, order_status, created_at,
CASE
WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
ELSE 'new'
END AS customer_status
FROM orders o

You can do this using window functions in Spark:
select t.*,
(case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
then 'old'
else 'new'
end) as customer_status
from test t;
Note that this assigns "new" to customers with no order with status "3".
You can also write this using join and group by:
select t.*,
coalesce(t3.customer_status, 'old') as customer_status
from test t left join
(select t.customer_id, min(created_at) as min_created_at,
'new' as customer_status
from t
where status = 3
group by t.customer_id
) t3
on t.customer_id = t3.customer_id and
t.created_at <= t3.min_created_at;

Related

sql that finds records within 3 days of a condition being met

I am trying to find all records that exist within a date range prior to an event occurring. In my table below, I want to pull all records that are 3 days or less from when the switch field changes from 0 to 1, ordered by date, partitioned by product. My solution does not work, it includes the first record when it should skip as it's outside the 3 day window. I am scanning a table with millions of records, is there a way to reduce the complexity/cost while maintaining my desired results?
http://sqlfiddle.com/#!18/eebe7
CREATE TABLE productlist
([product] varchar(13), [switch] int, [switchday] date)
;
INSERT INTO productlist
([product], [switch], [switchday])
VALUES
('a', 0, '2019-12-28'),
('a', 0, '2020-01-02'),
('a', 1, '2020-01-03'),
('a', 0, '2020-01-06'),
('a', 0, '2020-01-07'),
('a', 1, '2020-01-09'),
('a', 1, '2020-01-10'),
('a', 1, '2020-01-11'),
('b', 1, '2020-01-01'),
('b', 0, '2020-01-02'),
('b', 0, '2020-01-03'),
('b', 1, '2020-01-04')
;
my solution:
with switches as (
SELECT
*,
case when lead(switch) over (partition by product order by switchday)=1
and switch=0 then 'first day switch'
else null end as leadswitch
from productlist
),
switchdays as (
select * from switches
where leadswitch='first day switch'
)
select pl.*
,'lead'
from productlist pl
left join switchdays ss
on pl.product=ss.product
and pl.switchday = ss.switchday
and datediff(day, pl.switchday, ss.switchday)<=3
where pl.switch=0
desired output, capturing records that occur within 3 days of a switch going from 0 to 1, for each product, ordered by date:
product switch switchday
a 0 2020-01-02 lead
a 0 2020-01-06 lead
a 0 2020-01-07 lead
b 0 2020-01-02 lead
b 0 2020-01-03 lead
If I understand correctly, you can just use lead() twice:
select pl.*
from (select pl.*,
lead(switch) over (partition by product order by switchday) as next_switch_1,
lead(switch, 2) over (partition by product order by switchday) as next_switch_2
from productlist pl
) pl
where switch = 0 and
1 in (next_switch_1, next_switch_2);
Here is a db<>fiddle.
EDIT (based on comment):
select pl.*
from (select pl.*,
min(case when switch = 1 then switchdate end) over (partition by product order by switchdate desc) as next_switch_1_day
from productlist pl
) pl
where switch = 0 and
next_switch_one_day <= dateadd(day, 2, switchdate);

Sum last two records including last record of a group

In SQL Server 2017, how do I sum the last two records and show the last record in a single query?
CREATE TABLE Billing
(
Customer CHAR(12),
Month INT,
Amount INT
)
GO
INSERT INTO Billing VALUES ('AAAA', 3, 5)
INSERT INTO Billing VALUES ('AAAA', 2, 0)
INSERT INTO Billing VALUES ('AAAA', 1, 2)
INSERT INTO Billing VALUES ('BBBB', 10, 0)
INSERT INTO Billing VALUES ('BBBB', 12, 1)
INSERT INTO Billing VALUES ('BBBB', 11, 0)
INSERT INTO Billing VALUES ('BBBB', 13, 6)
Expected output:
Customer Total Last 2 Bills Last Bill
-----------------------------------------
AAAA 5 5
BBBB 7 6
I tried using SUM with LAST_VALUE with ORDER BY
You can filter out rows by using the ROW_NUMBER() window function, as in:
select
customer,
sum(amount) as total_last_2_bills,
sum(case when rn = 1 then amount else 0 end) as last_bill
from (
select
*,
row_number() over (partition by customer order by month desc) as rn
from billing
) x
where rn <= 2
group by customer
See SQL Fiddle.
You can use window functions:
select customer, (prev_amount + amount), amount
from (select b.*,
lag(amount) over (partition by customer order by month) as prev_amount,
lead(month) over (partition by customer order by month) as next_month
from billing b
) b
where next_month is null;
If you want to ignore values of 0, then filter:
select customer, (coalesce(prev_amount, 0) + amount), amount
from (select b.*,
lag(amount) over (partition by customer order by month) as prev_amount,
lead(month) over (partition by customer order by month) as next_month
from billing b
where amount <> 0
) b
where next_month is null;

Find date of most recent overdue

I have the following problem: from the table of pays and dues, I need to find the date of the last overdue. Here is the table and data for example:
create table t (
Id int
, [date] date
, Customer varchar(6)
, Deal varchar(6)
, Currency varchar(3)
, [Sum] int
);
insert into t values
(1, '2017-12-12', '1110', '111111', 'USD', 12000)
, (2, '2017-12-25', '1110', '111111', 'USD', 5000)
, (3, '2017-12-13', '1110', '122222', 'USD', 10000)
, (4, '2018-01-13', '1110', '111111', 'USD', -10100)
, (5, '2017-11-20', '2200', '222221', 'USD', 25000)
, (6, '2017-12-20', '2200', '222221', 'USD', 20000)
, (7, '2017-12-31', '2201', '222221', 'USD', -10000)
, (8, '2017-12-29', '1110', '122222', 'USD', -10000)
, (9, '2017-11-28', '2201', '222221', 'USD', -30000);
If the value of "Sum" is positive - it means overdue has begun; if "Sum" is negative - it means someone paid on this Deal.
In the example above on Deal '122222' overdue starts at 2017-12-13 and ends on 2017-12-29, so it shouldn't be in the result.
And for the Deal '222221' the first overdue of 25000 started at 2017-11-20 was completly paid at 2017-11-28, so the last date of current overdue (we are interested in) is 2017-12-31
I've made this selection to sum up all the payments, and stuck here :(
WITH cte AS (
SELECT *,
SUM([Sum]) OVER(PARTITION BY Deal ORDER BY [Date]) AS Debt_balance
FROM t
)
Apparently i need to find (for each Deal) minimum of Dates if there is no 0 or negative Debt_balance and the next date after the last 0 balance otherwise..
Will be gratefull for any tips and ideas on the subject.
Thanks!
UPDATE
My version of solution:
WITH cte AS (
SELECT ROW_NUMBER() OVER (ORDER BY Deal, [Date]) id,
Deal, [Date], [Sum],
SUM([Sum]) OVER(PARTITION BY Deal ORDER BY [Date]) AS Debt_balance
FROM t
)
SELECT a.Deal,
SUM(a.Sum) AS NET_Debt,
isnull(max(b.date), min(a.date)),
datediff(day, isnull(max(b.date), min(a.date)), getdate())
FROM cte as a
LEFT OUTER JOIN cte AS b
ON a.Deal = b.Deal AND a.Debt_balance <= 0 AND b.Id=a.Id+1
GROUP BY a.Deal
HAVING SUM(a.Sum) > 0
I believe you are trying to use running sum and keep track of when it changes to positive, and it can change to positive multiple times and you want the last date at which it became positive. You need LAG() in addition to running sum:
WITH cte1 AS (
-- running balance column
SELECT *
, SUM([Sum]) OVER (PARTITION BY Deal ORDER BY [Date], Id) AS RunningBalance
FROM t
), cte2 AS (
-- overdue begun column - set whenever running balance changes from l.t.e. zero to g.t. zero
SELECT *
, CASE WHEN LAG(RunningBalance, 1, 0) OVER (PARTITION BY Deal ORDER BY [Date], Id) <= 0 AND RunningBalance > 0 THEN 1 END AS OverdueBegun
FROM cte1
)
-- eliminate groups that are paid i.e. sum = 0
SELECT Deal, MAX(CASE WHEN OverdueBegun = 1 THEN [Date] END) AS RecentOverdueDate
FROM cte2
GROUP BY Deal
HAVING SUM([Sum]) <> 0
Demo on db<>fiddle
You can use window functions. These can calculate intermediate values:
Last day when the sum is negative (i.e. last "good" record).
Last sum
Then you can combine these:
select deal, min(date) as last_overdue_start_date
from (select t.*,
first_value(sum) over (partition by deal order by date desc) as last_sum,
max(case when sum < 0 then date end) over (partition by deal order by date) as max_date_neg
from t
) t
where last_sum > 0 and date > max_date_neg
group by deal;
Actually, the value on the last date is not necessary. So this simplifies to:
select deal, min(date) as last_overdue_start_date
from (select t.*,
max(case when sum < 0 then date end) over (partition by deal order by date) as max_date_neg
from t
) t
where date > max_date_neg
group by deal;

Use recursive CTE to handle date logic

At work, one of my assignments is to calculate commission to the sales staff. One rule has been more challenging than the others.
Two sales teams A and B work together each selling different products. Team A can send leads to team B. The same customer might be send multiple times. The first time a customer (ex. lead 1)* is send a commission is paid to the salesperson in team A who created the lead. Now the customer is “locked” for the next 365 days (counting from the date lead 1 was created). Meaning that no one can get additional commission for that customer in that period by sending additional leads (ex. Lead 2 and 3 gets no commission). After the 365 days have expired. A new lead can be created and receive commission (ex. Lead 4). Then the customer is locked again for 365 days counting from the day lead 4 was created. Therefore, lead 5 gets no commission. The tricky part is to reset the date that the 365 days is counted from.
'* Reference to tables #LEADS and #DISERED result.
I have solved the problem in tSQL using a cursor, but I wonder if it was possible to use a recursive CTE instead. I have made several attempts the best one is pasted in below. The problem with my solution is, that I refer to the recursive table more than once. I have tried to fix that problem with nesting a CTE inside a CTE. That’s is not allowed. I have tried using a temporary table inside the CTE that is not allowed either. I tried several times to recode the recursive part of the CTE so that the recursive table is referenced only once, but then I am not able to get the logic to work.
I am using SQL 2008
IF OBJECT_ID('tempdb.dbo.#LEADS', 'U') IS NOT NULL
DROP TABLE #LEADS;
CREATE TABLE #LEADS (LEAD_ID INT, CUSTOMER_ID INT, LEAD_CREATED_DATE DATETIME, SALESPERSON_NAME varchar(20))
INSERT INTO #LEADS
VALUES (1, 1, '2013-09-01', 'Rasmus')
,(2, 1, '2013-11-01', 'Christian')
,(3, 1, '2014-01-01', 'Nadja')
,(4, 1, '2014-12-24', 'Roar')
,(5, 1, '2015-12-01', 'Kristian')
,(6, 2, '2014-01-05', 'Knud')
,(7, 2, '2015-01-02', 'Rasmus')
,(8, 2, '2015-01-08', 'Roar')
,(9, 2, '2016-02-05', 'Kristian')
,(10, 2, '2016-03-05', 'Casper')
SELECT *
FROM #LEADS;
IF OBJECT_ID('tempdb.dbo.#DISERED_RESULT', 'U') IS NOT NULL
DROP TABLE #DISERED_RESULT;
CREATE TABLE #DISERED_RESULT (LEAD_ID INT, DESIRED_COMMISION_RESULT CHAR(3))
INSERT INTO #DISERED_RESULT
VALUES (1, 'YES')
,(2, 'NO')
,(3, 'NO')
,(4, 'YES')
,(5, 'NO')
,(6, 'YES')
,(7, 'NO')
,(8, 'YES')
,(9, 'YES')
,(10, 'NO')
SELECT *
FROM #DISERED_RESULT;
WITH COMMISSION_CALCULATION AS
(
SELECT T1.*
,COMMISSION = 'YES'
,MIN_LEAD_CREATED_DATE AS COMMISSION_DATE
FROM #LEADS AS T1
INNER JOIN (
SELECT A.CUSTOMER_ID
,MIN(A.LEAD_CREATED_DATE) AS MIN_LEAD_CREATED_DATE
FROM #LEADS AS A
GROUP BY A.CUSTOMER_ID
) AS T2 ON T1.CUSTOMER_ID = T2.CUSTOMER_ID AND T1.LEAD_CREATED_DATE = T2.MIN_LEAD_CREATED_DATE
UNION ALL
SELECT T10.LEAD_ID
,T10.CUSTOMER_ID
,T10.LEAD_CREATED_DATE
,T10.SALESPERSON_NAME
,T10.COMMISSION
,T10.COMMISSION_DATE
FROM (SELECT ROW_NUMBER() OVER(PARTITION BY T5.CUSTOMER_ID ORDER BY T5.LEAD_CREATED_DATE ASC) AS RN
,T5.*
,T6.MAX_COMMISSION_DATE
,DATEDIFF(DAY, T6.MAX_COMMISSION_DATE, T5.LEAD_CREATED_DATE) AS ANTAL_DAGE_SIDEN_SIDSTE_COMMISSION
,CASE
WHEN DATEDIFF(DAY, T6.MAX_COMMISSION_DATE, T5.LEAD_CREATED_DATE) > 365 THEN 'YES'
ELSE 'NO'
END AS COMMISSION
,CASE
WHEN DATEDIFF(DAY, T6.MAX_COMMISSION_DATE, T5.LEAD_CREATED_DATE) > 365 THEN T5.LEAD_CREATED_DATE
ELSE NULL
END AS COMMISSION_DATE
FROM #LEADS AS T5
INNER JOIN (SELECT T4.CUSTOMER_ID
,MAX(T4.COMMISSION_DATE) AS MAX_COMMISSION_DATE
FROM COMMISSION_CALCULATION AS T4
GROUP BY T4.CUSTOMER_ID) AS T6 ON T5.CUSTOMER_ID = T6.CUSTOMER_ID
WHERE T5.LEAD_ID NOT IN (SELECT LEAD_ID FROM COMMISSION_CALCULATION)
) AS T10
WHERE RN = 1
)
SELECT *
FROM COMMISSION_CALCULATION;
I have made some assumptions where your description does not fully make sense as written, but the below achieves your desired result:
if object_id('tempdb.dbo.#leads', 'u') is not null
drop table #leads;
create table #leads (lead_id int, customer_id int, lead_created_date datetime, salesperson_name varchar(20))
insert into #leads
values (1, 1, '2013-09-01', 'rasmus')
,(2, 1, '2013-11-01', 'christian')
,(3, 1, '2014-01-01', 'nadja')
,(4, 1, '2014-12-24', 'roar')
,(5, 1, '2015-12-01', 'kristian')
,(6, 2, '2014-01-05', 'knud')
,(7, 2, '2015-01-02', 'rasmus')
,(8, 2, '2015-01-08', 'roar')
,(9, 2, '2016-02-05', 'kristian')
,(10, 2, '2016-03-05', 'casper')
if object_id('tempdb.dbo.#disered_result', 'u') is not null
drop table #disered_result;
create table #disered_result (lead_id int, desired_commision_result char(3))
insert into #disered_result
values (1, 'yes'),(2, 'no'),(3, 'no'),(4, 'yes'),(5, 'no'),(6, 'yes'),(7, 'no'),(8, 'yes'),(9, 'yes'),(10, 'no')
with rownum
as
(
select row_number() over (order by customer_id, lead_created_date) as rn -- This is to ensure an incremantal ordering id
,lead_id
,customer_id
,lead_created_date
,salesperson_name
from #leads
)
,cte
as
(
select rn
,lead_id
,customer_id
,lead_created_date
,salesperson_name
,'yes' as commission_result
,lead_created_date as commission_window_start
from rownum
where rn = 1
union all
select r.rn
,r.lead_id
,r.customer_id
,r.lead_created_date
,r.salesperson_name
,case when r.customer_id <> c.customer_id -- If the customer ID has changed, we are at a new commission window.
then 'yes'
else case when r.lead_created_date > dateadd(d,365,c.commission_window_start) -- This assumes the window is 365 days and not one year (ie. Leap years don't add a day)
then 'yes'
else 'no'
end
end as commission_result
,case when r.customer_id <> c.customer_id
then r.lead_created_date
else case when r.lead_created_date > dateadd(d,365,c.commission_window_start) -- This assumes the window is 365 days and not one year (ie. Leap years don't add a day)
then r.lead_created_date
else c.commission_window_start
end
end as commission_window_start
from rownum r
inner join cte c
on(r.rn = c.rn+1)
)
select lead_id
,commission_result
from cte
order by customer_id
,lead_created_date;

Count previous consecutive rows in SQL Server

I have attendance data list which is showing below. Now I am trying to find data by a specific date range (01/05/2016 – 07/05/2016) with total Present Column, Total Present Column will be calculated from previous present data (P). Suppose today is 04/05/2016. If a person has 01,02,03,04 status ‘p’ then it will show date 04-05-2016 total present 4.
Could you help me to find total present from this result set.
You can check this example, which have logic to calculate previous sum value.
declare #t table (employeeid int, datecol date, status varchar(2) )
insert into #t values (10001, '01-05-2016', 'P'),
(10001, '02-05-2016', 'P'),
(10001, '03-05-2016', 'P'),
(10001, '04-05-2016', 'P'),
(10001, '05-05-2016', 'A'),
(10001, '06-05-2016', 'P'),
(10001, '07-05-2016', 'P'),
(10001, '08-05-2016', 'L'),
(10002, '07-05-2016', 'P'),
(10002, '08-05-2016', 'L')
--select * from #t
select * ,
SUM(case when status = 'P' then 1 else 0 end) OVER (PARTITION BY employeeid ORDER BY employeeid, datecol
ROWS BETWEEN UNBOUNDED PRECEDING
AND current row)
from
#t
Another twist of the same thing via cte (as you written SQLSERVER2012, this below solution only work in Sqlserver 2012 and above)
;with cte as
(
select employeeid , datecol , ROW_NUMBER() over(partition by employeeid order by employeeid, datecol) rowno
from
#t where status = 'P'
)
select t.*, cte.rowno ,
case when ( isnull(cte.rowno, 0) = 0)
then LAG(cte.rowno) OVER (ORDER BY t.employeeid, t.datecol)
else cte.rowno
end LagValue
from #t t left join cte on t.employeeid = cte.employeeid and t.datecol = cte.datecol
order by t.employeeid, t.datecol
You could use a subquery to calculate TotalPresent for each row:
SELECT
main.EmployeeID,
main.[Date],
main.[Status],
(
SELECT SUM(CASE WHEN t.[Status] = 'P' THEN 1 ELSE 0 END)
FROM [TableName] t
WHERE t.EmployeeID = main.EmployeeID AND t.[Date] <= main.[Date]
) as TotalPresent
FROM [TableName] main
ORDER BY
main.EmployeeID,
main.[Date]
Here I used subquery to count the sum of records that have the same EmployeeID and date is less or equal to the date of current row. If status of the record is 'P', then 1 is added to the sum, otherwise 0, which counts only records that have status P.
Interesting question, this should work:
select *
, (select count(retail) from p g
where g.date <= p.date and g.id = p.id and retail = 'P')
from p
order by ID, Date;
So I believe I understand correctly. You would like to count the occurences of P per ID datewise.
This makes a lot of sense. That is why the first occurrence of ID2 was L and the Total is 0. This query will count P status for each occurrence, pause at non-P for each ID.
Here is an example