I have a table including more than 5 million rows of sales transactions. I would like to find sum of date intervals between each customer three recent purchases.
Suppose my table looks like this :
CustomerID ProductID ServiceStartDate ServiceExpiryDate
A X1 2010-01-01 2010-06-01
A X2 2010-08-12 2010-12-30
B X4 2011-10-01 2012-01-15
B X3 2012-04-01 2012-06-01
B X7 2012-08-01 2013-10-01
A X5 2013-01-01 2015-06-01
The Result that I'm looking for may looks like this :
CustomerID IntervalDays
A 802
B 135
I know the query need to first retrieve 3 resent transactions of each customer (based on ServiceStartDate) and then calculate the interval between startDate and ExpiryDate of his/her transactions.
You want to calculate the difference between the previous row's ServiceExpiryDate and the current row's ServiceStartDate based on descending dates and then sum up the last two differences:
with cte as
(
select tab.*,
row_number()
over (partition by customerId
order by ServiceStartDate desc
, ServiceExpiryDate desc -- don't know if this 2nd column is necessary
) as rn
from tab
)
select t2.customerId,
sum(datediff(day, prevEnd, ServiceStartDate)) as Intervaldays
,count(*) as purchases
from cte as t2 left join cte as t1
on t1.customerId = t2.customerId
and t1.rn = t2.rn+1 -- previous and current row
where t2.rn <= 3 -- last three rows
group by t2.customerId;
Same result using LEAD:
with cte as
(
select tab.*,
row_number()
over (partition by customerId
order by ServiceStartDate desc) as rn
,lead(ServiceExpiryDate)
over (partition by customerId
order by ServiceStartDate desc
) as prevEnd
from tab
)
select customerId,
sum(datediff(day, prevEnd, ServiceStartDate)) as Intervaldays
,count(*) as purchases
from cte
where rn <= 3
group by customerId;
Both will not return the expected result unless you subtract purchases (or max(rn)) from Intervaldays. But as you only sum two differences this seems to be not correct for me either...
Additional logic must be applied based on your rules regarding:
customer has less than 3 purchases
overlapping intervals
Assuming there are no overlaps, I think you want this:
select customerId,
sum(datediff(day, ServiceStartDate, ServieEndDate) as Intervaldays
from (select t.*, row_number() over (partition by customerId
order by ServiceStartDate desc) as seqnum
from table t
) t
where seqnum <= 3
group by customerId;
Try this:
SELECT dt.CustomerID,
SUM(DATEDIFF(DAY, dt.PrevExpiry, dt.ServiceStartDate)) As IntervalDays
FROM (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY ServiceStartDate DESC) AS rn
, (SELECT Max(ti.ServiceExpiryDate)
FROM yourTable ti
WHERE t.CustomerID = ti.CustomerID
AND ti.ServiceStartDate < t.ServiceStartDate) As PrevExpiry
FROM yourTable t )dt
GROUP BY dt.CustomerID
Result will be:
CustomerId | IntervalDays
-----------+--------------
A | 805
B | 138
Related
Here is an example:
Id|price|Date
1|2|2022-05-21
1|3|2022-06-15
1|2.5|2022-06-19
Needs to look like this:
Id|Date|price
1|2022-05-21|2
1|2022-05-22|2
1|2022-05-23|2
...
1|2022-06-15|3
1|2022-06-16|3
1|2022-06-17|3
1|2022-06-18|3
1|2022-06-19|2.5
1|2022-06-20|2.5
...
Until today
1|2022-08-30|2.5
I tried using the lag(price) over (partition by id order by date)
But i can't get it right.
I'm not familiar with Azure, but it looks like you need to use a calendar table, or generate missing dates using a recursive CTE.
To get started with a recursive CTE, you can generate line numbers for each id (assuming multiple id values) in the source data ordered by date. These rows with row number equal to 1 (with the minimum date value for the corresponding id) will be used as the starting point for the recursion. Then you can use the DATEADD function to generate the row for the next day. To use the price values from the original data, you can use a subquery to get the price for this new date, and if there is no such value (no row for this date), use the previous price value from CTE (use the COALESCE function for this).
For SQL Server query can look like this
WITH cte AS (
SELECT
id,
date,
price
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) AS rn
FROM tbl
) t
WHERE rn = 1
UNION ALL
SELECT
cte.id,
DATEADD(d, 1, cte.date),
COALESCE(
(SELECT tbl.price
FROM tbl
WHERE tbl.id = cte.id AND tbl.date = DATEADD(d, 1, cte.date)),
cte.price
)
FROM cte
WHERE DATEADD(d, 1, cte.date) <= GETDATE()
)
SELECT * FROM cte
ORDER BY id, date
OPTION (MAXRECURSION 0)
Note that I added OPTION (MAXRECURSION 0) to make the recursion run through all the steps, since the default value is 100, this is not enough to complete the recursion.
db<>fiddle here
The same approach for MySQL (you need MySQL of version 8.0 to use CTE)
WITH RECURSIVE cte AS (
SELECT
id,
date,
price
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) AS rn
FROM tbl
) t
WHERE rn = 1
UNION ALL
SELECT
cte.id,
DATE_ADD(cte.date, interval 1 day),
COALESCE(
(SELECT tbl.price
FROM tbl
WHERE tbl.id = cte.id AND tbl.date = DATE_ADD(cte.date, interval 1 day)),
cte.price
)
FROM cte
WHERE DATE_ADD(cte.date, interval 1 day) <= NOW()
)
SELECT * FROM cte
ORDER BY id, date
db<>fiddle here
Both queries produces the same results, the only difference is the use of the engine's specific date functions.
For MySQL versions below 8.0, you can use a calendar table since you don't have CTE support and can't generate the required date range.
Assuming there is a column in the calendar table to store date values (let's call it date for simplicity) you can use the CROSS JOIN operator to generate date ranges for the id values in your table that will match existing dates. Then you can use a subquery to get the latest price value from the table which is stored for the corresponding date or before it.
So the query would be like this
SELECT
d.id,
d.date,
(SELECT
price
FROM tbl
WHERE tbl.id = d.id AND tbl.date <= d.date
ORDER BY tbl.date DESC
LIMIT 1
) price
FROM (
SELECT
t.id,
c.date
FROM calendar c
CROSS JOIN (SELECT DISTINCT id FROM tbl) t
WHERE c.date BETWEEN (
SELECT
MIN(date) min_date
FROM tbl
WHERE tbl.id = t.id
)
AND NOW()
) d
ORDER BY id, date
Using my pseudo-calendar table with date values ranging from 2022-05-20 to 2022-05-30 and source data in that range, like so
id
price
date
1
2
2022-05-21
1
3
2022-05-25
1
2.5
2022-05-28
2
10
2022-05-25
2
100
2022-05-30
the query produces following results
id
date
price
1
2022-05-21
2
1
2022-05-22
2
1
2022-05-23
2
1
2022-05-24
2
1
2022-05-25
3
1
2022-05-26
3
1
2022-05-27
3
1
2022-05-28
2.5
1
2022-05-29
2.5
1
2022-05-30
2.5
2
2022-05-25
10
2
2022-05-26
10
2
2022-05-27
10
2
2022-05-28
10
2
2022-05-29
10
2
2022-05-30
100
db<>fiddle here
I have a phonelog table that has information about callers' call history. I'd like to find out callers whose first and last call was to the same person on a given day.
Callerid Recipientid DateCalled
1 2 2019-01-01 09:00:00.000
1 3 2019-01-01 17:00:00.000
1 4 2019-01-01 23:00:00.000
2 5 2019-07-05 09:00:00.000
2 5 2019-07-05 17:00:00.000
2 3 2019-07-05 23:00:00.000
2 5 2019-07-06 17:00:00.000
2 3 2019-08-01 09:00:00.000
2 3 2019-08-01 17:00:00.000
2 4 2019-08-02 09:00:00.000
2 5 2019-08-02 10:00:00.000
2 4 2019-08-02 11:00:00.000
Expected Output
Callerid Recipientid Datecalled
2 5 2019-07-05
2 3 2019-08-01
2 4 2019-08-02
I wrote the below query but can't get it to return recipientid. Any help on this will be appreciated!
select pl.callerid,cast(pl.datecalled as date) as datecalled
from phonelog pl inner join (select callerid, cast(datecalled as date) as datecalled,
min(datecalled) as firstcall, max(datecalled) as lastcall
from phonelog
group by callerid, cast(datecalled as date)) as x
on pl.callerid = x.callerid and cast(pl.datecalled as date) = x.datecalled
and (pl.datecalled = x.firstcall or pl.datecalled = x.lastcall)
group by pl.callerid, cast(pl.datecalled as date)
having count(distinct recipientid) = 1
Another dbFiddle option
First, my prequery (PQ alias), I am getting for a given client, per day, the min and max time called but also HAVING to make sure person had at least 2 phone calls in a given day. From that, I re-join to the phone log table on the FIRST (MIN) call for the person for the given day. Then I join one more time for the LAST (MAX) call for the same person for the same day and make sure the recipient of the first is same as last.
I do not have to join on the stripped-down "JustDate" column used for the grouping as the MIN/MAX qualifies the FULL date/time.
select
PQ.JustDate,
PQ.CallerID,
pl1.RecipientID
from
( select
callerID,
convert( date, dateCalled ) JustDate,
min( DateCalled ) minDateCall,
max( DateCalled ) maxDateCall
from
PhoneLog pl
group by
callerID,
convert( date, dateCalled )
having
count(*) > 1) PQ
JOIN PhoneLog pl1
on PQ.CallerID = pl1.CallerID
AND PQ.minDateCall = pl1.dateCalled
JOIN PhoneLog pl2
on PQ.CallerID = pl2.CallerID
AND PQ.maxDateCall = pl2.dateCalled
AND pl1.RecipientID = pl2.RecipientID
Its very easy with window function
WITH cte AS (
SELECT *, CAST(DateCalled as DATE) DateCalled
,FIRST_VALUE(Recipientid) OVER (PARTITION BY Callerid ,CAST(DateCalled as date) ORDER BY CAST(DateCalled AS DATE)) f
,LAST_VALUE(Recipientid) OVER (PARTITION BY Callerid ,CAST(DateCalled as date) ORDER BY CAST(DateCalled AS DATE)) l
FROM phonelog
)
SELECT DISTINCT Callerid,Recipientid, DateCalled FROM cte
WHERE f=l
Since SQL Server 2019 you could use the first_value() and last_value() window functions.
SELECT DISTINCT
x1.callerid,
x1.fri,
x1.datecalled
FROM (SELECT pl1.callerid,
pl1.recipientid,
convert(date, pl1.datecalled) datecalled,
first_value(pl1.recipientid) OVER (PARTITION BY pl1.callerid,
convert(date, pl1.datecalled)
ORDER BY pl1.datecalled
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) fri,
last_value(pl1.recipientid) OVER (PARTITION BY pl1.callerid,
convert(date, pl1.datecalled)
ORDER BY pl1.datecalled
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) lri
FROM phonelog pl1) x1
WHERE x1.fri = x1.lri;
In older versions you can use correlated subqueries with TOP 1.
SELECT DISTINCT
x1.callerid,
x1.fri,
x1.datecalled
FROM (SELECT pl1.callerid,
pl1.recipientid,
convert(date, pl1.datecalled) datecalled,
(SELECT TOP 1
pl2.recipientid
FROM phonelog pl2
WHERE pl2.callerid = pl1.callerid
AND pl2.datecalled >= convert(date, pl1.datecalled)
AND pl2.datecalled < dateadd(day, 1, convert(date, pl1.datecalled))
ORDER BY pl2.datecalled ASC) fri,
(SELECT TOP 1
pl2.recipientid
FROM phonelog pl2
WHERE pl2.callerid = pl1.callerid
AND pl2.datecalled >= convert(date, pl1.datecalled)
AND pl2.datecalled < dateadd(day, 1, convert(date, pl1.datecalled))
ORDER BY pl2.datecalled DESC) lri
FROM phonelog pl1) x1
WHERE x1.fri = x1.lri;
db<>fiddle
If you don't want to return log rows where somebody just made one call on a day, which of course means the first and the last call of the day were to the same person, you can use GROUP BY and HAVING count(*) > 1 instead of DISTINCT.
SELECT x1.callerid,
x1.fri,
x1.datecalled
FROM (...) x1
WHERE x1.fri = x1.lri
GROUP BY x1.callerid,
x1.fri,
x1.datecalled
HAVING count(*) > 1;
You can use a CTE to compute the first and last call of each day by Callerid, and then self-JOIN that CTE to find callers whose first and last calls were to the same Recipientid:
WITH CTE AS (
SELECT Callerid, RecipientId, CONVERT(DATE, Datecalled) AS Datecalled,
ROW_NUMBER() OVER (PARTITION BY Callerid, CONVERT(DATE, Datecalled) ORDER BY Datecalled) AS rna,
ROW_NUMBER() OVER (PARTITION BY Callerid, CONVERT(DATE, Datecalled) ORDER BY Datecalled DESC) AS rnb
FROM phonelog
)
SELECT c1.Callerid, c1.RecipientId, c1.Datecalled
FROM CTE c1
JOIN CTE c2 ON c1.Callerid = c2.Callerid AND c1.Recipientid = c2.Recipientid
WHERE c1.rna = 1 AND c2.rnb = 1
Output:
Callerid RecipientId Datecalled
2 5 2019-07-05
2 3 2019-08-01
2 4 2019-08-02
Demo on SQLFiddle
As my understanding, you want to select callerid with each Recipientid with the times greater than 1 to make sure that we have First call and Last call. So you just need to group by 3 columns combine with having count(Recipientid) > 1 Like this
SELECT Callerid, Recipientid, CAST(Datecalled AS DATE) AS Datecalled
FROM phonelog
GROUP BY Callerid, Recipientid, CAST(Datecalled AS DATE)
HAVING COUNT(Recipientid) > 1
Demo on db<>fiddle
As per my understanding we have to rank Caller_id as well as Recipient_id along with the Date.
Below is my solution which is working well for this case.
with CTE as
(select *,
row_number() over (partition by callerid, convert(VARCHAR,datecalled,23) order by convert(VARCHAR,datecalled,23)) as first_recipient_id,
row_number() over (partition by receipientid, convert(VARCHAR,datecalled,23) order by convert(VARCHAR,datecalled,23) desc) as last_recipient_id
from activity
)
select t.callerid,t.receipientid,CONVERT(VARCHAR,t.datecalled) as DateCalled from CTE t
where t.first_recipient_id >1 AND t.last_recipient_id>1;
The result that I was able to get:
Result
I think we need to identify first and last call made by caller on a day and then compare it with first and last call by caller to a recipient for that day. Below code has firstcall and lastcall made by caller on a day. Then it finds first and last call by caller to respective recipient and then compare.
SELECT DISTINCT
callerid,
recipientid,
CONVERT(date,firstcall)
FROM
(
Select
callerid,
recipientid,
MIN(dateCalled) OVER(PARTITION BY callerid,CONVERT(date,DateCalled)) as firstcall,
MAX(DateCalled) OVER(PARTITION BY callerid,CONVERT(date,DateCalled)) as lastcall,
MIN(DateCalled) OVER(PARTITION BY callerid,recipientid,convert(date,DateCalled)) as recipfirstcall,
MAX(call_start_time) OVER(PARTITION BY callerid,recipientid,convert(date,DateCalled)) as reciplastcall
from phonelog
) as A
where A.firstcall=A.recipfirstcall and A.lastcall=A.reciplastcall
https://www.db-fiddle.com/f/rgLXTu3VysD3kRwBAQK3a4/3
My problem here is that I want function partition over to start counting the rows only from certain time range.
In this example, if I would add rn = 1 at the end, order_id = 5 would be excluded from the results (because partition is ordering by paid_date and there's order_id = 6 with earlier date) but it shouldn't be as I want that time range for partition starts from '2019-01-10'.
Adding condition rn = 1expected output should be order_id 3,5,11,15, now its only 3,11,15
it should include only orders with is_paid = 0 that are the first one within given time range (if there's preceeding order with is_paid = 1 it shouldn't be counted)
use correlated subquery with not exists
DEMO
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY paid_date,order_id) rn
FROM orders o
WHERE paid_date between '2019-01-10'
and '2019-01-15'
) x where rn=1 and not exists (select 1 from orders o1 where x.order_id=o1.order_id
and is_paid=1)
OUTPUT:
order_id customer_id amount is_paid paid_date rn
3 101 30 0 10/01/2019 00:00:00 1
5 102 15 0 10/01/2019 00:00:00 1
11 104 31 0 10/01/2019 00:00:00 1
15 105 11 0 10/01/2019 00:00:00 1
If priority should be given to order_id then put that before paid date in the partition function order by clause, this will solve your issue.
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_id,paid_date) rn
FROM orders o
) x WHERE is_paid = 0 and paid_date between
'2019-01-10' and '2019-01-15' and rn=1
Since you need the paid date to be ordered first you need to imply a where condition in the partitioning table in order to avoid unnecessary dates interrupting the partition function.
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY paid_date, order_id) rn
FROM orders o
where paid_date between '2019-01-10' and '2019-01-15'
) x WHERE is_paid = 0 and rn=1
I have a problem with writing a query.
Row data is as follow :
DATE CUSTOMER_ID AMOUNT
20170101 1 150
20170201 1 50
20170203 1 200
20170204 1 250
20170101 2 300
20170201 2 70
I want to know when(which date) the sum of amount for each customer_id becomes more than 350,
How can I write this query to have such a result ?
CUSTOMER_ID MAX_DATE
1 20170203
2 20170201
Thanks,
Simply use ANSI/ISO standard window functions to calculate the running sum:
select t.*
from (select t.*,
sum(t.amount) over (partition by t.customer_id order by t.date) as running_amount
from t
) t
where running_amount - amount < 350 and
running_amount >= 350;
If for some reason, your database doesn't support this functionality, you can use a correlated subquery:
select t.*
from (select t.*,
(select sum(t2.amount)
from t t2
where t2.customer_id = t.customer_id and
t2.date <= t.date
) as running_amount
from t
) t
where running_amount - amount < 350 and
running_amount >= 350;
ANSI SQL
Used for the test: TSQL and MS SQL Server 2012
select
"CUSTOMER_ID",
min("DATE")
FROM
(
select
"CUSTOMER_ID",
"DATE",
(
SELECT
sum(T02."AMOUNT") AMOUNT
FROM "TABLE01" T02
WHERE
T01."CUSTOMER_ID" = T02."CUSTOMER_ID"
AND T02."DATE" <= T01."DATE"
) "AMOUNT"
from "TABLE01" T01
) T03
where
T03."AMOUNT" > 350
group by
"CUSTOMER_ID"
GO
CUSTOMER_ID | (No column name)
----------: | :------------------
1 | 03/02/2017 00:00:00
2 | 01/02/2017 00:00:00
db<>fiddle here
DB-Fiddle
SELECT
tmp.`CUSTOMER_ID`,
MIN(tmp.`DATE`) as MAX_DATE
FROM
(
SELECT
`DATE`,
`CUSTOMER_ID`,
`AMOUNT`,
(
SELECT SUM(`AMOUNT`) FROM tbl t2 WHERE t2.`DATE` <= t1.`DATE` AND `CUSTOMER_ID` = t1.`CUSTOMER_ID`
) AS SUM_UP
FROM
`tbl` t1
ORDER BY
`DATE` ASC
) tmp
WHERE
tmp.`SUM_UP` > 350
GROUP BY
tmp.`CUSTOMER_ID`
Explaination:
First I select all rows and subselect all rows with SUM and ID where the current row DATE is smaller or same as all rows for the customer. From this tabe i select the MIN date, which has a current sum of >350
I think it is not an easy calculation and you have to calculate something. I know It could be seen a little mixed but i want to calculate step by step. As fist step if we can get success for your scenario, I believe it can be made better about performance. If anybody can make better my query please edit my post;
Unfortunately the solution that i cannot try on computer is below, I guess it will give you expected result;
-- Get the start date of customers
SELECT MIN(DATE) AS DATE
,CUSTOMER_ID
INTO #table
FROM TABLE t1
-- Calculate all possible date and where is sum of amount greater than 350
SELECT t1.CUSTOMER_ID
,SUM(SELECT Amount FROM TABLE t3 WHERE t3.DATE BETWEEN t1.DATE
AND t2.DATE) AS total
,t2.DATE AS DATE
INTO #tableCalculated
FROM #table t1
INNER JOIN TABLE t2 ON t.ID = t2.ID
AND t1.DATE != t2.DATE
WHERE total > 350
-- SELECT Min amount and date for per Customer_ID
SELECT CUSTOMER_ID, MIN(DATE) AS DATE
FROM #tableCalculated
GROUP BY ID
SELECT CUSTOMER_ID, MIN(DATE) AS GOALDATE
FROM ( SELECT cd1.*, (SELECT SUM(AMOUNT)
FROM CustData cd2
WHERE cd2.CUSTOMER_ID = cd1.CUSTOMER_ID
AND cd2.DATE <= cd1.DATE) AS RUNNINGTOTAL
FROM CustData cd1) AS custdata2
WHERE RUNNINGTOTAL >= 350
GROUP BY CUSTOMER_ID
DB Fiddle
Hi have written query like this:
select Customerid,orderDate, OrderNumber,
DENSE_RANK() OVER (PARTITION BY Customerid ORDER BY orderDate) "rank"
from [order]
and this produce result:
Here I want to retrieve only latest purchase of each customer like this:
1 2014-04-09 00:00:00.000 543141 6
2 2014-03-04 00:00:00.000 543056 4
3 2014-01-28 00:00:00.000 542986 7
How to achieve this using sql query
Use a subquery:
select o.*
from (select Customerid,orderDate, OrderNumber,
DENSE_RANK() OVER (PARTITION BY Customerid ORDER BY orderDate DESC) as seqnum
from [order] o
) o
where seqnum = 1;