I am trying to write a query that would get the customers with 7 consecutive transactions given a list of CustomerKeys.
I am currently doing a self join on Customer fact table that has 700 Million records in SQL Server 2008.
This is is what I came up with but its taking a long time to run. I have an clustered index as (CustomerKey, TranDateKey)
SELECT
ct1.CustomerKey,ct1.TranDateKey
FROM
CustomerTransactionFact ct1
INNER JOIN
#CRTCustomerList dl ON ct1.CustomerKey = dl.CustomerKey --temp table with customer list
INNER JOIN
dbo.CustomerTransactionFact ct2 ON ct1.CustomerKey = ct2.CustomerKey -- Same Customer
AND ct2.TranDateKey >= ct1.TranDateKey
AND ct2.TranDateKey <= CONVERT(VARCHAR(8), (dateadd(d, 6, ct1.TranDateTime), 112) -- Consecutive Transactions in the last 7 days
WHERE
ct1.LogID >= 82800000
AND ct2.LogID >= 82800000
AND ct1.TranDateKey between dl.BeginTranDateKey and dl.EndTranDateKey
AND ct2.TranDateKey between dl.BeginTranDateKey and dl.EndTranDateKey
GROUP BY
ct1.CustomerKey,ct1.TranDateKey
HAVING
COUNT(*) = 7
Please help make it more efficient. Is there a better way to write this query in 2008?
You can do this using window functions, which should be much faster. Assuming that TranDateKey is a number and you can subtract a sequential number from it, then the difference constant for consecutive days.
You can put this in a query like this:
SELECT CustomerKey, MIN(TranDateKey), MAX(TranDateKey)
FROM (SELECT ct.CustomerKey, ct.TranDateKey,
(ct.TranDateKey -
DENSE_RANK() OVER (PARTITION BY ct.CustomerKey, ct.TranDateKey)
) as grp
FROM CustomerTransactionFact ct INNER JOIN
#CRTCustomerList dl
ON ct.CustomerKey = dl.CustomerKey
) t
GROUP BY CustomerKey, grp
HAVING COUNT(*) = 7;
If your date key is something else, there is probably a way to modify the query to handle that, but you might have to join to the dimension table.
This would be a perfect task for a COUNT(*) OVER (RANGE ...), but SQL Server 2008 supports only a limited syntax for Windowed Aggregate Functions.
SELECT CustomerKey, MIN(TranDateKey), COUNT(*)
FROM
(
SELECT CustomerKey, TranDateKey,
dateadd(d,-ROW_NUMBER()
OVER (PARTITION BY CustomerKey
ORDER BY TranDateKey),TranDateTime) AS dummyDate
FROM CustomerTransactionFact
) AS dt
GROUP BY CustomerKey, dummyDate
HAVING COUNT(*) >= 7
The dateadd calculates the difference between the current TranDateTime and a Row_Number over all date per customer. The resulting dummyDatehas no actual meaning, but is the same meaningless date for consecutive dates.
Related
I need to use this SQL query for a software and get the time in a particular format hence the reason for the Time column however I need the query to insert the months that are missing with the value from the previous month. This is the query I currently have.
SELECT [accountnumber],SUM([postingamount]) AS Amount, [accountingdate],
convert(varchar(4),year(accountingdate))+'M'+ Format(DATEPART( MONTH, accountingdate) , '00')
AS [Time]
FROM [7 GL Detail MACL]
where [accountingdate]>='2019-01-01'
GROUP BY [accountingdate],[postingamount],[accountnumber]
Current Results
Expected Results
Since you didn't specify the RDBMS system you're using, I can't guarantee that this logic will work because every system uses slightly different SQL syntax.
However I used Rasgo datespine function to generate this SQL, as it is quite complex to wrap your head around, and tested it on Snowflake.
The main differences between Snowflake and other systems are: DATEADD and TABLE (GENERATOR())
In case you can't modify this to work in your system, here are the basic steps which you'll want to follow:
Select unique accountnumbers
Select unique dates (month beginnings?) This is where Snowflake uses GENERATOR but other systems might actually have a Calendar table you can select from
Cross Join (cartesian join) these to create every possible combination of accountnumber and date
Outer Join #3 to your data (might have to truncate your date to month-begin)
Filter out rows that dont apply. Like for instance you might have just inserted a row for 1/1/2019 for an account that didn't even begin until 12/12/2020.
WITH GLOBAL_SPINE AS (
SELECT
ROW_NUMBER() OVER (ORDER BY NULL) as INTERVAL_ID,
DATEADD('MONTH', (INTERVAL_ID - 1), '2019-01-01'::timestamp_ntz) as SPINE_START,
DATEADD('MONTH', INTERVAL_ID, '2022-06-01'::timestamp_ntz) as SPINE_END
FROM TABLE (GENERATOR(ROWCOUNT => 42))
),
GROUPS AS (
SELECT
accountnumber,
MIN(DESIRED_INTERVAL) AS LOCAL_START,
MAX(DESIRED_INTERVAL) AS LOCAL_END
FROM [7 GL Detail MACL]
GROUP BY
accountnumber
),
GROUP_SPINE AS (
SELECT
accountnumber,
SPINE_START AS GROUP_START,
SPINE_END AS GROUP_END
FROM GROUPS G
CROSS JOIN LATERAL (
SELECT
SPINE_START, SPINE_END
FROM GLOBAL_SPINE S
WHERE S.SPINE_START >= G.LOCAL_START
)
)
SELECT
G.accountnumber AS GROUP_BY_accountnumber,
GROUP_START,
GROUP_END,
T.*
FROM GROUP_SPINE G
LEFT JOIN {{ your_table }} T
ON DESIRED_INTERVAL >= G.GROUP_START
AND DESIRED_INTERVAL < G.GROUP_END
AND G.accountnumber = T.accountnumber;
You were also doing an aggregation step, but I figure once you get this complicated part down, you can figure out how to finally aggregate it the way you want it.
I am using Terdata SQL Assistant connected to an enterprise DW. I have written the query below to show an inventory of outstanding items as of a specific point in time. The table referenced loads and stores new records as changes are made to their state by load date (and does not delete historical records). The output of my query is 1 row for the specified date. Can I create a stored procedure or recursive query of some sort to build a history of these summary rows (with 1 new row per day)? I have not used such functions in the past; links to pertinent previously answered questions or suggestions on how I could get on the right track in researching other possible solutions are totally fine if applicable; just trying to bridge this gap in my knowledge.
SELECT
'2017-10-02' as Dt
,COUNT(DISTINCT A.RECORD_NBR) as Pending_Records
,SUM(A.PAY_AMT) AS Total_Pending_Payments
FROM DB.RECORD_HISTORY A
INNER JOIN
(SELECT MAX(LOAD_DT) AS LOAD_DT
,RECORD_NBR
FROM DB.RECORD_HISTORY
WHERE LOAD_DT <= '2017-10-02'
GROUP BY RECORD_NBR
) B
ON A.RECORD_NBR = B.RECORD_NBR
AND A.LOAD_DT = B.LOAD_DT
WHERE
A.RECORD_ORDER =1 AND Final_DT Is Null
GROUP BY Dt
ORDER BY 1 desc
Here is my interpretation of your query:
For the most recent load_dt (up until 2017-10-02) for record_order #1,
return
1) the number of different pending records
2) the total amount of pending payments
Is this correct? If you're looking for this info, but one row for each "Load_Dt", you just need to remove that INNER JOIN:
SELECT
load_Dt,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE record_order = 1
AND final_Dt IS NULL
GROUP BY load_Dt
ORDER BY 1 DESC
If you want to get the summary info per record_order, just add record_order as a grouping column:
SELECT
load_Dt,
record_order,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE final_Dt IS NULL
GROUP BY load_Dt, record_order
ORDER BY 1,2 DESC
If you want to get one row per day (if there are calendar days with no corresponding "load_dt" days), then you can SELECT from the sys_calendar.calendar view and LEFT JOIN the query above on the "load_dt" field:
SELECT cal.calendar_date, src.Pending_Records, src.Total_Pending_Payments
FROM sys_calendar.calendar cal
LEFT JOIN (
SELECT
load_Dt,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE record_order = 1
AND final_Dt IS NULL
GROUP BY load_Dt
) src ON cal.calendar_date = src.load_Dt
WHERE cal.calendar_date BETWEEN <start_date> AND <end_date>
ORDER BY 1 DESC
I don't have access to a TD system, so you may get syntax errors. Let me know if that works or you're looking for something else.
I need to sum up the values for the last 7 days,so it should be the current plus the previous 6. This should happen for each row i.e. in each row the column value would be current + previous 6.
The case :-
(Note:- I will calculate the hours,by suming up the seconds).
I tried using the below query :-
select SUM([drivingTime]) OVER(PARTITION BY driverid ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
from [f.DriverHseCan]
The problem I face is I have to do grouping on driver,asset for a date
In the above case,the driving time should be sumed up and then,its previous 6 rows should be taken,
I cant do this using rank() because I need these rows as well as I have to show it in the report.
I tried doing this in SSRS and SQL both.
In short it is adding total driving time for current+ 6 previous days
Try the following query
SELECT
s.date
, s.driverid
, s.assetid
, s.drivingtime
, SUM(s2.drivingtime) AS total_drivingtime
FROM f.DriverHseCan s
JOIN (
SELECT date,driverid, SUM(drivingtime) drivingtime
FROM f.DriverHseCan
GROUP BY date,driverid
) AS s2
ON s.driverid = s2.driverid AND s2.date BETWEEN DATEADD(d,-6,s.date) AND s.date
GROUP BY
s.date
, s.driverid
, s.assetid
, s.drivingtime
If you have week start/end dates, there could be better performing alternatives to solve your problem, e.g. use the week number in SSRS expressions rather than do the self join on SQL server
I think aggregation does what you want:
select sum(sum([drivingTime])) over (partition by driverid
order by date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)
from [f.DriverHseCan]
group by driverid, date
I guess you need to use CROSS APPLY.
Something like following? :
SELECT driverID,
date,
CA.Last6DayDrivingTime
FROM YourTable YT
CROSS APPLY
(
SELECT SUM(drivingTime) AS Last6DayDrivingTime
FROM YourTable CA ON CA.driverID=YT.driverID
WHERE CA.date BETWEEN DATEADD(DAY,-6,YT.date) AND YT.date)
) CA
Edit:
As you commented that cross apply slow down the performance, other option is to pre calculate the week values in temp table or using CTE and then use them in your main query.
Struggling to go the extra step with a SQL query I'd like to run.
I have a customer database with a Customer table with the date/time detail of when the customer joined and a transaction table with details of their transactions of the years
What I'd like to do is to Group by the Join Date (as Year) and count the number that joined in each year then in the next column I'd like to then count the number who have transacted in a specific year E.g. 2016 the current year. This way I can show customer retention over the years.
Both tables are linked by a customer URN, but I am struggling to get my head around the the most efficient way to show this. I can easily count and group the members by joined year and I can display the max dated transaction but I am struggling to bring the two together. I think I need to use sub queries and a left join but it's alluding me.
Example output column headers with data
Year_Joined = 2009
Joiner_Count = 10
Transact_in_2016 = 5
Where I am syntax-wise. I know this is no where near complete. As I need to group by DateJoined and then sub query the count of customers of have transacted in 2016?
SELECT Customer.URNCustomer,
MAX(YEAR(Customer.DateJoined)),
MAX(YEAR(Tran.TranDate)) As Latest_Tran,
FROM Mydatabase.dbo.Customer
LEFT JOIN Mydatabase.dbo.Tran
ON Tran.URNCustomer = Customer.URNCustomer
GROUP BY Customer.URNCustomer
ORDER BY Customer.URNCustomer
The best approach is to do the aggregation before doing the joins. You want to count two different things, so count them individually and them combine them.
The following uses full outer join. This handles the case where there are years with no new customers and years with no transactions:
select coalesce(c.yyyy, t.yyyy) as yyyy,
coalesce(c.numcustomers, 0) as numcustomers,
coalesce(t.numtransactions, 0) as numtransactions
from (select year(c.datejoined) as yyyy, count(*) as numcustomers
from Mydatabase.dbo.Customer c
group by year(c.datejoined)
) c full outer join
(select year(t.trandate) as yyyy, count(*) as numtransactions
from database.dbo.Tran t
group by year(t.trandate)
) t
on c.yyyy = t.yyyy;
You may want to try something like this:
SELECT YEAR(Customer.DateJoined),
COUNT( Customer.URNCustomer ),
COUNT( DISTINCT Tran.URNCustomer ) AS NO_ACTIVE_IN_2016
FROM Mydatabase.dbo.Customer
LEFT Mydatabase.dbo.Tran
ON Tran.URNCustomer = Customer.URNCustomer
AND YEAR(Tran.TranDate) = 2016
GROUP BY YEAR(Customer.DateJoined)
I have 2 SQL Tables
unit_transaction
unit_detail_transactions
(tables schema here: http://sqlfiddle.com/#!3/e3204/2 )
What I need is to perform an SQL Query in order to generate a table with balances. Right now I have this SQL Query but it's not working fine because when I have 2 transactions with the same date then the balance is not calculated correctly.
SELECT
ft.transactionid,
ft.date,
ft.reference,
ft.transactiontype,
CASE ftd.isdebit WHEN 1 THEN MAX(ftd.debitaccountid) ELSE MAX(ftd.creditaccountid) END as financialaccountname,
CAST(COUNT(0) as tinyint) as totaldetailrecords,
ftd.isdebit,
SUM(ftd.amount) as amount,
balance.amount as balance
FROM unit_transaction_details ftd
JOIN unit_transactions ft ON ft.transactionid = ftd.transactionid
JOIN
(
SELECT DISTINCT
a.transactionid,
SUM(CASE b.isdebit WHEN 1 THEN b.amount ELSE -ABS(b.amount) END) as amount
--SUM(b.debit-b.credit) as amount
FROM unit_transaction_details a
JOIN unit_transactions ft ON ft.transactionid = a.transactionid
CROSS JOIN unit_transaction_details b
JOIN unit_transactions ft2 ON ft2.transactionid = b.transactionid
WHERE (ft2.date <= ft.date)
AND ft.unitid = 1
AND ft2.unitid = 1
AND a.masterentity = 'CONDO-A'
GROUP BY a.transactionid,a.amount
) balance ON balance.transactionid = ft.transactionid
WHERE
ft.unitid = 1
AND ftd.isactive = 1
GROUP BY
ft.transactionid,
ft.date,
ft.reference,
ft.transactiontype,
ftd.isdebit,
balance.amount
ORDER BY ft.date DESC
The result of the query is this:
Any clue on how to perform a correct SQL that will show me the right balances ordered by transaction date in descendant mode?
Thanks a lot.
EDIT: THINK OF 2 POSSIBLE SOLUTIONS
The problem is generated when you have the same date in 2 transactions, so here is what Im going to do:
Save Date and Time into "date" column. That way there won't be 2 exact dates.
OR
Create a "priority" column and set the priority for each record. So if I found that the date already exists and it has priority = 1 then the current priority will be 2.
What do you think?
There are two ways to do a running sum. I am going to show the syntax on a simpler table, to give you an idea.
Some databases (Oracle, PostgreSQL, SQL Server 2012, Teradata, DB2 for instance) support cumulative sums directly. For this you use the following function:
select sum(<val>) over (partition by <column> order by <ordering column>)
from t
This is a windows function that will calculate the running sum of for each group of records identified by . The order of the sum is .
Alas, many databases don't support this functionality, so you would need to do a self join to do this in a single SELECT query in the database:
select t.column, sum(tprev.<val>) as cumsum
from t left join
t tprev
where t.<column> = tprev.<column> and
t.<ordering column> >= tprev.<ordering column>
group by t.column
There is also the possibility of creating another table and using a cursor to assign the cumulative sum, or of doing the sum at the application level.