Assistance with a slow running CTE Query - sql

The below query is being used to populate a Power BI report to determine what characteristics (the last six digits of the PARTNO field - RIGHT(dbo.OC_VDATA.PARTNO,6)) have been ran in the last 24 hours.
I believe the report has always ran poorly and, with enhancements to new characteristics being added and saved to the database, has made this even worse. I am not well versed enough in SQL Server to spot the bottleneck.
WITH FilterCTE AS
(
SELECT
i.ITEM_CODE,
CONCAT(RTRIM(dbo.OC_VDATA.UDL1), RTRIM(dbo.OC_VDATA.UDL6)) as LineID,
CONCAT(i.ITEM_CODE,RTRIM(dbo.OC_VDATA.UDL1),RTRIM(dbo.OC_VDATA.UDL6)) as LinkID,
CONCAT(RTRIM(dbo.OC_VDATA.UDL1), i.ITEM_CODE, RIGHT(PARTNO,6)) as SpecID,
CONCAT(RTRIM(dbo.OC_VDATA.UDL1), i.ITEM_CODE, RIGHT(PARTNO,6), RTRIM(dbo.OC_VDATA.UDL6)) as ControlLimitID,
ROW_NUMBER() OVER (PARTITION BY CONCAT(RTRIM(dbo.OC_VDATA.UDL1),
RTRIM(dbo.OC_VDATA.UDL6))
ORDER BY MAX(CONVERT(DATETIME, dbo.OC_VDAT_AUX.UDL40, 102)) DESC) AS RowNumber
FROM dbo.OC_VDATA INNER JOIN
dbo.OC_VDAT_AUX ON dbo.OC_VDATA.PARTNO = dbo.OC_VDAT_AUX.PARTNOAUX
AND dbo.OC_VDATA.DATETIME = dbo.OC_VDAT_AUX.DATETIMEAUX
INNER JOIN
stagingPLM.dbo.ITEM_CODES i ON LEFT(dbo.OC_VDATA.PARTNO, 12) = i.SPEC_NO
AND LEFT(dbo.OC_VDAT_AUX.PARTNOAUX, 12) = i.SPEC_NO
INNER JOIN
stagingPLM.dbo.PLANTS p ON dbo.OC_VDATA.UDL1 = p.PLANT_CODE
WHERE (CONVERT(DATETIME, dbo.OC_VDAT_AUX.UDL40, 101) > DATEADD(hour, - 24, GETDATE()))
AND (dbo.OC_VDAT_AUX.UDL28 LIKE 'PLASTIC%')
AND RIGHT(dbo.OC_VDATA.PARTNO,6) in ('008400',
'008500',
'036150',
'043300',
'043400',
'043200',
'043202',
'008800',
'008810',
'009600'
)
GROUP BY i.ITEM_CODE,
CONCAT(RTRIM(dbo.OC_VDATA.UDL1),RTRIM(dbo.OC_VDATA.UDL6)),
CONCAT(i.ITEM_CODE,RTRIM(dbo.OC_VDATA.UDL1),RTRIM(dbo.OC_VDATA.UDL6)),
CONCAT(RTRIM(dbo.OC_VDATA.UDL1),i.ITEM_CODE,RTRIM(dbo.OC_VDATA.UDL5)),
CONCAT(RTRIM(dbo.OC_VDATA.UDL1),i.ITEM_CODE,RIGHT(PARTNO,6)),
CONCAT(RTRIM(dbo.OC_VDATA.UDL1),i.ITEM_CODE,RIGHT(PARTNO,6),RTRIM(dbo.OC_VDATA.UDL6))
)
SELECT * FROM FilterCTE
Here is the expected output:
Could someone assist me in knowing where the bottleneck is and how did you spot it?

Related

Join and aggregate two huge tables efficiently

I have a huge table with over 1 million transaction records and I need to join this table to itself and pull all similar transactions within 52 weeks prior for each transaction and aggregate them for later use in an ML model.
select distinct a.transref,
a.transdate, a.transamount,
a.transtype,
avg (b.transamount)
over (partition by a.transref,a.transdate, a.transamount,a.transtype) as avg_trans_amount
from trans_table a
inner join trans_table b
on a.transtype = b.transtype
and b.transdate <= dateadd(week, -52, a.transdate)
and b.transdate <= a.transdate
and a.transdate between '2022-11-16' and '2021-11-16'
the transaction table looks like this:
+--------+----------+-----------+---------+
|transref|trasndate |transamount|transtype|
+--------+----------+-----------+---------+
|xh123rdk|2022-11-16|112.48 |food & Re|
|g8jegf90|2022-11-04|23.79 |Misc |
|ulpef32p|2022-10-23|83.15 |gasoline |
+--------+----------+-----------+---------+
and the expected output should look like this:
+--------+----------+-----------+---------+----------------+
|transref|trasndate |transamount|transtype|avg_trans_amount|
+--------+----------+-----------+---------+----------------+
|xh123rdk|2022-11-16|112.48 |food & Re|180.11 |
|g8jegf90|2022-11-04|23.79 |Misc |43.03 |
|ulpef32p|2022-10-23|83.15 |gasoline |112.62 |
+--------+----------+-----------+---------+----------------+
Since each transaction may pull over 10,000 similar type records the query is very slow and expensive to run, therefore SQL Server failed to create the output table.
How can I optimize this query to run efficiently within a reasonable time?
Note: After failing to run the query, I ended up creating a stored procedure to split the original table a into smaller chunks, join it to the big table, aggregate the results and append the results to an output table and repeat this until the entire table a was covered. This way I could manage to do the job, however, it was still slow. I expect there are better ways to do it in SQL without all this manual work.
ok, I think I figured out what's causing the query to run tooslow. the trick is to avoid repetitive and unnecessary calculations by doing some group by first before doing the join.
with merch as (
select transtype,
dateadd(week, -52, transdate) as startdate,
transdate as enddate),
from trans_table
group by transtype, transdate),
summary as (
select distinct transtype,
stratdate, enddate,
avg(t.transamt) over (partition by
m.transtype, m.startdate, m.enddate) as avg_amt,
percentile_cont(0.5) within group (order by t.transamt) over (partition by
m.transtype, m.startdate, m.enddate) as median_amt
from merch as m
inner join trans_table as t
on m.transtype = t.transdate and
t.transdate between m.starttype and
m.enddate)
select t.*, s.avg_amt s.median_amt
from trans_table t
inner join summary s
on t.transtype = s.transtype
and t.transdate = s.enddate

SQL select distinct from a concatenated column

This query almost does what I want
SELECT staging.dbo.ITEM_CODES.ITEM_CODE, MAX(dbo.OC_VDAT_AUX.UDL40) AS SAMPLEDATE,
CONCAT(RTRIM(dbo.OC_VDATA.UDL1), RTRIM(dbo.OC_VDATA.UDL6)) as LinkID
FROM dbo.OC_VDATA
INNER JOIN dbo.OC_VDAT_AUX ON dbo.OC_VDATA.PARTNO = dbo.OC_VDAT_AUX.PARTNOAUX AND dbo.OC_VDATA.DATETIME = dbo.OC_VDAT_AUX.DATETIMEAUX
INNER JOIN stagingPLM.dbo.ITEM_CODES ON LEFT(dbo.OC_VDATA.PARTNO, 12) = staging.dbo.ITEM_CODES.SPEC_NO
AND LEFT(dbo.OC_VDAT_AUX.PARTNOAUX, 12) = stagingPLM.dbo.ITEM_CODES.SPEC_NO
INNER JOIN stagingPLM.dbo.PLANTS ON dbo.OC_VDATA.UDL1 = staging.dbo.PLANTS.PLANT_CODE
WHERE (CONVERT(DATETIME, dbo.OC_VDAT_AUX.UDL40) > DATEADD(day, - 30, GETDATE()))
GROUP BY CONCAT(RTRIM(dbo.OC_VDATA.UDL1), RTRIM(dbo.OC_VDATA.UDL6)),staging.dbo.ITEM_CODES.ITEM_CODE
Sample Table generated by query:
The end result that I am trying to achieve is the latest ITEM_CODE per unique LinkID Note the first and last rows in the table. The last row should not be pulled by the query.
How do I modify this query to make that happen?
I have tried various placements for DISTINCT and sub queries in the select and where statements.
I would do in your case with ROW_NUMBER window function and CTE.
Solution can be like this:
WITH FilterCTE AS
(
SELECT staging.dbo.ITEM_CODES.ITEM_CODE, MAX(dbo.OC_VDAT_AUX.UDL40) AS SAMPLEDATE,
CONCAT(RTRIM(dbo.OC_VDATA.UDL1), RTRIM(dbo.OC_VDATA.UDL6)) AS LinkID,
ROW_NUMBER() OVER (PARTITION BY CONCAT(RTRIM(dbo.OC_VDATA.UDL1), RTRIM(dbo.OC_VDATA.UDL6)) ORDER BY MAX(dbo.OC_VDAT_AUX.UDL40)) AS RowNumber
FROM dbo.OC_VDATA
INNER JOIN dbo.OC_VDAT_AUX ON dbo.OC_VDATA.PARTNO = dbo.OC_VDAT_AUX.PARTNOAUX AND dbo.OC_VDATA.DATETIME = dbo.OC_VDAT_AUX.DATETIMEAUX
INNER JOIN stagingPLM.dbo.ITEM_CODES ON LEFT(dbo.OC_VDATA.PARTNO, 12) = staging.dbo.ITEM_CODES.SPEC_NO
AND LEFT(dbo.OC_VDAT_AUX.PARTNOAUX, 12) = stagingPLM.dbo.ITEM_CODES.SPEC_NO
INNER JOIN stagingPLM.dbo.PLANTS ON dbo.OC_VDATA.UDL1 = staging.dbo.PLANTS.PLANT_CODE
WHERE (CONVERT(DATETIME, dbo.OC_VDAT_AUX.UDL40) > DATEADD(day, - 30, GETDATE()))
GROUP BY CONCAT(RTRIM(dbo.OC_VDATA.UDL1), RTRIM(dbo.OC_VDATA.UDL6)),staging.dbo.ITEM_CODES.ITEM_CODE
)
SELECT *
FROM FilterCTE
WHERE RowNumber = 1

CTE doesn't work in SSAS Cube. I want to find solution or convert it ti subquery

I write this CTE query and the explanation is:
WITH TP AS
(select
c.ID, c.PeriodCId, c.PeriodName, c.Status, c.StatusChangeDate, CAST(c.StartDate AS DATE) AS StartDate, c.EndDate,c.PeriodCode,
c.PeriodType, c.ParentCId, c.MarketId, c.ParentId, c.WD, LEFT(CONVERT(varchar, c.StartDate, 112), 6) AS YEARMONTH,
(select count(*) from dTimePeriod c2 where c2.ParentId = c.ID and c2.Status='actv') as #children
from dTimePeriod c
where (MarketId = 7) ),
TP2 AS
( SELECT *
FROM TP
WHERE #children='12' ),
TP3 AS
(SELECT TP.*, CASE WHEN (TP.WD IS NOT NULL) AND (TP.StartDate <= getdate()) AND TP2.ID=TP.ParentId THEN 18 ELSE NULL END AS WorkingDays
FROM TP LEFT JOIN TP2 ON TP2.ID=TP.ParentId)
select * from TP3
order by ID
and this is the result
CTE Image
I have recursive table called [dTimePeriod] this table contains different cycles and each cycle contains different number of periods,EX: one cycle has 8 periods another cycle has 12 periods and so on, I want if cycles contains 12 periods put to each period value = 18 and for others cycle periods null
and there are some another conditions but it's not the issue.
And when I put it in the SSAS cube doesn't work because the cube doesn't understand the CTE so I tried to find a solution but it doesn't work,
one of them to put this CTE in a view and call this view in the cube but the view doesn't work as well.
so I start to write it as subquery to make the cube able to understand it.
but I am stuck, I can't write this CTE in subquery statement
and this is the subquery where I stuck
SELECT c.ID, c.PeriodCId, c.PeriodName, c.Status, c.StatusChangeDate, CAST(c.StartDate AS DATE) AS StartDate, c.EndDate,
c.PeriodCode, c.PeriodType, c.ParentCId, c.MarketId, c.ParentId, c.WD,
LEFT(CONVERT(varchar, c.StartDate, 112), 6) AS YEARMONTH,
CASE WHEN (c.WD IS NOT NULL) AND (c.StartDate <= getdate()) THEN 18
WHEN (c.WD IS NOT NULL) AND (c.StartDate > getdate()) THEN NULL ELSE c.WD END AS WorkingDays,
case when (select sub.* from
(select count(*) as children from dTimePeriod c2 where c2.ParentId = c.ID and c2.Status='actv' ) sub ) = 12
then 18 else null end as WWW
FROM dTimePeriod c
WHERE (c.MarketId = 7)
Instead of using the SQL command directly in the data source for your cube, you can turn this query into a stored procedure and use that result set in the cube. For the SQL statement in the cube, just do an EXEC command as below. The database name isn't necessary if this database is already the initial catalog in the connection string of the data source.
EXEC YourDatabase.YourSchema.YourSP

Using a date field for matching SQL Query

I'm having a bit of an issue wrapping my head around the logic of this changing dimension. I would like to associate these two tables below. I need to match the Cost - Period fact table to the cost dimension based on the Id and the effective date.
As you can see - if the month and year field is greater than the effective date of its associated Cost dimension, it should adopt that value. Once a new Effective Date is entered into the dimension, it should use that value for any period greater than said date going forward.
EDIT: I apologize for the lack of detail but the Cost Dimension will actually have a unique Index value and the changing fields to reference for the matching would be Resource, Project, Cost. I tried to match the query you provided with my fields, but I'm getting the incorrect output.
FYI: Naming convention change: EngagementId is Id, Resource is ConsultantId, and Project is ProjectId
I've changed the images below and here is my query
,_cte(HoursWorked, HoursBilled, Month, Year, EngagementId, ConsultantId, ConsultantName, ProjectId, ProjectName, ProjectRetainer, RoleId, Role, Rate, ConsultantRetainer, Salary, amount, EffectiveDate)
as
(
select sum(t.Duration), 0, Month(t.StartDate), Year(t.StartDate), t.EngagementId, c.ConsultantId, c.ConsultantName, c.ProjectId, c.ProjectName, c.ProjectRetainer, c.RoleId, c.Role, c.Rate, c.ConsultantRetainer,
c.Salary, 0, c.EffectiveDate
from timesheet t
left join Engagement c on t.EngagementId = c.EngagementId and Month(c.EffectiveDate) = Month(t.EndDate) and Year(c.EffectiveDate) = Year(t.EndDate)
group by Month(t.StartDate), Year(t.StartDate), t.EngagementId, c.ConsultantName, c.ConsultantId, c.ProjectId, c.ProjectName, c.ProjectRetainer, c.RoleId, c.Role, c.Rate, c.ConsultantRetainer,
c.Salary, c.EffectiveDate
)
select * from _cte where EffectiveDate is not null
union
select _cte.HoursWorked, _cte.HoursBilled, _cte.Month, _cte.Year, _cte.EngagementId, _cte.ConsultantId, _cte.ConsultantName, _cte.ProjectId, _Cte.ProjectName, _cte.ProjectRetainer, _cte.RoleId, _cte.Role, sub.Rate, _cte.ConsultantRetainer,_cte.Salary, _cte.amount, sub.EffectiveDate
from _cte
outer apply (
select top 1 EffectiveDate, Rate
from Engagement e
where e.ConsultantId = _cte.ConsultantId and e.ProjectId = _cte.ProjectId and e.RoleId = _cte.RoleId
and Month(e.EffectiveDate) < _cte.Month and Year(e.EffectiveDate) < _cte.Year
order by EffectiveDate desc
) sub
where _cte.EffectiveDate is null
Example:
I'm struggling with writing the query that goes along with this. At first I attempted to partition by greatest date. However, when I executed the join I got the highest effective date for every single period (even those prior to the effective date).
Is this something that can be accomplished in a query or should I be focusing on incremental updates of the destination table so that any effective date / time period in the past is left alone?
Any tips would be great!
Thanks,
Channing
Try this one:
; with _CTE as(
select p.* , c.EffectiveDate, c.Cost
from period p
left join CostDimension c on p.id = c.id and p.Month = DATEPART(month, c.EffectiveDate) and p.year = DATEPART (year, EffectiveDate)
)
select * from _CTE Where EffectiveDate is not null
Union
select _CTE.id, _CTE.Month, _CTE.Year, sub.EffectiveDate, sub.Cost
from _CTE
outer apply (select top 1 EffectiveDate, Cost
from CostDimension as cd
where cd.Id = _CTE.id and cd.EffectiveDate < DATETIMEFROMPARTS(_CTE.Year, _CTE.Month, 1, 0, 0, 0, 0)
order by EffectiveDate desc
) sub
where _Cte.EffectiveDate is null

Group By column throwing off query

I have a query that checks a database to see if a customer has visited multiple times a day. If they have it counts the number of visits, and then tells me what times they visited. The problem is it throws "Tickets.lcustomerid" into the group by clause, causing me to miss 5 records (Customers without barcodes). How can I change the below query to remove "tickets.lcustomerid" from the group by clause... If I remove it I get an error telling me "Tickets.lCustomerID" is not a valid select because it's not part of an aggregate or groupby clause.
The Query that works:
SELECT Customers.sBarcode, CAST(FLOOR(CAST(Tickets.dtCreated AS FLOAT)) AS DATETIME) AS dtCreatedDate, COUNT(Customers.sBarcode) AS [Number of Scans],
MAX(Customers.sLastName) AS LastName
FROM Tickets INNER JOIN
Customers ON Tickets.lCustomerID = Customers.lCustomerID
WHERE (Tickets.dtCreated BETWEEN #startdate AND #enddate) AND (Tickets.dblTotal <= 0)
GROUP BY Customers.sBarcode, CAST(FLOOR(CAST(Tickets.dtCreated AS FLOAT)) AS DATETIME)
HAVING (COUNT(*) > 1)
ORDER BY dtCreatedDate
The Output is:
sBarcode dtcreated Date Number of Scans slastname
1234 1/4/2013 12:00:00 AM 2 Jimbo
1/5/2013 12:00:00 AM 3 Jimbo2
1578 1/6/2013 12:00:00 AM 3 Jimbo3
My current Query with the subquery
SELECT customers.sbarcode,
Max(customers.slastname) AS LastName,
Cast(Floor(Cast(tickets.dtcreated AS FLOAT)) AS DATETIME) AS
dtCreatedDate,
Count(customers.sbarcode) AS
[Number of Scans],
Stuff ((SELECT ', '
+ RIGHT(CONVERT(VARCHAR, dtcreated, 100), 7) AS [text()]
FROM tickets AS sub
WHERE ( lcustomerid = tickets.lcustomerid )
AND ( dtcreated BETWEEN Cast(Floor(Cast(tickets.dtcreated
AS
FLOAT)) AS
DATETIME
)
AND
Cast(Floor(Cast(tickets.dtcreated
AS FLOAT
)) AS
DATETIME
)
+ '23:59:59' )
AND ( dbltotal <= '0' )
FOR xml path('')), 1, 1, '') AS [Times Scanned]
FROM tickets
INNER JOIN customers
ON tickets.lcustomerid = customers.lcustomerid
WHERE ( tickets.dtcreated BETWEEN #startdate AND #enddate )
AND ( tickets.dbltotal <= 0 )
GROUP BY customers.sbarcode,
Cast(Floor(Cast(tickets.dtcreated AS FLOAT)) AS DATETIME),
tickets.lcustomerid
HAVING ( Count(*) > 1 )
ORDER BY dtcreateddate
The Current output (notice the record without a barcode is missing) is:
sBarcode dtcreated Date Number of Scans slastname Times Scanned
1234 1/4/2013 12:00:00 AM 2 Jimbo 12:00PM, 1:00PM
1578 1/6/2013 12:00:00 AM 3 Jimbo3 03:05PM, 1:34PM
UPDATE: Based on our "chat" it seems that customerid is not the unique field but barcode is, even though customer id is the primary key.
Therefore, in order to not GROUP BY customer id in the subquery you need to join to a second customers table in there in order to actually join on barcode.
Try this:
SELECT customers.sbarcode,
Max(customers.slastname) AS LastName,
Cast(Floor(Cast(tickets.dtcreated AS FLOAT)) AS DATETIME) AS
dtCreatedDate,
Count(customers.sbarcode) AS
[Number of Scans],
Stuff ((SELECT ', '
+ RIGHT(CONVERT(VARCHAR, dtcreated, 100), 7) AS [text()]
FROM tickets AS subticket
inner join
customers as subcustomers
on
subcustomers.lcustomerid = subticket.lcustomerid
WHERE ( subcustomers.sbarcode = customers.sbarcode )
AND ( subticket.dtcreated BETWEEN Cast(Floor(Cast(tickets.dtcreated
AS
FLOAT)) AS
DATETIME
)
AND
Cast(Floor(Cast(tickets.dtcreated
AS FLOAT
)) AS
DATETIME
)
+ '23:59:59' )
AND ( dbltotal <= '0' )
FOR xml path('')), 1, 1, '') AS [Times Scanned]
FROM tickets
INNER JOIN customers
ON tickets.lcustomerid = customers.lcustomerid
WHERE ( tickets.dtcreated BETWEEN #startdate AND #enddate )
AND ( tickets.dbltotal <= 0 )
GROUP BY customers.sbarcode,
Cast(Floor(Cast(tickets.dtcreated AS FLOAT)) AS DATETIME)
HAVING ( Count(*) > 1 )
ORDER BY dtcreateddate
I can't directly solve your problem because I don't understand your data model or what you are trying to accomplish with this query. However, I can give you some advice on how to solve the problem yourself.
First do you understand exactly what you are trying to accomplish and how the tables fit together? If so move on to the next step, if not, get this knowledge first, you cannot do complex queries without this understanding.
Next break up what you are trying to accomplish in little steps and make sure you have each covered before moving to the rest. So in your case you seem to be missing some customers. Start with a new query (I'm pretty sure this one has more than one problem). So start with the join and the where clauses.
I suspect you may need to start with customers and left join to tickets (which would move the where conditions to the left joins as they are on tickets). This will get you all the customers whether they have tickets or not. If that isn't what you want, then work with the jon and the where clasues (and use select * while you are trying to figure things out) until you are returning the exact set of customer records you need. The reason why you use select * at this stage is to see what in the data may be causeing the problem you are having. That may tell you how to fix.
Usually I start with a the join and then add in the where clasues one at a time until I know I am getting the right inital set of records. If you have multiple joins, do them one at time to know when you suddenly start have more or less records than you would expect.
Then go into the more complex parts. Add each in one at a time and check the results. If you suddenly go from 10 records to 5 or 15, then you have probably hit a problem. When you work one step at a time and run into a problem, you know exactly what caused the problem making it much easier to find and fix.
Group BY is important to understand thoroughly. You must have every non-aggregated field in the group by or it will not work. Think of this as law like the law of gravity. It is not something you can change. However it can be worked around through the use of derived tables or CTEs. Please read up on those a bit if you don't know what they are, they are very useful techniques when you get into complex stuff and you shoud understand them thoroughly. I suspect you will need to use the derived table approach here to group on only the things you need and then join that derived table to the rest of teh query to get the ontehr fields. I'll show a simple example:
select
t1.table1id
, t1.field1
, t1.field2
, a.field3
, a.MostRecentDate
From table1 t1
JOIN
(select t1.table1id, t2.field3, max (datefield) as MostRecentDate
from table1 t1
JOin Table2 t2 on t1.table1id = t2.table1id
Where t2.field4 = 'test'
group by t1.table1id,t2.field3) a
ON a.table1id = t1.table1id
Hope this approach helps you solve this problem.