Query performance: CTE using ROW_NUMBER() to select first row - sql

We have three environments and when I run my SQL query in two of them just takes 30 or 38 seconds to run but in the other environment running is not completed and I should cancel it. Query is based on two parts, a CTE and a very simple select from a table, in both CTE and select I'm using the same table.
Could you please tell me why it takes so long time? how can I improve the query?
ALTER VIEW [fact].[vPurchase]
AS
WITH VKPL AS
(
SELECT *
FROM
(SELECT
iv.[Delivery_FK],
1 AS column2,
ROW_NUMBER() OVER(PARTITION BY [Delivery_FK] ORDER BY iv.UpdateDate) AS rk
FROM
[fact].[KRMFact] iv
LEFT JOIN
[dimension].[Product] pr ON iv.Product_FK =pr.Product_SK
LEFT JOIN
[dimension].[Delivery] le ON le.Delivery_FK = iv.Delivery_FK
WHERE
pr.Product_Key = '740') X
WHERE
rk = 1
)
SELECT
-- .... here are some columns
Delivery_FK,
Product_FK,
CAST(column2 AS VARCHAR) AS column2,
f.[UpdateDate] AS [Update date]
FROM
[fact].[KRMFact] f
LEFT JOIN
VKPL v ON f.Delivery_FK = v.Delivery_FK

This is guesswork.
I guess the environment where this query is slow is the one with lots of production data in it.
I guess some index on your KRMFact table will, maybe, help you. Here's how to figure out what you need: SQL Server Management Studio (SSMS) has a feature to show you a query's execution plan. Put your query (not simplified, please, the actual query) into SSMS, right click and choose "Include Actual Execution Plan." Then run the query. The execution plan display may recommend an index for you to create to get this query to run faster.
I guess you're trying to find rows with the earliest values of UpdateDate.
Your subquery
SELECT *
FROM
(SELECT
iv.[Delivery_FK],
1 AS column2,
ROW_NUMBER() OVER(PARTITION BY [Delivery_FK] ORDER BY iv.UpdateDate) AS rk
FROM
[fact].[KRMFact] iv
LEFT JOIN
[dimension].[Product] pr ON iv.Product_FK =pr.Product_SK
LEFT JOIN
[dimension].[Delivery] le ON le.Delivery_FK = iv.Delivery_FK
WHERE
pr.Product_Key = '740') X
WHERE
rk = 1
looks like it picks out the row with the earliest KRMFact.UpdateDate for each value of KRMFact.Delivery_FK. That's what the ROW_NUMBER() OVER... WHERE rk=1 language does.
If my guess about that is correct you can do that a different way, which may be more efficient.
SELECT *
FROM
(SELECT
iv.[Delivery_FK],
1 AS column2,
1 AS rk
FROM
[fact].[KRMFact] iv
JOIN ( SELECT Delivery_FK, MIN(UpdateDate) first_update
FROM [fact].[KRMFact]
GROUP BY Delivery_FK
) first_update ON iv.UpdateDate = first_update.first_update
LEFT JOIN
[dimension].[Product] pr ON iv.Product_FK =pr.Product_SK
LEFT JOIN
[dimension].[Delivery] le ON le.Delivery_FK = iv.Delivery_FK
WHERE
pr.Product_Key = '740') X
WHERE
rk = 1
You should probably try out the old and new versions of the subquery to determine whether they will yield the same results.
If you use this subquery query I suggest, this index will help make it run faster by optimizing the new sub-sub-query's MIN() ... GROUP BY operation.
CREATE INDEX x_KRMFact_Product_Update
ON [fact].[KRMFact]
([Product_FK],[UpdateDate])
By the way, WHERE pr.Product_Key = '740' turns your LEFT JOIN [dimension].[Product] operation into an ordinary inner JOIN.

Related

How to convert inline SQL queries to JOINS in SQL SERVER to reduce load time

I need help in optimizing this SQL query.
In the main SELECT statement there are three columns which is dependent on the outer query result. This is why my query is taking a long time to return data. I have tried making left joins but this is not working properly.
Can anyone help me to resolve this issue?
SELECT
DISTINCT ou.OrganizationUserID AS StudentID,
ou.FirstName,
ou.LastName,
(
SELECT
STRING_AGG(
(ug.UG_Name),
','
)
FROM
Groups ug
INNER JOIN ApplicantUserGroup augm ON augm.AUGM_UserGroupID = ug.UG_ID
WHERE
augm.AUGM_OrganizationUserID = ou.OrganizationUserID
AND ug.UG_IsDeleted = 0
AND augm.AUGM_IsDeleted = 0
) AS UserGroups,
order1.OrderNumber AS OrderId -- UAT-2455
,
(
SELECT
STRING_AGG(
(CActe.CustomAttribute),
','
)
FROM
CustomAttributeCte CActe
WHERE
CActe.HierarchyNodeID = dpm.DPM_ID
AND CActe.OrganizationUserID = ps.OrganizationUserID
) AS CustomAttributes -- UAT-2455
,
(
SELECT
STRING_AGG(
(CActe.CustomAttributeID),
','
)
FROM
CustomAttributeCte CActe
WHERE
CActe.HierarchyNodeID = dpm.DPM_ID
AND CActe.OrganizationUserID = ps.OrganizationUserID
) AS CustomAttributeID
FROM
ApplicantData acd WITH (NOLOCK)
INNER JOIN ClientPackage ps WITH (NOLOCK) ON acd.ClientSubscriptionID = ps.ClientSubscriptionID
INNER JOIN [ClientOrder] order1 WITH (NOLOCK) ON order1.OrderID = ps.OrderID
AND order1.IsDeleted = 0
INNER JOIN OUser ou WITH (NOLOCK) ON ou.OrganizationUserID = ps.OrganizationUserID
It looks like this query can be simplified, and the dependent subqueries in your SELECT clause removed, Consider your second and third dependent subqueries. You can refactor them into one nondependent subquery with a LEFT JOIN. Using nondependent subqueries is more efficient because the query planner can run them just once, rather than once for each row.
You want two STRING_AGG() results from the same table. This subquery gives those two outputs for every possible combination of HierarchyNodeID and OrganizationUserID values. STRING_AGG() is an aggregate function like SUM() and so works nicely with GROUP BY.
SELECT HierarchyNodeID, OrganizationUserID,
STRING_AGG((CActe.CustomAttribute), ',') CustomAttributes -- UAT-2455,
STRING_AGG((CActe.CustomAttributeID), ',') CustomAttributeIDs -- UAT-2455
FROM CustomAttributeCte CActe
GROUP BY HierarchyNodeID, OrganizationUserID
You can run this subquery itself to convince yourself it works.
Now, we can LEFT JOIN that into your query. Like this. (For readability I took out the NOLOCKs and used JOIN: it means the same thing as INNER JOIN.)
SELECT DISTINCT
ou.OrganizationUserID AS StudentID,
ou.FirstName,
ou.LastName,
'tempvalue' AS UserGroups, -- shortened for testing
order1.OrderNumber AS OrderId, -- UAT-2455
uat2455.CustomAttributes, -- UAT-2455
uat2455.CustomAttributeIDs -- UAT-2455
FROM ApplicantData acd
JOIN ClientPackage ps
ON acd.ClientSubscriptionID = ps.ClientSubscriptionID
JOIN ClientOrder order1
ON order1.OrderID = ps.OrderID
AND order1.IsDeleted = 0
JOIN OUser ou
ON ou.OrganizationUserID = ps.OrganizationUserID
LEFT JOIN (
SELECT HierarchyNodeID, OrganizationUserID,
STRING_AGG((CActe.CustomAttribute), ',') CustomAttributes -- UAT-2455,
STRING_AGG((CActe.CustomAttributeID), ',') CustomAttributeIDs -- UAT-2455
FROM CustomAttributeCte CActe
GROUP BY HierarchyNodeID, OrganizationUserID
) uat2455
ON uat2455.HierarchyNodeID = dpm.DPM_ID
AND uat2455.OrganizationUserId = ps.OrganizationUserID
See how we collapsed your second and third dependent subqueries to just one, then used it as a virtual table with LEFT JOIN? We transformed the WHERE clauses from the dependent subqueries into an ON clause.
You can test this: run it with TOP(50) and eyeball the results.
When you're happy, the next step is to transform your first dependent subquery the same way.
Pro tip Don't use WITH (NOLOCK), ever, unless a database administration expert tells you to after looking at your specific query. If your query's purpose is a historical report and you don't care whether the most recent transactions in your database are represented exactly right, you can precede your query with this statement. It also allows the query to run while avoiding locks.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
Pro tip Be obsessive about formatting your queries for readability. You, your colleagues, and yourself a year from now must be able to read and reason about queries like this.

SQL not efficient enough, tuning assistance required

We have some SQL that is ok on smaller data volumes but poor once we scale up to selecting from larger volumes. Is there a faster alternative style to achieve the same output as below? The idea is to pull back a single unique row to get latest version of the data... The SQL does reference another view but this view runs very fast - so we expect the issue is here below and want to try a different approach
SELECT *
FROM
(SELECT (select CustomerId from PremiseProviderVersionsToday
where PremiseProviderId = b.PremiseProviderId) as CustomerId,
c.D3001_MeterId, b.CoreSPID, a.EnteredBy,
ROW_NUMBER() OVER (PARTITION BY b.PremiseProviderId
ORDER BY a.effectiveDate DESC) AS rowNumber
FROM PremiseMeterProviderVersions a, PremiseProviders b,
PremiseMeterProviders c
WHERE (a.TransactionDateTimeEnd IS NULL
AND a.PremiseMeterProviderId = c.PremiseMeterProviderId
AND b.PremiseProviderId = c.PremiseProviderId)
) data
WHERE data.rowNumber = 1
As Bilal Ayub stated above, the correlated subquery can result in performance issues. See here for more detail. Below are my suggestions:
Change all to explicit joins (ANSI standard)
Use aliases that are more descriptive than single characters (this is mostly to help readers understand what each table does)
Convert data subquery to a temp table or cte (temp tables and ctes usually perform better than subqueries)
Note: normally, you should explicitly create and insert into your temp table but I chose not to do that here as I do not know the data types of your columns.
SELECT d.CustomerId
, c.D3001_MeterId
, b.CoreSPID
, a.EnteredBy
, rowNumber = ROW_NUMBER() OVER(PARTITION BY b.PremiseProviderId ORDER BY a.effectiveDate DESC)
INTO #tmp_RowNum
FROM PremiseMeterProviderVersions a
JOIN PremiseMeterProviders c ON c.PremiseMeterProviderId = a.PremiseMeterProviderId
JOIN PremiseProviders b ON b.PremiseProviderId = c.PremiseProviderId
JOIN PremiseProviderVersionsToday d ON d.PremiseProviderId = b.PremiseProviderId
WHERE a.TransactionDateTimeEnd IS NULL
SELECT *
FROM #tmp_RowNum
WHERE rowNumber = 1
You are running a correlated query that will run in loop, if size of table is small it will be faster, i would suggest to change it and try to join the table in order to get customerid.
(select CustomerId from PremiseProviderVersionsToday where PremiseProviderId = b.PremiseProviderId) as CustomerId
Consider derived tables including an aggregate query that calculates maximum EffectoveDate by PremiseProviderId and unit level query, each using explicit joins (current ANSI SQL standard) and not implicit as you currently use:
SELECT data.*
FROM
(SELECT t.CustomerId, c.D3001_MeterId, b.CoreSPID, a.EnteredBy,
b.PremiseProviderId, a.EffectiveDate
FROM PremiseMeterProviders c
INNER JOIN PremiseMeterProviderVersions a
ON a.PremiseMeterProviderId = c.PremiseMeterProviderId
AND a.TransactionDateTimeEnd IS NULL
INNER JOIN PremiseProviders b
ON b.PremiseProviderId = c.PremiseProviderId
INNER JOIN PremiseProviderVersionsToday t
ON t.PremiseProviderId = b.PremiseProviderId
) data
INNER JOIN
(SELECT b.PremiseProviderId, MAX(a.EffectiveDate) As MaxEffDate
FROM PremiseMeterProviders c
INNER JOIN PremiseMeterProviderVersions a
ON a.PremiseMeterProviderId = c.PremiseMeterProviderId
AND a.TransactionDateTimeEnd IS NULL
INNER JOIN PremiseProviders b
ON b.PremiseProviderId = c.PremiseProviderId
GROUP BY b.PremiseProviderId
) agg
ON data.PremiseProviderId = agg.PremiseProviderId
AND data.EffectiveDate = agg.MaxEffDate

Refactoring slow SQL query

I currently have this very very slow query:
SELECT generators.id AS generator_id, COUNT(*) AS cnt
FROM generator_rows
JOIN generators ON generators.id = generator_rows.generator_id
WHERE
generators.id IN (SELECT "generators"."id" FROM "generators" WHERE "generators"."client_id" = 5212 AND ("generators"."state" IN ('enabled'))) AND
(
generators.single_use = 'f' OR generators.single_use IS NULL OR
generator_rows.id NOT IN (SELECT run_generator_rows.generator_row_id FROM run_generator_rows)
)
GROUP BY generators.id;
An I'm trying to refactor it/improve it with this query:
SELECT g.id AS generator_id, COUNT(*) AS cnt
from generator_rows gr
join generators g on g.id = gr.generator_id
join lateral(select case when exists(select * from run_generator_rows rgr where rgr.generator_row_id = gr.id) then 0 else 1 end as noRows) has on true
where g.client_id = 5212 and "g"."state" IN ('enabled') AND
(g.single_use = 'f' OR g.single_use IS NULL OR has.norows = 1)
group by g.id
For reason it doesn't quite work as expected(It returns 0 rows). I think I'm pretty close to the end result but can't get it to work.
I'm running on PostgreSQL 9.6.1.
This appears to be the query, formatted so I can read it:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id
WHERE gr.generators_id IN (SELECT g.id
FROM generators g
WHERE g.client_id = 5212 AND
g.state = 'enabled'
) AND
(g.single_use = 'f' OR
g.single_use IS NULL OR
gr.id NOT IN (SELECT rgr.generator_row_id FROM run_generator_rows rgr)
)
GROUP BY gr.generators_id;
I would be inclined to do most of this work in the FROM clause:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id JOIN
generators gg
on g.id = gg.id AND
gg.client_id = 5212 AND gg.state = 'enabled' LEFT JOIN
run_generator_rows rgr
ON g.id = rgr.generator_row_id
WHERE g.single_use = 'f' OR
g.single_use IS NULL OR
rgr.generator_row_id IS NULL
GROUP BY gr.generators_id;
This does make two assumptions that I think are reasonable:
generators.id is unique
run_generator_rows.generator_row_id is unique
(It is easy to avoid these assumptions, but the duplicate elimination is more work.)
Then, some indexes could help:
generators(client_id, state, id)
run_generator_rows(id)
generator_rows(generators_id)
Generally avoid inner selects as in
WHERE ... IN (SELECT ...)
as they are usually slow.
As it was already shown for your problem it's a good idea to think of SQL as of set- theory.
You do NOT join tables on their sole identity:
In fact you take (SQL does take) the set (- that is: all rows) of the first table and "multiply" it with the set of the second table - thus ending up with n times m rows.
Then the ON- clause is used to (often strongly) reduce the result by simply selecting each one of those many combinations by evaluating this portion to either true (take) or false (drop). This way you can chose any arbitrary logic to select those combinations in favor.
Things get trickier with LEFT JOIN and RIGHT JOIN, but one can easily think of them as to take one side for granted:
output the combinations of that row IF the logic yields true (once at least) - exactly like JOIN does
output exactly ONE row, with 'the other side' (right side on LEFT JOIN and vice versa) consisting of ALL NULL for every column.
Count(*) is great either, but if things getting complicated don't stick to it: Use Sub- Selects for the keys only, and once all the hard word is done join the Fun- Stuff to it. Like in
SELECT SUM(VALID), ID
FROM SELECT
(
(1 IF X 0 ELSE) AS VALID, ID
FROM ...
)
GROUP BY ID) AS sub
JOIN ... AS details ON sub.id = details.id
Difference is: The inner query is executed only once. The outer query does usually have no indices left to work with and will be slow, but if the inner select here doesn't make the data explode this is usually many times faster than SELECT ... WHERE ... IN (SELECT..) constructs.

SQL - select only newest record with WHERE clause

I have been trying to get some data off our database but got stuck when I needed to only get the newest file upload for each file type. I have done this before using the WHERE clause but this time there is an extra table involved that is needed to determine the file type.
My query looks like this so far and i am getting six records for this user (2x filetypeNo4 and 4x filetypeNo2).
SELECT db_file.fileID
,db_profile.NAME
,db_applicationFileType.fileTypeID
,> db_file.dateCreated
FROM db_file
LEFT JOIN db_applicationFiles
ON db_file.fileID = db_applicationFiles.fileID
LEFT JOIN db_profile
ON db_applicationFiles.profileID = db_profile.profileID
LEFT JOIN db_applicationFileType
ON db_applicationFiles.fileTypeID = > > db_applicationFileType.fileTypeID
WHERE db_profile.profileID IN ('19456')
AND db_applicationFileType.fileTypeID IN ('2','4')
I have the WHERE clause looking like this which is not working:
(db_file.dateCreated IS NULL
OR db_file.dateCreated = (
SELECT MAX(db_file.dateCreated)
FROM db_file left join
db_applicationFiles on db_file.fileID = db_applicationFiles.fileID
WHERE db_applicationFileType.fileTypeID = db_applicationFiles.FiletypeID
))
Sorry I am a noob so this may be really simple, but I just learn this stuff as I go on my own..
SELECT
ff.fileID,
pf.NAME,
ff.fileTypeID,
ff.dateCreated
FROM db_profile pf
OUTER APPLY
(
SELECT TOP 1 af.fileTypeID, df.dateCreated, df.fileID
FROM db_file df
INNER JOIN db_applicationFiles af
ON df.fileID = af.fileID
WHERE af.profileID = pf.profileID
AND af.fileTypeID IN ('2','4')
ORDER BY create_date DESC
) ff
WHERE pf.profileID IN ('19456')
And it looks like all of your joins are actually INNER. Unless there may be profile without files (that's why OUTER apply instead of CROSS).
What about an obvious:
SELECT * FROM
(SELECT * FROM db_file ORDER BY dateCreated DESC) AS files1
GROUP BY fileTypeID ;

How to speed up query with aggregates?

I have the following query that seems to run pretty slow if more than one aggregate is used in the select part. Is there some way to optimize this?
The query returns a 168 rows and takes 1 second to complete, but this bogs down when a couple of users load the page at once and the original query had more aggregates which also add seconds to the query.
***** Update here's a simplier query**
Select
gocm.CustomerID,
sum(DISTINCT o.OrderTotal) as TotalOfOrders
from GroupOrder_Customer_Mapping gocm
Left Join [Order] o on o.CreatedForCustomerID = gocm.customerid and o.grouporderid = 8254
where gocm.grouporderid = 8254
group by gocm.CustomerID, invitePath
order by invitepath
Execution Plan
returns following data (sample results)
Possible this be helpful for you -
SELECT
gocm.CustomerID
, o.TotalOfOrders
FROM (
SELECT DISTINCT gocm.CustomerID, invitePath
FROM dbo.GroupOrder_Customer_Mapping gocm
WHERE gocm.grouporderid = 8254
) gocm
LEFT JOIN (
SELECT
o.CreatedForCustomerID
, TotalOfOrders = SUM(DISTINCT o.OrderTotal)
FROM dbo.[Order] o
WHERE o.grouporderid = 8254
GROUP BY o.CreatedForCustomerID
) o ON o.CreatedForCustomerID = gocm.customerid
ORDER BY invitepath
If data is not updated frequently you might consider an indexed view.