I have a SQL 2005 table consisting of around 10million records (dbo.Logs).
I have another table, dbo.Rollup that matches distinct dbo.Logs.URL to a FileId column in a third table, dbo.Files. The dbo.Rollup table forms the basis of various aggregate reports we run at a later stage.
Suffice to say for now, the problem I am having is in populating dbo.Rollup efficiently.
By definition, dbo.Logs has potentially tens of thousands of rows which all share the same URL field value. In our application, one URL can be matched to one dbo.Files.FileId. I.E. There is a many-to-one relationship between dbo.Logs.URL and dbo.Files.FileId (we parse the values of dbo.Logs to determine what the appropriate FileId is for a given URL).
My goal is to significantly reduce the amount of time it takes the first of three stored procedures that run in order to create meaningful statistics from our raw log data.
What I need is a specific example of how to refactor this SQL query to be much more efficient:
sp-Rollup-Step1:
INSERT INTO dbo.Rollup ([FileURL], [FileId])
SELECT
logs.RequestedFile As [URL],
FileId = dbo.fn_GetFileIdFromURL(l.RequestedFile, l.CleanFileName)
FROM
dbo.Logs l (readuncommitted)
WHERE
NOT EXISTS (
SELECT
FileURL
FROM
dbo.Rollup
WHERE
FileUrl = RequestedFile
)
fn_GetFileIdFromURL():
CREATE FUNCTION [dbo].[fn_GetFileIdFromURL]
(
#URL nvarchar(500),
#CleanFileName nvarchar(255)
)
RETURNS uniqueidentifier
AS
BEGIN
DECLARE #id uniqueidentifier
if (exists(select FileURL from dbo.[Rollup] where [FileUrl] = #URL))
begin
-- This URL has been seen before in dbo.Rollup.
-- Retrieve the FileId from the dbo.Rollup table.
set #id = (select top 1 FileId from dbo.[Rollup] where [FileUrl] = #URL)
end
else
begin
-- This is a new URL. Hunt for a matching URL in our list of files,
-- and return a FileId if a match is found.
Set #id = (
SELECT TOP 1
f.FileId
FROM
dbo.[Files] f
INNER JOIN
dbo.[Servers] s on s.[ServerId] = f.[ServerId]
INNER JOIN
dbo.[URLs] u on
u.[ServerId] = f.[ServerId]
WHERE
Left(u.[PrependURLProtocol],4) = left(#URL, 4)
AND #CleanFileName = f.FileName
)
end
return #id
END
Key considerations:
dbo.Rollup should contain only one entry for each DISTINCT/unique URL found in dbo.tLogs.
I would like to omit records from being inserted into dbo.[Rollup] where the FileId is NULL.
In my own observations, it seems the slowest part of the query by far is in the stored procedure: the "NOT EXISTS" clause (I am not sure at this point whether that continually refreshes the table or not).
I'm looking for a specific solution (with examples using either pseudo-code or by modifying my procedures shown here) - answer will be awarded to those who provide it!
Thanks in advance for any assistance you can provide.
/Richard.
Short answer is you have a CURSOR here. The scalar UDF is run per row of output.
The udf could be 2 LEFT JOINs onto derived tables. A rough outline:
...
COALESCE (F.xxx, L.xxx) --etc
...
FROM
dbo.Logs l (readuncommitted)
LEFT JOIN
(select DISTINCT --added after comment
FileId, FileUrl from dbo.[Rollup]) R ON L.FileUrl = R.FileUrl
LEFT JOIN
(SELECT DISTINCT --added after comment
f.FileId,
FileName ,
left(#PrependURLProtocol, 4) + '%' AS Left4
FROM
dbo.[Files] f
INNER JOIN
dbo.[Servers] s on s.[ServerId] = f.[ServerId]
INNER JOIN
dbo.[URLs] u on
u.[ServerId] = f.[ServerId]
) F ON L.CleanFileName = R.FileName AND L.FileURL LIKE F.Left4
...
I'm also not sure if you need the NOT EXISTS because of how the udf works. If you do, make sure the columns are indexed.
I think your hotspot is located here:
Left(u.[PrependURLProtocol],4) = left(#URL, 4)
This will cause the server to do a scan on the url table. You should not use a function on a field in a join clause. try to rewrite that to something like
... where PrependURLProtocol like left(#URL, 4) +"%"
And make sure you have an index on the field.
INSERT INTO dbo.Rollup ([FileURL], [FileId])
SELECT
logs.RequestedFile As [URL],
FileId = dbo.fn_GetFileIdFromURL(l.RequestedFile, l.CleanFileName)
FROM dbo.Logs l (readuncommitted) LEFT OUTER JOIN dbo.Rollup
on FileUrl = RequestedFile
WHERE FileUrl IS NULL
The logic here is that if dbo.Rollup does not exist for the given FileUrl, then the left outer join will turn up null. The NOT EXISTS now becomes an IS NULL, which is faster.
Related
I have a recursive function which gives allows me to give any GUID in the heirarchy and it pulls back all the values below it. This is used for folder security.
ALTER FUNCTION dbo.ValidSiteClass
(
#GUID UNIQUEIDENTIFIER
)
RETURNS TABLE
AS
RETURN
(
-- Add the SELECT statement with parameter references here
WITH previous
AS ( SELECT
PK_SiteClass,
FK_Species_SiteClass,
CK_ParentClass,
ClassID,
ClassName,
Description,
SyncKey,
SyncState
FROM
dbo.SiteClass
WHERE
PK_SiteClass = #GUID
UNION ALL
SELECT
Cur.PK_SiteClass,
Cur.FK_Species_SiteClass,
Cur.CK_ParentClass,
Cur.ClassID,
Cur.ClassName,
Cur.Description,
Cur.SyncKey,
Cur.SyncState
FROM
dbo.SiteClass Cur,
previous
WHERE
Cur.CK_ParentClass = previous.PK_SiteClass)
SELECT DISTINCT
previous.PK_SiteClass,
previous.FK_Species_SiteClass,
previous.CK_ParentClass,
previous.ClassID,
previous.ClassName,
previous.Description,
previous.SyncKey,
previous.syncState
FROM
previous
)
I have a stored procudure which then later needs to figure out what folders have changed in the user's heirarchy which I use for change tracking. When I try to join it with my change tracking it never returns the query. For example, the following doesn't ever return any results (It just spins, I stop it after 6 minutes)
DECLARE #ChangeTrackerNumber INT = 13;
DECLARE #SelectedSchema UNIQUEIDENTIFIER = '36EC6589-8297-4A82-86C3-E6AAECCC7D95';
WITH validones AS (SELECT PK_SITECLASS FROM ValidSiteClass(#SelectedSchema))
SELECT SiteClass.PK_SiteClass KeyGuid,
'' KeyString,
dbo.GetChangeOperationEnum(SYS_CHANGE_OPERATION) ChangeOp
FROM dbo.SiteClass
INNER JOIN CHANGETABLE(CHANGES SiteClass, #ChangeTrackerNumber) tracking --tracking
ON tracking.PK_SiteClass = SiteClass.PK_SiteClass
INNER JOIN validones
ON SiteClass.PK_SiteClass = validones.PK_SiteClass
WHERE SyncState IN ( 0, 2, 4 );
The only way I can make this work is with a temptable such as:
DECLARE #ChangeTrackerNumber INT = 13;
DECLARE #SelectedSchema UNIQUEIDENTIFIER = '36EC6589-8297-4A82-86C3-E6AAECCC7D95';
CREATE TABLE #temptable
(
[PK_SiteClass] UNIQUEIDENTIFIER
);
INSERT INTO #temptable
(
PK_SiteClass
)
SELECT PK_SiteClass
FROM dbo.ValidSiteClass(#SelectedSchema);
SELECT SiteClass.PK_SiteClass KeyGuid,
'' KeyString,
dbo.GetChangeOperationEnum(SYS_CHANGE_OPERATION) ChangeOp
FROM dbo.SiteClass
INNER JOIN CHANGETABLE(CHANGES SiteClass, #ChangeTrackerNumber) tracking --tracking
ON tracking.PK_SiteClass = SiteClass.PK_SiteClass
INNER JOIN #temptable
ON SiteClass.PK_SiteClass = #temptable.PK_SiteClass
WHERE SyncState IN ( 0, 2, 4 );
DROP TABLE #temptable;
In other words, the CTE doesn't work and I need to call the temptable.
First question, isn't the CTE supposed to be the same thing (but better) than a temptable?
Second question, does anyone know why this could be so? I have tried inner joins and using a where and in clause also. Is there something different about a recursive query that might cause this odd behavior?
Generally, when you have a table-valued function, you'd just include it like it was a regular table (assuming you have a parameter to pass to it). If you want to pass a series of parameters to it, you'd use outer apply, but that doesn't seem to be the case here.
I think (maybe) this is more like you want (notice no with clause):
select
s.PK_SiteClass KeyGuid,
'' KeyString,
dbo.GetChangeOperationEnum(t.SYS_CHANGE_OPERATION) ChangeOp
from
dbo.ValidSiteClass(#SelectedSchema) v
inner join
SiteClass s
on
s.PK_SiteClass = v.PK_SiteClass
inner join
changetable(changes SiteClass, #ChangeTrackerNumber) c
on
c.PK_SiteClass = s.PK_SiteClass
where
SyncState in ( 0, 2, 4 )
option (force order)
...which I'll admit doesn't look that mechanically different than what you have with the with clause. However, you could be running into an issue with SQL Server just picking a horrible plan not having any other clues. Including the option (force order) makes SQL Server perform the joins according to the order you put them in...and sometimes this makes an incredible difference.
I wouldn't say this is recommended. In fact, it's a hack...just to see WTF. But, play around with the order...and get SQL Server to show you the actual execution plans to see why it might have come up with something so heinous. An inline table-valued function is visible to SQL Server's query plan engine, and it may decide to not treat the function as an isolated thing the way programmers traditionally think about functions. I suspect this is why it took so long to begin with.
Funny enough, if your function were to be a so-called multi-lined table-valued function, SQL would definitely not have the same type of visibility into it when planning this query...and it might run faster. Again, not a recommendation, just something that might hack a better plan.
I have the code below that does not return any values when it should. Is there anyone that can find anything faulty with it? When I run the script I get 0 values in return (0 rows affected), even though I know that there are rows that should be selected.
This is just a part of the whole script, and when I run everything I get the error message "Warning: Null value is eliminated by an aggregate or other SET operation". But I don´t know what this implies for my script. I only have 1 "group by" in the script (which I don´t post here since that coding works). Does anyone know?
The code, below, should create a temp-table and to the temp-table insert rows from the table "StatusHistorik" where the following conditions are met:
1) [NyStatus]=4 (NyStatus is a column from the table "StatusHistorik"),
2) The variable "EnhetsId" should not exist in the table "SKL_AdminKontroll_SkaÅtgärdas" (F). Code: "F.EnhetsId is null".
I have used inner joins in both cases. This should result in a number of observations/rows in the temp-table, but returns nothing. Does anyone find any errors with the coding that could explain the absent of result?
I should mention that when I "comment away" the last inner join and the last part of the where-statement, the script works as it is supposed to. So I suspect that it is the last inner join-statement that is wrong somehow.
declare #temp2 table (
EnhetsId varchar(50),
TjanstId Int,
Tabell varchar(50),
Kommentar ntext,
Uppdaterad datetime
);
WITH ENHET_AVSLUT AS
(
SELECT DISTINCT A.[EnhetsId]
FROM [StatistikinlamningDataSKL].[dbo].[StatusHistorik] A
inner join (
select [EnhetsId], max(SenastUppdaterad) as SenastDatum
from [StatistikinlamningDataSKL].[dbo].[StatusHistorik]
group by [EnhetsId]
) B
on A.[EnhetsId] = B.[EnhetsId] and A.[SenastUppdaterad] = B.SenastDatum
INNER JOIN
StatistikinlamningDataSKL.dbo.SKL_AdminKontroll_SkaÅtgärdas F ON A.EnhetsId = F.EnhetsId
WHERE [NyStatus] = 4 AND F.EnhetsId is null
)
insert into #temp2
(EnhetsId, TjanstId, Tabell, Kommentar, Uppdaterad)
SELECT
EnhetsId, 1, ''GR_PS09_1'', ''OK'', getdate()
from ENHET_AVSLUT
select * from #temp2
Best regards,
Hannes
Because you are saying
The variable "EnhetsId" should not exist in the table "SKL_AdminKontroll_SkaÅtgärdas" (F). Code: "F.EnhetsId is null".
an INNER JOIN will try to join the tables and will not be able to perform a JOIN on NULL values, so the result will be zero rows.
So you must try using LEFT JOIN, where the tables ON LEFT will be pulled irrespective of the JOIN condition being satisfied.
I must warn you though, if the table is large it might have performance issues.
Your inner SQL must be something like
SELECT DISTINCT A.[EnhetsId]
FROM [StatistikinlamningDataSKL].[dbo].[StatusHistorik] A
inner join (
select [EnhetsId], max(SenastUppdaterad) as SenastDatum
from [StatistikinlamningDataSKL].[dbo].[StatusHistorik]
group by [EnhetsId]
) B
on A.[EnhetsId] = B.[EnhetsId] and A.[SenastUppdaterad] = B.SenastDatum
LEFT JOIN
StatistikinlamningDataSKL.dbo.SKL_AdminKontroll_SkaÅtgärdas F ON A.EnhetsId = F.EnhetsId
WHERE [NyStatus] = 4
I've been trying to assemble a stored procedure on my Azure database that when it runs the query, it returns one output value from one specific column.
The likelihood of multiple results is zero since the the table being queried has 3 columns, and the query must mach 2. Then it grabs data from another table. The key is I need the first query to output the value in order to commence the second query.
At present I have 2 procedures, I would like to have one.
Query is as such for the moment:
select
customers_catalogs_define.catalog_id
from
customers_catalogs
left outer join
customers_catalogs_define on customers_catalogs.catalog_id = customers_catalogs_define.catalog_id
where
customers_catalogs.catalog_unique_identifier = #catalog_unique
AND customers_catalogs_define.customer_id = #customer_id
The output of course is the catalog_id. From that I take it into another query which I have that does the actual list retrieval. At the very least I would like to add a line that simply states #catalog_id = output
Thanks
You have two options basically to have this work :
Since I haven't seen your second query, just make sure its querying the correct table listed below as CUSTOMERS_CATALOGS_DEFINE
Using a variable as you suggested:
DECLARE #CATALOG_ID INT
SET #CATALOG_ID = (
SELECT CUSTOMERS_CATALOGS_DEFINE.CATALOG_ID
FROM CUSTOMERS_CATALOGS
LEFT OUTER JOIN CUSTOMERS_CATALOGS_DEFINE
ON CUSTOMERS_CATALOGS.CATALOG_ID = CUSTOMERS_CATALOGS_DEFINE.CATALOG_ID
WHERE CUSTOMERS_CATALOGS.CATALOG_UNIQUE_IDENTIFIER = #CATALOG_UNIQUE
AND CUSTOMERS_CATALOGS_DEFINE.CUSTOMER_ID = #CUSTOMER_ID )
SELECT *
FROM CUSTOMERS_CATALOGS_DEFINE
WHERE CATALOG_ID = #CATALOG_ID
Second option would be to do it in one query:
SELECT *
FROM CUSTOMERS_CATALOGS_DEFINE
WHERE CATALOG_ID IN (
SELECT CUSTOMERS_CATALOGS_DEFINE.CATALOG_ID
FROM CUSTOMERS_CATALOGS
LEFT OUTER JOIN CUSTOMERS_CATALOGS_DEFINE
ON CUSTOMERS_CATALOGS.CATALOG_ID = CUSTOMERS_CATALOGS_DEFINE.CATALOG_ID
WHERE CUSTOMERS_CATALOGS.CATALOG_UNIQUE_IDENTIFIER = #CATALOG_UNIQUE
AND CUSTOMERS_CATALOGS_DEFINE.CUSTOMER_ID = #CUSTOMER_ID
)
I have to optimize this query can some help me fine tune it so it will return data faster?
Currently the output is taking somewhere around 26 to 35 seconds. I also created index based on attachment table following is my query and index:
SELECT DISTINCT o.organizationlevel, o.organizationid, o.organizationname, o.organizationcode,
o.organizationcode + ' - ' + o.organizationname AS 'codeplusname'
FROM Organization o
JOIN Correspondence c ON c.organizationid = o.organizationid
JOIN UserProfile up ON up.userprofileid = c.operatorid
WHERE c.status = '4'
--AND c.correspondence > 0
AND o.organizationlevel = 1
AND (up.site = 'ALL' OR
up.site = up.site)
--AND (#Dept = 'ALL' OR #Dept = up.department)
AND EXISTS (SELECT 1 FROM Attachment a
WHERE a.contextid = c.correspondenceid
AND a.context = 'correspondence'
AND ( a.attachmentname like '%.rtf' or a.attachmentname like '%.doc'))
ORDER BY o.organizationcode
I can't just change anything in db due to permission issues, any help would be much appreciated.
I believe your headache is coming from this part in specific...like in a where exists can be your performance bottleneck.
AND EXISTS (SELECT 1 FROM Attachment a
WHERE a.contextid = c.correspondenceid
AND a.context = 'correspondence'
AND ( a.attachmentname like '%.rtf' or a.attachmentname like '%.doc'))
This can be written as a join instead.
SELECT DISTINCT o.organizationlevel, o.organizationid, o.organizationname, o.organizationcode,
o.organizationcode + ' - ' + o.organizationname AS 'codeplusname'
FROM Organization o
JOIN Correspondence c ON c.organizationid = o.organizationid
JOIN UserProfile up ON up.userprofileid = c.operatorid
left join article a on a.contextid = c.correspondenceid
AND a.context = 'correspondence'
and right(attachmentname,4) in ('.doc','.rtf')
....
This eliminates both the like and the where exists. put your where clause at the bottom.it's a left join, so a.anycolumn is null means the record does not exist and a.anycolumn is not null means a record was found. Where a.anycolumn is not null will be the equivalent of a true in the where exists logic.
Edit to add:
Another thought for you...I'm unsure what you are trying to do here...
AND (up.site = 'ALL' OR
up.site = up.site)
so where up.site = 'All' or 1=1? is the or really needed?
and quickly on right...Right(column,integer) gives you the characters from the right of the string (I used a 4, so it'll take the 4 right chars of the column specified). I've found it far faster than a like statement runs.
This is always going to return true so you can eliminate it (and maybe the join to up)
AND (up.site = 'ALL' OR up.site = up.site)
If you can live with dirty reads then with (nolock)
And I would try Attachement as a join. Might not help but worth a try. Like is relatively expensive and if it is doing that in a loop where it could it once that would really help.
Join Attachment a
on a.contextid = c.correspondenceid
AND a.context = 'correspondence'
AND ( a.attachmentname like '%.rtf' or a.attachmentname like '%.doc'))
I know there are some people on SO that insist that exists is always faster than a join. And yes it is often faster than a join but not always.
Another approach is the create a #temp table using
CREATE TABLE #Temp (contextid INT PRIMARY KEY CLUSTERED);
insert into #temp
Select distinct contextid
from atachment
where context = 'correspondence'
AND ( attachmentname like '%.rtf' or attachmentname like '%.doc'))
order by contextid;
go
select ...
from correspondence c
join #Temp
on #Temp.contextid = c.correspondenceid
go
drop table #temp
Especially if productID is the primary key or part of the primary key on correspondence creating the PK on #temp will help.
That way you can be sure that like expression is only evaluated once. If the like is the expensive part and in a loop then it could be tanking the query. I use this a lot where I have a fairly expensive core query and I need to those results to pick up reference data from multiple tables. If you do a lot of joins some times the query optimizer goes stupid. But if you give the query optimizer PK to PK then it does not get stupid and is fast. The down side is it takes about 0.5 seconds to create and populate the #temp.
I have a relationship between two tables. The two tables PKs are int types.
In one table (UserS), I need to supply the Username and get the corresponding ID (which is the PK). This is the standard ASP.NET user table when using forms authentication.
However, in a related table, I want to supply the ID I find from the Users table to get a value out.
Something like:
Run query to get ID for a username (simple select/where query)
Return the result
Run a subquery where I can pass in the above result -
Get value where ID = result
This sounds a lot like dynamic sql. However, there might be a better suited and appropriate way of writing this query (on Sql Server 2k5).
How can I go about doing this and what gotchas lurk?
EDIT: I will try something along the lines of http://www.sqlteam.com/article/introduction-to-dynamic-sql-part-1
EDIT: Thanks for the tips everyone, I wrote this:
SELECT Discount.DiscountAmount
FROM Discount
INNER JOIN aspnet_Users
ON Discount.DiscountCode = aspnet_Users.UserId And aspnet_Users.Username = 's'
Where 's' is to be replaced by a parameter.
Thanks
You don't have to use dynamic SQL for that.
You can use a lookup instead:
DECLARE #ID bigint
SELECT #ID = ID FROM Users WHERE Username = #Username
SELECT
*
FROM
TableB
WHERE
ID = #ID
Then, you could add the PARAMETER #Username to your SqlCommand object, preventing the risks of SQL Injection.
Also, doing the lookup is preferable to a join, since the index is scanned a single time, for TableB.
Right, i just would do this:
SELECT *
FROM TableB
JOIN Users ON Users.Id = TableB.ID
WHERE Users.Username = #Username
Regarding lookup vs joins - while it may seem more intuitive for the novice to use the lookup, you should go with a join. A lookup needs to be evaluated for every row in your primary result set, while a join is evaluated once. Joins are much more efficient, which means that they are faster.
This is just a simple join isn't it?
SELECT x.*
FROM user_table u
INNER JOIN
other_table x
ON u.id = x.user_id
WHERE u.name = #user_name
SELECT values.value
FROM form_users, values
WHERE form_users.username = 'John Doe'
AND values.user = form_users.userid
SELECT
*
FROM
table2 t2
JOIN
UserS t1 ON t2.IDKey = t1.IDKey
WHERE
UserS.Username = #Input