Limit query size to control transaction log size - sql

This query causes our transaction log to grow to 25GB. The database is in SIMPLE mode.
INSERT INTO updbl.dbo.PopulationRelatives
( personid,
personsex,
relativeid,
relativesex,
degree,
relationship,
maternalpaternal )
SELECT DISTINCT
personid = relative1,
relative1sex,
relative2,
relative2sex,
degree,
relationship = Rel1Rel2,
maternalpaternal
FROM UPDBwork.dbo.DegreeRelationship
By looping I was able to limit the growth to 8GB.
SELECT #PID = 0, #BatchSize = 1000000, #ROWCOUNT = 0
SELECT #MaxPID = MAX(relative1) FROM updbwork.dbo.DegreeRelationship
WHILE #PID < #MaxPID+#BatchSize
BEGIN
INSERT INTO updbl.dbo.PopulationRelatives
( personid,
personsex,
relativeid,
relativesex,
degree,
relationship,
maternalpaternal )
SELECT DISTINCT
personid = relative1,
relative1sex,
relative2,
relative2sex,
degree,
relationship = Rel1Rel2,
maternalpaternal
FROM UPDBwork.dbo.DegreeRelationship
WHERE relative1 BETWEEN #PID+1 AND #PID+#BatchSize
SET #PID = #PID + #BatchSize
CHECKPOINT
END
This isn't the best strategy as each loop produces a different number of rows depending on the DISTINCT values. Unfortunately there is no good ID to partition the data on. Is there some way I could control for the size of each group? I was thinking of adding TOP(X) but the engine would still have to do a large calculation to satisfy the DISTINCT statement. A cursor would be great but again, how do I find my DISTINCT values? I am just hoping for some brain storming here.
Thanks.

Sounds like bulk operation... if changing the recovery model is an option temporarily change it to bulk-logged. Here is a link that may be of help: http://technet.microsoft.com/en-us/library/ms175987(v=SQL.105).aspx

Related

SQL Spatial Indexing including category field(s)

I have a spatial index set up on a geography field in my 2012 SQL database that stores item locations. There are about 15,000 Items.
I need to return a total of Items within a radius of N kilometres of a given Lat/Lng.
I can do this and it's fast.
DECLARE #radius GEOGRAPHY = GEOGRAPHY::Point(#Lat, #Lng, 4326).STBuffer(#RadiusInMetres*1000)
SELECT
COUNT(*) AS Total
FROM dbo.Items i
WHERE
i.LatLngGeo.STIntersects(#radius) = 1
However, what I now need to do is filter by several fields, to get items that match a given Category and Price.
DECLARE #radius GEOGRAPHY = GEOGRAPHY::Point(#Lat, #Lng, 4326).STBuffer(#RadiusInMetres*1000)
SELECT
COUNT(*) AS Total
FROM dbo.Items i
WHERE
i.LatLngGeo.STIntersects(#radius) = 1 AND
(i.Category = #Category OR #Category is null) AND
(i.Price < #Price OR #Price is null)
This grinds away for about 10+ seconds, and I can find no way of adding varchar or number fields to a spatial index.
What can I do to speed this up?
I would start with something like this:
--Query 1 - use a CTE to split the two filters
DECLARE #radius GEOGRAPHY = GEOGRAPHY::Point(#Lat, #Lng, 4326).STBuffer(#RadiusInMetres * 1000);
WITH InRadius AS (
SELECT * FROM dbo.Items WHERE LatLngGeo.STIntersects(#radius) = 1)
SELECT
COUNT(*)
FROM
InRadius
WHERE
ISNULL(#Category, Category) = Category
AND ISNULL(#Price, Price) = Price;
GO
--Query 2 - use a temp table to *definitely( split the two filters
DECLARE #radius GEOGRAPHY = GEOGRAPHY::Point(#Lat, #Lng, 4326).STBuffer(#RadiusInMetres * 1000);
IF OBJECT_ID('tempdb..#temp') IS NOT NULL
DROP TABLE #temp;
WITH InRadius AS (
SELECT * FROM dbo.Items WHERE LatLngGeo.STIntersects(#radius) = 1)
SELECT * INTO #temp FROM InRadius;
SELECT
COUNT(*)
FROM
#temp
WHERE
ISNULL(#Category, Category) = Category
AND ISNULL(#Price, Price) = Price;
Run those queries a few times, then benchmark them, compared to your original script.
Another trick is to copy your original query in as well, then view the execution plan. What you are looking for is the percentage split per query, which would ideally be something like 98%:1%:1%, i.e. the original query will take 98% of the work, but it will probably look very different indeed.
If this doesn't help, and you are okay with temp tables, then try adding an index on the temp table that matches the criteria you are filtering for. However, with only 15,00 rows the effect of an index should almost be imperceptible.
And finally, you could limit the data being loaded into the temp table to only be the items you are going to filter on, as all you seem to want is a count at the end?
Let's just take a quick recap:
extract the data matching your spatial query (which you say is quick already);
discard anything GEOGRAPHY based from the results, storing them in a temp table;
index the temp table to speed up any filters;
now your COUNT(*) should be just on a sub-set of the data (with no spatial data), and the optimiser will not be able to try and combine it with the proximity filter;
profit!

Delete Duplicates in Table with huge amount of rows

I have a table with 19 million records. I want to delete duplicates, but the query I am using is taking a very long while and eventually connection is timing out.
This is the query I am using:
DELETE FROM [TableName]
WHERE id NOT IN
(SELECT MAX(id) FROM [TableName] GROUP BY field)
where ID is Primary key and auto increment.
I want to delete the duplicates in field.
Is there a faster alternative to this query?
Any help would be appreciated.
i suggest temporarily adding an index onto field to speed things up. maybe use this statement to delete (even though yours should work fine with the index).
my statement generates a list of ids that should be deleted. assuming that id as the primary key is indexed, this is probably faster. this should also perform a little better than not in.
with candidates as (
SELECT id
, ROW_NUMBER() over (PARTITION by field order by id desc) rn
FROM [TableName]
)
delete
from candidates
where rn > 1
My answer is a a spin on Brett Schneiders, with a batched approach (including a small wait) to avoid contention, and alleviate explosive log file growth.
Set your initial #batchcount to something you think the server can handle -- you can also increase/decrease the wait time as needed. Once ##ROWCOUNT=0, the loop will terminate.
declare #batchcount int, #totalrows int
set #totalrows = 0
set #batchcount = 10000 -- set this to some initial value
while #batchcount > 0
begin
;with dupes as (
SELECT id
, ROW_NUMBER() over (PARTITION by field order by id desc) rownum
FROM [TableName]
)
delete top (#batchcount) t1
from TableName t1
join dupes c
on c.id = t1.id
and c.rownum > 1
set #batchcount = ##ROWCOUNT --record how many just got nuked
set #totalrows = #totalrows + #batchcount --track progress
print cast(#totalrows as varchar) + ' rows have been deleted' -- show progress
waitfor delay '00:00:05' -- wait 5 seconds for log writes, other queries etc
end
The print statement may not "show" on every loop in SSMS, but every so often you'll see SQL messages appear showing hundreds of iterations completed... be patient.
Create another heap table and insert there the ids you want to delete. Than delete the records in the main table (where exists in heap table) in chunks of 1000-5000 each to avoid the time out. Good luck!

Fast calculation of partial sums on a large SQL Server table

I need to calculate a total of a column up to a specified date on a table that currently has over 400k rows and is poised to grow further. I found the SUM() aggregate function to be too slow for my purpose, as I couldn't get it faster than about 1500ms for a sum over 50k rows.
Please note that the code below is the fastest implementation I have found so far. Notably filtering the data from CustRapport and storing it in a temporary table brought me a 3x performance increase. I also experimented with indexes, but they usually made it slower.
I would however like the function to be at least an order of magnitude faster. Any idea on how to achieve that? I have stumbled upon http://en.wikipedia.org/wiki/Fenwick_tree. However, I would rather have the storage and calculation processed within SQL Server.
CustRapport and CustLeistung are Views with the following definition:
ALTER VIEW [dbo].[CustLeistung] AS
SELECT TblLeistung.* FROM TblLeistung
WHERE WebKundeID IN (SELECT WebID FROM XBauAdmin.dbo.CustKunde)
ALTER VIEW [dbo].[CustRapport] AS
SELECT MainRapport.* FROM MainRapport
WHERE WebKundeID IN (SELECT WebID FROM XBauAdmin.dbo.CustKunde)
Thanks for any help or advice!
ALTER FUNCTION [dbo].[getBaustellenstunden]
(
#baustelleID int,
#datum date
)
RETURNS
#ret TABLE
(
Summe float
)
AS
BEGIN
declare #rapport table
(
id int null
)
INSERT INTO #rapport select WebSourceID from CustRapport
WHERE RapportBaustelleID = #baustelleID AND RapportDatum <= #datum
INSERT INTO #ret
SELECT SUM(LeistungArbeit)
FROM CustLeistung INNER JOIN #rapport as r ON LeistungRapportID = r.id
WHERE LeistungArbeit is not null
AND LeistungInventarID is null AND LeistungArbeit > 0
RETURN
END
Execution plan:
http://s23.postimg.org/mxq9ktudn/execplan1.png
http://s23.postimg.org/doo3aplhn/execplan2.png
General advice I can provide now until you provide more information.
Updated my query since it was pulling from views to pull straight from the tables.
INSERT INTO #ret
SELECT
SUM(LeistungArbeit)
FROM (
SELECT DISTINCT WebID FROM XBauAdmin.dbo.CustKunde
) Web
INNER JOIN dbo.TblLeistung ON TblLeistung.WebKundeID=web.webID
INNER JOIN dbo.MainRapport ON MainRapport.WebKundeID=web.webID
AND TblLeistung.LeistungRapportID=MainRapport.WebSourceID
AND MainRapport.RapportBaustelleID = #baustelleID
AND MainRapport.RapportDatum <= #datum
WHERE TblLeistung.LeistungArbeit is not null
AND TblLeistung.LeistungInventarID is null
AND TblLeistung.LeistungArbeit > 0
Get rid of the table variable. They have their use, but I switch to temp tables when I get over a 100 records; indexed temp tables simply perform better in my experience.
Update your select to the above query and retest performance
Check and ensure there are indexes on every column references in the query. If you use the show actual execution plan, SQL Server will help identify where indexes would be useful.

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.
Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).
As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...
You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

SQL Server 2005 Full Text forum Search

I'm working on a search stored procedure for our existing forums.
I've written the following code which uses standard SQL full text indexes, however I'm sure there is a better way of doing it and would like a point in the right direction.
To give some info on how it needs to work, The page has 1 search text box which when clicked will search thread titles, thread descriptions and post text and should return the results with the title matches first, then descriptions then post data.
Below is what I've written so far which works but is not elegant or as fast as I would like. To give an example of performance with 20K threads and 80K posts it takes about 12 seconds to search for 5 random words.
ALTER PROCEDURE [dbo].[SearchForums]
(
--Input Params
#SearchText VARCHAR(200),
#GroupId INT = -1,
#ClientId INT,
--Paging Params
#CurrentPage INT,
#PageSize INT,
#OutTotalRecCount INT OUTPUT
)
AS
--Create Temp Table to Store Query Data
CREATE TABLE #SearchResults
(
Relevance INT IDENTITY,
ThreadID INT,
PostID INT,
[Description] VARCHAR(2000),
Author BIGINT
)
--Create and populate table of all GroupID's This search will return from
CREATE TABLE #GroupsToSearch
(
GroupId INT
)
IF #GroupId = -1
BEGIN
INSERT INTO #GroupsToSearch
SELECT GroupID FROM SNetwork_Groups WHERE ClientId = #ClientId
END
ELSE
BEGIN
INSERT INTO #GroupsToSearch
VALUES(#GroupId)
END
--Get Thread Titles
INSERT INTO #SearchResults
SELECT
SNetwork_Threads.[ThreadId],
(SELECT NULL) AS PostId,
SNetwork_Threads.[Description],
SNetwork_Threads.[OwnerUserId]
FROM
SNetwork_Threads
INNER JOIN SNetwork_Groups ON SNetwork_Groups.GroupId = SNetwork_Threads.GroupId
WHERE
FREETEXT(SNetwork_Threads.[Description], #SearchText) AND
Snetwork_Threads.GroupID IN (SELECT GroupID FROM #GroupsToSearch) AND
SNetwork_Groups.ClientId = #ClientId
--Get Thread Descriptions
INSERT INTO #SearchResults
SELECT
SNetwork_Threads.[ThreadId],
(SELECT NULL) AS PostId,
SNetwork_Threads.[Description],
SNetwork_Threads.[OwnerUserId]
FROM
SNetwork_Threads
INNER JOIN SNetwork_Groups ON SNetwork_Groups.GroupId = SNetwork_Threads.GroupId
WHERE
FREETEXT(SNetwork_Threads.[Name], #SearchText) AND
Snetwork_Threads.GroupID IN (SELECT GroupID FROM #GroupsToSearch) AND
SNetwork_Groups.ClientId = #ClientId
--Get Posts
INSERT INTO #SearchResults
SELECT
SNetwork_Threads.ThreadId,
SNetwork_Posts.PostId,
SNetwork_Posts.PostText,
SNetwork_Posts.[OwnerUserId]
FROM
SNetwork_Posts
INNER JOIN SNetwork_Threads ON SNetwork_Threads.ThreadId = SNetwork_Posts.ThreadId
INNER JOIN SNetwork_Groups ON SNetwork_Groups.GroupId = SNetwork_Threads.GroupId
WHERE
FREETEXT(SNetwork_Posts.PostText, #SearchText) AND
Snetwork_Threads.GroupID IN (SELECT GroupID FROM #GroupsToSearch) AND
SNetwork_Groups.ClientId = #ClientId
--Return Paged Result Sets
SELECT #OutTotalRecCount = COUNT(*) FROM #SearchResults
SELECT
#SearchResults.[ThreadID],
#SearchResults.[PostID],
#SearchResults.[Description],
#SearchResults.[Author]
FROM
#SearchResults
WHERE
#SearchResults.[Relevance] >= (#CurrentPage - 1) * #PageSize + 1 AND
#SearchResults.[Relevance] <= #CurrentPage*#PageSize
ORDER BY Relevance ASC
--Clean Up
DROP TABLE #SearchResults
DROP TABLE #GroupsToSearch
I know its a bit long winded but just a nudge in the right direction would be well appreciated.
Incase it helps 80% of the query time is taken up when search posts and according to teh query plan is spent on "Clustered Index Scan" on the posts table. I cant see anyway around this.
Thanks
Gavin
I'd really have to see an explain plan to know where the slow parts were, as I don't see anything particularly nasty in your code. Very first thing - make sure all your indexes are in good shape, they are being used, statistics are up to date, etc.
One other idea would be to do the search on thread title first, then use the results from that to prune the searches on thread description and post text. Similarly, use the results from the thread description search to prune the post text search.
The basic idea here is that if you find the keywords in the thread title, why bother searching the description and posts? I realize this may not work depending on how you are presenting the search results to the user, and it may not make a huge difference, but it's something to think about.
80k records isn't that much. I'd recommend not inserting the resulting data into your temp table, and instead only inserting the IDs, then joining to that table afterward. This will save on writing to the temp table, as you may store 10000 ints, instead of 10000 full posts (of which you discard all but one page of). This may reduce the amount of time spent scanning posts, as well.
It looks like you would need two temp tables, one for threads and one for posts. You would union them in the final select.