How to properly index SQL Server table with 25 million rows

How to properly index SQL Server table with 25 million rows - sql

I have created a table in SQL Server 2008 R2 as follows:
CREATE TABLE [dbo].[7And11SidedDiceGame]
(
[Dice11Sides] [INT] NULL,
[Dice7Sides] [INT] NULL,
[WhoWon] [INT] NULL
)
I added the following index:
CREATE NONCLUSTERED INDEX [idxWhoWon]
ON [dbo].[7And11SidedDiceGame] ([WhoWon] ASC)
I then created a WHILE loop to Insert 25 million RANDomly generated rows to tally the results for statistical analysis.
Once I optimized the Insert function (Using BEGIN TRAN and COMMIT TRAN before and after the loop) the While loop ran decent. However, analyzing the data takes a long time. For example: using the following statement takes about 4 minutes to perform:
DECLARE #TotalRows real
SELECT #TotalRows = COUNT(*)
FROM [test].[dbo].[7And11SidedDiceGame]
PRINT REPLACE(CONVERT(VARCHAR, CAST(#TotalRows AS money), 1),'.00','')
SELECT
WhoWon, COUNT(WhoWon) AS Total,
((COUNT(WhoWon) * 100) / #TotalRows) AS PercentWinner
FROM
[test].[dbo].[7And11SidedDiceGame]
GROUP BY
WhoWon
My question is how can I better index the table to speed up retrieval of the data? Or do I need to approach the pulling of the data in a different manner?

I don't think you can do much here.
The query has to read all 25M rows from the index to count them. Though, 25M rows is not that much and I'd expect it to take less than 4 minutes on a modern hardware.
It is only 100MB of data to read (OK, in practice it is more, say, 200MB, still it should not take 4 minutes to read 200MB off the disk).
Is the server under a heavy load? Are there a lot of inserts into this table?
You could make a small improvement by defining WhoWon column as NOT NULL in the table. Do you really have NULL values in it?
And then use COUNT(*) instead of count(WhoWon) in the query.
If this query runs often, but the data in the table doesn't change too often, you can create an indexed view that would essentially materialise/cache/pre-calculate these Counts, so the query that would run off such view would be much faster.

You may be able to speed this by using window functions:
SELECT WhoWon, count(*) AS Total,
count(*) * 100.0 / sum(count(*)) over () as PercentWinner
FROM [test].[dbo].[7And11SidedDiceGame]
GROUP BY WhoWon;
This does not provide the separate print statement.
For performance, try an index on (WhoWon).

Related

Why changing where statement to a variable cause query to be 4 times slower

I am inserting data from one table "Tags" from "Recovery" database into another table "Tags" in "R3" database
they all live in my laptop similar SQL Server instance
I have built the insert query and because Recovery..Tags table is around 180M records I decided to break it into smaller sebsets. ( 1 million recs at the time)
Here is my query (Let's call Query A)
insert into R3..Tags (iID,DT,RepID,Tag,xmiID,iBegin,iEnd,Confidence,Polarity,Uncertainty,Conditional,Generic,HistoryOf,CodingScheme,Code,CUI,TUI,PreferredText,ValueBegin,ValueEnd,Value,Deleted,sKey,RepType)
SELECT T.iID,T.DT,T.RepID,T.Tag,T.xmiID,T.iBegin,T.iEnd,T.Confidence,T.Polarity,T.Uncertainty,T.Conditional,T.Generic,T.HistoryOf,T.CodingScheme,T.Code,T.CUI,T.TUI,T.PreferredText,T.ValueBegin,T.ValueEnd,T.Value,T.Deleted,T.sKey,R.RepType
FROM Recovery..tags T inner join Recovery..Reps R on T.RepID = R.RepID
where T.iID between 13000001 and 14000000
it takes around 2 minutes.
That is ok
To make things a bit easier for me
I put the iiD in the were statement in a variable
so my query looks like this (Let's call Query B)
declare #i int = 12
insert into R3..Tags (iID,DT,RepID,Tag,xmiID,iBegin,iEnd,Confidence,Polarity,Uncertainty,Conditional,Generic,HistoryOf,CodingScheme,Code,CUI,TUI,PreferredText,ValueBegin,ValueEnd,Value,Deleted,sKey,RepType)
SELECT T.iID,T.DT,T.RepID,T.Tag,T.xmiID,T.iBegin,T.iEnd,T.Confidence,T.Polarity,T.Uncertainty,T.Conditional,T.Generic,T.HistoryOf,T.CodingScheme,T.Code,T.CUI,T.TUI,T.PreferredText,T.ValueBegin,T.ValueEnd,T.Value,T.Deleted,T.sKey,R.RepType
FROM Recovery..tags T inner join Recovery..Reps R on T.RepID = R.RepID
where T.iID between (1000000 * #i) + 1 and (#i+1)*1000000
but that cause the insert to become so slow (around 10 min)
So what I tried query A again and gave me around 2 min
I tried query B again and gave around 8 min!!
I am attaching exec plan for each one (at a site that shows an analysis of the query plan) - Query A Plan and Query B Plan
Any idea why this is happening?
and how to fix it?

The big difference in time is due to the very different plans that are being created to join Tags and Reps.
Fundamentally, in version A, it knows how much data is being extracted (a million rows) and it can design an efficient query for that. However, because you are using variables in B to define how much data is being imported, it has to define a more generic query - one that would work for 10 rows, a million rows, or a hundred million rows.
In the plans, here are the relevant sections of the query joining Tags and Reps...
... in A
... and B
Note that in A it takes just over a minute to do the join; in B it takes 6 and a half minutes.
The key thing that appears to take the time is that it does a table scan of the Tags table which takes 5:44 to complete. The plan has this as a table scan, as the next time you run the query you may want many more than 1 million rows.
A secondary issue is that the amount of data it reads (or expects to read) from Reps is also way out of whack. In A it expected to read 2 million rows and read 1421; in B it basically read them all (even though technically it probably only needed the same 1421).
I think you have two main approaches to fix
Look at indexing, to remove the table scan on Tags - ensure the indexes match what is needed and allows the query to do a scan on that index (it appears that the index at the top of #MikePetri's answer is what you need, or similar). This way instead of doing a table scan, it can do an index scan which can start 'in the middle' of the data set (a table scan must start at either the start or end of the data set).
Separate this into two processes. The first process gets the relevant million rows from Tags, and saves it in a temporary table. The second process uses the data in the temporary table to join to Reps (also try using option (recompile) in the second query, so that it checks the temporary table's size before creating the plan).
You can even put an index or two (and/or Primary Key) on that temporary table to make it better for the next step.

The reason the first query is so much faster is it went parallel. This means the cardinality estimator knew enough about the data it had to handle, and the query was large enough to tip the threshold for parallel execution. Then, the engine passed chunks of data for different processors to handle individually, then report back and repartition the streams.
With the value as a variable, it effectively becomes a scalar function evaluation, and a query cannot go parallel with a scalar function, because the value has to determined before the cardinality estimator can figure out what to do with it. Therefore, it runs in a single thread, and is slower.
Some sort of looping mechanism might help. Create the included indexes to assist the engine in handling this request. You can probably find a better looping mechanism, since you are familiar with the identity ranges you care about, but this should get you in the right direction. Adjust for your needs.
With a loop like this, it commits the changes with each loop, so you aren't locking the table indefinitely.
USE Recovery;
GO
CREATE INDEX NCI_iID
ON Tags (iID)
INCLUDE (
DT
,RepID
,tag
,xmiID
,iBegin
,iEnd
,Confidence
,Polarity
,Uncertainty
,Conditional
,Generic
,HistoryOf
,CodingScheme
,Code
,CUI
,TUI
,PreferredText
,ValueBegin
,ValueEnd
,value
,Deleted
,sKey
);
GO
CREATE INDEX NCI_RepID ON Reps (RepID) INCLUDE (RepType);
USE R3;
GO
CREATE INDEX NCI_iID ON Tags (iID);
GO
DECLARE #RowsToProcess BIGINT
,#StepIncrement INT = 1000000;
SELECT #RowsToProcess = (
SELECT COUNT(1)
FROM Recovery..tags AS T
WHERE NOT EXISTS (
SELECT 1
FROM R3..Tags AS rt
WHERE T.iID = rt.iID
)
);
WHILE #RowsToProcess > 0
BEGIN
INSERT INTO R3..Tags
(
iID
,DT
,RepID
,Tag
,xmiID
,iBegin
,iEnd
,Confidence
,Polarity
,Uncertainty
,Conditional
,Generic
,HistoryOf
,CodingScheme
,Code
,CUI
,TUI
,PreferredText
,ValueBegin
,ValueEnd
,Value
,Deleted
,sKey
,RepType
)
SELECT TOP (#StepIncrement)
T.iID
,T.DT
,T.RepID
,T.Tag
,T.xmiID
,T.iBegin
,T.iEnd
,T.Confidence
,T.Polarity
,T.Uncertainty
,T.Conditional
,T.Generic
,T.HistoryOf
,T.CodingScheme
,T.Code
,T.CUI
,T.TUI
,T.PreferredText
,T.ValueBegin
,T.ValueEnd
,T.Value
,T.Deleted
,T.sKey
,R.RepType
FROM Recovery..tags AS T
INNER JOIN Recovery..Reps AS R ON T.RepID = R.RepID
WHERE NOT EXISTS (
SELECT 1
FROM R3..Tags AS rt
WHERE T.iID = rt.iID
)
ORDER BY
T.iID;
SET #RowsToProcess = #RowsToProcess - #StepIncrement;
END;

How to speed up this simple query

With SourceTable having > 15MM records and Bad_Phrase having > 3K records, the following query takes almost 10 hours to run on SQL Server 2005 SP4.
Update [SourceTable]
Set Bad_Count = (Select count(*)
from Bad_Phrase
where [SourceTable].Name like '%'+Bad_Phrase.PHRASE+'%')
In English, this query is counting the number of times that any phrases listed in Bad_Phrase are a substring of the column [Name] in the SourceTable and then placing that result in the column Bad_Count.
I would like some suggestions on how to have this query run considerably faster.

For a lack of a better idea, here is one:
I don't know if SQL Server natively supports parallelizing an UPDATE statement, but you can try to do it yourself manually by partitioning the work that needs to be done.
For instance, just as an example, if you can run the following 2 update statements in parallel manually or by writing a small app, I'd be curious to see if you can bring down your total processing time.
Update [SourceTable]
Set Bad_Count=(
Select count(*)
from Bad_Phrase
where [SourceTable].Name like '%'+Bad_Phrase.PHRASE+'%'
)
where Name < 'm'
Update [SourceTable]
Set Bad_Count=(
Select count(*)
from Bad_Phrase
where [SourceTable].Name like '%'+Bad_Phrase.PHRASE+'%'
)
where Name >= 'm'
So the 1st update statement takes care of updating all the rows whose names start with the letters a-l, and the 2nd query takes care of o-z.
It's just an idea, and you can try splitting this into smaller chunks and more parallel update statements, depending on the capacity of your SQL Server machine.

Sounds like your query is scanning the whole table. Does your tables have proper indexes on them. Putting an index on columns that appear in a where clause is a good place to start. You can also try and get the cost of the query in the Sql server management studio (display estimated execution cost) or if your willing to wait (display actual execution cost) are both buttons in the query window. The cost will provide insights as to what is taking forever and possibly steer you to wright faster queries.

You are updating the table using sub query with the same table, every row update will scan the whole table and that may cause too much execution time. I think is better if you will insert first all data in the #temp table and then use the #temp table in your update statement. Or you can JOIN the Source table and Temp table as well.

Passing a table to a stored procedure

I have a table with 20 billion rows. Table does not have any indexes as it was created on fly for doing bulk insert operation. The table is being used in a stored procedure which does the following operation
Delete A
from master a
inner join (Select distinct Col from TableB ) b
on A.Col = B.Col
Insert into master
Select *
from tableB
group by col1,col2,col3
TableB is the one which has 20 billion rows. I don't want to execute SP directly because it might take days to complete the execution. Master is also a huge table and has clustered index on Col
Can i pass chunks of rows to the stored procedure and perform the operation.This might reduce the log file growth. If yes how can i do that
Should i create clustered index on the table and execute the SP which might be little faster but then again i think creating CI on a huge table might take 10 hours to complete.
Or is there any way to perform this operation fast

I've used a method similar to this one. I'd recommend putting your DB into Bulk Logged recovery mode instead of Full recovery mode if you can.
Blog entry reproduced below to future proof it.
Below is a technique used to transfer a large amount of records from
one table to another. This scales pretty well for a couple reasons.
First, this will not fill up the entire log prior to committing the
transaction. Rather, it will populate the table in chunks of 10,000
records. Second, it’s generally much quicker. You will have to play
around with the batch size. Sometimes it’s more efficient at 10,000,
sometimes 500,000, depending on the system.
If you do not need to insert into an existing table and just need a
copy of the table, it is better to do a SELECT INTO. However for this
example, we are inserting into an existing table.
Another trick you should do is to change the recovery model of the
database to simple. This way, there will be much less logging in the
transaction log.
The WITH (TABLOCK) below only works in SQL 2008.
DECLARE #BatchSize INT = 10000
WHILE 1 = 1
BEGIN
INSERT INTO [dbo].[Destination] --WITH (TABLOCK) -- Uncomment for 2008
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE NOT EXISTS (
SELECT 1
FROM dbo.Destination
WHERE PersonID = s.PersonID
)
IF ##ROWCOUNT < #BatchSize BREAK
END
With the above example, it is important to have at least a non
clustered index on PersonID in both tables.
Another way to transfer records is to use multiple threads. Specifying
a range of records as such:
INSERT INTO [dbo].[Destination]
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE PersonID BETWEEN 1 AND 5000
GO
INSERT INTO [dbo].[Destination]
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE PersonID BETWEEN 5001 AND 10000
For super fast performance however, I’d recommend using SSIS.
Especially in SQL Server 2008. We recently transferred 17 million
records in 5 minutes with an SSIS package executed on the same server
as the two databases it transferred between.
SQL Server 2008 SQL Server 2008 has made changes with regards to it’s
logging mechanism when inserting records. Previously, to do an insert
that was minimally logged, you would have to perform a SELECT.. INTO.
Now, you can perform a minimally logged insert if you can lock the
table you are inserting into. The example below shows an example of
this. The exception to this rule is if you have a clustered index on
the table AND the table is not empty. If the table is empty and you
acquire a table lock and you have a clustered index, it will be
minimally logged. However if you have data in the table, the insert
will be logged. Now if you have a non clustered index on a heap and
you acquire a table lock then only the non clustered index will be
logged. It is always better to drop indexes prior to inserting
records.
To determine the amount of logging you can use the following statement
SELECT * FROM ::fn_dblog(NULL, NULL)
Credit for above goes to Derek Dieter at SQL Server Planet.

If you're dead set on passing a table to your stored procedure, you can pass a table-valued parameter to a stored procedure in SQL Server 2008. You might have better luck with some other approaches suggested, like partitioning. Select distinct on a table with 20 billion rows might be part of the problem. I wonder if some very basic tuning wouldn't help, too:
Delete A
from master a
where exists (select 1 from TableB b where b.Col = a.Col)

Numbering rows in a view

I am connecting to an SQL database using a PLC, and need to return a list of values. Unfortunately, the PLC has limited memory, and can only retrieve approximately 5,000 values at any one time, however the database may contain up to 10,000 values.
As such I need a way of retrieving these values in 2 operations. Unfortunately the PLC is limited in the query it can perform, and is limited to only SELECT and WHERE commands, so I cannot use LIMIT or TOP or anything like that.
Is there a way in which I can create a view, and auto number every field in that view? I could then query all records < 5,000, followed by a second query of < 10,000 etc?
Unfortunately it seems that views do not support the identity column, so this would need to be done manually.
Anyone any suggestions? My only realistic option at the moment seems to be to create 2 views, one with the first 5,000 and 1 with the next 5,000...
I am using SQL Server 2000 if that makes a difference...

There are 2 solutions. The easiest is to modify your SQL table and add an IDENTITY column. If that is not a possibility, the you'll have to do something like the below query. For 10000 rows, it shouldn't be too slow. But as the table grows, it will become worse and worse-performing.
SELECT Col1, Col2, (SELECT COUNT(i.Col1)
FROM yourtable i
WHERE i.Col1 <= o.Col1) AS RowID
FROM yourtable o

While the code provided by Derek does what I asked - i.e numbers each row in the view, the performance for this is really poor - approximately 20 seconds to number 100 rows. As such it is not a workable solution. An alternative is to number the first 5,000 records with a 1, and the next 5,000 with a 2. This can be done with 3 simple queries, and is far quicker to execute.
The code to do so is as follows:
SELECT TOP(5000) BCode, SAPCode, 1 as GroupNo FROM dbo.DB
UNION
SELECT TOP (10000) BCode, SAPCode, 2 as GroupNo FROM dbo.DB p
WHERE ID NOT IN (SELECT TOP(5000) ID FROM dbo.DB)
Although, as pointed out by Andriy M, you should also specify an explicit sort, to ensure the you dont miss any records.

One possibility might be to use a function with a temporary table such as
CREATE FUNCTION dbo.OrderedBCodeData()
RETURNS #Data TABLE (RowNumber int IDENTITY(1,1),BCode int,SAPCode int)
AS
BEGIN
INSERT INTO #Data (BCode,SAPCode)
SELECT BCode,SAPCode FROM dbo.DB ORDER BY BCode
RETURN
END
And select from this function such as
SELECT FROM dbo.OrderedBCodeData() WHERE RowNumber BETWEEN 5000 AND 10000
I haven't used this in production ever, in fact was just a quick idea this morning but worth exploring as a neater alternative?

Unexpected #temp table performance

Bounty open:
Ok people, the boss needs an answer and I need a pay rise. It doesn't seem to be a cold caching issue.
UPDATE:
I've followed the advice below to no avail. How ever the client statistics threw up an interesting set of number.
#temp vs #temp
Number of INSERT, DELETE and UPDATE statements
0 vs 1
Rows affected by INSERT, DELETE, or UPDATE statements
0 vs 7647
Number of SELECT statements
0 vs 0
Rows returned by SELECT statements
0 vs 0
Number of transactions
0 vs 1
The most interesting being the number of rows affected and the number of transactions. To remind you, the queries below return identical results set, just into different styles of tables.
The following query are basicaly doing the same thing. They both select a set of results (about 7000) and populate this into either a temp or var table. In my mind the var table #temp should be created and populated quicker than the temp table #temp however the var table in the first example takes 1min 15sec to execute where as the temp table in the second example takes 16 seconds.
Can anyone offer an explanation?
declare #temp table (
id uniqueidentifier,
brand nvarchar(255),
field nvarchar(255),
date datetime,
lang nvarchar(5),
dtype varchar(50)
)
insert into #temp (id, brand, field, date, lang, dtype )
select id, brand, field, date, lang, dtype
from view
where brand = 'myBrand'
-- takes 1:15
vs
select id, brand, field, date, lang, dtype
into #temp
from view
where brand = 'myBrand'
DROP TABLE #temp
-- takes 16 seconds

I believe this almost completely comes down to table variable vs. temp table performance.
Table variables are optimized for having exactly one row. When the query optimizer chooses an execution plan, it does it on the (often false) assumption that that the table variable only has a single row.
I can't find a good source for this, but it is at least mentioned here:
http://technet.microsoft.com/en-us/magazine/2007.11.sqlquery.aspx
Other related sources:
http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=125052
http://databases.aspfaq.com/database/should-i-use-a-temp-table-or-a-table-variable.html

Run both with SET STATISTICS IO ON and SET STATISTICS TIME ON. Run 6-7 times each, discard the best and worst results for both cases, then compare the two average times.
I suspect the difference is primarily from a cold cache (first execution) vs. a warm cache (second execution). The output from STATISTICS IO would give away such a case, as a big difference in the physical reads between the runs.
And make sure you have 'lab' conditions for the test: no other tasks running (no lock contention), databases (including tempdb) and logs are pre-grown to required size so you don't hit any log growth or database growth event.

This is not uncommon. Table variables can be (and in a lot of cases ARE) slower than temp tables. Here are some of the reasons for this:
SQL Server maintains statistics for queries that use temporary tables but not for queries that use table variables. Without statistics, SQL Server might choose a poor processing plan for a query that contains a table variable
Non-clustered indexes cannot be created on table variables, other than the system indexes that are created for a PRIMARY or UNIQUE constraint. That can influence the query performance when compared to a temporary table with non-clustered indexes.
table variables use internal metadata in a way that prevents the engine from using a table variable within a parallel query (this means that it wont take advantage of multi-processor machines).
A table variable is optimized for one row, by SQL Server (it assumes 1 row will be returned).

I'm not 100% that this is the cause, but the table var will not have any statistics whereas the temp table will.

SELECT INTO is a non-logged operation, which would likely explain most of the performance difference. INSERT creates a log entry for every operation.
Additionally, SELECT INTO is creating the table as part of the operation, so SQL Server knows automatically that there are no constraints on it, which may factor in.

If it takes over a full minute to insert 7000 records into a temp table (persistent or variable), then the perf issue is almost certainly in the SELECT statement that's populating it.
Have you run DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS before profiling? I'm thinking that maybe it's using some cached results for the second query.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas