I'm creating a database that holds yield values of electric engines. The yield values are stored in an Excel file which I have to transfer to the database. Each test for an engine has 42 rows (torque) and 42 columns (power in kw) with the values stored in these cells.
(kw) 1,0 1,2 ...(42x)
-------- -------
(rpm)2000 76,2 77,0
2100 76,7 77,6
...
(42x)
Well I thought of creating a column for engine_id, test_id (each engine can have more than one test), and 42 columns for the corresponding yield values. For each test I have to add 42 rows for one single engine with the yield values. This doesn't seem efficient nor easy to implement to me.
If there are 42 records (rows) for 1 single engine, in a matter of time the database will hold up several thousands of rows and searching for a specific engine with the corresponding values will be an exhausting task.
If I make for each test for a specific engine a separate table, again after some time I would I have probably thousands of tables. Now what should I go for, a table with thousands of records or a table with 42 columns and 42 rows? Either way, I still have redundant records.
A database is definitely the answer (searching through many millions, or hundred of millions of rows is pretty easy once you get the hang of SQL (the language for interacting with databases). I would recommend a table structure of
EngineId, TestId, TourqueId, PowerId, YieldValue
Which would have values...
Engine1, Test1, 2000, 1.0, 73.2
So only 5 columns. This will give you the flexibility to add more yield results in future should it be required (or even if its not, its just an easier schema anyway). You will need to learn SQL, however, to realise the power of the database over a spreadsheet. Also, there are many techniques for importing Excel data to SQL, so you should investigate that (Google it). If you find you are transferring all that data by hand then you are doing something wrong (not wrong really, but inefficient!).
Further to your comments, here is the exact schema with index (in MS SQL Server)
CREATE TABLE [dbo].[EngineTestResults](
[EngineId] [varchar](50) NOT NULL,
[TestId] [varchar](50) NOT NULL,
[Tourque] [int] NOT NULL,
[Power] [decimal](18, 4) NOT NULL,
[Yield] [decimal](18, 4) NOT NULL,
CONSTRAINT [PK_EngineTestResults] PRIMARY KEY CLUSTERED
(
[EngineId] ASC,
[TestId] ASC,
[Tourque] ASC,
[Power] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
/****** Object: Index [IX_EngineTestResults] Script Date: 01/14/2012 14:26:21 ******/
CREATE NONCLUSTERED INDEX [IX_EngineTestResults] ON [dbo].[EngineTestResults]
(
[EngineId] ASC,
[TestId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
So note that there is no incrementing primary key...the key is (EngineId, TestId, Torque, Power). To get the results for a particular engine you would run a query like the following:
Select * from EngineTestResults where engineId = 'EngineABC' and TestId = 'TestA'
Note that I have added an index for that set of criteria.
The strength of a relational database is the ability to normalize data across multiple tables, so you could have one table for engines, one for tests and one for results. Something like the following:
CREATE TABLE tbl__engines (
`engine_id` SMALLINT UNSIGNED NOT NULL,
`name` VARCHAR(255) NOT NULL,
PRIMARY KEY(engine_id)
);
CREATE TABLE tbl__tests (
`test_id` INT UNSIGNED NOT NULL,
`engine_id` SMALLINT UNSIGNED NOT NULL,
PRIMARY KEY(test_id),
FOREIGN KEY(engine_id) REFERENCES tbl__engines(engine_id)
);
CREATE TABLE tbl__test_result (
`result_id` INT UNSIGNED NOT NULL,
`test_id` INT UNSIGNED NOT NULL,
`torque` INT NOT NULL,
`power` DECIMAL(6,2) NOT NULL,
`yield` DECIMAL(6,2) NOT NULL,
FOREIGN KEY(test_id) REFERENCES tbl__tests(test_id)
);
Then you can simply perform a join across these three tables to return the required results. Something like:
SELECT
*
FROM `tbl__engines` e
INNER JOIN `tbl__tests` t ON e.engine_id = t.engine_id
INNER JOIN `tbl__results` r ON r.test_id = t.test_id;
Related
i have the following table structure :
CREATE TABLE [dbo].[TableABC](
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[FieldA] [nvarchar](36) NULL,
[FieldB] [int] NULL,
[FieldC] [datetime] NULL,
[FieldD] [nvarchar](255) NULL,
[FieldE] [decimal](19, 5) NULL,
PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
I do two type of CRUD operations with this table.
SELECT * FROM [dbo].[TableABC] WHERE FieldA = #FieldA
INSERT INTO [dbo].[TableABC](FieldA,FieldB,FieldC,FieldD,FieldE) VALUES (#FieldA,#FieldB,#FieldC,#FieldD,#FieldE)
FieldA has a unique value, but there is no constraint in the table.
Currently there are 6070755 rows in the table. Along with data growing , performance is getting slow.
Any suggestion , how to improve perfomance ? How to make CREATE and READ operation faster ?
now i faced problem , that select and insert takes too long , sometime more then 60 seconds
Read up on SQL basics- and Indices DEFINITELY are one. And if you have a unique value and no index on the field (constraint is irrelevant, unique index is good neough) - yes, that will get slower. SQL Server has to check the whole table.
So:
Add a unique index to Field a.
Given your 2 statements and the little "FieldA has a unique value, but there is no constraint in the table." I assume you are trying to enforce unique values there by selecting first. This will slow you down.
Instead make the index, and then try/catch the non unique sql errors - WAY faster. WAY faster. The index will make the insert a LITTLE slower, but you can save on the very slow select you do not totally.
I have an insert statement that's throwing a primary key error but I don't see how I could possibly be inserting duplicate key values.
First I create a temp table with a primary key.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED //Note: I've tried committed and uncommited, neither materially affects the behavior. See screenshots below for proof.
IF (OBJECT_ID('TEMPDB..#P')) IS NOT NULL DROP TABLE #P;
CREATE TABLE #P(idIsbn INT NOT NULL PRIMARY KEY, price SMALLMONEY, priceChangedDate DATETIME);
Then I pull prices from the Prices table, grouping by idIsbn, which is the primary key in the temp table.
INSERT INTO #P(idIsbn, price, priceChangedDate)
SELECT idIsbn ,
MIN(lowestPrice) ,
MIN(priceChangedDate)
FROM Price p
WHERE p.idMarketplace = 3100
GROUP BY p.idIsbn
I understand that grouping by idIsbn by definition makes it unique. The idIsbn in the prices table is: [idIsbn] [int] NOT NULL.
But every once in a while when I run this query I get this error:
Violation of PRIMARY KEY constraint 'PK__#P________AED35F8119E85FC5'. Cannot insert duplicate key in object 'dbo.#P'. The duplicate key value is (1447858).
NOTE: I've got a lot of questions about timing. I will select this statement, press F5, and no error will occur. Then I'll do it again, and it will fail, then I'll run it again and again and it will succeed a couple times before it fails again. I guess what I'm saying is that I can find no pattern for when it will succeed and when it won't.
How can I be inserting duplicate rows if (A) I just created the table brand new before inserting into it and (B) I'm grouping by the column designed to be the primary key?
For now, I'm solving the problem with IGNORE_DUP_KEY = ON, but I'd really like to know the root cause of the problem.
Here is what I'm actually seeing in my SSMS window. There is nothing more and nothing less:
##Version is:
Microsoft SQL Server 2008 (SP3) - 10.0.5538.0 (X64)
Apr 3 2015 14:50:02
Copyright (c) 1988-2008 Microsoft Corporation
Standard Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)
Execution Plan:
Here is an example of what it looks like when it runs fine. Here I'm using READ COMMITTED, but it doesn't matter b/c I get the error no matter whether I read it committed or uncommited.
Here is another example of it failing, this time w/ READ COMMITTED.
Also:
I get the same error whether I'm populating a temp table or a
persistent table.
When I add option (maxdop 1) to the end of the insert it seems to fail every time, though I can't be exhaustively sure of that b/c I can't run it for infinity. But it seems to be the case.
Here is the definition of the price table. Table has 25M rows. 108,529 updates in the last hour.
CREATE TABLE [dbo].[Price](
[idPrice] [int] IDENTITY(1,1) NOT NULL,
[idIsbn] [int] NOT NULL,
[idMarketplace] [int] NOT NULL,
[lowestPrice] [smallmoney] NULL,
[offers] [smallint] NULL,
[priceDate] [smalldatetime] NOT NULL,
[priceChangedDate] [smalldatetime] NULL,
CONSTRAINT [pk_Price] PRIMARY KEY CLUSTERED
(
[idPrice] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
CONSTRAINT [uc_idIsbn_idMarketplace] UNIQUE NONCLUSTERED
(
[idIsbn] ASC,
[idMarketplace] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
And the two non-clustered indexes:
CREATE NONCLUSTERED INDEX [IX_Price_idMarketplace_INC_idIsbn_lowestPrice_priceDate] ON [dbo].[Price]
(
[idMarketplace] ASC
)
INCLUDE ( [idIsbn],
[lowestPrice],
[priceDate]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_Price_idMarketplace_priceChangedDate_INC_idIsbn_lowestPrice] ON [dbo].[Price]
(
[idMarketplace] ASC,
[priceChangedDate] ASC
)
INCLUDE ( [idIsbn],
[lowestPrice]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
You hadn't supplied your table structure.
This is a repro with some assumed details that causes the problem at read committed (NB: now you have supplied the definition I can see in your case updates to the priceChangedDate column will move rows around in the IX_Price_idMarketplace_priceChangedDate_INC_idIsbn_lowestPrice index if that's the one being seeked)
Connection 1 (Set up tables)
USE tempdb;
CREATE TABLE Price
(
SomeKey INT PRIMARY KEY CLUSTERED,
idIsbn INT IDENTITY UNIQUE,
idMarketplace INT DEFAULT 3100,
lowestPrice SMALLMONEY DEFAULT $1.23,
priceChangedDate DATETIME DEFAULT GETDATE()
);
CREATE NONCLUSTERED INDEX ix
ON Price(idMarketplace)
INCLUDE (idIsbn, lowestPrice, priceChangedDate);
INSERT INTO Price
(SomeKey)
SELECT number
FROM master..spt_values
WHERE number BETWEEN 1 AND 2000
AND type = 'P';
Connection 2
Concurrent DataModifications that move a row from the beginning of the seeked range (3100,1) to the end (3100,2001) and back again repeatedly.
USE tempdb;
WHILE 1=1
BEGIN
UPDATE Price SET SomeKey = 2001 WHERE SomeKey = 1
UPDATE Price SET SomeKey = 1 WHERE SomeKey = 2001
END
Connection 3 (Do the insert into a temp table with a unique constraint)
USE tempdb;
CREATE TABLE #P
(
idIsbn INT NOT NULL PRIMARY KEY,
price SMALLMONEY,
priceChangedDate DATETIME
);
WHILE 1 = 1
BEGIN
TRUNCATE TABLE #P
INSERT INTO #P
(idIsbn,
price,
priceChangedDate)
SELECT idIsbn,
MIN(lowestPrice),
MIN(priceChangedDate)
FROM Price p
WHERE p.idMarketplace = 3100
GROUP BY p.idIsbn
END
The plan has no aggregate as there is a unique constraint on idIsbn (a unique constraint on idIsbn,idMarketplace would also work) therefore the group by can be optimised out as there are no duplicate values.
But at read committed isolation level shared row locks are released as soon as the row is read. So it is possible for a row to move places and be read a second time by the same seek or scan.
The index ix doesn't explicitly include SomeKey as a secondary key column but as it is not declared unique SQL Server silently includes the clustering key behind the scenes, hence updating that column value can move rows around in it.
I have a SQL Server database and having a table containing too many records. Before it was working fine but now when I run SQL Statement takes time to execute.
Sometime cause the SQL Database to use too much CPU.
This is the Query for the table.
CREATE TABLE [dbo].[tblPAnswer1](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[AttrID] [int] NULL,
[Kidato] [int] NULL,
[Wav] [int] NULL,
[Was] [int] NULL,
[ShuleID] [int] NULL,
[Mwaka] [int] NULL,
[Swali] [float] NULL,
[Wilaya] [int] NULL,
CONSTRAINT [PK_tblPAnswer1] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
And the following down is the sql stored procedure for the statement.
ALTER PROC [dbo].[uspGetPAnswer1](#ShuleID int, #Mwaka int, #Swali float, #Wilaya int)
as
SELECT ID,
AttrID,
Kidato,
Wav,
Was,
ShuleID,
Mwaka,
Swali,
Wilaya
FROM dbo.tblPAnswer1
WHERE [ShuleID] = #ShuleID
AND [Mwaka] = #Mwaka
AND [Swali] = #Swali
AND Wilaya = #Wilaya
What is wrong in my SQL Statement. Need help.
Just add an index on ShuleID, Mwaka, Swali and Wilaya columns. The order of columns in the index should depend on distribution of data (the columns with most diverse values in it should be the first in the index, and so on).
And if you need it super-fast, also include all the remaining columns used in the query, to have a covering index for this particular query.
EDIT: Probably should move the float col (Swali) from indexed to included columns.
Add an Index on the ID column and include ShuleID, Mwaka, Swali and Wilaya columns. That should help improve the speed of the query.
CREATE NONCLUSTERED INDEX IX_ID_ShuleID_Mwaka_Swali_Wilaya
ON tblPAnswer1 (ID)
INCLUDE (ShuleID, Mwaka, Swali, Wilaya);
What is the size of the table? You may need additional indices as you are not using the primary key to query the data. This article by Pinal Dave provides a script to identify missing indices.
http://blog.sqlauthority.com/2011/01/03/sql-server-2008-missing-index-script-download/
It provides a good starting point for index optimization.
I recently discovered included columns in SQL Server indexes. Do included columns in an index take up extra memory or are they stored on disk?
Also can someone point me to performance implications of including columns of differing data types as included columns in a Primary Key, which in my case is typically an in?
Thanks.
I don't fully understand the question: "Do included columns in an index take up extra memory or are they stored on disk?" Indexes are both stored on disk (for persistence) and in memory (for performance when being used).
The answer to your question is that the non-key columns are stored in the index and hence are stored both on disk and memory, along with the rest of the index. Included columns do have a significant performance advantage over key columns in the index. To understand this advantage, you have to understand the key values may be stored more than once in a b-tree index structure. They are used both as "nodes" in the tree and as "leaves" (the latter point to the actual records in the table). Non-key values are stored only in leaves, providing potentially a big savings in storage.
Such a savings means that more of the index can be stored in memory in a memory-limited environment. And that an index takes up less memory, allowing memory to be used for other things.
The use of included columns is to allow the index to be a "covering" index for queries, with a minimum of additional overhead. An index "covers" a query when all the columns needed for the query are in the index, so the index can be used instead of the original data pages. This can be a significant performance savings.
The place to go to learn more about them is the Microsoft documentation.
In SQL Server 2005 or upper versions, you can extend the functionality of nonclustered indexes by adding nonkey columns to the leaf level of the nonclustered index.
By including nonkey columns, you can create nonclustered indexes that cover more queries. This is because the nonkey columns have the following benefits:
• They can be data types not allowed as index key columns. ( All data types are allowed except text, ntext, and image.)
• They are not considered by the Database Engine when calculating the number of index key columns or index key size. You can include nonkey columns in a nonclustered index to avoid exceeding the current index size limitations of a maximum of 16 key columns and a maximum index key size of 900 bytes.
An index with included nonkey columns can significantly improve query performance when all columns in the query are included in the index either as key or nonkey columns. Performance gains are achieved because the query optimizer can locate all the column values within the index; table or clustered index data is not accessed resulting in fewer disk I/O operations.
Example:
Create Table Script
CREATE TABLE [dbo].[Profile](
[EnrollMentId] [int] IDENTITY(1,1) NOT NULL,
[FName] [varchar](50) NULL,
[MName] [varchar](50) NULL,
[LName] [varchar](50) NULL,
[NickName] [varchar](50) NULL,
[DOB] [date] NULL,
[Qualification] [varchar](50) NULL,
[Profession] [varchar](50) NULL,
[MaritalStatus] [int] NULL,
[CurrentCity] [varchar](50) NULL,
[NativePlace] [varchar](50) NULL,
[District] [varchar](50) NULL,
[State] [varchar](50) NULL,
[Country] [varchar](50) NULL,
[UIDNO] [int] NOT NULL,
[Detail1] [varchar](max) NULL,
[Detail2] [varchar](max) NULL,
[Detail3] [varchar](max) NULL,
[Detail4] [varchar](max) NULL,
PRIMARY KEY CLUSTERED
(
[EnrollMentId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
Stored procedure script
CREATE Proc [dbo].[InsertIntoProfileTable]
As
BEGIN
SET NOCOUNT ON
Declare #currentRow int
Declare #Details varchar(Max)
Declare #dob Date
set #currentRow =1;
set #Details ='Let''s think about the book. Every page in the book has the page number. All information in this book is presented sequentially based on this page number. Speaking in the database terms, page number is the clustered index. Now think about the glossary at the end of the book. This is in alphabetical order and allow you to quickly find the page number specific glossary term belongs to. This represents non-clustered index with glossary term as the key column. Now assuming that every page also shows "chapter" title at the top. If you want to find in what chapter is the glossary term, you have to lookup what page # describes glossary term, next - open corresponding page and see the chapter title on the page. This clearly represents key lookup - when you need to find the data from non-indexed column, you have to find actual data record (clustered index) and look at this column value. Included column helps in terms of performance - think about glossary where each chapter title includes in addition to glossary term. If you need to find out what chapter the glossary term belongs - you don''t need to open actual page - you can get it when you lookup the glossary term. So included column are like those chapter titles. Non clustered Index (glossary) has addition attribute as part of the non-clustered index. Index is not sorted by included columns - it just additional attributes that helps to speed up the lookup (e.g. you don''t need to open actual page because information is already in the glossary index).'
while(#currentRow <=200000)
BEGIN
insert into dbo.Profile values( 'FName'+ Cast(#currentRow as varchar), 'MName' + Cast(#currentRow as varchar), 'MName' + Cast(#currentRow as varchar), 'NickName' + Cast(#currentRow as varchar), DATEADD(DAY, ROUND(10000*RAND(),0),'01-01-1980'),NULL, NULL, #currentRow%3, NULL,NULL,NULL,NULL,NULL, 1000+#currentRow,#Details,#Details,#Details,#Details)
set #currentRow +=1;
END
SET NOCOUNT OFF
END
GO
Using the above SP you can insert 200000 records at one time.
You can see that there is a clustered index on column “EnrollMentId”.
Now Create a non-Clustered index on “ UIDNO” Column.
Script
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20140216-223309] ON [dbo].[Profile]
(
[UIDNO] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Now Run the following Query
select UIDNO,FName,DOB, MaritalStatus, Detail1 from dbo.Profile --Takes about 30-50 seconds and return 200,000 results.
Query 2
select UIDNO,FName,DOB, MaritalStatus, Detail1 from dbo.Profile
where DOB between '01-01-1980' and '01-01-1985'
--Takes about 10-15 seconds and return 36,479 records.
Now drop the above non-clustered index and re-create with following script
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20140216-231011] ON [dbo].[Profile]
(
[UIDNO] ASC,
[FName] ASC,
[DOB] ASC,
[MaritalStatus] ASC,
[Detail1] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
It will throw the following error
Msg 1919, Level 16, State 1, Line 1
Column 'Detail1' in table 'dbo.Profile' is of a type that is invalid for use as a key column in an index.
Because we can not use varchar(Max) datatype as key column.
Now Create a non-Clustered Index with included columns using following script
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20140216-231811] ON [dbo].[Profile]
(
[UIDNO] ASC
)
INCLUDE ( [FName],
[DOB],
[MaritalStatus],
[Detail1]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Now Run the following Query
select UIDNO,FName,DOB, MaritalStatus, Detail1 from dbo.Profile --Takes about 20-30 seconds and return 200,000 results.
Query 2
select UIDNO,FName,DOB, MaritalStatus, Detail1 from dbo.Profile
where DOB between '01-01-1980' and '01-01-1985'
--Takes about 3-5 seconds and return 36,479 records.
Included columns provide functionality similar to a clustered index where the row contents are kept in the leaf node of the primary index. In addition to the key columns in the index, additional attributes are kept in the index table leaf nodes.
This permits immediate access to the column values without having to access another page in the database. There is a trade off with increased index size and general storage against the improved response from not having to indirect through a page reference in the index. The impact is likely similar with adding multiple indices to tables.
From here:-
An index with nonkey columns can significantly improve query
performance when all columns in the query are included in the index
either as key or nonkey columns. Performance gains are achieved
because the query optimizer can locate all the column values within
the index; table or clustered index data is not accessed resulting in
fewer disk I/O operations.
I have a database table with 5 million rows. The clustered index is auto-increment identity column. There PK is a code generated 256 byte VARCHAR which is a SHA256 hash of a URL, this is a non-clustered index on the table.
The table is as follows:
CREATE TABLE [dbo].[store_image](
[imageSHAID] [nvarchar](256) NOT NULL,
[imageGUID] [uniqueidentifier] NOT NULL,
[imageURL] [nvarchar](2000) NOT NULL,
[showCount] [bigint] NOT NULL,
[imageURLIndex] AS (CONVERT([nvarchar](450),[imageURL],(0))),
[autoIncID] [bigint] IDENTITY(1,1) NOT NULL,
CONSTRAINT [PK_imageSHAID] PRIMARY KEY NONCLUSTERED
(
[imageSHAID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE CLUSTERED INDEX [autoIncPK] ON [dbo].[store_image]
(
[autoIncID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
imageSHAID is a SHA256 hash of an image URL e.g. "http://blah.com/image1.jpg", it is hashed into a varchar of 256 length.
imageGUID is a code generated guid in which I identify the image (it will be used as an index later, but for now I have omitted this column as an index)
imageURL is the full URL of the image (up to 2000 characters)
showCount is the number of times the image is shown, this is incremented each time this particular image is shown.
imageURLIndex is a computed column limited by 450 characters, this allows me to do text searches on the imageURL should I choose to, it is indexable (again index is omitted for brevity)
autoIncID is the clustered index, should allow faster inserting of data.
Periodically I merge from a temp table into the store_image table. The temp table structure is as follows (very similar to the store_image table):
CREATE TABLE [dbo].[store_image_temp](
[imageSHAID] [nvarchar](256) NULL,
[imageURL] [nvarchar](2000) NULL,
[showCount] [bigint] NULL,
) ON [PRIMARY]
GO
When the merge process is run, I write a DataTable to the temp table using the following code:
using (SqlBulkCopy bulk = new SqlBulkCopy(storeConn, SqlBulkCopyOptions.KeepIdentity | SqlBulkCopyOptions.KeepNulls, null))
{
bulk.DestinationTableName = "[dbo].[store_image_temp]";
bulk.WriteToServer(imageTableUpsetDataTable);
}
I then run the merge command to update the showCount in the store_image table by merging from the temp table based on the imageSHAID. If the image doesn't currently exist in the store_image table, I create it:
merge into store_image as Target using [dbo].[store_image_temp] as Source
on Target.imageSHAID=Source.imageSHAID
when matched then update set
Target.showCount=Target.showCount+Source.showCount
when not matched then insert values (Source.imageSHAID,NEWID(), Source.imageURL, Source.showCount);
I'm typically trying to merge 2k-5k rows from the temp table to the store_image table at any one merge process.
I used to run this DB on a SSD (only SATA 1 connected) and it was very fast (under 200 ms). I ran out of room on the SSD so I swapped the DB to a 1TB 7200 cache spinning disk, since then completion times are over 6-100 seconds (6000 - 100000MS). When the bulk insert is running I can see disk activity of around 1MB-2MB/sec, low CPU usage.
Is this a typical write time for this amount of data? It seems a little slow to me, what is causing the slow performance? Surely with the imageSHAID being indexed we should expect quicker seek times than this?
Any help would be appreciated.
Thanks for your time.
Your UPDATE clause in the MERGE updates showCount. This requires a key lookup on the clustered index.
However, the clustered index is also declared non-unique. This gives information to the optimiser even though the underlying column is unique.
So, I'd make these changes
the clustered primary key to be autoIncID
the current PK on imageSHAID to be a standalone unique index (not constraint) and add an INCLUDE for showCount. Unique constraints can't have INCLUDEs
More observations:
you don't need nvarchar for the hash or URL columns. These are not unicode.
A hash is also fixed length so can be char(64) (for SHA2-512).
The length of a column defines how much memory to assign to the query. See this for more: is there an advantage to varchar(500) over varchar(8000)?