SQL Server Update Indexing Design - sql

I got a table, IDMAP with DML:
CREATE TABLE tempdb2.dbo.idmaptemp (
OldId varchar(20),
CV_ModStamp datetimeoffset,
NewId varchar(20),
RestoreComplete bit,
RestoreErrorMessage varchar(1000),
OperationType varchar(20)
)
As it is defined, it already contains a predefined set of rows about (1Million). When restore operation is complete, I have to update NewId, RestoreComplete, RestoreErrorMessage on the table.
The statement is:
update tempdb2.dbo.IdMaptemp set NewId = 'xxx', RestoreComplete = 'false', RestoreErrorMessage = 'error' where OldId = 'ABC';
The Java application has about a million values on memory and has to update the values with the above statement. The database is set to autocommit off and is varied with batch (batchsize 500).
I have tried two options on Indexing with OldId field:
Clustered Index - Execution plan lists as clustered index update (100% cost). This occurs as the leaves are the rows that are getting updated and which would trigger an index update. Am I right here?
Non-Clustered Index - Execution plan lists as update (75%) and seek(25%).
Are there any other speed ups that can be achieved on mass update on a database table? The table cannot be cleared and re-inserted as there are other rows that aren't affected by the updates. Clustered index on a sample of 500 rows per batch has taken around 7 hours to update.
Should I go for the Non-Clustered index option?

Changing a large table's clustered index is an expensive proposition. A table's clustered index is defined for the entire table and not for a subset of rows.
If you're leaving oldid as the clustered index and just want to improve the batching performance, consider allowing the db to participate in the batching process rather than the application/java layer. Asking the db to update millions of rows 1 row at a time, is an expensive proposition. Populating a temp table with batches worth and then letting SQL update the entire batch at a time can be a good way of improving performance.
insert #temptable (OldId,NewId)
...
Update
set T1.NewId = T2.NewId
T1
from
T1 join #tempTable T2
on T1.OldId = T2.OldId
If you can compute the new id, consider another batching tactic.
update tempdb2.dbo.IdMaptemp top 1000 set NewId = 'xxx', RestoreComplete = 'false',
RestoreErrorMessage = 'error' where NewId is null;
If you really want to create a new table with NewId as the clustered index
Create the new table as you like
insert into NewTable()
select top 10000 *
from OldTable O
left join NewTable N
on O.OldId = N.OldId
where N.OldId is null
When done, drop the old table.
Note: Does your id need to be 20 bytes? Typically clustered indexes are either int - 4 bytes or bigint - 8 bytes.
If this is a one time thing, then changing the clustered index on a large persistent table will be worth it. If the oldid though will always be in the process of acquiring the newid value, and that's just the workflow you have, I wouldn't bother changing the persistent table's clustered index. Just leave the oldid as the clustered index. NewId sounds like a surrogate key.

Related

Improve insert performance when checking existing rows

I have this simple query that inserts rows from one table(sn_users_main) into another(sn_users_history).
To make sure sn_users_history only has unique rows it checks if the column query_time already exists and if it does then don't insert. query_time is kind of a session identifier that is the same for every row in sn_users_main.
This works fine but since sn_users_history is reaching 50k rows running this query takes more than 2 minutes to run which is too much. Is there anything I can do to improve performance and get the same result?
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE NOT EXISTS(SELECT snh.query_time
FROM sn_users_history snh
WHERE snh.query_time = snm.query_time) --Dont insert items into history table if they already exist
I think you are missing extra condition on user_id, when you are inserting into history table. You have to check combination of userid, querytime.
For your question, I think you are trying to reinvent the wheel. SQL Server is already having temporal tables, to suppor this historical data holding. Read about SQL Server Temporal Tables
If you want to still continue with this approach, I would suggest you to do in batches:
Create a configuration Table to hold the last processed querytime
CREATE TABLE HistoryConfig(HistoryConfigId int, HistoryTableName SYSNAME,
lastProcessedQueryTime DATETIME)
you can do incremental historical inserts
DECLARE #lastProcessedQueryTime DATETIME = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE query_time > #lastProcessedQueryTime
Now, you can update the configuration again
UPDATE HistoryConfig SET lastProcessedQueryTime = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
HistoryTableName = 'sn_users_history'
I would suggest you to create index on clustered index on UserId, Query_Time(if possible, Otherwise create non-clustered index) which will improve the performance.
Other approaches you can think of:
Create clustered index on userId, querytime in the historical table and also have userid,querytime as clustered index on the main table and perform MERGE operation.

Firebird SQL index on multiple columns

This is for Firebird 2.5.
I have a table T with an index made of 2 columns, say ColA and ColB. If I'm doing :
SELECT * FROM T WHERE ColA=..., so the WHERE clause is only on column A, will Firebird put a default value for column ColB, and benefit of the index, or it cannot use at all this index?
A bit of context:
I'm doing a db upgrade. Here is what I have:
CREATE TABLE user(
newid BIGINT NOT NULL,
oldid BIGINT NOT NULL,
anotherCol INT);
CREATE INDEX idx ON user(oldid, anotherCol);
CREATE TABLE order(
RefUser BIGINT);
order.RefUser were oldid and I need to change them to newid. I do it using this query:
UPDATE order o SET o.refuser = (SELECT u.newid FROM user u WHERE u.oldId = o.refuser);
At this point of time, oldid is still unique, but later on the uniqueness will only be guaranteed for (oldid, anotherCol), hence the index, and the creation of newid.
User table is a few million of records, order table is a few dozens of millions: this query takes more than an hour. I would like to see how to improve it (not keen on shutting down a critical service for that amount of time).
Assuming the index statistics are up-to-date, or at least good enough for the optimizer, then Firebird can (and often will) use a multi-column index when not all columns are part of the where-clause. The only restriction is that it can only use it for the first columns (or the 'prefix' of the index).
So with
CREATE INDEX idx ON user(oldid, anotherCol);
Firebird can use the index idx just fine for where oldid = 'something', but not for where anotherCol = 'something'.
And no, Firebird does not "put a default value for column [anotherCol]". It does a range scan on the index and returns all rows that have the matching oldid prefix.
Technically, Firebird creates index keys by combining the columns as described in Firebird for the Database Expert: Episode 1 - Indexes, which means the value in the index is something like:
0<oldid> 1<anotherCol> : row_id
e.g. (simplified, as in real life Firebird also does a prefix compression)
0val1 1other1 : rowid1
0val1 1other2 : rowid4
0val1 1other3 : rowid6
0val2 1other1 : rowid2
...
When using where oldid = 'val1', Firebird will search the index for all entries that start with 0val1 1 (as if it was doing a string search for 0val1 1% on a single column). And in this case it will match rowid1, rowid4 and rowid6.
Although this works, if you query a lot on only oldid, it might be better to also create a single column index on oldid only, as this index will be smaller and therefor faster to traverse when searching for records. The downside of course is that more indices have a performance impact on inserts, updates and deletes.
See also Concatenated Indexes on Use The Index, Luke.

Speedup SQL Query with aggregates on DateTime and group by

I've a large (> 100 million rows) table in my MS SQL database with the following columns:
Id int not null,
ObjectId int not null,
Timestamp datetime not null
State int not null
Id it the primary key of the table (and has a clustered index on it). I added a non clustered index on Timestamp and ObjectId (in this order). There are just around 2000 distinct values in ObjectId. I want now perform the following query:
SELECT ObjectId, MAX(Timestamp) FROM Table GROUP BY ObjectId
It takes something around four seconds, which is too slow for my application. The execution plan says that 97% of the runtime goes to an Index Scan of the non clustered index.
On a copy of the table I created a clustered index on ObjectId and Timestamp. The resulting runtime is same, the execution plan says its doing now a Index Scan of the clustered index.
Is there any other possibility to improve the runtime without splitting the table's data into multiple tables?
I can propose you another answer, add a boolean column LAST and update last true for the ObjectID to false before insert now row for this ObjectID with LAST to true. Create an index on ObjectID and LAST. Query very simple :
SELECT ObjectId, Timestamp FROM Table where LAST = true
No more group by and fullscan but one more update each for insert.
4 seconds in not bad for that kind on work in DB with more 100M rows.
You can archive daily some data in another table to preserve historic. You can archive all data in another table and delete old changing of objects :
delete from TABLE where Id in (select t1.Id from Table t1, Table t2
where t1.ObjectId = t2.ObjectId and t1.Timestamp < t2.Timestamp )
For this particular query, an index on (ObjectId, Timestamp) will be optimal. And there is a chance that (ObjectId, Timestamp DESC) will perform even faster.

Insert sorted data to a table with nonclustered index

My db schema:
Table point (point_id int PK, name varchar);
Table point_log (point_log_id int PK, point_id int FK, timestamp datetime, value int)
point_log has an index:
point_log_idx1 (point_id asc, timestamp asc)
I need to insert point log samples to point_log table, in each transaction only insert log samples for the one point_id, and the log samples are already sorted ascendingly. That means the all the log samples data in a transaction is in the same order for the index( point_log_idx1), how can I make SQL Server to take advantage of this, to avoid the the tree search cost?
The tree search cost is probably negligible compared to the cost of physical writing to disk and page splitting and logging.
1) You should definitely insert data in bulk, rather than row by row.
2) To reduce page splitting of the point_log_idx1 index you can try to use ORDER BY in the INSERT statement. It still doesn't guarantee the physical order on disk, but it does guarantee the order in which point_log_id IDENTITY would be generated, and hopefully it will hint to process source data in this order. If source data is processed in the requested order, then the b-tree structure of the point_log_idx1 index may grow without unnecessary costly page splits.
I'm using SQL Server 2008. I have a system that collects a lot of monitoring data in a central database 24/7. Originally I was inserting data as it arrived, row by row. Then I realized that each insert was a separate transaction and most of the time system spent writing into the transaction log.
Eventually I moved to inserting data in batches using stored procedure that accepts table-valued parameter. In my case a batch is few hundred to few thousand rows. In my system I keep data only for a given number of days, so I regularly delete obsolete data. To keep the system performance stable I rebuild my indexes regularly as well.
In your example, it may look like the following.
First, create a table type:
CREATE TYPE [dbo].[PointValuesTableType] AS TABLE(
point_id int,
timestamp datetime,
value int
)
Then procedure would look like this:
CREATE PROCEDURE [dbo].[InsertPointValues]
-- Add the parameters for the stored procedure here
#ParamRows dbo.PointValuesTableType READONLY
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
BEGIN TRANSACTION;
BEGIN TRY
INSERT INTO dbo.point_log
(point_id
,timestamp
,value)
SELECT
TT.point_id
,TT.timestamp
,TT.value
FROM #ParamRows AS TT
ORDER BY TT.point_id, TT.timestamp;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH;
END
In practice you should measure for your system what is more efficient, with ORDER BY, or without.
You really need to consider performance of the INSERT operation as well as performance of subsequent queries.
It may be that faster inserts lead to higher fragmentation of the index, which leads to slower queries.
So, you should check the fragmentation of the index after INSERT with ORDER BY or without.
You can use sys.dm_db_index_physical_stats to get index stats.
Returns size and fragmentation information for the data and indexes of
the specified table or view in SQL Server.
This looks like a good opportunity for changing the clustered index on Point_Log to cluster by its parent point_id Foreign Key:
CREATE TABLE Point_log
(
point_log_id int PRIMARY KEY NONCLUSTERED,
point_id int,
timestamp datetime,
value int
);
And then:
CREATE CLUSTERED INDEX C_PointLog ON dbo.Point_log(point_id);
Rationale: This will reduce the read IO on point_log when fetching point_log records for a given pointid
Moreover, given that Sql Server will add a 4 byte uniquifier to a non-unique clustered index, you may as well include the Surrogate PK on the Cluster as well, to make it unique, viz:
CREATE UNIQUE CLUSTERED INDEX C_PointLog ON dbo.Point_log(point_id, point_log_id);
The non clustered index point_log_idx1 ( point_id asc, timestamp asc) would need to be retained if you have a large number of point_logs per point, and assuming good selectivity of queries filtering on point_log.pointid & point_log.timestamp

Improve Speed of Script with Indexing

I have the following script that used to be ok, but since our user base has now expanded to almost a million members, the script is now very sluggish. I want to improve it and need expert assistance to make this faster, with either coding changes, create indexes, or both. Here is the code:
IF #MODE = 'CREATEREQUEST'
BEGIN
IF NOT EXISTS (SELECT * FROM FriendRequest WHERE FromMemberID = #FromMemberID AND ToMemberID = #ToMemberID)
AND NOT EXISTS (SELECT * FROM MemberConnection WHERE MemberID = #FromMemberID AND ConnMemberID = #ToMemberID)
AND NOT EXISTS (SELECT * FROM MemberConnection WHERE MemberID = #ToMemberID AND ConnMemberID = #FromMemberID)
BEGIN
INSERT INTO FriendRequest (
FromMemberID,
ToMemberID,
RequestMsg,
OnHold)
VALUES (
#FromMemberID,
#ToMemberID,
#RequestMsg,
#OnHold)
END
BEGIN
UPDATE Member SET FriendRequestCount = (FriendRequestCount + 1) WHERE MemberID = #ToMemberID
END
END
Any assistance you can provide would be greatly appreciated.
You can use SQL Server Management Studio to view the indexes on a table. If, for example, your FriendRequest table has a PK on FriendRequestID, you will be able to see that you have a clustered index on that field. You can have only one clustered index per table, and the table records are stored in that order.
You might want to try adding non-clustered indexes to your foreign key fields. You could use the New Index wizard, or else syntax like this:
CREATE NONCLUSTERED INDEX [IX_FromMemberID] ON [dbo].[FriendRequest] (FromMemberID)
CREATE NONCLUSTERED INDEX [IX_ToMemberID] ON [dbo].[FriendRequest] (ToMemberID)
But you should be aware that indexing will generally slow down the INSERT and UPDATE operations you showed in your code. It will generally tend to speed up the SELECT queries that can use the indexed fields (see Execution Plans).
You can try the Database Engine Tuning Advisor to get an idea of some possible indexes and their effect on your application's workload.
Indexing is a large subject and you may wish to take it a small step at a time.