Improve insert performance when checking existing rows

Improve insert performance when checking existing rows - sql

I have this simple query that inserts rows from one table(sn_users_main) into another(sn_users_history).
To make sure sn_users_history only has unique rows it checks if the column query_time already exists and if it does then don't insert. query_time is kind of a session identifier that is the same for every row in sn_users_main.
This works fine but since sn_users_history is reaching 50k rows running this query takes more than 2 minutes to run which is too much. Is there anything I can do to improve performance and get the same result?
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE NOT EXISTS(SELECT snh.query_time
FROM sn_users_history snh
WHERE snh.query_time = snm.query_time) --Dont insert items into history table if they already exist

I think you are missing extra condition on user_id, when you are inserting into history table. You have to check combination of userid, querytime.
For your question, I think you are trying to reinvent the wheel. SQL Server is already having temporal tables, to suppor this historical data holding. Read about SQL Server Temporal Tables
If you want to still continue with this approach, I would suggest you to do in batches:
Create a configuration Table to hold the last processed querytime
CREATE TABLE HistoryConfig(HistoryConfigId int, HistoryTableName SYSNAME,
lastProcessedQueryTime DATETIME)
you can do incremental historical inserts
DECLARE #lastProcessedQueryTime DATETIME = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE query_time > #lastProcessedQueryTime
Now, you can update the configuration again
UPDATE HistoryConfig SET lastProcessedQueryTime = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
HistoryTableName = 'sn_users_history'
I would suggest you to create index on clustered index on UserId, Query_Time(if possible, Otherwise create non-clustered index) which will improve the performance.
Other approaches you can think of:
Create clustered index on userId, querytime in the historical table and also have userid,querytime as clustered index on the main table and perform MERGE operation.

Related

SQL Server Update Indexing Design

I got a table, IDMAP with DML:
CREATE TABLE tempdb2.dbo.idmaptemp (
OldId varchar(20),
CV_ModStamp datetimeoffset,
NewId varchar(20),
RestoreComplete bit,
RestoreErrorMessage varchar(1000),
OperationType varchar(20)
)
As it is defined, it already contains a predefined set of rows about (1Million). When restore operation is complete, I have to update NewId, RestoreComplete, RestoreErrorMessage on the table.
The statement is:
update tempdb2.dbo.IdMaptemp set NewId = 'xxx', RestoreComplete = 'false', RestoreErrorMessage = 'error' where OldId = 'ABC';
The Java application has about a million values on memory and has to update the values with the above statement. The database is set to autocommit off and is varied with batch (batchsize 500).
I have tried two options on Indexing with OldId field:
Clustered Index - Execution plan lists as clustered index update (100% cost). This occurs as the leaves are the rows that are getting updated and which would trigger an index update. Am I right here?
Non-Clustered Index - Execution plan lists as update (75%) and seek(25%).
Are there any other speed ups that can be achieved on mass update on a database table? The table cannot be cleared and re-inserted as there are other rows that aren't affected by the updates. Clustered index on a sample of 500 rows per batch has taken around 7 hours to update.
Should I go for the Non-Clustered index option?

Changing a large table's clustered index is an expensive proposition. A table's clustered index is defined for the entire table and not for a subset of rows.
If you're leaving oldid as the clustered index and just want to improve the batching performance, consider allowing the db to participate in the batching process rather than the application/java layer. Asking the db to update millions of rows 1 row at a time, is an expensive proposition. Populating a temp table with batches worth and then letting SQL update the entire batch at a time can be a good way of improving performance.
insert #temptable (OldId,NewId)
...
Update
set T1.NewId = T2.NewId
T1
from
T1 join #tempTable T2
on T1.OldId = T2.OldId
If you can compute the new id, consider another batching tactic.
update tempdb2.dbo.IdMaptemp top 1000 set NewId = 'xxx', RestoreComplete = 'false',
RestoreErrorMessage = 'error' where NewId is null;
If you really want to create a new table with NewId as the clustered index
Create the new table as you like
insert into NewTable()
select top 10000 *
from OldTable O
left join NewTable N
on O.OldId = N.OldId
where N.OldId is null
When done, drop the old table.
Note: Does your id need to be 20 bytes? Typically clustered indexes are either int - 4 bytes or bigint - 8 bytes.
If this is a one time thing, then changing the clustered index on a large persistent table will be worth it. If the oldid though will always be in the process of acquiring the newid value, and that's just the workflow you have, I wouldn't bother changing the persistent table's clustered index. Just leave the oldid as the clustered index. NewId sounds like a surrogate key.

Add column to huge table in Postgresql without downtime

I have a very small table(about 1 mil rows) and I'm going to drop constraints and add new column. The query below is hang about 5 minutes, had to rollback.
BEGIN WORK;
LOCK TABLE usertable IN SHARE MODE;
ALTER TABLE usertable ALTER COLUMN email DROP NOT NULL;
COMMIT WORK;
Another approach suggested on the similar questions in the internet -
CREATE TABLE new_tbl
(
field1 int,
field2 int,
...
);
INSERT INTO new_tbl(field1, field2, ...)
(
SELECT FROM ... -- use your new logic here to insert the updated data
)
CREATE INDEX -- add your constraints and indexes to new_tbl
DROP TABLE tbl;
ALTER TABLE tbl_new RENAME tbl;
Create new table
Insert records from old table to new one (take less then a second)
Drop old table - this query hangs for about 5 minutes ~. Had to rollback. Does not work for me.
Renamed new created table to old one
Dropping old table simply hangs. However when I try to drop new created table with 1 million rows - it works instantly. Why dropping of old table takes so much time ?
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
I can see a lot of concurrent writes/reads which are waiting for my operation. Since I took lock on the table, I don't really think that the reason why I can't drop old table.
Just run vacuum on old table it did not help.
Why I can't drop old table why it takes so much time compared to dropping recently created table ?

I don't have a lot of experience with PostgreSQL, but my guess is that it keeps a bit to signify when a NULLable column is NULL (as opposed to empty) and when that column is marked as NOT NULL it no longer needs that bit. So, when you change that attribute on a column the system needs to go through the whole table and rearrange the data, moving lots of bits around so that the rows are all correctly structured.
This is much different from a DROP TABLE, which merely needs to mark the disk space as no longer in use and perhaps update a few metadata values.
In short, they're very different actions, so of course they take different amounts of time.

I was not able to drop/rename table original table cause of FK of others tables. Once I dropoed it, approach with renaming table works great

Selecting the most optimal query

I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?

An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.

In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.

If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.

Is it possible to add index to a temp table? And what's the difference between create #t and declare #t

I need to do a very complex query.
At one point, this query must have a join to a view that cannot be indexed unfortunately.
This view is also a complex view joining big tables.
View's output can be simplified as this:
PID (int), Kind (int), Date (date), D1,D2..DN
where PID and Date and Kind fields are not unique (there may be more than one row having same combination of pid,kind,date), but are those that will be used in join like this
left join ComplexView mkcs on mkcs.PID=q4.PersonID and mkcs.Date=q4.date and mkcs.Kind=1
left join ComplexView mkcl on mkcl.PID=q4.PersonID and mkcl.Date=q4.date and mkcl.Kind=2
left join ComplexView mkco on mkco.PID=q4.PersonID and mkco.Date=q4.date and mkco.Kind=3
Now, if I just do it like this, execution of the query takes significant time because the complex view is ran three times I assume, and out of its huge amount of rows only some are actually used (like, out of 40000 only 2000 are used)
What i did is declare #temptable, and insert into #temptable select * from ComplexView where Date... - one time per query I select only the rows I am going to use from my ComplexView, and then I am joining this #temptable.
This reduced execution time significantly.
However, I noticed, that if I make a table in my database, and add a clustered index on PID,Kind,Date (non-unique clustered) and take data from this table, then doing delete * from this table and insert into this table from complex view takes some seconds (3 or 4), and then using this table in my query (left joining it three times) take down query time to half, from 1 minute to 30 seconds!
So, my question is, first of all - is it possible to create indexes on declared #temptables.
And then - I've seen people talk about "create #temptable" syntax. Maybe this is what i need? Where can I read about what's the difference between declare #temptable and create #temptable? What shall I use for a query like mine? (this query is for MS Reporting Services report, if it matters).

#tablename is a physical table, stored in tempdb that the server will drop automatically when the connection that created it is closed, #tablename is a table stored in memory & lives for the lifetime of the batch/procedure that created it, just like a local variable.
You can only add a (non PK) index to a #temp table.
create table #blah (fld int)
create nonclustered index idx on #blah (fld)

It's not a complete answer but #table will create a temporary table that you need to drop or it will persist in your database. #table is a table variable that will not persist longer than your script.
Also, I think this post will answer the other part of your question.
Creating an index on a table variable

Yes, you can create indexes on temp tables or table variables. http://sqlserverplanet.com/sql/create-index-on-table-variable/

The #tableName syntax is a table variable. They are rather limited. The syntax is described in the documentation for DECLARE #local_variable. You can kind of have indexes on table variables, but only indirectly by specifying PRIMARY KEY and UNIQUE constraints on columns. So, if your data in the columns that you need an index on happens to be unique, you can do this. See this answer. This may be “enough” for many use cases, but only for small numbers of rows. If you don’t have indexes on your table variable, the optimizer will generally treat table variables as if they contain one row (regardless of how many rows there actually are) which can result in terrible query plans if you have hundreds or thousands of rows in them instead.
The #tableName syntax is a locally-scoped temporary table. You can create these either using SELECT…INTO #tableName or CREATE TABLE #tableName syntax. The scope of these tables is a little bit more complex than that of variables. If you have CREATE TABLE #tableName in a stored procedure, all references to #tableName in that stored procedure will refer to that table. If you simply reference #tableName in the stored procedure (without creating it), it will look into the caller’s scope. So you can create #tableName in one procedure, call another procedure, and in that other procedure read/update #tableName. However, once the procedure that created #tableName runs to completion, that table will be automatically unreferenced and cleaned up by SQL Server. So, there is no reason to manually clean up these tables unless if you have a procedure which is meant to loop/run indefinitely or for long periods of time.
You can define complex indexes on temporary tables, just as if they are permanent tables, for the most part. So if you need to index columns but have duplicate values which prevents you from using UNIQUE, this is the way to go. You do not even have to worry about name collisions on indexes. If you run something like CREATE INDEX my_index ON #tableName(MyColumn) in multiple sessions which have each created their own table called #tableName, SQL Server will do some magic so that the reuse of the global-looking identifier my_index does not explode.
Additionally, temporary tables will automatically build statistics, etc., like normal tables. The query optimizer will recognize that temporary tables can have more than just 1 row in them, which can in itself result in great performance gains over table variables. Of course, this also is a tiny amount of overhead. Though this overhead is likely worth it and not noticeable if your query’s runtime is longer than one second.

To extend Alex K.'s answer, you can create the PRIMARY KEY on a temp table
IF OBJECT_ID('tempdb..#tempTable') IS NOT NULL
DROP TABLE #tempTable
CREATE TABLE #tempTable
(
Id INT PRIMARY KEY
,Value NVARCHAR(128)
)
INSERT INTO #tempTable
VALUES
(1, 'first value')
,(3, 'second value')
-- will cause Violation of PRIMARY KEY constraint 'PK__#tempTab__3214EC071AE8C88D'. Cannot insert duplicate key in object 'dbo.#tempTable'. The duplicate key value is (1).
--,(1, 'first value one more time')
SELECT * FROM #tempTable

how to select the newly added rows in a table efficiently?

I need to periodically update a local cache with new additions to some DB table. The table rows contain an auto-increment sequential number (SN) field. The cache keeps this number too, so basically I just need to fetch all rows with SN larger than the highest I already have.
SELECT * FROM table where SN > <max_cached_SN>
However, the majority of the attempts will bring no data (I just need to make sure that I have an absolutely up-to-date local copy). So I wander if this will be more efficient:
count = SELECT count(*) from table;
if (count > <cache_size>)
// fetch new rows as above
I suppose that selecting by an indexed numeric field is quite efficient, so I wander whether using count has benefit. On the other hand, this test/update will be done quite frequently and by many clients, so there is a motivation to optimize it.

this test/update will be done quite frequently and by many clients
this could lead to unexpected race competition for cache generation
I would suggest
upon new addition to your table add the newest id into a queue table
using like crontab to trigger the cache generation by checking queue table
upon new cache generated, delete the id from queue table
as you stress majority of the attempts will bring no data, the above will only trigger where there is new addition
and the queue table concept, even can expand for update and delete

I believe that
SELECT * FROM table where SN > <max_cached_SN>
will be faster, because select count(*) may call table scan. Just for clarification, do you never delete rows from this table?

SELECT COUNT(*) may involve a scan (even a full scan), while SELECT ... WHERE SN > constant can effectively use an index by SN, and looking at very few index nodes may suffice. Don't count items if you don't need the exact total, it's expensive.

You don't need to use SELECT COUNT(*)
There is two solution.
You can use a temp table that has one field that contain last count of your table, and create new Trigger after insert on your table and inc temp table field in Trigger.
You can use a temp table that has one field that contain last SN of your table is cached and create new Trigger after insert on your table and update temp table field in Trigger.

not much to this really
drop table if exists foo;
create table foo
(
foo_id int unsigned not null auto_increment primary key
)
engine=innodb;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas