sql azure time consuming query - sql

I have a table in Sql Azure contains about 6M rows.
I want to create a new index for it. the cmd is like:
CREATE NONCLUSTERED INDEX [INDEX1] ON [dbo].Table1
(
[Column1] ASC,
[Column2] ASC,
[Column3] ASC,
[Column4] ASC
)
INCLUDE ( [Column5],[Column6])
And after about 15 minutes, an error occurs
"Msg 10054, Level 20, State 0, Line 0
A transport-level error has occurred when receiving results from the
server. (provider: TCP Provider, error: 0 - An existing connection was
forcibly closed by the remote host.)"
I tried several times, got the same error.
But I have executed other time consuming queries,like:
Insert into table1(Col1,Col2,Col3) select Col1,Col2,Col3 from table2
Which took 20 minutes and returned successfully.
The queries were executed in the same Sql Azure DB. I don't know what's going on here. Could anyone help? Thanks!

I had a the same problem with at table containing 100M rows and contacted Microsoft Support. This is the reply i got:
The reason why you can’t create the index on your table is that you
are facing a limitation on the platform that prevents to have
transactions larger than 2GB.
The creation of an index is a transactional operation that relies on
the transaction log to execute the move of the table pages. More rows
in a table means more pages to put in the T-Log. Since your table
contains 100 million of records (which is quite a big number), it is
easy for you to hit this limit.
In order to create the index we need to change the approach.
Basically we are going to use a temporary(staging) table to store the
data while you create the index on the source table, that you would
have previously cleared from data.
Action Plan:
Create a staging table identical to the original table but without
any index (this makes the staging table a heap)
move the data from the original table to a staging table (the insert
is faster because the staging table is a heap)
empty the original table
create the index on the original table (this time the transaction should be almost empty)
move back data from staging table to original table (this would take some time, as the table contains indexes)
delete the staging table
They suggest using BCP to move data between the staging table and the original table.
When looking in the event_log table...
select * from sys.event_log
where database_name ='<DBName>'
and event_type <> 'connection_successful'
order by start_time desc
.. I found this error message:
The session has been terminated because of excessive transaction log
space usage. Try modifying fewer rows in a single transaction.

Thanks for answering! Actually, I found the root cause either.
There's a solution to it, set the ONLINE=ON, in the online mode, the index creating task will be broke into multiple small tasks so the T-Log won't exceed 2GB.
But there's a limitation, the 'include column' of the index creating command can not be object with unlimited size, like nvarchar(max), if so the command will fail immediately.
So in Sql Azure, for a index creating operation like the following:
CREATE NONCLUSTERED INDEX [INDEX1] ON [dbo].Table1
(
[Column1] ASC,
[Column2] ASC,
[Column3] ASC,
[Column4] ASC
)
INCLUDE ( [Column5],[Column6])
take the following actions, if the previous failed.
1.create index using 'online=on'
2.if #1 failed, means either column5 or column6 is nvarchar(max), query the table size, if < 2GB, directly create index using online=off.
3.if #2 fail, means table size > 2GB, then there's no simple way to create index without temporary table involved, have to take action as ahkvk replied.

Related

Improve insert performance when checking existing rows

I have this simple query that inserts rows from one table(sn_users_main) into another(sn_users_history).
To make sure sn_users_history only has unique rows it checks if the column query_time already exists and if it does then don't insert. query_time is kind of a session identifier that is the same for every row in sn_users_main.
This works fine but since sn_users_history is reaching 50k rows running this query takes more than 2 minutes to run which is too much. Is there anything I can do to improve performance and get the same result?
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE NOT EXISTS(SELECT snh.query_time
FROM sn_users_history snh
WHERE snh.query_time = snm.query_time) --Dont insert items into history table if they already exist
I think you are missing extra condition on user_id, when you are inserting into history table. You have to check combination of userid, querytime.
For your question, I think you are trying to reinvent the wheel. SQL Server is already having temporal tables, to suppor this historical data holding. Read about SQL Server Temporal Tables
If you want to still continue with this approach, I would suggest you to do in batches:
Create a configuration Table to hold the last processed querytime
CREATE TABLE HistoryConfig(HistoryConfigId int, HistoryTableName SYSNAME,
lastProcessedQueryTime DATETIME)
you can do incremental historical inserts
DECLARE #lastProcessedQueryTime DATETIME = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE query_time > #lastProcessedQueryTime
Now, you can update the configuration again
UPDATE HistoryConfig SET lastProcessedQueryTime = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
HistoryTableName = 'sn_users_history'
I would suggest you to create index on clustered index on UserId, Query_Time(if possible, Otherwise create non-clustered index) which will improve the performance.
Other approaches you can think of:
Create clustered index on userId, querytime in the historical table and also have userid,querytime as clustered index on the main table and perform MERGE operation.

optimize sql scan query to get results from postgres db

I'm working on small sql logic
I've one table Messages contaning message_id, accountid as columns
Data is keep coming in this table with unique message id.
My target is to store these mesaages table data into another database. [From postgres(source) DB to postgres(destination) DB]
For this I have set up a ETL job. Which is helping me to transfer the data.
Here comes the problem, In postgres(source) DB where messages table is located, in that table message_id is not in sorted form. And data looks like this .....
And my etl job runs after in every half an hour, My motive is whenever etl job runs, takes the data from source db to destinaton db on the basis of message_id. In destination db, I'm having one stored procedure which helps me to get the max(message_id) from messages table and store that value in another table. So in ETL I use that value in query which I use to fire on source db for getting the data greater than message_id I got from destination db.
So its kind a load incremental process.using etl. But the query am using to get data from source db is like this http://prnt.sc/b3u5il
SELECT * FROM (SELECT * FROM MESSAGES ORDER BY message_id) as a WHERE message_id >"+context.vid+"
This query scans the all table every time it runs...so itakes so much time to execute. I'm getting my desired results. But is there any way so that I could perform this process in more faster way.
Can anyone help me to optimize this query (don't know whether its possible or not) ? or any other suggestions are welcome.
Thanks
The most efficient way to improve performance in your case is to a add a INDEX to your sort column in this case message_id for better performance.
In this way , your query will perform an index scan instead of a full table scan which hampers the performance.
You can create an index by using following statement:
CREATE INDEX index_name
ON table_name (column_name)
Yes.
If message_id is not the leading column in the primary key or a secondary index, then create an index:
... ON MESSAGES (message_id)
And eliminate the inline view:
SELECT m.*
FROM MESSAGES m
WHERE m.message_id > ?
ORDER BY m.message_id
Create a B-Tree Index :
You can adjust the ordering of a B-tree index by including the options ASC, DESC, NULLS FIRST, and/or NULLS LAST when creating the index; for example:
CREATE INDEX test2_info_nulls_low ON test2 (info NULLS FIRST);
CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);

Insert sorted data to a table with nonclustered index

My db schema:
Table point (point_id int PK, name varchar);
Table point_log (point_log_id int PK, point_id int FK, timestamp datetime, value int)
point_log has an index:
point_log_idx1 (point_id asc, timestamp asc)
I need to insert point log samples to point_log table, in each transaction only insert log samples for the one point_id, and the log samples are already sorted ascendingly. That means the all the log samples data in a transaction is in the same order for the index( point_log_idx1), how can I make SQL Server to take advantage of this, to avoid the the tree search cost?
The tree search cost is probably negligible compared to the cost of physical writing to disk and page splitting and logging.
1) You should definitely insert data in bulk, rather than row by row.
2) To reduce page splitting of the point_log_idx1 index you can try to use ORDER BY in the INSERT statement. It still doesn't guarantee the physical order on disk, but it does guarantee the order in which point_log_id IDENTITY would be generated, and hopefully it will hint to process source data in this order. If source data is processed in the requested order, then the b-tree structure of the point_log_idx1 index may grow without unnecessary costly page splits.
I'm using SQL Server 2008. I have a system that collects a lot of monitoring data in a central database 24/7. Originally I was inserting data as it arrived, row by row. Then I realized that each insert was a separate transaction and most of the time system spent writing into the transaction log.
Eventually I moved to inserting data in batches using stored procedure that accepts table-valued parameter. In my case a batch is few hundred to few thousand rows. In my system I keep data only for a given number of days, so I regularly delete obsolete data. To keep the system performance stable I rebuild my indexes regularly as well.
In your example, it may look like the following.
First, create a table type:
CREATE TYPE [dbo].[PointValuesTableType] AS TABLE(
point_id int,
timestamp datetime,
value int
)
Then procedure would look like this:
CREATE PROCEDURE [dbo].[InsertPointValues]
-- Add the parameters for the stored procedure here
#ParamRows dbo.PointValuesTableType READONLY
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
BEGIN TRANSACTION;
BEGIN TRY
INSERT INTO dbo.point_log
(point_id
,timestamp
,value)
SELECT
TT.point_id
,TT.timestamp
,TT.value
FROM #ParamRows AS TT
ORDER BY TT.point_id, TT.timestamp;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH;
END
In practice you should measure for your system what is more efficient, with ORDER BY, or without.
You really need to consider performance of the INSERT operation as well as performance of subsequent queries.
It may be that faster inserts lead to higher fragmentation of the index, which leads to slower queries.
So, you should check the fragmentation of the index after INSERT with ORDER BY or without.
You can use sys.dm_db_index_physical_stats to get index stats.
Returns size and fragmentation information for the data and indexes of
the specified table or view in SQL Server.
This looks like a good opportunity for changing the clustered index on Point_Log to cluster by its parent point_id Foreign Key:
CREATE TABLE Point_log
(
point_log_id int PRIMARY KEY NONCLUSTERED,
point_id int,
timestamp datetime,
value int
);
And then:
CREATE CLUSTERED INDEX C_PointLog ON dbo.Point_log(point_id);
Rationale: This will reduce the read IO on point_log when fetching point_log records for a given pointid
Moreover, given that Sql Server will add a 4 byte uniquifier to a non-unique clustered index, you may as well include the Surrogate PK on the Cluster as well, to make it unique, viz:
CREATE UNIQUE CLUSTERED INDEX C_PointLog ON dbo.Point_log(point_id, point_log_id);
The non clustered index point_log_idx1 ( point_id asc, timestamp asc) would need to be retained if you have a large number of point_logs per point, and assuming good selectivity of queries filtering on point_log.pointid & point_log.timestamp

Delete statement in SQL is very slow

I have statements like this that are timing out:
DELETE FROM [table] WHERE [COL] IN ( '1', '2', '6', '12', '24', '7', '3', '5')
I tried doing one at a time like this:
DELETE FROM [table] WHERE [COL] IN ( '1' )
and so far it's at 22 minutes and still going.
The table has 260,000 rows in it and is four columns.
Does anyone have any ideas why this would be so slow and how to speed it up?
I do have a non-unique, non-clustered index on the [COL] that i'm doing the WHERE on.
I'm using SQL Server 2008 R2
update: I have no triggers on the table.
Things that can cause a delete to be slow:
deleting a lot of records
many indexes
missing indexes on foreign keys in child tables. (thank you to #CesarAlvaradoDiaz for mentioning this in the comments)
deadlocks and blocking
triggers
cascade delete (those ten parent records you are deleting could mean
millions of child records getting deleted)
Transaction log needing to grow
Many Foreign keys to check
So your choices are to find out what is blocking and fix it or run the deletes in off hours when they won't be interfering with the normal production load. You can run the delete in batches (useful if you have triggers, cascade delete, or a large number of records). You can drop and recreate the indexes (best if you can do that in off hours too).
Disable CONSTRAINT
ALTER TABLE [TableName] NOCHECK CONSTRAINT ALL;
Disable Index
ALTER INDEX ALL ON [TableName] DISABLE;
Rebuild Index
ALTER INDEX ALL ON [TableName] REBUILD;
Enable CONSTRAINT
ALTER TABLE [TableName] CHECK CONSTRAINT ALL;
Delete again
Deleting a lot of rows can be very slow. Try to delete a few at a time, like:
delete top (10) YourTable where col in ('1','2','3','4')
while ##rowcount > 0
begin
delete top (10) YourTable where col in ('1','2','3','4')
end
In my case the database statistics had become corrupt. The statement
delete from tablename where col1 = 'v1'
was taking 30 seconds even though there were no matching records but
delete from tablename where col1 = 'rubbish'
ran instantly
running
update statistics tablename
fixed the issue
If the table you are deleting from has BEFORE/AFTER DELETE triggers, something in there could be causing your delay.
Additionally, if you have foreign keys referencing that table, additional UPDATEs or DELETEs may be occurring.
Preventive Action
Check with the help of SQL Profiler for the root cause of this issue. There may be Triggers causing the delay in Execution. It can be anything. Don't forget to Select the Database Name and Object Name while Starting the Trace to exclude scanning unnecessary queries...
Database Name Filtering
Table/Stored Procedure/Trigger Name Filtering
Corrective Action
As you said your table contains 260,000 records...and IN Predicate contains six values. Now, each record is being search 260,000 times for each value in IN Predicate. Instead it should be the Inner Join like below...
Delete K From YourTable1 K
Inner Join YourTable2 T on T.id = K.id
Insert the IN Predicate values into a Temporary Table or Local Variable
It's possible that other tables have FK constraint to your [table].
So the DB needs to check these tables to maintain the referential integrity.
Even if you have all needed indexes corresponding these FKs, check their amount.
I had the situation when NHibernate incorrectly created duplicated FKs on the same columns, but with different names (which is allowed by SQL Server).
It has drastically slowed down running of the DELETE statement.
Check execution plan of this delete statement. Have a look if index seek is used. Also what is data type of col?
If you are using wrong data type, change update statement (like from '1' to 1 or N'1').
If index scan is used consider using some query hint..
If you're deleting all the records in the table rather than a select few it may be much faster to just drop and recreate the table.
Is [COL] really a character field that's holding numbers, or can you get rid of the single-quotes around the values? #Alex is right that IN is slower than =, so if you can do this, you'll be better off:
DELETE FROM [table] WHERE [COL] = '1'
But better still is using numbers rather than strings to find the rows (sql likes numbers):
DELETE FROM [table] WHERE [COL] = 1
Maybe try:
DELETE FROM [table] WHERE CAST([COL] AS INT) = 1
In either event, make sure you have an index on column [COL] to speed up the table scan.
I read this article it was really helpful for troubleshooting any kind of inconveniences
https://support.microsoft.com/en-us/kb/224453
this is a case of waitresource
KEY: 16:72057595075231744 (ab74b4daaf17)
-- First SQL Provider to find the SPID (Session ID)
-- Second Identify problem, check Status, Open_tran, Lastwaittype, waittype, and waittime
-- iMPORTANT Waitresource select * from sys.sysprocesses where spid = 57
select * from sys.databases where database_id=16
-- with Waitresource check this to obtain object id
select * from sys.partitions where hobt_id=72057595075231744
select * from sys.objects where object_id=2105058535
After inspecting an SSIS Package(due to a SQL Server executing commands really slow), that was set up in a client of ours about 5-4 years before the time of me writing this, I found out that there were the below tasks:
1) insert data from an XML file into a table called [Importbarcdes].
2) merge command on an another target table, using as source the above mentioned table.
3) "delete from [Importbarcodes]", to clear the table of the row that was inserted after the XML file was read by the task of the SSIS Package.
After a quick inspection all statements(SELECT, UPDATE, DELETE etc.) on the table ImportBarcodes that had only 1 row, took about 2 minutes to execute.
Extended Events showed a whole lot PAGEIOLATCH_EX wait notifications.
No indexes were present of the table and no triggers were registered.
Upon close inspection of the properties of the table, in the Storage Tab and under general section, the Data Space field showed more than 6 GIGABYTES of space allocated in pages.
What happened:
The query ran for a good portion of time each day for the last 4 years, inserting and deleting data in the table, leaving unused pagefiles behind with out freeing them up.
So, that was the main reason of the wait events that were captured by the Extended Events Session and the slowly executed commands upon the table.
Running ALTER TABLE ImportBarcodes REBUILD fixed the issue freeing up all the unused space. TRUNCATE TABLE ImportBarcodes did a similar thing, with the only difference of deleting all pagefiles and data.
Older topic but one still relevant.
Another issue occurs when an index has become fragmented to the extent of becoming more of a problem than a help. In such a case, the answer would be to rebuild or drop and recreate the index and issuing the delete statement again.
As an extension to Andomar's answer, above, I had a scenario where the first 700,000,000 records (of ~1.2 billion) processed very quickly, with chunks of 25,000 records processing per second (roughly). But, then it starting taking 15 minutes to do a batch of 25,000. I reduced the chunk size down to 5,000 records and it went back to its previous speed. I'm not certain what internal threshold I hit, but the fix was to reduce the number of records, further, to regain the speed.
open CMD and run this commands
NET STOP MSSQLSERVER
NET START MSSQLSERVER
this will restart the SQL Server instance.
try to run again after your delete command
I have this command in a batch script and run it from time to time if I'm encountering problems like this. A normal PC restart will not be the same so restarting the instance is the most effective way if you are encountering some issues with your sql server.

Slow simple update query on PostgreSQL database with 3 million rows

I am trying a simple UPDATE table SET column1 = 0 on a table with about 3 million rows on Postegres 8.4 but it is taking forever to finish. It has been running for more than 10 min.
Before, I tried to run a VACUUM and ANALYZE commands on that table and I also tried to create some indexes (although I doubt this will make any difference in this case) but none seems to help.
Any other ideas?
Update:
This is the table structure:
CREATE TABLE myTable
(
id bigserial NOT NULL,
title text,
description text,
link text,
"type" character varying(255),
generalFreq real,
generalWeight real,
author_id bigint,
status_id bigint,
CONSTRAINT resources_pkey PRIMARY KEY (id),
CONSTRAINT author_pkey FOREIGN KEY (author_id)
REFERENCES users (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT c_unique_status_id UNIQUE (status_id)
);
I am trying to run UPDATE myTable SET generalFreq = 0;
I have to update tables of 1 or 2 billion rows with various values for each rows. Each run makes ~100 millions changes (10%).
My first try was to group them in transaction of 300K updates directly on a specific partition as Postgresql not always optimize prepared queries if you use partitions.
Transactions of bunch of "UPDATE myTable SET myField=value WHERE
myId=id" Gives 1,500 updates/sec. which means each run would
take at least 18 hours.
HOT updates solution as described here with FILLFACTOR=50. Gives
1,600 updates/sec. I use SSD's so it's a costly improvement as it
doubles the storage size.
Insert in a temporary table of updated value and merge them after
with UPDATE...FROM Gives 18,000 updates/sec. if I do a VACUUM
for each partition; 100,000 up/s otherwise. Cooool.Here is the
sequence of operations:
CREATE TEMP TABLE tempTable (id BIGINT NOT NULL, field(s) to be updated,
CONSTRAINT tempTable_pkey PRIMARY KEY (id));
Accumulate a bunch of updates in a buffer depending of available RAM
When it's filled, or need to change of table/partition, or completed:
COPY tempTable FROM buffer;
UPDATE myTable a SET field(s)=value(s) FROM tempTable b WHERE a.id=b.id;
COMMIT;
TRUNCATE TABLE tempTable;
VACUUM FULL ANALYZE myTable;
That means a run now takes 1.5h instead of 18h for 100 millions updates, vacuum included. To save time, it's not necessary to make a vacuum FULL at the end but even a fast regular vacuum is usefull to control your transaction ID on the database and not get unwanted autovacuum during rush hours.
Take a look at this answer: PostgreSQL slow on a large table with arrays and lots of updates
First start with a better FILLFACTOR, do a VACUUM FULL to force table rewrite and check the HOT-updates after your UPDATE-query:
SELECT n_tup_hot_upd, * FROM pg_stat_user_tables WHERE relname = 'myTable';
HOT updates are much faster when you have a lot of records to update. More information about HOT can be found in this article.
Ps. You need version 8.3 or better.
After waiting 35 min. for my UPDATE query to finish (and still didn't) I decided to try something different. So what I did was a command:
CREATE TABLE table2 AS
SELECT
all the fields of table1 except the one I wanted to update, 0 as theFieldToUpdate
from myTable
Then add indexes, then drop the old table and rename the new one to take its place. That took only 1.7 min. to process plus some extra time to recreate the indexes and constraints. But it did help! :)
Of course that did work only because nobody else was using the database. I would need to lock the table first if this was in a production environment.
Today I've spent many hours with similar issue. I've found a solution: drop all the constraints/indices before the update. No matter whether the column being updated is indexed or not, it seems like psql updates all the indices for all the updated rows. After the update is finished, add the constraints/indices back.
Try this (note that generalFreq starts as type REAL, and stays the same):
ALTER TABLE myTable ALTER COLUMN generalFreq TYPE REAL USING 0;
This will rewrite the table, similar to a DROP + CREATE, and rebuild all indices. But all in one command. Much faster (about 2x) and you don't have to deal with dependencies and recreating indexes and other stuff, though it does lock the table (access exclusive--i.e. full lock) for the duration. Or maybe that's what you want if you want everything else to queue up behind it. If you aren't updating "too many" rows this way is slower than just an update.
How are you running it? If you are looping each row and performing an update statement, you are running potentially millions of individual updates which is why it will perform incredibly slowly.
If you are running a single update statement for all records in one statement it would run a lot faster, and if this process is slow then it's probably down to your hardware more than anything else. 3 million is a lot of records.
The first thing I'd suggest (from https://dba.stackexchange.com/questions/118178/does-updating-a-row-with-the-same-value-actually-update-the-row) is to only update rows that "need" it, ex:
UPDATE myTable SET generalFreq = 0 where generalFreq != 0;
(might also need an index on generalFreq). Then you'll update fewer rows. Though not if the values are all non zero already, but updating fewer rows "can help" since otherwise it updates them and all indexes regardless of whether the value changed or not.
Another option: if the stars align in terms of defaults and not-null constraints, you can drop the old column and create another by just adjusting metadata, instant time.
In my tests I noticed that a big update, more than 200 000 rows, is slower than 2 updates of 100 000 rows, even with a temporary table.
My solution is to loop, in each loop create a temporary table of 200 000 rows, in this table I compute my values, then update my main table with the new values aso...
Every 2 000 000 rows, I manually "VACUUM ANALYSE mytable", I noticed that the auto vacuum doesn't do its job for such updates.
I need to update more than 1B+ rows on PostgreSQL table which contains some indexes. I am working on PostgreSQL 12 + SQLAlchemy + Python.
Inspired by the answers here, I wrote a temp table and UPDATE... FROM based updater to see if it makes a difference. The temp table is then fed from CSV generated by Python, and uploaded over the normal SQL client connection.
The speed-up naive approach using SQLAlchemy's bulk_update_mappings is 4x - 5x. Not an order of magnitude, but still considerable and in my case this means 1 day, not 1 week, of a batch job.
Below is the relevant Python code that does CREATE TEMPORARY TABLE, COPY FROM and UPDATE FROM. See the full example in this gist.
def bulk_load_psql_using_temp_table(
dbsession: Session,
data_as_dicts: List[dict],
):
"""Bulk update columns in PostgreSQL faster using temp table.
Works around speed issues on `bulk_update_mapping()` and PostgreSQL.
Your mileage and speed may vary, but it is going to be faster.
The observation was 3x ... 4x faster when doing UPDATEs
where one of the columns is indexed.
Contains hardcoded temp table creation and UPDATE FROM statements.
In our case we are bulk updating three columns.
- Create a temp table - if not created before
- Filling it from the in-memory CSV using COPY FROM
- Then performing UPDATE ... FROM on the actual table from the temp table
- Between the update chunks, clear the temp table using TRUNCATE
Why is it faster? I have did not get a clear answer from the sources I wa reading.
At least there should be
less data uploaded from the client to the server,
as CSV loading is more compact than bulk updates.
Further reading
- `About PSQL temp tables <https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/>`_
- `Naive bulk_update_mapping approach <https://stackoverflow.com/questions/36272316/using-bulk-update-mappings-in-sqlalchemy-to-update-multiple-rows-with-different>`_
- `Discussion on UPDATE ... FROM + temp table approach <https://stackoverflow.com/questions/3361291/slow-simple-update-query-on-postgresql-database-with-3-million-rows/24811058#24811058>_`.
:dbsession:
SQLAlchemy session.
Note that we open a separate connection for the bulk update.
:param data_as_dicts:
In bound data as it would be given to bulk_update_mapping
"""
# mem table created in sql
temp_table_name = "temp_bulk_temp_loader"
# the real table of which data we are filling
real_table_name = "swap"
# colums we need to copy
columns = ["id", "sync_event_id", "sync_reserve0", "sync_reserve1"]
# how our CSV fields are separated
delim = ";"
# Increase temp buffer size for updates
temp_buffer_size = "3000MB"
# Dump data to a local mem buffer using CSV writer.
# No header - this is specifically addressed in copy_from()
out = StringIO()
writer = csv.DictWriter(out, fieldnames=columns, delimiter=delim)
writer.writerows(data_as_dicts)
# Update data in alternative raw connection
engine = dbsession.bind
conn = engine.connect()
try:
# No rollbacks
conn.execution_options(isolation_level="AUTOCOMMIT")
# See https://blog.codacy.com/how-to-update-large-tables-in-postgresql/
conn.execute(f"""SET temp_buffers = "{temp_buffer_size}";""")
# Temp table is dropped at the end of the session
# https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/
# This must match data_as_dicts structure.
sql = f"""
CREATE TEMP TABLE IF NOT EXISTS {temp_table_name}
(
id int,
sync_event_id int,
sync_reserve0 bytea,
sync_reserve1 bytea
);
"""
conn.execute(sql)
# Clean any pending data in the temp table
# between update chunks.
# TODO: Not sure why this does not clear itself at conn.close()
# as I would expect based on the documentation.
sql = f"TRUNCATE {temp_table_name}"
conn.execute(sql)
# Load data from CSV to the temp table
# https://www.psycopg.org/docs/cursor.html
cursor = conn.connection.cursor()
out.seek(0)
cursor.copy_from(out, temp_table_name, sep=delim, columns=columns)
# Fill real table from the temp table
# This copies values from the temp table using
# UPDATE...FROM and matching by the row id.
sql = f"""
UPDATE {real_table_name}
SET
sync_event_id=b.sync_event_id,
sync_reserve0=b.sync_reserve0,
sync_reserve1=b.sync_reserve1
FROM {temp_table_name} AS b
WHERE {real_table_name}.id=b.id;
"""
res = conn.execute(sql)
logger.debug("Updated %d rows", res.rowcount)
finally:
conn.close()
I do an update millions rows incrementally in batches with minimal locks by one procedure loop_execute(). There is a progress of execution in percent and a prediction of the end work time!
try
UPDATE myTable SET generalFreq = 0.0;
Maybe it is a casting issue