SQLite batch commit vs commit in shot for huge bulk inserts - sql

I need to dump huge (~ 10-40 million rows) huge data set into a SQLite database. Is there an advantage of doing a commit for every n number of inserts (n could 50,000, 100,000, etc) vs. doing a commit only after whole 40 millions rows got inserted.
Obviously, in theory a single commit will be fastest way to do it. But is there an advantage of doing commit by batches? In my case it is either all data got inserted or non gets inserted. Is there any danger of doing extremely large amount of inserts in SQLite before doing a commit (i.e. Do I need to have bigger diskspace for sqlite as it needs to use bigger temp files?)?
I'm using Perl DBI to insert the data.

I have had some performance improvements by using the following things:
set PRAGMA synchronous = OFF this prevents SQLite engine from waiting for OS-level write to complete.
set PRAGMA journal_mode = MEMORY this tells the SQLite engine to store the journal in RAM instead of disk, only drawback is that the database can't be recovered in case of a OS crash or power failure.
next, create indexes after all inserts. Also, you may issue a commit after every 100,000 records.

Related

Understanding locks and query status in Snowflake (multiple updates to a single table)

While using the python connector for snowflake with queries of the form
UPDATE X.TABLEY SET STATUS = %(status)s, STATUS_DETAILS = %(status_details)s WHERE ID = %(entry_id)s
, sometimes I get the following message:
(snowflake.connector.errors.ProgrammingError) 000625 (57014): Statement 'X' has locked table 'XX' in transaction 1588294931722 and this lock has not yet been released.
and soon after that
Your statement X' was aborted because the number of waiters for this lock exceeds the 20 statements limit
This usually happens when multiple queries are trying to update a single table. What I don't understand is that when I see the query history in Snowflake, it says the query finished successfully (Succeded Status) but in reality, the Update never happened, because the table did not alter.
So according to https://community.snowflake.com/s/article/how-to-resolve-blocked-queries I used
SELECT SYSTEM$ABORT_TRANSACTION(<transaction_id>);
to release the lock, but still, nothing happened and even with the succeed status the query seems to not have executed at all. So my question is, how does this really work and how can a lock be released without losing the execution of the query (also, what happens to the other 20+ queries that are queued because of the lock, sometimes it seems that when the lock is released the next one takes the lock and have to be aborted as well).
I would appreciate it if you could help me. Thanks!
Not sure if Sergio got an answer to this. The problem in this case is not with the table. Based on my experience with snowflake below is my understanding.
In snowflake, every table operations also involves a change in the meta table which keeps track of micro partitions, min and max. This meta table supports only 20 concurrent DML statements by default. So if a table is continuously getting updated and getting hit at the same partition, there is a chance that this limit will exceed. In this case, we should look at redesigning the table updation/insertion logic. In one of our use cases, we increased the limit to 50 after speaking to snowflake support team
UPDATE, DELETE, and MERGE cannot run concurrently on a single table; they will be serialized as only one can take a lock on a table at at a time. Others will queue up in the "blocked" state until it is their turn to take the lock. There is a limit on the number of queries that can be waiting on a single lock.
If you see an update finish successfully but don't see the updated data in the table, then you are most likely not COMMITting your transactions. Make sure you run COMMIT after an update so that the new data is committed to the table and the lock is released.
Alternatively, you can make sure AUTOCOMMIT is enabled so that DML will commit automatically after completion. You can enable it with ALTER SESSION SET AUTOCOMMIT=TRUE; in any sessions that are going to run an UPDATE.

Sybase ASE 15.5 : Slow insert with JDBC executebatch()

I'm using Sybase ASE 15.5 and JDBC driver jconnect 4 and I'm experiencing slow insert with executebatch() with a batch size of +/-40 rows on a large table of 400 million rows with columns (integer, varchar(128),varchar(255)), primary key and clustered index on columns (1,2) and nonclustered index on columns (2,1). Each batch of +/-40 rows takes +/-200ms. Is the slowness related to the size of the table? I know that dropping the indexes can improve performance but unfortunately that is not an option. How can I improve the speed of insertion?
NOTE : This is part of the application live run, this is not a one shot migration, so I won't be using bcp tool.
EDIT : I have checked this answer for mysql but not sure it applies to Sybase ASE https://stackoverflow.com/a/13504946/8315843
There are many reasons why the inserts could be slow, eg:
each insert statement is having to be parsed/compiled; the ASE 15.x optimizer attempts to do a lot more work than the previous ASE 11/12 optimizer w/ net result being that compiles (generally) take longer to perform
the batch is not wrapped in a single transaction, so each insert has to wait for a separate write to the log to complete
you've got a slow network connection between the client host and the dataserver host
there's some blocking going on
the table has FK constraints that need to be checked for each insert
there's a insert trigger on the table (w/ the obvious question of what is the trigger doing and how long does it take to perform its operations)
Some ideas to consider re: speeding up the inserts:
use prepared statements; the first insert is compiled into a lightweight procedure (think 'temp procedure'); follow-on inserts (using the prepared statement) benefit from not having to be compiled
make sure a batch of inserts are wrapped in a begin/commit tran wrapper; this tends to defer the log write(s) until the commit tran is issued; fewer writes to the log means less time waiting for the log write to be acknowledged
if you have a (relatively) slow network connection between the application and dataserver hosts, look at using a larger packet size; fewer packets means less time waiting for round-trip packet processing/waiting
look into if/how jdbc supports the bulk-copy libraries (basically implement bcp-like behavior via jdbc) [I don't work with jdbc so I'm only guessing this might be avaialble]
Some of the above is covered in these SO threads:
Getting ExecuteBatch to execute faster
JDBC Delete & Insert using batch
Efficient way to do batch INSERTS with JDBC

What does ExecuteNonQuery() do during a bulk insert?

I have some bulk insert vb.net code (working) that I have written. It calls ExecuteNonQuery() for each insert and then at the end does a commit().
My question is on where these inserts are placed, while waiting for the commit() command? I have not made any changes to support batching as yet. So with my existing code a million rows will be inserted before calling commit(). I ask this question obviously to know if I will run into memory issues, hence forcing me to makes changes to my code now.
In the normal rollback journal mode, changes are simply written to the database. However, to allow atomic commits, the previous contents of all changed database pages are written to the rollback journal so that a rollback can restore the previous state.
(When you do so many inserts that new pages need to be allocated, there is no old state for those pages.)
In WAL mode, all changes are written to the write-ahead log.
In either case, nothing is actually written until the amount of data overflows the page cache (which has a size of about 2 MB by default).
So the size of a transaction is not limited by memory, only by disk space.
in bulk insert query, command.ExecuteNonQuery() returns no of rows effected by insert,update or delete statement.
In your case after each successful insert , it will return 1 as integer.
if you are not using a transaction query,Commit doesn't make any sense. If you're not explicitly using a transaction, changes are committed automatically.

How to update lots of rows with out locking table in SQL Server 2008?

I have a job which is updating a table for 20 minutes and at these moments I can't update any row of it naturally.
Is there a way or method for to do this ?
The job can be more longer but I have to continue to update table.
On the other hand the job should rollback if it has got error during job.
Thanks..
Split the job into separate transactions. The way locks work in a professional DBMS like SQL Server is that they escalate to higher levels as they are required. Once a query has hit a lot of pages it's only natural that it's updated to a read-only table lock. The only way to circumvent this while keeping transactional integrity is to split it up in smaller jobs.
As Niels has pointed out you should attempt to update the table in batches explicitly committing each batch within a transaction. If you are creating enough locks to warrant a table lock then the chances are you have also ballooned your transaction log. Its probably worth checking and shrinking down to a more reasonable size if necessary.
Alternatively you could try enabling Trace flags 1224 or 1211 which "Disables lock escalation based on the number of locks" - http://msdn.microsoft.com/en-us/library/ms188396.aspx

Mass Updates and Commit frequency in SQL Server

My database background is mainly Oracle, but I've recently been helping with some SQL Server work. My group has inherited some SQL server DTS packages that do daily loads and updates of large amounts of data. Currently it is running in SQL Server 2000, but will soon be upgraded to SQL Server 2005 or 2008. The mass updates are running too slowly.
One thing I noticed about the code is that some large updates are done in procedural code in loops, so that each statement only updates a small portion of the table in a single transaction. Is this a sound method to do updates in SQL server? Locking of concurrent sessions should not be an issue because user access to tables is disabled while the bulk loading takes place. I've googled around some, and found some articles suggesting that doing it this way conserves resources, and that resources are released each time an update commits, leading to greater efficiency. In Oracle this is generally a bad approach, and I've used single transactions for very large updates with success in Oracle. Frequent commits slow the process down and use more resources in Oracle.
My question is, for mass updates in SQL Server, is it generally a good practice to use procedural code, and commit many SQL statements, or to use one big statement to do the whole update?
Sorry Guys,
None of the above answer the question. They are just examples of how you can do things. The answer is, more resources get used with frequent commits, however, the transaction log cannot be truncated until a commit point. Thus, if your single spanning transaction is very big it will cause the transaction log to grow and possibly fregment which if undetected will cause problems later. Also, in a rollback situation, the duration is generally twice as long as the original transaction. So if your transaction fails after 1/2 hour it will take 1 hour to roll back and you can't stop it :-)
I have worked with SQL Server2000/2005, DB2, ADABAS and the above is true for all. I don't really see how Oracle can work differently.
You could possibly replace the T-SQL with a bcp command and there you can set the batch size without having to code it.
Issuing frequest commits in a single table scan is prefferable to running multiple scans with small processing numbers because generally if a table scan is required the whole table will be scanned even if you only returning a small subset.
Stay away from snapshots. A snapshot will only increase the number of IOs and competes for IO and CPU
In general, I find it better to update in batches - typically in the range of between 100 to 1000. It all depends on how your tables are structured: foreign keys? Triggers? Or just updating raw data? You need to experiment to see which scenario works best for you.
If I am in pure SQL, I will do something like this to help manage server resources:
SET ROWCOUNT 1000
WHILE 1=1 BEGIN
DELETE FROM MyTable WHERE ...
IF ##ROWCOUNT = 0
BREAK
END
SET ROWCOUNT 0
In this example, I am purging data. This would only work for an UPDATE if you could restrict or otherwise selectively update rows. (Or only insert xxxx number of rows into an auxiliary table that you can JOIN against.)
But yes, try not to update xx million rows at one time. It takes forever and if an error occurs, all those rows will be rolled back (which takes an additional forever.)
Well everything depends.
But ... assuming your db is in single user mode or you have table locks (tablockx) against all the tables involved, batches will probably perform worse. Especially if the batches are forcing table scans.
The one caveat is that very complex queries will quite often consume resources on tempdb, if tempdb runs out of space (cause the execution plan required a nasty complicated hash join) you are in deep trouble.
Working in batches is a general practice that is quite often used in SQL Server (when its not in snapshot isolation mode) to increase concurrency and avoid huge transaction rollbacks because of deadlocks (you tend to get deadlock galore when updating a 10 million row table that is active).
When you move to SQL Server 2005 or 2008, you will need to redo all those DTS packages in SSIS. I think you will pleasantly surprised to see how much faster SSIS can be.
In general, In SQL Server 2000, you want to run things in batches of records if the whole set ties up the table for too long. If you are running the packages at night when there is no use on the system, you may be able to get away with a set-based insert of the entire dataset. Row-by-row is always the slowest method, so avoid that if possible as well (Especially if all the row-row-row inserts are in one giant transaction!). If you have 24 hour access with no down time, you will almost certainly need to run in batches.