I have nearly 7 billion rows of data in memory (list<T> and sortedlist<T,T>) in C#. I want to insert this data into tables in SQL Server. To do this, I define different SqlConnection for each collection and set connection pool to False.
First, I tried to insert data with connected mode (ExecuteNonQuery). Even I defined Parallel.Invoke and called all insert methods for different collections concurrently, it is too slow and up to now I couldn't finish it (I couldn't discriminate any differences between sequential and concurrent insert).
Also, I tried to create an object from SqlDataTable. To fill tables I read all data from collections once and add data to SqlDataTable. In this case I set SqlBatchSize=10000 and SqlTimeOut=0 for SqlBulkCopy. But this one also is very slow.
How can I insert a huge amount of data into SQL Server fast?
Look for 'BULK INSERT'. The technique is available for various RDBMS. Basically, you create a (text)file with one line per record and tell the server to consume this text file. This is the fastest approach I could think of. I import 50 million rows in a couple of seconds that way.
You already discovered SqlBulkCopy but you say it is slow. This can be because of two reasons:
You are using too small batches. Try to stream the rows in using a custom IDataReader that you pass to WriteToServer (or just use bigger DataTables)
Your table has nonclustered indexes. Disable them pre-import and regenerate them
You can't go faster than with bulk-import, though.
Related
After trying a few different packages and methods found online, I am yet to find a solution that works for inserting a dataframe from R into an existing table in SQL Server.
I've had great success doing this with MySQL, but SQL Server seems to be more difficult.
I have managed to write a new table using the DBI package, but I can't find a way to insert into using this method. Looking at the documentation, there doesn't seem to be a way of inserting.
As there are more than 1000 rows of data, using sqlQuery from the RODBC package also seems unfeasable.
Can anybody suggest a working method for inserting large amounts of data from a dataframe into an existing SQL table?
I've had similar needs using R and PostGreSQL using the r-postgres-specific drivers. I imagine similar issues may exist with SQLServer. The best solution I found was to write to a temporary table in the database using either dbWriteTable or one of the underlying functions to write from a stream to load very large tables (for Postgres, postgresqlCopyInDataframe, for example). The latter usually requires more work in terms of defining and aligning SQL data types and R class types to ensure writing, wheres dbWriteTable tends to be a bit easier. Once written to a temporary table, to then issue an SQL statement to insert into your table as you would within the database environment. Below is an example using high-level DBI library database calls:
dbExecute(conn,"start transaction;")
dbExecute(conn,"drop table if exists myTempTable")
dbWriteTable(conn,"myTempTable",df)
dbExecute(conn,"insert into myRealTable(a,b,c) select a,b,c from myTempTable")
dbExecute(conn,"drop table if exists myTempTable")
dbExecute(conn,"commit;")
My company is cursed by a symbiotic partnership turned parasitic. To get our data from the parasite, we have to use a painfully slow odbc connection. I did notice recently though that I can get more throughput by running queries in parallel (even on the same table).
There is a particularly large table that I want to extract data from and move it into our local table. Running queries in parallel I can get data faster, but I also imagine that this could cause issues with trying to write data from multiple queries into the same table at once.
What advice can you give me on how to best handle this situation so that I can take advantage of the increased speed of using queries in parallel?
EDIT: I've gotten some great feedback here, but I think I wasn't completely clear on the fact that I'm pulling data via a linked server (which uses the odbc drivers). In other words that means I can run normal INSERT statements and I believe that would provide better performance than either SqlBulkCopy or BULK INSERT (actually, I don't believe BULK INSERT would even be an option).
Have you read Load 1TB in less than 1 hour?
Run as many load processes as you have available CPUs. If you have
32 CPUs, run 32 parallel loads. If you have 8 CPUs, run 8 parallel
loads.
If you have control over the creation of your input files, make them
of a size that is evenly divisible by the number of load threads you
want to run in parallel. Also make sure all records belong to one
partition if you want to use the switch partition strategy.
Use BULK insert instead of BCP if you are running the process on the
SQL Server machine.
Use table partitioning to gain another 8-10%, but only if your input
files are GUARANTEED to match your partitioning function, meaning
that all records in one file must be in the same partition.
Use TABLOCK to avoid row at a time locking.
Use ROWS PER BATCH = 2500, or something near this if you are
importing multiple streams into one table.
For SQL Server 2008, there are certain circumstances where you can utilize minimal logging for a standard INSERT SELECT:
SQL Server 2008 enhances the methods that it can handle with minimal
logging. It supports minimally logged regular INSERT SELECT
statements. In addition, turning on trace flag 610 lets SQL Server
2008 support minimal logging against a nonempty B-tree for new key
ranges that cause allocations of new pages.
If your looking to do this in code ie c# there is the option to use SqlBulkCopy (in the System.Data.SqlClient namespace) and as this article suggests its possible to do this in parallel.
http://www.adathedev.co.uk/2011/01/sqlbulkcopy-to-sql-server-in-parallel.html
If by any chance you've upgraded to SQL 2014, you can insert in parallel (compatibility level must be 110). See this:
http://msdn.microsoft.com/en-us/library/bb510411%28v=sql.120%29.aspx
I am trying to copy a table in a database into another database on another connection in VB.NET, using OleDb. If they were on the same connection I would just use SELECT INTO, but they are not. I have two different OleDbConnection and cannot see an easy way to do this.
Right now I am attempting to just copy the database into a DataTable using an OleDbDataAdapter, and then loop through the DataTable and insert every record into the target database one at a time. This obviously takes a ton of time for the large DB I could potentially be dealing with, and I have to deal with escaping strings, null values, etc.
Is there an easier way to do this?
Thanks,
Logan
edit - just to make this more clear: I have two OleDbConnection objects, one is linked directly to a local .mdb file on my computer (JET). The other is linked to a database on our servers (SQLOLEDB). I am wanting to do this:
"SELECT * FROM fromDB INTO toDB"
But I can't because fromDB and toDB are on different connections, and the OleDbCommand object is only attached to one. The only way I can see how to do this is to connect to fromDB, copy it into a DataTable, connect to toDB, and copy all of the data in the DataTable row by row into toDB. I was wondering if there is an easier way to do this.
If you are constrained to this architecture, one idea is to write a stored procedure on the server that accepts a large chuck of row data in one call. It could then write out the row data to a file for a future bulk-insert, or it could attempt to insert the rows directly.
This also has the benefit of speeding things up over high latency connections to the server.
Also, if you use parameterized statements, you can avoid having to escape strings etc.
If you are just copying from one to the other, why don't you do it in SQL?
You can create a Synonym within one database pointing at a table, view or stored proc on another database (on another server). You can then insert into this synonym just like you could into a table in the same db.
http://www.developer.com/db/article.php/3613301/Using-Synonyms-in-SQL-Server-2005.htm
I need to select some 100k+ records from a SQL table and do some processing and then do a bulk insert to another table. I am using SQLBulkCopy to to do the bulk insert which runs quickly. For getting the 100k+ records, I am currently using DataReader.
Problem: Sometimes I am getting a timeout error in DataReader. I have increased the timeout to some managable number.
Is there anything like SQLBulkCopy for selecting records in a bulk batch?
Thanks!
Bala
It sound like you should do all your processing inside sql server. Or split data into chunks.
A quote from this msdn page:
Note
No special optimization techniques exist for bulk-export operations. These operations simply select the data from the source table by using a SELECT statement.
However, on that same page, it mentions the bcp utlity can "bulk export" data from SQL Server to a file.
I suggest you try your query with bcp, and see if it's significantly faster. If it's not, I'd give up and try fiddling with your batch sizes, or look harder at moving the processing into SQL Server.
I've to INSERT a lot of rows (more than 1.000.000.000) to a SQL Server data base. The table has an AI Id, two varchar(80) cols and a smalldatetime with GETDATE as default value. The last one is just for auditory, but necesary.
I'd like to know the best (fastest) way to INSERT the rows. I've been reading about BULK INSERT. But if posible I'd like to avoid it because the app does not run on the same server where database is hosted and I'd like to keep them as isolated as posible.
Thanks!
Diego
Another option would be bcp.
Alternatively, if you're using .NET you can use the SqlBulkCopy class to bulk insert data. This is something I've blogged about on the performance of, which you may be interested in as I compared SqlBulkCopy vs another way of bulk loading data to SQL Server from .NET (using SqlDataAdapter). Basic example loading 100,000 rows took 0.8229s using SqlBulkCopy vs. 25.0729s using the SqlDataAdapter approach.
Create an SSIS package that will copy the file to SQL server machine and then use the data flow task to import data from file to SQL server database.
There is no faster/more efficient way than BULK INSERT and when you're dealing with such large ammount of data, do not even think about anything from .NET, because thanks to GC, managing millions of object in memory causes massive performance degradation.