What is the fastest way to insert a huge array (10M elements) from a C# application?
Till now, I used bulk insert. C# app generates a large textual file and I load it with BULK INSERT command . Out of curiosity I wrote a simple user defined CLR table value function.
[SqlFunction(Name = "getArray", FillRowMethodName = "FillRow")]
public static IEnumerable getArray(String name)
{
return my_arrays[name]; // returns the array I want to insert into db
}
public static void FillRow(Object o, out SqlDouble sdo)
{
sdo = new SqlDouble((double)o);
}
And this query:
INSERT INTO my_table SELECT data FROM dbo.getArray('x');
Works almost 2 times faster than bulk equivalent. The exact results are:
BULK - 330s (write to disk + insert)
TVF - 185s
Of course, this is due to write overhead, but I don't know if BULK insert have any in memory equivalent.
So my question is - is TVF better compering to the BULK (which is created for huge inserts), or am I missing something here. Is there any third alternative?
I use a SqlBulkCopy when I really need the very last drop of performance, that way you can skip the overhead of first putting it all on disk.
The SqlBulkCopy accepts an IDataReader that you have to implement, but only a few methods of the interface. What I always do is just create the class MyBulkCopySource : IDataReader, click 'Implement interface' and feed it to the BulkCopy as is to see wich method gets called. Implement that, try again etc. You only need to implement three of four of them, the rest never gets called.
AFAIK this is the fastest way to pump data from a C# program into a SqlDB.
GJ
Use SqlBulkCopy
From multiple threads with blocks like 30.000 rows each time.
NOT to the final table, but to a temporary table
From which you copy over, using a connection setting that does not honor locks.
This totally puts the smallest locking on the end table.
Related
I need to import millions of records in multiple sql server relational tables.
TableA(Aid(pk),Name,Color)----return id using scope identity
TableB(Bid,Aid(fk),Name)---Here we need to insert Aid(pk) which we got using scocpe Identity
How I can do bulk insert of collection of millions of records using dapper in one single Insert statement
Dapper just wraps raw ADO.NET; raw ADO.NET doesn't offer a facility for this, therefore dapper does not. What you want is SqlBulkCopy. You could also use a table-valued-parameter, but this really feels like a SqlBulkCopy job.
In a pinch, you can use dapper here - Execute will unroll an IEnumerable<T> into a series of commands about T - but it will be lots of commands; and unless you explicitly enable async-pipelining, it will suffer from latency per-command (the pipelined mode avoids this, but it will still be n commands). But SqlBulkCopy will be much more efficient.
If the input data is an IEnumerable<T>, you might want to use ObjectReader from FastMember; for example:
IEnumerable<SomeType> data = ...
using(var bcp = new SqlBulkCopy(connection))
using(var reader = ObjectReader.Create(data, "Id", "Name", "Description"))
{
bcp.DestinationTableName = "SomeTable";
bcp.WriteToServer(reader);
}
I am working on developing an application for my company. From the beginning we were planning on having a split DB with an access front end, and storing the back end data on our shared server. However, after doing some research we realized that storing the data in a back end access DB on a shared drive isn’t the best idea for many reasons (vpn is so slow to shared drive from remote offices, access might not be the best with millions of records, etc.). Anyways, we decided to still use the access front end, but host the data on our SQL server.
I have a couple questions about storing data on our SQL server. Right now when I insert a record I do it with something like this:
Private Sub addButton_Click()
Dim rsToRun As DAO.Recordset
Set rsToRun = CurrentDb.OpenRecordset("SELECT * FROM ToRun")
rsToRun.AddNew
rsToRun("MemNum").Value = memNumTextEntry.Value
rsToRun.Update
memNumTextEntry.Value = Null
End Sub
It seems like it is inefficient to have to use a sql statement like SELECT * FROM ToRun and then make a recordset, add to the recordset, and update it. If there are millions of records in ToRun will this take forever to run? Would it be more efficient just to use an insert statement? If so, how do you do it? Our program is still young in development so we can easily make pretty substantial changes. Nobody on my team is an access or SQL expert so any help is really appreciated.
If you're working with SQL Server, use ADO. It handles server access much better than DAO.
If you are inserting data into a SQL Server table, an INSERT statement can have (in SQL 2008) up to 1000 comma-separated VALUES groups. You therefore need only one INSERT for each 1000 records. You can just append additional inserts after the first, and do your entire data transfer through one string:
INSERT INTO ToRun (MemNum) VALUES ('abc'),('def'),...,('xyz');
INSERT INTO ToRun (MemNum) VALUES ('abcd'),('efgh'),...,('wxyz');
...
You can assemble this in a string, then use an ADO Connection.Execute to do the work. It is frequently faster than multiple DAO or ADO .AddNew/.Update pairs. You just need to remember to requery your recordset afterwards if you need it to be populated with your newly-inserted data.
There are actually two questions in your post:
Will OpenRecordset("SELECT * FROM ToRun") immediately load all recordsets?
No. By default, DAO's OpenRecordset opens a server-side cursor, so the data is not retrieved until you actually start to move around the recordset. Still, it's bad practice to select lots of rows if you don't need to. This leads to the next question:
How should I add records in an attached SQL Server database?
There are a few ways to do that (in order of preference):
Use an INSERT statment. That's the most elegant and direct solution: You want to insert something, so you execute INSERT, not SELECT and AddNew. As Monty Wild explained in his answer, ADO is prefered. In particular, ADO allows you to use parameterized commands, which means that you don't have to put-into-quotes-and-escape your strings and correctly format your dates, which is not so easy to do right.
(DAO also allows you to execute INSERT statements (via CurrentDb.Execute), but it does not allow you to use parameters.)
That said, ADO also supports the AddNew syntax familiar to you. This is a bit less elegant but requires less changes to your existing code.
And, finally, your old DAO code will still work. As always: If you think you have a performance problem, measure if you really have one. Clean code is great, but refactoring has a cost and it makes sense to optimize those places first where it really matters. Test, measure... then optimize.
It seems like it is inefficient to have to use a sql statement like SELECT * FROM ToRun and then make a recordset, add to the recordset, and update it. If there are millions of records in ToRun will this take forever to run?
Yes, you do need to load something from the table in order to get your Recordset, but you don't have to load any actual data.
Just add a WHERE clause to the query that doesn't return anything, like this:
Set rsToRun = CurrentDb.OpenRecordset("SELECT * FROM ToRun WHERE 1=0")
Both INSERT statements and Recordsets have their pros and cons.
With INSERTs, you can insert many records with relatively little code, as shown in Monty Wild's answer.
On the other hand, INSERTs in the basic form shown there are prone to SQL Injection and you need to take care of "illegal" characters like ' inside your values, ideally by using parameters.
With a Recordset, you obviously need to type more code to insert a record, as shown in your question.
But in exchange, a Recordset does some of the work for you:
For example, in the line rsToRun("MemNum").Value = memNumTextEntry.Value you don't have to care about:
characters like ' in the input, which would break an INSERT query unless you use parameters
SQL Injection
getting the date format right when inserting date/time values
Got about a 400 MB .txt file here that is delimited by '|'. Using a Windows Form with C#, I'm inserting each row of the .txt file into a table in my SQL server database.
What I'm doing is simply this (shortened by "..." for brevity):
while ((line = file.ReadLine()) != null)
{
string[] split = line.Split(new Char[] { '|' });
SqlCommand cmd = new SqlCommand("INSERT INTO NEW_AnnualData VALUES (#YR1984, #YR1985, ..., #YR2012)", myconn);
cmd.Parameters.AddWithValue("#YR1984", split[0]);
cmd.Parameters.AddWithValue("#YR1985", split[1]);
...
cmd.Parameters.AddWithValue("#YR2012", split[28]);
cmd.ExecuteNonQuery();
}
Now, this is working, but it is taking awhile. This is my first time to do anything with a huge amount of data, so I need to make sure that A) I'm doing this in an efficient manner, and that B) my expectations aren't too high.
Using a SELECT COUNT() while the loop is going, I can watch the number go up and up over time. So I used a clock and some basic math to figure out the speed that things are working. In 60 seconds, there were 73881 inserts. That's 1231 inserts per second. The question is, is this an average speed, or am I getting poor performance? If the latter, what can I do to improve the performance?
I did read something about SSIS being efficient for this purpose exactly. However, I need this action to come from clicking a button in a Windows Form, not going through SISS.
Oooh - that approach is going to give you appalling performance. Try using BULK INSERT, as follows:
BULK INSERT MyTable
FROM 'e:\orders\lineitem.tbl'
WITH
(
FIELDTERMINATOR ='|',
ROWTERMINATOR ='\n'
)
This is the best solution in terms of performance. There is a drawback, in that the file must be present on the database server. There are two workarounds for this that I've used in the past, if you don't access to the server's file system from where you're running the process. One is to install an instance of SQL Express on the workstation, add the main server as a linked server to the workstation instance, and then run "BULK INSERT MyServer.MyDatabase.dbo.MyTable...". The other option is to reformat the CSV file as XML, which can be processed very quickly, and then passing the XML to query and processing it using OPENXML. Both BULK INSERT and OPENXML are well documented on MSDN, and you'd do well to read through the examples.
Have a look at SqlBulkCopy on MSDN, or the nice blog post here. For me that goes up to tens of thousands of inserts per second.
I'd have to agree with Andomar. I really quite like SqlBulkCopy. It is really fast (you need to play around with BatchSizes to make sure you find one that suits your situation.)
For a really in depth article discussing the various options, check out Microsoft's "Data Loading Performance Guide";
http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx
Also, take a look at the C# example with SqlBulkCopy of CSV Reader. It isn't free, but if you can write a fast and accurate parser in less time, then go for it. At least, it'll give you some ideas.
I have fonud SSIS to be much faster than this type of method but there are a bunch of variables that can affect performence.
If you want to experiment with SSIS, use the Import and Export wizard in Management Studio to generate a SSIS package that will import a pipe delimited file. You can save out the package and run it from a .NET application
See this article: http://blogs.msdn.com/b/michen/archive/2007/03/22/running-ssis-package-programmatically.aspx for info on how to run an SSIS package programatically. It includes options on how to run from the client, from the server, or wherever.
Also, take a look at this article for additional ways you can improve bulk insert performance in general. http://msdn.microsoft.com/en-us/library/ms190421.aspx
i am using ODP.NET (version 2.111.7.0) and C#, OracleCommand & OracleParameter objects and OracleCommand.ExecuteNonQuery method
i was wondering if there is a way to insert a big byte array into an oracle table that resides in another database, through DB link. i know that lob handling through DB links is problematic in general, but i am a bit hesitant to modify code and add another connection.
will creating a stored procedure that takes blob as parameter and talks internally via dblink make any difference? don't think so..
my current situation is that Oracle will give me "ORA-22992: cannot use LOB locators selected from remote tables" whenever the parameter i pass with the OracleCommand is a byte array with length 0, or with length > 32KB (i suspect, because 20KB worked, 35KB didn't)
I am using OracleDbType.Blob for this parameter.
thank you.
any ideas?
i ended up using a second connection, synchronizing the two transactions so that commits and rollbacks are always performed jointly. i also ended up actually believing that there's a general issue with handling BLOBs through a dblink, so a second connection was a better choice, although in my case the design of my application was slightly disrupted - needed to introduce both a second connection and a second transaction. other option was to try and insert the blob in chunks of 32k, using PL/SQL blocks and some form of DBMS_LOB.WRITEAPPEND, but this would require more deep changes in my code (in my case), therefore i opted for the easier and more straightforward first solution.
I need to populate a database with thousands of entries on a daily basis, but my code at the moment manually inserts each one into the database one at a time.
Do While lngSQLLoop < lngCurrentRecord
lngSQLLoop = lngSQLLoop + 1
sql = "INSERT INTO db (key1, key2) VALUES ('value1', 'value2');"
result = bInsertIntoDatabase(sql, True)
If result = false Then lngFailed = lngFailed + 1
Loop
This works, but takes about 5 seconds for each 100 entries. Would there be a more efficient way to put this into the database? I've tried
INSERT INTO db (key1, key2) VALUES ('value1-1', 'value2-1'), ('value1-2', 'value2-2'), ('value1-3', 'value2-3');
but this fails with a missing colon ; error, suggesting it doesn't like the values to be listed like that. Is there a way that VBA will do this?
The use of multiple (), () clauses only works with SQL Server 2008.
But you're in luck: you can batch these by simply concatenating your SQL statements and batch a the calls to bInsertIntoDatabase.
The only down side to this approach is that if one statement in the batch fails, so will every subsequent statement in the batch.
So, if failure is a regular issue (say, from key collisions), you would need to use another approach. One solution is to:
Insert batches into a temporary table first (without unique indexes, thus avoiding failures initially)
Do a final insert into the main table with a WHERE clause that prevents an error
Get the result count and subtract from the total number of records in the temporary table to get the number of failures.
If the source of your data can be accessed via a database driver (like ODBC) and your database framework supports heterogeneous queries you should be able to do:
INSERT INTO targetDBtable (key1, key2)
VALUES (SELECT key1, key2 FROM sourceDBtable);
Using .AddNew and .Update with an updateable recordset seems fast: takes about 0.25 seconds to add 10000 records with no errors, or 1.25 seconds to add 10000 records with 10000 errors, on my system.
Save the data to a CSV file first and then use Access' TransferText method (of the DoCmd object) to load in to Access table in one go. Remember to delete the CSV file afterwards.
Even if you're running the code from Excel, you can still execute the TransferText method in Access via Automation.