Speed up Python executemany - sql

I'm inserting data from one database to another, so I have 2 connections (Conn1 and Conn2). Below is the code (using pypyodbc).
import pypyodbc
Conn1_Query = "SELECT column FROM Table"
Conn1_Cursor.execute(Conn1_Query)
Conn1_Data = Conn1_Cursor.fetchall()
Conn1_array = []
for row in Conn1_Data:
Conn1_array.append(row)
The above part runs very quickly.
stmt = "INSERT INTO TABLE(column) values (?)"
Conn2_Cursor.executemany(stmt, Conn1_array)
Conn2.commit()
This part is extremely slow. I've also tried to do a for loop to insert each row at a time using cursor.execute, but that is also very slow. What am I doing wrong and is there anything I can do to speed it up? Thanks for taking a look.
Thought I should also add that the Conn1 data is only ~50k rows. I also have some more setup code at the beginning that I didn't include because it's not pertinent to the question. It takes about 15 minutes to insert. As a comparison, it takes about 25 seconds to write the output to a csv file.

Yes, executemany under pypyodbc sends separate INSERT statements for each row. It acts just the same as making individual execute calls in a loop. Given that pypyodbc is no longer under active development, that is unlikely to change.
However, if you are using a compatible driver like "ODBC Driver xx for SQL Server" and you switch to pyodbc then you can use its fast_executemany option to speed up the inserts significantly. See this answer for more details.

Related

How does query execution on SQL Server from .NET differ from Management Studio?

I investigated a problem when running a certain set of searches (from a .NET 3.5 application) against a Full Text Search DB on a SQL Server 2008 R2. Using profiler I extracted the long running query (120 seconds until Command Timeout was reached) and ran it in my SQL Server Management Studio. Duration was "0 Seconds" and depending on which one I tried 0 to 6 rows were returned.
The query looks like follows:
exec sp_executesql
N'SELECT TOP 1000 [DBNAME].[dbo].[FTSTABLE].[ID] AS [Id], [DBNAME].[dbo].[FTSTABLE].[Title], [DBNAME].[dbo].[FTSTABLE].[FirstName], [ABOUT 20 OTHERS]
FROM [DBNAME].[dbo].[FTSTABLE]
WHERE ( (
( Contains(([DBNAME].[dbo].[FTSTABLE].[Title], [DBNAME].[dbo].[FTSTABLE].[FirstName], [ABOUT 10 OTHERS]), #FieldsList1))
AND ( Contains(([DBNAME].[dbo].[FTSTABLE].[Title], [DBNAME].[dbo].[FTSTABLE].[FirstName], [ABOUT 10 OTHERS]), #FieldsList2))
AND ( Contains(([DBNAME].[dbo].[FTSTABLE].[Title], [DBNAME].[dbo].[FTSTABLE].[FirstName], [ABOUT 10 OTHERS]), #FieldsList3))
))'
,N'#FieldsList1 nvarchar(10),#FieldsList2 nvarchar(10),#FieldsList3 nvarchar(16)'
,#FieldsList1=N'"SomeString1*"'
,#FieldsList2=N'"SomeString2*"'
,#FieldsList3=N'"SomeString3*"'
The query looks a little weird as it is generated from an OR Mapper, but right now I don't want to optimize the query, as in SSMS it runs in less than one second, which shows it is not really the query making trouble.
I wrote a small testprogram:
SqlConnection conn = new SqlConnection("EXACTSAMECONNECTIONSTRING_USING_SAME_USER_ETC")
conn.Open();
SqlCommand command = conn.CreateCommand()
command.CommandText = "EXACTLY SAME STRING, LITERALLY, AS ABOVE IN SSMS- exec sp_executessql.....";
command.CommandTimeout = 120;
var reader = command.ExecuteReader();
while(reader.NextResult())
{
Console.WriteLine(reader[0]);
}
I got from my local PC also a SQLException after 120 seconds when command timeout was exeeded.
The SQL Server was at no moment under load heavier than a few single percent. There were no blocks at that table at any time during my tests.
I solved it after some time: I reduced the TOP 1000 to TOP 200 and suddenly the query from .NET code executed also in less than a second.
The questions I have:
Why in general is there such a huge difference between SSMS and simplest SQLCommand .NET code?
Why did reducing to TOP 200 have any effect, especially considering there were max 6 rows in the result.
This is tied to how query plans are built. When you run it in SSMS, you probably replace the variables manually, so it's not the same.
You can read a full explanation here : http://www.sommarskog.se/query-plan-mysteries.html
edit : maybe start with the paragraph "The Default Settings" and look at the results with manual enabling or disabling of ARITHABORT. This is the most common cause.
So the preliminary answer (not yet fully verified due to its complexity) can be derived from Keorl's answer, or mostly from the link provided therein.
To describe the different symptoms, I'll explain what happens:
The SQL Server cached the query against the fulltext indexed table, which includes the execution plan of the query. This means, if the first query to run (which puts the plan into the cache) is a very rare query with an absurd execution plan, this plan is cached and used for all subsequent queries, ruining performance for most runs.
One thing I could reproduce in the end: rerunning the FT indexer/gatherer solved the problem (this time). Also here the explanation is simple: an index update throws away precompiled/cached queries. Thus a better query than the previously cached one could run as the first and store a much better overall plan in the cache.
Answer to Q1: Why in general is there such a huge difference between SSMS and simplest SQLCommand .NET code?
So why didn't this happen with SSMS? Also this can be extracted from Keorl's answer: SSMS circumvents this in setting ARITHABORT option, which results in its own newly compiled query which is then cached. Thus the different observations for the same query just using SSMS and Code.
Answer to Q2: Why did reducing to TOP 200 have any effect, especially considering there were max 6 rows in the result?
For Dynamic SQL as used in example above, cache is stored depending on hashes of the complete query. As the query is different for TOP 200 and TOP 1000 two different compiles would cached. Parameters are not part of the hash though, so queries with just changing parameters would still result in same cache entry being used.
Concluding this: Thanks Keorl for providing the means to find an answer.

Moving from Access backend to SQL Server as be. Efficiency help needed

I am working on developing an application for my company. From the beginning we were planning on having a split DB with an access front end, and storing the back end data on our shared server. However, after doing some research we realized that storing the data in a back end access DB on a shared drive isn’t the best idea for many reasons (vpn is so slow to shared drive from remote offices, access might not be the best with millions of records, etc.). Anyways, we decided to still use the access front end, but host the data on our SQL server.
I have a couple questions about storing data on our SQL server. Right now when I insert a record I do it with something like this:
Private Sub addButton_Click()
Dim rsToRun As DAO.Recordset
Set rsToRun = CurrentDb.OpenRecordset("SELECT * FROM ToRun")
rsToRun.AddNew
rsToRun("MemNum").Value = memNumTextEntry.Value
rsToRun.Update
memNumTextEntry.Value = Null
End Sub
It seems like it is inefficient to have to use a sql statement like SELECT * FROM ToRun and then make a recordset, add to the recordset, and update it. If there are millions of records in ToRun will this take forever to run? Would it be more efficient just to use an insert statement? If so, how do you do it? Our program is still young in development so we can easily make pretty substantial changes. Nobody on my team is an access or SQL expert so any help is really appreciated.
If you're working with SQL Server, use ADO. It handles server access much better than DAO.
If you are inserting data into a SQL Server table, an INSERT statement can have (in SQL 2008) up to 1000 comma-separated VALUES groups. You therefore need only one INSERT for each 1000 records. You can just append additional inserts after the first, and do your entire data transfer through one string:
INSERT INTO ToRun (MemNum) VALUES ('abc'),('def'),...,('xyz');
INSERT INTO ToRun (MemNum) VALUES ('abcd'),('efgh'),...,('wxyz');
...
You can assemble this in a string, then use an ADO Connection.Execute to do the work. It is frequently faster than multiple DAO or ADO .AddNew/.Update pairs. You just need to remember to requery your recordset afterwards if you need it to be populated with your newly-inserted data.
There are actually two questions in your post:
Will OpenRecordset("SELECT * FROM ToRun") immediately load all recordsets?
No. By default, DAO's OpenRecordset opens a server-side cursor, so the data is not retrieved until you actually start to move around the recordset. Still, it's bad practice to select lots of rows if you don't need to. This leads to the next question:
How should I add records in an attached SQL Server database?
There are a few ways to do that (in order of preference):
Use an INSERT statment. That's the most elegant and direct solution: You want to insert something, so you execute INSERT, not SELECT and AddNew. As Monty Wild explained in his answer, ADO is prefered. In particular, ADO allows you to use parameterized commands, which means that you don't have to put-into-quotes-and-escape your strings and correctly format your dates, which is not so easy to do right.
(DAO also allows you to execute INSERT statements (via CurrentDb.Execute), but it does not allow you to use parameters.)
That said, ADO also supports the AddNew syntax familiar to you. This is a bit less elegant but requires less changes to your existing code.
And, finally, your old DAO code will still work. As always: If you think you have a performance problem, measure if you really have one. Clean code is great, but refactoring has a cost and it makes sense to optimize those places first where it really matters. Test, measure... then optimize.
It seems like it is inefficient to have to use a sql statement like SELECT * FROM ToRun and then make a recordset, add to the recordset, and update it. If there are millions of records in ToRun will this take forever to run?
Yes, you do need to load something from the table in order to get your Recordset, but you don't have to load any actual data.
Just add a WHERE clause to the query that doesn't return anything, like this:
Set rsToRun = CurrentDb.OpenRecordset("SELECT * FROM ToRun WHERE 1=0")
Both INSERT statements and Recordsets have their pros and cons.
With INSERTs, you can insert many records with relatively little code, as shown in Monty Wild's answer.
On the other hand, INSERTs in the basic form shown there are prone to SQL Injection and you need to take care of "illegal" characters like ' inside your values, ideally by using parameters.
With a Recordset, you obviously need to type more code to insert a record, as shown in your question.
But in exchange, a Recordset does some of the work for you:
For example, in the line rsToRun("MemNum").Value = memNumTextEntry.Value you don't have to care about:
characters like ' in the input, which would break an INSERT query unless you use parameters
SQL Injection
getting the date format right when inserting date/time values

INSERTing data from a text file into SQL server (speed? method?)

Got about a 400 MB .txt file here that is delimited by '|'. Using a Windows Form with C#, I'm inserting each row of the .txt file into a table in my SQL server database.
What I'm doing is simply this (shortened by "..." for brevity):
while ((line = file.ReadLine()) != null)
{
string[] split = line.Split(new Char[] { '|' });
SqlCommand cmd = new SqlCommand("INSERT INTO NEW_AnnualData VALUES (#YR1984, #YR1985, ..., #YR2012)", myconn);
cmd.Parameters.AddWithValue("#YR1984", split[0]);
cmd.Parameters.AddWithValue("#YR1985", split[1]);
...
cmd.Parameters.AddWithValue("#YR2012", split[28]);
cmd.ExecuteNonQuery();
}
Now, this is working, but it is taking awhile. This is my first time to do anything with a huge amount of data, so I need to make sure that A) I'm doing this in an efficient manner, and that B) my expectations aren't too high.
Using a SELECT COUNT() while the loop is going, I can watch the number go up and up over time. So I used a clock and some basic math to figure out the speed that things are working. In 60 seconds, there were 73881 inserts. That's 1231 inserts per second. The question is, is this an average speed, or am I getting poor performance? If the latter, what can I do to improve the performance?
I did read something about SSIS being efficient for this purpose exactly. However, I need this action to come from clicking a button in a Windows Form, not going through SISS.
Oooh - that approach is going to give you appalling performance. Try using BULK INSERT, as follows:
BULK INSERT MyTable
FROM 'e:\orders\lineitem.tbl'
WITH
(
FIELDTERMINATOR ='|',
ROWTERMINATOR ='\n'
)
This is the best solution in terms of performance. There is a drawback, in that the file must be present on the database server. There are two workarounds for this that I've used in the past, if you don't access to the server's file system from where you're running the process. One is to install an instance of SQL Express on the workstation, add the main server as a linked server to the workstation instance, and then run "BULK INSERT MyServer.MyDatabase.dbo.MyTable...". The other option is to reformat the CSV file as XML, which can be processed very quickly, and then passing the XML to query and processing it using OPENXML. Both BULK INSERT and OPENXML are well documented on MSDN, and you'd do well to read through the examples.
Have a look at SqlBulkCopy on MSDN, or the nice blog post here. For me that goes up to tens of thousands of inserts per second.
I'd have to agree with Andomar. I really quite like SqlBulkCopy. It is really fast (you need to play around with BatchSizes to make sure you find one that suits your situation.)
For a really in depth article discussing the various options, check out Microsoft's "Data Loading Performance Guide";
http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx
Also, take a look at the C# example with SqlBulkCopy of CSV Reader. It isn't free, but if you can write a fast and accurate parser in less time, then go for it. At least, it'll give you some ideas.
I have fonud SSIS to be much faster than this type of method but there are a bunch of variables that can affect performence.
If you want to experiment with SSIS, use the Import and Export wizard in Management Studio to generate a SSIS package that will import a pipe delimited file. You can save out the package and run it from a .NET application
See this article: http://blogs.msdn.com/b/michen/archive/2007/03/22/running-ssis-package-programmatically.aspx for info on how to run an SSIS package programatically. It includes options on how to run from the client, from the server, or wherever.
Also, take a look at this article for additional ways you can improve bulk insert performance in general. http://msdn.microsoft.com/en-us/library/ms190421.aspx

is there a maximum number of inserts that can be run in a batch sql script?

I have a series of simple "Insert INTO" type statements but after running about 3 or 4 of them the script stops and i get empty sets when i try selecting from the appropriate tables....aside from my specific code...i wonder whether there is an ideal way of running multiple insert type queries.
Right now i just have a txt file saved as a.sql with normal sql commands separated by ";"
No, there is not. however, if it stops after 3 or 4 inserts, it's a good bet there's an error in the 3rd or 4th insert. Depending on which SQL engine you use, there are different ways of making it report errors during and after operations.
Additionally, if you have lots of inserts, it's a good idea to wrap them inside a transaction - this basically buffers all the insert commands until it sees the end command for the transaction, and then commit everything to your table. That way, if something goes wrong, your database doesn't get polluted with data that needs to first be deleted again. More importantly, every insert without a transaction counts as a single transaction, which makes them really slow - Doing 100 inserts inside a transaction can be as fast as doing two or three normal inserts.
Maximum Capacity Specifications for SQL Server
Max Batch size = 65,536 * Network Packet Size
However I doubt that Max Batch size is your problem.

Report on SQL/SSRS 2k5 takes > 10 minutes, query < 3 mins

We have SQL and SSRS 2k5 on a Win 2k3 virtual server with 4Gb on the virt server. (The server running the virt server has > 32Gb)
When we run our comparison report, it calls a stored proc on database A. The proc pulls data from several tables, and from a view on database B.
If I run Profiler and monitor the calls, I see activity
SQL:BatchStarting SELECT
DATABASEPROPERTYEX(DB_NAME(),
'Collation'),
COLLATIONPROPERTY(CONVERT(char,
DATABASEPROPERTYEX(DB_NAME(),
'collation')), 'LCID')
then wait several minutes till the actual call of the proc shows up.
RPC:Completed exec sp_executesql
N'exec
[procGetLicenseSales_ALS_Voucher]
#CurrentLicenseYear,
#CurrentStartDate, #CurrentEndDate,
''Fishing License'',
#PreviousLicenseYear,
#OpenLicenseAccounts',N'#CurrentStartDate
datetime,#CurrentEndDate
datetime,#CurrentLicenseYear
int,#PreviousLicenseYear
int,#OpenLicenseAccounts
nvarchar(4000)',#CurrentStartDate='2010-11-01
00:00:00:000',#CurrentEndDate='2010-11-30
00:00:00:000',#CurrentLicenseYear=2010,#PreviousLicenseYear=2009,#OpenLicenseAccounts=NULL
then more time, and usually the report times out. It takes about 20 minutes if I let it run in Designer
This Report was working, albeit slowly but still less than 10 minutes, for months.
If I drop the query (captured from profiler) into SQL Server Management Studio, it takes 2 minutes, 8 seconds to run.
Database B just had some changes and data replicated to it (we only read from the data, all new data comes from nightly replication).
Something has obviously changed, but what change broke the report? How can I test to find out why the SSRS part is taking forever and timing out, but the query runs in about 2 minutes?
Added: Please note, the stored proc returns 18 rows... any time. (We only have 18 products to track.)
The report takes those 18 rows, and groups them and does some sums. No matrix, only one page, very simple.
M Kenyon II
Database B just had some changes and data replicated to it (we only read from the data, all new data comes from nightly replication).
Ensure that all indexes survived the changes to Database B. If they still exist, check how fragmented they are and reorganize or rebuild as necessary.
Indexes can have a huge impact on performance.
As far as the report taking far longer to run than your query, there can be many reasons for this. Some tricks for getting SSRS to run faster can be found here:
http://www.sqlservercentral.com/Forums/Topic859015-150-1.aspx
Edit:
Here's the relevant information from the link above.
AshMc
I recall some time ago we had the same issue where we were passing in the parameters within SSRS to a SQL Dataset and it would slow it all down compared to doing it in SSMS (minutes compared to seconds like your issue). It appeared that when SSRS was passing in the parameter it was possibly recalculating the value and not storing it once and that was it.
What I did was declare a new TSQL parameter first within the dataset and set it to equal the SSRS parameter and then use the new parameter like I would in SSMS.
eg:
DECLARE #X as int
SET #X = #SSRSParameter
janavarr
Thanks AshMc, this one worked for me. However my issue now is that it will only work with a single parameter and the query won’t run if I want to pass multiple parameter values.
...
AshMc
I was able to find how I did this previously. I created a Temp table placed the values that we wanted to filter on in it then did an inner join on the main query to it. We only use the SSRS Parameters as a filter on what to put in the temp table.
This saved a lot of report run time doing it this way
DECLARE #ParameterList TABLE (ValueA Varchar(20))
INSERT INTO #ParameterList
select ValueA
from TableA
where ValueA = #ValueB
INNER JOIN #ParameterList
ON ValueC = ValueA
Hope this helps,
--Dubs
Could be parameter sniffing. If you've changed some data or some of the tables then the cached plan that will have satisfied the sp for the old data model may not be valid any more.
Answered a very similar thing here:
stored procedure performance issue
Quote:
f you are sure that the sql is exactly the same and that the params are the same then you could be experiencing a parameter sniffing problem .
It's a pretty uncommon problem. I've only had it happen to me once and since then I've always coded away the problem.
Start here for a quick overview of the problem:
http://blogs.msdn.com/b/queryoptteam/archive/2006/03/31/565991.aspx
http://elegantcode.com/2008/05/17/sql-parameter-sniffing-and-what-to-do-about-it/
try declaring some local variables inside the sp and allocate the vales of the parameters to them. The use the local variables in place of the params.
It's a feature not a bug but it makes you go #"$#