I need to upload a lot of data to a MySQL db. For most models I use django's ORM, but one of my models will have billions (!) of instances and I would like to optimize its insert operation.
I can't seem to find a way to make executemany() work, and after googling it seems there are almost no examples out there.
I'm looking for the correct sql syntax + correct command syntax + correct values data structure to support an executemany command for the following sql statement:
INSERT INTO `some_table` (`int_column1`, `float_column2`, `string_column3`, `datetime_column4`) VALUES (%d, %f, %s, %s)
Yes, I'm explicitly stating the id (int_column1) for efficiency.
A short example code would be great
Here's a solution that actually uses executemany() !
Basically the idea in the example here will work.
But note that in Django, you need to use the %s placeholder rather than the question mark.
Also, you will want to manage your transactions. I'll not get into that here as there is plenty of documentation available.
from django.db import connection,transaction
cursor = connection.cursor()
query = ''' INSERT INTO table_name
(var1,var2,var3)
VALUES (%s,%s,%s) '''
query_list = build_query_list()
# here build_query_list() represents some function to populate
# the list with multiple records
# in the tuple format (value1, value2, value3).
cursor.executemany(query, query_list)
transaction.commit()
are you serisouly suggesting loading billions of rows (sorry instances) of data via some ORM data access layer - how long do you have ?
bulk load if possible - http://dev.mysql.com/doc/refman/5.1/en/load-data.html
If you need to modify the data, bulk load with load data into a temporary table as is. Then apply modifications with an insert into select command. IME, this is by far the fastest way to get a lot of data into a table.
I'm not sure how to use the executemany() command, but you can use a single SQL INSERT statement to insert multiple records
Related
I am working on developing an application for my company. From the beginning we were planning on having a split DB with an access front end, and storing the back end data on our shared server. However, after doing some research we realized that storing the data in a back end access DB on a shared drive isn’t the best idea for many reasons (vpn is so slow to shared drive from remote offices, access might not be the best with millions of records, etc.). Anyways, we decided to still use the access front end, but host the data on our SQL server.
I have a couple questions about storing data on our SQL server. Right now when I insert a record I do it with something like this:
Private Sub addButton_Click()
Dim rsToRun As DAO.Recordset
Set rsToRun = CurrentDb.OpenRecordset("SELECT * FROM ToRun")
rsToRun.AddNew
rsToRun("MemNum").Value = memNumTextEntry.Value
rsToRun.Update
memNumTextEntry.Value = Null
End Sub
It seems like it is inefficient to have to use a sql statement like SELECT * FROM ToRun and then make a recordset, add to the recordset, and update it. If there are millions of records in ToRun will this take forever to run? Would it be more efficient just to use an insert statement? If so, how do you do it? Our program is still young in development so we can easily make pretty substantial changes. Nobody on my team is an access or SQL expert so any help is really appreciated.
If you're working with SQL Server, use ADO. It handles server access much better than DAO.
If you are inserting data into a SQL Server table, an INSERT statement can have (in SQL 2008) up to 1000 comma-separated VALUES groups. You therefore need only one INSERT for each 1000 records. You can just append additional inserts after the first, and do your entire data transfer through one string:
INSERT INTO ToRun (MemNum) VALUES ('abc'),('def'),...,('xyz');
INSERT INTO ToRun (MemNum) VALUES ('abcd'),('efgh'),...,('wxyz');
...
You can assemble this in a string, then use an ADO Connection.Execute to do the work. It is frequently faster than multiple DAO or ADO .AddNew/.Update pairs. You just need to remember to requery your recordset afterwards if you need it to be populated with your newly-inserted data.
There are actually two questions in your post:
Will OpenRecordset("SELECT * FROM ToRun") immediately load all recordsets?
No. By default, DAO's OpenRecordset opens a server-side cursor, so the data is not retrieved until you actually start to move around the recordset. Still, it's bad practice to select lots of rows if you don't need to. This leads to the next question:
How should I add records in an attached SQL Server database?
There are a few ways to do that (in order of preference):
Use an INSERT statment. That's the most elegant and direct solution: You want to insert something, so you execute INSERT, not SELECT and AddNew. As Monty Wild explained in his answer, ADO is prefered. In particular, ADO allows you to use parameterized commands, which means that you don't have to put-into-quotes-and-escape your strings and correctly format your dates, which is not so easy to do right.
(DAO also allows you to execute INSERT statements (via CurrentDb.Execute), but it does not allow you to use parameters.)
That said, ADO also supports the AddNew syntax familiar to you. This is a bit less elegant but requires less changes to your existing code.
And, finally, your old DAO code will still work. As always: If you think you have a performance problem, measure if you really have one. Clean code is great, but refactoring has a cost and it makes sense to optimize those places first where it really matters. Test, measure... then optimize.
It seems like it is inefficient to have to use a sql statement like SELECT * FROM ToRun and then make a recordset, add to the recordset, and update it. If there are millions of records in ToRun will this take forever to run?
Yes, you do need to load something from the table in order to get your Recordset, but you don't have to load any actual data.
Just add a WHERE clause to the query that doesn't return anything, like this:
Set rsToRun = CurrentDb.OpenRecordset("SELECT * FROM ToRun WHERE 1=0")
Both INSERT statements and Recordsets have their pros and cons.
With INSERTs, you can insert many records with relatively little code, as shown in Monty Wild's answer.
On the other hand, INSERTs in the basic form shown there are prone to SQL Injection and you need to take care of "illegal" characters like ' inside your values, ideally by using parameters.
With a Recordset, you obviously need to type more code to insert a record, as shown in your question.
But in exchange, a Recordset does some of the work for you:
For example, in the line rsToRun("MemNum").Value = memNumTextEntry.Value you don't have to care about:
characters like ' in the input, which would break an INSERT query unless you use parameters
SQL Injection
getting the date format right when inserting date/time values
I was wondering if there was a way that I can do the folloing in one sql statement.
I am parsing an csv file for product insertion into the database. is there a way that I can check to see if an entry in a table with X equaling to N, if so update the rest f the columns else insert it?
regards
Phil
This type of operation is sometimes called an "UPSERT". The SQL standard way to do it is to use the MERGE statement, but unfortunately it is not widely implemented yet.
Some databases have added their own ways to do it, such as the non-standard MySQL extensions REPLACE and INSERT ... ON DUPLICATE KEY UPDATE.
The MERGE statement will get you there, but it's only available on SQL Server 2008 R2. You would otherwise need to treat this as a Slowly Changing Dimension and either bring the data into a temp table and compare or use an integration services package to do the work.
Would something like this work for you?
if(update myTable where x='n')
else{
insert into myTable(x,y,z) values(1,2,3)
}
The update query will run regardless, and will return false if it couldn't update that record, causing it to do the insert.
The only thing I don't have an automated tool for when working with Oracle is a program that can create INSERT INTO scripts.
I don't desperately need it so I'm not going to spend money on it. I'm just wondering if there is anything out there that can be used to generate INSERT INTO scripts given an existing database without spending lots of money.
I've searched through Oracle with no luck in finding such a feature.
It exists in PL/SQL Developer, but errors for BLOB fields.
Oracle's free SQL Developer will do this:
http://www.oracle.com/technetwork/developer-tools/sql-developer/overview/index.html
You just find your table, right-click on it and choose Export Data->Insert
This will give you a file with your insert statements. You can also export the data in SQL Loader format as well.
You can do that in PL/SQL Developer v10.
1. Click on Table that you want to generate script for.
2. Click Export data.
3. Check if table is selected that you want to export data for.
4. Click on SQL inserts tab.
5. Add where clause if you don't need the whole table.
6. Select file where you will find your SQL script.
7. Click export.
Use a SQL function (I'm the author):
https://github.com/teopost/oracle-scripts/blob/master/fn_gen_inserts.sql
Usage:
select fn_gen_inserts('select * from tablename', 'p_new_owner_name', 'p_new_table_name')
from dual;
where:
p_sql – dynamic query which will be used to export metadata rows
p_new_owner_name – owner name which will be used for generated INSERT
p_new_table_name – table name which will be used for generated INSERT
p_sql in this sample is 'select * from tablename'
You can find original source code here:
http://dbaora.com/oracle-generate-rows-as-insert-statements-from-table-view-using-plsql/
Ashish Kumar's script generates individually usable insert statements instead of a SQL block, but supports fewer datatypes.
I have been searching for a solution for this and found it today. Here is how you can do it.
Open Oracle SQL Developer Query Builder
Run the query
Right click on result set and export
http://i.stack.imgur.com/lJp9P.png
You might execute something like this in the database:
select "insert into targettable(field1, field2, ...) values(" || field1 || ", " || field2 || ... || ");"
from targettable;
Something more sophisticated is here.
If you have an empty table the Export method won't work. As a workaround. I used the Table View of Oracle SQL Developer. and clicked on Columns. Sorted by Nullable so NO was on top. And then selected these non nullable values using shift + select for the range.
This allowed me to do one base insert. So that Export could prepare a proper all columns insert.
If you have to load a lot of data into tables on a regular basis, check out SQL Loader or external tables. Should be much faster than individual Inserts.
You can also use MyGeneration (free tool) to write your own sql generated scripts. There is a "insert into" script for SQL Server included in MyGeneration, which can be easily changed to run under Oracle.
I am trying to archive some of my tables into another database on the same server. However the INSERT INTO...SELECT...FROM gives me an error (SQLSTATE=42704) on build. The table exists in the second database.
Can anyone help with this?
It's not clear from your question what version of DB2 is being used. I'll presume that it's the Linux, Unix & Windows version. You look to be using federation to link the two databases.
Does the SELECT part of your query work from LS2DB001? It's worth trying to pin down which database you have the issue with.
Presuming that the problem is on LS2DB001, if the user you have defined the federated link with has permissions on the base tables in the query, check also that they have permissions on the system catalog tables. If not, they would not be able to parse and validate that you can run the query.
We've cracked it! If the following script is used then it works. The LOAD works without having to COMMIT in between batches of rows copied. ('Transaction Log full...' error problem is also solved)
CONNECT TO LS2DB001;
EXPORT TO "C:\temp\TIN_TRIGGER_OUT.IXF" OF IXF
MESSAGES "C:\temp\TIN_TRIGGER_OUT.EXM"
SELECT * FROM LS2USER.TIN_TRIGGER_OUT;
CONNECT RESET;
CONNECT TO LQIFCOLD;
LOAD FROM "C:\temp\TIN_TRIGGER_OUT.IXF" OF IXF
MESSAGES "C:\temp\TIN_TRIGGER_OUT.IMM"
INSERT INTO LS2USER.TIN_TRIGGER_OUT COPY NO INDEXING MODE AUTOSELECT;
COMMIT;
CONNECT RESET;
I found this on http://www.connx.com/products/connx/Connx%208.6%20UserGuide/CONNXCDD32D/DB2_SQL_States.htm:
42704 Undefined object or constraint name. Revise SQL syntax and retry.
For more help try to be more specific, eg paste the full sql statement, the table scheme etc.
You can do
Select 'insert into tblxxxx (blabla,blabal) values(' + fld1 + ',' + fld2 + ',' ...... + ')'
From tblxxxxxx
copy the result as a text script and execute it in the other DB.
The best way to do this would be to create a custom script. Depending on the size of the tables (how many records) you could either do a select of all of the data into memory and then roll over them inserting them into a copy of the table you create first, or you could export the data out as a csv file or some other text based file and then roll over that to insert the data into the other table.
If you do not have some sort of formal backup procedures that could do this already, this would be your best bet.
Note: some db2 databases, such as those on an iSeries do not actually have "databases", they have libraries. With the right user profile you can access two libraries at the same time, joining tables from them together or doing a
create table library/newFilename as
(select * from originallibrary/originalfilename) with data
But this only applies to the iSeries I believe.
I'm writing this response as another answer so I have more space.
I can only suggest breaking the steps down to their components, and working through to see where the error is occuring. Again, I'm assuming you're using federation:
a) In your FROM db, connecting as the user you're using for the federated link, does your select work?
b) In your TO db, using the link, does the select work?
c) In your TO db, using the link via a stored proc, does the select work?
d) In your TO db, using an INSERT...values(x,y,z), can you insert into the table?
e) In your TO db, via a stored proc, using INSERT...values(x,y,z), can you insert?
Without more information, this is the best line of attack I can suggest.
I've been attempting to write embedded SQL statements for DB2 that ultimately gets compiled in C.
I couldn't find a tutorial or manual on the embedded SQL syntax for C for reference. One case I would like to do is to insert data into a table. I know most embedded sql statements need the initalizer EXEC SQL, but that's the extent of my knowledge generally. I'm doing this for an assignment and would appreciate if there are more information regarding this or solution.
Example of a statement to query the database:
EXEC SQL SELECT SNAME, AGE into :sname, :sage
FROM ONE.SAILOR
WHERE sid = :sid;
I like to see what statement allows me to INSERT into the database. I've tried something like the following, but it doesn't work.
EXEC SQL INSERT ....
See IBM's Embedded SQL manual.
Embedded SQL is largely the same no matter what the host language is.
The four dots aren't syntactically valid :-D
The reliable way is the same as with any other INSERT statement: list the columns and the values.
EXEC SQL INSERT INTO SomeTable(Col1, Col2, Col3) VALUES(:hv1, :hv2, :hv3);
Here, the :hv1, :hv2 and :hv3 represent three host variables of types appropriate to the columns in the table. Note that the table could contain other columns than these three as long as those columns have a default specified or accept NULL (which is really just a default default in this case). The unreliable way does not list the columns:
EXEC SQL INSERT INTO SomeTable VALUES(:hv1, :hv2, :hv3);
Now you are dependent on getting the sequence right, and you must provide a value for each column -- there cannot be extra columns in SomeTable.
I just started using sqllite. Besides the good documentation for C++, SQLlist might be a nice thing to have because you can unit-test your code without being dependent on DB2 and it's really easy add with your code.