Better Import Performance on Large Data Sets

Better Import Performance on Large Data Sets - sql

I have a few jobs that insert large data sets from a text file. The data is loaded via .NET's SqlBulkCopy.
Currently, I load all the data into a temp table and then insert it into the production table. This was an improvement over straight importing into production. The T-SQL insert results query was a lot faster. Data is only loaded via this method, there is no other inserts or deletes.
However, I'm back to timeouts because of locks while the job is executing. The job consists of the following steps:
load data into temp table
start transaction
delete current and future dated rows
insert from temp table
commit
This happens once every hour. This portion takes 70 seconds. I need to get that to the smallest number possible.
The production table has about 20 million records and each import is about 70K rows. The table is not accessed at night, so I use this time to do all required maintenance (rebuild stats, index, etc.). Of the 70K, added, ~4K is kept from day-to-day - that is, the table grows by 4k a day.
I'm thinking a 2 part solution:
The job will turn into a copy/rename job. I insert all current data into the temp table, create stats & index, rename tables, drop old table.
Create a history table to break out older data. The "current" table would have a rolling 6 months data, about 990K records. This would make the delete/insert table smaller and [hopefully] more performant. I would prefer not to do this; the table is well designed with the perfect indexes; queries are plenty fast. But eventually it might be required.
Edit: Using Windows 2003, SQL Server 2008
Thoughts? Other suggestions?

Well one really sneaky way is to rename the current table as TableA and set up a second table with the same structure as TableB and the same data. Then set up a view with the same name and the exact fields in the TableA. Now all your existing code will use the view instead of the current table. The view starts out looking at TableA.
In your load process, load to TableB. Refresh the view defintion changing it to look at TableB. Your users are down for less than a second. Then load the same data to TableA and store which table you should start with somewhere in a database table. Next time load first to TableA and then change the view to point to TableA then reload TableB.

The answer should be that your queries that read from this table should READ UNCOMMITTED, since your data load is the only place that changes data. With READ UNCOMMITTED, the SELECT queries won't get locks.
http://msdn.microsoft.com/en-us/library/ms173763.aspx

You should look into using partitioned tables. The basic idea is that you can load all of your new data into a new table, then join that table to the original as a new partition. This is orders of magnitude faster than inserting into the current existing table.
Later on, you can merge multiple partitions into a single larger partition.
More information here: http://msdn.microsoft.com/en-us/library/ms191160.aspx

Get better hardware. Using 3 threads, 35.000 item batches I import around 90.000 items per second using this approach.
Sorry, but at a point hardware decides insert speed. Important: SSD for the logs, mirrored ;)

Another trick you could use is have a delta table for the updates. You'd have 2 tables with a view over them to merge them. One table, TableBig, will hold the old data, the second table, TableDelta, will hold deltas that you add to rows in tableBig.
You create a view over them that adds them up. A simple example:
For instance your old data in the TableBig (20M rows, lot's of indexes etc)
ID Col1 Col2
123 5 10
124 6 20
And you want to update 1 row and add a new one so that afterwards the table looks like this:
ID Col1 Col2
123 5 10 -- unchanged
124 8 30 -- this row is updated
125 9 60 -- this one added
Then in the TableDelta you insert these two rows:
ID Col1 Col2
124 2 10 -- these will be added to give the right number
125 9 60 -- this row is new
and the view is
select ID,
sum(col1) col1, -- the old value and delta added to give the correct value
sum(col2) col2
from (
select id, col1, col2 from TableBig
union all
select id, col1, col2 from TableDelta
)
group by ID
At night you can merge TableDelta into TableBig and index etc.
This way you can leave the big table alone completely during the day, and TableDelta will not have many rows so overall query perf is very good. Getting the data from BigTable benefits from inexing, getting rows from DeltaTable is no issue becuase it's small, summing them up is cheap compared to looking for data on disk. Pumping data into TableDelta is very cheap because you can just insert at the end.
Regards Gert-Jan
Update for text columns:
You could try somehting similar, with two tables, but instead of adding, you would substitute.Like this:
Select isnull(b.ID, o.ID) Id,
isnull(b.Col1, o.Col1) Col1
isnull(b.Col2, o.Col2) col2
From TableBig b
full join TableOverWrite o on b.ID = o.ID
The basic idea is the same: a big table with indexes and a small table for updates that doesn't need them.

Your solution seems sound to me. Another alternative to try would be to create the final data in a temp table, all outside of a transaction and then inside the transaction truncate the target table then load it from your temp table...something along those lines might be worth trying too.

Related

Is high read but absolutely no writing on SQLite query normal?

I've been running a query to add a column to a table by matching ID from another table. Both have about 600 million rows so its reasonable that this would take a while, but what is worrying me is that there are high read speeds (~500MB/s) on the disk but sqlite is writing 0B/s according to iotop. The filesize on my .db file has not changed in hours, but adding a column to a 600 million row table should change AT LEAST one byte, no?
Is this normal behavior from SQLite? The machine is pretty powerful, Ubuntu 16 on quad core i7 with 64GB RAM and NVMe SSD. The queries and table schema are below.
ALTER TABLE tableA ADD address TEXT;
UPDATE tableA SET address = (SELECT address FROM tableB WHERE tableA.ID = tx_out.ID);
Table schema:
CREATE TABLE tableA (
ID TEXT,
column1 INT,
column2 TEXT,
);
CREATE TABLE tx_out (
ID TEXT,
sequence INT,
address TEXT
);

Adding a column changes almost nothing on disk; a row with fewer values than the table has columns is assumed to have NULLs in the missing columns.
The UPDATE is extremely slow because the subquery has to scan through the entire tx_out table for every row in tableA.
You could speed it up greatly with an index on the tx_out.ID column.
When the database has to rewrite all rows anyway, and you have the disk space, it might be a better idea to create a new table:
INSERT INTO NewTable(ID, col1, col2, address)
SELECT ID, col1, col2, address
FROM tableA
JOIN tableB USING (ID); -- also needs an index to be fast

Too big for a comment
I've had this run for days with no change...I think it is prone to locking itself in one manner or another, I killed it after it's 3rd day of what appeared to be no changes. I've encountered very similar issues when attempting to add a new index as well, but that one successfully completed in 2 days before I hit my 3 day kill threshhold ;) Possible 3 days just wasn't enough?
My preference now is to create a second table that has the new column, load the table with the data from the old plus the new column, rename the old table to X_oldtablename, rename the new table to the table name. Run tests and drop x_oldtablename after you are confident the new table is working

updating a very large oracle table

I have a very large table people with 60M rows indexed on id, wish to populate a field newid for every record based on a look up table id_conversion (1M rows) which contains id and newid, indexed on id.
when I run
update people p set p.newid=(select l.newid from id_conversion l where l.id=p.id)
it runs for an hour or so and then I get an archive error ora 00257.
Any suggestions for either running update in sections or better sql command?

To avoid writing to Oracle's undo log if your update statement hits every single row of the table then you are likely better off running a create table as select query which will bypass all undo logs, which is likely the issue you're running into as it is logging the impact across 60 million rows. You can then drop the old table and rename the new table to that of the old table's name.
Something like:
create table new_people as
select l.newid,
p.col2,
p.col3,
p.col4,
p.col5
from people p
join id_conversion l
on p.id = l.id;
drop table people;
-- rebuild any constraints and indexes
-- from old people table to new people table
alter table new_people rename to people;
For reference, read some of the tips here: http://www.dba-oracle.com/t_efficient_update_sql_dml_tips.htm
If you are basically creating a new table and not just updating some of the rows of a table it will likely prove the faster method.

I doubt you will be able to get this to run in seconds. Your query, as written, needs to update all 60 million rows.
My first advice is to add an index on id_conversion(id, newid), to make the subquery more efficient. If that doesn't help, then doing the update in batches might be the best way to go.
I should add. Because you are updating all the rows, it might be faster to take the following approach:
Copy the data into a new table with the new values.
Truncate the original table.
Insert the new data into the old table.
Inserts are faster than updates.

In addition to the answers above, which probably will work better in this case, you should know the MERGE statement
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_9016.htm
that is used for updating one table according to another table and is far faster then update according to a select statement

Copying timestamp columns within a Postgres table

I have a table with about 30 million rows in a Postgres 9.4 db. This table has 6 columns, the primary key id, 2 text, one boolean, and two timestamp. There are indices on one of the text columns, and obviously the primary key.
I want to copy the values in the first timestamp column, call it timestamp_a into the second timestamp column, call it timestamp_b. To do this, I ran the following query:
UPDATE my_table SET timestamp_b = timestamp_a;
This worked, but it took an hour and 15 minutes to complete, which seems a really long time to me considering, as far as I know, it's just copying values from one column to the next.
I ran EXPLAIN on the query and nothing seemed particularly inefficient. I then used pgtune to modify my config file, most notably it increased the shared_buffers, work_mem, and maintenance_work_mem.
I re-ran the query and it took essentially the same amount of time, actually slightly longer (an hour and 20 mins).
What else can I do to improve the speed of this update? Is this just how long it takes to write 30 million timestamps into postgres? For context I'm running this on a macbook pro, osx, quadcore, 16 gigs of ram.

The reason this is slow is that internally PostgreSQL doesn't update the field. It actually writes new rows with the new values. This usually takes a similar time to inserting that many values.
If there are indexes on any column this can further slow the update down. Even if they're not on columns being updated, because PostgreSQL has to write a new row and write new index entries to point to that row. HOT updates can help and will do so automatically if available, but that generally only helps if the table is subject to lots of small updates. It's also disabled if any of the fields being updated are indexed.
Since you're basically rewriting the table, if you don't mind locking out all concurrent users while you do it you can do it faster with:
BEGIN
DROP all indexes
UPDATE the table
CREATE all indexes again
COMMIT
PostgreSQL also has an optimisation for writes to tables that've just been TRUNCATEd, but to benefit from that you'd have to copy the data to a temp table, then TRUNCATE and copy it back. So there's no benefit.

#Craig mentioned an optimization for COPY after TRUNCATE: Postgres can skip WAL entries because if the transaction fails, nobody will ever have seen the new table anyway.
The same optimization is true for tables created with CREATE TABLE AS:
What causes large INSERT to slow down and disk usage to explode?
Details are missing in your description, but if you can afford to write a new table (no concurrent transactions get in the way, no dependencies), then the fastest way might be (except if you have big TOAST table entries - basically big columns):
BEGIN;
LOCK TABLE my_table IN SHARE MODE; -- only for concurrent access
SET LOCAL work_mem = '???? MB'; -- just for this transaction
CREATE my_table2
SELECT ..., timestamp_a, timestamp_a AS timestamp_b
-- columns in order, timestamp_a overwrites timestamp_b
FROM my_table
ORDER BY ??; -- optionally cluster table while being at it.
DROP TABLE my_table;
ALTER TABLE my_table2 RENAME TO my_table;
ALTER TABLE my_table
, ADD CONSTRAINT my_table_id_pk PRIMARY KEY (id);
-- more constraints, indices, triggers?
-- recreate views etc. if any
COMMIT;
The additional benefit: a pristine (optionally clustered) table without bloat. Related:
Best way to populate a new column in a large table?

Alternatives to UPDATE statement Oracle 11g

I'm currently using Oracle 11g and let's say I have a table with the following columns (more or less)
Table1
ID varchar(64)
Status int(1)
Transaction_date date
tons of other columns
And this table has about 1 Billion rows. I would want to update the status column with a specific where clause, let's say
where transaction_date = somedatehere
What other alternatives can I use rather than just the normal UPDATE statement?
Currently what I'm trying to do is using CTAS or Insert into select to get the rows that I want to update and put on another table while using AS COLUMN_NAME so the values are already updated on the new/temporary table, which looks something like this:
INSERT INTO TABLE1_TEMPORARY (
ID,
STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS)
SELECT
ID
3 AS STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS
FROM TABLE1
WHERE
TRANSACTION_DATE = SOMEDATE
So far everything seems to work faster than the normal update statement. The problem now is I would want to get the remaining data from the original table which I do not need to update but I do need to be included on my updated table/list.
What I tried to do at first was use DELETE on the same original table using the same where clause so that in theory, everything that should be left on that table should be all the data that i do not need to update, leaving me now with the two tables:
TABLE1 --which now contains the rows that i did not need to update
TABLE1_TEMPORARY --which contains the data I updated
But the delete statement in itself is also too slow or as slow as the orginal UPDATE statement so without the delete statement brings me to this point.
TABLE1 --which contains BOTH the data that I want to update and do not want to update
TABLE1_TEMPORARY --which contains the data I updated
What other alternatives can I use in order to get the data that's the opposite of my WHERE clause (take note that the where clause in this example has been simplified so I'm not looking for an answer of NOT EXISTS/NOT IN/NOT EQUALS plus those clauses are slower too compared to positive clauses)
I have ruled out deletion by partition since the data I need to update and not update can exist in different partitions, as well as TRUNCATE since I'm not updating all of the data, just part of it.
Is there some kind of JOIN statement I use with my TABLE1 and TABLE1_TEMPORARY in order to filter out the data that does not need to be updated?
I would also like to achieve this using as less REDO/UNDO/LOGGING as possible.
Thanks in advance.

I'm assuming this is not a one-time operation, but you are trying to design for a repeatable procedure.
Partition/subpartition the table in a way so the rows touched are not totally spread over all partitions but confined to a few partitions.
Ensure your transactions wouldn't use these partitions for now.
Per each partition/subpartition you would normally UPDATE, perform CTAS of all the rows (I mean even the rows which stay the same go to TABLE1_TEMPORARY). Then EXCHANGE PARTITION and rebuild index partitions.
At the end rebuild global indexes.
If you don't have Oracle Enterprise Edition, you would need to either CTAS entire billion of rows (followed by ALTER TABLE RENAME instead of ALTER TABLE EXCHANGE PARTITION) or to prepare some kind of "poor man's partitioning" using a view (SELECT UNION ALL SELECT UNION ALL SELECT etc) and a bunch of tables.
There is some chance that this mess would actually be faster than UPDATE.
I'm not saying that this is elegant or optimal, I'm saying that this is the canonical way of speeding up large UPDATE operations in Oracle.

How about keeping in the UPDATE in the same table, but breaking it into multiple small chunks?
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 0000000 and 0999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 1000000 and 1999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 2000000 and 2999999
COMMIT
This could help if the total workload is potentially manageable, but doing it all in one chunk is the problem. This approach breaks it into modest-sized pieces.
Doing it this way could, for example, enable other apps to keep running & give other workloads a look in; and would avoid needing a single humungous transaction in the logfile.

Safely insert row data from one table to another - SQL

I need to move some data stored in one table to another using a script, taking into account existing records that may already be in the destination table as well as any relationships that may exist.
I am curious to know the best method of doing this that has a relatively low impact on performance and can be reversed if necessary.
At first I will be moving only one record to ensure the process runs smoothly but then it will be responsible for moving around 1650 rows.
What would be the best approach to take or is there a better alternative?
Edit:
My previous suggestion of using MERGE will not work as I will be operating under the SQL Server 2005 environment, not 2008 like previously mentioned.

the question does not provide any details, so I can't provide any actual real code, just this plan of attack:
step 1 write a query that will SELECT only the rows you need to copy. You will need to JOIN and/or filter (WHERE) this data to only include the rows that don't already exist in the destination table. Make the column list be the exact same as the destination table's columns, in column order and data type.
step 2 turn that SELECT statement into an INSERT by adding INSERT YourDestinationTable (col1, col2, col3..) before the select.
step 3 if you only want to try a single row, add a TOP 1 to the select part of the new INSERET - SELECT command, you can rerun this command as many times as necessary with/without a TOP because it should eliminate any rows you add by the JOINs and WHERE conditions in the SELECT
in the end, you'll have something that looks like:
INSERT YourDestinationTable
(Col1, Col2, Col3, ...)
SELECT
s.Col1, s.Col2, s.Col3, ...
FROM YourSourceTable s
LEFT OUTER JOIN SomeOtherTable x ON s.Col4=x.Col4
WHERE NOT EXISTS (SELECT 1 FROM YourDestinationTable d WHERE s.PK=d.PK)
AND x.Col5='J'
I'm reading the question as only inserting missing rows from a source table to a destination table. If changes need to be migrated as well then prior to the above steps you will need to do an UPDATE of the destination table joining in the source table. This is hard to explain without more specifics of the actual tables, columns, etc.

Yes, the MERGE statement is ideal for bulk imports if you are running SQL Server 2008.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas