Postgres: UPSERT or INCREMENTAL update of a large table

Postgres: UPSERT or INCREMENTAL update of a large table - sql

I have a following problem.
We have two systems, an OLTP system and our Data Warehouse, both on PostgreSQL. I need to sync one table from OLTP system with DWH, so they're identical. I am using Foreign Data Wrapper for comunication between databases, so then sync could be straightforward and look like that:
TRUNCATE TABLE target_table;
INSERT INTO target_table SELECT * FROM source_table;
Unfortunately, the table is huge for the hardware we use (hundreds of millions of rows, and about 50GB of data), so this is massively time consuming.
I have then figured out a way to do it faster by doing incremental upsert, based on following approach.
Identify new rows in source table by taking rows with insert_time > SELECT max(insert_time) FROM target_table;.
Identify rows, that need to be updated, by joining both tables on id and source_table.update_time > target_table.update_time.
That worked absolutely fine, however, a new problem arised - rows from source table can be removed, therefore I need to remove corresponding rows from target table. All approaches I tried are way to slow, like this one for example:
CREATE VIEW deleted_rows AS SELECT
target.row_id
FROM target
LEFT JOIN source USING(id)
WHERE source.id IS NULL;
Any idea how to solve this problem smartly and quickly? Thanks in advance for your responses.

Related

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?

When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.

CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

How to merge 500 million table with another 500 million table

I have to merge two 500M+ row tables.
What is the best method to merge them?
I just need to display the records from these two SQL-Server tables if somebody searches on my webpage.
These are fixed tables, no one will ever change data in these tables once they are live.
create a view myview as select * from table1 union select * from table2
Is there any harm using the above method?
If I start merging 500M rows it will run for days and if machine reboots it will make the database go into recovery mode, and then I have to start from the beginning again.
Why Am I merging these table?
I have a website which provides a search on the person table.
This table have columns like Name, Address, Age etc
We got 500 million similar .txt files which we loaded into some other
table.
Now we want the website search page to query both tables to see if
a person exists in the table.
We get similar .txt files of 100 million or 20 million, which we load
to this huge table.
How we are currently doing it?
We import the .txt files into separate tables ( some columns are different
in .txt)
Then we arrange the columns and do the data type conversions
Then insert this staging table into the liveCopy huge table ( in
test environment)
We have SQL server 2008 R2
Can we use table partitioning for performance benefits?
Is it ok to create monthly small tables and create a view on top of
them?
How can indexing be done in this case?
We only load new data once in a month and do the select
Does replication help?
Biggest issue I am facing is managing huge tables.
I hope I explained the situation .
Thanks & Regards

1) Usually developers, to achieve more performance, are splitting large tables into smaller ones and call this as partitioning (horizontal to be more precise, because there is also vertical one). Your view is a sample of such partitions joined. Of course, it is mostly used to split a large amount of data into range of values (for example, table1 contains records with column [col1] < 0, while table2 with [col1] >= 0). But even for unsorted data it is ok too, because you get more room for speed improvements. For example - parallel reads if put tables to different storages. So this is a good choice.
2) Another way is to use MERGE statement supported in SQL Server 2008 and higher - http://msdn.microsoft.com/en-us/library/bb510625(v=sql.100).aspx.
3) Of course you can copy using INSERT+DELETE, but in this case or in case of MERGE command used do this in a small batches. Smth like:
SET ROWCOUNT 10000
DECLARE #Count [int] = 1
WHILE #Count > 0 BEGIN
... INSERT+DELETE/MERGE transcation...
SET #Count = ##ROWCOUNT
END

If your purpose is truly just to move the data from the two tables into one table, you will want to do it in batches - 100K records at a time, or something like that. I'd guess you crashed before because your T-Log got full, although that's just speculation. Make sure to throw in a checkpoint after each batch if you are in Full recovery mode.
That said, I agree with all the comments that you should provide why you are doing this - it may not be necessary at all.

You may want to have a look at an Indexed View.
In this way, you can set up indexes on your view and get the best performance out of it. The expensive part of using Indexed Views is in the CRUD operations - but for read performance it would be your best solution.
http://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/
https://www.simple-talk.com/sql/learn-sql-server/sql-server-indexed-views-the-basics/

If the two tables are linked one to one, then you are wasting the cpu time a lot for each read. Especially that you mentioned that the tables don't change at all. You should have only one table in this case.
Try creating a new table including (at least) the two columns from the two tables.
You can do this by:
Select into newTable
from A left join B on A.x=B.y
or (if some people don't have the information of the text file)
Select into newTable
from A inner join B on A.x=B.y
And note that you have to have made index on the join fields at least (to speed up the process).
More details about the fields may help giving more precise answer as well.

Delete All / Bulk Insert

First off let me say I am running on SQL Server 2005 so I don't have access to MERGE.
I have a table with ~150k rows that I am updating daily from a text file. As rows fall out of the text file I need to delete them from the database and if they change or are new I need to update/insert accordingly.
After some testing I've found that performance wise it is exponentially faster to do a full delete and then bulk insert from the text file rather than read through the file line by line doing an update/insert. However I recently came across some posts discussing mimicking the MERGE functionality of SQL Server 2008 using a temp table and the output of the UPDATE statement.
I was interested in this because I am looking into how I can eliminate the time in my Delete/Bulk Insert method when the table has no rows. I still think that this method will be the fastest so I am looking for the best way to solve the empty table problem.
Thanks

I think your fastest method would be to:
Drop all foreign keys and indexes
from your table.
Truncate your
table.
Bulk insert your data.
Recreate your foreign keys and
indexes.

Is the problem that Joe's solution is not fast enough, or that you can not have any activity against the target table while your process runs? If you just need to prevent users from running queries against your target table, you should contain your process within a transaction block. This way, when your TRUNCATE TABLE executes, it will create a table lock that will be held for the duration of the transaction, like so:
begin tran;
truncate table stage_table
bulk insert stage_table
from N'C:\datafile.txt'
commit tran;

An alternative solution which would satsify your requirement for not having "down time" for the table you are updating.
It sounds like originally you were reading the file and doing an INSERT/UPDATE/DELETE 1 row at a time. A more performant approach than that, that does not involve clearing down the table is as follows:
1) bulk load the file into a new, separate table (no indexes)
2) then create the PK on it
3) Run 3 statements to update the original table from this new (temporary) table:
DELETE rows in the main table that don't exist in the new table
UPDATE rows in the main table where there is a matching row in the new table
INSERT rows into main table from the new table where they don't already exist
This will perform better than row-by-row operations and should hopefully satisfy your overall requirements

There is a way to update the table with zero downtime: keep two day's data in the table, and delete the old rows after loading the new ones!
Add a DataDate column representing the date for which your ~150K rows are valid.
Create a one-row, one-column table with "today's" DataDate.
Create a view of the two tables that selects only rows matching the row in the DataDate table. Index it if you like. Readers will now refer to this view, not the table.
Bulk insert the rows. (You'll obviously need to add the DataDate to each row.)
Update the DataDate table. View updates Instantly!
Delete yesterday's rows at your leisure.
SELECT performance won't suffer; joining one row to 150,000 rows along the primary key should present no problem to any server less than 15 years old.
I have used this technique often, and have also struggled with processes that relied on sp_rename. Production processes that modify the schema are a headache. Don't.

For raw speed, I think with ~150K rows in the table, I'd just drop the table, recreate it from scratch (without indexes) and then bulk load afresh. Once the bulk load has been done, then create the indexes.
This assumes of course that having a period of time when the table is empty/doesn't exist is acceptable which it does sound like could be the case.

How can I efficiently manipulate 500k records in SQL Server 2005?

I am getting a large text file of updated information from a customer that contains updates for 500,000 users. However, as I am processing this file, I often am running into SQL Server timeout errors.
Here's the process I follow in my VB application that processes the data (in general):
Delete all records from temporary table (to remove last month's data) (eg. DELETE * FROM tempTable)
Rip text file into the temp table
Fill in extra information into the temp table, such as their organization_id, their user_id, group_code, etc.
Update the data in the real tables based on the data computed in the temp table
The problem is that I often run commands like UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) and these commands frequently time out. I have tried bumping the timeouts up to as far as 10 minutes, but they still fail. Now, I realize that 500k rows is no small number of rows to manipulate, but I would think that a database purported to be able to handle millions and millions of rows should be able to cope with 500k pretty easily. Am I doing something wrong with how I am going about processing this data?
Please help. Any and all suggestions welcome.

subqueries like the one you give us in the question:
UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id)
are only good on one row at a time, so you must be looping. Think set based:
UPDATE t
SET user_id = u.user_id
FROM tempTable t
inner join myUsers u ON t.external_id=u.external_id
and remove your loops, this will update all rows in one statement and be significantly faster!

Needs more information. I am manipulating 3-4 million rows in a 150 million row table regularly and I am NOT thinking this is a lot of data. I have a "products" table that contains about 8 million entries - includign full text search. No problems either.
Can you just elaborte on your hardware? I assume "normal desktop PC" or "low end server", both with absolutely non-optimal disc layout, and thus tons of IO problems - on updates.

Make sure you have indexes on your tables that you are doing the selects from. In your example UPDATE command, you select the user_id from the myUsers table. Do you have an index with the user_id column on the myUsers table? The downside of indexes is that they increase time for inserts/updates. Make sure you don't have indexes on the tables you are trying to update. If the tables you are trying to update do have indexes, consider dropping them and then rebuilding them after your import.
Finally, run your queries in SQL Server Management Studio and have a look at the execution plan to see how the query is being executed. Look for things like table scans to see where you might be able to optimize.

Look at the KM's answer and don't forget about indexes and primary keys.

Are you indexing your temp table after importing the data?
temp_table.external_id should definitely have an index since it is in the where clause.

There are more efficient ways of importing large blocks of data. Look in SQL Books Online under BCP (Bulk Copy Protocol.)

SQL - Optimizing performance of bulk inserts and large joins?

I am doing ETL for log files into a PostgreSQL database, and want to learn more about the various approaches used to optimize performance of loading data into a simple star schema.
To put the question in context, here's an overview of what I do currently:
Drop all foreign key and unique
constraints
Import the data (~100 million records)
Re-create the constraints and run analyze on the fact table.
Importing the data is done by loading from files. For each file:
1) Load the data from into a temporary table using COPY (the PostgreSQL bulk upload tool)
2) Update each of the 9 dimension tables with any new data using an insert for each such as:
INSERT INTO host (name)
SELECT DISTINCT host_name FROM temp_table
EXCEPT
SELECT name FROM host;
ANALYZE host;
The analyze is run at the end of the INSERT with the idea of keeping the statistics up to date over the course of tens of millions of updates (Is this advisable or necessary? At minimum it does not seem to significantly reduce performance).
3) The fact table is then updated with an unholy 9-way join:
INSERT INTO event (time, status, fk_host, fk_etype, ... )
SELECT t.time, t.status, host.id, etype.id ...
FROM temp_table as t
JOIN host ON t.host_name = host.name
JOIN url ON t.etype = etype.name
... and 7 more joins, one for each dimension table
Are there better approaches I'm overlooking?

I've tried several different approaches to trying to normalize the data incoming from a source as such and generally I've found the approach you're using now to be my choice. Its easy to follow and minor changes stay minor. Trying to return the generated id from one of the dimension tables during stage 2 only complicated things and usually generates far too many small queries to be efficient for large data sets. Postgres should be very efficient with your "unholy join" in modern versions and using "select distinct except select" works well for me. Other folks may know better, but I've found your current method to be my perferred method.

During stage 2 you know the primary key of each dimension you're inserting data into (after you've inserted it), but you're throwing this information away and rediscovering it in stage 3 with your "unholy" 9-way join.
Instead I'd recommend creating one sproc to insert into your fact table; e.g. insertXXXFact(...), which calls a number of other sprocs (one per dimension) following the naming convention getOrInsertXXXDim, where XXX is the dimension in question. Each of these sprocs will either look-up or insert a new row for the given dimension (thus ensuring referential integrity), and should return the primary key of the dimension your fact table should reference. This will significantly reduce the work you need to do in stage 3, which is now reduced to a call of the form insert into XXXFact values (DimPKey1, DimPKey2, ... etc.)
The approach we've adopted in our getOrInsertXXX sprocs is to insert a dummy value if one is not available and have a separate cleanse process to identify and enrich these values later on.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas