I am trying to populate postgres table from another table, nearly about 24 millions records. but query become too slow it taking 9-10 hours. the update operation only update 1-2 row each second. i cant understand why it slow.
Current benchmark
Query = INSERT INTO .... SELECT FROM .... ON CONFLICT DO UPDATE
Source table has 24 Million records
Destination Already have 560 Millions records with indexes, unique keys, primary and foreign keys
Query(Sample)
INSERT INTO destination_tbl(col1, col2 .... , col22, false AS processed, null AS updated_at)
SELECT (col1, col2 .... , col22) FROM source_tbl
WHERE processed=false
ON CONFLICT (unique_cols...)
DO UPDATE
SET col1 = EXCLUDED.col1
....
col22 = EXCLUDED.col22
processed = false
updated_at = now()
The query performance results that you have mentioned do seem to be according to the query that you have.It is a simple insert query, which uses INSERT ... ON CONFLICT which is one of the ways to UPSERT data.However talking about the performance then it matters a lot if you use ON CONFLICT DO NOTHING or if you use an UPDATE clause.
Generally when a DO NOTHING clause is running , there won't be any dead tuples that have to be cleaned up whereas if you use an UPDATE clause, there will be a dead tuple, and cleaning up these dead tuples may take time which is definitely inclusive in the total query execution time.We know that INSERT ON CONFLICT always performs a read to determine the necessary writes, the UPSERT statement writes without reading, making it faster. For tables with secondary indexes, there is no performance difference between UPSERT and INSERT ON CONFLICT.
Try to check on the above factors and see if batch loads are possible or a query division which would allow a reduction in time of execution and also fillfactor value set should help in time reduction.
Related
I have to execute the following UPDATE statement on a large table, partitioned using a timestamp field with a resolution of one day:
UPDATE my_huge_table
SET column1 = 'an-uuid4'
WHERE customer_id = 'a-customer-uuid4'
AND organization_id = 'an-organization-uuid4'
column1 IN ('another-uuid4', 'yet-another-uuid4', ...)
where customer_id, organization_id and column1 columns are all properly indexed.
That statement is executed by an Aurora database (PostgreSQL) on AWS.
It turns out that, sometimes, that query times out. The reason, according to the performance insights provided by AWS, is that likely an inefficient plan is used to execute the query, as a lot of data must be read in one shot from the file system.
My guess is that the database is loading a huge amount of table records in memory as the column being updated is the same used for the selection.
I wonder whether there's a more efficient way, performance-wise, to accomplish the same task.
In my postgres database, I have a table that is used to map two fields together with a serial primary key, but I will rarely get new mappings. I may get data sent to me 60k times per day (or more) and only get new mappings a handful of times every month. I wrote a query with the on conflict clause:
insert into mapping_table
values
(field1, field2)
on conflict (field1, field2) do nothing
This seems to work as expected! But I'm wondering if running the insert query tens of thousands of times per day when a mapping rarely needs to be added is problematic somehow? And if so, is there a better way to do that?
I'm also not sure if there's a problem with the serial primary key. As expected, the value auto increments even though there is a conflict. I don't have any logic that expects that the primary key won't have gaps, but the numbers will get large very fast, and I'm not sure if that could become a problem in the future.
There are 86,400 seconds in a day. So, 60k times per day is less than once per second.
Determining the conflict is a lookup in a unique index, and then perhaps some additional checking for locking and so on. Under most circumstances, this would be an acceptable load on the system.
on conflict seems like the right tool for this. It is built-in, probably fairly optimized, and implements the logic you want with a straight-forward syntax.
If you want to avoid "burning" sequence numbers though, an alternative is not exists, to check if a row exists before inserting it:
insert into mapping_table (field1, field2)
select *
from (values (field1, field2)) v
where not exists (
select 1
from mapping_table mp
where mp.field1 = v.field1 and mp.field2 = v.field2
)
This query would take advantage of the already existing index on (field1, field2). It will most likely not be faster than on conflict though, and probably slower.
I have a table with about 30 million rows in a Postgres 9.4 db. This table has 6 columns, the primary key id, 2 text, one boolean, and two timestamp. There are indices on one of the text columns, and obviously the primary key.
I want to copy the values in the first timestamp column, call it timestamp_a into the second timestamp column, call it timestamp_b. To do this, I ran the following query:
UPDATE my_table SET timestamp_b = timestamp_a;
This worked, but it took an hour and 15 minutes to complete, which seems a really long time to me considering, as far as I know, it's just copying values from one column to the next.
I ran EXPLAIN on the query and nothing seemed particularly inefficient. I then used pgtune to modify my config file, most notably it increased the shared_buffers, work_mem, and maintenance_work_mem.
I re-ran the query and it took essentially the same amount of time, actually slightly longer (an hour and 20 mins).
What else can I do to improve the speed of this update? Is this just how long it takes to write 30 million timestamps into postgres? For context I'm running this on a macbook pro, osx, quadcore, 16 gigs of ram.
The reason this is slow is that internally PostgreSQL doesn't update the field. It actually writes new rows with the new values. This usually takes a similar time to inserting that many values.
If there are indexes on any column this can further slow the update down. Even if they're not on columns being updated, because PostgreSQL has to write a new row and write new index entries to point to that row. HOT updates can help and will do so automatically if available, but that generally only helps if the table is subject to lots of small updates. It's also disabled if any of the fields being updated are indexed.
Since you're basically rewriting the table, if you don't mind locking out all concurrent users while you do it you can do it faster with:
BEGIN
DROP all indexes
UPDATE the table
CREATE all indexes again
COMMIT
PostgreSQL also has an optimisation for writes to tables that've just been TRUNCATEd, but to benefit from that you'd have to copy the data to a temp table, then TRUNCATE and copy it back. So there's no benefit.
#Craig mentioned an optimization for COPY after TRUNCATE: Postgres can skip WAL entries because if the transaction fails, nobody will ever have seen the new table anyway.
The same optimization is true for tables created with CREATE TABLE AS:
What causes large INSERT to slow down and disk usage to explode?
Details are missing in your description, but if you can afford to write a new table (no concurrent transactions get in the way, no dependencies), then the fastest way might be (except if you have big TOAST table entries - basically big columns):
BEGIN;
LOCK TABLE my_table IN SHARE MODE; -- only for concurrent access
SET LOCAL work_mem = '???? MB'; -- just for this transaction
CREATE my_table2
SELECT ..., timestamp_a, timestamp_a AS timestamp_b
-- columns in order, timestamp_a overwrites timestamp_b
FROM my_table
ORDER BY ??; -- optionally cluster table while being at it.
DROP TABLE my_table;
ALTER TABLE my_table2 RENAME TO my_table;
ALTER TABLE my_table
, ADD CONSTRAINT my_table_id_pk PRIMARY KEY (id);
-- more constraints, indices, triggers?
-- recreate views etc. if any
COMMIT;
The additional benefit: a pristine (optionally clustered) table without bloat. Related:
Best way to populate a new column in a large table?
I'm currently using Oracle 11g and let's say I have a table with the following columns (more or less)
Table1
ID varchar(64)
Status int(1)
Transaction_date date
tons of other columns
And this table has about 1 Billion rows. I would want to update the status column with a specific where clause, let's say
where transaction_date = somedatehere
What other alternatives can I use rather than just the normal UPDATE statement?
Currently what I'm trying to do is using CTAS or Insert into select to get the rows that I want to update and put on another table while using AS COLUMN_NAME so the values are already updated on the new/temporary table, which looks something like this:
INSERT INTO TABLE1_TEMPORARY (
ID,
STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS)
SELECT
ID
3 AS STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS
FROM TABLE1
WHERE
TRANSACTION_DATE = SOMEDATE
So far everything seems to work faster than the normal update statement. The problem now is I would want to get the remaining data from the original table which I do not need to update but I do need to be included on my updated table/list.
What I tried to do at first was use DELETE on the same original table using the same where clause so that in theory, everything that should be left on that table should be all the data that i do not need to update, leaving me now with the two tables:
TABLE1 --which now contains the rows that i did not need to update
TABLE1_TEMPORARY --which contains the data I updated
But the delete statement in itself is also too slow or as slow as the orginal UPDATE statement so without the delete statement brings me to this point.
TABLE1 --which contains BOTH the data that I want to update and do not want to update
TABLE1_TEMPORARY --which contains the data I updated
What other alternatives can I use in order to get the data that's the opposite of my WHERE clause (take note that the where clause in this example has been simplified so I'm not looking for an answer of NOT EXISTS/NOT IN/NOT EQUALS plus those clauses are slower too compared to positive clauses)
I have ruled out deletion by partition since the data I need to update and not update can exist in different partitions, as well as TRUNCATE since I'm not updating all of the data, just part of it.
Is there some kind of JOIN statement I use with my TABLE1 and TABLE1_TEMPORARY in order to filter out the data that does not need to be updated?
I would also like to achieve this using as less REDO/UNDO/LOGGING as possible.
Thanks in advance.
I'm assuming this is not a one-time operation, but you are trying to design for a repeatable procedure.
Partition/subpartition the table in a way so the rows touched are not totally spread over all partitions but confined to a few partitions.
Ensure your transactions wouldn't use these partitions for now.
Per each partition/subpartition you would normally UPDATE, perform CTAS of all the rows (I mean even the rows which stay the same go to TABLE1_TEMPORARY). Then EXCHANGE PARTITION and rebuild index partitions.
At the end rebuild global indexes.
If you don't have Oracle Enterprise Edition, you would need to either CTAS entire billion of rows (followed by ALTER TABLE RENAME instead of ALTER TABLE EXCHANGE PARTITION) or to prepare some kind of "poor man's partitioning" using a view (SELECT UNION ALL SELECT UNION ALL SELECT etc) and a bunch of tables.
There is some chance that this mess would actually be faster than UPDATE.
I'm not saying that this is elegant or optimal, I'm saying that this is the canonical way of speeding up large UPDATE operations in Oracle.
How about keeping in the UPDATE in the same table, but breaking it into multiple small chunks?
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 0000000 and 0999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 1000000 and 1999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 2000000 and 2999999
COMMIT
This could help if the total workload is potentially manageable, but doing it all in one chunk is the problem. This approach breaks it into modest-sized pieces.
Doing it this way could, for example, enable other apps to keep running & give other workloads a look in; and would avoid needing a single humungous transaction in the logfile.
What I'm trying to implement here is a condition wherein a sqlite database holds only the most recent 1000 records. I have timestamps with each record.
One of the inefficient logic which strikes right away is to check the total number of records. If they exceed 1000, then simply delete the ones which fall out of the periphery.
However, I would have to do this check with each INSERT which makes things highly inefficient.
What could be a better logic? Can we do something with triggers?
Some related questions which follow the same logic I thought of are posted on SO:-
Delete oldest records from database
SQL Query to delete records older than two years
You can use an implicit "rowid" column for that.
Assuming you don't delete rows manually in different ways:
DELETE FROM yourtable WHERE rowid < (last_row_id - 1000)
You can obtain last rowid using API function or as max(rowid)
If you don't need to have exactly 1000 records (e.g. just want to cleanup old records), it is not necessary to do it on each insert. Add some counter in your program and execute cleanup f.i. once every 100 inserts.
UPDATE:
Anyway, you pay performance either on each insert or on each select. So the choice depends on what you have more: INSERTs or SELECTs.
In case you don't have that much inserts to care about performance, you can use following trigger to keep not more than 1000 records:
CREATE TRIGGER triggername AFTER INSERT ON tablename BEGIN
DELETE FROM tablename WHERE timestamp < (SELECT MIN(timestamp) FROM tablename ORDER BY timestamp DESC LIMIT 1000);
END
Creating unique index on timestamp column should be a good idea too (in case it isn't PK already). Also note, that SQLITE supports only FOR EACH ROW triggers, so when you bulk-insert many records it is worth to temporary disable the trigger.
If there are too many INSERTs, there isn't much you can do on database side. You can achieve less frequent trigger calls by adding trigger condition like AFTER INSERT WHEN NEW.rowid % 100 = 0. And with selects just use LIMIT 1000 (or create appropriate view).
I can't predict how much faster that would be. The best way would be just measure how much performance you will gain in your particular case.