How to avoid data fragmentation in an "INSERT once, UPDATE once" table? - sql

I have large number of tables that are "INSERT once", and then read-only after that. ie: After the initial INSERT of a record, there are never any UPDATE or DELETE statements. Due to this, the data fragmentation on disk for the tables is minimal.
I'm now considering adding an needs_action boolean field to every table. This field will only ever get updated once, and it will be done on a slow/periodic basis. As a result of MVCC, when VACUUM comes along (with an even slower schedule) after the UPDATE, the table becomes quite fragmented as it clears out the originally inserted tuples, and they get subsequently backfilled by new insertions.
In short: The addition of this single "always updated once" field changes the table from being being minimally fragmented by design, to being highly fragmented by design.
Is there some method of effectively achieving this single needs_action record flagging that can avoid the resulting table fragmentation?
.
.
.
.
<And now for some background/supplemental info...>
Some options considered so far...
At the risk of making this question massive (and therefore overlooked?), below are some options that have been considered so far:
Just add the column to each table, do the UPDATE and don't worry about resulting fragmentation until it actually proves to be an issue.
I'm conscious of premature optimization here, but with some of the tables getting large (>1M, and even > 1B) I'd rather get the design right up front.
Make an independent tracking table (for every table) only containing A) the PK from the master table, and B) the needs_action flag. Create a record in the tracking table with an AFTER INSERT trigger in the master table
This will preserve "INSERT only" minimal fragmentation levels on the master table... at the expense of adding (significant?) up-front write overhead
putting the tracking tables in a separate schema would also neatly separate the functionality from the core tables
Forcing the needs_action field to be a HOT update to avoid tuple replication
Needing an index on WHERE needs_action = TRUE seems to rule out this option, but maybe there's another way to find them quickly?
Using table fillfactor (at 50?) to leave space for the inevitable UPDATE
eg: set fillfactor to 50 to leave room for the UPDATE, and therefore keep it in the same page
But... with only one UPDATE it seems like this will leave the table packing fraction eternally at 50% and take twice the storage space? I don't 100% understand this option yet... still learning.
Find a special/magical field/bit in the master table records that can be twiddled without MVCC impact.
This does not seem to exist in postgres. Even if it did, it would need to be indexed (or have some other quick finding mechanism akin to a WHERE needs_action = TRUE partial index)
Being able to optionally suppress MVCC operation on specific columns seems like it would be nice here (although surely fraught with peril)
Storing needs_action outside of postgres (eg: as a <table_name>:needs_copying list of PKs in redis) to evade fragmentation due to mvcc.
I have concerns about keeping this atomic, though. Maybe using redis_fdw (or some other fdw?) in an AFTER INSERT trigger can keep it atomic? I need to learn more about fdw capabilities... seems like all fdw's I can find are all read-only, though.
Run a fancy view with background defrag/compacting, as described in this fantastic article
Seems like a bit much to do for all tables.
Simply track ids/PKs that need copying in a postgres table
just stash ids that need action as records to a fast inert table (eg: no PK), and DELETE the records when action is completed
similar to RPUSHing to an offline redis list (but definitely ACID)
This seems like the best option at the moment.
Are there other options to consider?
More on the specific use case driving this...
I'm interested in the general case how to avoid this fragmentation, but here's a bit more on the current use case:
Read performance is much more important than write performance on all tables (but avoiding crazy slow writes is clearly desirable)
Some tables will reach millions of rows. A few may hit billions of rows.
SELECT queries will be across wide table ranges (not just recent data) and can range from single result records to 100k+
Table design can be done from scratch... no need to worry about existing data
PostgreSQL 9.6

I would just lower the fillfactor below the default value of 100.
Depending on the size of the row, use a value like 80 or 90, so that a few new rows will still fit into the block. After the update, the old row can be “pruned” and defragmented by the next transaction so that the space can be reused.
A value of 50 seems way too low. True, this would leave room for all rows in a block being updated at the same time, but that is not your use case, right?

You can try to use inherited tables. This approach does not directly support PK for tables, but it might be resolved by triggers.
CREATE TABLE data_parent (a int8, updated bool);
CREATE TABLE data_inserted (CHECK (NOT updated)) INHERITS (data_parent);
CREATE TABLE data_updated (CHECK (updated)) INHERITS (data_parent);
CREATE FUNCTION d_insert () RETURNS TRIGGER AS $$
BEGIN
NEW.updated = false;
INSERT INTO data_inserted VALUES (NEW.*);
RETURN NULL;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;
CREATE TRIGGER d_insert BEFORE INSERT ON data_parent FOR EACH ROW EXECUTE PROCEDURE d_insert();
CREATE FUNCTION d_update () RETURNS TRIGGER AS $$
BEGIN
NEW.updated = true;
INSERT INTO data_updated VALUES (NEW.*);
DELETE FROM data_inserted WHERE (data_inserted.*) IN (OLD);
RETURN NULL;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;
CREATE TRIGGER d_update BEFORE INSERT ON data_inserted FOR EACH ROW EXECUTE PROCEDURE d_update();
-- GRANT on d_insert to regular user
-- REVOKE insert / update to regular user on data_inserted/updated
INSERT INTO data_parent (a) VALUES (1);
SELECT * FROM ONLY data_parent;
SELECT * FROM ONLY data_inserted;
SELECT * FROM ONLY data_updated;
INSERT 0 0
a | updated
---+---------
(0 rows)
a | updated
---+---------
1 | f
(1 row)
a | updated
---+---------
(0 rows)
UPDATE data_parent SET updated = true;
SELECT * FROM ONLY data_parent;
SELECT * FROM ONLY data_inserted;
SELECT * FROM ONLY data_updated;
UPDATE 0
a | updated
---+---------
(0 rows)
a | updated
---+---------
(0 rows)
a | updated
---+---------
1 | t
(1 row)

Related

Proper way of updating a whole table in Redshift, drop table + create table vs. truncate + insert into table

Currently I have many tables for which I have to update the information they hold, sometimes on a daily or weekly basis. So far, I've been doing this by a combination of DROP TABLE IF EXIST some_schema.some_table_name; followed by CREATE TABLE some_schema.some_table_name AS ( SELECT ... FROM ... WHERE ...); and I would like to know what is the "best-practice" or proper way of doing so.
I've read that INSERT operations in Redshift are quite expensive, so I've been avoiding its usage, but maybe the use of TRUNCATE with INSERT is better than dropping and creating.
How can I confirm which option is better?
I've seen this article from Redshift docs, but I'm not sure if it is the best option, since I could have not only to remove records, but keeping and inserting as well.
If your desire is to completely erase a table and replace the data then the general pattern you are following in fine. However, there are a few things you should be doing to make things safer / better.
There are 3 patterns to do this and one is clearly the lowest performance. These are Delete/Insert, Truncate/Insert, and Drop/Insert. Of these Delete/Create/Insert is NOT what you want to do from a performance point of view. This process invalidates all the rows in the table (not delete them) and adds new valid rows. This doubles the size of the table, wasting space, and needs to be vacuumed. The only upside of this approach is that it doesn't have the downsides of the other approaches but this only matters in certain situations. Go down this approach only if you have to.
Truncate/Insert is fast and maintains the same table id as the original table. Because truncate operates on the blocks of the table (unlinking them) it is fast but there is some small overhead in managing all the block links. Since the table definition is unchanged all DDL stays defined and dependent views can keep pointing to the table. The downside with truncate is that it forces a COMMIT to occur which means that until the table is repopulated with new data other users of the database can see an empty table. This can lead to incorrect results during these windows. Not good.
Lastly there is Drop/Create/Insert. This approach is marginally (very slightly and only for large tables) faster than truncate in the ideal case. It just throws away the old blocks. There is some additional cost to setting up the new table (of the same name) so truncate and drop are about the same speed unless the table is large. Since Drop can be inside of a transaction block the empty table won't be seen by third parties (if done correctly). The downside with this approach is that the old table and new table are entirely different tables (different oids) - they just happen to have the same name. This means that any dependent (regular) views will need to be dropped and recreated as well. Also since this table is "going away" the commit of the transaction cannot complete until all uses of the table are complete. This becomes a large problem when someone leaves a transaction open in their bench and goes home for the night. Since the tables needs to be recreated your process needs to know the complete and correct DDL for the table.
Hopefully this gives you some idea of when to use these different approaches. Two things I see that could be better in your current code - 1) You are not using a transaction block (as far as I can tell) so there is a window when others will see that the tables doesn't exists or is empty. This may or may not be important to you but be aware. 2) "Create table As" doesn't define the DDL of the table in performant structure (and possibly incorrectly). You should always specify your permanent tables fully. Sort and Dist keys matter as do varchar lengths, data types etc. This is a time bomb waiting to go off.
Per request for an example of drop/create/insert:
As I mentioned there are lock dependency issues that can arise with this method so I like to use a "swap & drop" approach to this path. This makes the new information visible to users at the "swap" so even if the "drop" gets blocks things get published on time. This doesn't remove the lock risk as a lock can still prevent the process (session) from completing, it just makes it so that the new data is visible (published) while you hunt down the offender.
(Please note that for transactions to execute properly you need to be sure that extra COMMITs are not being inserted into the process. This can happen with benches that are configured in "autocommit" mode.)
Create table new_table ( ... ) ...; -- make the new table but with a different name (and unique from other tables) than the existing table
Insert into new_table ... ; -- put the desired data into the new table
Analyze new_table; -- to ensure metadata is up to date
Begin; -- start transaction
Alter table perm_table rename to old_table; -- rename existing table
Alter table new_table rename to perm_table; -- complete the swap
Commit; -- publish the new data for all to see but transactions still using the original data can keep doing so
Drop table old_table; -- remove the old data to free up space
Commit;
This process is just one example. Sometimes you want to keep the old versions of the table around for a while (history / error recovery) so you will date stamp the old data and have a separate process to free up the space. This also helps with stray locks clogging up the works - only the clean up process gets stalled. You can also have the recreation of views in the process so that these are updated in the same transaction. And so on.
I think you will need to use the Update command. I understand that drooping a table is a risky move, as you might loose all of your data from your database.
Update some_table_name s set
s.Id="whatever you want to update",
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
In above code I assumed your table had for columns (1-Id, 2-Name, 3-LastName, 4-OtherTableColumn). If you have more or less columns then I would adjust accordingly.
I would also write a update procedure for this (and each table) so if you need to update somewhat frequently you just use the procedure; I think its quicker. Below would be my procedure:
Create Proc sp_UpdateSome_table_name
#Id int,
#name nvarchar(255),
#lastname nvarchar(255),
#OtherTableColumn int
AS
BEGIN
Update s some_table_name
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
Where
s.Id=#Id
END
You want to make sure that each column in your table is defined with correct data type in the procedure. For example I assumed above that #Id was int, Name was nvarchar(255) etc. If you want to allow yourself not to enter any data (allowing null) in certain table columns when updating then after the data type you can write Null; for example if you write #Id int Null, then you can update is as null; but if you are not sure what this is, simply ignore this sentence for now.
Once you assured above paragraph is good (data types are correct), then select the entire procedure and then execute (F5). This will store this procedure.
Then I will write the procedure every time you want to update your table shown as below:
Exec sp_UpdateSome_table_name 1,John,Smith,77
If you highlight the above command and execute (f5) it then it will update the table which has Id=1 and it will make the name John, last name Smith and the other column 77 from whatever it was before. If there is no data in the table with Id=1 then you can execute.
Keep in mind the last rows of the codes might not have a comma. The above codes are written correctly, just pointing it out as you might put a comma out of habit.

ORACLE 11g SET COLUMN NULL for specific Partition of large table

I have a Composite-List-List partitioned table with 19 Columns and about 400 million rows. Once a week new data is inserted in this table and before the insert I need to set the values of 2 columns to null for specific partitions.
Obvious approach would be something like the following where COLUMN_1 is the partition criteria:
UPDATE BLABLA_TABLE
SET COLUMN_18 = NULL, SET COLUMN_19 = NULL
WHERE COLUMN_1 IN (VALUE1, VALUE2…)
Of course this would be awfully slow.
My second thought was to use CTAS for every partition that I need to set those two columns to null and then use EXCHANGE PARTITION to update the data in my big table. Unfortunately that wouldn’t work because it´s a Composite-Partition.
I could use the same approach with subpartitions but then I would have to use CATS about 8000 times and drop those tables afterwards every week. I guess that would not pass the upcoming code-review.
May somebody has another idea how to performantly solve this?
PS: I’m using ORACLE 11g as database.
PPS: Sorry for my bad English…..
You've ruled out updating through DDL (switch partitions), so this lets us with only DML to consider.
I don't think that it's actually that bad an update with a table so heavily partitioned. You can easily split the update in 8k mini updates (each a single tiny partition):
UPDATE BLABLA_TABLE SUBPARTITION (partition1) SET COLUMN_18 = NULL...
Each subpartition would contain 15k rows to be updated on average so the update would be relatively tiny.
While it still represents a very big amount of work, it should be easy to set to run in parallel, hopefully during hours where database activity is very light. Also the individual updates are easy to restart if one of them fails (rows locked?) whereas a 120M update would take such a long time to rollback in case of error.
If I were to update almost 90% of rows in table, I would check feasibility/duration of just inserting to another table of same structure (less redo, no row chaining/migration, bypass cache and so on via direct insert. drop indexes and triggers first. exclude columns to leave them null in target table), rename the tables to "swap" them, rebuild indexes and triggers, then drop the old table.
From my experience in data warehousing, plain direct insert is better than update/delete. More steps needed but it's done in less time overall. I agree, partition swap is easier said than done when you have to process most of the table and just makes it more complex for the ETL developer (logic/algorithm bound to what's in the physical layer), we haven't encountered need to do partition swaps so far.
I would also isolate this table in its own tablespaces, then alternate storage between these two tablespaces (insert to 2nd drop table from 1st, vice-versa in next run, resize empty tablespace to reclaim space).

How to speed up a slow UPDATE query

I have the following UPDATE query:
UPDATE Indexer.Pages SET LastError=NULL where LastError is not null;
Right now, this query takes about 93 minutes to complete. I'd like to find ways to make this a bit faster.
The Indexer.Pages table has around 506,000 rows, and about 490,000 of them contain a value for LastError, so I doubt I can take advantage of any indexes here.
The table (when uncompressed) has about 46 gigs of data in it, however the majority of that data is in a text field called html. I believe simply loading and unloading that many pages is causing the slowdown. One idea would be to make a new table with just the Id and the html field, and keep Indexer.Pages as small as possible. However, testing this theory would be a decent amount of work since I actually don't have the hard disk space to create a copy of the table. I'd have to copy it over to another machine, drop the table, then copy the data back which would probably take all evening.
Ideas? I'm using Postgres 9.0.0.
UPDATE:
Here's the schema:
CREATE TABLE indexer.pages
(
id uuid NOT NULL,
url character varying(1024) NOT NULL,
firstcrawled timestamp with time zone NOT NULL,
lastcrawled timestamp with time zone NOT NULL,
recipeid uuid,
html text NOT NULL,
lasterror character varying(1024),
missingings smallint,
CONSTRAINT pages_pkey PRIMARY KEY (id ),
CONSTRAINT indexer_pages_uniqueurl UNIQUE (url )
);
I also have two indexes:
CREATE INDEX idx_indexer_pages_missingings
ON indexer.pages
USING btree
(missingings )
WHERE missingings > 0;
and
CREATE INDEX idx_indexer_pages_null
ON indexer.pages
USING btree
(recipeid )
WHERE NULL::boolean;
There are no triggers on this table, and there is one other table that has a FK constraint on Pages.PageId.
What #kgrittn posted as comment is the best answer so far. I am merely filling in details.
Before you do anything else, you should upgrade PostgreSQL to a current version, at least to the last security release of your major version. See guidelines on the project.
I also want to stress what Kevin mentioned about indexes involving the column LastError. Normally, HOT updates can recycle dead rows on a data page and make UPDATEs a lot faster - effectively removing (most of) the need for vacuuming. Related:
Redundant data in update statements
If your column is used in any index in any way, HOT UPDATEs are disabled, because it would break the index(es). If that is the case, you should be able to speed up the query a lot by deleting all of these indexes before you UPDATE and recreate them later.
In this context it would help to run multiple smaller UPDATEs:
If ...
... the updated column is not involved in any indexes (enabling HOT updates).
... the UPDATE is easily divided into multiple patches in multiple transactions.
... the rows in those patches are spread out over the table (physically, not logically).
... there are no other concurrent transactions keeping dead tuples from being reused.
Then you would not need to VACCUUM in between multiple patches, because HOT updates can reuse dead tuples directly - only dead tuples from previous transactions, not from the same or concurrent ones. You may want to schedule a VACUUM at the end of the operation, or just let auto-vacuuming do its job.
The same could be done with any other index that is not needed for the UPDATE - and judging from your numbers the UPDATE is not going to use an index anyway. If you update large parts of your table, building new indexes from scratch is much faster than incrementally updating indexes with every changed row.
Also, your update is not likely to break any foreign key constraints. You could try to delete & recreate those, too. This does open a time slot where referential integrity would not be enforced. If the integrity is violated during the UPDATE you get an error when trying to recreate the FK. If you do it all within one transaction, concurrent transactions never get to see the dropped FK, but you take a write lock on the table - same as with dropping / recreating indexes or triggers)
Lastly, disable & enable triggers that are not needed for the update.
Be sure to do all of this in one transaction. Maybe do it in a number of smaller patches, so it does not block concurrent operations for too long.
So:
BEGIN;
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- UPDATE ...
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
COMMIT;
You cannot run VACUUM inside a transaction block. Per documentation:
VACUUM cannot be executed inside a transaction block.
You could split your operation into a few big chunks and run in between:
VACUUM ANALYZE tbl;
If you don't have to deal with concurrent transactions you could (even more effectively):
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- Multiple UPDATEs with logical slices of the table
-- each slice in its own transaction.
-- VACUUM ANALYZE tbl; -- optionally in between, or autovacuum kicks in
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
UPDATE Indexer.Pages
SET LastError=NULL
;
The where clause is not needed since the NULL fields are already NULL, so it won't harm to set them to NULL again (I don't think this would affect performance significantly).
Given your number_of_rows = 500K and your table size=46G, I conclude that your average rowsize is 90KB. That is huge. Maybe you could move {unused, sparse} columns of your table to other tables?
Your theory is probably correct. Reading the full table (and then doing anything) is probably causing the slow-down.
Why don't you just create another table that has PageId and LastError? Initialize this with the data in the table you have now (which should take less than 93 minutes). Then, use the LastError from the new table.
At your leisure, you can remove LastError from your existing table.
By the way, I don't normally recommend keeping two copies of a column in two separate tables. In this case, though, you sound like you are stuck and need a way to proceed.

Oracle SQL technique to avoid filling trans log

Newish to Oracle programming (from Sybase and MS SQL Server). What is the "Oracle way" to avoid filling the trans log with large updates?
In my specific case, I'm doing an update of potentially a very large number of rows. Here's my approach:
UPDATE my_table
SET a_col = null
WHERE my_table_id IN
(SELECT my_table_id FROM my_table WHERE some_col < some_val and rownum < 1000)
...where I execute this inside a loop until the updated row count is zero,
Is this the best approach?
Thanks,
The amount of updates to the redo and undo logs will not at all be reduced if you break up the UPDATE in multiple runs of, say 1000 records. On top of it, the total query time will be most likely be higher compared to running a single large SQL.
There's no real way to address the UNDO/REDO log issue in UPDATEs. With INSERTs and CREATE TABLEs you can use a DIRECT aka APPEND option, but I guess this doesn't easily work for you.
Depends on the percent of rows almost as much as the number. And it also depends on if the update makes the row longer than before. i.e. going from null to 200bytes in every row. This could have an effect on your performance - chained rows.
Either way, you might want to try this.
Build a new table with the column corrected as part of the select instead of an update. You can build that new table via CTAS (Create Table as Select) which can avoid logging.
Drop the original table.
Rename the new table.
Reindex, repoint contrainst, rebuild triggers, recompile packages, etc.
you can avoid a lot of logging this way.
Any UPDATE is going to generate redo. Realistically, a single UPDATE that updates all the rows is going to generate the smallest total amount of redo and run for the shortest period of time.
Assuming you are updating the vast majority of the rows in the table, if there are any indexes that use A_COL, you may be better off disabling those indexes before the update and then doing a rebuild of those indexes with NOLOGGING specified after the massive UPDATE statement. In addition, if there are any triggers or foreign keys that would need to be fired/ validated as a result of the update, getting rid of those temporarily might be helpful.

Update VERY LARGE PostgreSQL database table efficiently

I have a very large database table in PostgresQL and a column like "copied". Every new row starts uncopied and will later be replicated to another thing by a background programm. There is an partial index on that table "btree(ID) WHERE replicated=0". The background programm does a select for at most 2000 entries (LIMIT 2000), works on them and then commits the changes in one transaction using 2000 prepared sql-commands.
Now the problem ist that I want to give the user an option to reset this replicated-value, make it all zero again.
An update table set replicated=0;
is not possible:
It takes very much time
It duplicates the size of the tabel because of MVCC
It is done in one transaction: It either fails or goes through.
I actually don't need transaction-features for this case: If the system goes down, it shall process only parts of it.
Several other problems:
Doing an
update set replicated=0 where id >10000 and id<20000
is also bad: It does a sequential scan all over the whole table which is too slow.
If it weren't doing that, it would still be slow because it would be too many seeks.
What I really need is a way of going through all rows, changing them and not being bound to a giant transaction.
Strangely, an
UPDATE table
SET replicated=0
WHERE ID in (SELECT id from table WHERE replicated= LIMIT 10000)
is also slow, although it should be a good thing: Go through the table in DISK-order...
(Note that in that case there was also an index that covered this)
(An update LIMIT like Mysql is unavailable for PostgresQL)
BTW: The real problem is more complicated and we're talking about an embedded system here that is already deployed, so remote schema changes are difficult, but possible
It's PostgresQL 7.4 unfortunately.
The amount of rows I'm talking about is e.g. 90000000. The size of the databse can be several dozend gigabytes.
The database itself only contains 5 tables, one is a very large one.
But that is not bad design, because these embedded boxes only operate with one kind of entity, it's not an ERP-system or something like that!
Any ideas?
How about adding a new table to store this replicated value (and a primary key to link each record to the main table). Then you simply add a record for every replicated item, and delete records to remove the replicated flag. (Or maybe the other way around - a record for every non-replicated record, depending on which is the common case).
That would also simplify the case when you want to set them all back to 0, as you can just truncate the table (which zeroes the table size on disk, you don't even have to vacuum to free up the space)
If you are trying to reset the whole table, not just a few rows, it is usually faster (on extremely large datasets -- not on regular tables) to simply CREATE TABLE bar AS SELECT everything, but, copied, 0 FROM foo, and then swap the tables and drop the old one. Obviously you would need to ensure nothing gets inserted into the original table while you are doing that. You'll need to recreate that index, too.
Edit: A simple improvement in order to avoid locking the table while you copy 14 Gigabytes:
lock ;
create a new table, bar;
swap tables so that all writes go to bar;
unlock;
create table baz as select from foo;
drop foo;
create the index on baz;
lock;
insert into baz from bar;
swap tables;
unlock;
drop bar;
(let writes happen while you are doing the copy, and insert them post-factum).
While you cannot likely fix the problem of space usage (it is temporary, just until a vacuum) you can probably really speed up the process in terms of clock time. The fact that PostgreSQL uses MVCC means that you should be able to do this without any issues related to newly inserted rows. The create table as select will get around some of the performance issues, but will not allow for continued use of the table, and takes just as much space. Just ditch the index, and rebuild it, then do a vacuum.
drop index replication_flag;
update big_table set replicated=0;
create index replication_flag on big_table btree(ID) WHERE replicated=0;
vacuum full analyze big_table;
This is pseudocode. You'll need 400MB (for ints) or 800MB (for bigints) temporary file (you can compress it with zlib if it is a problem). It will need about 100 scans of a table for vacuums. But it will not bloat a table more than 1% (at most 1000000 dead rows at any time). You can also trade less scans for more table bloat.
// write all ids to temporary file in disk order
// no where clause will ensure disk order
$file = tmpfile();
for $id, $replicated in query("select id, replicated from table") {
if ( $replicated<>0 ) {
write($file,&$id,sizeof($id));
}
}
// prepare an update query
query("prepare set_replicated_0(bigint) as
update table set replicated=0 where id=?");
// reread this file, launch prepared query and every 1000000 updates commit
// and vacuum a table
rewind($file);
$counter = 0;
query("start transaction");
while read($file,&$id,sizeof($id)) {
query("execute set_replicated_0($id)");
$counter++;
if ( $counter % 1000000 == 0 ) {
query("commit");
query("vacuum table");
query("start transaction");
}
}
query("commit");
query("vacuum table");
close($file);
I guess what you need to do is
a. copy the 2000 records PK value into a temporary table with the same standard limit, etc.
b. select the same 2000 records and perform the necessary operations in the cursor as it is.
c. If successful, run a single update query against the records in the temp table. Clear the temp table and run step a again.
d. If unsuccessful, clear the temp table without running the update query.
Simple, efficient and reliable.
Regards,
KT
I think it's better to change your postgres to version 8.X. probably the cause is the low version of Postgres. Also try this query below. I hope this can help.
UPDATE table1 SET name = table2.value
FROM table2
WHERE table1.id = table2.id;