I have a scenario where I have a central server and a node. Both server and node are capable of running PostgreSQL but the storage space on the node is limited. The node collects data at a high speed and writes the data to its local DB.
The server needs to replicate the data from the node. I plan on accomplishing this with Slony-I or Bucardo.
The node needs to be able to delete all records from its tables at a set interval in order to minimize disk space used. Should I use pgAgent with a job consisting of a script like
DELETE FROM tablex, tabley, tablez;
where the actual batch file to run the script would be something like
#echo off
C:\Progra~1\PostgreSQL\9.1\bin\psql -d database -h localhost -p 5432 -U postgres -f C:\deleteFrom.sql
?
I'm just looking for opinions if this is the best way to accomplish this task or if anyone knows of a more efficient way to pull data from a remote DB and clear that remote DB to save space on the remote node. Thanks for your time.
The most efficient command for you is the TRUNCATE command.
With TRUNCATE, you can chain up tables, like your example:
TRUNCATE tablex, tabley, tablez;
Here's the description from the postgres docs:
TRUNCATE quickly removes all rows from a set of tables. It has the same effect as an unqualified DELETE on each table, but since it does not actually scan the tables it is faster. Furthermore, it reclaims disk space immediately, rather than requiring a subsequent VACUUM operation. This is most useful on large tables.
You may also add CASCADE as a parameter:
CASCADE Automatically truncate all tables that have foreign-key references to any of the named tables, or to any tables added to the group due to CASCADE.
The two best options, depending on your exact needs and workflow, would be truncate, as #Bohemian suggested, or to create a new table, rename, then drop.
We use something much like the latter create/rename/drop method in one of our major projects. This has an advantage where you need to be able to delete some data, but not all data, from a table very quickly. The basic workflow is:
Create a new table with a schema identical to the old one
CREATE new_table LIKE ...
In a transaction, rename the old and new tables simultaneously:
BEGIN;
RENAME table TO old_table;
RENAME new_table TO table;
COMMIT;
[Optional] Now you can do stuff with the old table, while the new table is happily accepting new inserts. You can dump the data to your centralized server, run queries on it, or whatever.
Delete the old table
DROP old_table;
This is an especially useful strategy when you want to keep, say, 7 days of data around, and only discard the 8th day's data all at once. Doing a DELETE in this case can be very slow. By storing the data in partitions (one for each day), it is easy to drop an entire day's data at once.
Related
We have a Postgres database in a test system which we want to 'refresh' with the data from the production system. However, there are some tables with test configuration data that I want to preserve in the test database. Note that the tables I want to preserve are referred to with foreign key constraints in other tables that are not preserved.
To refresh the test database, we usually rename it to '..._old' and then re-create the database from a dump of the production data.
Now there's a few ways to try to preserve the test configuration data, but I'm wondering if anyone has any brilliant ideas that are better/faster. Hoping that we can script this somehow to make it easy each time we do this.
A straight pg_dump/pg_restore won't work, because it will only INSERT not UPDATE the matching records. Or am I missing something?
I had thought about doing it by:
Renaming the tables involved with '..._test'
Using pg_dump to dump just those renamed tables to a file
Re-create the 'refreshed' database
Restore the renamed tables from file into the new database
Perform an UPDATE table_a SET (......) = (SELECT * FROM table_a_test) to overwrite refreshed data with preserved test data
Note that the number of records in the production data and the test data may not be the same.
The content of these tables is not huge, so I had also thought about generating UPDATE scripts for all the records within the preserved data.
Can anyone think of a better way to do this?
We are working on a data warehouse using IBM DB2 and we wanted to load data by partition exchange. That means we prepare a temporary table with the data we want to load into the target table and then use that entire table as a data partition in the target table. If there was previous data we just discard the old partition.
Basically you just do "ALTER TABLE target_table ATTACH PARTITION pname [starting and ending clauses] FROM temp_table".
It works wonderfully, but only for one operation at a time. If we do multiple loads in parallel or try to attach multiple partitions to the same table it's raining deadlock errors from the database.
From what I understand, the problem isn't necessarily with parallel access to the target table itself (locking it changes nothing), but accesses to system catalog tables in the background.
I have combed through the DB2 documentation but the only reference to the topic of concurrent DDL statements I found at all was to avoid doing them. The answer to this question, can't be to simply not attempt it?
Does anyone know a way to deal with this problem?
I tried to have a global, single synchronization table to lock if you want to attach any partitions, but it didn't help either. Either I'm missing something (implicit commits somewhere?) or some of the data catalog updates even happen asynchronously, which makes the whole problem much worse. If that is the case, is there are any chance at all to query if the attach is safe to perform at any given moment?
I am using MonetDB (MDB) for OLAP queries. I am storing source data in PostgreSQL (PGSQL) and syncing it with MonetDB in batches written in Python.
In PGSQL there is a wide table with ID (non-unique) and few columns. Every few seconds Python script takes a batch of 10k records changed in the PGSQL and uploads them to MDB.
The process of upload to MDB is as follows:
Create staging table in MDB
Use COPY command to upload 10k records into the staging table.
DELETE from destination table all IDs that are in staging table.
INSERT to the destination table all rows from staging table.
So, it is basically a DELETE & INSERT. I cannot use MERGE statement, because I do not have a PK - one ID can have multiple values in the destination. So I need to do a delete and full insert for all IDs currently synced.
Now to the problem: the DELETE is slow.
When I do a DELETE on a destination table, deleting 10k records in table of 25M rows, it will take 500ms.
However! If I run simple SELECT * FROM destination WHERE id = 1 and THEN do a DELETE, it takes 2ms.
I think that it has something to do with automatic creation of auxiliary indices. But this is where my knowledge ends.
I tried to solve this problem of "pre-heating" by doing the lookup myself and it works - but only for the first DELETE after pre-heat.
Once I do DELETE and INSERT, the next DELETE gets again slow. And doing the pre-heating before each DELETE does not make sense, because the pre-heat itself takes 500ms.
Is there any way on how to sync data to MDB without breaking auxiliary indices already built? Or make the DELETE faster without pre-heat? Or should I use some different technique to sync data into MDB without PK (does MERGE has the same problem?).
Thanks!
What happens when I runned:
zcat /mnt/Postgres/restoreFile.gz | psql my_db
on the working database and after doing ALTER TABLE and other standard things there were problems with duplicated keys. When I stopped it and tried to insert into database then I got duplicates key error because of sequences and constraints. Seems like all data is in but what about the sequences. What really happend with that database?
A normal Postgres backup consists of table design (like create table) and data (like insert) statements. If you run it twice, most design statements will fail. The insert statements would succeed in so far as the data definition allows for duplicate rows.
So restoring a database to a production server would typically result in a lot of duplicate rows in tables without a primary key. Some design changes made after the backup (like changing the owner of a table) may be undone.
I want to create a temp table so as to be able to join it to a few tables because joining those tables with the content of the proposed temporary table takes a lot of time (fetching the content of the temporary table is time consuming.Repeating it over and over takes more and more time). I am dropping the temporary table when my needs are accomplished.
I want to know if these temporary tables would be visible over other client session(my requirement is to make them visible only for current client session). I am using postgresql. It would be great if you could suggest better alternatives to the solution I am thinking of.
PostgreSQL then is the database for you. Temporary tables done better than the standard. From the docs,
Although the syntax of CREATE TEMPORARY TABLE resembles that of the SQL standard, the effect is not the same. In the standard, temporary tables are defined just once and automatically exist (starting with empty contents) in every session that needs them. PostgreSQL instead requires each session to issue its own CREATE TEMPORARY TABLE command for each temporary table to be used. This allows different sessions to use the same temporary table name for different purposes, whereas the standard's approach constrains all instances of a given temporary table name to have the same table structure.
Pleas read the documentation.
Temporary tables are only visible in the current session and are automatically dropped when the database session ends.
If you specify ON COMMIT, the temporary table will automatically be dropped at the end of the current transaction.
If you need good table statistics on a temporary table, you have to call ANALYZE explicitly, as these statistics are not collected automatically.
By default , temporary tables are visible for current session only and temporary tables are dropped automatically after commit or closing that transaction, so you don't need to drop explicitly.
The auto vacuum daemon cannot access and therefore cannot vacuum or analyze temporary tables. For this reason, appropriate vacuum and analyze operations should be performed via session SQL commands. For example, if a temporary table is going to be used in complex queries, it is wise to run ANALYZE on the temporary table after it is populated.
Optionally, GLOBAL or LOCAL can be written before TEMPORARY or TEMP. This presently makes no difference in PostgreSQL and is deprecated