Automated way of deleting millions of rows from Postgres tables - sql

Postgres Version: PostgreSQL 10.9 (Ubuntu 10.9-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit
Before I ask my question, I would like to explain why I'm looking into this. we have a history table which has more than 5 million rows and growing every hour.
As the table length grows the select queries are becoming slower, even though we have a proper index. So ideally the first choice for us to delete the old records which are unused.
Approach #1
We tried deleting the records from the table using simple delete from table_name where created_date > certain_date where is_active = false
This took a very long time.
Approach #2
Create a script which would delete the rows with the cursor-based approach.
This also takes a very long time.
Approach #3
Created a new unlogged table.
Create an index on the new_table.
Copy the contents from the old table to a new table
Then set table is logged.
Rename the master table as a backup.
Issues with this approach, it requires some downtime.
On live productions instances, this would result in missing data / resulting in failures
Approach #4
On further investigation, the performant way to delete unused rows is if we create a table with partition https://www.postgresql.org/docs/10/ddl-partitioning.html - Which we could drop the entire partition immediately.
Questions with the above approach are
How can I create a partition on the existing table?
Will that require downtime?
How can we configure Postgres to create partition automatically, we can't really create partitions manually everyday right?
Any other approaches are also welcome, the thing is I really want this to be automated than manual because I would extend this to multiple tables.
Please let me know your thoughts, which would very helpful

I would go for approach 4, table partitioning.
Create partitions
New data goes directly to the correct partition
Move old data (manually / scripted) to the correct partition
Set a cron job to create partitions for the next X days, if they don't exists already
No downtime needed

We tried deleting the records from the table using simple delete from table_name where created_date > certain_date where is_active = false This took a very long time.
Surely you mean <, not > there? So what if it took a long time? How fast do you need it to be? Did it cause problems? How much data were you trying to delete at once?
5 millions rows is pretty small. I probably wouldn't use partitioning in the first place on something that size.
There is no easy and transparent way to migrate to having partitioned data if you don't already have it. The easiest way is to have a bit of downtime as you populate the new table.
If you do want to partition, your partitioning scheme would have to include is_active, not just created_date. And by day seems way too fine, you could do it by month and pre-create a few years worth up front.

Answering specifically to the below:
How can we configure Postgres to create partition automatically, we can't really create partitions manually everyday right?
As you are using Postgres 10, you could use https://github.com/pgpartman/pg_partman for auto management of partitions.

Related

avoiding write conflicts while re-sorting a table

I have a large table that I need to re-sort periodically. I am partly basing this on a suggestion I was given to stay away from using cluster keys since I am inserting data ordered differently (by time) from how I need it clustered (by ID), and that can cause re-clustering to get a little out of control.
Since I am writing to the table on a hourly I am wary of causing problems with these two processes conflicting: If I CTAS to a newly sorted temp table and then swap the table name it seems like I am opening the door to have a write on the source table not make it to the temp table.
I figure I can trigger a flag when I am re-sorting that causes the ETL to pause writing, but that seems a bit hacky and maybe fragile.
I was considering leveraging locking and transactions, but this doesn't seem to be the right use case for this since I don't think I'd be locking the table I am copying from while I write to a new table. Any advice on how to approach this?
I've asked some clarifying questions in the comments regarding the clustering that you are avoiding, but in regards to your sort, have you considered creating a nice 4XL warehouse and leveraging the INSERT OVERWRITE option back into itself? It'd look something like:
INSERT OVERWRITE INTO table SELECT * FROM table ORDER BY id;
Assuming that your table isn't hundreds of TB in size, this will complete rather quickly (inside an hour, I would guess), and any inserts into the table during that period will queue up and wait for it to finish.
There are some reasons to avoid the automatic reclustering, but they're basically all the same reasons why you shouldn't set up a job to re-cluster frequently. You're making the database do all the same work, but without the built in management of it.
If your table is big enough that you are seeing performance issues with the clustering by time, and you know that the ID column is the main way that this table is filtered (in JOINs and WHERE clauses) then this is probably a good candidate for automatic clustering.
So I would recommend at least testing out a cluster key on the ID and then monitoring/comparing performance.
To give a brief answer to the question about resorting without conflicts as written:
I might recommend using a time column to re-sort records older than a given time (probably in a separate table). While it's sorting, you may get some new records. But you will be able to use that time column to marry up those new records with the, now sorted, older records.
You might consider revoking INSERT, UPDATE, DELETE privileges on the original table within the same script or procedure that performs the CTAS creating the newly sorted copy of the table. After a successful swap you can re-enable the privileges for the roles that are used to perform updates.

Most efficient way to delete records from a huge table

I have a table tblcalldatastore which produce around 4000000 records daily. I want to create a daily job to delete any record order than 24 hours. What is the most efficient and less time taking way? Below query is my requirement.
delete from [tblcalldatastore]
where istestcase=0
and datediff(hour,receiveddate,GETDATE())>24
The better approach is to avoid delete entirely by using partitions on your table. Instead of deleting records, drop partitions.
For example, you can create a partition for each hour. Then you can drop the entire partition for the 25th hour in the past. Or you can basically have two partitions by day and drop the older one after 24 hours.
This approach has a big performance advantage, because partition drops are not logged at the record level, saving lots of time. They also do not invoke triggers or other checks, saving more effort.
The documentation on partitioning is here.
You might not want to go down the Partitions route.
It looks like you will typically be deleting approx half the data in your table every day.
Deletes are very expensive...
A much faster way to do this is to
Select INTO a New Table (the data you want to keep)
rename (or Drop) your old Table
Then Rename your new table to the old table name.
This should work out quicker - Unless you have heaps of Indexes & FKs...

Truncate and insert new content into table with the least amount of interruption

Twice a day, I run a heavy query and save the results (40MBs worth of rows) to a table.
I truncate this results table before inserting the new results such that it only ever has the latest query's results in it.
The problem, is that while the update to the table is written, there is technically no data and/or a lock. When that is the case, anyone interacting with the site could experience an interruption. I haven't experienced this yet, but I am looking to mitigate this in the future.
What is the best way to remedy this? Is it proper to write the new results to a table named results_pending, then drop the results table and rename results_pending to results?
Two methods come to mind. One is to swap partitions for the table. To be honest, I haven't done this in SQL Server, but it should work at a low level.
I would normally have all access go through a view. Then, I would create the new day's data in a separate table -- and change the view to point to the new table. The view change is close to "atomic". Well, actually, there is a small period of time when the view might not be available.
Then, at your leisure you can drop the old version of the table.
TRUNCATE is a DDL operation which causes problems like this. If you are using snapshot isolation with row versioning and want users to either see the old or new data then use a single transaction to DELETE the old records and INSERT the new data.
Another option if a lot of the data doesn't actually change is to UPDATE / INSERT / DELETE only those records that need it and leave unchanged records alone.

Truncate and Insert in ClickHouse Database

I have a particular scenario where I need to truncate and batch insert into a Table in ClickHouse DBMS for every 30 minutes or so. I could find no reference of truncate option in ClickHouse.
However, I could find suggestions that we can indirectly achieve this by dropping the old table, creating a new table with same name and inserting data into it.
With respect to that, I have a few questions.
How is this achieved ? What is the sequence of steps in this process ?
What happens to other queries such as Select during the time when the table is being dropped and recreated ?
How long does it usually take for a table to be dropped and recreated in ClickHouse ?
Is there a better and clean way this can be achieved ?
How is this achieved ? What is the sequence of steps in this process ?
TRUNCATE is supported. There is no need to drop and recreate the table now.
What happens to other queries such as Select during the time when the table is being dropped and recreated ?
That depends on which table engine you use. For merge-tree family you get a snapshot-like behavior for SELECT.
How long does it usually take for a table to be dropped and recreated in ClickHouse ?
I would assume it relies on how fast the underlying file system can handle file deletions. For large tables it might contain millions of data part files which leads to slow truncation. However in your case I wouldn't worry much.
Is there a better and clean way this can be achieved ?
I suggest using partitons with a (DateTime / 60) column (per minute) along with a user script that constantly do partition harvest for out of date partitions.

ORACLE 11g SET COLUMN NULL for specific Partition of large table

I have a Composite-List-List partitioned table with 19 Columns and about 400 million rows. Once a week new data is inserted in this table and before the insert I need to set the values of 2 columns to null for specific partitions.
Obvious approach would be something like the following where COLUMN_1 is the partition criteria:
UPDATE BLABLA_TABLE
SET COLUMN_18 = NULL, SET COLUMN_19 = NULL
WHERE COLUMN_1 IN (VALUE1, VALUE2…)
Of course this would be awfully slow.
My second thought was to use CTAS for every partition that I need to set those two columns to null and then use EXCHANGE PARTITION to update the data in my big table. Unfortunately that wouldn’t work because it´s a Composite-Partition.
I could use the same approach with subpartitions but then I would have to use CATS about 8000 times and drop those tables afterwards every week. I guess that would not pass the upcoming code-review.
May somebody has another idea how to performantly solve this?
PS: I’m using ORACLE 11g as database.
PPS: Sorry for my bad English…..
You've ruled out updating through DDL (switch partitions), so this lets us with only DML to consider.
I don't think that it's actually that bad an update with a table so heavily partitioned. You can easily split the update in 8k mini updates (each a single tiny partition):
UPDATE BLABLA_TABLE SUBPARTITION (partition1) SET COLUMN_18 = NULL...
Each subpartition would contain 15k rows to be updated on average so the update would be relatively tiny.
While it still represents a very big amount of work, it should be easy to set to run in parallel, hopefully during hours where database activity is very light. Also the individual updates are easy to restart if one of them fails (rows locked?) whereas a 120M update would take such a long time to rollback in case of error.
If I were to update almost 90% of rows in table, I would check feasibility/duration of just inserting to another table of same structure (less redo, no row chaining/migration, bypass cache and so on via direct insert. drop indexes and triggers first. exclude columns to leave them null in target table), rename the tables to "swap" them, rebuild indexes and triggers, then drop the old table.
From my experience in data warehousing, plain direct insert is better than update/delete. More steps needed but it's done in less time overall. I agree, partition swap is easier said than done when you have to process most of the table and just makes it more complex for the ETL developer (logic/algorithm bound to what's in the physical layer), we haven't encountered need to do partition swaps so far.
I would also isolate this table in its own tablespaces, then alternate storage between these two tablespaces (insert to 2nd drop table from 1st, vice-versa in next run, resize empty tablespace to reclaim space).