Most efficient way to delete records from a huge table - sql

I have a table tblcalldatastore which produce around 4000000 records daily. I want to create a daily job to delete any record order than 24 hours. What is the most efficient and less time taking way? Below query is my requirement.
delete from [tblcalldatastore]
where istestcase=0
and datediff(hour,receiveddate,GETDATE())>24

The better approach is to avoid delete entirely by using partitions on your table. Instead of deleting records, drop partitions.
For example, you can create a partition for each hour. Then you can drop the entire partition for the 25th hour in the past. Or you can basically have two partitions by day and drop the older one after 24 hours.
This approach has a big performance advantage, because partition drops are not logged at the record level, saving lots of time. They also do not invoke triggers or other checks, saving more effort.
The documentation on partitioning is here.

You might not want to go down the Partitions route.
It looks like you will typically be deleting approx half the data in your table every day.
Deletes are very expensive...
A much faster way to do this is to
Select INTO a New Table (the data you want to keep)
rename (or Drop) your old Table
Then Rename your new table to the old table name.
This should work out quicker - Unless you have heaps of Indexes & FKs...

Related

Automated way of deleting millions of rows from Postgres tables

Postgres Version: PostgreSQL 10.9 (Ubuntu 10.9-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit
Before I ask my question, I would like to explain why I'm looking into this. we have a history table which has more than 5 million rows and growing every hour.
As the table length grows the select queries are becoming slower, even though we have a proper index. So ideally the first choice for us to delete the old records which are unused.
Approach #1
We tried deleting the records from the table using simple delete from table_name where created_date > certain_date where is_active = false
This took a very long time.
Approach #2
Create a script which would delete the rows with the cursor-based approach.
This also takes a very long time.
Approach #3
Created a new unlogged table.
Create an index on the new_table.
Copy the contents from the old table to a new table
Then set table is logged.
Rename the master table as a backup.
Issues with this approach, it requires some downtime.
On live productions instances, this would result in missing data / resulting in failures
Approach #4
On further investigation, the performant way to delete unused rows is if we create a table with partition https://www.postgresql.org/docs/10/ddl-partitioning.html - Which we could drop the entire partition immediately.
Questions with the above approach are
How can I create a partition on the existing table?
Will that require downtime?
How can we configure Postgres to create partition automatically, we can't really create partitions manually everyday right?
Any other approaches are also welcome, the thing is I really want this to be automated than manual because I would extend this to multiple tables.
Please let me know your thoughts, which would very helpful
I would go for approach 4, table partitioning.
Create partitions
New data goes directly to the correct partition
Move old data (manually / scripted) to the correct partition
Set a cron job to create partitions for the next X days, if they don't exists already
No downtime needed
We tried deleting the records from the table using simple delete from table_name where created_date > certain_date where is_active = false This took a very long time.
Surely you mean <, not > there? So what if it took a long time? How fast do you need it to be? Did it cause problems? How much data were you trying to delete at once?
5 millions rows is pretty small. I probably wouldn't use partitioning in the first place on something that size.
There is no easy and transparent way to migrate to having partitioned data if you don't already have it. The easiest way is to have a bit of downtime as you populate the new table.
If you do want to partition, your partitioning scheme would have to include is_active, not just created_date. And by day seems way too fine, you could do it by month and pre-create a few years worth up front.
Answering specifically to the below:
How can we configure Postgres to create partition automatically, we can't really create partitions manually everyday right?
As you are using Postgres 10, you could use https://github.com/pgpartman/pg_partman for auto management of partitions.

Can we delete some rows from a partition instead of looping over all records of the big table?

I'm new to SQL and databases world, and I faced this situation:
I have a partitioned table by day : every day a partition is created and collects all rows added in that day.
But now we are trying to reduce the amount of data since the size of the DB is getting bigger, so we decided to delete some rows based on some conditions.
what we are trying to do is: delete some rows of unused data only of last 2 days.
so my question is :
Can we delete some rows from a partition? if so, does it delete data from the actual table and frees some space?
example :
delete from MyTable where condition1 and time >= (sysdate -2) ;
-- is it the same as (from a performance perspective)
delete from Mytable partition (MyTble_Partition) where condition1;
does a fragmentation or rebuild of indexes after delete of some rows is needed in this case?
Please correct me if I'm saying stupid things.
I will be grateful for any guidance , Thanks in advance.
Main rule: you almost never have the reason to access to partition explicitly.
Your predicates (join\where conditions) must provide all needed information to database for correct targeting to only needed partitions.
If you want to delete some data from last 2 days then YES it's ok to pass time>= predicate. Only needed partition will be scanned by Oracle.
You don't need to rebuild indexes. This work will be done by DBMS.
Your next question - clearing the data to have more space. This is a bit tricky.
You must imagine every partition of your table as "independed table" in DB.
In many aspects that is.
When you do DELETE you don't get any free space on your hard drive. You just get some "free space" in your table. You can use this space for further INSERTS.
But (attention!) you must to know that you really want to add some records in "that old day" partition in future.
If not then you got no profit from DELETE. At all.
Also read this article to understand how to free real disk space after table rows deleting

Create a Historical Auditing Table

Currently we have an AuditLog table that holds over 11M records. Regardless on the indexes and statistics any query referencing this table takes a long time. Most reports don't check for Audit records past a year but we would still like to keep these records. Whats the best way to handle this?
I was thinking of keeping the AuditLog table to hold all records less than or equal to a year old. Then move any records greater than a year old to an AuditLogHistory table. Maybe just running a batch file every night to move these records over and then update the indexes and statistics of the AuditLog table. Is this an okay way to complete this task? Or what other way should I be storing older records?
The records brought back from the AuditLog table hit a linked server and check in 6 different db's to see if a certain member exists in them based on a condition. I don't have access to make any changes to the linked server db's so can only optimize what I have which is the Auditlog. Hitting the linked server db's uses up over 90% of the queries cost. So I'm just trying to limit what I can.
First, I find it hard to believe that you cannot optimize a query on a table with 11 million records. You should investigate the indexes that you have relative to the queries that are frequently run.
In any case, the answer to your question is "partitioning". You would partition by the date column and be sure to include this condition in all queries. That will reduce the amount of data and probably speed the processing.
The documentation is a good place to start for learning about partitioning.

Relation with DB size and performance

Is there any relation between DB size and performance in my case:
There is a table in my Oracle DB that is used for logging. Now it has almost close to over 120 million rows and increases at a rate of 1000 rows per min. Each row has 6-7 columns with basic string data.
It is for our client. We never take any data from there but we might need that in case of any issues. However its fine if we clean up every month or so.
However the actual issue is will it affect performance of other transactional tables in the same db? Assuming the disk space as unlimited.
If 1000 rows/minute are being inserted into this table then about 40 million rows would be added per month. If this table has indexes I'd say that the biggest issue will be that eventually index maintenance will become a burden on the system, so in that case I'd expect performance to be affected.
This table seems like a good candidate for partitioning. If it's partitioned on the date/time that each row is added, with each partition containing one month's worth of data, maintenance would be much simpler. The partitioning scheme can be set up so that partitions are created automatically as needed (assuming you're on Oracle 11 or higher), and then when you need to drop a month's worth of data you can just drop the partition containing that data, which is a quick operation which doesn't burden the system with a large number of DELETE operations.
Best of luck.

ORACLE 11g SET COLUMN NULL for specific Partition of large table

I have a Composite-List-List partitioned table with 19 Columns and about 400 million rows. Once a week new data is inserted in this table and before the insert I need to set the values of 2 columns to null for specific partitions.
Obvious approach would be something like the following where COLUMN_1 is the partition criteria:
UPDATE BLABLA_TABLE
SET COLUMN_18 = NULL, SET COLUMN_19 = NULL
WHERE COLUMN_1 IN (VALUE1, VALUE2…)
Of course this would be awfully slow.
My second thought was to use CTAS for every partition that I need to set those two columns to null and then use EXCHANGE PARTITION to update the data in my big table. Unfortunately that wouldn’t work because it´s a Composite-Partition.
I could use the same approach with subpartitions but then I would have to use CATS about 8000 times and drop those tables afterwards every week. I guess that would not pass the upcoming code-review.
May somebody has another idea how to performantly solve this?
PS: I’m using ORACLE 11g as database.
PPS: Sorry for my bad English…..
You've ruled out updating through DDL (switch partitions), so this lets us with only DML to consider.
I don't think that it's actually that bad an update with a table so heavily partitioned. You can easily split the update in 8k mini updates (each a single tiny partition):
UPDATE BLABLA_TABLE SUBPARTITION (partition1) SET COLUMN_18 = NULL...
Each subpartition would contain 15k rows to be updated on average so the update would be relatively tiny.
While it still represents a very big amount of work, it should be easy to set to run in parallel, hopefully during hours where database activity is very light. Also the individual updates are easy to restart if one of them fails (rows locked?) whereas a 120M update would take such a long time to rollback in case of error.
If I were to update almost 90% of rows in table, I would check feasibility/duration of just inserting to another table of same structure (less redo, no row chaining/migration, bypass cache and so on via direct insert. drop indexes and triggers first. exclude columns to leave them null in target table), rename the tables to "swap" them, rebuild indexes and triggers, then drop the old table.
From my experience in data warehousing, plain direct insert is better than update/delete. More steps needed but it's done in less time overall. I agree, partition swap is easier said than done when you have to process most of the table and just makes it more complex for the ETL developer (logic/algorithm bound to what's in the physical layer), we haven't encountered need to do partition swaps so far.
I would also isolate this table in its own tablespaces, then alternate storage between these two tablespaces (insert to 2nd drop table from 1st, vice-versa in next run, resize empty tablespace to reclaim space).