Deleting some data from partition will impact local index? - sql

I have a partitioned table "alarms" as following
partitioned by range(version); version: 1,2,3 ..
each partition have local index on version
each partition have a mix of columns as local indexes
version is a local index
no global index
Due to some business constrains,
I need to delete some data from each version (but not all partition data).
no update will happen to old versions, only select
on daily basis, i am inserting new version data
So for this i will delete as following:
delete /*+ full(alarms) parallel(alarms,4)*/ from alarms where version <= (number) and alarm_type = 'type1';
And this will not delete all the partition. But may be each 1 month, this partition will be empty.
So I have a procedure loops on all versions and all empty partitions will be dropped by name.
My question is: Until partition is not empty
this may impact performance?
Do i need to rebuild index each delete?

This may impact performance?
I'm not sure just how you mean this. If you mean "will deleting data from a table impact other concurrent users of that table", the answer is yes, although it's impossible to state what the degree of impact will be. If you mean, "will deleting data from a table have long-term impact on access to that table", my answer is that there should be very little long-term affect.
Do I need to rebuild index each delete?
Deleting data from a table is a normal activity in a database, and the indexes will be maintained properly.
Best of luck.

Related

Automated way of deleting millions of rows from Postgres tables

Postgres Version: PostgreSQL 10.9 (Ubuntu 10.9-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit
Before I ask my question, I would like to explain why I'm looking into this. we have a history table which has more than 5 million rows and growing every hour.
As the table length grows the select queries are becoming slower, even though we have a proper index. So ideally the first choice for us to delete the old records which are unused.
Approach #1
We tried deleting the records from the table using simple delete from table_name where created_date > certain_date where is_active = false
This took a very long time.
Approach #2
Create a script which would delete the rows with the cursor-based approach.
This also takes a very long time.
Approach #3
Created a new unlogged table.
Create an index on the new_table.
Copy the contents from the old table to a new table
Then set table is logged.
Rename the master table as a backup.
Issues with this approach, it requires some downtime.
On live productions instances, this would result in missing data / resulting in failures
Approach #4
On further investigation, the performant way to delete unused rows is if we create a table with partition https://www.postgresql.org/docs/10/ddl-partitioning.html - Which we could drop the entire partition immediately.
Questions with the above approach are
How can I create a partition on the existing table?
Will that require downtime?
How can we configure Postgres to create partition automatically, we can't really create partitions manually everyday right?
Any other approaches are also welcome, the thing is I really want this to be automated than manual because I would extend this to multiple tables.
Please let me know your thoughts, which would very helpful
I would go for approach 4, table partitioning.
Create partitions
New data goes directly to the correct partition
Move old data (manually / scripted) to the correct partition
Set a cron job to create partitions for the next X days, if they don't exists already
No downtime needed
We tried deleting the records from the table using simple delete from table_name where created_date > certain_date where is_active = false This took a very long time.
Surely you mean <, not > there? So what if it took a long time? How fast do you need it to be? Did it cause problems? How much data were you trying to delete at once?
5 millions rows is pretty small. I probably wouldn't use partitioning in the first place on something that size.
There is no easy and transparent way to migrate to having partitioned data if you don't already have it. The easiest way is to have a bit of downtime as you populate the new table.
If you do want to partition, your partitioning scheme would have to include is_active, not just created_date. And by day seems way too fine, you could do it by month and pre-create a few years worth up front.
Answering specifically to the below:
How can we configure Postgres to create partition automatically, we can't really create partitions manually everyday right?
As you are using Postgres 10, you could use https://github.com/pgpartman/pg_partman for auto management of partitions.

Can we delete some rows from a partition instead of looping over all records of the big table?

I'm new to SQL and databases world, and I faced this situation:
I have a partitioned table by day : every day a partition is created and collects all rows added in that day.
But now we are trying to reduce the amount of data since the size of the DB is getting bigger, so we decided to delete some rows based on some conditions.
what we are trying to do is: delete some rows of unused data only of last 2 days.
so my question is :
Can we delete some rows from a partition? if so, does it delete data from the actual table and frees some space?
example :
delete from MyTable where condition1 and time >= (sysdate -2) ;
-- is it the same as (from a performance perspective)
delete from Mytable partition (MyTble_Partition) where condition1;
does a fragmentation or rebuild of indexes after delete of some rows is needed in this case?
Please correct me if I'm saying stupid things.
I will be grateful for any guidance , Thanks in advance.
Main rule: you almost never have the reason to access to partition explicitly.
Your predicates (join\where conditions) must provide all needed information to database for correct targeting to only needed partitions.
If you want to delete some data from last 2 days then YES it's ok to pass time>= predicate. Only needed partition will be scanned by Oracle.
You don't need to rebuild indexes. This work will be done by DBMS.
Your next question - clearing the data to have more space. This is a bit tricky.
You must imagine every partition of your table as "independed table" in DB.
In many aspects that is.
When you do DELETE you don't get any free space on your hard drive. You just get some "free space" in your table. You can use this space for further INSERTS.
But (attention!) you must to know that you really want to add some records in "that old day" partition in future.
If not then you got no profit from DELETE. At all.
Also read this article to understand how to free real disk space after table rows deleting

Firebird truncate table / delete all rows

I am using Firebird 2.5.1 Embedded. I have done the usual to empty the table with nearly 200k rows:
delete from SZAFKI
Here's the output, see as it takes 16 seconds, which is, well, unacceptable.
Preparing query: delete from SZAFKI
Prepare time: 0.010s
PLAN (SZAFKI NATURAL)
Executing...
Done.
3973416 fetches, 1030917 marks, 116515 reads, 116434 writes.
0 inserts, 0 updates, 182658 deletes, 27 index, 182658 seq.
Delta memory: -19688 bytes.
SZAFKI: 182658 deletes.
182658 rows affected directly.
Total execution time: 16.729s
Script execution finished.
Firebird has no TRUNCATE keyword. As the query uses PLAN NATURAL, I tried to PLAN the query by hand, like so:
delete from szafki PLAN (SZAFKI INDEX (SZAFKI_PK))
but Firebird says "SZAFKI_PK cannot be used in the specified plan" (it is a primary key)
Question is how do i empty table efficiently? Dropping and recreating is not possible.
Answer based on my comment
A trick you could try is to use DELETE FROM SZAFKI WHERE ID > 0 (assuming the ID is 1 or higher). This will force Firebird to look up the rows using the primary key index.
My initial assumption was that this would be worse than an unindexed delete. An unindexed delete will do a sequential scan of all datapages of a table and delete rows (that is: create a new recordversion that is a deleted stub record). When you use the index it will lookup rows in index order, this will result in a random walk through the datapages (assuming a high level of fragmentation in the data due to a high number of record versions due to inserts, deletes and updates). I had expected this to be slower, but probably it will result in Firebird having to only read the relevant datapages (with record versions relevant to the transaction) instead of all datapages of a table.
Unfortunately, there is no fast way to do massive delete on entire (big) table with currently Firebird versions. You can expect even higher delays when the "deleted content" is garbage collected (run select * in the table after the delete is committed and you will see). You can try to deactivate indexes in that table before doing the delete and see if it helps.
If you are using the table as some kind of temporary storage, I suggest you to use the GTT feature.
Fastest and only way to dot get rid from all data fast in FireBird table- drop and create table again. At least for the current official version 2.5.X . There is no truncate operator in roadmap for FireBird 3.0 , beta is out, so most probably no truncate in 3.0 too.
Also, you can use the operator RECREATE - same syntax as create. If table exists, RECRATE drops it, then creates new. If table doesn't exists, then recreate just creates it.
RECREATE TABLE Table1 (
ID INTEGER,
NAME VARCHAR(20),
DATE DATE,
T TIME
);

SQL Server Table Partitioning, what is happening behind the scenes?

I'm working with table partitioning on extremely large fact table in a warehouse. I have executed the script a few different ways. With and without non clustered indexes. With indexes it appears to dramatically expand the log file while without the non clustered indexes it appears to not expand the log file as much but takes more time to run due to the rebuilding of the indexes.
What I am looking for is any links or information as to what is happening behind the scene specifically to the log file when you split a table partition.
I think it isn't to hard to theorize what is going on (to a certain extent). Behind the scenes each partition is given a different HoBT, which in normal language means each partition is in effect sitting on it's own hidden table.
So theorizing the splitting of a partition (assuming data is moving) would involve:
inserting the data into the new table
removing data from the old table
The NC index can be figured out, but depending on whether there is a clustered index or not, the theorizing will alter. It also matters whether the index is partition aligned or not.
Given a bit more information on the table (CL or Heap) we could theorize this further
If the partition function is used by a
partitioned table and SPLIT results in
partitions where both will contain
data, SQL Server will move the data to
the new partition. This data movement
will cause transaction log growth due
to inserts and deletes.
This is from an article by Microsoft on Partitioned Table and Index Strategies
So looks like its doing a delete from old partition and and insert into the new partition. This could explain the growth in t-log.

Does MySQL use existing indexes on creating new indexes?

I have a large table with millions of records.
Table `price`
------------
id
product
site
value
The table is brand new, and there are no indexes created.
I then issued a request for new index creation with the following query:
CREATE INDEX ix_price_site_product_value_id ON price (site, product, value, id);
This took long long time, last time I was checking ran for 5000+ seconds, because of the machine.
I am wondering if I issue another index creation, will it use the existing index in the process calculation? If so in what form?
Next to run query 1:
CREATE INDEX ix_price_product_value_id ON price (product, value, id);
Next to run query 2:
CREATE INDEX ix_price_value_id ON price (value, id);
I am wondering if I issue another index creation, will it use the existing index in the process calculation? If so in what form?
No, it won't.
Theoretically, an index on (site, product, value, id) has everything required to build an index on any subset of these fields (including the indices on (product, value, id) and (value, id)).
However, building an index from a secondary index is not supported.
First, MySQL does not support fast full index scan (that is scanning an index in physical order rather than logical), thus making an index access path more expensive than the table read. This is not a problem for InnoDB, since the table itself is always clustered.
Second, the record orders in these indexes are completely different so the records need to be sorted anyway.
However, the main problem with the index creation speed in MySQL is that it generates the order on site (just inserting the records one by one into a B-Tree) instead of using a presorted source. As #Daniel mentioned, fast index creation solves this problem. It is available as a plugin for 5.1 and comes preinstalled in 5.5.
If you're using MySQL version 5.1, and the InnoDB storage engine, you may want to use the InnoDB Plugin 1.0, which supports a new feature called Fast Index Creation. This allows the storage engine to create indexes without copying the contents of the entire table.
Overview of the InnoDB Plugin:
Starting with version 5.1, MySQL AB has promoted the idea of a “pluggable” storage engine architecture, which permits multiple storage engines to be added to MySQL. Currently, however, most users have accessed only those storage engines that are distributed by MySQL AB, and are linked into the binary (executable) releases.
Since 2001, MySQL AB has distributed the InnoDB transactional storage engine with its releases (both source and binary). Beginning with MySQL version 5.1, it is possible for users to swap out one version of InnoDB and use another.
Source: Introduction to the InnoDB Plugin
Overview of Fast Index Creation:
In MySQL versions up to 5.0, adding or dropping an index on a table with existing data can be very slow if the table has many rows. The CREATE INDEX and DROP INDEX commands work by creating a new, empty table defined with the requested set of indexes. It then copies the existing rows to the new table one-by-one, updating the indexes as it goes. Inserting entries into the indexes in this fashion, where the key values are not sorted, requires random access to the index nodes, and is far from optimal. After all rows from the original table are copied, the old table is dropped and the copy is renamed with the name of the original table.
Beginning with version 5.1, MySQL allows a storage engine to create or drop indexes without copying the contents of the entire table. The standard built-in InnoDB in MySQL version 5.1, however, does not take advantage of this capability. With the InnoDB Plugin, however, users can in most cases add and drop indexes much more efficiently than with prior releases.
...
Changing the clustered index requires copying the data, even with the InnoDB Plugin. However, adding or dropping a secondary index with the InnoDB Plugin is much faster, since it does not involve copying the data.
Source: Overview of Fast Index Creation