Does MySQL use existing indexes on creating new indexes? - sql

I have a large table with millions of records.
Table `price`
------------
id
product
site
value
The table is brand new, and there are no indexes created.
I then issued a request for new index creation with the following query:
CREATE INDEX ix_price_site_product_value_id ON price (site, product, value, id);
This took long long time, last time I was checking ran for 5000+ seconds, because of the machine.
I am wondering if I issue another index creation, will it use the existing index in the process calculation? If so in what form?
Next to run query 1:
CREATE INDEX ix_price_product_value_id ON price (product, value, id);
Next to run query 2:
CREATE INDEX ix_price_value_id ON price (value, id);

I am wondering if I issue another index creation, will it use the existing index in the process calculation? If so in what form?
No, it won't.
Theoretically, an index on (site, product, value, id) has everything required to build an index on any subset of these fields (including the indices on (product, value, id) and (value, id)).
However, building an index from a secondary index is not supported.
First, MySQL does not support fast full index scan (that is scanning an index in physical order rather than logical), thus making an index access path more expensive than the table read. This is not a problem for InnoDB, since the table itself is always clustered.
Second, the record orders in these indexes are completely different so the records need to be sorted anyway.
However, the main problem with the index creation speed in MySQL is that it generates the order on site (just inserting the records one by one into a B-Tree) instead of using a presorted source. As #Daniel mentioned, fast index creation solves this problem. It is available as a plugin for 5.1 and comes preinstalled in 5.5.

If you're using MySQL version 5.1, and the InnoDB storage engine, you may want to use the InnoDB Plugin 1.0, which supports a new feature called Fast Index Creation. This allows the storage engine to create indexes without copying the contents of the entire table.
Overview of the InnoDB Plugin:
Starting with version 5.1, MySQL AB has promoted the idea of a “pluggable” storage engine architecture, which permits multiple storage engines to be added to MySQL. Currently, however, most users have accessed only those storage engines that are distributed by MySQL AB, and are linked into the binary (executable) releases.
Since 2001, MySQL AB has distributed the InnoDB transactional storage engine with its releases (both source and binary). Beginning with MySQL version 5.1, it is possible for users to swap out one version of InnoDB and use another.
Source: Introduction to the InnoDB Plugin
Overview of Fast Index Creation:
In MySQL versions up to 5.0, adding or dropping an index on a table with existing data can be very slow if the table has many rows. The CREATE INDEX and DROP INDEX commands work by creating a new, empty table defined with the requested set of indexes. It then copies the existing rows to the new table one-by-one, updating the indexes as it goes. Inserting entries into the indexes in this fashion, where the key values are not sorted, requires random access to the index nodes, and is far from optimal. After all rows from the original table are copied, the old table is dropped and the copy is renamed with the name of the original table.
Beginning with version 5.1, MySQL allows a storage engine to create or drop indexes without copying the contents of the entire table. The standard built-in InnoDB in MySQL version 5.1, however, does not take advantage of this capability. With the InnoDB Plugin, however, users can in most cases add and drop indexes much more efficiently than with prior releases.
...
Changing the clustered index requires copying the data, even with the InnoDB Plugin. However, adding or dropping a secondary index with the InnoDB Plugin is much faster, since it does not involve copying the data.
Source: Overview of Fast Index Creation

Related

PostgreSQL with TimescaleDB only uses a single core during index creation

we have a PostgreSQL hypertable with a few billion rows and we're trying to create a unique index on top of it like so:
CREATE UNIQUE INDEX device_data__device_id__value_type__timestamp__idx ON public.device_data(device_id, value_type, "timestamp" DESC);
We created the hypertable like this:
SELECT create_hypertable('device_data', 'timestamp');
Since we want to create the index as fast as possible, we'd like to parallelize the index creation, and followed this guide.
We tested various settings for work_mem, maintenance_work_mem, max_worker_processes, max_parallel_maintenance_workers, and max_parallel_workers. We also set the parallel_workers setting on our table: ALTER TABLE device_data SET (parallel_workers = 10);. But no matter what we do, the index creation always only uses a single core (we have 16 available), and therefore the creation takes very long.
Any idea what we might be missing here?
Our PostgreSQL version is 12.5 and the server runs Ubuntu 18.
Unfortunately, Timescale doesn't currently support parallel index creation. I would recommend filing a Github issue asking to support it. It is a bit of a heavy lift and might not get prioritized horribly quickly. I think another option that could be useful would be to take the https://docs.timescale.com/latest/api#create_index transaction_per_chunk option here and allow the user to control how the indexes are created, so a simple api that would create the index for all future chunks, but not on older chunks and then allow you to call create_index(chunk_name, ht_index_name) on all the chunks, then you could parallelize that operation in your own code. This ends up being a much simpler lift because the transactionality of the parallel index creation is the hardest part.

Deleting some data from partition will impact local index?

I have a partitioned table "alarms" as following
partitioned by range(version); version: 1,2,3 ..
each partition have local index on version
each partition have a mix of columns as local indexes
version is a local index
no global index
Due to some business constrains,
I need to delete some data from each version (but not all partition data).
no update will happen to old versions, only select
on daily basis, i am inserting new version data
So for this i will delete as following:
delete /*+ full(alarms) parallel(alarms,4)*/ from alarms where version <= (number) and alarm_type = 'type1';
And this will not delete all the partition. But may be each 1 month, this partition will be empty.
So I have a procedure loops on all versions and all empty partitions will be dropped by name.
My question is: Until partition is not empty
this may impact performance?
Do i need to rebuild index each delete?
This may impact performance?
I'm not sure just how you mean this. If you mean "will deleting data from a table impact other concurrent users of that table", the answer is yes, although it's impossible to state what the degree of impact will be. If you mean, "will deleting data from a table have long-term impact on access to that table", my answer is that there should be very little long-term affect.
Do I need to rebuild index each delete?
Deleting data from a table is a normal activity in a database, and the indexes will be maintained properly.
Best of luck.

How to create indexes faster

I have a table of about 60GB and I'm trying to create an index,
and its very slow (almost a day, and still running!)
I see most of the time is on Disk I/O(4MB/s), and it doesn't use the memory or cpu so much
I tried: running 'pragma cache_zise = 10000' and 'pragma page_zise=4000'
(after I created the table), and its still doesn't help.
How can I make the 'create index' run in reasonable time?
Creating an index on a database table is a one time operation and it can be expensive based on many factors ranging from how many fields and of what type are included in the index, the size of the data table that is to be indexed, the hardware of the machine the database is running on, and possibly even more.
To give a reasonable answer on speeding things up, we would need to know the schema of the table, the definition of the index you are creating, are you reasonably sure if you are including uniqueness in your index that the data is actually unique, what are the hardware specs of your server, what are your disk speeds, how much available space on the disks, are you using a raid array, what level of raid, how much ram do you have and what is the utilization. etc...
Now all that said, this might be faster but I have not tested it.
make a structurally duplicate table of the table you wish to index.
Add the index to the new empty table.
copy the data from the old table to the new table in chunks.
drop the old table.
My theory is that it will be less expensive to index the data as it is added than to dig through the data that is already there and add the index after the fact.
When you create table,you should create the index. PS:you should consider the index is properly.and you need not to create the index at runtime.

Is the physical ordering of rows maintained when adding new rows to a table with a clustered index?

In IDS?..SE?.. SQL Server, Oracle, MySQL and others automatically insert new rows into the appropiate location in the datafile to maintain the clustering.
The way Informix handles clustered indexes is by rebuilding the table (and index) so that the data in the table is in the correct physical sequence for the index at the time when the index is created. Thereafter, rows are inserted wherever seems most appropriate, which does not continue to preserve the clustered order. This has been the case since (Informix-SQL 1.10 in 1985) Informix-SQL 2.10 from 1986 (possibly 2.00; I don't have a manual for that still) through Informix Dynamic Server 11.70 in 2010.
The statement:
ALTER INDEX idxname TO NOT CLUSTERED;
is always very quick. The complementary statement:
ALTER INDEX idxname TO CLUSTERED;
is often a slow process, involving creating a full new version of the table and the index before dropping the old table and index.
The ISQL 1.10 manual does not have ALTER INDEX; the 2.10 manual does has ALTER INDEX.
I can't answer for IDS but I can for some you mentioned.
It depends on the platforms: does it use pages and does it separate data from index tree?
Generally, physical ordering of rows is not maintained: only logical ordering can be
Reason: you can't "make room" on a fixed size page (as Bohemian suggested)
So if you extend a row (eg add more data to a long varchar) or insert in between (ID=3 between rows id IN (2,4)) then the one of the following happens
row is taken out to a new page with pointers
row overflows (SQL Server 2005+ for example)
page is split
This results in logical/index fragmentation and reduced data density (per page): which is why we have index maintenance to remove this.

SQL Server Table Partitioning, what is happening behind the scenes?

I'm working with table partitioning on extremely large fact table in a warehouse. I have executed the script a few different ways. With and without non clustered indexes. With indexes it appears to dramatically expand the log file while without the non clustered indexes it appears to not expand the log file as much but takes more time to run due to the rebuilding of the indexes.
What I am looking for is any links or information as to what is happening behind the scene specifically to the log file when you split a table partition.
I think it isn't to hard to theorize what is going on (to a certain extent). Behind the scenes each partition is given a different HoBT, which in normal language means each partition is in effect sitting on it's own hidden table.
So theorizing the splitting of a partition (assuming data is moving) would involve:
inserting the data into the new table
removing data from the old table
The NC index can be figured out, but depending on whether there is a clustered index or not, the theorizing will alter. It also matters whether the index is partition aligned or not.
Given a bit more information on the table (CL or Heap) we could theorize this further
If the partition function is used by a
partitioned table and SPLIT results in
partitions where both will contain
data, SQL Server will move the data to
the new partition. This data movement
will cause transaction log growth due
to inserts and deletes.
This is from an article by Microsoft on Partitioned Table and Index Strategies
So looks like its doing a delete from old partition and and insert into the new partition. This could explain the growth in t-log.