Is the physical ordering of rows maintained when adding new rows to a table with a clustered index? - sql

In IDS?..SE?.. SQL Server, Oracle, MySQL and others automatically insert new rows into the appropiate location in the datafile to maintain the clustering.

The way Informix handles clustered indexes is by rebuilding the table (and index) so that the data in the table is in the correct physical sequence for the index at the time when the index is created. Thereafter, rows are inserted wherever seems most appropriate, which does not continue to preserve the clustered order. This has been the case since (Informix-SQL 1.10 in 1985) Informix-SQL 2.10 from 1986 (possibly 2.00; I don't have a manual for that still) through Informix Dynamic Server 11.70 in 2010.
The statement:
ALTER INDEX idxname TO NOT CLUSTERED;
is always very quick. The complementary statement:
ALTER INDEX idxname TO CLUSTERED;
is often a slow process, involving creating a full new version of the table and the index before dropping the old table and index.
The ISQL 1.10 manual does not have ALTER INDEX; the 2.10 manual does has ALTER INDEX.

I can't answer for IDS but I can for some you mentioned.
It depends on the platforms: does it use pages and does it separate data from index tree?
Generally, physical ordering of rows is not maintained: only logical ordering can be
Reason: you can't "make room" on a fixed size page (as Bohemian suggested)
So if you extend a row (eg add more data to a long varchar) or insert in between (ID=3 between rows id IN (2,4)) then the one of the following happens
row is taken out to a new page with pointers
row overflows (SQL Server 2005+ for example)
page is split
This results in logical/index fragmentation and reduced data density (per page): which is why we have index maintenance to remove this.

Related

Am I supposed to drop and recreate indexes on tables every time I add data to it?

I'm currently using MSSQL Server, I've created a table with indexes on 4 columns. I plan on appending 1mm rows every month end. Is it customary to drop the indexes, and recreate them every time you add data to the table?
Don't recreate the index. Instead, you can use update statistics to compute the statistics for the given index or for the whole table:
UPDATE STATISTICS mytable myindex; -- statistics for the table index
UPDATE STATISTICS mytable; -- statistics for the whole table
I don't think it is customary, but it is not uncommon. Presumably the database would not be used for other tasks during the data load, otherwise, well, you'll have other problems.
It could save time and effort if you just disabled the indexes:
ALTER INDEX IX_MyIndex ON dbo.MyTable DISABLE
More info on this non-trivial topic can be found here. Note especially that disabling the clustered index will block all access to the table (i.e. don't do that). If the data being loaded is ordered in [clustered index] order, that can help some.
A last note, do some testing. 1MM rows doesn't seem like that much; the time you save may get used up by recreating the indexes.

Clustering Factor and Unique Key

Clustering factor - A Awesome Simple Explanation on how it is calculated:
Basically, the CF is calculated by performing a Full Index Scan and
looking at the rowid of each index entry. If the table block being
referenced differs from that of the previous index entry, the CF is
incremented. If the table block being referenced is the same as the
previous index entry, the CF is not incremented. So the CF gives an
indication of how well ordered the data in the table is in relation to
the index entries (which are always sorted and stored in the order of
the index entries). The better (lower) the CF, the more efficient it
would be to use the index as less table blocks would need to be
accessed to retrieve the necessary data via the index.
My Index statistics:
So, here are my indexes(index over just one column) under analysis.
Index starting PK_ is my Primary Key and UI is a Unique key. (Ofcourse both hold unique values)
Query1:
SELECT index_name,
UNIQUENESS,
clustering_factor,
num_rows,
CEIL((clustering_factor/num_rows)*100) AS cluster_pct
FROM all_indexes
WHERE table_name='MYTABLE';
Result:
INDEX_NAME UNIQUENES CLUSTERING_FACTOR NUM_ROWS CLUSTER_PCT
-------------------- --------- ----------------- ---------- -----------
PK_TEST UNIQUE 10009871 10453407 96 --> So High
UITEST01 UNIQUE 853733 10113211 9 --> Very Less
We can see the PK having the highest CF and the other unique index is not.
The only logical explanation that strikes me is, the data beneath is stored actually by order of column over the Unique index.
1) Am I right with this understanding?
2) Is there any way to give the PK , the lowest CF number?
3) Seeing the Query cost using both these index, it is very fast for single selects. But still, the CF number is what baffle us.
The table is relatively huge over 10M records, and also receives real time inserts/updates.
My Database version is Oracle 11gR2, over Exadata X2
You are seeing the evidence of a heap table indexed by an ordered tree structure.
To get extremely low CF numbers you'd need to order the data as per the index. If you want to do this (like SQL Server or Sybase clustered indexes), in Oracle you have a couple of options:
Simply create supplemental indexes with additional columns that can satisfy your common queries. Oracle can return a result set from an index without referring to the base table if all of the required columns are in the index. If possible, consider adding columns to the trailing end of your PK to serve your heaviest query (practical if your query has small number of columns). This is usually advisable over changing all of your tables to IOTs.
Use an IOT (Index Organized Table) - It is a table, stored as an index, so is ordered by the primary key.
Sorted hash cluster - More complicated, but can also yield gains when accessing a list of records for a certain key (like a bunch of text messages for a given phone number)
Reorganize your data and store the records in the table in order of your index. This option is ok if your data isn't changing, and you just want to reorder the heap, though you can't explicitly control the order; all you can do is order the query and let Oracle append it to a new segment.
If most of your access patterns are random (OLTP), single record accesses, then I wouldn't worry about the clustering factor alone. That is just a metric that is neither bad nor good, it just depends on the context, and what you are trying to accomplish.
Always remember, Oracle's issues are not SQL Server's issues, so make sure any design change is justified by performance measurement. Oracle is highly concurrent, and very low on contention. Its multi-version concurrency design is very efficient and differs from other databases. That said, it is still a good tuning practice to order data for sequential access if that is your common use case.
To read some better advice on this subject, read Ask Tom: what are oracle's clustered and nonclustered indexes

Firebird truncate table / delete all rows

I am using Firebird 2.5.1 Embedded. I have done the usual to empty the table with nearly 200k rows:
delete from SZAFKI
Here's the output, see as it takes 16 seconds, which is, well, unacceptable.
Preparing query: delete from SZAFKI
Prepare time: 0.010s
PLAN (SZAFKI NATURAL)
Executing...
Done.
3973416 fetches, 1030917 marks, 116515 reads, 116434 writes.
0 inserts, 0 updates, 182658 deletes, 27 index, 182658 seq.
Delta memory: -19688 bytes.
SZAFKI: 182658 deletes.
182658 rows affected directly.
Total execution time: 16.729s
Script execution finished.
Firebird has no TRUNCATE keyword. As the query uses PLAN NATURAL, I tried to PLAN the query by hand, like so:
delete from szafki PLAN (SZAFKI INDEX (SZAFKI_PK))
but Firebird says "SZAFKI_PK cannot be used in the specified plan" (it is a primary key)
Question is how do i empty table efficiently? Dropping and recreating is not possible.
Answer based on my comment
A trick you could try is to use DELETE FROM SZAFKI WHERE ID > 0 (assuming the ID is 1 or higher). This will force Firebird to look up the rows using the primary key index.
My initial assumption was that this would be worse than an unindexed delete. An unindexed delete will do a sequential scan of all datapages of a table and delete rows (that is: create a new recordversion that is a deleted stub record). When you use the index it will lookup rows in index order, this will result in a random walk through the datapages (assuming a high level of fragmentation in the data due to a high number of record versions due to inserts, deletes and updates). I had expected this to be slower, but probably it will result in Firebird having to only read the relevant datapages (with record versions relevant to the transaction) instead of all datapages of a table.
Unfortunately, there is no fast way to do massive delete on entire (big) table with currently Firebird versions. You can expect even higher delays when the "deleted content" is garbage collected (run select * in the table after the delete is committed and you will see). You can try to deactivate indexes in that table before doing the delete and see if it helps.
If you are using the table as some kind of temporary storage, I suggest you to use the GTT feature.
Fastest and only way to dot get rid from all data fast in FireBird table- drop and create table again. At least for the current official version 2.5.X . There is no truncate operator in roadmap for FireBird 3.0 , beta is out, so most probably no truncate in 3.0 too.
Also, you can use the operator RECREATE - same syntax as create. If table exists, RECRATE drops it, then creates new. If table doesn't exists, then recreate just creates it.
RECREATE TABLE Table1 (
ID INTEGER,
NAME VARCHAR(20),
DATE DATE,
T TIME
);

Does MySQL use existing indexes on creating new indexes?

I have a large table with millions of records.
Table `price`
------------
id
product
site
value
The table is brand new, and there are no indexes created.
I then issued a request for new index creation with the following query:
CREATE INDEX ix_price_site_product_value_id ON price (site, product, value, id);
This took long long time, last time I was checking ran for 5000+ seconds, because of the machine.
I am wondering if I issue another index creation, will it use the existing index in the process calculation? If so in what form?
Next to run query 1:
CREATE INDEX ix_price_product_value_id ON price (product, value, id);
Next to run query 2:
CREATE INDEX ix_price_value_id ON price (value, id);
I am wondering if I issue another index creation, will it use the existing index in the process calculation? If so in what form?
No, it won't.
Theoretically, an index on (site, product, value, id) has everything required to build an index on any subset of these fields (including the indices on (product, value, id) and (value, id)).
However, building an index from a secondary index is not supported.
First, MySQL does not support fast full index scan (that is scanning an index in physical order rather than logical), thus making an index access path more expensive than the table read. This is not a problem for InnoDB, since the table itself is always clustered.
Second, the record orders in these indexes are completely different so the records need to be sorted anyway.
However, the main problem with the index creation speed in MySQL is that it generates the order on site (just inserting the records one by one into a B-Tree) instead of using a presorted source. As #Daniel mentioned, fast index creation solves this problem. It is available as a plugin for 5.1 and comes preinstalled in 5.5.
If you're using MySQL version 5.1, and the InnoDB storage engine, you may want to use the InnoDB Plugin 1.0, which supports a new feature called Fast Index Creation. This allows the storage engine to create indexes without copying the contents of the entire table.
Overview of the InnoDB Plugin:
Starting with version 5.1, MySQL AB has promoted the idea of a “pluggable” storage engine architecture, which permits multiple storage engines to be added to MySQL. Currently, however, most users have accessed only those storage engines that are distributed by MySQL AB, and are linked into the binary (executable) releases.
Since 2001, MySQL AB has distributed the InnoDB transactional storage engine with its releases (both source and binary). Beginning with MySQL version 5.1, it is possible for users to swap out one version of InnoDB and use another.
Source: Introduction to the InnoDB Plugin
Overview of Fast Index Creation:
In MySQL versions up to 5.0, adding or dropping an index on a table with existing data can be very slow if the table has many rows. The CREATE INDEX and DROP INDEX commands work by creating a new, empty table defined with the requested set of indexes. It then copies the existing rows to the new table one-by-one, updating the indexes as it goes. Inserting entries into the indexes in this fashion, where the key values are not sorted, requires random access to the index nodes, and is far from optimal. After all rows from the original table are copied, the old table is dropped and the copy is renamed with the name of the original table.
Beginning with version 5.1, MySQL allows a storage engine to create or drop indexes without copying the contents of the entire table. The standard built-in InnoDB in MySQL version 5.1, however, does not take advantage of this capability. With the InnoDB Plugin, however, users can in most cases add and drop indexes much more efficiently than with prior releases.
...
Changing the clustered index requires copying the data, even with the InnoDB Plugin. However, adding or dropping a secondary index with the InnoDB Plugin is much faster, since it does not involve copying the data.
Source: Overview of Fast Index Creation

Slow bulk insert for table with many indexes

I try to insert millions of records into a table that has more than 20 indexes.
In the last run it took more than 4 hours per 100.000 rows, and the query was cancelled after 3½ days...
Do you have any suggestions about how to speed this up.
(I suspect the many indexes to be the cause. If you also think so, how can I automatically drop indexes before the operation, and then create the same indexes afterwards again?)
Extra info:
The space used by the indexes is about 4 times the space used by the data alone
The inserts are wrapped in a transaction per 100.000 rows.
Update on status:
The accepted answer helped me make it much faster.
You can disable and enable the indexes. Note that disabling them can have unwanted side-effects (such as having duplicate primary keys or unique indices etc.) which will only be found when re-enabling the indexes.
--Disable Index
ALTER INDEX [IXYourIndex] ON YourTable DISABLE
GO
--Enable Index
ALTER INDEX [IXYourIndex] ON YourTable REBUILD
GO
This sounds like a data warehouse operation.
It would be normal to drop the indexes before the insert and rebuild them afterwards.
When you rebuild the indexes, build the clustered index first, and conversely drop it last. They should all have fillfactor 100%.
Code should be something like this
if object_id('Index') is not null drop table IndexList
select name into Index from dbo.sysindexes where id = object_id('Fact')
if exists (select name from Index where name = 'id1') drop index Fact.id1
if exists (select name from Index where name = 'id2') drop index Fact.id2
if exists (select name from Index where name = 'id3') drop index Fact.id3
.
.
BIG INSERT
RECREATE THE INDEXES
As noted by another answer disabling indexes will be a very good start.
4 hours per 100.000 rows
[...]
The inserts are wrapped in a transaction per 100.000 rows.
You should look at reducing the number, the server has to maintain a huge amount of state while in a transaction (so it can be rolled back), this (along with the indexes) means adding data is very hard work.
Why not wrap each insert statement in its own transaction?
Also look at the nature of the SQL you are using, are you adding one row per statement (and network roundtrip), or adding many?
Disabling and then re-enabling indices is frequently suggested in those cases. I have my doubts about this approach though, because:
(1) The application's DB user needs schema alteration privileges, which it normally should not possess.
(2) The chosen insert approach and/or index schema might be less then optimal in the first place, otherwise rebuilding complete index trees should not be faster then some decent batch-inserting (e.g. the client issuing one insert statement at a time, causing thousands of server-roundtrips; or a poor choice on the clustered index, leading to constant index node splits).
That's why my suggestions look a little bit different:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits. Usually an identity column is a good choice
Let the client insert into a temporary heap table first (heap tables don't have any clustered index); then, issue one big "insert-into-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Decrease transaction logging by choosing bulk-logged recovery model
You might find more detailled information in this article.