Indexing Speed in SQL

Indexing Speed in SQL - sql

When creating indexes for an SQL table,if i had an index on 2 columns in the table and i changed the index to be on 4 columns in the table, what would be a reasonable increase the time taken to save say 1 million rows to expect?
I know that the answer to this question will vary depending on a lot of factors, such as foreign keys, other indexes, etc, but I thought I'd ask anyway. Not sure if it matters, but I am using MS SQLServer 2005.
EDIT: Ok, so here's some more information that might help get a better answer. I have a table called CostDependency. Inside this table are the following columns:
CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
DistributionID as UniqueIdentifier (FK)
IsValid as Bit
At the moment there is one Unique index involving ParentPriceID, DependantPriceID, LocationID and DistributionID. The reason for this index is to guarantee that the combination of those four columns is unique. We are not doing any searching on these four columns together. I can however normalise this table and make it into three tables:
CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
Unique Index on ParentPriceID and DependantPriceID
and
ExtensionID as UniqueIdentifier (PK)
CostDependencyID (FK)
DistributionID as UniqueIdentifier (FK)
Unique Index on CostDependencyID and DistributionID
and
ID as UniqueIdentifier (PK)
ExtensionID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
IsValid as Bit
Unique Index on ExtensionID and LocationID
I am trying to work out if normalising this table and thus reducing the number of columns in the indexes will mean speed improvements when adding a large number of rows (i.e. 1 million).
Thanks, Dane.

With all the new info available, I'd like to suggest the following:
1) If a few of the GUID (UniqueIdentifier) columns are such that a) there are relatively few different values and b) there are relatively few new values added after the initial load. (For example the LocationID may represents a store, and if we only see a few new stores every day), it would be profitable to spin off these to a separate lookup table(s) GUID ->LocalId (an INT or some small column), and use this LocalId in the main table.
==> Doing so will greatly reduce the overall size of the main table and its associated indexes, at the cost of slightly complicating the update logic (but not not its performance), because of the lookup(s) and the need to maintain the lookup table(s) with new values.
2) Unless a particular important/frequent search case could make [good] use of a clustered index, we could use the clustered index on the main table to be for the 4 columns-based unique composite key. This would avoid replicating that much data in a separate non-clustered index, and as counter intuitive at is seems, it would save time for the initial load and with new inserts. The trick would be to use a relatively low fillfactor so that node splitting and balancing etc. would be infrequent. BTW, if we make the main record narrower with the use of local IDs, we can more readily afford "wasting" space in the fillfactor, and more new record will fit in this space before requiring node balancing.
3) link664 could provide an order of magnitude for the total number of records in "main" table and the number expected daily/weekly/whenever updates are scheduled. And these two parameter could confirm the validity of the approach suggested above, as well as provide hints as to the possibility of maybe dropping the indexes (or some of them) prior to big batch inserts, as suggested by Philip Kelley. Doing so however would be contingent to operational considerations such as the need to continue search service while new data is inserted.
4) other considerations such as SQL partitioning, storage architecture etc. can also be put to work to improve load and/or retrieve performance.

It depends pretty much on whether the wider index forms a covering index for your queries (and to a lesser extent the ratio of read to writes on that table). Suggest you post your execution plan(s) for the query workload you are trying to improve.

I'm a bit confused over your goals. The (post-edit) question reads that you're trying to optimize data (row) insertion, comparing one table of six columns and a four-column compound primary key against a "normalized" set of three tables of three or four columns each, and each of the three with a two-column compound key. Is this your issue?
My first question is, what are the effects of the "normalization" from one table to three? If you had 1M rows in the single table, how many rows are you likely to have in the three normalized ones? Normalization usually removes redundant data, does it do so here?
Inserting 1M rows into a four-column-PK table will take more time than into a two-column-PK table--perhaps a little, perhaps a lot (see next paragraph). However, if all else is equal, I believe that inserting 1M rows into three two-column-PK tables will be slower than the four column one. Testing is called for.
One thing that is certain is that if the data to be inserted is not loaded in the same order as it will be stored in, it will be a LOT slower than if the data being inserted were already sorted. Multiply that by three, and you'll have a long wait indeed. The most common work-around to this problem is to drop the index, load the data, and then recreate the index (sounds like a time-waster, but for large data sets it can be faster than inserting into an indexed table). A more abstract work-around is to load the table into an unindexed partition, (re)build the index, then switch the partition into the "live" table. Is this an option you can consider?
By and large, people are not overly concerned with performance when data is being inserted--generally they sweat over data retrieval performance. If it's not a warehouse situation, I'd be interested in knowing why insert performance is your apparent bottleneck.

The query optimizer will look at the index and determine if it can use the leading column. If the first column is not in the query, then it won't be used period. If the index can be used, then it will check to see if the second column can be used. If your query contains 'where A=? and C=?' and you index is on A,B,C,D then only the 'A' column will be used in the query plan.
Adding columns to an index can be useful sometimes to avoid the database having to go from the index page to the data page. If you query is 'select D from table where a=? and b=? and c=?', then column 'D' will be returned from the index, and save you a bit of IO in having to go to the data page.

Related

SQL Server - can GUID be a good choice as part of a clustered index?

I have a large domain set of tables in a database - over 100 tables. Every single one uses a uniqueidentifier as a PK.
I'm realizing now that my mistake is that these are also by default, the clustered index.
Consider a table with this type of structure:
Orders
Id (uniqueidentifier) Primary Key
UserId (uniqueidentifier)
.
.
.
.
Other columns
Most queries are going to be something like "Get top 10 orders for user X sorted by OrderDate".
In this case, would it make sense to create a clustered index on UserId,Id...that way the data is physically stored sorted by UserId?
I'm not too concerned about Inserts and Updates - those will be few enough that performance loss there isn't a big deal. I'm mostly concerned with READs.

A clustered index means that data is physically stored in the order of the values. By default, the primary key is used for the clustered index.
The problem with GUIDs is that they are generated is (essentially) random order. That means that inserts are happening "in the middle" of the table. And, such inserts result in fragmentation.
Without getting into database internals, this is a little hard to explain. But what it means is that inserts require much more work than just inserting the values "at the end" of the table, because new rows go in the middle of a data page so the other rows have to be moved around.
SQL Server offers a solution for this, newsequentialid(). On a given server, this returns a sequential value which is inserted at the end. Often, this is an excellent compromise if you have to use GUIDs.
That said, I have a preference for just plain old ints as ids -- identity columns. These are smaller, so they take up less space. This is particularly true for indexes. Inserts work well because new values go at the "end" of the table. I also find integers easier to work with visually.
Using identity columns for primary keys and foreign key references still allows you to have unique GUID columns for each identity, if that is a requirement for the database (say for interfacing to other applications).

Clustered index is when you want to retrieve rows for a range of values for a given column. As data is physically arranged in that order, the rows can be extracted very efficiently.
a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.

What are the disadvantages to Indexes in database tables?

Is there any reason I shouldn't create an index for every one of my database tables, as a means of increasing performance? It would seem there must be some reason(s) else all tables would automatically have one by default.
I use MS SQL Server 2016.

One index on a table is not a big deal. You automatically have an index on columns (or combinations of columns) that are primary keys or declared as unique.
There is some overhead to an index. The index itself occupies space on disk and memory (when used). So, if space or memory are issues then too many indexes could be a problem. When data is inserted/updated/deleted, then the index needs to be maintained as well as the original data. This slows down updates and locks the tables (or parts of the tables), which can affect query processing.
A small number of indexes on each table are reasonable. These should be designed with the typical query load in mind. If you index every column in every table, then data modifications would slow down. If your data is static, then this is not an issue. However, eating up all the memory with indexes could be an issue.

Advantage of having an index
Read speed: Faster SELECT when that column is in WHERE clause
Disadvantages of having an index
Space: Additional disk/memory space needed
Write speed: Slower INSERT / UPDATE / DELETE

As a minimum I would normally recommend having at least 1 index per table, this would be automatically created on your tables primary key, for example an IDENTITY column. Then foreign keys would normally benefit from an index, this will need to be created manually. Other columns that are frequently included in WHERE clauses should be indexed, especially if they contain lots of unique values. The benefit of indexing columns, such as gender (low-cardinality) when this only has 2 values is debatable.
Most of the tables in my databases have between 1 and 4 indexes, depending on the data in the table and how this data is retrieved.

Clustering Factor and Unique Key

Clustering factor - A Awesome Simple Explanation on how it is calculated:
Basically, the CF is calculated by performing a Full Index Scan and
looking at the rowid of each index entry. If the table block being
referenced differs from that of the previous index entry, the CF is
incremented. If the table block being referenced is the same as the
previous index entry, the CF is not incremented. So the CF gives an
indication of how well ordered the data in the table is in relation to
the index entries (which are always sorted and stored in the order of
the index entries). The better (lower) the CF, the more efficient it
would be to use the index as less table blocks would need to be
accessed to retrieve the necessary data via the index.
My Index statistics:
So, here are my indexes(index over just one column) under analysis.
Index starting PK_ is my Primary Key and UI is a Unique key. (Ofcourse both hold unique values)
Query1:
SELECT index_name,
UNIQUENESS,
clustering_factor,
num_rows,
CEIL((clustering_factor/num_rows)*100) AS cluster_pct
FROM all_indexes
WHERE table_name='MYTABLE';
Result:
INDEX_NAME UNIQUENES CLUSTERING_FACTOR NUM_ROWS CLUSTER_PCT
-------------------- --------- ----------------- ---------- -----------
PK_TEST UNIQUE 10009871 10453407 96 --> So High
UITEST01 UNIQUE 853733 10113211 9 --> Very Less
We can see the PK having the highest CF and the other unique index is not.
The only logical explanation that strikes me is, the data beneath is stored actually by order of column over the Unique index.
1) Am I right with this understanding?
2) Is there any way to give the PK , the lowest CF number?
3) Seeing the Query cost using both these index, it is very fast for single selects. But still, the CF number is what baffle us.
The table is relatively huge over 10M records, and also receives real time inserts/updates.
My Database version is Oracle 11gR2, over Exadata X2

You are seeing the evidence of a heap table indexed by an ordered tree structure.
To get extremely low CF numbers you'd need to order the data as per the index. If you want to do this (like SQL Server or Sybase clustered indexes), in Oracle you have a couple of options:
Simply create supplemental indexes with additional columns that can satisfy your common queries. Oracle can return a result set from an index without referring to the base table if all of the required columns are in the index. If possible, consider adding columns to the trailing end of your PK to serve your heaviest query (practical if your query has small number of columns). This is usually advisable over changing all of your tables to IOTs.
Use an IOT (Index Organized Table) - It is a table, stored as an index, so is ordered by the primary key.
Sorted hash cluster - More complicated, but can also yield gains when accessing a list of records for a certain key (like a bunch of text messages for a given phone number)
Reorganize your data and store the records in the table in order of your index. This option is ok if your data isn't changing, and you just want to reorder the heap, though you can't explicitly control the order; all you can do is order the query and let Oracle append it to a new segment.
If most of your access patterns are random (OLTP), single record accesses, then I wouldn't worry about the clustering factor alone. That is just a metric that is neither bad nor good, it just depends on the context, and what you are trying to accomplish.
Always remember, Oracle's issues are not SQL Server's issues, so make sure any design change is justified by performance measurement. Oracle is highly concurrent, and very low on contention. Its multi-version concurrency design is very efficient and differs from other databases. That said, it is still a good tuning practice to order data for sequential access if that is your common use case.
To read some better advice on this subject, read Ask Tom: what are oracle's clustered and nonclustered indexes

PK Index Fragmentation on tables with nvarchar primary keys

Due to some recent database problems we've been having recently, I've been looking at the state of our indexes and trying to reduce the fragmentation.
The app has several tables of names such as BoysNames and GirlsNames which we use to set attributes on User objects we created. The obvious attribute here would be Gender. These tables can have anywhere from a few hundred to 10,000 rows.
These tables are structured like this:
Name - nvarchar(50) - PK & Clustered Index
DateCreated - datetime
When I tell Sql Server to reorganize or rebuild the indexes on all my tables, most table fragmentaiton goes down to 0%, but some of these Name tables are at 50% fragemented straight away.
I only access these tables in 2 places:
The first is when I select every name from the table and store it in
memory to use against new users coming into the system so I can do something like this: if
(boysNames.Contains(user.Name)) {user.Gender = "M"}; This happens quite
often.
The second is when I'm adding new names to the list, I check for the
existance of a name, if it doesn't exist, I add it. This happens
rarely.
So what I need to know is:
Is this high level of fragmentation going to be causing me problems? How can I reduce the the index fragmentation to 0% when it's being set to 50% straight after a reorganize/rebuild?
Should I have used an int as the primary key and put an index on Name, or was nvarchar the right choice for primary key?

If the index pages are resident in memory, which is likely with that few rows, fragmentation does not matter. You can benchmark that with a count(*) query. Starting with the 2nd execution, you should see in-memory speeds. If you now compare the results for a 100% and a 0% fragmented table, you should see no difference.
I don't think you have a problem. If you insist, you can set a fill factor lower than 100 so that there is room for new rows when you insert rows at random places. Start with 90 and lower it in increments of 5 until you are happy with the rate fragmentation develops.
Clustering on an IDENTITY field would remove the fragmentation for the clustered index, but you'd probably need an index on Name which again fragments. If you do not need any index at all, make this a heap table and be done with it. I recommend very much against this, though.

Many-to-Many Relationship Against a Single, Large Table

I have a geometric diagram the consists of 5,000 cells, each of which is an arbitrary polygon. My application will need to save many such diagrams.
I have determined that I need to use a database to make indexed queries against this map. Loading all the map data is far too inefficient for quick responses to simple queries.
I've added the cell data to the database. It has a fairly simple structure:
CREATE TABLE map_cell (
map_id INT NOT NULL ,
cell_index INT NOT NULL ,
...
PRIMARY KEY (map_id, cell_index)
)
5,000 rows per map is quite a few, but queries should remain efficient into the millions of rows because the main join indexes can be clustered. If it gets too unwieldy, it can be partitioned on map_id bounds. Despite the large number of rows per map, this table would be quite scalable.
The problem comes with storing the data that describes which cells neighbor each other. The cell-neighbor relationship is a many-to-many relationship against the same table. There are also a very large number of such relationship per map. A normalized table would probably look something like this:
CREATE TABLE map_cell_neighbors (
id INT NOT NULL AUTO INCREMENT ,
map_id INT NOT NULL ,
cell_index INT NOT NULL ,
neighbor_index INT ,
...
INDEX IX_neighbors (map_id, cell_index)
)
This table requires a surrogate key that will never be used in a join, ever. Also, this table includes duplicate entries: if cell 0 is a neighbor with cell 1, then cell 1 is always a neighbor of cell 0. I can eliminate these entries, at the cost of some extra index space:
CREATE TABLE map_cell_neighbors (
id INT NOT NULL AUTO INCREMENT ,
map_id INT NOT NULL ,
neighbor1 INT NOT NULL ,
neighbor2 INT NOT NULL ,
...
INDEX IX_neighbor1 (map_id, neighbor1),
INDEX IX_neighbor2 (map_id, neighbor2)
)
I'm not sure which one would be considered more "normalized", since option 1 includes duplicate entries (including duplicating any properties the relationship has), and option 2 is some pretty weird database design that just doesn't feel normalized. Neither option is very space efficient. For 10 maps, option 1 used 300,000 rows taking up 12M of file space. Option 2 was 150,000 rows taking up 8M of file space. On both tables, the indexes are taking up more space than the data, considering the data should be about 20 bytes per row, but it's actually taking 40-50 bytes on disk.
The third option wouldn't be normalized at all, but would be incredibly space- and row-efficient. It involves putting a VARBINARY field in map_cell, and storing a binary-packed list of neighbors in the cell table itself. This would take 24-36 bytes per cell, rather than 40-50 bytes per relationship. It would also reduce the overall number of rows, and the queries against the cell table would be very fast due to the clustered primary key. However, performing a join against this data would be impossible. Any recursive queries would have to be done one step at a time. Also, this is just really ugly database design.
Unfortunately, I need my application to scale well and not hit SQL bottlenecks with just 50 maps. Unless I can think of something else, the latter option might be the only one that really works. Before I committed such a vile idea to code, I wanted to make sure I was looking clearly at all the options. There may be another design pattern I'm not thinking of, or maybe the problems I'm foreseeing aren't as bad as they appear. Either way, I wanted to get other people's input before pushing too far into this.
The most complex queries against this data will be path-finding and discovery of paths. These will be recursive queries that start at a specific cell and that travel through neighbors over several iterations and collect/compare properties of these cells. I'm pretty sure I can't do all this in SQL, there will likely be some application code throughout. I'd like to be able to perform queries like this of moderate size, and get results in an acceptable amount of time to feel "responsive" to user, about a second. The overall goal is to keep large table sizes from causing repeated queries or fixed-depth recursive queries from taking several seconds or more.

Not sure which database you are using, but you seem to be re-inventing what a spatial enabled database supports already.
If SQL Server, for example, is an option, you could store your polygons as geometry types, use the built-in spatial indexing, and the OGC compliant methods such as "STContains", "STCrosses", "STOverlaps", "STTouches".
SQL Server spatial indexes, after decomposing the polygons into various b-tree layers, also uses tessellation to index which neighboring cells a given polygon touches, at a given layer of the tree-index.
There are other mainstream databases which support spatial types as well, including MySQL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas