Are indexes on columns with always a different value worth it?

Are indexes on columns with always a different value worth it? - sql

Does creating an index on a column that will always have a different value in each record (like a unique column) improves performances on SELECTs?
I understand that having an index on a column named ie. status which can have 3 values (such as PENDING, DONE, FAILED) and searching only FAILED in 1kk records will be faster.
But what happens if I have a unique id (not the primary key) in 1kk records, and I'm doing a SELECT on that column?

An index on a unique column is actually better than an index on a column with a few values.
To understand why, you need a basic understanding of how databases manage storage. This is a high-level view.
The primary purpose of an index is to reduce the number of pages that need to be read for a query. The rows themselves are stored on data pages. If you don't have an index, then all the data needs to be read.
The index is a data structure that makes it efficient to find a particular value. You can think of it as a sorted list, where a binary search is used to identify the right location. In actual fact, these are usually stored in a structure called b-trees (where the "b" stands for "balanced", not "binary") but that is an implementation detail. And there are types of indexes that don't use b-trees.
So, if the values are unique, then an index is extremely helpful. Instead of doing a full table scan, the "row id" can efficiently be looked up in the index and then only one data page needs to be read.
Note that unique constraints are implemented using indexes. If you have declared a column to be unique, there is no need for an additional index because it is already there.

Related

Searching for record(s) in a table that has over 200 Million Rows

Which type of index should be used on the table? It is initially inserted (one a month) into a empty table. I then place a non clustered composite index on two of the columns. Wondering if merging the two fields into one would increase performance when searching. Or does it not matter? Should I be working with an identity column that has a primary key clustered index?

You should index the field(s) most likely to be used in the where clause as people query the table. Don't worry about the primary key - it already has an index.

If you can define a unique primary key that can be used when querying the table, this will be used as the clustered index and will be the fastest for selects.
If your select query has to use the two fields you mentioned, keep them separate. Performance will not be impacted and the schema is not spoiled.

"A clustered index is particularly efficient on columns that are often searched for ranges of values. After the row with the first value is found using the clustered index, rows with subsequent indexed values are guaranteed to be physically adjacent."
With this in mind you probably won't see much benefit from haveing a clustered index on your primary key (ID) unless it have business meaning for your aplication. If you have a Date value that you are commonly querying, then it may make more sense to add a clustered index to that
select * from table where created > '2013-01-01' and created < '2013-02-01'
I have seen datawarehouses use a concatenated key approach. Whether this works for you depends on your queries. Obviously querying a single field value will be faster than multiple fields, particularly when there is one less lookup in the B-tree index.
Alternatively, if you have 200 million rows in a table you could look at breaking the data out into multiple tables if it makes sense to do so.
You're saying that you're loading all this data every month so I have to assume that all the data is relevant. If there was data in your table that is considered "old" and not relevant to searches, then you could move data out into a archive table (using the same schema) so your queries only run against "current" data.
Otherwise, you can look at a sharding approach as used by NoSQL like MongoDB. If MongoDB is not an option, you could achieve the same shard key like logic in your application. I doubt that your database SQL drivers will support sharding natively.

Issue with the big tables ( no primary key available)

Tabe1 has around 10 Lack records (1 Million) and does not contain any primary key. Retrieving the data by using SELECT command ( With a specific WHERE condition) is taking large amount of time. Can we reduce the time of retrieval by adding a primary key to the table or do we need to follow any other ways to do the same. Kindly help me.

A primary key does not have a direct affect on performance. But indirectly, it does. This is because when you add a primary key to a table, SQL Server creates a unique index (clustered by default) that is used to enforce entity integrity. But you can create your own unique indexes on a table. So, strictly speaking, a primary index does not affect performance, but the index used by the primary key does.
WHEN SHOULD PRIMARY KEY BE USED?

Primary key is needed for referring to a specific record.
To make your SELECTs run fast you should consider adding an index on an appropriate columns you're using in your WHERE.
E.g. to speed-up SELECT * FROM "Customers" WHERE "State" = 'CA' one should create an index on State column.

Primarykey will not help if you don't have Primarykey in where cause.
If you would like to make you quesry faster, you can create non-cluster index on columns in where cause. You may want include columns on top of your index(it depend on your select cause)
The SQL optimizer will seek on your indexs that will make your query faster.
(but you should think about when data adding in your table. Insert operation might takes time if you create index on many columns.)

It depends on the SELECT statement, and the size of each row in the table, the number of rows in the table, and whether you are retrieving all the data in each row or only a small subset of the data (and if a subset, whether the data columns that are needed are all present in a single index), and on whether the rows must be sorted.
If all the columns of all the rows in the table must be returned, then you can't speed things up by adding an index. If, on the other hand, you are only trying to retrieve a tiny fraction of the rows, then providing appropriate indexes on the columns involved in the filter conditions will greatly improve the performance of the query. If you are selecting all, or most, of the rows but only selecting a few of the columns, then if all those columns are present in a single index and there are no conditions on columns not in the index, an index can help.
Without a lot more information, it is hard to be more specific. There are whole books written on the subject, including:
Relational Database Index Design and the Optimizers

One way you can do it is to create indexes on your table. It's always better to create a primary key, which creates a unique index that by default will reduce the retrieval time .........
The optimizer chooses an index scan if the index columns are referenced in the SELECT statement and if the optimizer estimates that an index scan will be faster than a table scan. Index files generally are smaller and require less time to read than an entire table, particularly as tables grow larger. In addition, the entire index may not need to be scanned. The predicates that are applied to the index reduce the number of rows to be read from the data pages.
Read more: Advantages of using indexes in database?

Table index design

I would like to add index(s) to my table.
I am looking for general ideas how to add more indexes to a table.
Other than the PK clustered.
I would like to know what to look for when I am doing this.
So, my example:
This table (let's call it TASK table) is going to be the biggest table of the whole application. Expecting millions records.
IMPORTANT: massive bulk-insert is adding data in this table
table has 27 columns: (so far, and counting :D )
int x 9 columns = id-s
varchar x 10 columns
bit x 2 columns
datetime x 5 columns
INT COLUMNS
all of these are INT ID-s but from tables that are usually smaller than Task table (10-50 records max), example: Status table (with values like "open", "closed") or Priority table (with values like "important", "not so important", "normal")
there is also a column like "parent-ID" (self - ID)
join: all the "small" tables have PK, the usual way ... clustered
STRING COLUMNS
there is a (Company) column (string!) that is something like "5 characters long all the time" and every user will be restricted using this one. If in Task there are 15 different "Companies" the logged in user would only see one. So there's always a filter on this one. Might be a good idea to add an index to this column?
DATE COLUMNS
I think they don't index these ... right? Or can / should be?

I wouldn't add any indices - unless you have specific reasons to do so, e.g. performance issues.
In order to figure out what kind of indices to add, you need to know:
what kind of queries are being used against your table - what are the WHERE clauses, what kind of ORDER BY are you doing?
how is your data distributed? Which columns are selective enough (< 2% of the data) to be useful for indexing
what kind of (negative) impact do additional indices have on your INSERTs and UPDATEs on the table
any foreign key columns should be part of an index - preferably as the first column of the index - to speed up JOINs to other tables
And sure you can index a DATETIME column - what made you think you cannot?? If you have a lot of queries that will restrict their result set by means of a date range, it can make total sense to index a DATETIME column - maybe not by itself, but in a compound index together with other elements of your table.
What you cannot index are columns that hold more than 900 bytes of data - anything like VARCHAR(1000) or such.
For great in-depth and very knowledgeable background on indexing, consult the blog by Kimberly Tripp, Queen of Indexing.

in general an index will speed up a JOIN, a sort operation and a filter
SO if the columns are in the JOIN, the ORDER BY or the WHERE clause then an index will help in terms of performance...but there is always a but...with every index that you add UPDATE, DELETE and INSERT operations will be slowed down because the indexes have to be maintained
so the answer is...it depends
I would say start hitting the table with queries and look at the execution plans for scans, try to make those seeks by either writing SARGable queries or adding indexes if needed...don't just add indexes for the sake of adding indexes

Step one is to understand how the data in the table will be used: how will it be inserted, selected, updated, deleted. Without knowing your usage patterns, you're shooting in the dark. (Note also that whatever you come up with now, you may be wrong. Be sure to compare your decisions with actual usage patterns once you're up and running.) Some ideas:
If users will often be looking up individual items in the table, an index on the primary key is critical.
If data will be inserted with great frequency and you have multiple indexes, over time you well have to deal with index fragmentation. Read up on and understand clustered and non-clustered indexes and fragmentation (ALTER INDEX...REBUILD).
But, if performance is key in situations when you need to retrieve a lot of rows, you might consider using your clustered indexe to support that.
If you often want a set of data based on Status, indexing on that column can be good--particularly if 1% of your rows are "Active" vs. 99% "Not Active", and all you want are the active ones.
Conversely, if your "PriorityId" is only used to get the "label" stating what PriorityId 42 is (i.e. join into the lookup table), you probably don't need an index on it in your main table.
A last idea, if everyone will always retrieve data for only one Company at a time, then (a) you'll definitely want to index on that, and (b) you might want to consider partitioning the table on that value, as it can act as a "built in filter" above and beyond conventional indexing. (This is perhaps a bit extreme and it's only available in Enterprise edition, but it may be worth it in your case.)

What is an index in SQL?

Also, when is it appropriate to use one?

An index is used to speed up searching in the database. MySQL has some good documentation on the subject (which is relevant for other SQL servers as well):
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
An index can be used to efficiently find all rows matching some column in your query and then walk through only that subset of the table to find exact matches. If you don't have indexes on any column in the WHERE clause, the SQL server has to walk through the whole table and check every row to see if it matches, which may be a slow operation on big tables.
The index can also be a UNIQUE index, which means that you cannot have duplicate values in that column, or a PRIMARY KEY which in some storage engines defines where in the database file the value is stored.
In MySQL you can use EXPLAIN in front of your SELECT statement to see if your query will make use of any index. This is a good start for troubleshooting performance problems. Read more here:
http://dev.mysql.com/doc/refman/5.0/en/explain.html

A clustered index is like the contents of a phone book. You can open the book at 'Hilditch, David' and find all the information for all of the 'Hilditch's right next to each other. Here the keys for the clustered index are (lastname, firstname).
This makes clustered indexes great for retrieving lots of data based on range based queries since all the data is located next to each other.
Since the clustered index is actually related to how the data is stored, there is only one of them possible per table (although you can cheat to simulate multiple clustered indexes).
A non-clustered index is different in that you can have many of them and they then point at the data in the clustered index. You could have e.g. a non-clustered index at the back of a phone book which is keyed on (town, address)
Imagine if you had to search through the phone book for all the people who live in 'London' - with only the clustered index you would have to search every single item in the phone book since the key on the clustered index is on (lastname, firstname) and as a result the people living in London are scattered randomly throughout the index.
If you have a non-clustered index on (town) then these queries can be performed much more quickly.

An index is used to speed up the performance of queries. It does this by reducing the number of database data pages that have to be visited/scanned.
In SQL Server, a clustered index determines the physical order of data in a table. There can be only one clustered index per table (the clustered index IS the table). All other indexes on a table are termed non-clustered.
SQL Server Index Basics
SQL Server Indexes: The Basics
SQL Server Indexes
Index Basics
Index (wiki)

Indexes are all about finding data quickly.
Indexes in a database are analogous to indexes that you find in a book. If a book has an index, and I ask you to find a chapter in that book, you can quickly find that with the help of the index. On the other hand, if the book does not have an index, you will have to spend more time looking for the chapter by looking at every page from the start to the end of the book.
In a similar fashion, indexes in a database can help queries find data quickly. If you are new to indexes, the following videos, can be very useful. In fact, I have learned a lot from them.
Index Basics
Clustered and Non-Clustered Indexes
Unique and Non-Unique Indexes
Advantages and disadvantages of indexes

Well in general index is a B-tree. There are two types of indexes: clustered and nonclustered.
Clustered index creates a physical order of rows (it can be only one and in most cases it is also a primary key - if you create primary key on table you create clustered index on this table also).
Nonclustered index is also a binary tree but it doesn't create a physical order of rows. So the leaf nodes of nonclustered index contain PK (if it exists) or row index.
Indexes are used to increase the speed of search. Because the complexity is of O(log N). Indexes is very large and interesting topic. I can say that creating indexes on large database is some kind of art sometimes.

INDEXES - to find data easily
UNIQUE INDEX - duplicate values are not allowed
Syntax for INDEX
CREATE INDEX INDEX_NAME ON TABLE_NAME(COLUMN);
Syntax for UNIQUE INDEX
CREATE UNIQUE INDEX INDEX_NAME ON TABLE_NAME(COLUMN);

First we need to understand how normal (without indexing) query runs. It basically traverse each rows one by one and when it finds the data it returns. Refer the following image. (This image has been taken from this video.)
So suppose query is to find 50 , it will have to read 49 records as a linear search.
Refer the following image. (This image has been taken from this video)
When we apply indexing, the query will quickly find out the data without reading each one of them just by eliminating half of the data in each traversal like a binary search. The mysql indexes are stored as B-tree where all the data are in leaf node.

INDEX is a performance optimization technique that speeds up the data retrieval process. It is a persistent data structure that is associated with a Table (or View) in order to increase performance during retrieving the data from that table (or View).
Index based search is applied more particularly when your queries include WHERE filter. Otherwise, i.e, a query without WHERE-filter selects whole data and process. Searching whole table without INDEX is called Table-scan.
You will find exact information for Sql-Indexes in clear and reliable way:
follow these links:
For cocnept-wise understanding:
http://dotnetauthorities.blogspot.in/2013/12/Microsoft-SQL-Server-Training-Online-Learning-Classes-INDEX-Overview-and-Optimizations.html
For implementation-wise understanding:
http://dotnetauthorities.blogspot.in/2013/12/Microsoft-SQL-Server-Training-Online-Learning-Classes-INDEX-Creation-Deletetion-Optimizations.html

If you're using SQL Server, one of the best resources is its own Books Online that comes with the install! It's the 1st place I would refer to for ANY SQL Server related topics.
If it's practical "how should I do this?" kind of questions, then StackOverflow would be a better place to ask.
Also, I haven't been back for a while but sqlservercentral.com used to be one of the top SQL Server related sites out there.

An index is used for several different reasons. The main reason is to speed up querying so that you can get rows or sort rows faster. Another reason is to define a primary-key or unique index which will guarantee that no other columns have the same values.

So, How indexing actually works?
Well, first off, the database table does not reorder itself when we put index on a column to optimize the query performance.
An index is a data structure, (most commonly its B-tree {Its balanced tree, not binary tree}) that stores the value for a specific column in a table.
The major advantage of B-tree is that the data in it is sortable. Along with it, B-Tree data structure is time efficient and operations such as searching, insertion, deletion can be done in logarithmic time.
So the index would look like this -
Here for each column, it would be mapped with a database internal identifier (pointer) which points to the exact location of the row. And, now if we run the same query.
Visual Representation of the Query execution
So, indexing just cuts down the time complexity from o(n) to o(log n).
A detailed info - https://pankajtanwar.in/blog/what-is-the-sorting-algorithm-behind-order-by-query-in-mysql

INDEX is not part of SQL. INDEX creates a Balanced Tree on physical level to accelerate CRUD.
SQL is a language which describe the Conceptual Level Schema and External Level Schema. SQL doesn't describe Physical Level Schema.
The statement which creates an INDEX is defined by DBMS, not by SQL standard.

An index is an on-disk structure associated with a table or view that speeds retrieval of rows from the table or view. An index contains keys built from one or more columns in the table or view. These keys are stored in a structure (B-tree) that enables SQL Server to find the row or rows associated with the key values quickly and efficiently.
Indexes are automatically created when PRIMARY KEY and UNIQUE constraints are defined on table columns. For example, when you create a table with a UNIQUE constraint, Database Engine automatically creates a nonclustered index.
If you configure a PRIMARY KEY, Database Engine automatically creates a clustered index, unless a clustered index already exists. When you try to enforce a PRIMARY KEY constraint on an existing table and a clustered index already exists on that table, SQL Server enforces the primary key using a nonclustered index.
Please refer to this for more information about indexes (clustered and non clustered):
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-ver15
Hope this helps!

Two single-column indexes vs one two-column index in MySQL?

I'm faced with the following and I'm not sure what's best practice.
Consider the following table (which will get large):
id PK | giver_id FK | recipient_id FK | date
I'm using InnoDB and from what I understand, it creates indices automatically for the two foreign key columns. However, I'll also be doing lots of queries where I need to match a particular combination of:
SELECT...WHERE giver_id = x AND recipient_id = t.
Each such combination will be unique in the table.
Is there any benefit from adding an two-column index over these columns, or would the two individual indexes in theory be sufficient / the same?

If you have two single column indexes, only one of them will be used in your example.
If you have an index with two columns, the query might be faster (you should measure). A two column index can also be used as a single column index, but only for the column listed first.
Sometimes it can be useful to have an index on (A,B) and another index on (B). This makes queries using either or both of the columns fast, but of course uses also more disk space.
When choosing the indexes, you also need to consider the effect on inserting, deleting and updating. More indexes = slower updates.

A covering index like:
ALTER TABLE your_table ADD INDEX (giver_id, recipient_id);
...would mean that the index could be used if a query referred to giver_id, or a combination of giver_id and recipient_id. Mind that index criteria is leftmost based - a query referring to only recipient_id would not be able to use the covering index in the statement I provided.
Please note that some older MySQL versions can only use one index per SELECT so a covering index would be the best means of optimizing your queries.

If one of the foreign key indexes is already very selective, then the database engine should use that one for the query you specified. Most database engines use some kind of heuristic to be able to choose the optimal index in that situation. If neither index is highly selective by itself, it probably does make sense to add the index built on both keys since you say you will use that type of query a lot.
Another thing to consider is if you can eliminate the PK field in this table and define the primary key index on the giver_id and recipient_id fields. You said that the combination is unique, so that would possibly work (given a lot of other conditions that only you can answer). Typically, though, I think the added complexity that adds is not worth the hassle.

Another thing to consider is that the performance characteristics of both approaches will be based on the size and cardinality of the dataset. You may find that the 2-column index only becomes noticing more performant at a certain dataset size threshold, or the exact opposite. Nothing can substitute for performance metrics for your exact scenario.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas