Clustering Factor and Unique Key - sql

Clustering factor - A Awesome Simple Explanation on how it is calculated:
Basically, the CF is calculated by performing a Full Index Scan and
looking at the rowid of each index entry. If the table block being
referenced differs from that of the previous index entry, the CF is
incremented. If the table block being referenced is the same as the
previous index entry, the CF is not incremented. So the CF gives an
indication of how well ordered the data in the table is in relation to
the index entries (which are always sorted and stored in the order of
the index entries). The better (lower) the CF, the more efficient it
would be to use the index as less table blocks would need to be
accessed to retrieve the necessary data via the index.
My Index statistics:
So, here are my indexes(index over just one column) under analysis.
Index starting PK_ is my Primary Key and UI is a Unique key. (Ofcourse both hold unique values)
Query1:
SELECT index_name,
UNIQUENESS,
clustering_factor,
num_rows,
CEIL((clustering_factor/num_rows)*100) AS cluster_pct
FROM all_indexes
WHERE table_name='MYTABLE';
Result:
INDEX_NAME UNIQUENES CLUSTERING_FACTOR NUM_ROWS CLUSTER_PCT
-------------------- --------- ----------------- ---------- -----------
PK_TEST UNIQUE 10009871 10453407 96 --> So High
UITEST01 UNIQUE 853733 10113211 9 --> Very Less
We can see the PK having the highest CF and the other unique index is not.
The only logical explanation that strikes me is, the data beneath is stored actually by order of column over the Unique index.
1) Am I right with this understanding?
2) Is there any way to give the PK , the lowest CF number?
3) Seeing the Query cost using both these index, it is very fast for single selects. But still, the CF number is what baffle us.
The table is relatively huge over 10M records, and also receives real time inserts/updates.
My Database version is Oracle 11gR2, over Exadata X2

You are seeing the evidence of a heap table indexed by an ordered tree structure.
To get extremely low CF numbers you'd need to order the data as per the index. If you want to do this (like SQL Server or Sybase clustered indexes), in Oracle you have a couple of options:
Simply create supplemental indexes with additional columns that can satisfy your common queries. Oracle can return a result set from an index without referring to the base table if all of the required columns are in the index. If possible, consider adding columns to the trailing end of your PK to serve your heaviest query (practical if your query has small number of columns). This is usually advisable over changing all of your tables to IOTs.
Use an IOT (Index Organized Table) - It is a table, stored as an index, so is ordered by the primary key.
Sorted hash cluster - More complicated, but can also yield gains when accessing a list of records for a certain key (like a bunch of text messages for a given phone number)
Reorganize your data and store the records in the table in order of your index. This option is ok if your data isn't changing, and you just want to reorder the heap, though you can't explicitly control the order; all you can do is order the query and let Oracle append it to a new segment.
If most of your access patterns are random (OLTP), single record accesses, then I wouldn't worry about the clustering factor alone. That is just a metric that is neither bad nor good, it just depends on the context, and what you are trying to accomplish.
Always remember, Oracle's issues are not SQL Server's issues, so make sure any design change is justified by performance measurement. Oracle is highly concurrent, and very low on contention. Its multi-version concurrency design is very efficient and differs from other databases. That said, it is still a good tuning practice to order data for sequential access if that is your common use case.
To read some better advice on this subject, read Ask Tom: what are oracle's clustered and nonclustered indexes

Related

SQL Server - can GUID be a good choice as part of a clustered index?

I have a large domain set of tables in a database - over 100 tables. Every single one uses a uniqueidentifier as a PK.
I'm realizing now that my mistake is that these are also by default, the clustered index.
Consider a table with this type of structure:
Orders
Id (uniqueidentifier) Primary Key
UserId (uniqueidentifier)
.
.
.
.
Other columns
Most queries are going to be something like "Get top 10 orders for user X sorted by OrderDate".
In this case, would it make sense to create a clustered index on UserId,Id...that way the data is physically stored sorted by UserId?
I'm not too concerned about Inserts and Updates - those will be few enough that performance loss there isn't a big deal. I'm mostly concerned with READs.
A clustered index means that data is physically stored in the order of the values. By default, the primary key is used for the clustered index.
The problem with GUIDs is that they are generated is (essentially) random order. That means that inserts are happening "in the middle" of the table. And, such inserts result in fragmentation.
Without getting into database internals, this is a little hard to explain. But what it means is that inserts require much more work than just inserting the values "at the end" of the table, because new rows go in the middle of a data page so the other rows have to be moved around.
SQL Server offers a solution for this, newsequentialid(). On a given server, this returns a sequential value which is inserted at the end. Often, this is an excellent compromise if you have to use GUIDs.
That said, I have a preference for just plain old ints as ids -- identity columns. These are smaller, so they take up less space. This is particularly true for indexes. Inserts work well because new values go at the "end" of the table. I also find integers easier to work with visually.
Using identity columns for primary keys and foreign key references still allows you to have unique GUID columns for each identity, if that is a requirement for the database (say for interfacing to other applications).
Clustered index is when you want to retrieve rows for a range of values for a given column. As data is physically arranged in that order, the rows can be extracted very efficiently.
a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.

SQL Server Indexing Doubts

Indexing is used to improve performance of sql query but I always found it little difficult to decide in which situation should I use index and in which not. I want to clarify some of my doubts regarding non-clustered index
What is Non-clustered index key. As book say each index row of non clustered index contains non clustered key value so is it mean it is the column in which we created non clustered index i.e. If created index on empname varchar(50) , so non clustered key will be
that empname .
Why It is preferable to create index on column with small width. It is due to comparison with more width column takes more time for SQL server engine or is it due to it will increment hierarchy of intermediate nodes as page size is fixed so with more width column in a page or node less index row it will contain.
If a table contain multiple non clustered column so whether non clustered key will be combination of all this column or some unique id is generated internally by SQL with locator which will point to actual data row. If possible please clear it will some real time example and graphs.
Why It is said that column with non-repeatable value is good to create index as even if it contains repeated value it will definitely improve performance as once it reach to certain key value its locator will immediately found its actual row.
If column used in indexing is not unique how it find actual data row from table.
Please refer any book or tutorial which will be useful to clear my doubts.
First I think we need to cover what an actual index is. Usually in RDBMS indexes are implemented using a variant of B-tree's (B+ variant is most common). To put it shortly - think a binary search tree optimized for being stored on a disk. The result of looking up a key in the B-tree is usually the primary key of the table. That means if a lookup in the index completes and we need more data than what is present in the index we can do a seek in the table using the primary key.
Please remember that when we think of performance for a RDBMS we usually measure this in disk accesses (I decide to ignore locking and other issues here) and not so much CPU time.
Having the index being non-clustered means that the actual way the data in the table is stored has no relation to the index key - whereas a clustered index specifies that the data in the table will be sorted (or clustered by) the index key - this is why there can only be one clustered index per table.
2) Back to our model of measuring performance - if the index key is has small width (fits into a low amount of bytes) it means that per block of disk data we retrieve we can fit more keys - and as such perform lookups in the B-tree much faster if you measure disk I/O.
3) I tried explaining this further up - unfortunately I don't really have any graphs or drawings to indicate this - hopefully someone else can come along and share these.
4) If you're running a query like so:
SELECT something, something_else FROM sometable t1 WHERE akey = 'some value'
On a table with an index defined like so:
CREATE INDEX idx_sometable_akey ON sometable(akey)
If sometable has alot of rows where akey is equal to 'some value' this means alot of lookups in both the index but also in the actual table to retrieve the values of something and something_else. Whereas if there's a good chance that this filtering returns few rows then it also means less disk accesses.
5) See earlier explanation
Hope this helps :)

Is ROWID internally indexed unique by an SQL DBMS?

It's my understanding that the quickest way to access a particular row is by its ROWID. In INFORMIX-SE 7.3, when I do a SELECT ROWID FROM table I notice that its values are type SERIAL[INT]. In Oracle, they are SERIAL[HEX]. Has anyone ever used ROWID for any practical use? If I wanted to locate the most recent row added to a table, would SELECT MAX(ROWID) FROM table be quicker and more reliable than say SELECT MAX(pk_id) FROM table, where pk_id is a user-created SERIAL column? What other practical use have you ever put ROWID to work for you?
Your understanding is not necessarily correct. The ROWID property in SQL Server is primarily intended for replication as a way to guarantee that the table has a single-field unique index value. This way the replication system does not have to account for any specific primary key semantics that your design might employ, while still being able to identify every row by a single column. No table is required to have a ROWID column unless it is part of a merge replication publication, so it's not something that every table has, unlike Oracle. It also doesn't serve the same purpose (they're Guid's--or uniqueidentifier in T-SQL parlance--on SQL Server and are random, not sequential integers like they are on Oracle).
The quickest way to retrieve a row from a table is by accessing the row via the clustered index. A table can only have one clustered index, as it's what determines the physical ordering of the rows on the disk. Furthermore, if the table has a primary key, the primary key is the clustered index. While it's possible to declare a table without a primary key and assign the clustered index to something else, I can't (off the top of my head) fathom a reason why you'd want to do this (or, for practical purposes, how you can justify having a table without a primary key).
In short, that means that the quickest way to retireve a row is by using the primary key of the table. Unless the ROWID column is the primary key (which is certainly possible to do), then it isn't the fastest way.
Well, I can only really tell how it works in Oracle, using it for 19+ years :-)
Put simply, ROWID is an internel identification, that acts like an physical address. It can be split into database file no, block no, and row no. So obtaining the ROWID makes the db engine able to look the data up in a single direct IO.
In an index the B* tree will have ROWIDs on the leaf nodes pointing directly the location of the data, e.g. in a primary index.
Being an physical address it is submit to change on relocation on disk, which can happen after restoring a backup, rebuilding a table, or export/import of data.
The db engine can do some tricks, e.g. when moving a pluggable tablespace from one instance to another to avoid rebuilding indexes, however this is strickly db engine internals.
So to keep out of trouble leave the ROWID for internal use for the db engine. Storing the ROWID for your own usage will eventually lead to inconsistency.
In Informix-SE, the ROWID is basically the record number within the C-ISAM file that is used to hold the table. SE only deals in fixed size records, of course (no VARCHAR data).
In Informix Dynamic Server, the ROWID is (a) more complex (page number plus slot number) and (b) not always present (fragmented tables do not expose the ROWID, unless the table was created WITH ROWIDS, in which case the ROWID is a physical column that is indexed after all) - be aware!
If no data is ever deleted and you are using SE, then selecting the row with the maximum ROWID will be the most recently added row. If a row is deleted, then that space will eventually be reused, and then the most recently added row ceases to be the one with the maximum ROWID. (IDS does not make that promise for a variety of complex reasons.)
The SE implementation of ROWID does not store the ROWID in the table, and does not create an index on it, but it does not need an index because it knows the formula for where to go to find the data (offset in data file = ROWID * RowSize), give or take a plus one on RowSize or a minus one ROWID or both.
As to practical use for ROWID, the style that was used before fragmentation was added to IDS was to select a list of ROWID values for the records of interest in the table, maintaining that list in memory:
SELECT ROWID
FROM InterestingTable
WHERE SomeColumn = xxx
AND AnotherColumn < yyy;
Then, the program could present these rows one at time, fetching the current data via the stored ROWID. The ROWID for a record would not change while a program was running. This ensured that the current data - whether edits from the current user or someone else - was shown when the record was displayed.
There's a program you're familiar with, ISQL Perform, that behaves like this. And it does not work with fragmented tables (necessarily in IDS; SE does not support fragmented tables) unless they are created with a physical ROWID column with the WITH ROWIDS clause.
Perhaps the term "RDBMS" rather than "an SQL server"?
Attaching any purpose to a ROWID is a bad idea. Particularly if you're in the habit of dropping and recreating tables. If your table needs a SERIAL PK, then that's what it should have. No good can come of using ROWIDs within your application.

Is there a better/faster method locating a row with the maximum value in a column?

INFORMIX-SE 7.32:
I have a transaction table with about 5,000 nrows. The transaction.ticket_number[INT] is a column which gets updated with the next available sequential ticket number every time a specific row is updated. The column is unique indexed. I'm currently using the following SELECT statement to locate the max(transaction.ticket_num):
SELECT MAX(transaction.ticket_number) FROM transaction;
Since the row being updated is clustered acording to the transaction.fk_id[INT], where it is joined to customer.pk_id[SERIAL],the row is not physically located at the end of the transaction table, rather it resides within the group of transaction rows belonging to each particular customer. I chose to cluster the transactions belonging to each customer because response time is faster when I scroll through each customers transaction. Is there a faster way of locating the max(transaction.ticket_number) with the above query?.. Would a 'unique index on transaction(ticket_number) descending' improve access or is the indexed fully traversed from begining to end irrelevantly?
On a table of only 5000 rows on a modern machine, you are unlikely to be able to measure the difference in performance of the various techniques, especially in the single-user scenario which I believe you are facing. Even if the 5000 rows were all at the maximum permissible size (just under 32 KB), you would be dealing with 160 MB of data, which could easily fit into the machine's caches. In practice, I'm sure your rows are far smaller, and you'd never need all the data in the cache.
Unless you have a demonstrable performance problem, go with the index on the ticket number column and rely on the server (Informix SE) to do its job. If you have a demonstrable problem, show the query plans from SET EXPLAIN output. However, there are major limits on how much you can tweak SE performance - it is install-and-go technology with minimal demands on tuning.
I'm not sure whether Informix SE supports the 'FIRST n' (aka 'TOP n') notation that Informix Dynamic Server supports; I believe not.
Due to NULLABLE columns and other factors, use of indexes, etc, you can often find the following would be faster, but normally only negligably...
SELECT TOP 1 ticket_number FROM transaction ORDER BY ticket_number DESCENDING
I'm also uncertain as to whether you actually have an Index on [ticket_number]? Or do you just have a UNIQUE constraint? A constraint won't help determine a MAX, but an INDEX will.
In the event that an INDEX exists with ticket_number as the first indexable column:
- An index seek/lookup would likely be used, not needing to scan the other values at all
In the event that an INDEX exists with ticket_number Not as the first indexable column:
- An index scan would likely occur, checking every single unique entry in the index
In the event that no usable INDEX exists:
- The whole table would be scanned

How does database indexing work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 12 months ago.
The community reviewed whether to reopen this question yesterday and left it closed:
Original close reason(s) were not resolved
Improve this question
Given that indexing is so important as your data set increases in size, can someone explain how indexing works at a database-agnostic level?
For information on queries to index a field, check out How do I index a database column.
Why is it needed?
When data is stored on disk-based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.
Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires (N+1)/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire tablespace must be searched at N block accesses.
Whereas with a sorted field, a Binary Search may be used, which has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.
What is indexing?
Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and a pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.
The downside to indexing is that these indices require additional space on the disk since the indices are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.
How does it work?
Firstly, let’s outline a sample database table schema;
Field name Data type Size on disk
id (Primary key) Unsigned INT 4 bytes
firstName Char(50) 50 bytes
lastName Char(50) 50 bytes
emailAddress Char(100) 100 bytes
Note: char was used in place of varchar to allow for an accurate size on disk value.
This sample database contains five million rows and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).
Example 1 - sorted vs unsorted fields
Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.
A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value, given that the id field is a key field. But since the id field is also sorted, a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.
Now the firstName field is neither sorted nor a key field, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.
Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;
Field name Data type Size on disk
firstName Char(50) 50 bytes
(record pointer) Special 4 bytes
Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.
Example 2 - indexing
Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the index is N = (r/bfr) = 5000000/18 = 277,778 blocks.
Now a search using the firstName field can utilize the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 1,000,000 block accesses required to find a firstName match in the non-indexed table.
When should it be used?
Given that creating an index requires additional disk space (277,778 blocks extra from the above example, a ~28% increase), and that too many indices can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.
Since indices are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.
Classic example "Index in Books"
Consider a "Book" of 1000 pages, divided by 10 Chapters, each section with 100 pages.
Simple, huh?
Now, imagine you want to find a particular Chapter that contains a word "Alchemist". Without an index page, you have no other option than scanning through the entire book/Chapters. i.e: 1000 pages.
This analogy is known as "Full Table Scan" in database world.
But with an index page, you know where to go! And more, to lookup any particular Chapter that matters, you just need to look over the index page, again and again, every time. After finding the matching index you can efficiently jump to that chapter by skipping the rest.
But then, in addition to actual 1000 pages, you will need another ~10 pages to show the indices, so totally 1010 pages.
Thus, the index is a separate section that stores values of indexed
column + pointer to the indexed row in a sorted order for efficient
look-ups.
Things are simple in schools, isn't it? :P
An index is just a data structure that makes the searching faster for a specific column in a database. This structure is usually a b-tree or a hash table but it can be any other logic structure.
The first time I read this it was very helpful to me. Thank you.
Since then I gained some insight about the downside of creating indexes:
if you write into a table (UPDATE or INSERT) with one index, you have actually two writing operations in the file system. One for the table data and another one for the index data (and the resorting of it (and - if clustered - the resorting of the table data)). If table and index are located on the same hard disk this costs more time. Thus a table without an index (a heap) , would allow for quicker write operations. (if you had two indexes you would end up with three write operations, and so on)
However, defining two different locations on two different hard disks for index data and table data can decrease/eliminate the problem of increased cost of time. This requires definition of additional file groups with according files on the desired hard disks and definition of table/index location as desired.
Another problem with indexes is their fragmentation over time as data is inserted. REORGANIZE helps, you must write routines to have it done.
In certain scenarios a heap is more helpful than a table with indexes,
e.g:- If you have lots of rivalling writes but only one nightly read outside business hours for reporting.
Also, a differentiation between clustered and non-clustered indexes is rather important.
Helped me:- What do Clustered and Non clustered index actually mean?
Now, let’s say that we want to run a query to find all the details of any employees who are named ‘Abc’?
SELECT * FROM Employee
WHERE Employee_Name = 'Abc'
What would happen without an index?
Database software would literally have to look at every single row in the Employee table to see if the Employee_Name for that row is ‘Abc’. And, because we want every row with the name ‘Abc’ inside it, we can not just stop looking once we find just one row with the name ‘Abc’, because there could be other rows with the name Abc. So, every row up until the last row must be searched – which means thousands of rows in this scenario will have to be examined by the database to find the rows with the name ‘Abc’. This is what is called a full table scan
How a database index can help performance
The whole point of having an index is to speed up search queries by essentially cutting down the number of records/rows in a table that need to be examined. An index is a data structure (most commonly a B- tree) that stores the values for a specific column in a table.
How does B-trees index work?
The reason B- trees are the most popular data structure for indexes is due to the fact that they are time efficient – because look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason B- trees are more commonly used is because the data that is stored inside the B- tree can be sorted. The RDBMS typically determines which data structure is actually used for an index. But, in some scenarios with certain RDBMS’s, you can actually specify which data structure you want your database to use when you create the index itself.
How does a hash table index work?
The reason hash indexes are used is because hash tables are extremely efficient when it comes to just looking up values. So, queries that compare for equality to a string can retrieve values very fast if they use a hash index.
For instance, the query we discussed earlier could benefit from a hash index created on the Employee_Name column. The way a hash index would work is that the column value will be the key into the hash table and the actual value mapped to that key would just be a pointer to the row data in the table. Since a hash table is basically an associative array, a typical entry would look something like “Abc => 0x28939″, where 0x28939 is a reference to the table row where Abc is stored in memory. Looking up a value like “Abc” in a hash table index and getting back a reference to the row in memory is obviously a lot faster than scanning the table to find all the rows with a value of “Abc” in the Employee_Name column.
The disadvantages of a hash index
Hash tables are not sorted data structures, and there are many types of queries which hash indexes can not even help with. For instance, suppose you want to find out all of the employees who are less than 40 years old. How could you do that with a hash table index? Well, it’s not possible because a hash table is only good for looking up key value pairs – which means queries that check for equality
What exactly is inside a database index?
So, now you know that a database index is created on a column in a table, and that the index stores the values in that specific column. But, it is important to understand that a database index does not store the values in the other columns of the same table. For example, if we create an index on the Employee_Name column, this means that the Employee_Age and Employee_Address column values are not also stored in the index. If we did just store all the other columns in the index, then it would be just like creating another copy of the entire table – which would take up way too much space and would be very inefficient.
How does a database know when to use an index?
When a query like “SELECT * FROM Employee WHERE Employee_Name = ‘Abc’ ” is run, the database will check to see if there is an index on the column(s) being queried. Assuming the Employee_Name column does have an index created on it, the database will have to decide whether it actually makes sense to use the index to find the values being searched – because there are some scenarios where it is actually less efficient to use the database index, and more efficient just to scan the entire table.
What is the cost of having a database index?
It takes up space – and the larger your table, the larger your index. Another performance hit with indexes is the fact that whenever you add, delete, or update rows in the corresponding table, the same operations will have to be done to your index. Remember that an index needs to contain the same up to the minute data as whatever is in the table column(s) that the index covers.
As a general rule, an index should only be created on a table if the data in the indexed column will be queried frequently.
See also
What columns generally make good indexes?
How do database indexes work
Simple Description!
The index is nothing but a data structure that stores the values for a specific column in a table. An index is created on a column of a table.
Example: We have a database table called User with three columns – Name, Age and Address. Assume that the User table has thousands of rows.
Now, let’s say that we want to run a query to find all the details of any users who are named 'John'.
If we run the following query:
SELECT * FROM User
WHERE Name = 'John'
The database software would literally have to look at every single row in the User table to see if the Name for that row is ‘John’. This will take a long time.
This is where index helps us: index is used to speed up search queries by essentially cutting down the number of records/rows in a table that needs to be examined.
How to create an index:
CREATE INDEX name_index
ON User (Name)
An index consists of column values(Eg: John) from one table, and those values are stored in a data structure.
So now the database will use the index to find employees named John
because the index will presumably be sorted alphabetically by the
Users name. And, because it is sorted, it means searching for a name
is a lot faster because all names starting with a “J” will be right
next to each other in the index!
Just think of Database Index as Index of a book.
If you have a book about dogs and you want to find an information about let's say, German Shepherds, you could of course flip through all the pages of the book and find what you are looking for - but this of course is time consuming and not very fast.
Another option is that, you could just go to the Index section of the book and then find what you are looking for by using the Name of the entity you are looking ( in this instance, German Shepherds) and also looking at the page number to quickly find what you are looking for.
In Database, the page number is referred to as a pointer which directs the database to the address on the disk where entity is located. Using the same German Shepherd analogy, we could have something like this (“German Shepherd”, 0x77129) where 0x77129 is the address on the disk where the row data for German Shepherd is stored.
In short, an index is a data structure that stores the values for a specific column in a table so as to speed up query search.