Issue with the big tables ( no primary key available) - sql

Tabe1 has around 10 Lack records (1 Million) and does not contain any primary key. Retrieving the data by using SELECT command ( With a specific WHERE condition) is taking large amount of time. Can we reduce the time of retrieval by adding a primary key to the table or do we need to follow any other ways to do the same. Kindly help me.

A primary key does not have a direct affect on performance. But indirectly, it does. This is because when you add a primary key to a table, SQL Server creates a unique index (clustered by default) that is used to enforce entity integrity. But you can create your own unique indexes on a table. So, strictly speaking, a primary index does not affect performance, but the index used by the primary key does.
WHEN SHOULD PRIMARY KEY BE USED?

Primary key is needed for referring to a specific record.
To make your SELECTs run fast you should consider adding an index on an appropriate columns you're using in your WHERE.
E.g. to speed-up SELECT * FROM "Customers" WHERE "State" = 'CA' one should create an index on State column.

Primarykey will not help if you don't have Primarykey in where cause.
If you would like to make you quesry faster, you can create non-cluster index on columns in where cause. You may want include columns on top of your index(it depend on your select cause)
The SQL optimizer will seek on your indexs that will make your query faster.
(but you should think about when data adding in your table. Insert operation might takes time if you create index on many columns.)

It depends on the SELECT statement, and the size of each row in the table, the number of rows in the table, and whether you are retrieving all the data in each row or only a small subset of the data (and if a subset, whether the data columns that are needed are all present in a single index), and on whether the rows must be sorted.
If all the columns of all the rows in the table must be returned, then you can't speed things up by adding an index. If, on the other hand, you are only trying to retrieve a tiny fraction of the rows, then providing appropriate indexes on the columns involved in the filter conditions will greatly improve the performance of the query. If you are selecting all, or most, of the rows but only selecting a few of the columns, then if all those columns are present in a single index and there are no conditions on columns not in the index, an index can help.
Without a lot more information, it is hard to be more specific. There are whole books written on the subject, including:
Relational Database Index Design and the Optimizers

One way you can do it is to create indexes on your table. It's always better to create a primary key, which creates a unique index that by default will reduce the retrieval time .........
The optimizer chooses an index scan if the index columns are referenced in the SELECT statement and if the optimizer estimates that an index scan will be faster than a table scan. Index files generally are smaller and require less time to read than an entire table, particularly as tables grow larger. In addition, the entire index may not need to be scanned. The predicates that are applied to the index reduce the number of rows to be read from the data pages.
Read more: Advantages of using indexes in database?

Related

Non-clustered index including columns

Let's say I have table with many columns (20 for instance), and I often do the search by one of them. If I create non-clustered index for that column, then I know I should also include other columns from select statement to cover the query.
But what if the query is SELECT *, should I include all columns to index? I know I am making a copy of the whole table by doing that, is it good or bad practice?
Indexing the most / whole table is not usually a good idea, especially if there are inserts / updates / deletes to the table. When all the wanted fields are not included in the index, a key lookup must be made using the clustered index to find the row(s) from the table. How good / bad that is depends on how many rows you're fetching and how many levels there are in the clustered index -- and that's why it's good to have a narrow clustering key, preferably an int.
If you have to do key lookups for significant portion of the rows in the table, it's usually a lot faster just to scan the whole table. That is most likely the case in your scenario too, because doing key lookups isn't going to be that expensive, if there's not much rows affected, so indexing all fields wouldn't really help.
Of course if your table is huge, indexing all the columns might help, at least in theory. I haven't ever even considered doing that, but I would assume it would help when scanning the whole table would be a costly operation. This of course only in case that the table doesn't get much updates, because maintaining the index would cause problems too.

SQL Server indexing includes questions

I've been trouble shooting some bad SQL calls in my works applications. I've been reading up on indexes, tweaking and benchmarking things. Here's some of the rules I've gathered (let me know if this sounds right):
For heavily used quires, boil down the query to only what is needed and rework the where statements to use the most common columns first. Then make a non clustered index on the columns used in the where statement and do INCLUDING on any remaining select columns (excluding large columns of course like nvarchar(max)).
If a query is going to return > 20% of the entries table contents, it's best to do a table scan and not use an index
Order in an index matters. You have to make sure to structure your where statement like the index is built.
Now one thing I'm having trouble finding info on is what if a query is selecting on columns that are not part of any index but is using a where statement that is? Is the index used and leaf node hits the table and looks at the associated row for it?
ex: table
Id col1 col2 col3
CREATE INDEX my_index
ON my_table (col1)
SELECT Id, col1, col2, col3
FROM my_table
WHERE col1 >= 3 AND col1 <= 6
Is my_index used here? If so, how does it resolve Id, col2, col3? Does it point back to table rows and pick up the values?
To answer your question, yes, my_index is used. And yes, your index will point back to the table rows and pick the id, col2 and col3 values there. That is what an index does.
Regarding your 'rules'
Rule 1 makes sense. Except for the fact that I usually do not 'include' other columns in my index. As explained above, the index will refer back to the table and quickly retrieve the row(s) that you need.
Rule 2, I don't really understand. You create the index and SQL Server will decide which indices to use or not use. You don't really have to worry about it.
Rule 3, the order does not really make a difference.
I hope this helps.
From dba.stackexchange.com:
There are a few concepts and terms that are important to understand
when dealing with indexes. Seeks, scans, and lookups are some of the
ways that indexes will be utilized through select statements.
Selectivity of key columns is integral to determining how effective an
index can be.
A seek happens when the SQL Server Query Optimizer determines that the
best way to find the data you have requested is by scanning a range
within an index. Seeks typically happen when a query is "covered" by
an index, which means the seek predicates are in the index key and the
displayed columns are either in the key or included. A scan happens
when the SQL Server Query Optimizer determines that the best way to
find the data is to scan the entire index and then filter the results.
A lookup typically occurs when an index does not include all requested
columns, either in the index key or in the included columns. The query
optimizer will then use either the clustered key (against a clustered
index) or the RID (against a heap) to "lookup" the other requested
columns.
Typically, seek operations are more efficient than scans, due to
physically querying a smaller data set. There are situations where
this is not the case, such as a very small initial data set, but that
goes beyond the scope of your question.
Now, you asked how to determine how effective an index is, and there
are a few things to keep in mind. A clustered index's key columns are
called a clustering key. This is how records are made unique in the
context of a clustered index. All nonclustered indexes will include
the clustered key by default, in order to perform lookups when
necessary. All indexes will be inserted to, updated to, or deleted
from for every respective DML statement. That having been said, it is
best to balance performance gains in select statements against
performance hits in insert, delete, and update statements.
In order to determine how effective an index is, you must determine
the selectivity of your index keys. Selectivity can be defined as a
percentage of distinct records to total records. If I have a [person]
table with 100 total records and the [first_name] column contains 90
distinct values, we can say that the [first_name] column is 90%
selective. The higher the selectivity, the more efficient the index
key. Keeping selectivity in mind, it is best to put your most
selective columns first in your index key. Using my previous [person]
example, what if we had a [last_name] column that was 95% selective?
We would want to create an index with [last_name], [first_name] as the
index key.
I know this was a bit long-winded answer, but there really are a lot
of things that go into determining how effective an index will be, and
a lot things you must weigh any performance gains against.

Searching for record(s) in a table that has over 200 Million Rows

Which type of index should be used on the table? It is initially inserted (one a month) into a empty table. I then place a non clustered composite index on two of the columns. Wondering if merging the two fields into one would increase performance when searching. Or does it not matter? Should I be working with an identity column that has a primary key clustered index?
You should index the field(s) most likely to be used in the where clause as people query the table. Don't worry about the primary key - it already has an index.
If you can define a unique primary key that can be used when querying the table, this will be used as the clustered index and will be the fastest for selects.
If your select query has to use the two fields you mentioned, keep them separate. Performance will not be impacted and the schema is not spoiled.
"A clustered index is particularly efficient on columns that are often searched for ranges of values. After the row with the first value is found using the clustered index, rows with subsequent indexed values are guaranteed to be physically adjacent."
With this in mind you probably won't see much benefit from haveing a clustered index on your primary key (ID) unless it have business meaning for your aplication. If you have a Date value that you are commonly querying, then it may make more sense to add a clustered index to that
select * from table where created > '2013-01-01' and created < '2013-02-01'
I have seen datawarehouses use a concatenated key approach. Whether this works for you depends on your queries. Obviously querying a single field value will be faster than multiple fields, particularly when there is one less lookup in the B-tree index.
Alternatively, if you have 200 million rows in a table you could look at breaking the data out into multiple tables if it makes sense to do so.
You're saying that you're loading all this data every month so I have to assume that all the data is relevant. If there was data in your table that is considered "old" and not relevant to searches, then you could move data out into a archive table (using the same schema) so your queries only run against "current" data.
Otherwise, you can look at a sharding approach as used by NoSQL like MongoDB. If MongoDB is not an option, you could achieve the same shard key like logic in your application. I doubt that your database SQL drivers will support sharding natively.

sql table optimization: primary and secondary indexes

Do people usually make every column in a table a secondary index to be on the safe side in case the customer decides to use either field to search for a record?
Does the search first go through the secondary indexes and then to the primary key? Thus narrowing down to the requested data?
What is the point of having secondary index if you already have a column that is a primary key?
(The following response applies to Sql Server. Some parts may vary for other DBMSs.)
Last question first: "What is the point of having secondary keys if you already have a column that is a primary key?" I illustrate with the example of a table "People (Id int primary key, firstname varchar(40), middlename varchar(40), lastname varchar(40))". Now consider the query "select * from people where lastname = 'flynn'". If there is no index on the lastname column, the table will be scanned sequentially to find matches. Every row must be accessed. The primary key index does not help at all here. If you index the lastname column, the result can be found much more quickly.
You would normally index only those columns that would be useful to the queries your application issues. If your queries never have a join or where condition on a column named "MiddleName" then no benefit would come from indexing that column. You don't want to add unnecessary indexes because they increase the cost of data inserts and updates that involve that column.
We usually say that Sql Server uses only a single index per table instance in a query. So a query like "select * from people where firstname='Elroy' and lastname = 'Flynn' " would use at most one index, even if both firstname and lastname have indexes. Sql Server would choose one or the other index based on the statistics it has collected from the data values.
In full completeness, I have to get a little advanced here, and discuss clustered vs. non-clustered indexes. A table can have only one clustered index: the rest are non-clustered. The previous paragraph notwithstanding, when a non-clustered index is used for to resolve a query, the index lookup produces an intermediate result which is the full value of the key associated with whichever index is the clustered index (often, the primary key). That is, the leaves of every non-clustered index contain the clustered key value, not a row pointer. After finding this clustered key, the clustered index is then used to resolve the lookup to a specific database row. So, ultimately, ALL index lookups eventually use the clustered index.
Still, for practical purposes, it is usually adequate and simpler to say that only a single index is used per table instance. Note that if a table is aliased in a query so that it appears more than once, a different index could be used for the different references. e.g., "select * from people p1 join people p2 on p1.firstname = p2.lastname" could use a firstname index on the p1 instance and a lastname index on the p2 instance.
see http://msdn.microsoft.com/en-us/library/aa933131(v=SQL.80).aspx
Usually you only index columns that need to be. Adding additional indexes would normally be considered premature optimization.
Most optimizers will identify the fastest method to find the least number of records. This may be to use and index, but may be a full table scan. If there are multiple indexes that can be used, often only one is used, and the resulting records compared against the remaining criteria. If multiple indexes are used, then the resulting result sets need to be matched, and records which weren't found in both indexes eliminated.
It is common to use surrogate keys for tables where the natural key is subject to change, or very (purposely vague) long. The natural key in this case would be indexed as a secondary unique key. In some cases there may be competing natural keys, in which case all the natural keys would have unique indexes.
One other item not mentioned yet, every additional index has to be maintained. So if you have indexes covering all your columns in several different combinations, not only will they take up lots of space, every update/insert/delete has the potential to change one or more of those indexes. This will result in those operations being slowed way down in many situations.
It's always a tradeoff. The more indexes you have the more work the server has to do to keep them up to date, but the more likely it is that you'll have at least one that will cover any query you throw at that table.
"On the safe side"? No.
An index trades space and insert-time for select-time. Unnecessary keys chew up disk-space and slow inserts in return for speeding up a query that never occurs.
As with all optimizations, do query optimizations last -- build the system then observe its behavior.
The primary/secondary distinction in a highly technical one. All indices exist to speed up queries and/or enforce certain integrity constraints.

Two single-column indexes vs one two-column index in MySQL?

I'm faced with the following and I'm not sure what's best practice.
Consider the following table (which will get large):
id PK | giver_id FK | recipient_id FK | date
I'm using InnoDB and from what I understand, it creates indices automatically for the two foreign key columns. However, I'll also be doing lots of queries where I need to match a particular combination of:
SELECT...WHERE giver_id = x AND recipient_id = t.
Each such combination will be unique in the table.
Is there any benefit from adding an two-column index over these columns, or would the two individual indexes in theory be sufficient / the same?
If you have two single column indexes, only one of them will be used in your example.
If you have an index with two columns, the query might be faster (you should measure). A two column index can also be used as a single column index, but only for the column listed first.
Sometimes it can be useful to have an index on (A,B) and another index on (B). This makes queries using either or both of the columns fast, but of course uses also more disk space.
When choosing the indexes, you also need to consider the effect on inserting, deleting and updating. More indexes = slower updates.
A covering index like:
ALTER TABLE your_table ADD INDEX (giver_id, recipient_id);
...would mean that the index could be used if a query referred to giver_id, or a combination of giver_id and recipient_id. Mind that index criteria is leftmost based - a query referring to only recipient_id would not be able to use the covering index in the statement I provided.
Please note that some older MySQL versions can only use one index per SELECT so a covering index would be the best means of optimizing your queries.
If one of the foreign key indexes is already very selective, then the database engine should use that one for the query you specified. Most database engines use some kind of heuristic to be able to choose the optimal index in that situation. If neither index is highly selective by itself, it probably does make sense to add the index built on both keys since you say you will use that type of query a lot.
Another thing to consider is if you can eliminate the PK field in this table and define the primary key index on the giver_id and recipient_id fields. You said that the combination is unique, so that would possibly work (given a lot of other conditions that only you can answer). Typically, though, I think the added complexity that adds is not worth the hassle.
Another thing to consider is that the performance characteristics of both approaches will be based on the size and cardinality of the dataset. You may find that the 2-column index only becomes noticing more performant at a certain dataset size threshold, or the exact opposite. Nothing can substitute for performance metrics for your exact scenario.