Indexing a table with duplicate records

Indexing a table with duplicate records - sql

I have a SQL Server table with around 50,000 rows. The table gets updated once in a day by some upstream process.
The following query has been fired from application:
SELECT * FROM Table1 where Field1 = "somevalue"
The "Field1" column contains duplicate values. I am trying to improve performance of the above query. I cannot modify the code in the application side. So limiting column instead of "SELECT *" is not possible. I am planning to index the table. Should I define a NON-CLUSTERED index on "Field1" column in order to improve performance? Or some other kind of indexing would help? Is there any other ways to improve performance from DB side ?

Yes, a non-clustered index on Field1 should serve your purposes...
For example,
CREATE NONCLUSTERED INDEX Idx_Table1_Field1 ON Table1 (Field1)

The best thing you can do is run SP_BlitzIndex by Brent Ozar to get a better picture of your entire database index setup (including this table).
http://www.brentozar.com/blitzindex/
If your table already has a clustered index (which it should - apply one following these principles), you should first look at the execution plan to see what it is advocating.
Further, if the table is only updated every day, and presumably during off hours, you can easily compress the table and given it has repetitive data mostly, you will save over 50% IO and space on the query and incur a small CPU overhead. Table compression has no effect on the data itself, only on the space it holds. This feature is only available in SQL Server Enterprise.
Last but not least, are your data types properly set, i.e. are you pulling from datetime when the column could easily be date, or are you pulling from bigint when the column could easily be int.
Asking a question as to how to make an index really isn't a proper question for Stack, i.e.
CREATE NONCLUSTERED INDEX Idx_Table1_Field1 ON Table1 (Field1)
As it is already on MSDN and can even be created via SSMS via Create Index drop down right clicking on the index burst out section under a given table icon, the question you should be asking is how do I properly address performance improvements in my environment related to indexing. Finally, analyze whether or not your end query result really necessitates a select * - this is a common oversight on data display, a table with 30 columns is selected from a dataset when the developer only plans on showing 5 of the columns, which would be a 600% IO gain if the dataset only populated 5 columns.
Please also note the famous index maintenance script by Ole Hallengren

Related

SQL Server filter by date on a big table takes too long

I have a table with 300 million rows. One of the columns is type of date and when I select rows within two dates, it takes forever. About 3 minutes. The date field is indexed and I'm using SQL Server 2012 on a very powerful machine with high specs.
Is there anything I can do to make it significantly faster?
This is the query:
Select flightID, FlightDirection, DestinationID, FlightDuration
from T_Flights (nolock)
where FlightDate between #fromDate And #toDate

Scan is not good in the execution plan.
It should be seek.
try to add the columns in the select statement to the index and run the query.
If it still doesn't work another thing you could do is use the Database Engine Tuning Advisor to see if it gives you any suggestions. Select the query in SSMS, right click and select Analyze Query in Database Engine Tuning Advisor.

From your discussion, I understand that you are not having proper index on the date column. You have mentioned that index is being scanned. As you have not given enough details about which index being scanned, I would suggest you to create included index for suiting your query.
Now, your query can be satisfied by the below Nonclustered index itself. But, adding index brings additional overhead of maintaining it. So, add the index, only when your workload demanding it.
-- Assuming FlightID is primary key. primary key is included by default and no need
--to add it separately. If FlightID is not primary key, add that to the list
-- of included columns
CREATE NONCLUSTERED INDEX NCI_FlightDate ON dbo.T_Flights(FlightDate)
INCLUDE (FlightDirection, DestinationID, FlightDuration)

If you have 300 million rows, then an index on (flightdate) might help -- but depending on how many flights per day and how many days. You can include the other columns in the index which should help a bit more.
However, for what you want to do, it sounds like a better solution is to partition the table. This would store each day's of data in a separate "file" and only the ones needed for the query would be read for a given query.
The downside is that this requires re-creating the table. However, you might find that this is a big win performance-wise so worth the effort.

You need to use the Database Engine Tuning Advisor to optimize the query execution

indexing of Temptables

Good Day to every one!
i have a migration process from a remote query, i fetch data and store it in a #Temptable,
the question is, what would be better? putting index after Creation of table of #temptable or insert data first in the #temtable before putting an index? and why? or it is better to process the data while in the remote query before inserting the data in a #temptable
ex.
Select * into #BiosData
from sometable a
where (a.Status between 3 and 5)
CREATE CLUSTERED INDEX IDX_MAINID ON #BiosData([MAINID])
**Process the data retrieved above....**
OR this?
select A.MAINIDinto #BiosData
from table a
inner join Transactions.sometable c
on a.ID= c.fld_ID
inner join Reference.sometable b
on cast(a.[ID]/1000000000000 as decimal (38,0)) = b.fld_ID
where a.version > b.fld_version
and (a.Status between 3 and 5)
thank you for your tips and suggestions :) im a newbie in Sql please be gentle to me :)

As a generic rule:
If you create a fresh table and are going to insert data into it and it needs an index then it is faster to insert the data first and create the index afterwards. Why: because creating an index means calculating it if data exists, but inserting data on an indexed table will continiously reshuffle the index contents which also need to be written. So by creating the index afterwards you avoid the overhead of updating the index while inserting
Exception 1: if you want to have the index combined with the data hence when a read occurs to the index t find a particular value it also has the data available in the same read operation. In oracle they call it an index organized table. I think in MS SQL it might be called an clustered index, but not 100% sure.
Exception 2: if your index is used to enforce some constraint then creating the index first is a good option to make sure that during the inserts the constraint is maintained.
In your case: I notice that in the complex query there is an additional where clause: it may result in fewer inserts hence faster processing, however if the tables used in the complex query have additional indexes which speed up the query, make sure similar indices are also created on the temp table.
Finally: indices are typically used to reduce disk i/o, temporary tables are if I am not mistaken maintained in memory. So adding indices are not guaranteed to increase speed...

How to decrease response time of a simple select query?

MarketPlane table contains more than 60 million rows.
When I need the total number of plane from a particular date, I execute this query which takes more than 7 min. How can I reduce this time ?
SELECT COUNT(primaryKeyColumn)
FROM MarketPlan
WHERE LaunchDate > #date
I have implemented all things mentioned in your links even now I have implemented With(nolock) which reduce response time is to 5 min.

You will have to create an index on the table, or maybe partition the table by date.
You might also want to have a look at
SQL Server 2000/2005 Indexed View Performance Tuning and Optimization Tips
SQL Server Indexed Views

Does the table in question have an index on the LaunchDate column? Also, did you really mean to post LaunchDate>#date?

Assuming SQL-Server based on #date, although the same can be applied to most databases.
If your primary query is to select out a range of data (based on sample), adding, or altering the CLUSTERED INEDX will go a long way to improving query times.
See: http://msdn.microsoft.com/en-us/library/ms190639.aspx
By default, SQL-Server creates the Primary Key as the Clustered Index which is great from a transactional point of view, but if your focus is to retrieve the data, then altering that default makes a huge difference.
CREATE CLUSTERED INDEX name ON MarketPlan (LaunchDate DESC)
Note: Assuming LaunchDate is a static date value and is primarily inserted in increasing/sequential order to minimize index fragmentation.

There are some fine suggestions here, if all else fails, consider a little denormalization, create another table with the cumulative counts and update it with a trigger. If you have more queries of this nature think about OLAP

Your particular query does not require clustered key on the date column. It would actually run better with nonclustered index with the leading date column because you don't need to do key lookup in this query, so the nonclustered index would be covering and more compact than clustered (it implicitly includes clustered key columns).
If you have it indexed properly and it still does not perform it is most likely fragmentation. In this case defragment the index and try again.

Create a new index like this:
CREATE INDEX xLaunchDate on MarketPlan (LaunchDate, primaryKeyColumn)
Check this nice article about how an index can improve the performance.
http://blog.sqlauthority.com/2009/10/08/sql-server-query-optimization-remove-bookmark-lookup-remove-rid-lookup-remove-key-lookup-part-2/

"WHERE LaunchDate > #date"
Is the value of parameter #date defined in the same batch (or transaction or context)?
If not, then this would lead to Clustered Index Scan (of all rows) instead of Clustered Index Seek (of just rows satisfying WHERE condition) if its value is coming from outside of current batch (as, for example, input parameter of stored procedure or udf function).
The query cannot be fully optimized by SQL Server optimizer (at compile time) leading to full table scan since the value of parameter is known only at run-time
Update: Comment to answers proposing OLAP.
OLAP is just concept, SSAS cubes is just one of the possible ways of OLAP implementation.
It is convenience, not obligation in getting/using OLAP concept.
You have not use SSAS to use OLAP concept.
See, for ex., Simulated OLAP
Update2: Comment to question in comments to answer:
MDX performance vs. T-SQL
MDX is an option/convenience/feature/functionality provided by SSAS (cubes/OLAP) not obligation

The simplest thing you can do is:
SELECT COUNT(LaunchDate)
FROM MarketPlan
WHERE LaunchDate > #date
This will guarantee you index-only retrieval for any LaunchDate index.
Also (this depends on your execution plan), I have seen instances (but not specific to SQL Server) in which > did a table scan and BETWEEN used an index. If you know the top date you might try WHERE LaunchDate BETWEEN #date AND <<Literal Date>>.

How wide is the table? If the table is wide (ie: many columns of (n)char, (n)varchar or xml) there might be a significant amount of IO causing the query to run slowly as a result of using the clustered index.
To determine if IO is causing the long query time perform the following:
Create a non-clustered index only on the LaunchDate column.
Run the query below which counts LaunchDate and forces the use of the new index.
SELECT COUNT(LaunchDate)
FROM MarketPlan WITH (INDEX = TheNewIndexName)
WHERE LaunchDate > #date
I do not like to use index hints and I only suggest this hint only to prove if the IO is causing long query times.

There are two ways to do this
First create a clustered index on the date column, since query is date range specific, all the data will be in the actual order and this will avoid having to scan through all records in the table
You can try using Horizontal partioning, this will affect your existing table design but this is the most optimal way to do so, see this
http://blog.sqlauthority.com/2008/01/25/sql-server-2005-database-table-partitioning-tutorial-how-to-horizontal-partition-database-table/

how to optimize sql server table for faster response?

i found a in a table there are 50 thousands records and it takes one minute when we fetch data from sql server table just by issuing a sql. there are one primary key that means a already a cluster index is there. i just do not understand why it takes one minute. beside index what are the ways out there to optimize a table to get the data faster. in this situation what i need to do for faster response. also tell me how we can write always a optimize sql. please tell me all the steps in detail for optimization.
thanks.

The fastest way to optimize indexes in table is to use SQL Server Tuning Advisor. Take a look http://www.youtube.com/watch?v=gjT8wL92mqE <-- here

Select only the columns you need, rather than select *. If your table has some large columns e.g. OLE types or other binary data (maybe used for storing images etc) then you may be transferring vastly more data off disk and over the network than you need.
As others have said, an index is no help to you when you are selecting all rows (no where clause). Using an index would be slower in such cases because of the index read and table lookup for each row, vs full table scan.

If you are running select * from employee (as per question comment) then no amount of indexing will help you. It's an "Every column for every row" query: there is no magic for this.
Adding a WHERE won't help usually for select * query too.
What you can check is index and statistics maintenance. Do you do any? Here's a Google search
Or change how you use the data...
Edit:
Why a WHERE clause usually won't help...
If you add a WHERE that is not the PK..
you'll still need to scan the table unless you add an index on the searched column
then you'll need a key/bookmark lookup unless you make it covering
with SELECT * you need to add all columns to the index to make it covering
for a many hits, the index will probably be ignored to avoid key/bookmark lookups.
Unless there is a network issue or such, the issue is reading all columns not lack of WHERE
If you did SELECT col13 FROM MyTable and had an index on col13, the index will probably be used.
A SELECT * FROM MyTable WHERE DateCol < '20090101' with an index on DateCol but matched 40% of the table, it will probably be ignored or you'd have expensive key/bookmark lookups

Irrespective of the merits of returning the whole table to your application that does sound an unexpectedly long time to retrieve just 50000 rows of employee data.
Does your query have an ORDER BY or is it literally just select * from employee?
What is the definition of the employee table? Does it contain any particularly wide columns? Are you storing binary data such as their CVs or employee photo in it?
How are you issuing the SQL and retrieving the results?
What isolation level are your select statements running at (You can use SQL Profiler to check this)
Are you encountering blocking? Does adding NOLOCK to the query speed things up dramatically?

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.

Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.

I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.

Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.

It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).

yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer

To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas