MarketPlane table contains more than 60 million rows.
When I need the total number of plane from a particular date, I execute this query which takes more than 7 min. How can I reduce this time ?
SELECT COUNT(primaryKeyColumn)
FROM MarketPlan
WHERE LaunchDate > #date
I have implemented all things mentioned in your links even now I have implemented With(nolock) which reduce response time is to 5 min.
You will have to create an index on the table, or maybe partition the table by date.
You might also want to have a look at
SQL Server 2000/2005 Indexed View Performance Tuning and Optimization Tips
SQL Server Indexed Views
Does the table in question have an index on the LaunchDate column? Also, did you really mean to post LaunchDate>#date?
Assuming SQL-Server based on #date, although the same can be applied to most databases.
If your primary query is to select out a range of data (based on sample), adding, or altering the CLUSTERED INEDX will go a long way to improving query times.
See: http://msdn.microsoft.com/en-us/library/ms190639.aspx
By default, SQL-Server creates the Primary Key as the Clustered Index which is great from a transactional point of view, but if your focus is to retrieve the data, then altering that default makes a huge difference.
CREATE CLUSTERED INDEX name ON MarketPlan (LaunchDate DESC)
Note: Assuming LaunchDate is a static date value and is primarily inserted in increasing/sequential order to minimize index fragmentation.
There are some fine suggestions here, if all else fails, consider a little denormalization, create another table with the cumulative counts and update it with a trigger. If you have more queries of this nature think about OLAP
Your particular query does not require clustered key on the date column. It would actually run better with nonclustered index with the leading date column because you don't need to do key lookup in this query, so the nonclustered index would be covering and more compact than clustered (it implicitly includes clustered key columns).
If you have it indexed properly and it still does not perform it is most likely fragmentation. In this case defragment the index and try again.
Create a new index like this:
CREATE INDEX xLaunchDate on MarketPlan (LaunchDate, primaryKeyColumn)
Check this nice article about how an index can improve the performance.
http://blog.sqlauthority.com/2009/10/08/sql-server-query-optimization-remove-bookmark-lookup-remove-rid-lookup-remove-key-lookup-part-2/
"WHERE LaunchDate > #date"
Is the value of parameter #date defined in the same batch (or transaction or context)?
If not, then this would lead to Clustered Index Scan (of all rows) instead of Clustered Index Seek (of just rows satisfying WHERE condition) if its value is coming from outside of current batch (as, for example, input parameter of stored procedure or udf function).
The query cannot be fully optimized by SQL Server optimizer (at compile time) leading to full table scan since the value of parameter is known only at run-time
Update: Comment to answers proposing OLAP.
OLAP is just concept, SSAS cubes is just one of the possible ways of OLAP implementation.
It is convenience, not obligation in getting/using OLAP concept.
You have not use SSAS to use OLAP concept.
See, for ex., Simulated OLAP
Update2: Comment to question in comments to answer:
MDX performance vs. T-SQL
MDX is an option/convenience/feature/functionality provided by SSAS (cubes/OLAP) not obligation
The simplest thing you can do is:
SELECT COUNT(LaunchDate)
FROM MarketPlan
WHERE LaunchDate > #date
This will guarantee you index-only retrieval for any LaunchDate index.
Also (this depends on your execution plan), I have seen instances (but not specific to SQL Server) in which > did a table scan and BETWEEN used an index. If you know the top date you might try WHERE LaunchDate BETWEEN #date AND <<Literal Date>>.
How wide is the table? If the table is wide (ie: many columns of (n)char, (n)varchar or xml) there might be a significant amount of IO causing the query to run slowly as a result of using the clustered index.
To determine if IO is causing the long query time perform the following:
Create a non-clustered index only on the LaunchDate column.
Run the query below which counts LaunchDate and forces the use of the new index.
SELECT COUNT(LaunchDate)
FROM MarketPlan WITH (INDEX = TheNewIndexName)
WHERE LaunchDate > #date
I do not like to use index hints and I only suggest this hint only to prove if the IO is causing long query times.
There are two ways to do this
First create a clustered index on the date column, since query is date range specific, all the data will be in the actual order and this will avoid having to scan through all records in the table
You can try using Horizontal partioning, this will affect your existing table design but this is the most optimal way to do so, see this
http://blog.sqlauthority.com/2008/01/25/sql-server-2005-database-table-partitioning-tutorial-how-to-horizontal-partition-database-table/
Related
I have a table with 300 million rows. One of the columns is type of date and when I select rows within two dates, it takes forever. About 3 minutes. The date field is indexed and I'm using SQL Server 2012 on a very powerful machine with high specs.
Is there anything I can do to make it significantly faster?
This is the query:
Select flightID, FlightDirection, DestinationID, FlightDuration
from T_Flights (nolock)
where FlightDate between #fromDate And #toDate
Scan is not good in the execution plan.
It should be seek.
try to add the columns in the select statement to the index and run the query.
If it still doesn't work another thing you could do is use the Database Engine Tuning Advisor to see if it gives you any suggestions. Select the query in SSMS, right click and select Analyze Query in Database Engine Tuning Advisor.
From your discussion, I understand that you are not having proper index on the date column. You have mentioned that index is being scanned. As you have not given enough details about which index being scanned, I would suggest you to create included index for suiting your query.
Now, your query can be satisfied by the below Nonclustered index itself. But, adding index brings additional overhead of maintaining it. So, add the index, only when your workload demanding it.
-- Assuming FlightID is primary key. primary key is included by default and no need
--to add it separately. If FlightID is not primary key, add that to the list
-- of included columns
CREATE NONCLUSTERED INDEX NCI_FlightDate ON dbo.T_Flights(FlightDate)
INCLUDE (FlightDirection, DestinationID, FlightDuration)
If you have 300 million rows, then an index on (flightdate) might help -- but depending on how many flights per day and how many days. You can include the other columns in the index which should help a bit more.
However, for what you want to do, it sounds like a better solution is to partition the table. This would store each day's of data in a separate "file" and only the ones needed for the query would be read for a given query.
The downside is that this requires re-creating the table. However, you might find that this is a big win performance-wise so worth the effort.
You need to use the Database Engine Tuning Advisor to optimize the query execution
How can I improve my performance issue? I have an sql query with 'IN' I guess 'IN' making some costly performance issue. But I need index my sql query?
My sql query:
SELECT [p].[ReferencedxxxId]
FROM [Common].[xxxReference] AS [p]
WHERE ([p].[IsDeleted] = 0)
AND (([p].[ReferencedxyzType] = #__refxyzType_0)
AND [p].[ReferencedxxxId] IN ('42342','ffsdfd','5345345345'))
My solution: (BUT I NEED YOUR HELP FOR BETTER ADVISE) Whichone is correct clustered or nonclustred index?
USE [xxx]
GO
CREATE NONCLUSTERED INDEX IX_NonClusteredIndexDemo_xxxId
ON [Common].[xxxReference](xxxId)
INCLUDE ([ID],[ReferencedxxxId])
WITH (DROP_EXISTING=ON, ONLINE=ON, FILLFACTOR=90)
GO
Second:
CREATE INDEX xxxReference_ReferencedxxxId_index
ON [Common].[xxxReference] (ReferencedxxxId)[/code]
Whichone is correct or do you have better solution?
The performance problem of this query is not the result of using the IN operator.
This operator performs very well with small lists (say, less than 1000 members).
The performance bottle neck here is the fact that SQL Server performs an index scan instead of an index seek (which is very costly), and the key lookup, which is 20% of the query cost.
To avoid both problems, you can add an index on IsDeleted, ReferencedxyzType and ReferencedxxxId - probably in this exact order.
SQL Performance tuning is a science that tends to look a little like art or magic - either way you look at it it requires a good knowledge of both the theory and practice of index settings and the relevant systems requirements.
Therefor, my suggestion is this: Do not attempt to solve it yourself with the help of strangers on the internet. Get an expert for a consulting job for a couple of hours/days to analyze the system and help you fine-tune it.
Learn whatever you can during this process. Ask questions about everything that is not trivial. This will be money well spent.
Couple of things:
If you have a SELECT statement inside the IN, that should be avoided
and should be replaced with an EXISTS clause. But in your above
example, that is not relevant as you have direct values inside IN.
Using EXISTS and NOT EXISTS instead of IN and NOT IN helps SQL
Server to not needing to scan each value of the column for each
values inside the IN / NOT IN and rather can short circuit the
search once a match or non-match found.
Avoid the implicit conversion. They degrade the performance due to
many reasons including i> SQL Server not able to find proper
statistics on an index and hence not able to leverage an index and
would rather go make use of a clustered index available in the table
(which may not be covering your query), ii> Not assigning proper
required RAM during memory allocation phase of the query by storage
engine, iii> Cardinality estimation becomes wrong as SQL Server
would not have statistics on the computed value of that column, and
rather probably had statistics on that column.
If you look at your execution plan posted above, you will see a
yellow mark in your 'SELECT'. If you hover over it, you will see
one/more warning messages. If your warning is related to implicit
conversion, try to use proper datatypes during comparison.
Eg. What is the datatype of the column '[ReferencedxxxId]'? If it
is not an NVARCHAR and is rather a VARCHAR, then I would suggest:
Make the values inside the IN as VARCHAR (currently you are making them NVARCHAR). This way you will still be able to take full advantage of the rowstore index created on [ReferencedxxxId] column.
If you must have the values as NVARCHAR inside the IN clause, then you should:
CONVERT/CAST the column [ReferencedxxxId] in your IN clause. This is going to get rid of the Implicit conversion but you will no longer be able to take full advantage of the rowstore index on [ReferencedxxxId] column.
+
Rather create a clustered/nonclustered columnstore index on the table covering the columns used in the query. This should significantly enhance the performance of your SELECT query.
If you decided to go with the route of using rowstore index by correcting the values inside the IN, you need to make sure that you create a clustered/nonclustered index which covers the query. Meaning the index covers the columns on which you are doing search ([ReferencedxxxId], [ReferencedxxxType], [IsDeleted]) and then including the columns used in SELECT statement under INCLUDE clause (if it is a nonclustered index)
Also, when you are creating a composite rowstore index, try to keep the order of columns inside the index high cardinality to low cardinality from left to right to make the best use of that index.
On the basis of assuming an OLTP based system and not OLAP, my first pass would be an NC Index - given isDeleted is likely to have the least selectivity, I would place it last, first pass would be an NC index ReferencedxyzType, ReferencedxxxId, IsDeleted
I might even be tempted in a higher volume scenario to move the IsDeleted out of the index onto an include instead, since it provides so little selectivity to the index itself.
There is clearly already a clustered index in place on the table (from the query plan we can see it), we don't have the details of what is in it.
The question around clustered vs non-clustered is more complex and requires a lot more knowledge of the system and usage.
I have a large table that I run queries like select date_att > date '2001-01-01' on regularly. I'm trying to increase the speed of these queries by clustering the table on date_att, but when I run those queries through explain analyze it still chooses to sequentially scan the table, even on queries as simple as SELECT date_att from table where date_att > date '2001-01-01'. Why is this the case? I understand that since the query returns a large portion of the table, the optimizer will ignore the index, but since the table is clustered by that attribute, shouldn't it be able to really quickly binary search through the table to the point where date > '2001-01-01' and return all results after that? This query still takes as much time as without the clustering.
It seems like you are confusing two concepts:
PostgreSQL clustering of a table
Clustering a table according to an index in PostgreSQL aligns the order of table rows (stored in a heap table) to the order in the index at the time of clustering. From the docs:
Clustering is a one-time operation: when the table is subsequently
updated, the changes are not clustered.
http://www.postgresql.org/docs/9.3/static/sql-cluster.html
Clustering potentially (often) improves query speed for range queries because the selected rows are stored nearby in the heap table by coincidence. There is nothing that guarantees this order! Consequently the optimizer cannot assume that it is true.
E.g. if you insert a new row that fulfills your where clause it might be inserted at any place in the table — e.g. where rows for 1990 are stored. Hence, this assumtion doesn't hold true:
but since the table is clustered by that attribute, shouldn't it be able to really quickly binary > search through the table to the point where date > '2001-01-01' and return all results after that?
This brings us to the other concept you mentioned:
Clustered Indexes
This is something completely different, not supported by PostgreSQL at all but by many other databases (SQL Server, MySQL with InnoDB and also Oracle where it is called 'Index Organized Table').
In that case, the table data itself is stored in an index structure — there is no separate heap structure! As it is an index, the order is also maintained for each insert/update/delete. Hence your assumption would hold true and indeed I'd expect the above mentioned database to behave as you would expect it (given the date column is the clustering key!).
Hope that clarifies it.
Let's assume you have one massive table with three columns as shown below:
[id] INT NOT NULL,
[date] SMALLDATETIME NOT NULL,
[sales] FLOAT NULL
Also assume you are limited to one physical disk and one filegroup (PRIMARY). You expect this table to hold sales for 10,000,000+ ids, across 100's of dates (easily 1B+ records).
As with many data warehousing scenarios, the data will typically grow sequentially by date (i.e., each time you perform a data load, you will be inserting new dates, and maybe updating some of the more recent dates of data). For analytic purposes, the data will often be queried and aggregated for a random set of ~10,000 ids which will be specified via a join with another table. Often, these queries don't specify date ranges, or specify very wide date ranges, which leads me to my question: What is the best way to index / partition this table?
I have thought about this for a while, but am stuck with conflicting solutions:
Option #1: As data will be loaded sequentially by date, define the clustered index (and primary key) as [date], [id]. Also create a "sliding window" partitioning function / scheme on date allowing rapid movement of new data in / out of the table. Potentially create a non-clustered index on id to help with querying.
Expected Outcome #1: This setup will be very fast for data loading purposes, but sub-optimal when it comes to analytic reads as, in a worst case scenario (no limiting by dates, unlucky with set of id's queried), 100% of the data pages may be read.
Option #2: As the data will be queried for only a small subset of ids at a time, define the clustered index (and primary key) as [id], [date]. Do not bother to create a partitioned table.
Expected Outcome #2: Expected huge performance hit when it comes to loading data as we can no longer quickly limit by date. Expected huge performance benefit when it comes to my analytic queries as it will minimize the number of data pages read.
Option #3: Clustered (and primary key) as follows: [id], [date]; "sliding window" partition function / scheme on date.
Expected Outcome #3: Not sure what to expect. Given that the first column in the clustered index is [id] and thus (it is my understanding) the data is arranged by ID, I would expect good performance from my analytic queries. However, the data is partitioned by date, which is contrary to the definition of the clustered index (but still aligned as date is part of the index). I haven't found much documentation that speaks to this scenario and what, if any, performance benefits I may get from this, which brings me to my final, bonus question:
If I am creating a table on one filegroup on one disk, with a clustered index on one column, is there any benefit (besides partition switching when loading the data) that comes from defining a partition on the same column?
This table is awesomely narrow. If the real table will be this narrow, you should be happy to have table scans instead of index->lookups.
I would do this:
CREATE TABLE Narrow
(
[id] INT NOT NULL,
[date] SMALLDATETIME NOT NULL,
[sales] FLOAT NULL,
PRIMARY KEY(id, date) --EDIT, just noticed your id is not unique.
)
CREATE INDEX CoveringNarrow ON Narrow(date, id, sales)
This handles point queries with seeks and wide-range queries with limited scans against date criteria and id criteria. There is no per-record lookup from index. Yes, I've doubled the write time (and space used) but that's fine, imo.
If there's some need for a specific piece of data (and that need is demonstrated by profiling!!), I'd create a clustered view targetting that section of the table.
CREATE VIEW Narrow200801
AS
SELECT * FROM Narrow WHERE '2008-01-01' <= [date] AND [date] < '2008-02-01'
--There is some command that I don't have at my finger tips to make this a clustered view.
Clustered views can be used in queries by name, or the optimizer will choose to use the clustered views when the FROM and WHERE clause are appropriate. For example, this query will use the clustered view. Note that the base table is referred to in the query.
SELECT SUM(sales) FROM Narrow WHERE '2008-01-01' <= [date] AND [date] < '2008-02-01'
As index lets you make specific columns conveniently accessible... Clustered view lets you make specific rows conveniently accessible.
A clustered index will give you performance benefits for queries when localising the I/O. Date is a traditional partitioning strategy as many D/W queries look at movements by date.
A rule of thumb for a partitioned table suggests that partitions should be around 10m rows in size.
It would be somewhat unusual to see much performance gain from a clustered index on a diverse analytic workload. The query optimiser will use a technique called 'Index Intersection' to select rows without even hitting the fact table. See Here for a post I did on another question that explains this in more depth with some links.
A clustered index may or may not participate in the index intersection, so you may find that it gains you relatively little on a general query workload.
You may find circumstances in loading where clustered indexes give you some gain, particularly if you have derived calculations (such as Earned Premium) that are computed within the ETL process. In this case you may get some benefits. If you have a specific query that you know will be executed all the time it might make sense to use clustered indexes for this. Options #2 and #3 are only going to significantly benefit you if you expect this type of query to be the overwhelming majority of the work done by the application.
For a flexible system, a simple date range partition with an index on the ID (and date if the partitions hold a range would probably get you as good a performance as any. You might get some benefit from clustering the index limited circumstances. You might also get some mileage from building a cube over the data and ensuring that the aggregations are set up correctly for this query.
If you are using the partitions in the select statements, then you cn gain some speed.
If you are not using it, only using "standard" selects, then you have no benefit.
On your original problem: I would recommend you option #1 with the non-clustered index on id included.
I would do the following:
Non-Clustered Index on [Id]
Clustered Index on [Date]
Convert the [sales] datatype to numeric instead of float
Partition the table by date. Several horizontal partitions will be more performant than one large table with that many rows.
Clustered index on the date column isn't good if you'll have inserts that will be inserted faster that the datetime resolution of 3.33 ms is.
if you do you'll get 2 keys with the same value and your index will have to get another internal uniquifier which will increase its size.
i'd go with #2 of your options.
It's my understanding that nulls are not indexable in DB2, so assuming we have a huge table (Sales) with a date column (sold_on) which is normally a date, but is occasionally (10% of the time) null.
Furthermore, let's assume that it's a legacy application that we can't change, so those nulls are staying there and mean something (let's say sales that were returned).
We can make the following query fast by putting an index on the sold_on and total columns
Select * from Sales
where
Sales.sold_on between date1 and date2
and Sales.total = 9.99
But an index won't make this query any faster:
Select * from Sales
where
Sales.sold_on is null
and Sales.total = 9.99
Because the indexing is done on the value.
Can I index nulls? Maybe by changing the index type? Indexing the indicator column?
From where did you get the impression that DB2 doesn't index NULLs? I can't find anything in documentation or articles supporting the claim. And I just performed a query in a large table using a IS NULL restriction involving an indexed column containing a small fraction of NULLs; in this case, DB2 certainly used the index (verified by an EXPLAIN, and by observing that the database responded instantly instead of spending time to perform a table scan).
So: I claim that DB2 has no problem with NULLs in non-primary key indexes.
But as others have written: Your data may be composed in a way where DB2 thinks that using an index will not be quicker. Or the database's statistics aren't up-to-date for the involved table(s).
I'm no DB2 expert, but if 10% of your values are null, I don't think an index on that column alone will ever help your query. 10% is too many to bother using an index for -- it'll just do a table scan. If you were talking about 2-3%, I think it would actually use your index.
Think about how many records are on a page/block -- say 20. The reason to use an index is to avoid fetching pages you don't need. The odds that a given page will contain 0 records that are null is (90%)^20, or 12%. Those aren't good odds -- you're going to need 88% of your pages to be fetched anyway, using the index isn't very helpful.
If, however, your select clause only included a few columns (and not *) -- say just salesid, you could probably get it to use an index on (sold_on,salesid), as the read of the data page wouldn't be needed -- all the data would be in the index.
The rule of thumb is that an index is useful for values up on to 15% of the records. ... so an index might be useful here.
If DB2 won't index nulls, then I would suggest adding a boolean field, IsSold, and set it to true whenever the sold_on date gets set (this could be done in a trigger).
That's not the nicest solution, but it might be what you need.
Troels is correct; even rows with a SOLD_ON value of NULL will benefit from an index on that column. If you're doing ranged searches on SOLD_ON, you may benefit even more by creating a clustered index that begins with SOLD_ON. In this particular example, it may not require much additional overhead to maintain the clustering order based on SOLD_ON, since newer rows added will most likely have a newer SOLD_ON date.