Let's assume you have one massive table with three columns as shown below:
[id] INT NOT NULL,
[date] SMALLDATETIME NOT NULL,
[sales] FLOAT NULL
Also assume you are limited to one physical disk and one filegroup (PRIMARY). You expect this table to hold sales for 10,000,000+ ids, across 100's of dates (easily 1B+ records).
As with many data warehousing scenarios, the data will typically grow sequentially by date (i.e., each time you perform a data load, you will be inserting new dates, and maybe updating some of the more recent dates of data). For analytic purposes, the data will often be queried and aggregated for a random set of ~10,000 ids which will be specified via a join with another table. Often, these queries don't specify date ranges, or specify very wide date ranges, which leads me to my question: What is the best way to index / partition this table?
I have thought about this for a while, but am stuck with conflicting solutions:
Option #1: As data will be loaded sequentially by date, define the clustered index (and primary key) as [date], [id]. Also create a "sliding window" partitioning function / scheme on date allowing rapid movement of new data in / out of the table. Potentially create a non-clustered index on id to help with querying.
Expected Outcome #1: This setup will be very fast for data loading purposes, but sub-optimal when it comes to analytic reads as, in a worst case scenario (no limiting by dates, unlucky with set of id's queried), 100% of the data pages may be read.
Option #2: As the data will be queried for only a small subset of ids at a time, define the clustered index (and primary key) as [id], [date]. Do not bother to create a partitioned table.
Expected Outcome #2: Expected huge performance hit when it comes to loading data as we can no longer quickly limit by date. Expected huge performance benefit when it comes to my analytic queries as it will minimize the number of data pages read.
Option #3: Clustered (and primary key) as follows: [id], [date]; "sliding window" partition function / scheme on date.
Expected Outcome #3: Not sure what to expect. Given that the first column in the clustered index is [id] and thus (it is my understanding) the data is arranged by ID, I would expect good performance from my analytic queries. However, the data is partitioned by date, which is contrary to the definition of the clustered index (but still aligned as date is part of the index). I haven't found much documentation that speaks to this scenario and what, if any, performance benefits I may get from this, which brings me to my final, bonus question:
If I am creating a table on one filegroup on one disk, with a clustered index on one column, is there any benefit (besides partition switching when loading the data) that comes from defining a partition on the same column?
This table is awesomely narrow. If the real table will be this narrow, you should be happy to have table scans instead of index->lookups.
I would do this:
CREATE TABLE Narrow
(
[id] INT NOT NULL,
[date] SMALLDATETIME NOT NULL,
[sales] FLOAT NULL,
PRIMARY KEY(id, date) --EDIT, just noticed your id is not unique.
)
CREATE INDEX CoveringNarrow ON Narrow(date, id, sales)
This handles point queries with seeks and wide-range queries with limited scans against date criteria and id criteria. There is no per-record lookup from index. Yes, I've doubled the write time (and space used) but that's fine, imo.
If there's some need for a specific piece of data (and that need is demonstrated by profiling!!), I'd create a clustered view targetting that section of the table.
CREATE VIEW Narrow200801
AS
SELECT * FROM Narrow WHERE '2008-01-01' <= [date] AND [date] < '2008-02-01'
--There is some command that I don't have at my finger tips to make this a clustered view.
Clustered views can be used in queries by name, or the optimizer will choose to use the clustered views when the FROM and WHERE clause are appropriate. For example, this query will use the clustered view. Note that the base table is referred to in the query.
SELECT SUM(sales) FROM Narrow WHERE '2008-01-01' <= [date] AND [date] < '2008-02-01'
As index lets you make specific columns conveniently accessible... Clustered view lets you make specific rows conveniently accessible.
A clustered index will give you performance benefits for queries when localising the I/O. Date is a traditional partitioning strategy as many D/W queries look at movements by date.
A rule of thumb for a partitioned table suggests that partitions should be around 10m rows in size.
It would be somewhat unusual to see much performance gain from a clustered index on a diverse analytic workload. The query optimiser will use a technique called 'Index Intersection' to select rows without even hitting the fact table. See Here for a post I did on another question that explains this in more depth with some links.
A clustered index may or may not participate in the index intersection, so you may find that it gains you relatively little on a general query workload.
You may find circumstances in loading where clustered indexes give you some gain, particularly if you have derived calculations (such as Earned Premium) that are computed within the ETL process. In this case you may get some benefits. If you have a specific query that you know will be executed all the time it might make sense to use clustered indexes for this. Options #2 and #3 are only going to significantly benefit you if you expect this type of query to be the overwhelming majority of the work done by the application.
For a flexible system, a simple date range partition with an index on the ID (and date if the partitions hold a range would probably get you as good a performance as any. You might get some benefit from clustering the index limited circumstances. You might also get some mileage from building a cube over the data and ensuring that the aggregations are set up correctly for this query.
If you are using the partitions in the select statements, then you cn gain some speed.
If you are not using it, only using "standard" selects, then you have no benefit.
On your original problem: I would recommend you option #1 with the non-clustered index on id included.
I would do the following:
Non-Clustered Index on [Id]
Clustered Index on [Date]
Convert the [sales] datatype to numeric instead of float
Partition the table by date. Several horizontal partitions will be more performant than one large table with that many rows.
Clustered index on the date column isn't good if you'll have inserts that will be inserted faster that the datetime resolution of 3.33 ms is.
if you do you'll get 2 keys with the same value and your index will have to get another internal uniquifier which will increase its size.
i'd go with #2 of your options.
Related
I'm looking for a partitioning solution. And, obviously, I'm interested in the performance side of it. Means, not to make it too bad while benefiting maintenance time.
My question is related to one of the tables that I have. And it looks similar to
Id bigint IDENTITY PRIMARY KEY CLUSTERED,
DatetimeFiled,
/*
a lot of other fields
*/
According to the data structure and usage, it's suitable to split the table into partitions by the DatetimeFiled (classic), because I have filters by date on this table.
But I do have filters by Id as well. Moreover, I have JOINs that use the Id field as a predicate, which now benefits from its uniqueness (https://www.brentozar.com/archive/2015/08/performance-benefits-of-unique-indexes/).
So, I decided to use Id ,DatetimeFiled as UNIQUE CLUSTERED INDEX.
But I doubt will it still benefit JOINs by Id field?
And is it ok to use that kind of field order, because I saw that partitioned field is often in the first place?
Using a trailing clustered index column as the partition column is a common and useful approach. You can find rows by Id by seeking the Clustered Index in each partition, and you can find rows by DatetimeFiled through partition elimination.
I have a large table that I run queries like select date_att > date '2001-01-01' on regularly. I'm trying to increase the speed of these queries by clustering the table on date_att, but when I run those queries through explain analyze it still chooses to sequentially scan the table, even on queries as simple as SELECT date_att from table where date_att > date '2001-01-01'. Why is this the case? I understand that since the query returns a large portion of the table, the optimizer will ignore the index, but since the table is clustered by that attribute, shouldn't it be able to really quickly binary search through the table to the point where date > '2001-01-01' and return all results after that? This query still takes as much time as without the clustering.
It seems like you are confusing two concepts:
PostgreSQL clustering of a table
Clustering a table according to an index in PostgreSQL aligns the order of table rows (stored in a heap table) to the order in the index at the time of clustering. From the docs:
Clustering is a one-time operation: when the table is subsequently
updated, the changes are not clustered.
http://www.postgresql.org/docs/9.3/static/sql-cluster.html
Clustering potentially (often) improves query speed for range queries because the selected rows are stored nearby in the heap table by coincidence. There is nothing that guarantees this order! Consequently the optimizer cannot assume that it is true.
E.g. if you insert a new row that fulfills your where clause it might be inserted at any place in the table — e.g. where rows for 1990 are stored. Hence, this assumtion doesn't hold true:
but since the table is clustered by that attribute, shouldn't it be able to really quickly binary > search through the table to the point where date > '2001-01-01' and return all results after that?
This brings us to the other concept you mentioned:
Clustered Indexes
This is something completely different, not supported by PostgreSQL at all but by many other databases (SQL Server, MySQL with InnoDB and also Oracle where it is called 'Index Organized Table').
In that case, the table data itself is stored in an index structure — there is no separate heap structure! As it is an index, the order is also maintained for each insert/update/delete. Hence your assumption would hold true and indeed I'd expect the above mentioned database to behave as you would expect it (given the date column is the clustering key!).
Hope that clarifies it.
I would like to add index(s) to my table.
I am looking for general ideas how to add more indexes to a table.
Other than the PK clustered.
I would like to know what to look for when I am doing this.
So, my example:
This table (let's call it TASK table) is going to be the biggest table of the whole application. Expecting millions records.
IMPORTANT: massive bulk-insert is adding data in this table
table has 27 columns: (so far, and counting :D )
int x 9 columns = id-s
varchar x 10 columns
bit x 2 columns
datetime x 5 columns
INT COLUMNS
all of these are INT ID-s but from tables that are usually smaller than Task table (10-50 records max), example: Status table (with values like "open", "closed") or Priority table (with values like "important", "not so important", "normal")
there is also a column like "parent-ID" (self - ID)
join: all the "small" tables have PK, the usual way ... clustered
STRING COLUMNS
there is a (Company) column (string!) that is something like "5 characters long all the time" and every user will be restricted using this one. If in Task there are 15 different "Companies" the logged in user would only see one. So there's always a filter on this one. Might be a good idea to add an index to this column?
DATE COLUMNS
I think they don't index these ... right? Or can / should be?
I wouldn't add any indices - unless you have specific reasons to do so, e.g. performance issues.
In order to figure out what kind of indices to add, you need to know:
what kind of queries are being used against your table - what are the WHERE clauses, what kind of ORDER BY are you doing?
how is your data distributed? Which columns are selective enough (< 2% of the data) to be useful for indexing
what kind of (negative) impact do additional indices have on your INSERTs and UPDATEs on the table
any foreign key columns should be part of an index - preferably as the first column of the index - to speed up JOINs to other tables
And sure you can index a DATETIME column - what made you think you cannot?? If you have a lot of queries that will restrict their result set by means of a date range, it can make total sense to index a DATETIME column - maybe not by itself, but in a compound index together with other elements of your table.
What you cannot index are columns that hold more than 900 bytes of data - anything like VARCHAR(1000) or such.
For great in-depth and very knowledgeable background on indexing, consult the blog by Kimberly Tripp, Queen of Indexing.
in general an index will speed up a JOIN, a sort operation and a filter
SO if the columns are in the JOIN, the ORDER BY or the WHERE clause then an index will help in terms of performance...but there is always a but...with every index that you add UPDATE, DELETE and INSERT operations will be slowed down because the indexes have to be maintained
so the answer is...it depends
I would say start hitting the table with queries and look at the execution plans for scans, try to make those seeks by either writing SARGable queries or adding indexes if needed...don't just add indexes for the sake of adding indexes
Step one is to understand how the data in the table will be used: how will it be inserted, selected, updated, deleted. Without knowing your usage patterns, you're shooting in the dark. (Note also that whatever you come up with now, you may be wrong. Be sure to compare your decisions with actual usage patterns once you're up and running.) Some ideas:
If users will often be looking up individual items in the table, an index on the primary key is critical.
If data will be inserted with great frequency and you have multiple indexes, over time you well have to deal with index fragmentation. Read up on and understand clustered and non-clustered indexes and fragmentation (ALTER INDEX...REBUILD).
But, if performance is key in situations when you need to retrieve a lot of rows, you might consider using your clustered indexe to support that.
If you often want a set of data based on Status, indexing on that column can be good--particularly if 1% of your rows are "Active" vs. 99% "Not Active", and all you want are the active ones.
Conversely, if your "PriorityId" is only used to get the "label" stating what PriorityId 42 is (i.e. join into the lookup table), you probably don't need an index on it in your main table.
A last idea, if everyone will always retrieve data for only one Company at a time, then (a) you'll definitely want to index on that, and (b) you might want to consider partitioning the table on that value, as it can act as a "built in filter" above and beyond conventional indexing. (This is perhaps a bit extreme and it's only available in Enterprise edition, but it may be worth it in your case.)
MarketPlane table contains more than 60 million rows.
When I need the total number of plane from a particular date, I execute this query which takes more than 7 min. How can I reduce this time ?
SELECT COUNT(primaryKeyColumn)
FROM MarketPlan
WHERE LaunchDate > #date
I have implemented all things mentioned in your links even now I have implemented With(nolock) which reduce response time is to 5 min.
You will have to create an index on the table, or maybe partition the table by date.
You might also want to have a look at
SQL Server 2000/2005 Indexed View Performance Tuning and Optimization Tips
SQL Server Indexed Views
Does the table in question have an index on the LaunchDate column? Also, did you really mean to post LaunchDate>#date?
Assuming SQL-Server based on #date, although the same can be applied to most databases.
If your primary query is to select out a range of data (based on sample), adding, or altering the CLUSTERED INEDX will go a long way to improving query times.
See: http://msdn.microsoft.com/en-us/library/ms190639.aspx
By default, SQL-Server creates the Primary Key as the Clustered Index which is great from a transactional point of view, but if your focus is to retrieve the data, then altering that default makes a huge difference.
CREATE CLUSTERED INDEX name ON MarketPlan (LaunchDate DESC)
Note: Assuming LaunchDate is a static date value and is primarily inserted in increasing/sequential order to minimize index fragmentation.
There are some fine suggestions here, if all else fails, consider a little denormalization, create another table with the cumulative counts and update it with a trigger. If you have more queries of this nature think about OLAP
Your particular query does not require clustered key on the date column. It would actually run better with nonclustered index with the leading date column because you don't need to do key lookup in this query, so the nonclustered index would be covering and more compact than clustered (it implicitly includes clustered key columns).
If you have it indexed properly and it still does not perform it is most likely fragmentation. In this case defragment the index and try again.
Create a new index like this:
CREATE INDEX xLaunchDate on MarketPlan (LaunchDate, primaryKeyColumn)
Check this nice article about how an index can improve the performance.
http://blog.sqlauthority.com/2009/10/08/sql-server-query-optimization-remove-bookmark-lookup-remove-rid-lookup-remove-key-lookup-part-2/
"WHERE LaunchDate > #date"
Is the value of parameter #date defined in the same batch (or transaction or context)?
If not, then this would lead to Clustered Index Scan (of all rows) instead of Clustered Index Seek (of just rows satisfying WHERE condition) if its value is coming from outside of current batch (as, for example, input parameter of stored procedure or udf function).
The query cannot be fully optimized by SQL Server optimizer (at compile time) leading to full table scan since the value of parameter is known only at run-time
Update: Comment to answers proposing OLAP.
OLAP is just concept, SSAS cubes is just one of the possible ways of OLAP implementation.
It is convenience, not obligation in getting/using OLAP concept.
You have not use SSAS to use OLAP concept.
See, for ex., Simulated OLAP
Update2: Comment to question in comments to answer:
MDX performance vs. T-SQL
MDX is an option/convenience/feature/functionality provided by SSAS (cubes/OLAP) not obligation
The simplest thing you can do is:
SELECT COUNT(LaunchDate)
FROM MarketPlan
WHERE LaunchDate > #date
This will guarantee you index-only retrieval for any LaunchDate index.
Also (this depends on your execution plan), I have seen instances (but not specific to SQL Server) in which > did a table scan and BETWEEN used an index. If you know the top date you might try WHERE LaunchDate BETWEEN #date AND <<Literal Date>>.
How wide is the table? If the table is wide (ie: many columns of (n)char, (n)varchar or xml) there might be a significant amount of IO causing the query to run slowly as a result of using the clustered index.
To determine if IO is causing the long query time perform the following:
Create a non-clustered index only on the LaunchDate column.
Run the query below which counts LaunchDate and forces the use of the new index.
SELECT COUNT(LaunchDate)
FROM MarketPlan WITH (INDEX = TheNewIndexName)
WHERE LaunchDate > #date
I do not like to use index hints and I only suggest this hint only to prove if the IO is causing long query times.
There are two ways to do this
First create a clustered index on the date column, since query is date range specific, all the data will be in the actual order and this will avoid having to scan through all records in the table
You can try using Horizontal partioning, this will affect your existing table design but this is the most optimal way to do so, see this
http://blog.sqlauthority.com/2008/01/25/sql-server-2005-database-table-partitioning-tutorial-how-to-horizontal-partition-database-table/
I have a very specific query. I tried lots of ways but i couldn't reach the performance i want.
SELECT *
FROM
items
WHERE
user_id=1
AND
(item_start < 20000 AND item_end > 30000)
i created and index on user_id, item_start, item_end
this didn't work and i dropped all indexes and create new indexes
user_id, (item_start, item_end)
also this didn't work.
(user_id, item_start and item_end are int)
edit: database is MySQL 5.1.44, engine is InnoDB
UPDATE: per your comment below, you need all the columns in the query (hence your SELECT *). If that's the case, you have a few options to maximize query performance:
create (or change) your clustered index to be on item_user_id, item_start, item_end. This will ensure that as few rows as possible are examined for each query. Per my original answer below, this approach may speed up this particular query but may slow down others, so you'll need to be careful.
if it's not practical to change your clustered index, you can create a non-clustered index on item_user_id, item_start, item_end and any other columns your query needs. This will slow down inserts somewhat, and will double the storage required for your table, but will speed up this particular query.
There are always other ways to increase performance (e.g. by reducing the size of each row) but the primary way is to decrease the number of rows which must be accessed and to increase the % of rows which are accessed sequentially rather than randomly. The indexing suggestions above do both.
ORIGINAL ANSWER BELOW:
Without knowing the exact schema or query plan, the main performance problem with this query is that SELECT * forces a lookup back to your clustered index for every row. If there are large numbers of matching rows for a particular user ID and if your clustered index's first column is not item_user_id, then this will likley be a very inefficient operation because your disk will be trying to fetch lots of randomly distributed rows from teh clustered inedx.
In other words, even thouggh filtering the rows you want is fast (because of your index), actually fetching the data is slower. .
If, however, your clustered index is ordered by item_user_id, item_start, item_end then that should speed things up. Note that this is not a panacea, since if you have other queries which depend on different ordering, or if you're inserting rows in a differnet order, you could end up slowing down other queries.
A less impactful solution would be to create a covering index which contains only the columns you want (also ordered by item_user_id, item_start, item_end, and then add the other cols you need). THen change your query to only pull back the cols you need, instead of using SELECT *.
If you could post more info about the DBMS brand and version, and the schema of your table, and we can help with more details.
Do you need to SELECT *?
If not, you can create a index on user_id, item_start, item_end with the fields you need in the SELECT-part as included columns. This all assuming you're using Microsoft SQL Server 2005+