I would like to add clustered columnstore indexes to some tables in a SQL Server 2014 database. Before doing so, I need to gather a good estimate of required memory. How can I predict clustered columnstore memory usage?
Things I know:
The size of the tables on disk
How the tables will be queried
The growth rate of these tables on disk
You will find an answer here - source - under the title "Memory Usage".
I'd rather not copy & paste the relevant section as the Terms Of Use of sqlservercentral.com state "You are not permitted to copy or use any of the Redgate Materials for any purpose." Though presumably the Terms Of Use themselves are exempt from this condition. :-)
Related
Most SQL relational databases support the concept of a clustered index in a table. A clustered index, usually implemented as a B-tree, represents the actual records in a given table, physically ordered by that index on disk/storage. One advantage of this special clustered index is that after traversing the B-tree in search for a record or set of records, the actual data can be found immediately at the leaf nodes.
This stands in contrast to a non clustered index. A non clustered index exists outside the clustered index, and also orders the underlying data using one or more columns. But, the leaf nodes may not have data for all the columns needed in the query. In this case, the database has to do a disk seek to the original data to get this information.
In most database resources I have seen on Stack Overflow and elsewhere, this additional disk seek is viewed as a substantial performance penalty. My question is how would this analysis change assuming that all database files were stored on a solid state drive (SSD)?
From the Wikipedia page for SSDs, the random access time for SSDs is less than 0.1 ms, while random access times for mechanical hard disks are typically 10-100 times slower.
Do SSDs narrow the gap between clustered and non clustered indices, such that the former become less important for overall performance?
First of all, a clustered index does not guarantee that the rows are physically stored in index order. InnoDB for example can store the clustered index in a non-sequential way. That is, two database pages containing consecutive rows of the table might be stored physically close to each other, or far apart in the tablespace, and in either order. The B-tree data structure for the clustered index has pointers to the leaf pages, but they don't have to be stored in any order.
SSD is helpful for speeding up IO-based operations, particularly involving disk seeks. It's way faster than an spinning magnetic disk. But RAM is still a couple of orders of magnitude faster than the best SSD.
The 2018 numbers:
Disk seek: 3,000,000ns
SSD random read: 16,000ns
Main memory reference: 100ns
RAM still trumps durable storage by a wide margin. If your dataset (or at least the active subset of your dataset) fits in RAM, you won't need to worry about the difference between magnetic disk storage and SSD storage.
Re your comment:
The clustered index helps because when a primary key lookup searches through the B-tree and finds a leaf node, right there are all the other fields of the row associated with that primary key value.
Compare with MyISAM, where a primary key index is separate from the rows of the table. A query searches the B-tree of the primary key index, and at the leaf node finds a pointer to the location in the data file where the corresponding row is stored. So it has to do a second seek into the data file.
This does not necessarily mean that the clustered index in InnoDB is stored consecutively. It might need to skip around a bit to read all the pages of the tablespace. This is why it's so helpful to have the pages in RAM in the buffer pool.
First, the additional disk seek is not really a "killer". This can be a big issue in high transaction environments where microseconds and milliseconds count. However, for longer running queries, it will make little difference.
This is especially true if the database intelligently does "look ahead" disk seeks. Databases are often not waiting for data because another thread is predicting what pages will be needed and working on bringing those back. This is usually done by just taking the "next" pages on a sequential scan.
SSDs are going to speed up pretty much all operations. They do change the optimization parameters. In particular, I think they are comparably fast on throughput (although I don't keep up with the technology specifically). Their big win is in latency -- the time from issuing the request for a disk block and the time when it is retrieved.
In my experience (which is a few years old), the performance using SSD was comparable to an in-memory database for most operations.
Whether this makes cluster indexes redundant is another matter. A key place where they are used is when you want to separate a related small amount of rows (say "undeleted") from a larger amount. By putting them in the same data pages, the clustered index reduces the overall number of rows being read -- it doesn't just make the reads faster.
Just sume suggestions (to broad for simple comment)
taking into account that everything depends on the distribution of the keys in the not clusterd index and in the respective nodes, (which is completely causal and can only be assessed in average terms) remains the fact that any access benefits from the performance of the SSD disk. In this case, the increase in prepositions is not linear but is nonetheless substantial. Therefore, on average, it should not be a factor of 1 to 100, precisely for issues related to the randomness of distribution, but for every circumstance in which this manifests itself. access is 100 times faster .. in this case it is all the more efficient the more causally .. the situation occurs.
There is however a fact at the base .. every action on disk is much more efficient and therefore in general the behavior of a not clusterd index comes to be explicit in an optimal context.
Taking this into account, the gap should be radically reduced and this should take place thanks to the context in which the entire filing system exists and which is the basis of the database; from accessing the logical files that compose it to the physical sectors in which the data are actually preserved
I just learned about the wonders of columnstore indexes and how you can "Use the columnstore index to achieve up to 10x query performance gains over traditional row-oriented storage, and up to 7x data compression over the uncompressed data size."
With such sizable performance gains, is there really any reason to NOT use them?
The main disadvantage is that you'll have a hard time reading only a part of the index if the query contains a selective predicate. There are ways to do it (partitioning, segment elimination) but those are neither particularly easy to reliably implement nor do they scale to complex requirements.
For scan-only workloads columnstore indexes are pretty much ideal.
Columnstore Indexes is especially beneficial for DataWarehousing (DW). Meaning that you will only perform updates or deletes at certain times.
This is due to their special design with delta loading and more features. This video will show great detail and a nice basic overview of what the exact difference is Columnstore Index.
Traditional
If you however have a high I/O (input and output) of the application; Columnstore Index is not ideal since traditional row indexing will find and manipulate (using the rows found through the index) on that specific target. An example of this would be a ATM application which frequently changes the values of the rows of the given persons accounts.
ColumnStore
Columnstore Indexing indexes throughout the COLUMNS which is not ideal in this case since the row values will be spread throughout the segmentations (columnsindexes).
I highly recommend the video!
I want to also elaborate on the non-clustered vs clustered columnstore:
Non-clustered Columnstore (update in 2012) saves the WHOLE data again meaning (2X data) twice the data.
Where as Clustered Columnstore index (update in 2014) only takes up 5MB for about 16GB of data. This is due to the RTE (runtime encoding) which saves the amount of duplicate data in each column. Making the index take up less extra storage.
Hello A very detailed explanation of columns store index can be found here.
ColumnStore Index
A columnstore index is a technology for storing, retrieving and managing data by using a columnar data format, called a columnstore.
This feature has been introduced with SQL Server 2012 which intends to significantly speed-up the processing time of common data warehousing queries. The main objectives of columnstore indexes is appropriate for typical data warehousing data sets and improve the performance of the query whenever data is pulled from the huge datasets.
They are column based indexes which are capable to transform the data warehousing experience for users by enabling faster performance for common data warehousing queries such as filtering, aggregating, grouping and star-join queries. They store the data column-wise instead of row-wise, as indexes currently do.
My company is moving to SQL Server 2008 R2. We have a table with tons of archive data. Majority of the queries that uses this table employs DateTime value in the where statement. For example:
Query 1
SELECT COUNT(*)
FROM TableA
WHERE
CreatedDate > '1/5/2010'
and CreatedDate < '6/20/2010'
I'm making the assumption that partitions are created on CreatedDate and each partition is spread out across multiple drives, we have 8 CPUs, and there are 500 million records in the database that are evenly spread out across the dates from 1/1/2008 to 2/24/2011 (38 partitions). This data could also be portioned in to quarters of a year or other time durations, but lets keep the assumptions to months.
In this case I would believe that the 8 CPU's would be utilized, and only the 6 partitions would be queried for dates between 1/5/2010 and 6/20/2010.
Now what if I ran the following query and my assumptions are the same as above.
Query 2
SELECT COUNT(*)
FROM TableA
WHERE State = 'Colorado'
Questions?
1. Will all partitions be queried? Yes
2. Will all 8 CPUs be used to execute the query? Yes
3. Will performance be better than querying a table that is not partitoned? Yes
4. Is there anything else I'm missing?
5. How would Partition Index help?
I answer the first 3 questions above, base on my limited knowledge of SQL Server 2008 Partitioned Table & Parallelism. But if my answers are incorrect, can you provide feedback any why I'm incorrect.
Resource:
Video: Demo SQL Server 2008 Partitioned Table Parallelism (5 minutes long)
MSDN: Partitioned Tables and Indexes
MSDN: Designing Partitions to Manage Subsets of Data
MSDN: Query Processing Enhancements on Partitioned Tables and Indexes
MSDN: Word Doc: Partitioned Table and Index Strategies Using SQL Server 2008 white paper
BarDev
Partitioning is never an option for improving performance. The best you can hope for is to have on-par performance with non-partitioned table. Usually you get a regression that increases with the number of partitions. For performance you need indexes, not partitions. Partitions are for data management operations: ETL, archival etc. Some claim that partition elimination is possible performance gain, but for anything partition elimination can give placing the leading index key on the same column as the partitioning column will give much better results.
Will all partitions be queried?
That query needs an index on State. Otherwise is a table scan, and will scan the entire table. A table scan over a partitioned table is always slower than a scan over the same size non-partitioned table. The index itself can be aligned on the same partition scheme, but the leading key must be State.
Will all 8 CPUs be used to execute the query?
Parallelism has nothing to do with partitioning, despite the common misconception of the contrary. Both partitioned and non-partitioned range scans can be use a parallel operator, it will be the Query Optimizer decision.
Will performance be better than querying a table that is not
partitioned?
No
How would Partition Index help?
An index will help. If the index has to be aligned, then it must be partitioned. A non-partitioned index will be faster than a partitioned one, but the index alignment requirement for switch-in/switch-out operations cannot be circumvented.
If you're looking at partitioning, it should be because you need to do fast switch-in switch-out operations to delete old data past retention policy period or something similar. For performance, you need to look at indexes, not at partitioning.
Partitioning can increase performance--I have seen it many times. The reason partitioning was developed was and is performance, especially for inserts. Here is an example from the real world:
I have multiple tables on a SAN with one big ole honking disk as far as we can tell. The SAN administrators insist that the SAN knows all so will not optimize the distribution of data. How can a partition possibly help? Fact: it did and does.
We partitioned multiple tables using the same scheme (FileID%200) with 200 partitions ALL on primary. What use would that be if the only reason to have a partitioning scheme is for "swapping"? None, but the purpose of partitioning is performance. You see, each of those partitions has its own paging scheme. I can write data to all of them at once and there is no possibility of a deadlock. The pages cannot be locked because each writing process has an unique ID that equates to a partition. 200 partitions increased performance 2000x (fact) and deadlocks dropped from 7500 per hour to 3-4 per day. This for the simple reason that page lock escalation always occurs with large amounts of data and a high volume OLTP system and page locks are what cause deadlocks. Partitioning, even on the same volume and file group, places the partitioned data on different pages and lock escalation has no effect since processes are not attempting to access the same pages.
THe benefit is there, but not as great, for selecting data. But typically the partitioning scheme would be developed with the purpose of the DB in mind. I am betting Remus developed his scheme with incremental loading (such as daily loads) rather than transactional processing in mind. Now if one were frequently selecting rows with locking (read committed) then deadlocks could result if processes attempted to access the same page simultaneously.
But Remus is right--in your example I see no benefit, in fact there may be some overhead cost in finding the rows across different partitions.
the very first question i have is if your table has a clustered index on it. if not, you'll want one.
Also, you'll want a covering index for your queries. Covering Indexes
If you have a lot of historical data you might look into an archiving process to help speed up your oltp applications.
i was asked by an interviewer regarding sql server. the scenario was we have table with million or records. table has primary key, clustered and non clustered keys as well. still the data is fetching late. what do we need in this case?
Please kindly give me the answer.
regards,
murli
Limited information, but any of these could be attempted.
Write more efficient queries.
Buy more hardware.
Index the columns that are more appropriate for your queries.
Place this tables file on a more efficient RAID Controller Type
Partition the table.
Pretty much:
As indexing is given already efficiently....
Better hardware.
That would be either more memory (seen servers with 128gb memory) or a WAY faster disc subsystem. DB servers often do not buy discs for space, but for IO. I have seen a server with 190 attached discs ;)
No index maintenance
Poor hardware or not configured correctly
uniqueidentifier as clustered index
bad datatypes (too wide, ints in varchar, (max) types etc)
poor design (EAV?)
useless index
statistic disabled
fill factor of 5%
...etc
So, it seems to me like a query on a table with 10k records and a query on a table with 10mil records are almost equally fast if they are both fetching roughly the same number of records and making good use of simple indexes(auto increment, record id type indexed field).
My question is, will this extend to a table with close to 4 billion records if it is indexed properly and the database is set up in such a way that queries always use those indexes effectively?
Also, I know that inserting new records in to a very large indexed table can be very slow because all the indexes have to be recalculated, if I add new records only to the end of the table can I avoid that slow down, or will that not work because the index is a binary tree and a large chunk of the tree will still have to be recalculated?
Finally, I looked around a bit for a FAQs/caveats about working with very large tables, but couldn't really find one, so if anyone knows of something like that, that link would be appreciated.
Here is some good reading about large tables and the effects of indexing on them, including cost/benefit, as you requested:
http://www.dba-oracle.com/t_indexing_power.htm
Indexing very large tables (as with anything database related) depends on many factors, incuding your access patterns, ratio of Reads to Writes and size of available RAM.
If you can fit your 'hot' (i.e. frequently accessed index pages) into memory then accesses will generally be fast.
The strategy used to index very large tables, is using partitioned tables and partitioned indexes. BUT if your query does not join or filter on the partition key then there will no improvement in performance over an unpartitioned table i.e. no partition elimination.
SQL Server Database Partitioning Myths and Truths
Oracle Partitioned Tables and Indexes
It's very important to keep your indexes as narrow as possible.
Kimberly Tripp's The Clustered Index Debate Continues...(SQL Server)
Accessing the data via a unique index lookup will slow down as the table gets very large, but not by much. The index is stored as a B-tree structure in Postgres (not binary tree which only has two children per node), so a 10k row table might have 2 levels whereas a 10B row table might have 4 levels (depending on the width of the rows). So as the table gets ridiculously large it might go to 5 levels or higher, but this only means one extra page read so is probably not noticeable.
When you insert new rows, you cant control where they are inserted in the physical layout of the table so I assume you mean "end of the table" in terms of using the maximum value being indexed. I know Oracle has some optimisations around leaf block splitting in this case, but I dont know about Postgres.
If it is indexed properly, insert performance may be impacted more than select performance. Indexes in PostgreSQL have vast numbers of options which can allow you to index part of a table or the output of an immutable function on tuples in the table. Also size of the index, assuming it is usable, will affect speed much more slowly than will the actual scan of the table. The biggest difference is between searching a tree and scanning a list. Of course you still have disk I/O and memory overhead that goes into index usage, and so large indexes don't perform as well as they theoretically could.