Does Clustered Index in U-SQL table impact parallelism? - azure-data-lake

We are working with U-SQL tables and have questions related to Clustered Index. In U-SQL table, parallelism is managed by how data is partitioned and distributed. Does Clustered Index impact parallelism as well in U-SQL table? Secondly how it manages data skew in a distribution bucket?

Clustered index is not impacting parallelism per se, but it may impact if you read the data using an index seek or index scan depending on the query predicate. So it impacts the performance of accessing the data inside a vertex.
Data skew is not managed. If you have skew you will have to either find a better distribution key, use a skewfactor hint or use ROUND ROBIN distribution.

Related

By how much do SSDs narrow the performance gap between clustered and non clustered indices?

Most SQL relational databases support the concept of a clustered index in a table. A clustered index, usually implemented as a B-tree, represents the actual records in a given table, physically ordered by that index on disk/storage. One advantage of this special clustered index is that after traversing the B-tree in search for a record or set of records, the actual data can be found immediately at the leaf nodes.
This stands in contrast to a non clustered index. A non clustered index exists outside the clustered index, and also orders the underlying data using one or more columns. But, the leaf nodes may not have data for all the columns needed in the query. In this case, the database has to do a disk seek to the original data to get this information.
In most database resources I have seen on Stack Overflow and elsewhere, this additional disk seek is viewed as a substantial performance penalty. My question is how would this analysis change assuming that all database files were stored on a solid state drive (SSD)?
From the Wikipedia page for SSDs, the random access time for SSDs is less than 0.1 ms, while random access times for mechanical hard disks are typically 10-100 times slower.
Do SSDs narrow the gap between clustered and non clustered indices, such that the former become less important for overall performance?
First of all, a clustered index does not guarantee that the rows are physically stored in index order. InnoDB for example can store the clustered index in a non-sequential way. That is, two database pages containing consecutive rows of the table might be stored physically close to each other, or far apart in the tablespace, and in either order. The B-tree data structure for the clustered index has pointers to the leaf pages, but they don't have to be stored in any order.
SSD is helpful for speeding up IO-based operations, particularly involving disk seeks. It's way faster than an spinning magnetic disk. But RAM is still a couple of orders of magnitude faster than the best SSD.
The 2018 numbers:
Disk seek: 3,000,000ns
SSD random read: 16,000ns
Main memory reference: 100ns
RAM still trumps durable storage by a wide margin. If your dataset (or at least the active subset of your dataset) fits in RAM, you won't need to worry about the difference between magnetic disk storage and SSD storage.
Re your comment:
The clustered index helps because when a primary key lookup searches through the B-tree and finds a leaf node, right there are all the other fields of the row associated with that primary key value.
Compare with MyISAM, where a primary key index is separate from the rows of the table. A query searches the B-tree of the primary key index, and at the leaf node finds a pointer to the location in the data file where the corresponding row is stored. So it has to do a second seek into the data file.
This does not necessarily mean that the clustered index in InnoDB is stored consecutively. It might need to skip around a bit to read all the pages of the tablespace. This is why it's so helpful to have the pages in RAM in the buffer pool.
First, the additional disk seek is not really a "killer". This can be a big issue in high transaction environments where microseconds and milliseconds count. However, for longer running queries, it will make little difference.
This is especially true if the database intelligently does "look ahead" disk seeks. Databases are often not waiting for data because another thread is predicting what pages will be needed and working on bringing those back. This is usually done by just taking the "next" pages on a sequential scan.
SSDs are going to speed up pretty much all operations. They do change the optimization parameters. In particular, I think they are comparably fast on throughput (although I don't keep up with the technology specifically). Their big win is in latency -- the time from issuing the request for a disk block and the time when it is retrieved.
In my experience (which is a few years old), the performance using SSD was comparable to an in-memory database for most operations.
Whether this makes cluster indexes redundant is another matter. A key place where they are used is when you want to separate a related small amount of rows (say "undeleted") from a larger amount. By putting them in the same data pages, the clustered index reduces the overall number of rows being read -- it doesn't just make the reads faster.
Just sume suggestions (to broad for simple comment)
taking into account that everything depends on the distribution of the keys in the not clusterd index and in the respective nodes, (which is completely causal and can only be assessed in average terms) remains the fact that any access benefits from the performance of the SSD disk. In this case, the increase in prepositions is not linear but is nonetheless substantial. Therefore, on average, it should not be a factor of 1 to 100, precisely for issues related to the randomness of distribution, but for every circumstance in which this manifests itself. access is 100 times faster .. in this case it is all the more efficient the more causally .. the situation occurs.
There is however a fact at the base .. every action on disk is much more efficient and therefore in general the behavior of a not clusterd index comes to be explicit in an optimal context.
Taking this into account, the gap should be radically reduced and this should take place thanks to the context in which the entire filing system exists and which is the basis of the database; from accessing the logical files that compose it to the physical sectors in which the data are actually preserved

How much memory will my columnstore indexes need?

I would like to add clustered columnstore indexes to some tables in a SQL Server 2014 database. Before doing so, I need to gather a good estimate of required memory. How can I predict clustered columnstore memory usage?
Things I know:
The size of the tables on disk
How the tables will be queried
The growth rate of these tables on disk
You will find an answer here - source - under the title "Memory Usage".
I'd rather not copy & paste the relevant section as the Terms Of Use of sqlservercentral.com state "You are not permitted to copy or use any of the Redgate Materials for any purpose." Though presumably the Terms Of Use themselves are exempt from this condition. :-)

Columnstore index proper usage

I just learned about the wonders of columnstore indexes and how you can "Use the columnstore index to achieve up to 10x query performance gains over traditional row-oriented storage, and up to 7x data compression over the uncompressed data size."
With such sizable performance gains, is there really any reason to NOT use them?
The main disadvantage is that you'll have a hard time reading only a part of the index if the query contains a selective predicate. There are ways to do it (partitioning, segment elimination) but those are neither particularly easy to reliably implement nor do they scale to complex requirements.
For scan-only workloads columnstore indexes are pretty much ideal.
Columnstore Indexes is especially beneficial for DataWarehousing (DW). Meaning that you will only perform updates or deletes at certain times.
This is due to their special design with delta loading and more features. This video will show great detail and a nice basic overview of what the exact difference is Columnstore Index.
Traditional
If you however have a high I/O (input and output) of the application; Columnstore Index is not ideal since traditional row indexing will find and manipulate (using the rows found through the index) on that specific target. An example of this would be a ATM application which frequently changes the values of the rows of the given persons accounts.
ColumnStore
Columnstore Indexing indexes throughout the COLUMNS which is not ideal in this case since the row values will be spread throughout the segmentations (columnsindexes).
I highly recommend the video!
I want to also elaborate on the non-clustered vs clustered columnstore:
Non-clustered Columnstore (update in 2012) saves the WHOLE data again meaning (2X data) twice the data.
Where as Clustered Columnstore index (update in 2014) only takes up 5MB for about 16GB of data. This is due to the RTE (runtime encoding) which saves the amount of duplicate data in each column. Making the index take up less extra storage.
Hello A very detailed explanation of columns store index can be found here.
ColumnStore Index
A columnstore index is a technology for storing, retrieving and managing data by using a columnar data format, called a columnstore.
This feature has been introduced with SQL Server 2012 which intends to significantly speed-up the processing time of common data warehousing queries. The main objectives of columnstore indexes is appropriate for typical data warehousing data sets and improve the performance of the query whenever data is pulled from the huge datasets.
They are column based indexes which are capable to transform the data warehousing experience for users by enabling faster performance for common data warehousing queries such as filtering, aggregating, grouping and star-join queries. They store the data column-wise instead of row-wise, as indexes currently do.

SQL SERVER 2012 ColumnStore Index

When we create a columnstore index on a huge table,does it use a separate physical storage on disk to store the coulmn store index or it changes the storage structure of the base table from row-stoarge to column store.
Here my question is, when we create normal index on any table, it store sthe indexed column data into a b-tree using separate storage without affecting base table,the same way columnstore index does?
Only nonclusterd columnstore indexes are supported in SQL Server 2012 so the table itself will not be reorganized.
http://msdn.microsoft.com/en-us/library/gg492153.aspx
NONCLUSTERED
Creates a columnstore index that specifies the logical ordering of a table. Clustered columnstore indexes are not supported.
Indexes (with the exception of the clustered index which is the table its self) are stored in separate locations, they can have their own packing (space allowed for futher inserts without the tree becoming too unbalanced) and even be stored on separate drives: CREATE INDEX ....ON PRIMARY, SECONDARY etc. You have to create the SECONDARY and further files before creating the index and allocating to the File. Indexes are allocated to the logical file name. You can reduce costs and increase speed by having these as single rather than RAID drives, as in case of failure the index can be rebuilt without data loss. http://msdn.microsoft.com/en-us/library/ms188783.aspx and http://msdn.microsoft.com/en-us/library/gg492088.aspx

To what degree can effective indexing overcome performance issues with VERY large tables?

So, it seems to me like a query on a table with 10k records and a query on a table with 10mil records are almost equally fast if they are both fetching roughly the same number of records and making good use of simple indexes(auto increment, record id type indexed field).
My question is, will this extend to a table with close to 4 billion records if it is indexed properly and the database is set up in such a way that queries always use those indexes effectively?
Also, I know that inserting new records in to a very large indexed table can be very slow because all the indexes have to be recalculated, if I add new records only to the end of the table can I avoid that slow down, or will that not work because the index is a binary tree and a large chunk of the tree will still have to be recalculated?
Finally, I looked around a bit for a FAQs/caveats about working with very large tables, but couldn't really find one, so if anyone knows of something like that, that link would be appreciated.
Here is some good reading about large tables and the effects of indexing on them, including cost/benefit, as you requested:
http://www.dba-oracle.com/t_indexing_power.htm
Indexing very large tables (as with anything database related) depends on many factors, incuding your access patterns, ratio of Reads to Writes and size of available RAM.
If you can fit your 'hot' (i.e. frequently accessed index pages) into memory then accesses will generally be fast.
The strategy used to index very large tables, is using partitioned tables and partitioned indexes. BUT if your query does not join or filter on the partition key then there will no improvement in performance over an unpartitioned table i.e. no partition elimination.
SQL Server Database Partitioning Myths and Truths
Oracle Partitioned Tables and Indexes
It's very important to keep your indexes as narrow as possible.
Kimberly Tripp's The Clustered Index Debate Continues...(SQL Server)
Accessing the data via a unique index lookup will slow down as the table gets very large, but not by much. The index is stored as a B-tree structure in Postgres (not binary tree which only has two children per node), so a 10k row table might have 2 levels whereas a 10B row table might have 4 levels (depending on the width of the rows). So as the table gets ridiculously large it might go to 5 levels or higher, but this only means one extra page read so is probably not noticeable.
When you insert new rows, you cant control where they are inserted in the physical layout of the table so I assume you mean "end of the table" in terms of using the maximum value being indexed. I know Oracle has some optimisations around leaf block splitting in this case, but I dont know about Postgres.
If it is indexed properly, insert performance may be impacted more than select performance. Indexes in PostgreSQL have vast numbers of options which can allow you to index part of a table or the output of an immutable function on tuples in the table. Also size of the index, assuming it is usable, will affect speed much more slowly than will the actual scan of the table. The biggest difference is between searching a tree and scanning a list. Of course you still have disk I/O and memory overhead that goes into index usage, and so large indexes don't perform as well as they theoretically could.