In the Redshift FAQ under
Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?
It says the following:
Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
Why is this the case?
It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.
Materialised Views
I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.
I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?
Indexes
RedShift does not have indexes.
It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).
It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.
That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics
As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns
as of dec 2019, Redshift has a preview of materialized views: Announcement
from the documentation: A materialized view contains a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Amazon Redshift returns the precomputed results from the materialized view, without having to access the base tables at all. From the user standpoint, the query results are returned much faster compared to when retrieving the same data from the base tables.
Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.
The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.
Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.
for eg.
Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys.
If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate).
If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently.
If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently.
If you have an interleaved soft key on (orderid, shipdate) and if your query
Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.
https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html
Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks
The simple answer is: because it can read the needed data really, really fast and in parallel.
One of the primary uses of indexes are "needle-in-the-haystack" queries. These are queries where only a relatively small number of rows are needed and these match a WHERE clause. Columnar datastores handle these differently. The entire column is read into memory -- but only the column, not the rest of the row's data. This is sort of similar to having an index on each column, except the values need to be scanned for the match (that is where the parallelism comes in handy).
Other uses of indexes are for matching key pairs for joining or for aggregations. These can be handled by alternative hash-based algorithms.
As for materialized views, RedShift's strength is not updating data. Many such queries are quite fast enough without materialization. And, materialization incurs a lot of overhead for maintaining the data in a high transaction environment. If you don't have a high transaction environment, then you can increment temporary tables after batch loads.
They recently added support for Materialized Views in Redshift: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-introduces-support-for-materialized-views-preview/
Syntax for materialized view creation:
CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ]
[ table_attributes ]
AS query
Syntax for refreshing a materialized view:
REFRESH MATERIALIZED VIEW mv_name
Related
I am quite new to Redshift but have quite some experience in the BI area. I need help from an expert Redshift developer. Here's my situation:
I have an external (S3) database added to Redshift. This will suffer very frequent changes, approx. every 15 minutes. I will run a lot of concurrent queries directly from Qlik Sense against this external DB.
As best practices say that Redshift + Spectrum works best when the smaller table resides in Redshift, I decided to move some calculated dimension tables locally and leave the outer tables in S3. The challenge I have is if it's better suited to use materialized views for this or tables.
I already tested both, with DIST STYLE = ALL and proper SORT KEY and the test show that MVs are faster. I just don't understand why that is. Considering the dimension tables are fairly small (<3mil rows), I have the following questions:
shall we use MVs and refresh them via scheduled task or use table and perform some form of ETL via stored procedure (maybe) to refresh it.
if table is used: I tried casting the varchar keys (heavily used in joins) to bigint to force encoding to AZ64, but queries perform worse than without casting (where encode=LZO). Is this because in the external DB it's stored as varchar?
if MV is used: I also tried above casting in the query behind MV, but the encoding says NONE (figured out by checking the table created behind the scene). Moreover, even without casting, most of the key columns used in joins have no encoding. Might it be that this is the reason why MVs are faster than table? And should I not expect the opposite - no encoding = worse performance?
Some more info: in S3, we store in the form of parquet files, with proper partitioning. In Redshift, the tables are sorted against the same columns as S3 partitioning, plus some more columns. And all queries use joins against these columns in the same order and also a filter in the where clause on these columns. So the query is well structured.
Let me know if you need any other details.
Are column store indexes in SQL Server useful only when the query uses aggregate functions?
Certainly no. Even if they were designed to be used in DWH ambient, they can be used in OLTP environment.
And even when used in DWH, aggregation is not a requirenment.
Columnstore indexes use a different storage format for data, storing
compressed data on a per-column rather than a per-row basis. This
storage format benefits query processing in data warehousing,
reporting, and analytics environments where, although they typically
read a very large number of rows, queries work with just a subset of
the columns from a table.
So the first benefit is data compression.
Compression in columnstore is table-wide, non page-wide (I mean dictionary applied) when you use PAGE data compression. So the compression ratio is the best. A table with clustered columnstore index defined uses less space compared to the same table with no columnstore but page compression enambed.
The second benefit is for queries that filter nothing(or almost nothing, needing (almost) all the rows) but need only some columns to be returned.
When the table is stored "per-row", even if you want only 10 columns of 100, and you want all the rows, the whole table will be read, because there is a need to read the whole row to get your 10 requested columns out of it. When you use "per-column" storage, only needed columns will be read.
Of course you can define an index with your 10 needed columns as included, but it will be additional space used and the overhead of maintenance of this index. Now imagin that your queries need these 10, and other 10, and another 2o of 100, so you need to create more indexes for these queries.
With one columnstore index you will be able to satisfy all these queries
Columnstore Indexes, stores data in columnar format, therefore they are quite helpful when you are using aggregate functions. One of the reason is, because homogeneous data compression is much faster when you are trying to aggregate columns.
But this is not the only use of columnstore indexes. It is really helpful when you are processing millions of rows (in multidimensional data models).
Check out the official documentation and this as well for better understanding.
You can't say they are always useful for aggregate functions as it depends on which rows are included in the aggregation. If you are performing aggregation on all rows - they are useful. If you are selecting only small amount of the rows because of filtering, you can even get worse result than using traditional non-clustered index.
As it is written in MSDN, they can be used:
to achieve up to 10x query performance gains in your data warehouse
over traditional row-oriented storage
to get 10x data compression over the uncompressed data size (if you are interested in compression in should check the COLUMNSTORE_ARCHIVE option)
Also, depending on your SQL Server version (if SQL Server 2017 or later) you can check the Adaptive Query Processing as one of the conditions is to have such index:
You should look through the documentation and see what options you have depending on your SQL Server version and to test for sure how this index is going to affect the performance, because it is very possible to make the things worse.
It's good that in every article Microsoft have mention the scenarios when the types of column store indexes can be used for good.
We have a table design that consists of 10,000,000 records and 200,000 columns.
The columns are a mixture of:
Binary flags.
Integers.
The queries need to perform and / or operations on 1-100 columns at a time, and should complete in under 0.1 seconds, returning a only projection/subset of each matched row.
Around 10 new columns get added per day.
Around 1,000 new rows get added per day.
There are no joins.
Which DBMS is best suited for this?
Reason behind this approach:
The columns are materialized indexes from user defined queries: that's why new columns get added each day (as more users come up with their own queries). The other option would be to not use materialized views, and have the user's queries perform joins. Problem here is the queries could take any form and in aggregate there would be a large number of very different execution plans across everyones query... since the user defines the query, it's kinda impossible to optimise a traditional SQL database using indexes, normalised tables, etc.
First, I'd suggest measuring ad-hoc JOINs, and only doing further optimization if you find the performance lacking. I understand it could be difficult to measure every possible query, but you may be able to cover most common/representative cases, and if they perform well-enough just stop there. There is a lot that can be done with good indexing!
Second, and only if the measurements above warrant it, create a new separate materialized view for each ad-hoc query.
Some databases will be able to maintain such views automatically for you1, so if the "base" data changes, relevant results will be automatically added or removed from the materialized view (just as they would from the "live" query result).
Other databases may allow periodic refresh2.
Be warned though: maintaining materialized views is not free, and having thousands of them (especially if they are constantly kept up-to-date, as opposed to periodically refreshed) will definitely impact the insert/update/delete performance on the base data!
1 E.g. SQL Server indexed views.
2 E.g. Oracle Materialized views, although it looks like 12c can also do something close to SQL Server's immediate refresh.
Keeping aside ,why you want to go with 1000 of columns,you can look at below databases which support,unlimited columns
References: https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
You pay per size of data queried. So, would be a better alternative to use views from the cost point of view?
Views are not materialized in bigquery (currently), so the cost of querying from a view is identical to writing the more complex query on the underlying table.
You can, of course, create your own "materialized views" by running a query and saving it as a table. Then you can run subsequent queries against that table. This may be more cost effective if the saved table is smaller than the underlying table. That takes a bit more manual bookkeeping, however.
I've a query that performs join on many tables which is resulting in poor performance.
To improve the performance, I've created an indexed view and I see a significant improvement in the performance of the query on view with date filter. However, my concern is about the storage of the index. From what I have read, the unique clustered index is stored on SQL Server. Does it mean it stores separately the entire data resulting as part of joins within the view? If so, if I've included all columns from tables that are part of the joins in the view, would the disk space on the server consumed be approx double the disk space without indexed view? And every time I enter data into underlying tables, the data is duplicated for the indexed view?
That is correct. An indexed view is basically an additional table that contains a copy of all the data in a sorted way. That's what makes it so fast, but as everything in SQL Server land, it comes at a price - in this case the additional storage required and the additional time required to keep all the copies of the data in sync.
The same is true for a normal index on a table. It is also a copy of the index keys (plus some information of where to find the original row), that needs additional storage and additional time during updates to be maintained.
To figure out if adding an index on a table or view makes sense, requires you to look at the entire system and see if the performance improvement for the one query is worth the performance degradation of other queries.
In your case you should also (first) check if additional indexes on the underlying tables might help your queries.
Yes, that is correct. An indexed view persists all data in the view separately from the source tables. Depending on the columns and joins, the data is duplicated, and can actually be many times larger than the source tables.
Pretty much, yeah. You've made a trade-off where you get better performance in return for some additional effort by the engine, plus the additional storage needed to persist it.