I have an unpartitioned hive table which is used to create a partitioned view. The table has just some metadata columns and the actual data is stored in an array which makes querying difficult. Hence the data is exploded into a view which then is used for all querying purposes. This view is partitioned on the date the data arrives. In this scenario, will the performance be affected as the original table is unpartitioned? Should the original table be partitioned too?
If underlying tables are not partitioned, view partitioning is not useful at all.
Of course, table should be partitioned if you want partition pruning to work, otherwise the full-scan will be performed.
On the other hand, if the table is partitioned and a view is not partitioned and query has predicates on table partitions, optimizer is clever enough to derive partition info from the view definition, push predicates down and partition pruning works. And this makes view partitioning rather useless feature and manually managed view partitions add unnecessary complication. Better use partitioned tables and normal not partitioned views.
Why you may need partitioned view if partition pruning works with non-partitioned view. One possible use-case is when restricted user can see only the view, not underlying tables and different tools can derive partition information from view metadata only and suggest filtering on partitions, knowing nothing about underlying tables. From the restricted user perspective, view is the same as table and they should see partitioning schema.
See HIVE-1079:
For the manual approach, a very simple thing we can start with is just
letting users add partitions to views explicitly as a way of
indicating that the underlying data is ready.
See also HIVE-1941 and this design document
BTW Materialized views are already implemented in Hive 3.0.0 and prtitioning of them makes much more sense because the data in them are stored accordingly to partition schema specified in DDL.
Related
What is the difference between Views and Materialized Views in Oracle?
Materialized views are disk based and are updated periodically based upon the query definition.
Views are virtual only and run the query definition each time they are accessed.
Views
They evaluate the data in the tables underlying the view definition at the time the view is queried. It is a logical view of your tables, with no data stored anywhere else.
The upside of a view is that it will always return the latest data to you. The downside of a view is that its performance depends on how good a select statement the view is based on. If the select statement used by the view joins many tables, or uses joins based on non-indexed columns, the view could perform poorly.
Materialized views
They are similar to regular views, in that they are a logical view of your data (based on a select statement), however, the underlying query result set has been saved to a table. The upside of this is that when you query a materialized view, you are querying a table, which may also be indexed.
In addition, because all the joins have been resolved at materialized view refresh time, you pay the price of the join once (or as often as you refresh your materialized view), rather than each time you select from the materialized view. In addition, with query rewrite enabled, Oracle can optimize a query that selects from the source of your materialized view in such a way that it instead reads from your materialized view. In situations where you create materialized views as forms of aggregate tables, or as copies of frequently executed queries, this can greatly speed up the response time of your end user application. The downside though is that the data you get back from the materialized view is only as up to date as the last time the materialized view has been refreshed.
Materialized views can be set to refresh manually, on a set schedule, or based on the database detecting a change in data from one of the underlying tables. Materialized views can be incrementally updated by combining them with materialized view logs, which act as change data capture sources on the underlying tables.
Materialized views are most often used in data warehousing / business intelligence applications where querying large fact tables with thousands of millions of rows would result in query response times that resulted in an unusable application.
Materialized views also help to guarantee a consistent moment in time, similar to snapshot isolation.
A view uses a query to pull data from the underlying tables.
A materialized view is a table on disk that contains the result set of a query.
Materialized views are primarily used to increase application performance when it isn't feasible or desirable to use a standard view with indexes applied to it. Materialized views can be updated on a regular basis either through triggers or by using the ON COMMIT REFRESH option. This does require a few extra permissions, but it's nothing complex. ON COMMIT REFRESH has been in place since at least Oracle 10.
Materialised view - a table on a disk that contains the result set of a query
Non-materiased view - a query that pulls data from the underlying table
Views are essentially logical table-like structures populated on the fly by a given query. The results of a view query are not stored anywhere on disk and the view is recreated every time the query is executed. Materialized views are actual structures stored within the database and written to disk. They are updated based on the parameters defined when they are created.
View: View is just a named query. It doesn't store anything. When there is a query on view, it runs the query of the view definition. Actual data comes from table.
Materialised views: Stores data physically and get updated periodically. While querying MV, it gives data from MV.
Adding to Mike McAllister's pretty-thorough answer...
Materialized views can only be set to refresh automatically through the database detecting changes when the view query is considered simple by the compiler. If it's considered too complex, it won't be able to set up what are essentially internal triggers to track changes in the source tables to only update the changed rows in the mview table.
When you create a materialized view, you'll find that Oracle creates both the mview and as a table with the same name, which can make things confusing.
Materialized views are the logical view of data-driven by the select query but the result of the query will get stored in the table or disk, also the definition of the query will also store in the database.
The performance of Materialized view it is better than normal View because the data of materialized view will be stored in table and table may be indexed so faster for joining also joining is done at the time of materialized views refresh time so no need to every time fire join statement as in case of view.
Other difference includes in case of View we always get latest data but in case of Materialized view we need to refresh the view for getting latest data.
In case of Materialized view we need an extra trigger or some automatic method so that we can keep MV refreshed, this is not required for views in the database.
In the Redshift FAQ under
Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?
It says the following:
Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
Why is this the case?
It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.
Materialised Views
I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.
I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?
Indexes
RedShift does not have indexes.
It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).
It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.
That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics
As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns
as of dec 2019, Redshift has a preview of materialized views: Announcement
from the documentation: A materialized view contains a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Amazon Redshift returns the precomputed results from the materialized view, without having to access the base tables at all. From the user standpoint, the query results are returned much faster compared to when retrieving the same data from the base tables.
Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.
The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.
Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.
for eg.
Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys.
If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate).
If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently.
If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently.
If you have an interleaved soft key on (orderid, shipdate) and if your query
Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.
https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html
Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks
The simple answer is: because it can read the needed data really, really fast and in parallel.
One of the primary uses of indexes are "needle-in-the-haystack" queries. These are queries where only a relatively small number of rows are needed and these match a WHERE clause. Columnar datastores handle these differently. The entire column is read into memory -- but only the column, not the rest of the row's data. This is sort of similar to having an index on each column, except the values need to be scanned for the match (that is where the parallelism comes in handy).
Other uses of indexes are for matching key pairs for joining or for aggregations. These can be handled by alternative hash-based algorithms.
As for materialized views, RedShift's strength is not updating data. Many such queries are quite fast enough without materialization. And, materialization incurs a lot of overhead for maintaining the data in a high transaction environment. If you don't have a high transaction environment, then you can increment temporary tables after batch loads.
They recently added support for Materialized Views in Redshift: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-introduces-support-for-materialized-views-preview/
Syntax for materialized view creation:
CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ]
[ table_attributes ]
AS query
Syntax for refreshing a materialized view:
REFRESH MATERIALIZED VIEW mv_name
You pay per size of data queried. So, would be a better alternative to use views from the cost point of view?
Views are not materialized in bigquery (currently), so the cost of querying from a view is identical to writing the more complex query on the underlying table.
You can, of course, create your own "materialized views" by running a query and saving it as a table. Then you can run subsequent queries against that table. This may be more cost effective if the saved table is smaller than the underlying table. That takes a bit more manual bookkeeping, however.
what is the difference (system resource wise) between using views and temporary tables? I have a complex report that will be built from numerous tables. I am trying to determine if I should use a series of views or temp tables (SQL 2008). Is there a difference?
A view is just an alias for the query.
A temporary table materializes the resultset.
Views, being just aliases, don't take time to be filled, but they may be less performant when being selected from.
Temporary tables take some time (and effort) to be populated but can be more efficient.
Note that SQL Server optimizer can create a temporary table in runtime from a view (which you will see in the query statistics and plan as Worktable in tempdb), index it and populate it with the rows, and even add the missing rows on demand (these operations are called Eager Spool and Lazy Spool)
This makes the view rewinding much more efficient: instead of reevaluating the whole view each time its results are needed, the results are stored in a temporary table either all at once (as in Eager Spool), or incrementally (as in Lazy Spool).
SQL Server, though, can materialize views by creating clustered indexes on them, though not all views can be indexed (to be indexable, the view must meet certain conditions described here).
View is just a "fixed" SELECT statement with a name. If you have complex query, or in other case, if you want to show only partial data from the table you use view. Views exists in the database you have created it in.
Temporary table is the table like any others, but it is stored in tempdb database and is deleted after you finish stored procedure, etc. (local table). There can be seen only in the scope of the procedure you use it in. There are also global temporary tables, which can be seen outside the procedure.
chances are that an indexed view would probably work for you more than inserting data into temp table/s. also, instead of using a series of views you'd be better of consolidating into one view. it's much easier to manage and will most likely improve performance. also, the right indexes on your source tables will also improve performance.
What is the difference between Views and Materialized Views in Oracle?
Materialized views are disk based and are updated periodically based upon the query definition.
Views are virtual only and run the query definition each time they are accessed.
Views
They evaluate the data in the tables underlying the view definition at the time the view is queried. It is a logical view of your tables, with no data stored anywhere else.
The upside of a view is that it will always return the latest data to you. The downside of a view is that its performance depends on how good a select statement the view is based on. If the select statement used by the view joins many tables, or uses joins based on non-indexed columns, the view could perform poorly.
Materialized views
They are similar to regular views, in that they are a logical view of your data (based on a select statement), however, the underlying query result set has been saved to a table. The upside of this is that when you query a materialized view, you are querying a table, which may also be indexed.
In addition, because all the joins have been resolved at materialized view refresh time, you pay the price of the join once (or as often as you refresh your materialized view), rather than each time you select from the materialized view. In addition, with query rewrite enabled, Oracle can optimize a query that selects from the source of your materialized view in such a way that it instead reads from your materialized view. In situations where you create materialized views as forms of aggregate tables, or as copies of frequently executed queries, this can greatly speed up the response time of your end user application. The downside though is that the data you get back from the materialized view is only as up to date as the last time the materialized view has been refreshed.
Materialized views can be set to refresh manually, on a set schedule, or based on the database detecting a change in data from one of the underlying tables. Materialized views can be incrementally updated by combining them with materialized view logs, which act as change data capture sources on the underlying tables.
Materialized views are most often used in data warehousing / business intelligence applications where querying large fact tables with thousands of millions of rows would result in query response times that resulted in an unusable application.
Materialized views also help to guarantee a consistent moment in time, similar to snapshot isolation.
A view uses a query to pull data from the underlying tables.
A materialized view is a table on disk that contains the result set of a query.
Materialized views are primarily used to increase application performance when it isn't feasible or desirable to use a standard view with indexes applied to it. Materialized views can be updated on a regular basis either through triggers or by using the ON COMMIT REFRESH option. This does require a few extra permissions, but it's nothing complex. ON COMMIT REFRESH has been in place since at least Oracle 10.
Materialised view - a table on a disk that contains the result set of a query
Non-materiased view - a query that pulls data from the underlying table
Views are essentially logical table-like structures populated on the fly by a given query. The results of a view query are not stored anywhere on disk and the view is recreated every time the query is executed. Materialized views are actual structures stored within the database and written to disk. They are updated based on the parameters defined when they are created.
View: View is just a named query. It doesn't store anything. When there is a query on view, it runs the query of the view definition. Actual data comes from table.
Materialised views: Stores data physically and get updated periodically. While querying MV, it gives data from MV.
Adding to Mike McAllister's pretty-thorough answer...
Materialized views can only be set to refresh automatically through the database detecting changes when the view query is considered simple by the compiler. If it's considered too complex, it won't be able to set up what are essentially internal triggers to track changes in the source tables to only update the changed rows in the mview table.
When you create a materialized view, you'll find that Oracle creates both the mview and as a table with the same name, which can make things confusing.
Materialized views are the logical view of data-driven by the select query but the result of the query will get stored in the table or disk, also the definition of the query will also store in the database.
The performance of Materialized view it is better than normal View because the data of materialized view will be stored in table and table may be indexed so faster for joining also joining is done at the time of materialized views refresh time so no need to every time fire join statement as in case of view.
Other difference includes in case of View we always get latest data but in case of Materialized view we need to refresh the view for getting latest data.
In case of Materialized view we need an extra trigger or some automatic method so that we can keep MV refreshed, this is not required for views in the database.