How we do bucketing in hive for star schema

How we do bucketing in hive for star schema - hive

What is the best practice in hive for bucketing a star schema model?
lets say I have a fact with 3 dimensions
f_test:
customer_key,
vendor_key,
country_key,
measures
d_customer
d_vendor
d_country
How would you bucket the above use case? Bucket dimensions based on keys and fact a composite bucket (customer, vendor, country) ?
Please advice on best practice.

Bucketing is used to improve query performance so without knowing how your users are going to query your data it is impossible to recommend how to bucket it e.g. if most queries of the fact table are by customer attributes then bucketing on customer_key makes sense.
Unless you have very high volumes of data in your Dims it's probably not worth bucketing them e.g. I assume Country only has ca. 200 records.
Unfortunately it's one of the main limitations of using Hive/Impala/etc. as an analytical platform in that you have very limited scope for improving performance via table design i.e. you can only partition/bucket a table in one way and therefore only support one query pattern. In your example, if your fact table is queried by customer and vendor equally there is no way to improve performance of both types of query and you just have to rely on the "horsepower" of the platform to process the queries.
Compare this to a conventional DB where you can just add a new index to support a query if needed

Related

Why does Redshift not need materialized views or indexes?

In the Redshift FAQ under
Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?
It says the following:
Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
Why is this the case?

It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.
Materialised Views
I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.
I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?
Indexes
RedShift does not have indexes.
It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).
It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.
That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics
As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns

as of dec 2019, Redshift has a preview of materialized views: Announcement
from the documentation: A materialized view contains a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Amazon Redshift returns the precomputed results from the materialized view, without having to access the base tables at all. From the user standpoint, the query results are returned much faster compared to when retrieving the same data from the base tables.

Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.
The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.
Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.
for eg.
Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys.
If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate).
If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently.
If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently.
If you have an interleaved soft key on (orderid, shipdate) and if your query
Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.
https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html
Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks

The simple answer is: because it can read the needed data really, really fast and in parallel.
One of the primary uses of indexes are "needle-in-the-haystack" queries. These are queries where only a relatively small number of rows are needed and these match a WHERE clause. Columnar datastores handle these differently. The entire column is read into memory -- but only the column, not the rest of the row's data. This is sort of similar to having an index on each column, except the values need to be scanned for the match (that is where the parallelism comes in handy).
Other uses of indexes are for matching key pairs for joining or for aggregations. These can be handled by alternative hash-based algorithms.
As for materialized views, RedShift's strength is not updating data. Many such queries are quite fast enough without materialization. And, materialization incurs a lot of overhead for maintaining the data in a high transaction environment. If you don't have a high transaction environment, then you can increment temporary tables after batch loads.

They recently added support for Materialized Views in Redshift: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-introduces-support-for-materialized-views-preview/
Syntax for materialized view creation:
CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ]
[ table_attributes ]
AS query
Syntax for refreshing a materialized view:
REFRESH MATERIALIZED VIEW mv_name

Star Schema Query - do you have to include all dimensions(joins) in the query?

Hi I have a question regarding star schema query in MS SQL datawarehouse.
I have a fact table and 8 dimensions. And I am confused, to get the metrics from Fact, do we have to join all dimensions with Fact, even though I am not getting data from them? Is this required for the right metrics?
My fact table is huge, so that's why I am wondering for performance purposes and the right way to query.
Thanks!

No you do not have to join all 8 dimensions. You only need to join the dimensions that contain data you need for analyzing the metrics in the fact table. Also to increase performance make sure to only include columns from the dimension table that are needed for the analysis. Including all columns from the dimensions you join will decrease performance.

It is not necessary to include all the dimensions. Indeed, while exploring fact tables, It is very important to have the possibility to select only some dimensions to join and drop the others. The performance issues must not be an excuse to give up on this capability.
You have a bunch of different techniques to solve performance issues depending on the database you are using. Some common ways :
aggregate tables : it is one of the best way to solve performance issues. If you have a huge fact table, you can create an aggregate version of it, using only the most frequently queried columns. This way, it should be much smaller. Then, users (or the adhoc query application) has to use the aggregrate table instead of the original fact table when this is possible. The good news is that most databases know how to automatically manage aggregate tables (materialized views for example). Queries that initially target the original fact table, are transparently redirected to the aggregate table whenever possible.
indexing : bitmap indexing for example can be an efficient way to increase performance in a star schema fact table.

OLTP Performance Issue with Netezza

We recently migrated our database from Oracle to Netezza . Our's is mainly an OLAP database but a part of it is OLTP which populates few tables ( Front end application) . These real-time tables will get joined with few history tables and generates reports. We are satisfied with the performance of the OLAP part of it , but the OLTP has a major performance issue. What are the ways to improve OLPT in Netezza ? Or is there any design approach to maintain OLTP separately ?

I would not recommend using Netezza for OLTP however we get into situations where the data is so large that we don't have another choice. In those circumstances you can do some tuning to speed things up.
Make sure that your table has good distribution if you can't find a good column to distribute on then distribute on random
Add organize to the table on the key your OLTP operations are based on
Make sure your key is a distributable datatype Integer,Date
Alternatively you may consider a hybrid design
Do your OLTP operations for your web application in SQL Server or Postgres
ETL changes back to Netezza every few minutes or hours.

There are two main considerations for performance with a parallel system: even distribution of data; and collocation of data that will be joined. The first point was already addressed: choose a good partitioning key that will give you good, even distribution of data.
As to my second point, if data is going to be joined (say for update), there has to be little or no data movement. Making sure this is true will reduce data traffic. The best way to ensure data is collocated is to use the same partitioning key. The same key will always hash to the same node; the same node means no traffic to do the join. For example, say you have Cust (CNo), Order (Ono), Order Item (Ono, INo),and Product (PNo). The primary keys are in parentheses. If you use the above as partitioning keys, the data will NOT be collocated. If, however, you use CNo in Cust, the fk CNo in Order, and put a redundant fk of CNo in Order Item, all as the partitioning keys, they will be collocated. Product cannot be collocated but it doesn't need to be; it is usually not that large and it would make no sense to put a CNo in Product.
NOSQL does not allow joins because it cannot ensure collocation of join data (there are other reasons as well). In NOSQL data will be distributed over a wide array of nodes.
The two great performance killers are sorts and cartesian products. Make sure these do not exist in the data being joined on Netezza.

lookup table or something else?

I have 8 tables;
employees
employee_subjects
outlet
outlet_subjects
subjects
geography
outlet_geography
employee_geography
Now, I need to be able to search out outlets and employees within a range of different geographies and based on a range of subjets.
My questions is: Is there a good strategy and is it a good idea to create a somewhat static lookup table where I have inserted all the data I need in my range ?
The table would potentially grow to +50 million rows but I would be able to say
SELECT ... FROM lookup WHERE subId = 1 OR subId = 2 OR geoId = 1 geoId = 2...etc etc.
So I get to keep the joins out.
Vague, yes, but I need guidance on this!

That question cannot be answered in general. In some contexts you have to keep redundant, denormalized data for performance reasons (in particular for data warehouses). However, you should not introduce redundancies or potential inconsistencies lightly.
I suggest to first measure the query performance and check your execution plans. Make sure that you create all the indexes that you need. If the query turns out to be still too slow, you might consider using a materialized view (called indexed view for sql server, see, e.g., here). A materialized is quite like the table that you suggest, but it is kept in sync with your data automatically by the DBMS.

In a Datawarehouse context for analytics queries (pulling out numbers and statistics from your system) that could make sense, but for an oltp system regularly updated by users, a big lookup table is a very bad design, hard to maintain (lot of uneeded data: not all columns needed for all records etc), bad data etc.
Keeping out joins just for querying the system does not sounds like good idea too
as it could break the work of Sql Server optimizer and has more chances to lead to table scans
(that could be hard with a big table).
Here is an interesting article from Joe Celko on big lookup tables, that sounds related to your problem, not exactly the same but could give you some insights.
A general advice would be : keep a normalized design (and especially for and oltp system).

SQL Schema Design: individual tables or mass table Scalability

What's the more efficient method of schema design in terms of scalability?
If a database has multiple users, and each user has objects (consisting of data in a longtext, a date, and a unique id), is it more efficient to
(1) create a mass table of objects that each have a user column, or
(2) create individual tables of objects for each user?
I've seen conflicting answers when researching, database normalization says to make individual columns for each user, while several posts mention that performance is increased by using a mass table.
Edit: changed 'elements' to 'objects' for clarity.

In general, you want to create a single table for an entity and not break them into separate tables for users.
This makes the system more maintainable. Queries on the system are consistent across all applications. It also structures the data in the way that databases are optimized for accessing it.
There are a few specialized cases where you would split user data into separate tables or even separate databases. This might be a user requirement ("our data cannot mix with anyone elses"). It may be needed to support various backup and security policies. However, the general approach is to design the tables around entities and not around users.

Having a single table with a column to identify the user is proper relational design. Having to add tables when adding a user sounds very fishy, and may cause problems once you have a large amount of users.
When a single table becomes too large, most database products support so-called partitioning which allows to split a single logical table into multiple physical tables on disk, based on some criteria (e.g. to stay with your example, you could have three physical tables with data for userids 1 - 99999 in partion 1, 100000 - 199999 in partition 2 and 200000 - 299999 in partition 3).
Here's an overview for Oracle.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas