Improve performance of processing billions-of-rows data in Spark SQL - sql

In my corporate project, I need to cross join a dataset of over a billion rows with another of about a million rows using Spark SQL. As cross join was used, I decided to divide the first dataset into several parts (each having about 250 million rows) and cross join each part with the million-row one. I then made of use "union all".
Now I need to improve the performance of the join processes. I heard it can be done by partitioning data and distribution of work to Spark workers. My questions are how the effective performance can be made with partitioning? and What are the other ways to do this without using partitioning?
Edit: filtering already included.

Well, in all scenarios, you will end up with tons of data. Be careful, try to avoid cartesian joins on big data set as much as possible as it usually ends with OOM exceptions.
Yes, partitioning can be the way that help you, because you need to distribute your workload from one node to more nodes or even to the whole cluster. Default partitioning mechanism is hash of key or original partitioning key from source (Spark is taking this from source directly). You need to first evaluate what is your partitioning key right now and afterwards you can find maybe better partitioning key/mechanism and repartition data, therefore distribute load. But, anyway join must be done, but it will be done with more parallel sources.

There should be some filters on your join query. you can use filter attributes as key to partition the data and then join based on the partitioned.

Related

Is it bad to do joins in Hive?

Hi I recently joined a new job that uses Hive and PostgreSQL. The existing ETL scripts gather data from Hive partitioned by dates and creates tables for those data in PostgreSQL and then the PostgreSQL scripts/queries perform left joins and create the final table for reporting purpose. I have heard in the past that Hive joins are not a good idea. However, I noticed that Hive does allow joins so I'm not sure why it's a bad idea.
I wanted to use something like Talend or Mulesoft to create joins and do aggregations within hive and create a temporary table and transfer that temporary table as the final table to PostgreSQL for reporting.
Any suggestions, especially if this is not a good practice with HIVE. I'm new to hive.
Thanks.
The major issue with joining in hive has to do with data locality.
Hive queries are executed as MapReduce jobs and several mappers will launch, as much as possible, in the nodes where the data lies.
However, when joining tables the two rows of data from LHS and RHS tables will not in general be in the same node, which may cause a significant amount of network traffic between nodes.
Joining in Hive is not bad per se, but if the two tables being joined are large may result in slow jobs.
If one of the tables is significantly smaller than the other you may want to store it in HDFS cache, making its data available in every node, which allows the join algorithm to retrieve all data locally.
So, there's nothing wrong with running large joins in Hive, you just need to be aware they need their time to finish.
Hive is growing in maturity
It is possible that arguments against using joins, no longer apply for recent versions of hive.
The most clear example I found in the manual section on join optimization:
The MAPJOIN implementation prior to Hive 0.11 has these limitations:
The mapjoin operator can only handle one key at a time
Therefore I would recommend asking what the foundation of their reluctance is, and then checking carefully whether it still applies. Their arguments may well still be valid, or might have been resolved.
Sidenote:
Personally I find Pig code much easier to re-use and maintain than hive, consider using Pig rather than hive to do map-reduce operations on your (hive table) data.
Its perfectly fine to do joins in HIVE, I am a ETL tester and have performed left joins on big tables in Hive most of the time the queries run smoothly but some times the job do get stuck or are slow due to network traffic.
Also depends on number of Nodes the cluster is having.
Thanks

Star Schema Query - do you have to include all dimensions(joins) in the query?

Hi I have a question regarding star schema query in MS SQL datawarehouse.
I have a fact table and 8 dimensions. And I am confused, to get the metrics from Fact, do we have to join all dimensions with Fact, even though I am not getting data from them? Is this required for the right metrics?
My fact table is huge, so that's why I am wondering for performance purposes and the right way to query.
Thanks!
No you do not have to join all 8 dimensions. You only need to join the dimensions that contain data you need for analyzing the metrics in the fact table. Also to increase performance make sure to only include columns from the dimension table that are needed for the analysis. Including all columns from the dimensions you join will decrease performance.
It is not necessary to include all the dimensions. Indeed, while exploring fact tables, It is very important to have the possibility to select only some dimensions to join and drop the others. The performance issues must not be an excuse to give up on this capability.
You have a bunch of different techniques to solve performance issues depending on the database you are using. Some common ways :
aggregate tables : it is one of the best way to solve performance issues. If you have a huge fact table, you can create an aggregate version of it, using only the most frequently queried columns. This way, it should be much smaller. Then, users (or the adhoc query application) has to use the aggregrate table instead of the original fact table when this is possible. The good news is that most databases know how to automatically manage aggregate tables (materialized views for example). Queries that initially target the original fact table, are transparently redirected to the aggregate table whenever possible.
indexing : bitmap indexing for example can be an efficient way to increase performance in a star schema fact table.

Why is there an Each modifier?

From the docs:
Normal JOIN operations require that the right-side table contains less
than 8 MB of compressed data. The EACH modifier is a hint that informs
the query execution engine that the JOIN might reference two large
tables. The EACH modifier can't be used in CROSS JOIN clauses.
When possible, use JOIN without the EACH modifier for best
performance. Use JOIN EACH when table sizes are too large for JOIN.
Why isn't that automatic?
Is there a way to simplify this? Can I just always use JOIN EACH or always use JOIN (It seems I can't always use join because of the 8mb limitation written above)
BigQuery parallelizes the processing of information within many servers that pass condensed information onto further servers in a tree topology. Everything ends up in a root node, and some of the BigQuery limitations come from this bottleneck: You can read "unlimited" amounts of data, but the output of a query has to fit into a single server.
(more details in the 2010 Dremel paper http://research.google.com/pubs/pub36632.html)
To overcome this limitation the EACH keyword was introduced: It forces a shuffle at the starting level, allowing the parallelization of tasks - without requiring a single output node - and allowing JOINing tables of unlimited size. This approach has some drawbacks, like losing the ability to ORDER BY the end result, as no single node will have visibility onto the whole output.
Would it be possible for BigQuery to detect when to use EACH automatically? Ideally, but for now the EACH keyword allows you to complete previously impossible operations - with the drawback of requiring your awareness of it.

What is table partitioning?

In which case we should use table partitioning?
An example may help.
We collected data on a daily basis from a set of 124 grocery stores. Each days data was completely distinct from every other days. We partitioned the data on the date. This allowed us to have faster
searches because oracle can use partitioned indexes and quickly eliminate all of the non-relevant days.
This also allows for much easier backup operations because you can work in just the new partitions.
Also after 5 years of data we needed to get rid of an entire days data. You can "drop" or eliminate an entire partition at a time instead of deleting rows. So getting rid of old data was a snap.
So... They are good for large sets of data and very good for improving performance in some cases.
Partitioning enables tables and indexes or index-organized tables to be subdivided into smaller manageable pieces and these each small piece is called a "partition".
For more info: Partitioning in Oracle. What? Why? When? Who? Where? How?
When you want to break a table down into smaller tables (based on some logical breakdown) to improve performance. So now the user can refer to tables as one table name or to the individual partitions within.
Table partitioning consist in a technique adopted by some database management systems to deal with large databases. Instead of a single table storage location, they split your table in several files for quicker queries.
If you have a table which will store large ammounts of data (I mean REALLY large ammounts, like millions of records) table partitioning will be a good option.
i) It is the subdivision of a single table into multiple segments, called partitions, each of which holds a subset of values.
ii) You should use it when you have read the documentation, run some test cases, fully understood the advantages and disadvantages of it, and found that it will be of benefit.
You are not going to get a complete answer on a forum. Go and read the documentation and come back when you have a manageable problem.

performance - single join select vs. multiple simple selects

What is better as far as performance goes?
There is only one way to know: Time it.
In general, I think a single join enables the database to do a lot of optimizations, as it can see all the tables it needs to scan, overhead is reduced, and it can build up the result set locally.
Recently, I had about 100 select-statements which I changed into a JOIN in my code. With a few indexes, I was able to go from 1 minute running time to about 0.6 seconds.
Do not try to write your own join loop as a bunch of selects. Your database server has many clever algorithms for doing joins. Further, your database server can use statistics and estimated cost of access to dynamically pick a join algorithm.
The database server's join algorithm is -- usually -- better than anything you might concoct. They know more about physical I/O, caching and what-not.
This allows you to focus on your problem domain.
A single join will usually outperform multiple single selects. However, there are too many different cases that fit your question. It isn't wise to lump them together under a single simple rule.
More important, a single join will usually be easier for the next programmer to understand and to revise, provided that you and the next programmer "speak the same language" when you use SQL. I'm talking about the language of sets of tuples.
And equally important is that database physical design and query design need to focus first on the questions that will result in a ten for one speed improvement, not on a 10% speed imporvement. If you were doing thousands of simple selects versus a single join, you might get a ten for one advantage. If you are doing three or four simple selects, you won't see a big improvement one way or the other.
One thing to consider besides what has been said, is that the selects will return more data through the network than the joins probably will. If the network connection is already a bottleneck, this could make it much worse, especially if this is done frequently. That said, your best bet in any performacne situation is to test, test, test.
It all depends on how the database will optimize the joins, and the use of indexes.
I had a slow and complex query with lots of joins. Then i subdivided it into 2 or 3 less complex querys. The performance gain was astonishing.
But in the end, "it depends", you have to know where´s the bottleneck.
As has been said before, there is no right answer without context.
The answer to this is dependent on (from the top of my head):
the amount of joining
the type of joining
indexing
the amount of re-use you could have for any of the separate pieces to be joined
the amount of data to be processed
the server setup
etc.
If you are using SQL Server (I am not sure if this is available with other RDBMSs) I would suggest that you bundle an execution plan with you query results. This will give you the ability to see exactly how your query(s) are being executed and what is causing any bottlenecks.
Until you know what SQL Server is actually doing I wouldn't hazard a guess about which query is better.
If your database has lots of data .... and there are multiple joins then please use indexing for better performance.
If there are left/right outer joins in this case , then use multiple selects.
It all depends on your db size, your query, the indexes (which include primary and foreign keys also) ... One cannot reach on conclusion with yes/no on your question.