I am using Hive version 0.7.1-cdh3u2
I have two big tables (let's say) A and B, both partitioned by day. I am running the following query
select col1,col2
from A join B on (A.day=B.day and A.key=B.key)
where A.day='2014-02-25'
When I look at the xml file of the map reduce task, I find that mapred.input.dir includes A/2014-02-25 and all hdfs directories for all days for B rather than only for the specific day ('2014-02-25'). This takes a lot of time and more number of reduce tasks.
I also tried to use
select col1,col2
from A join B on (A.day=B.day and A.key=B.key and A.day='2014-02-25'
and B.day='2014-02-25')
This query performed much faster and with only the required hdfs directories in mapred.input.dir
I have the following questions.
Shouldn't hive optimizer be smart enough for both the queries to run exactly in the same manner?
What should be an optimized way to run the hive query for joining such tables with partitions on multiple keys?
What is the difference between using conditions that involve partitions in the join on clause and the where clause in terms of performance?
You need to mention the condition i.e the partition directory explicitly in the JOIN clause or in the WHERE clause. So it will process only the required partitions which will in turn increase the performance.
You can refer this link:
Apache Hive LanguageManual
Related
I want to join two large tables with many columns using Presto SQL syntax in AWS Athena. My code is pretty simple:
select
*
from TableA as A
left join TableB as B
on A.key_id = B.key_id
;
After joining, the primary key column (key_id) is repeated two times. Both tables have more than 100 columns, and the joining takes very long. How can I fix it such that the key_id column does not repeat twice in the final result?
P.S. AWS Athena does not support except command, unlike Google BigQuery.
This would be a nice feature, but is not part of standard SQL. The EXCEPT keyword is a set-based operation (i.e. filtering rows).
In Athena, as with standard SQL, you will have to specify the columns you want to include. The argument for this is that it's lower maintenance, and in fact best practice is to always explicitly state the columns you want - never leaving this to "whatever columns exist". This will help ensure your queries don't change behaviour if/when your table structure changes.
Some SQL languages have features like this. I understand Oracle has this too. But to my knowledge Athena (/ PrestoSQL / Trino) does not.
In my corporate project, I need to cross join a dataset of over a billion rows with another of about a million rows using Spark SQL. As cross join was used, I decided to divide the first dataset into several parts (each having about 250 million rows) and cross join each part with the million-row one. I then made of use "union all".
Now I need to improve the performance of the join processes. I heard it can be done by partitioning data and distribution of work to Spark workers. My questions are how the effective performance can be made with partitioning? and What are the other ways to do this without using partitioning?
Edit: filtering already included.
Well, in all scenarios, you will end up with tons of data. Be careful, try to avoid cartesian joins on big data set as much as possible as it usually ends with OOM exceptions.
Yes, partitioning can be the way that help you, because you need to distribute your workload from one node to more nodes or even to the whole cluster. Default partitioning mechanism is hash of key or original partitioning key from source (Spark is taking this from source directly). You need to first evaluate what is your partitioning key right now and afterwards you can find maybe better partitioning key/mechanism and repartition data, therefore distribute load. But, anyway join must be done, but it will be done with more parallel sources.
There should be some filters on your join query. you can use filter attributes as key to partition the data and then join based on the partitioned.
Let's say I have a table with multiple partitions and I need to query something from the entire table. Is there a difference, from a performance point of view, between running a single sql query on the entire table and running one sql for each partition?
LE: I'm using Postgres
In Microsoft SQL Server when you create a partition function for partitioning a table, this function partitions data and route the query to the best data file.
For example if your partition function creates in a datetime field and partition data yearly, your query just run in a single data file that contains your where clause data.
Therefore you don't need to separate your query and the SQL Server Engine will do that automatically.
It depends on what your intention is.
If you already have a partitioned table and are deciding what the best strategy to retrieve all rows is, then running a query against the partitioned table is almost certainly the faster solution.
Retrieval of all partitions will most likely be parallelized (depending on your configuration of parallel query). If you query each partition manually, you would need to implement that yourself e.g. creating multiple connections with each one running a query against one partition.
However if your intention is to decide whether it makes sense to partition a table, then the answer isn't so straightforward. If you have to query all rows of the table very often, then this is usually (slightly) slower than querying a single non-partitioned table. If that is the exception and you almost always have run queries that target a single partition, then partitioning does make sense.
I have written a very complicated query in Amazon Redshift which comprises of 3-4 temporary tables along with sub-queries.Since, Query is slow in execution, I tried to replace it with another query, which uses derived tables instead of temporary tables.
I just want to ask, Is there any way to compare the "Explain" Output for both the queries, so that we can conclude which query is working better in performance(both space and time).
Also, how much helpful is replacing temporary tables with derived tables in redshift ?
When Redshift generates it's own temporary tables (visible in the plan) then you may be able to tune the query by creating them as temporary tables yourself, specifying compression and adding distribution and sort keys that help with joins done on the table.
Very slow queries typically use a nested loop join style. The fastest join type is a merge join. If possible, rewrite the query or modify the tables to use merge join or at least hash join. Details here: https://docs.aws.amazon.com/redshift/latest/dg/query-performance-improvement-opportunities.html
Resources to better understand Redshift query planning and execution:
Query Planning And Execution Workflow:
https://docs.aws.amazon.com/redshift/latest/dg/c-query-planning.html
Reviewing Query Plan Steps:
https://docs.aws.amazon.com/redshift/latest/dg/reviewing-query-plan-steps.html
Mapping the Query Plan to the Query Summary:
https://docs.aws.amazon.com/redshift/latest/dg/query-plan-summary-map.html
Diagnostic Queries for Query Tuning:
https://docs.aws.amazon.com/redshift/latest/dg/diagnostic-queries-for-query-tuning.html
Hi I recently joined a new job that uses Hive and PostgreSQL. The existing ETL scripts gather data from Hive partitioned by dates and creates tables for those data in PostgreSQL and then the PostgreSQL scripts/queries perform left joins and create the final table for reporting purpose. I have heard in the past that Hive joins are not a good idea. However, I noticed that Hive does allow joins so I'm not sure why it's a bad idea.
I wanted to use something like Talend or Mulesoft to create joins and do aggregations within hive and create a temporary table and transfer that temporary table as the final table to PostgreSQL for reporting.
Any suggestions, especially if this is not a good practice with HIVE. I'm new to hive.
Thanks.
The major issue with joining in hive has to do with data locality.
Hive queries are executed as MapReduce jobs and several mappers will launch, as much as possible, in the nodes where the data lies.
However, when joining tables the two rows of data from LHS and RHS tables will not in general be in the same node, which may cause a significant amount of network traffic between nodes.
Joining in Hive is not bad per se, but if the two tables being joined are large may result in slow jobs.
If one of the tables is significantly smaller than the other you may want to store it in HDFS cache, making its data available in every node, which allows the join algorithm to retrieve all data locally.
So, there's nothing wrong with running large joins in Hive, you just need to be aware they need their time to finish.
Hive is growing in maturity
It is possible that arguments against using joins, no longer apply for recent versions of hive.
The most clear example I found in the manual section on join optimization:
The MAPJOIN implementation prior to Hive 0.11 has these limitations:
The mapjoin operator can only handle one key at a time
Therefore I would recommend asking what the foundation of their reluctance is, and then checking carefully whether it still applies. Their arguments may well still be valid, or might have been resolved.
Sidenote:
Personally I find Pig code much easier to re-use and maintain than hive, consider using Pig rather than hive to do map-reduce operations on your (hive table) data.
Its perfectly fine to do joins in HIVE, I am a ETL tester and have performed left joins on big tables in Hive most of the time the queries run smoothly but some times the job do get stuck or are slow due to network traffic.
Also depends on number of Nodes the cluster is having.
Thanks