I ran into a Hive query calculating a count distinct without grouping, which runs very slow. So I was wondering how is this functionality implemented in Hive, is there a UDAFCountDistinct for this?
Hive 1.2.0+ provides auto-rewrite optimization for count(distinct). Check this setting:
hive.optimize.distinct.rewrite=true;
Related
I am using spark 2.4v with java.
In my spark job , I am doing some aggregations like avg, percentiles and etc.
Its written in "group by" clause, but its damn slow.
Hence I tried to write the same aggregations with Window function partition by and order by clauses. But it is slower.
Windows functions suppose to be faster right , so how to tune it for better performance.
Any good resources, notes for the highly appreciated.
In spark one can write sql queries as well use spark api functions. ReduceByKey should always be used than groupbykey as it prevents more shuffling.
I would like to know, when you use sql queries by registering the dataframe how can we use reduceby ? In sql queries there is only group by no reduce by. Do internally it optimises to use reduceBykey than a group by ?
I got it. I actually did an explain to understand the physical plan and it first executes a function as partial_sum and then after that executes the function sum which implies that it has first performed a sum within executors and then shuffled across.
I am working in MySql. Hive, to be particular.
There is no such limit, however, you can refer the documentation for kind of subqueries are supported.
I am starting a project where I need to make some not equality Join.
Now, I've read that neither Pig nor Hive support inequality Join.
I have also read that Pig can support that by using CROSS and FILTER.
Could I do that also in Hive using WHERE clause?Are there any cases where it is not possible?
Finally, supposed that I can do that both in Pig and in Hive, which would be better about performance?
I remember Hive can only use one reducer to do "CROSS". Pig uses a smart approach to implement "CROSS" and run it in parallel and it usually has better performance than Hive.
BTW, I have not updated my knowledge about Hive and Pig for one year. I'm not sure if Hive improved "CROSS" in the past year.
I am using Hive version 0.7.1-cdh3u2
I have two big tables (let's say) A and B, both partitioned by day. I am running the following query
select col1,col2
from A join B on (A.day=B.day and A.key=B.key)
where A.day='2014-02-25'
When I look at the xml file of the map reduce task, I find that mapred.input.dir includes A/2014-02-25 and all hdfs directories for all days for B rather than only for the specific day ('2014-02-25'). This takes a lot of time and more number of reduce tasks.
I also tried to use
select col1,col2
from A join B on (A.day=B.day and A.key=B.key and A.day='2014-02-25'
and B.day='2014-02-25')
This query performed much faster and with only the required hdfs directories in mapred.input.dir
I have the following questions.
Shouldn't hive optimizer be smart enough for both the queries to run exactly in the same manner?
What should be an optimized way to run the hive query for joining such tables with partitions on multiple keys?
What is the difference between using conditions that involve partitions in the join on clause and the where clause in terms of performance?
You need to mention the condition i.e the partition directory explicitly in the JOIN clause or in the WHERE clause. So it will process only the required partitions which will in turn increase the performance.
You can refer this link:
Apache Hive LanguageManual