What is the difference between bucketing and partitioning?
Where and when to use it?
Related
I ran into a Hive query calculating a count distinct without grouping, which runs very slow. So I was wondering how is this functionality implemented in Hive, is there a UDAFCountDistinct for this?
Hive 1.2.0+ provides auto-rewrite optimization for count(distinct). Check this setting:
hive.optimize.distinct.rewrite=true;
What are the different ways to distribute data in Hive ? E.g. Partition By, Bucketing, distributed by etc.
Apache Hive organizes tables into partitions. Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department.
I'm trying to partition my tables in BQ, I've read the documentation and it always points to timePartition. I understand that this may be the default partition, but is it possible to define your table's column/s as the partition?
Any inputs would help. Thanks!
Not as of today. The only available partition type is "DAY"
I have one oracle table which takes 3 hours to respond to a select query. I was thinking about importing it into hadoop for processing.
Would it be a good idea ? If I will use hive to perform the same query, would there be any performance gain ?
If yes, then how should I import my table into hadoop? Since table has composite primary key, sqoop is not an option. One more thing, Should I use HBase? Which approach will be better?
The performance in hadoop depends on the Size of the data. If the data is really huge you can see performance improvement. If the data is small , its better to tweak your query.
I am starting a project where I need to make some not equality Join.
Now, I've read that neither Pig nor Hive support inequality Join.
I have also read that Pig can support that by using CROSS and FILTER.
Could I do that also in Hive using WHERE clause?Are there any cases where it is not possible?
Finally, supposed that I can do that both in Pig and in Hive, which would be better about performance?
I remember Hive can only use one reducer to do "CROSS". Pig uses a smart approach to implement "CROSS" and run it in parallel and it usually has better performance than Hive.
BTW, I have not updated my knowledge about Hive and Pig for one year. I'm not sure if Hive improved "CROSS" in the past year.