Ordering observations in HIVE - hive

In HIVE, we have order by, distribute by, sort by, cluster by to arrange records.
Order by: Here, the o/p all the mappers are passed to only a single reducer and the reducer does the ordering. When we print the final dataset, we see that it is sorted by given columns.
Distribute by: After mappers are run, the resultant is hashed by the given columns and passed into different reducers.
Sort by: It sorts observations within a Single reducer. So to achieve ordering, we have to call distribute by and then sort by.
Cluster by: It wraps distribute by and sort by together.
When we print the query resultant by using any of the above methods (order by/ distribute by & sort by / cluster by), I see that the whole dataset is sorted by the given columns.
So I understand that when we print the results in console, even in case of distribute by&sort by OR cluster by, the output files of all reducers are read and passed into single reducer and
sorted; then output is printed.
Only when we save the query output, order by saves it in a single output where as the other two save the result in multiple files (depends on the number of reducers ran).
Is my understanding correct ?
Thanks!

Related

Specify minimum number of generated files from Hive insert

I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent.
So my questions are
(a) what determines how many files are generated and
(b) is there a way to specify the minimum number of files or (even better) the maximum size of each file?
The number of files generated during INSERT ... SELECT depends on the number of processes running on final reducer (final reducer vertex if you are running on Tez) plus bytes per reducer configured.
If the table is partitioned and there is no DISTRIBUTE BY specified, then in the worst case each reducer creates files in each partition. This creates high pressure on reducers and may cause OOM exception.
To make sure reducers are writing only one partition files each, add DISTRIBUTE BY partition_column at the end of your query.
If the data volume is too big, and you want more reducers to increase parallelism and to create more files per partition, add random number to the distribute by, for example using this: FLOOR(RAND()*100.0)%10 - it will distribute data additionally by random 10 buckets, so in each partition will be 10 files.
Finally your INSERT sentence will look like:
INSERT OVERWRITE table PARTITION(part_col)
SELECT *
FROM src
DISTRIBUTE BY part_col, FLOOR(RAND()*100.0)%10; --10 files per partition
Also this configuration setting affects the number of files generated:
set hive.exec.reducers.bytes.per.reducer=67108864;
If you have too much data, Hive will start more reducers to process no more than bytes per reducer specified on each reducer process. The more reducers - the more files will be generated. Decreasing this setting may cause increasing the number of reducers running and they will create minimum one file per reducer. If partition column is not in the distribute by then each reducer may create files in each partition.
To make long story short, use
DISTRIBUTE BY part_col, FLOOR(RAND()*100.0)%10 -- 10 files per partition
If you want 20 files per partition, use FLOOR(RAND()*100.0)%20; - this will guarantee minimum 20 files per partition if you have enough data, but will not guarantee the maximum size of each file.
Bytes per reducer setting does not guarantee that it will be the fixed minimum number of files. The number of files will depend of total data size/bytes.per.reducer. This setting will guarantee the maximum size of each file.
But much better use some evenly distributed key or combination with low cardinality instead of random because in case of containers restart, rand() may produce different values for the same rows and it may cause data duplication or loss(same data which is already present in some reducer output will be distributed one more time to another reducer). You can calculate similar function on some keys available instead of rand() to get more or less evenly distributed key with low cardinality.
You can use both methods combined: bytes per reducer limit + distribute by to control both the minimum number of files and maximum file size.
Also read this answer about using distribute by to distribute data evenly between reducers: https://stackoverflow.com/a/38475807/2700344

Can we predict the order of the results of a Hive SELECT * query?

Is it possible that the order of the results of a SELECT * query (no ORDER BY) is always the same provided that the same DBMS is used as Metastore?
So, as long as MySQL is used as Metastore, the order of the results for a SELECT *; query will always be the same. If Postgres is used, the order will be always the same on the same data, but different from when MySQL is used. I am talking about the same data.
Maybe it all boils down to the question of what is the default order of results and why is it different for MySQL and Postgres Metastore.
There is no such thing as default order of rows, without ORDER BY the order is not guaranteed. This fact is not connected with Metastore database used.
In general data is being read in parallel by many processes (mappers), after splits were calculated, each process starts reading some piece of file or few files, depending on splits calculated. All parallel processes can process different volume of data and running on different nodes, the load is not the same each time, so they start returning rows and finishing at different times, depending on too many factors, such as node load, network load, volume of data per process, etc, etc. Removing all this factors you can increase the order prediction accuracy. Say, single thread sequential file read will return rows in the same order as they are in the file. But this is not how the database works.
Also according to Codd's relational theory, the order of columns and rows is immaterial to the database.

Hadoop map reduce job modelling

I am fairly new to hadoop and I need help modelling a map reduce job.
I have two groups of files: GroupA and GroupB. The structure of both group of files is same: key,value in each line. Group A and B have same set of keys. However the values in the two groups different properties. The files are sufficiently large and hence the hadoop option.
The task is to combine the properties from group A and group B for each individual key into a third property for that key and then afterwards sum up the third property for all the keys.
Now, on the first glance what seems like is: Map -> collect the key -value pairs from both groupfiles Combine-partition-sort-shuffle -> group the entries of same key into same partition, so they fall to same reducer(handled by hadoop internally) reduces -> combine the same key values into the third property and writes its batches into the output files.
I am not sure how to model the third step of adding up the third property across the keys. One way I can think of is to have another map-red job after this one which can take this files and combine them through one reducer into the result. Is this the right way of modelling ? Is there any other way, I can model this ? Is it possible to have consecutive reducers along the lines of something like this - map -> red -> red ?
The model in hadoop would go something like having two map reduce models triggered one after the other. If we use spark over hadoop there is something called count which can be invoked after the map-reduce to get the final output.

Difference between df.repartition and DataFrameWriter partitionBy?

What is the difference between DataFrame repartition() and DataFrameWriter partitionBy() methods?
I hope both are used to "partition data based on dataframe column"? Or is there any difference?
Watch out: I believe the accepted answer is not quite right! I'm glad you ask this question, because the behavior of these similarly-named functions differs in important and unexpected ways that are not well documented in the official spark documentation.
The first part of the accepted answer is correct: calling df.repartition(COL, numPartitions=k) will create a dataframe with k partitions using a hash-based partitioner. COL here defines the partitioning key--it can be a single column or a list of columns. The hash-based partitioner takes each input row's partition key, hashes it into a space of k partitions via something like partition = hash(partitionKey) % k. This guarantees that all rows with the same partition key end up in the same partition. However, rows from multiple partition keys can also end up in the same partition (when a hash collision between the partition keys occurs) and some partitions might be empty.
In summary, the unintuitive aspects of df.repartition(COL, numPartitions=k) are that
partitions will not strictly segregate partition keys
some of your k partitions may be empty, whereas others may contain rows from multiple partition keys
The behavior of df.write.partitionBy is quite different, in a way that many users won't expect. Let's say that you want your output files to be date-partitioned, and your data spans over 7 days. Let's also assume that df has 10 partitions to begin with. When you run df.write.partitionBy('day'), how many output files should you expect? The answer is 'it depends'. If each partition of your starting partitions in df contains data from each day, then the answer is 70. If each of your starting partitions in df contains data from exactly one day, then the answer is 10.
How can we explain this behavior? When you run df.write, each of the original partitions in df is written independently. That is, each of your original 10 partitions is sub-partitioned separately on the 'day' column, and a separate file is written for each sub-partition.
I find this behavior rather annoying and wish there were a way to do a global repartitioning when writing dataframes.
If you run repartition(COL) you change the partitioning during calculations - you will get spark.sql.shuffle.partitions (default: 200) partitions. If you then call .write you will get one directory with many files.
If you run .write.partitionBy(COL) then as the result you will get as many directories as unique values in COL. This speeds up futher data reading (if you filter by partitioning column) and saves some space on storage (partitioning column is removed from data files).
UPDATE: See #conradlee's answer. He explains in details not only how the directories structure will look like after applying different methods but also what will be resulting number of files in both scenarios.
repartition() is used to partition data in memory and partitionBy is used to partition data on disk. They're often used in conjunction.
Both repartition() and partitionBy can be used to "partition data based on dataframe column", but repartition() partitions the data in memory and partitionBy partitions the data on disk.
repartition()
Let's play around with some code to better understand partitioning. Suppose you have the following CSV data.
first_name,last_name,country
Ernesto,Guevara,Argentina
Vladimir,Putin,Russia
Maria,Sharapova,Russia
Bruce,Lee,China
Jack,Ma,China
df.repartition(col("country")) will repartition the data by country in memory.
Let's write out the data so we can inspect the contents of each memory partition.
val outputPath = new java.io.File("./tmp/partitioned_by_country/").getCanonicalPath
df.repartition(col("country"))
.write
.csv(outputPath)
Here's how the data is written out on disk:
partitioned_by_country/
part-00002-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
part-00044-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
part-00059-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
Each file contains data for a single country - the part-00059-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv file contains this China data for example:
Bruce,Lee,China
Jack,Ma,China
partitionBy()
Let's write out data to disk with partitionBy and see how the filesystem output differs.
Here's the code to write out the data to disk partitions.
val outputPath = new java.io.File("./tmp/partitionedBy_disk/").getCanonicalPath
df
.write
.partitionBy("country")
.csv(outputPath)
Here's what the data looks like on disk:
partitionedBy_disk/
country=Argentina/
part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000.csv
country=China/
part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000
country=Russia/
part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000
Why partition data on disk?
Partitioning data on disk can make certain queries run much faster.

Datastore for aggregations

What is a preferred datastore for fast aggregating of data?
I have data that I pull from other systems regularly, and the data store should support queries like:
What is the number of transactions done by a user in a time range.
What is the total sum of successful transactions done by a user in a time range.
Queries should support sql constructs like group by, count, sum etc over a large set of data.
Right now, I'm using a custom data model in Redis, and data is fetched in memory, and then aggregates are run over it. The problem with this model is that this is closely tied to my pivots (columns) and any additional pivot, if added will cause my data to explode leading to huge memory consumption on my redis boxes.
I've explored elasticsearch, but elasticsearch queries with aggregations are taking longer than 200ms, for the kind of data that I have.
Are there any other alternatives, I'm also looking at Aerospike now. Can someone throw some light on how does aerospike aggregates work in this scenario?
Aerospike supports aggregations on top of secondary index queries. Seems most of your queries are pivoted on user. You can build a secondary index on top of userid and query for all the data corresponding to a user. You can then slap the aggregation logic and filter the stuff based on desired time range. you need to do this because Aerospike does not yet support multiple where clause where you query for a user and a time range at same time.
Your queries 1 & 2 can be done by writing an aggregation UDF based on a secondary index query on userid as above.
I am not very clear about your 3 questions. Aerospike does not provide group by, sum, count etc as native queries. But you can always write an aggregation UDF to achieve it. http://www.aerospike.com/docs/guide/aggregation.html