More than 1 reducer per bucket - hive

I have a hive table that is bucketed with 1024 buckets. Max reducer limit set is 1024 and usually rule of thumb is 1 reducer per bucket. Now I want to increase # of reducers for faster performance and I want to know if I can set more than one reducer per bucket. If I can do that, then how does it affect the performance ?

You should use
set hive.enforce.bucketing = true;
The number of reducers must equal the number of buckets. This is
described in the language manual.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

Related

How to avoid maximum storage limit of cosmosdb for Scalar DL's asset records?

I plan to store asset records in Scalar DL on Azure CosmosDB.
CosmosDB limits maximum storage size across all items per (logical) partition to 20GB.
Scalar DL's asset is partition key on Azure CosmosDB. So, the same asset is stored in same partition. I think there is a limitation of asset size on Azure CosmosDB.
Can we avoid this limitation?
Does Scalar DL have a feature to chain to another asset?
See also: https://learn.microsoft.com/en-us/azure/cosmos-db/concepts-limits
The short answer is No.
First, any database systems usually have such limitations.
For example, DynamoDB partition has also 10 GB limitation.
Cassandra partition can have up to 2 billion cells but it is recommended to be smaller than 100 MB from performance perspective. (see https://www.instaclustr.com/cassandra-data-partitioning/)
So, it's always a good practice to model your asset properly not to grow too much.
If you know what you are doing and there are no other solutions except for having a large asset (partition), please split your asset into multiple assets. (e.g., if there is an asset that has an ID named Asset-A, create Asset-A-1, Asset-A-2, ..., Asset-A-M)
In such a case, an application has to also manage how they are split.

Specify minimum number of generated files from Hive insert

I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent.
So my questions are
(a) what determines how many files are generated and
(b) is there a way to specify the minimum number of files or (even better) the maximum size of each file?
The number of files generated during INSERT ... SELECT depends on the number of processes running on final reducer (final reducer vertex if you are running on Tez) plus bytes per reducer configured.
If the table is partitioned and there is no DISTRIBUTE BY specified, then in the worst case each reducer creates files in each partition. This creates high pressure on reducers and may cause OOM exception.
To make sure reducers are writing only one partition files each, add DISTRIBUTE BY partition_column at the end of your query.
If the data volume is too big, and you want more reducers to increase parallelism and to create more files per partition, add random number to the distribute by, for example using this: FLOOR(RAND()*100.0)%10 - it will distribute data additionally by random 10 buckets, so in each partition will be 10 files.
Finally your INSERT sentence will look like:
INSERT OVERWRITE table PARTITION(part_col)
SELECT *
FROM src
DISTRIBUTE BY part_col, FLOOR(RAND()*100.0)%10; --10 files per partition
Also this configuration setting affects the number of files generated:
set hive.exec.reducers.bytes.per.reducer=67108864;
If you have too much data, Hive will start more reducers to process no more than bytes per reducer specified on each reducer process. The more reducers - the more files will be generated. Decreasing this setting may cause increasing the number of reducers running and they will create minimum one file per reducer. If partition column is not in the distribute by then each reducer may create files in each partition.
To make long story short, use
DISTRIBUTE BY part_col, FLOOR(RAND()*100.0)%10 -- 10 files per partition
If you want 20 files per partition, use FLOOR(RAND()*100.0)%20; - this will guarantee minimum 20 files per partition if you have enough data, but will not guarantee the maximum size of each file.
Bytes per reducer setting does not guarantee that it will be the fixed minimum number of files. The number of files will depend of total data size/bytes.per.reducer. This setting will guarantee the maximum size of each file.
But much better use some evenly distributed key or combination with low cardinality instead of random because in case of containers restart, rand() may produce different values for the same rows and it may cause data duplication or loss(same data which is already present in some reducer output will be distributed one more time to another reducer). You can calculate similar function on some keys available instead of rand() to get more or less evenly distributed key with low cardinality.
You can use both methods combined: bytes per reducer limit + distribute by to control both the minimum number of files and maximum file size.
Also read this answer about using distribute by to distribute data evenly between reducers: https://stackoverflow.com/a/38475807/2700344

Is there total size limit for S3 bucket?

I am wondering if there is size limit for S3 bucket? or I can store object limitless?
I need this information to assure I need to write a cleaner tool or not.
You can store unlimited objects in your S3 bucket. However, there are limits on individual objects stored -
An object can be 0 bytes to 5TB.
The largest object that can be uploaded in a single PUT is 5 gigabytes
For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
If you have too many objects then consider randomizing some prefix to your object names.
When your workload is a mix of request types, introduce some randomness to key names by adding a hash string as a prefix to the key name. By introducing randomness to your key names the I/O load will be distributed across multiple index partitions. For example, you can compute an MD5 hash of the character sequence that you plan to assign as the key and add 3 or 4 characters from the hash as a prefix to the key name.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
Yes, you can store object limitless.
BTW, if your consideration is performance, see
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html

How do redis store data

If redis stores data as key value pair in memory, what is the size of hash table redis creates initially to store key-value pairs? Do it create a table of size equivalent to maxmemory parameter in config file?
No, the size of the hash table of the main dictionary is dynamic.
The initial size is 4 entries. Then it grows to accommodate the data, following powers of 2. Growing is dynamic, so rehashing is incrementally performed in background. Expensive rehashing operations can not block a simple set command.

S3 limit to objects in a bucket

Does anyone know if there is a limit to the number of objects I can put in an S3 bucket? can I put a million, 10 million etc.. all in a single bucket?
According to Amazon:
Write, read, and delete objects containing from 0 bytes to 5 terabytes of data each. The number of objects you can store is unlimited.
Source: http://aws.amazon.com/s3/details/ as of Sep 3, 2015.
It looks like the limit has changed. You can store 5TB for a single object.
The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
http://aws.amazon.com/s3/faqs/#How_much_data_can_I_store
There is no limit on objects per bucket.
There is a limit of 100 buckets per account (you need to request amazon if you need more).
There is no performance drop even if you store millions of objects in a
single bucket.
From docs,
There is no limit to the number of objects that can be stored in a
bucket and no difference in performance whether you use many buckets
or just a few. You can store all of your objects in a single bucket,
or you can organize them across several buckets.
as of Aug 2016
While you can store an unlimited number of files/objects in a single bucket, when you go to list a "directory" in a bucket, it will only give you the first 1000 files/objects in that bucket by default. To access all the files in a large "directory" like this, you need to make multiple calls to their API.
There are no limits to the number of objects you can store in your S3 bucket. AWS claims it to have unlimited storage. However, there are some limitations -
By default, customers can provision up to 100 buckets per AWS account. However, you can increase your Amazon S3 bucket limit by visiting AWS Service Limits.
An object can be 0 bytes to 5TB.
The largest object that can be uploaded in a single PUT is 5 gigabytes
For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
That being said if you really have a lot of objects to be stored in S3 bucket consider randomizing your object name prefix to improve performance.
When your workload is a mix of request types, introduce some randomness to key names by adding a hash string as a prefix to the key name. By introducing randomness to your key names the I/O load will be distributed across multiple index partitions. For example, you can compute an MD5 hash of the character sequence that you plan to assign as the key and add 3 or 4 characters from the hash as a prefix to the key name.
More details - https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
-- As of June 2018
"You can store as many objects as you want within a bucket, and write,
read, and delete objects in your bucket. Objects can be up to 5
terabytes in size."
from http://aws.amazon.com/s3/details/ (as of Mar 4th 2015)
#Acyra- performance of object delivery from a single bucket would depend greatly on the names of the objects in it.
If the file names were distanced by random characters then their physical locations would be spread further on the AWS hardware, but if you named everything 'common-x.jpg', 'common-y.jpg' then those objects will be stored together.
This may slow delivery of the files if you request them simultaneously but not by enough to worry you, the greater risk is from data-loss or an outage, since these objects are stored together they will be lost or unavailable together.