Is it possible to change partition size manually in Hive? - hive

Actually whenever we are creating partition in hive, how much size is getting allocated. Is it same like 1 Block for 1 partition i.e, 128 MB ? And also is it possible to change the size of partition?

In Hive partition is a folder (and of course Hive metadata about partition and it's location), no size is allocated initially. The size of partition is the total size of files located in the partition location folder.
Read also this: https://stackoverflow.com/a/47720850/2700344
and this https://stackoverflow.com/a/50219830/2700344

Related

Why spark is reading more data that I expect it to read using read schema?

In my spark job, I'm reading a huge table (parquet) with more than 30 columns. To limit the size of data read I specify schema with one column only (I need only this one). Unfortunately, when reading the info in spark UI I get the information that the size of files read equals 1123.8 GiB but filesystem read data size total equals 417.0 GiB. I was expecting that if I take one from 30 columns the filesystem read data size total will be around 1/30 of the initial size, not almost half.
Could you explain to me why is that happening?

How to efficiently filter a dataframe from an S3 bucket

I want to pull a specified number of days from an S3 bucket that is partitioned by year/month/day/hour. This bucket has new files added everyday and will grow to be rather large. I want to do spark.read.parquet(<path>).filter(<condition>), however when I ran this it took significantly longer (1.5 hr) than specifying the paths (.5 hr). I dont understand why it takes longer, should I be adding a .partitionBy() when reading from the bucket? or is it because of the volume of data in the bucket that has to be filtered?
That problem that you are facing is regarding the partition discovery. If you point to the path where your parquet files are with the spark.read.parquet("s3://my_bucket/my_folder") spark will trigger a task in the task manager called
Listing leaf files and directories for <number> paths
This is a partition discovery method. Why that happens? When you call with the path Spark has no place to find where the partitions are and how many partitions are there.
In my case if I run a count like this:
spark.read.parquet("s3://my_bucket/my_folder/").filter('date === "2020-10-10").count()
It will trigger the listing that will take 19 Seconds for around 1700 folders. Plus the 7 seconds to count, it has a total of 26 seconds.
To solve this overhead time you should use a Meta Store. AWS provide a great solution with AWS Glue, to be used just like the Hive Metastore in a Hadoop environment.
With Glue you can store the Table metadata and all the partitions. Instead of you giving the Parquet path you will point to the table just like that:
spark.table("my_db.my_table").filter('date === "2020-10-10").count()
For the same data, with the same filter. The list files doesn't exist and the whole process of counting took only 9 Seconds.
In your case that you partitionate by Year, Month, Day and Hour. We are talking about 8760 folders per year.
I would recommend you take a look at this link and this link
This will show how you can use Glue as your Hive Metastore. That will help a lot to improve the speed of Partition query.

Spark parquet file partitioning

I have 10000 (each file size is 13kb) parquet files in 30 folders. so totally 13 MB.
The property spark.sql.files.maxPartitionBytes is set to 128MB(by default)
But when I try to read the data using Spark, total no of partition is 235.
Can any one tell me how this is calculated?

How can we decide the total no. of buckets for a hive table

i am bit new to hadoop. As per my knowledge buckets are fixed no. of partitions in hive table and hive uses the no. of reducers same as the total no. of buckets defined while creating the table. So can anyone tell me how to calculate the total no. of buckets in a hive table. Is there any formula for calculating the total number of buckets ?
Lets take a scenario Where table size is: 2300 MB,
HDFS Block Size: 128 MB
Now, Divide 2300/128=17.96
Now, remember number of bucket will always be in the power of 2.
So we need to find n such that 2^n > 17.96
n=5
So, I am going to use number of buckets as 2^5=32
Hope, It will help some of you.
From the documentation
link
In general, the bucket number is determined by the expression
hash_function(bucketing_column) mod num_buckets. (There's a
'0x7FFFFFFF in there too, but that's not that important). The
hash_function depends on the type of the bucketing column. For an int,
it's easy, hash_int(i) == i. For example, if user_id were an int, and
there were 10 buckets, we would expect all user_id's that end in 0 to
be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.
For other datatypes, it's a little tricky. In particular, the hash of
a BIGINT is not the same as the BIGINT. And the hash of a string or a
complex datatype will be some number that's derived from the value,
but not anything humanly-recognizable. For example, if user_id were a
STRING, then the user_id's in bucket 1 would probably not end in 0. In
general, distributing rows based on the hash will give you a even
distribution in the buckets.
If you want to know how many buckets you should choose in your CLUSTER BY clause, I believe it is good to choose a number that results in buckets that are at or just below your HDFS block size.
This should help avoid having HDFS allocate memory to files that are mostly empty.
Also choose a number that is a power of two.
You can check your HDFS block size with:
hdfs getconf -confKey dfs.blocksize
optimal bucket number is ( B * HashTableSize of Table ) / Total Memory of Node, B=1.01

Per row size limits in BigQuery data?

Is there a limit to the amount of data that can be put in a single row in BigQuery? Is there a limit on the size of a single column entry (cell)? (in bytes)
Is there a limitation when importing from Cloud Storage?
The largest size of a single row allowed is 1MB for CSV and 2 MB for JSON. There are no limits on field sizes, but obviously they must be under the row size as well.
These limits are described here.