Calculate and control number of mappers used by Hive query - hive

I have a Hive table t1 which has 104 files. Out of the 104 files 1 file is 61 MB and remaining 103 files are less than 1 MB. When I execute the query
select count(*) from t1
29 mappers are executed along with 1 reducer. I'm trying to figure out why 29 mappers and how can reduce the number of mappers?
mapreduce.input.fileinputformat.split.maxsize=256MB
mapreduce.input.fileinputformat.split.maxsize=1kb
Thanks

Try Setting the mappers using below options
set mapred.map.tasks = 20;
also check the total blocks of your data.
hdfs dfs -du -s -h /apps/hive/warehouse//
Each block will be processed independently. Each mapper can be processed multiple blocks based on mapper numbers you are setting.

Related

Empty Bucket in HIVE

There is 65 GB of data and I have created 40 buckets in HIVE but after data loading I found that 5 buckets remain empty. What could be the possibilities for these 5 empty buckets ?
Without knowing how your data is inserted, my guess is that you are using a Hive version below 2.x but hive.enforce.bucketing is not set to true. Or you didn't explicitly offer a cluster by clause when data is inserted.

Hive Distinct Query takes time when we have more files

Table Structure -
hive> desc table1;
OK
col1 string
col2 string
col3 string
col4 bigint
col5 string
Time taken: 0.454 seconds, Fetched: 5 row(s);
NUmber of underlying files -
[user#localhost ~]$ hadoop fs -ls /user/hive/warehouse/database.db/table | wc -l
58822
[user#localhost ~]$
Distinct Query - select distinct concat(col1,'~',col2,'~',col3) from vn_req_tab;
Total records - ~2M Above query runs for 8 hours.
What is causing the issue, How do i debug this query.
You have very large number of small files and this is the main problem.
When you are executing the query 1 mapper is executing on each file thus there are lot of mappers are running each mapper on small piece of data(1 file each) which means they are consuming unnecessary resources from cluster and waits for others to finish.
Please note hadoop is ideal for bigger files with large data.
If you would have executed the same query on bigger files it would have given much better performance.
Try setting the below properties
set mapred.min.split.size=100000000; // u can set max.split.size for optimal performance.
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set mapred.min.split.size=100000000;
Try tweaking the values in property to reach at optimal number of mappers

How do you run a saved query from Big Query cli and export result to CSV?

I have a saved query in Big Query but it's too big to export as CSV. I don't have permission to export to a new table so is there a way to run the query from the bq cli and export from there?
From the CLI you can't directly access your saved queries as it's a UI-only feature as of now but, as explained here there is a feature request for that.
If you just want to run it once to get the results you can copy the query from the UI and just paste it when using bq.
Using the docs example query you can try the following with a public dataset:
QUERY="SELECT word, SUM(word_count) as count FROM publicdata:samples.shakespeare WHERE word CONTAINS 'raisin' GROUP BY word"
bq query $QUERY > results.csv
The output of cat results.csv should be:
+---------------+-------+
| word | count |
+---------------+-------+
| dispraisingly | 1 |
| praising | 8 |
| Praising | 4 |
| raising | 5 |
| dispraising | 2 |
| raisins | 1 |
+---------------+-------+
Just replace the QUERY variable with your saved query.
Also, take into account if you are using Standard or Legacy SQL with the --use_legacy_sql flag.
Reference docs here.
Despite what you may have understood from the official documentation, you can get large query results from bq query, but there are multiple details you have to be aware of.
To start, here's an example. I got all of the rows of the public table usa_names.usa_1910_2013 from the public dataset bigquery-public-data by using the following commands:
total_rows=$(bq query --use_legacy_sql=false --format=csv "SELECT COUNT(*) AS total_rows FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" | xargs | awk '{print $2}');
bq query --use_legacy_sql=false --max_rows=$((total_rows + 1)) --format=csv "SELECT * FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" > output.csv
The result of this command was a CSV file with 5552454 lines, with the first two containing header information. The number of rows in this table is 5552452, so it checks out.
Here's where the caveats come in to play:
Regardless of what the documentation might seem to say when it comes to query download limits specifically, those limits seem to only apply to the Web UI, meaning bq is exempt from them;
At first, I was using the Cloud Shell to run this bq command, but the number of rows was so big that streaming the result set into it killed the Cloud Shell instance! I had to use a Compute instance with at least the same resources that of an n1-standard-4 (4vCPUs, 16GiB RAM), and even with all of this RAM, the query took me 10 minutes to finish (note that the query itself runs server-side, it's just a problem of buffering the results);
I'm manually copy-pasting the query itself, as there doesn't seem to be a way to reference saved queries directly from bq;
You don't have to use Standard SQL, but you have to specify max_rows, because otherwise it'll only return you 100 rows (100 is the current default value of this argument);
You'll still be facing the usual quotas & limits associated with BigQuery, so you might want to run this as a batch job or not, it's up to you. Also, don't forget that the maximum response size for a query is 128 MiB, so you might need to break the query into multiple bq query commands in order to not hit this size limit. If you want a public table that's big enough to hit this limitation during queries, try the samples.wikipedia one from bigquery-public-data dataset.
I think that's about it! Just make sure you're running these commands on a beefy machine and after a few tries it should give you the result you want!
P.S.: There's currently a feature request to increase the size of CSVs you can download from the Web UI. You can find it here.

Increasing Spark Read and Parquet Conversion Performance for Gzipped Text File

Use case:
A> Have Text Gzipped files in AWS s3 location
B> Hive Table created on top of the file, to access the data from the file as Table
C> Using Spark Dataframe to read the table and converting into Parquet Data with Snappy Compression
D> Number of fields in the table is 25, which includes 2 partition columns. Data Type is String except for two fields which has Decimal as data type.
Used following Spark Option: --executor-memory 37G --executor-cores 5 --num-executors 20
Cluster Size - 10 Data Nodes of type r3.8xLarge
Found the number of vCores used in AWS EMR is always equal to the number of files, may be because gzip files are not splittable. Gzipped files are coming from different system and size of files are around 8 GB.
Total Time taken is more than 2 hours for Parquet conversion for 6 files with total size 29.8GB.
Is there a way to improve the performance via Spark, using version 2.0.2?
Code Snippet:
val srcDF = spark.sql(stgQuery)
srcDF.write.partitionBy("data_date","batch_number").options(Map("compression"->"snappy","spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version"->"2","spark.speculation"->"false")).mode(SaveMode.Overwrite).parquet(finalPath)
It doesn't matter how many nodes you ask for, or how many cores there are, if you have 6 files, six threads will be assigned to work on them. Try to do one of
save in a splittable format (snappy)
get the source to save their data is many smaller files
do some incremental conversion into a new format as you go along (e.g a single spark-streaming core polling for new gzip files, then saving elsewhere into snappy files. Maybe try with AWS-Lambda as the trigger for this, to save dedicating a single VM to the task.

Spark SQL: Number of partitions being generated seems weird

I have a very simple Hive table with the below structure.
CREATE EXTERNAL TABLE table1(
col1 STRING,
col2 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://path/';
The directory this table is being pointed to has just ONE file of size 51 KB.
From the pyspark shell (with all default values):
df = sparksession.sql("SELECT * from table1")
df.rdd.getNumPartitions()
The number of partitions being returned is weird. Sometimes it returned 64 and sometimes 81.
My expectation was to see 1 or 2 partitions utmost. Any thoughts on why I see that many partitions?
Thanks.
As you stated that number of partitions returned sometimes it returned 64 and sometimes 81 because its up to the spark that in how many partitions it want to store the data even if you use the repartition command then also its a request to the spark to shuffle the data into given re partitions if spark thinks its not possible then it will take the decision by itself and store the data in random number of partitions.
Hope this explanation solves your query.