how to reduce the number of containers in the query - hive

I have a query using to much containers and to much memory. (97% of the memory used).
Is there a way to set the number of containers used in the query and limit the max memory?
The query is running on Tez.
Thanks in advance

Controlling the number of Mappers:
The number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also How initial task parallelism works
MR uses CombineInputFormat, while Tez uses grouped splits.
Tez:
set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split
Increase these figures to reduce the number of mappers running.
Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.
Controlling the number of Reducers:
The number of reducers determined according to
mapreduce.job.reduces
The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.
Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapreduce.job.reduces is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.
Simply set hive.exec.reducers.max=<number> to limit the number of reducers running.
If you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer.
Memory settings
set tez.am.resource.memory.mb=8192;
set tez.am.java.opts=-Xmx6144m;
set tez.reduce.memory.mb=6144;
set hive.tez.container.size=9216;
set hive.tez.java.opts=-Xmx6144m;
The default settings mean that the actual Tez task will use the mapper's memory setting:
hive.tez.container.size = mapreduce.map.memory.mb
hive.tez.java.opts = mapreduce.map.java.opts
Read this for more details: Demystify Apache Tez Memory Tuning - Step by Step
I would suggest to optimize query first. Use map-joins if possible, use vectorising execution, add distribute by partitin key if you are writing partitioned table to reduce memory consumption on reducers and write good sql of course.

Related

Does dataframe.repartition(x) makes execution faster

I have a Spark script that reads data from amazon S3 and then writes in another bucket usion parquet format.
This is what the code looks like:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods.repartition(25).write.format("parquet").mode("OverWrite").save("AnotherLocationInS3")
My question is: how does the repartition argument (here 25) affects the execution time? Should I increase it so the script runs faster?
Second question: Would it be better if I cache my df before the last line?
Thank you
In typical setups neither repartition nor cache will help you in this specific case. Since you read data from non-splittable format:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods will have only one partition.
In such case repartitioning would make sense, if you performed any actual processing on this data.
However if you just write to distributed file system repartitioning will simply double the cost - you have to send data to other nodes first (that involves serialization, deserialization, network transfer, write to disk) and then still write to distributed file system.
There are of course edge cases when this makes sense. If network connecting your cluster is much faster than network connection your cluster to S3 nodes, effective latency might be a bit lower.
As of caching ‒ there is no value in caching here at all. Caching Dataset is expensive, and makes sense only if persisted data is reused.
Answer 1 :- Repartition of 25 or more or less it depends on how much data you have and no. of executors you provided. If your Spark code run in the cluster with more than one executor and it is not repartitioned then repartitioning will speedy to writing parallel your data.
Answer 2 :- There is no need to cache df before the last line because you are using only single action in your code. If you will perform multiple actions on your DF and don't want it will recalculate as the number of actions then you will Cache it.
The thing here is that Spark can parallelize writing to a certain point since one file can't be written by multiple executors at the same time.
Repartition helps you in this parallelization because it will write 25 different files (one for each partition). If you increase the number of partitions you will increase the number of written files hence speeding up the execution. This comes with a price because of the reading time will increase with the number of files to be read.
The limit is the number of executors you are running your job with, e.g. if you are running with 25 executors then setting repartition to 26 will not help you because to write the 26th partition one of the previous 25 would have to be finished.
For the other question, I don't think .cache() will help you because Spark is lazy, maybe this article can help you further.

BigQuery GUI - CPU Resource Limit

Is there a way to set the CPU Resource Limit on the BigQuery with Python and GUI?
I'm getting an error of:
Query exceeded resource limits. 2147706.163729571 CPU seconds were used, and this query must use less than 46300.0 CPU seconds.
Looking at the BigQuery's Python reference page: http://google-cloud-python.readthedocs.io/en/latest/bigquery/reference.html
It looks like there's:
1. maximum_billing_tier
2. maximum_bytes_billed
That can be set, but there is no CPU second options.
You cannot set anymore maximum_billing_tier - it is obsolete and as soon as you are lower than tier 100 you are billed as if it were 1. if you exceed 100 - query just failes.
As of CPU - check concept of slots
Maximum concurrent slots per project for on-demand pricing — 2,000
The default number of slots for on-demand queries is shared among all queries in a single project. As a rule, if you're processing less than 100 GB of queries at once, you're unlikely to be using all 2,000 slots.
To check how many slots you're using, see Monitoring BigQuery Using Stackdriver. If you need more than 2,000 slots, contact your sales representative to discuss whether flat-rate pricing meets your needs.
See more at https://cloud.google.com/bigquery/quotas#query_jobs

How to set the number of partitions/nodes when importing data into Spark

Problem: I want to import data into Spark EMR from S3 using:
data = sqlContext.read.json("s3n://.....")
Is there a way I can set the number of nodes that Spark uses to load and process the data? This is an example of how I process the data:
data.registerTempTable("table")
SqlData = sqlContext.sql("SELECT * FROM table")
Context: The data is not too big, takes a long time to load into Spark and also to query from. I think Spark partitions the data into too many nodes. I want to be able to set that manually. I know when dealing with RDDs and sc.parallelize I can pass the number of partitions as an input. Also, I have seen repartition(), but I am not sure if it can solve my problem. The variable data is a DataFrame in my example.
Let me define partition more precisely. Definition one: commonly referred to as "partition key" , where a column is selected and indexed to speed up query (that is not what i want). Definition two: (this is where my concern is) suppose you have a data set, Spark decides it is going to distribute it across many nodes so it can run operations on the data in parallel. If the data size is too small, this may further slow down the process. How can i set that value
By default it partitions into 200 sets. You can change it by using set command in sql context sqlContext.sql("set spark.sql.shuffle.partitions=10");. However you need to set it with caution based up on your data characteristics.
You can call repartition() on dataframe for setting partitions. You can even set spark.sql.shuffle.partitions this property after creating hive context or by passing to spark-submit jar:
spark-submit .... --conf spark.sql.shuffle.partitions=100
or
dataframe.repartition(100)
Number of "input" partitions are fixed by the File System configuration.
1 file of 1Go, with a block size of 128M will give you 10 tasks. I am not sure you can change it.
repartition can be very bad, if you have lot of input partitions this will make lot of shuffle (data traffic) between partitions.
There is no magic method, you have to try, and use the webUI to see how many tasks are generated.

datastax : Spark job fails : Removing BlockManager with no recent heart beats

Im using datastax-4.6. I have created a cassandra table and stored 2crore records. Im trying to read the data using scala. The code works fine for few records but when i try to retrieve all 2crore records it displays me follwing error.
**WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 172.20.98.17, 34224, 0) with no recent heart beats: 140948ms exceeds 45000ms
15/05/15 19:34:06 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(C15759,34224) not found**
Any help?
This problem is often tied to GC pressure
Tuning your Timeouts
Increase the spark.storage.blockManagerHeartBeatMs so that Spark waits for the GC pause to end.
SPARK-734 recommends setting -Dspark.worker.timeout=30000 -Dspark.akka.timeout=30000 -Dspark.storage.blockManagerHeartBeatMs=30000 -Dspark.akka.retry.wait=30000 -Dspark.akka.frameSize=10000
Tuning your jobs for your JVM
spark.cassandra.input.split.size - will allow you to change the level of parallelization of your cassandra reads. Bigger split sizes mean that more data will have to reside in memory at the same time.
spark.storage.memoryFraction and spark.shuffle.memoryFraction - amount of the heap that will be occupied by RDDs (as opposed to shuffle memory and spark overhead). If you aren't doing any shuffles, you could increase this value. The databricks guys say to make this similar in size to the size of your oldgen.
spark.executor.memory - Obviously this depends on your hardware. Per DataBricks you can do up to 55gb. Make sure to leave enough RAM for C* and for your OS and OS page cache. Remember that long GC pauses happen on larger heaps.
Out of curiosity, are you frequently going to be extracting your entire C* table with Spark? What's the use case?

Hadoop counters - tuning and optimization

I just wrote my first hadoop job. It processes many files and generates multipleoutput files for each input file. I am running it on a two node cluster and it takes about 10 minutes for my largest input set. Looking at the counters below, what are the optimizations I can do to make it run faster? Are there any specific indicators which one should look for in these counters-
Version: 2.0.0-mr1-cdh4.1.2
Map task Capacity:20
Reduce task Capacity:20
Avg task per node:20
We can see here that most of data reduction happens in the map phase (number of map output bytes is much less then HDFS read bytes, The same about map input records - it is much lower then map input record). We also see that a lot of CPU time spent. We also see low number of shuffling bytes
So this job is:
a) A lot of data reduction is done on Map phase.
b) The job is CPU bound.
So I think code of mapper and reducer should be optimized. I/O probably is not important for this job.