Usages of cores in Spark SQL Execution - apache-spark-sql

I am new to Spark SQL queries and trying to understand it's working under the hood.
I have come across the term "Core" in the Spark vocabulary but still struggling to get a hold on the same.
I know that - 1 core = 1 task.
My questions -
Can anyone please explain what exactly does a core mean ?
Does Spark UI show the number of cores currently allocated for my job ? If yes,
then where can I see it ?
If I find in the Spark UI that the number of tasks running is less, is
there a way to increase the number of cores allocated for my job, so
that Spark can submit more tasks and make my job run faster ?
Please advise.

Yes, you are right in a way.
In spark task are distributed across executors, on each executor number of task running is equal to the number of cores on that executors. So basically core is something that is going to execute your task. The task here is the most granular work that needs to be carried out.
JOB=>STAGE=>TASK
Yes, spark UI shows you the number of the task currently running on your every executor. You can check them under the Executors tab. This tab shows you a very detailed view of your task allocation against the number of cores available and a lot of other details.
Yes, you can increase the number of cores. You can do that by passing the argument in the spark-submit command.
--executor-cores n
Here n is the number of cores you want. For optimum usage, it should be 5.
It is not necessary that more than the number of cores faster your job will run.
Your task needs to be distributed equally across all the cores available to run faster.
If you provide more cores than required they will remain idle most of the time.

Related

Hazelcast is using high number of JVM threads

I am using hazelcast in JVM in my application which is running 2 replicas in kubernetes. Hazelcast in both comtainers has formed a cluster and sync is working perfectly fine.
But my application has started using 20% more threads after starting to using hazelcast. On aanalyzing thread dump it is found that hazelcast is using that extra 20%
Is it okay for hazelcast to use this many number of threads or if this can be reduced, how can I go about this?
Hazelcast will self-size the number of threads it uses, based on the number of processors available to it.
(In Java, see Runtime.availableProcessors() )
How many does your container have allocated ?
You can override the threading if you are sure it's inappropriate. Look for system properties like hazelcast.*.thread.count from here. There are many options and it's not a casual task just to reduce or increase, If you tune numbers down, you risk performance being very poor.

Stopping when the solution is good enough?

I successfully implemented a solver that fits my needs. However, I need to run the solver on 1500+ different "problems" at 0:00 precisely, everyday. Because my web-app is in ruby, I built a quarkus "micro-service" that takes the data, calculate a solution and return it to my main app.
In my application.properties, I set:
quarkus.optaplanner.solver.termination.spent-limit=5s
which means each request take ~5s to solve. But sending 1500 requests at once will saturate the CPU on my machine.
Is there a way to tell OptaPlanner to stop when the solution is good enough ? ( for example if the score is stable ... ). That way I can maybe reduce the time from 5s to 1-2s depending on the problem?
What are your recommandations for my specific scenario?
The SolverManager will automatically queue solver jobs if too many come in, based on its parallelSolverCount configuration:
quarkus.optaplanner.solver-manager.parallel-solver-count=3
In this case, it will run 3 solvers in parallel. So if 7 datasets come in, it will solve 3 of them and the other 4 later, as the earlier solvers terminate. However if you use moveThreadCount=2, then each solver uses at least 2 cpu cores, so you're using at least 6 CPU cores.
By default parallelSolverCount is currently set to half your CPU cores (it currently ignores moveThreadCount). In containers, it's important to use JDK 11+: the CPU count of the container is often different than from the bare metal machine.
You can indeed tell the OptaPlanner Solvers to stop when the solution is good enough, for example when a certain score is attained or the score hasn't improved in an amount of time, or combinations thereof. See these OptaPlanner docs. Quarkus exposes some of these already (the rest currently still need a solverConfig.xml file), some Quarkus examples:
quarkus.optaplanner.solver.termination.spent-limit=5s
quarkus.optaplanner.solver.termination.unimproved-spent-limit=2s
quarkus.optaplanner.solver.termination.best-score-limit=0hard/-1000soft

Dataproc set number of vcores per executor container

I'm building a spark application which will run on Dataproc. I plan to use ephemeral clusters, and spin a new one up for each execution of the application. So I basically want my job to eat up as much of the cluster resources as possible, and I have a very good idea of the requirements.
I've been playing around with turning off dynamic allocation and setting up the executor instances and cores myself. Currently I'm using 6 instances and 30 cores a pop.
Perhaps it's more of a yarn question, but I'm finding the relationship between container vCores and my spark executor cores a bit confusing. In the YARN application manager UI I see that 7 containers are spawned (1 driver and 6 executors) and each of these use 1 vCore. Within spark however I see that the executors themselves are using the 30 cores I specified.
So I'm curious if the executors are trying to do 30 tasks in parallel on what is essentially a 1 core box. Or maybe the vCore displayed in the AM gui is erroneous?
If its the former, wondering what the best way is to set this application up so I end up with one executor per worker node, and all the CPUs are used.
The vCore displayed in the YARN GUI is erroneous; this is a not-well-documented but a known issue with the capacity-scheduler, which is Dataproc's default. Notably, with the default settings on Dataproc, YARN is only doing resource bin-packing based on memory rather than CPUs; the benefit is that this is more versatile for oversubscribing CPUs to varying degrees as desired per-workload, especially if something is IO bound, but the downside is that YARN won't be responsible for carving out CPU usage in a fixed manner.
See https://stackoverflow.com/a/43302303/3777211 for some discussion of changing to fair-scheduler to see the vcores allocation accurately represented in YARN. However, in your case there's probably no benefit to doing so; making YARN do bin-packing across both dimensions is more of a "shared multitenant cluster" issue, and only complicates the scheduling problem.
In your case, the best way to set your application up is just to ignore what YARN says about vcores; if you want just one executor per worker node, then set the executor memory size to the maximum that will fit in YARN per node, and make cores per executor equal to the total number of cores per node.

Is there a way to force Bazel to run tests serially

By default, Bazel runs tests in a parallel fashion to speed things up. However, I have a resource (GPU) that can't handle parallel jobs due to the GPU memory limit. Is there a way to force Bazel to run tests in a serial, i.e., non-parallel way?
Thanks.
--jobs 1 will limit the number of parallel jobs Bazel runs to 1.
You can also modify the test targets and add tags = ["exclusive"] to prevent specific test to run in parallel (see http://bazel.io/docs/test-encyclopedia.html).
Use --local_test_jobs=1 to only run a single test job at a time locally.
The max number of local test jobs to run concurrently. Takes an integer, or a keyword ("auto", "HOST_CPUS", "HOST_RAM"), optionally followed by an operation ([-|]) eg. "auto", "HOST_CPUS.5". 0 means local resources will limit the number of local test jobs to run concurrently instead. Setting this greater than the value for --jobs is ineffectual
tags = ["exclusive"] has other complications to consider with respect to caching.
--jobs will serialize the entire build process, not just testing, so it's less than ideal.
There are 2 resources Bazel will respect limitations upon: RAM and CPU. You may hijack one (Probably RAM) to represent GPU(s) as they're available to a run and required by a test. (I've stopped short of doing this for a limited hardware resource because it feels to inelegant, but I can't think of a reason it shouldn't work.)
Future releases of Bazel should support extra resources like GPUs
and releases that contain that change should support extra resource tags like "resources:GPU:1" when --local_extra_resources=gpu=1 is set. This should enable GPU tests to be bound by a limited quantity of GPUs, and for them to run non-exclusively and without limiting the total number of --jobs or "test_jobs"

Pig script minimum execution time

I'm currently learning Pig and I'm executing my scripts inside Hortonworks Sandbox. What is bugging me from the very beginning is that the minimum execution time for a Pig script seems to be at least 30-40 seconds. Is that because I'm using the Hortonworks Sandbox or is a normal for Pig scripts? Is there a way to reduce the execution time, because this is really slowing my learning progress? If this execution time is normal can you explain me what is going on and why is that?
PS
I've allocated 2GB RAM for the Hortonworks virtual machine. And just to mention I'm currently executing just simple scripts on small data sets.
If you execute pig in local mode (pig -x local) then it'll run a lot faster but it won't do map-reduce and won't access hdfs - it's good for learning though!
Yes, 30-40 seconds is absolutely normal for Pig, because it has a big overhead for compiling the job, launching JVMs, etc.
As stated in the other answer - you can try to run in local mode. It usually takes me about 15 seconds for a simple job with input containing just a few lines of data. My Cloudera VM is allocated with 4G of RAM, btw.