resource management on spark jobs on Yarn and spark shell jobs - hadoop-yarn

Our company has a 9 nodes clusters on cloudera.
We have 41 long running spark streaming jobs [YARN + cluster mode] & some regular spark shell jobs scheduled to run on 1pm daily.
All jobs are currently submitted at user A role [ with root permission]
The issue I encountered are that while all 41 spark streaming jobs are running, my scheduled jobs will not be able to obtain resource to run.
I have tried the YARN fair scheduler, but the scheduled jobs remain not running.
We expect the spark streaming jobs are always running, but it will reduce the resources occupied whenever other scheduled jobs start.
please feel free to share your suggestions or possible solutions.

Your spark streaming jobs are consuming too many resources for your scheduled jobs to get started. This is either because they're always scaled to a point that there aren't enough resources left for scheduled jobs or they aren't scaling back.
For the case where the streaming jobs aren't scaling back you could check whether you have dynamic resource allocation enabled for your streaming jobs. One way of checking is via the spark shell using spark.sparkContext.getConf.get("spark.streaming.dynamicAllocation.enabled"). If dynamic allocation is enabled then you could look at reducing the minimum resources for those jobs.

Related

flink on yarn use table api read from hive ,many hive file caused flink used all resource(cpu,memory)

when I use flink execute one job that read from hive to deal ,hive include about 1000 files,the flink show the parallelism is 1000,flink request resources used all resources of my cluster that caused others job request slot faild,others job executed faild.each file of 1000 files is small. the job maybe not need occupy the all resources.how can I tune the flink param that use less resource to execute the job
Yarn perspective
I don't recommend usage of Yarn's memory management. Yarn kills containers instantly when they exceed the limits. Usually you need to disable memory checks to overcome this kind of problems.
"yarn.nodemanager.vmem-check-enabled":"false",
"yarn.nodemanager.pmem-check-enabled":"false"
Flink perspective
You can't limit slot resource usage. You have to tune your task managers on your needs. By reducing slots or running multiple task managers on each node . You can set task manager resource usage limit by taskmanager.memory.process.size.
Alternatively you can use flink on kubernetes. You can create Flink clusters for each job which will give you more flexibility. It will create task managers for each job and destroy them when jobs are completed.
There are also stateful functions which you can deploy job pipeline operators into separate containers. This will allow you to manage each function resources separately beside task managers. This allows you to reduce pressure on task managers.
Flink also supports Reactive Mode. This also can reduce pressure on workers by scaling up/down operators automatically based on cpu kind of metrics.
You need to discover this kind of features and find best solution for your needs.

How to make Dataproc detect Python-Hive connection as a Yarn Job?

I launch a Dataproc cluster and serve Hive on it. Remotely from any machine I use Pyhive or PyODBC to connect to Hive and do things. It's not just one query. It can be a long session with intermittent queries. (The query itself has issues; will ask separately.)
Even during one single, active query, the operation does not show as a "Job" (I guess it's Yarn) on the dashboard. In contrast, when I "submit" tasks via Pyspark, they show up as "Jobs".
Besides the lack of task visibility, I also suspect that, w/o a Job, the cluster may not reliably detect a Python client is "connected" to it, hence the cluster's auto-delete might kick in prematurely.
Is there a way to "register" a Job to companion my Python session, and cancel/delete the job at times of my choosing? For my case, it is a "dummy", "nominal" job that does nothing.
Or maybe there's a more proper way to let Yarn detect my Python client's connection and create a job for it?
Thanks.
This is not supported right now, you need to submit jobs via Dataproc Jobs API to make them visible on jobs UI page and to be taken into account by cluster TTL feature.
If you can not use Dataproc Jobs API to execute your actual jobs, then you can submit a dummy Pig job that sleeps for desired time (5 hours in the example below) to prevent cluster deletion by max idle time feature:
gcloud dataproc jobs submit pig --cluster="${CLUSTER_NAME}" \
--execute="sh sleep $((5 * 60 * 60))"

Is there a way to reuse a single running databricks cluster in multiple mapping data flows

Is there a way to reuse a databricks cluster that is started by a web activity before
we run the mapping data flows and use the same running cluster in all of the data flows instead of letting all the data flow instances spin up their
own clusters which takes around 6 minutes for setting up each cluster?
Yes. Set the TTL in the Azure Integration Runtime under "Data Flow Properties" to an amount of time that there is a gap in between data flow job executions. This way, we can set-up a VM pool for you and reuse those resource to minimize the cluster start-up time: https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-ttl-to-azure-ir-to-reduce-data-flow-activity-times/ba-p/878380.
To start the cluster, don't use a web activity. Use a "dummy" data flow as I demonstrate here: https://youtu.be/FFCbU4ujCiY?t=533.
In ADF, you cannot access the underlying compute engines (Databricks in this case), so you have to kick-off a dummy data flow to warm it up.
That cluster start-up will take 5-6 mins. But now, if you use that same Azure IR in your subsequent activities, as long as they are scheduled to execute within that TTL window, ADF can grab existing VM resources to spin-up the Spark clusters and marshall your data flow definition to the Spark job execution.
End-to-end that process should now take just 2 mins.

Monitoring long lasting tasks in Airflow

I've seen people using Airflow to schedule hundreds of scraping jobs through Scrapyd daemons. However, one thing they miss in Airflow is monitoring long-lasting jobs like scraping: getting number of pages and items scraped so far, number of URL that failed so far or were retried without success.
What are my options to monitor current status of long lasting jobs? Is there something already available or I need to resort to external solutions like Prometheus, Grafana and instrument Scrapy spiders myself?
We've had better luck keeping our airflow jobs short and sweet.
With long-running tasks, you risk running into queue back-ups. And we've found the parallelism limits are not quite intuitive. Check out this post for a good breakdown.
In a case kind of like yours, when there's a lot of work to orchestrate and potentially retry, we reach for Kafka. The airflow dags pull messages off of a Kafka topic and then report success/failure via a Kafka wrapper.
We end up with several overlapping airflow tasks running in "micro-batches" reading a tune-able number of messages off Kafka, with the goal of keeping each airflow task under a certain run time.
By keeping the work small in each airflow task, we can easily scale the number of messages up or down to tune the overall task run time with the overall parallelism of the airflow workers.
Seems like you could explore something similar here?
Hope that helps!

Oozie start time and submission time delay

I'm working on a workflow that has both Hive and Java actions. Very often we have been noticing that there is a few minutes delay between Java action start time and the job submission time. We don't see that with Hive jobs, meaning Hive jobs seem to be submitted almost immediately after they are started. The Java jobs do not do much and so they finish successfully in seconds after they are submitted but the time between start and submission seem to be very night ( 4 -5 minutes). We are using fair scheduler and the there are enough mapper/reducer slots available. But still even if it's a resource problem the Hive jobs should also show delay between start and submission but they don't ! Java jobs are very simple jobs and they don't process any files etc and basically used to call a web service and they spawn only single mapper and no reducers where are the Hive jobs creates hundreds of mapper/reducer tasks but still there is not delay between start and submission. We are not able to figure out why oozie is not submitting the Java job immediately. Any ideas?