Can properties such as yarn.log-aggregation-enable and yarn.log-aggregation.retain-seconds applied on a per job basis? I would not like this to be enabled at a cluster-wide scale but only for a few tasks.
Currently, there is no way to aggregate logs for specific YARN applications. https://issues.apache.org/jira/browse/YARN-85 seems to have attempted to provide this feature but the issue is still unresolved.
Related
I've installed Onepanel on my EKS cluster and I want to run CVAT tool there. I want to keep track on user log in-log out activities and timings. Is that even possible?
Onepanel isn't supported anymore as far as I know. It has an outdated version of CVAT. CVAT has analytics functionality: https://opencv.github.io/cvat/v2.2.0/docs/manual/advanced/analytics/. It can show working time and intervals of activity.
when I use flink execute one job that read from hive to deal ,hive include about 1000 files,the flink show the parallelism is 1000,flink request resources used all resources of my cluster that caused others job request slot faild,others job executed faild.each file of 1000 files is small. the job maybe not need occupy the all resources.how can I tune the flink param that use less resource to execute the job
Yarn perspective
I don't recommend usage of Yarn's memory management. Yarn kills containers instantly when they exceed the limits. Usually you need to disable memory checks to overcome this kind of problems.
"yarn.nodemanager.vmem-check-enabled":"false",
"yarn.nodemanager.pmem-check-enabled":"false"
Flink perspective
You can't limit slot resource usage. You have to tune your task managers on your needs. By reducing slots or running multiple task managers on each node . You can set task manager resource usage limit by taskmanager.memory.process.size.
Alternatively you can use flink on kubernetes. You can create Flink clusters for each job which will give you more flexibility. It will create task managers for each job and destroy them when jobs are completed.
There are also stateful functions which you can deploy job pipeline operators into separate containers. This will allow you to manage each function resources separately beside task managers. This allows you to reduce pressure on task managers.
Flink also supports Reactive Mode. This also can reduce pressure on workers by scaling up/down operators automatically based on cpu kind of metrics.
You need to discover this kind of features and find best solution for your needs.
I am building a system that will consist of several instances, each running our Optaplanner implementation. These instances will monitor a common queue for incoming jobs. I don't want an instance that is already busy to take the job, so I want to check the number of ongoing jobs in the solver manager.
In the debugger, it looks like the solverManager has some stuff that could help me check that (problemIdToSolverJobMap.size() < parallelSolverCount would work for instance), but these are private ant not accesible to me.
How do I in the most robust way check the status of the solver manager as a whole, not for a specific job?
That would be useful indeed. This is an API gap, clearly. Please create a jira.
I have a running application on a Linux EC-2 instance and I would like to set the CloudWatch Agent.
I would like to know what are the CloudWatch Agent using CPU/Memory/Disk to what extent in order to collect the information.
Its a minor concern but still would like to know if Agent will affect the instance performance (Is it a minimal impact?).
Thanks in advance!
Golan
Anything that runs on a computer would impact performance. It is always a trade-off between running some code and the benefit that the code provides.
The Agent only collects data at regular intervals, so it should not have a large impact on the system.
I suggest you install CloudWatch Agent and measure the impact yourself.
Does GCP have a job scheduling service like Azure Scheduler, where jobs can be scheduled and managed dynamically via API?
Google Cron service is set in a static file and it seems like their answer to this is to use that to poke a roll your own service backed with PubSub and a data store. Looking for Quartz-like functionality, consumable by APP engine, which can be managed and invoked via API as opposed to managing a cluster, queue, and compute instance/VM deployment of Quartz (or the like) or rolling a custom solution. Should support 50 million simultaneous jobs per day with retry / recoverability and dynamic scheduling per tenant capabilities.
This is the cheapest and easiest way I can imagine building a solution today on top of an existing AppEngine based project:
As you observed, currently there is no such API/service directly available on GCP. There is an open feature request (on GAE) for it.
But, also as you observed, it is possible to build and use a custom solution, just like the one you proposed.
Depending on the context even simpler solutions are possible. For a GAE context check out, for example, How to schedule repeated jobs or tasks from user parameters in Google App Engine?.