Long running chef-client executions - automation

I'm using a open-source chef server managing about 150 nodes.
Analytics/Reporting module is not activated in the chef server due to resource constraints.
"chef-client" is running on all the nodes every 30 minutes
How can I find, how much time each chef-client run is taking to complete?
I'm trying to find the nodes that are slowest in completing their chef-client runs

Chef Server doesn't store this information. You'll need to manage it yourself, possibly using a handler as linked above in the comments. A simple option would be to make a handler which stores the duration of the last run as a node attribute, but the sky is the limit. If you want something to help debug long runs once you find them, check out my poise-profiler cookbook.

Related

Handling cache warm-up with twisted and systemd

I have a simple twisted application which I run using a systemd service, executing a script, which subsequently executes a .tac file.
The application is structured as a JSON RPC endpoint (fastjsonrpc), built into a t.w.r.Resource, which is in a t.w.s.Site, and served t.a.i.TCPServer, and the whole thing packed into a t.a.Application. This works fine.
Where I do run into trouble is when I try to warm up caches at startup. This warm-up process is pretty slow (~300 seconds), and makes systemd timeout and kill the process. Increasing the timeout is not really a viable option, since I wouldn't want this to block system boot.
Analogous code is used in a separate stack running on Flask from within Apache and wsgi. That server starts itself off and lets systemd go on while it takes its time building the caches. This behaviour is fine for me.
I've tried calling the warmup function using the following within the setup function of the t.w.r.Resource:
reactor.callLater(1, ep.warmup, None)
I've not yet tried using this from within systemd, and have been testing it from twistd directly on the command line. The server does work as expected, however it no longer responds to SIGINT (^C). Removing the callLater is all that's needed to let the server respond to SIGINT.
If the warmup function is called directly (not by callLater, i.e., the arrangement which makes systemd give up while waiting for warm up to complete), the resulting server also continues to respond to SIGINT.
Is there a better / good way to handle this sort of long-running warmup code?
Why would twistd / the reactor not respond to SIGINT? Am I missing something here?
Twisted is a single-threaded thing. It sounds like your "cache warmup" code is blocking the reactor for those 300 seconds. One easy way to fix this would be using deferToThread to let it run without blocking the reactor.

how to find chef client run successful or not

I want to perform chef operations using api from my own program(java). I would like to know whether my chef client run is successful or not.
What is the best way to find this. Did chef maintains any attribute or store recent chef client run status.
You can get the timestamp of the last successful run from the node data (key is ohai_time), but that's about it for vanilla Chef. More likely what you want is the information for specific runs, which you could get from the Reporting system (part of the Premium add-ons) or by making a custom report/error handler to ship the data to your own system.

How to submit code to a remote Spark cluster from IntelliJ IDEA

I have two clusters, one in local virtual machine another in remote cloud. Both clusters in Standalone mode.
My Environment:
Scala: 2.10.4
Spark: 1.5.1
JDK: 1.8.40
OS: CentOS Linux release 7.1.1503 (Core)
The local cluster:
Spark Master: spark://local1:7077
The remote cluster:
Spark Master: spark://remote1:7077
I want to finish this:
Write codes(just simple word-count) in IntelliJ IDEA locally(on my laptp), and set the Spark Master URL to spark://local1:7077 and spark://remote1:7077, then run my codes in IntelliJ IDEA. That is, I don't want to use spark-submit to submit a job.
But I got some problem:
When I use the local cluster, everything goes well. Run codes in IntelliJ IDEA or use spark-submit can submit job to cluster and can finish the job.
But When I use the remote cluster, I got a warning log:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
It is sufficient resources not sufficient memory!
And this log keep printing, no further actions. Both spark-submit and run codes in IntelliJ IDEA result the same.
I want to know:
Is it possible to submit codes from IntelliJ IDEA to remote cluster?
If it's OK, does it need configuration?
What are the possible reasons that can cause my problem?
How can I handle this problem?
Thanks a lot!
Update
There is a similar question here, but I think my scene is different. When I run my codes in IntelliJ IDEA, and set Spark Master to local virtual machine cluster, it works. But I got Initial job has not accepted any resources;... warning instead.
I want to know whether the security policy or fireworks can cause this?
Submitting code programatically (e.g. via SparkSubmit) is quite tricky. At the least there is a variety of environment settings and considerations -handled by the spark-submit script - that are quite difficult to replicate within a scala program. I am still uncertain of how to achieve it: and there have been a number of long running threads within the spark developer community on the topic.
My answer here is about a portion of your post: specifically the
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources
The reason is typically there were a mismatch on the requested memory and/or number of cores from your job versus what were available on the cluster. Possibly when submitting from IJ the
$SPARK_HOME/conf/spark-defaults.conf
were not properly matching the parameters required for your task on the existing cluster. You may need to update:
spark.driver.memory 4g
spark.executor.memory 8g
spark.executor.cores 8
You can check the spark ui on port 8080 to verify that the parameters you requested are actually available on the cluster.

knife status to list node where chef is failing

I was looking a way to list chef-client where chef is failing (Only failing NOT stopped).
I checked knife status document, but didn't find anything usefull.
anyone know better way to do this ?
Thanks
The knife lastrun plugin provides a useful chef handler that will record the time and status of most recent chef run. It will also usefully store the stack trace of any failed chef run. This is useful plugin when running large numbers of chef clients.

In YARN what is the difference between a managed and an unmanaged Application Manager

I'm experimenting with the Distributed Shell example in YARN 2.2 and am hoping that someone can clarify what the difference between a managed and and an un-managed application manager is?
For example the following lines appear in the client code
// unmanaged AM
appContext.setUnmanagedAM(true);
but I am unable to find documentation explaining the difference this line makes to the execution behaviour.
Many thanks.
The setUnmanagedAM(true) is used for debugging purposes i.e. it runs an application manager in local mode and does not submit it to a cluster so it is easier to step into code and debug.
You can see it in use in the hadoop-yarn-applications-unmanaged-am-launcher.jar that ships with yarn
Check the respective JIRA tickets: JIRA-420 and JIRA-419 (client side)
Currently, the RM itself manages the AM by allocating a container for it and negotiating the launch on the NodeManager and manages the AM lifecycle. Thereafter, the AM negotiates resources with the RM and launches tasks to do the real work.
It would be a useful improvement to enhance this model by allowing the AM to be launched independently by the client without requiring the RM. These AM's would be launched on a gateway machine that can talk to the cluster. This would open up new use cases such as the following
1) Easy debugging of AM, specially during initial development. Having the AM launched on an arbitrary cluster node makes it hard to looks at logs or attach a debugger to the AM. If it can be launched locally then these tasks would be easier.
2) Running AM's that need special privileges that may not be available on machines managed by the NodeManager
Blog post with more implementation details on unmanaged AM: click-me
Example of how Impala manages its resources with the help of unmanaged applications: Llama