GET Mapreduce Job Progress after Job finished - hive

I'm developing a application that can collect mpreduce job progress info to analyze.The first way is parse log file.but It's ugly。Is there any method like hook or plugin can do this

You can probably use the YARN application API to get most of the information. See this Yarn Application API
Here is an excerpt from the page:
... All query parameters for this api will filter on all applications. However the queue query parameter will only implicitly filter on unfinished applications that are currently in the given queue.
There are other YARN APIs too, that you can utilize to achieve your goal. It is certainly better than scanning log files.

Related

YARN Architecture of Hadoop 2.0

From below link of Apache Hadoop site, I learn that
ApplicationMaster has the responsibility of negotiating appropriate
resource containers from the Scheduler (ResourceManager)
and also learn that
ApplicationsManager negotiating the first container for executing the
ApplicationMaster
Link : http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
So here is my confusion.
If ApplicationMaster has the responsilibility to request ResourceManager for Container, then Who is creating the first container and what is the process to create the first container for executing the ApplicationMaster?
Is there anyone giving and request to create the first container?
What are the resonsibilities of the first Container? First Container only executes the ApplicationMaster or it is also behaving like other Resource Container?
Please let me know if anyone has the idea regarding this.
First of all, you are confusing the terms ApplicationManager and ApplicationMaster. They are not the same, have a look at my answer to understand difference between Application Manager and Application Master in YARN.
Answers to your questions are given below:
YarnClient has the responsibility to submit the application to ResourceManager, it sends an ApplicationSubmissionContext object to ResourceManager, which represents all of the information needed by the ResourceManager to launch the ApplicationMaster for an application.
Yes, YarnClient does that!
First Container is the Application Master, its job is to request the resources(containers) from ResourceManager and make application level decisions. If a sufficient number of containers (defined by the logic in your ApplicationMaster) are provided by the ResourceManager, then ApplicationMaster can go ahead and launch the application code on containers. FurtherMore, ApplicationMaster keeps track of failed containers and relauch them or terminates the application(kills all other containers), again based on the logic of your ApplicationMaster.
To understand the internals of Hadoop YARN, i would suggest you to read YARN paper or if you have more time you can read a book on Hadoop YARN.

How to get statistic data from teamcity?

I'd like to get some statistics data from TeamCity. I don't know if It's better to do this using TeamCity REST API or by queering TeamCity Database to get the data I need.
Can someone advice me which one is better ?
FYI: I tried to use TeamCity Restful web services but if I request a project containing a lot of build configs, I get a timeout exception :
iagnostic.web.DiagnosticFilter - Request processing took too long: 192577 ms, request: GET '/app/rest/builds/?locator= ...
TeamCity version: 9.1.1(build 37059)
I finally used REST API. I get failed tests of the last build for each buildConfig.

How to fetch Spark Streaming job statistics using REST calls when running in yarn-cluster mode

I have a spark streaming program running on Yarn Cluster in "yarn-cluster" mode. (-master yarn-cluster).
I want to fetch spark job statistics using REST APIs in json format.
I am able to fetch basic statistics using REST url call:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/metrics/json. But this is giving very basic statistics.
However I want to fetch per executor or per RDD based statistics.
How to do that using REST calls and where I can find the exact REST url to get these statistics.
Though $SPARK_HOME/conf/metrics.properties file sheds some light regarding urls i.e.
5. MetricsServlet is added by default as a sink in master, worker and client driver, you can send http request "/metrics/json" to get a snapshot of all the registered metrics in json format. For master, requests "/metrics/master/json" and "/metrics/applications/json" can be sent seperately to get metrics snapshot of instance master and applications. MetricsServlet may not be configured by self.
but that is fetching html pages not json. Only "/metrics/json" fetches stats in json format.
On top of that knowing application_id pro-grammatically is a challenge in itself when running in yarn-cluster mode.
I checked REST API section of Spark Monitoring page, but that didn't worked when we run spark job in yarn-cluster mode. Any pointers/answers are welcomed.
You should be able to access the Spark REST API using:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/
From here you can select the app-id from the list and then use the following endpoint to get information about executors, for example:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/{app-id}/executors
I verified this with my spark streaming application that is running in yarn cluster mode.
I'll explain how I arrived at the JSON response using a web browser. (This is for a Spark 1.5.2 streaming application in yarn-cluster mode).
First, use the hadoop url to view the RUNNING applications. http://{yarn-cluster}:8088/cluster/apps/RUNNING.
Next, select a running application, say http://{yarn-cluster}:8088/cluster/app/application_1450927949656_0021.
Next, click on the TrackingUrl link. This uses a proxy and the port is different in my case: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/. This shows the spark UI. Now, append the api/v1/applications to this URL: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/api/v1/applications.
You should see a JSON response with the application name supplied to SparkConf and the start time of the application.
I was able to reconstruct the metrics in the columns seen in the Spark Streaming web UI (batch start time, processing delay, scheduling delay) using the /jobs/ endpoint.
The script I used is available here. I wrote a short post describing and tying its functionality back to the Spark codebase. This does not need any web-scraping.
It works for Spark 2.0.0 and YARN 2.7.2, but may work for other version combinations too.
You'll need to scrape through the HTML page to get the relevant metrics. There isn't a Spark rest endpoint for capturing this info.

Jmeter View Results Tree doesn't return anything in Response Data

I sent a HTTP request. There is an error in View Results Tree. But In the Response Data doesn't seem anything.
How can I see the error logs?
Don't use JMeter GUI to run performance tests. Run test in command-line non-GUI mode instead.
Don't use View Results Tree listener for anything but tests development or single-threaded debugging, it's too memory intensive and basically stores all requests/responses details in the memory. It might be the cause of errors you're getting. As per documentation
View Results Tree MUST NOT BE USED during load test as it consumes a lot of resources (memory and CPU). Use it only for either functional testing or during Test Plan debugging and Validation.
If you need to store response data on error in .jtl results file add the following line to user.properties file (it usually lives under /bin folder of your JMeter installation)
jmeter.save.saveservice.response_data.on_error=true
Alternatively you can use a sniffer tool like Wireshark to capture full requests/responses details for later inspection.

Find out the return code of the privileged help run through SMJobSubmit

Is there a way to know the return code or process ID of the process which gets executed when the privileged helper tool is installed as a launchdaemon and launched via SMJobSubmit().
I have an application which to execute some tasks in privileged manner uses the SMJobSubmit API as mentioned here.
Now in order to know whether the tasks succeeded or not, I will have to do one of the following.
The best option is to get the return code of the executable that ran.
Another option would be if I could create a pipe between my application and the launchd.
If the above two are not possible, I will have to resort to some hack like writing a file in /tmp location and reading it from my app.
I guess SMJobSubmit internally submits the executable with a launchdaemon dictionary to the launchd which is then responsible for its execution. So is there a way I could query launchd to find out the return code for the executable run with the label "mylabel".
There is no way to do this directly.
SMJobSubmit is a simple wrapper around a complicated task. It also returns synchronously despite launching a task asynchronously. So, while it can give you an error if it fails to submit the job, if it successfully submits a job that fails to run, there is no way to find that out.
So, you will have to explicitly write some code to communicate from your helper to your app, to report that it's up and running.
If you've already built some communication mechanism (signals, files, Unix or TCP sockets, JSON-RPC over HTTP, whatever), just use that.
If you're designing something from scratch, XPC may be the best answer. You can't use XPC to launch your helper (since it's privileged), but you can manually create a connection by registering a Mach service and calling xpc_connection_create_mach_service.