How to submit code to a remote Spark cluster from IntelliJ IDEA - intellij-idea

I have two clusters, one in local virtual machine another in remote cloud. Both clusters in Standalone mode.
My Environment:
Scala: 2.10.4
Spark: 1.5.1
JDK: 1.8.40
OS: CentOS Linux release 7.1.1503 (Core)
The local cluster:
Spark Master: spark://local1:7077
The remote cluster:
Spark Master: spark://remote1:7077
I want to finish this:
Write codes(just simple word-count) in IntelliJ IDEA locally(on my laptp), and set the Spark Master URL to spark://local1:7077 and spark://remote1:7077, then run my codes in IntelliJ IDEA. That is, I don't want to use spark-submit to submit a job.
But I got some problem:
When I use the local cluster, everything goes well. Run codes in IntelliJ IDEA or use spark-submit can submit job to cluster and can finish the job.
But When I use the remote cluster, I got a warning log:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
It is sufficient resources not sufficient memory!
And this log keep printing, no further actions. Both spark-submit and run codes in IntelliJ IDEA result the same.
I want to know:
Is it possible to submit codes from IntelliJ IDEA to remote cluster?
If it's OK, does it need configuration?
What are the possible reasons that can cause my problem?
How can I handle this problem?
Thanks a lot!
Update
There is a similar question here, but I think my scene is different. When I run my codes in IntelliJ IDEA, and set Spark Master to local virtual machine cluster, it works. But I got Initial job has not accepted any resources;... warning instead.
I want to know whether the security policy or fireworks can cause this?

Submitting code programatically (e.g. via SparkSubmit) is quite tricky. At the least there is a variety of environment settings and considerations -handled by the spark-submit script - that are quite difficult to replicate within a scala program. I am still uncertain of how to achieve it: and there have been a number of long running threads within the spark developer community on the topic.
My answer here is about a portion of your post: specifically the
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources
The reason is typically there were a mismatch on the requested memory and/or number of cores from your job versus what were available on the cluster. Possibly when submitting from IJ the
$SPARK_HOME/conf/spark-defaults.conf
were not properly matching the parameters required for your task on the existing cluster. You may need to update:
spark.driver.memory 4g
spark.executor.memory 8g
spark.executor.cores 8
You can check the spark ui on port 8080 to verify that the parameters you requested are actually available on the cluster.

Related

How to set a specific port for single-user Jupyterhub server REST API calls?

I have setup Spark SQL on Jypterhub using Apache Toree SQL kernel. I wrote a Python function to update Spark configuration options in the kernel.json file for my team to change configuration based on their queries and cluster configuration. But I have to shutdown the running notebook and re-open or restart the kernel after running Python function. In this way, I'm forcing the Toree kernel to read the JSON file to pick up the new configuration.
I thought of implementing this shutdown and restart of kernel in a programmatic way. I got to know about the Jupyterhub REST API documentation and am able implement it by invoking related API's. But the problem is, the single user server API port is set randomly by the Spawner object of Jupyterhub and it keeps changing every time I spin up a cluster. I want this to be fixed before launching the Jupyterhub service.
Here is a solution I tried based on Jupyterhub docs:
sudo echo "c.Spawner.port = 35289
c.Spawner.ip = '127.0.0.1'" >> /etc/jupyterhub/jupyterhub_config.py
But this did not work as the port was again set by the Spawner randomly. I think there is a way to fix this. Any help on this would be greatly appreciated. Thanks

Saving VisualVM information as data

Using VisaulVM, I'd like to obtain this as data, without image processing algorithms being applied... How can I do that? I think this won't come out of a snapshot.
I am not sure how VisualVM and jVisualVM vary, the naming is sure confusing, but I'm running the Oracle supplied one (Version 1.7.0_80 (Build 150109))
Thanks!
You can use Tracer plugin with various probes. Tracer can export data in CSV, HTML or XML.
All this information is available through JMX. That's how VisualVM gets the information and you can use the same technology to get it too. First install the VisualVM-MBeans plugin from the Tools menu. This will add another tab titled MBeans where you can see all the available data for your application. You will find the graphed data under java.lang.Memory and java.lang.OperationSystem.
If you're trying to check information for your own process, it's as simple as calling ManagementFactory.getOperatingSystemMXBean().getSystemLoadAverage() and ManagementFactory.getMemoryMXBean().getHeapMemoryUsage(). There are more, but these should get you started.
To get precise CPU usage see: Using OperatingSystemMXBean to get CPU usage
If you want to get information on another process, you'd need some more code. There is a complete answer on Accessing a remote MBean server but basically:
// replace host and port
// not tested, might not work
JMXServiceURL url = new JMXServiceURL("service:jmx:rmi:///jndi/rmi://<addr>:<port>");
JMXConnector jmxConnector = JMXConnectorFactory.connect(url);
MBeanServerConnection connection = jmxConnector.getMBeanServerConnection();
OperatingSystemMXBean bean = ManagementFactory.getPlatformMXBean(connection, OperatingSystemMXBean.class);
bean.getSystemLoadAverage();
You will also have to start your Java process with exposed JMX as explained on How to activate JMX on my JVM for access with jconsole? but basically:
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9010
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
There is also a way to enumerate Java processes running on the local machine and even connect to processes that don't have JMX enabled (though you get less data). If that's what you're looking for, VisualVM Source Code will be a good place to start.
To answer your other question about naming:
VisualVM is opensource project hosted at visualvm.java.net and Java VisualVM is stable version of VisualVM with Oracle branding and other small changes. Java VisualVM is distributed in JDK. There is a table where you can find which VisualVM release is the basis for Java VisualVM in respective JDK update.

Where are Ambari Macros set

In Knox config file in Ambari we have defined:
<url>http://{{namenode_host}}:{{namenode_http_port}}/webhdfs</url>
The problem is we have 2 namenodes, one active and one passive for high availability. Our active namenode01 failed so namenode02 became active.
This caused problems for a lot scripts as they were hardcoded to point to namenode01. So we used a command to failover namenode02 back to namenode01 using a terminal, not Ambari.
Now, the macro {{namenode_host}} is defined as namenode02 and not namenode01.
So, where is {{namenode_host}} defined?
Or, do we need to failover namenode01 to namenode02, then failover again to namenode01 using Ambari to update the macro?
If we need to failover the namenode using Ambari, I'm assuming we need to select the "Restart" option? There isn't a direct failover command.
See issue here:
https://issues.apache.org/jira/browse/AMBARI-12763
This was committed to Ambari to support HA mode for Knox. However if you're still looking for the location take a look at the file that's edited in the patch. That file is the place where the macros are set. You'll have to find it on your local machine though.
Should be something like params_linux.py

RDO unable to boot VM with disk size specified

I have packstack-allinone setup on my RHEL7.1 trial for Juno release.
I am facing problem while launching VM(for ex: cirros) with a disk size mentioned in flavor. If there is 0gb disk size then VM are getting launched but not for higher flavor sizes.
I also observe that when I do this, openstack-nova-compute service goes down which I observed when I checked using nova-manage service list with nova-compute being XXX making me restart the service everytime I try this scenario. The compute logs doesn't throw any error, it just gets stuck at "Creating image".
Is there any Filesystem issue which i missing to be configured? I am new to this, so please help.
PS: I run all commands with "root" user.
The problem was with esxi. Esxi needs to be 5.5v to support RHEL7x Since mine was 5.1v it only supported RHEL6x.
After upgrading esxi5.1 to 5.5v it worked fine.

In YARN what is the difference between a managed and an unmanaged Application Manager

I'm experimenting with the Distributed Shell example in YARN 2.2 and am hoping that someone can clarify what the difference between a managed and and an un-managed application manager is?
For example the following lines appear in the client code
// unmanaged AM
appContext.setUnmanagedAM(true);
but I am unable to find documentation explaining the difference this line makes to the execution behaviour.
Many thanks.
The setUnmanagedAM(true) is used for debugging purposes i.e. it runs an application manager in local mode and does not submit it to a cluster so it is easier to step into code and debug.
You can see it in use in the hadoop-yarn-applications-unmanaged-am-launcher.jar that ships with yarn
Check the respective JIRA tickets: JIRA-420 and JIRA-419 (client side)
Currently, the RM itself manages the AM by allocating a container for it and negotiating the launch on the NodeManager and manages the AM lifecycle. Thereafter, the AM negotiates resources with the RM and launches tasks to do the real work.
It would be a useful improvement to enhance this model by allowing the AM to be launched independently by the client without requiring the RM. These AM's would be launched on a gateway machine that can talk to the cluster. This would open up new use cases such as the following
1) Easy debugging of AM, specially during initial development. Having the AM launched on an arbitrary cluster node makes it hard to looks at logs or attach a debugger to the AM. If it can be launched locally then these tasks would be easier.
2) Running AM's that need special privileges that may not be available on machines managed by the NodeManager
Blog post with more implementation details on unmanaged AM: click-me
Example of how Impala manages its resources with the help of unmanaged applications: Llama