Use remote driver with Databricks Connect - virtual-machine

When connecting to Databricks cluster from local IDE, I take that only spark-related commands are executed in remote mode (on cluster). How about single-node operations such as scikit-learn, to_pandas. If these functions only use local machine, the resource pool would be tiny. How to also utilize remote driver for execution of single-node operations ?
Databricks connect

It's not possible by design of the Databricks Connect - in it, the local machine is always a Spark driver, and worker nodes of Databricks cluster are used as Spark executors. So all local operations like .collect will bring data to your machine & will run locally.
You may need to look onto dbx tool from Databricks Labs - it recently got dbx sync command that allows to automatically sync code changes to the Databricks repo, so you can write code in the IDE, and run it in the Databricks notebook, so in this case it will use driver from the Databricks cluster. (It won't allow to interactively debug code, but at least you can get code executed in the cloud, not on your machine)

Related

How can anyone run access Selenium test scripts without having to install/run it locally?

I am looking for ways to set up like a central 'hub' for Selenium in my work, allowing anyone to access in within the company. For example, Tester A writes test scripts, the Person B can run without having to manually copy over the test scripts to their local workstation)
So far, I've only thought of installing Selenium in a VM which will then execute as per normal. But if I run Selenium Grid, it will run VMs within VM (?). My only concern with VMs is that it'd run slowly.
If anyone can think of a better solution or recommendation please do give me some advice. Thank you in advance.
One idea. You can create an infrastructure combining Jenkins/Selenium/Amazon.
The following is my solution from another post.
You can do it with a grid.
First of all you need to create a Selenium hub with an EC2 ubuntu 14.04 AMI without UI and link it as a jenkins slave to your Jenkins master. Or as directly a master. What you want. Only command line. Download Selenium Server standalone. (be careful on downloading the version. If you Download the Selenium3Beta, things could change). Here you can configure the HUB. You can also add the Selenium Hub as a service and configure to run automatically at server start. its important that you open the Selenium default port (or the one that you configured) so the nodes can connect to it. You can do that on the Amazon EC2 console when you have created your instance. You just need to add a security group with an inbound rule for TCP in the port you want for the IPs you want.
Then, you can create a Windows server 2012 instance server (for example, that's what I did), and do the same process. Download the same version for Selenium and the chromedriver (there is no need to download any firefoxdriver for Selenium versions before Selenium3). Generate a txt file and prepare the Selenium command to link to the HUB as a NODE. And convert it to *.bat in order to execute it. If you want to run the bat at start you can create a service with the task scheduler or use NSSM (https://nssm.cc/). Don't forget to add the rules to the security groups for this machine too!
Next, create the Jenkins server. You can use the Selenium Hub as the Jenkins master or as a slave.
Last step is configuring a job to be run in the Jenkins-Selenium machine. This job needs to be linked to your code repository (git, mercurial...) Using the parametrized build plugion for jenkins you can tell that job to pull the revision you want (where every developer can pull the revision with the new changes and new tests) and run the Selenium tests in that build with the current breanch/revision and against one unique selenium. You can use ANT or Maven to run the Selenium tests in Jenkins.
May be it's complicated to understand because there are so many concepts here but it's robust and it works fine!
If you have doubts, tell me!
If Internet Explorer is not one of the browsers on which you must run your automation tests, I would recommend that you consider docker selenium.
Selenium is providing pre-configured docker images for both Selenium Hub and Node ( refer here for more information ). For making use of docker selenium all you need to do is find a machine (preferably unix machine), install docker on it by following instructions detailed here and then start the hub and node by starting off those containers. In the case of docker you can literally transform a VM (or) a physical machine into a VM farm and yet not have to worry about slowness etc., because I believe docker is optimised for these and it runs your VM as a process.
Resorting to using Amazon cloud for running your selenium nodes is all fine, but if you have corporate policies that prevent in-coming traffic from the internet into your intranet region, then I am not sure how far Amazon cloud would be useful.
Also remember that Jenkins is not something that is absolutely required but is more of a good to have part in the setup because it would let anyone run their tests from a web UI. This will however require that all your tests are checked-in and made available in a central version control system in your organization.
PS : The reason why called out Internet Explorer as an exception is because IE runs only on windows and there are no docker images (yet) for windows. All the docker images are UNIX based images.

Spark long deploying time on EC2 with custom Windows AMI

I am trying to run a Spark cluster with some Windows instances on an Amazon EC2 infrastructure, but I am facing some issues with extremely high deploying times.
My project needs to be run on a Windows environment, and therefore I am using an alternative AMI by indicating it with the -a flag provided by Spark's spark-ec2 script. When I run the script, the process keeps stuck waiting for the instances to be up and running, with the following message:
Waiting for all instances in cluster to enter 'ssh-ready' state.............
When I use the default AMI, instead, the cluster launches normally after very few minutes of waiting.
I have searched for similar problems with other users, and so far I have only been able to find this statement about long deploying time with custom AMI-s (see Josh Rosen's answer).
I am using the version 1.2.0 of Spark. The call that launches the cluster looks something like the following:
./spark-ec2 -k MyKeyPair
-i MyKeyPair.pem
-s 10
-a ami-905fe9e7
--instance-type=t1.micro
--region=eu-west-1
--spark-version=1.2.0
launch MyCluster
The AMI indicated above refers to:
Microsoft Windows Server 2012 R2 Base - ami-905fe9e7
Desc: Microsoft Windows 2012 R2 Standard edition with 64-bit architecture. [English]
Any help or acclaration abouth this issue would be greatly appreciated.
I think I have figured out the problem. It seems Spark does not support the creation of clusters on a Windows environment with its default scripts. I think it is still possible to create a cluster with some manual tweaking, but it goes out of my limited knowledge. Here is the official post that explains it.
Instead, as a temporal solution, I am considering the usage of a Microsoft Azure cluster, which has just released an experimental tool that makes able to use a variant of Apache Hadoop (Spark) on their HDinsight clusters. Here is the article that explains it better.

Monitoring Storm JVM metrics

I have got a storm cluster running and I want to monitor its performance. I followed this blog and was able to measure the number of tuples received by a bolt using codahale metrics and display it in graphite.
My goal is to deploy a storm cluster on a lightweight computer such as beaglebone and for that I need to be able to monitor JVM parameters such as CPU, thread and memory usage of each Worker Process.
I really like codahale metrics and would like to continue using it in my application. Can anyone direct me as to how I can measure JVM parameters separately for each worker using codahale metrics?
I would really appreciate it if someone posted an example of how to get jvm metrics using codahale metrics.
Thanks,
Palak
I found an excellent tutorial here. Works like a charm.
Using VisualVM and JMX we can get the CPU usage,GC activity, class loading information, Heap size & Used Heap statistics, All the Threads information with statistics,
CPU & Memory profiling, performance monitoring, Memory leaks of worker nodes. And also you can take heap dumps and thread dumps, profiler snapshots.
STEPS for setup
STEP 1: Staring VisualVM
Java VisualVM is bundled with JDK version 6 update 7 or greater. Navigate to your JDK software's bin directory and double-click the Java VisualVM executable.
Alternatively, navigate to your JDK software's bin directory and type the following command at the command (shell) prompt: jvisualvm.
STEP 2: Adding MBean plugin
For JMX monitoring you need to add MBean plugin explicitly.
1, Choose Tools > Plugins from the main menu.
2, In the downloaded Plugins tab, Click Add Plugins
3, Select the Mbean plugin
After successfully adding MBean plugin you can see MBean tab in VisualVM and you can monitor JMX.
STEP 3: Local Monitoring
By default VisualVM will monitor all the applications running on the local JVM. No need to do any changes if your using Java 1.6 and above.
STEP 4: Remote Monitoring
To retrieve and display information on applications running on the remote host, the jstatd utility needs to be running on the remote host.
Steps to run jstatd
The jstatd tool is an RMI server application that monitors for the creation and termination of instrumented HotSpot Java virtual machines (JVMs) and provides an
interface to allow remote monitoring tools to attach to JVMs.
1, create a file with "jstatd.all.policy" file name and copy the below content
grant codebase "file:${java.home}/../lib/tools.jar" { permission java.security.AllPermission ;};
2, copy "jstatd.all.policy" file in java bin (Java\jdk1.7.0_10\bin) directory
3, Navigate to your JDK software's bin directory and type the following command at the command prompt: jstatd -J-Djava.security.policy=jstatd.all.policy.txt
4, to run jstatd admin privileges required, then only all the other users can connect it remote host.
It’s one time activity. (Run with background process in CIT and SIT)
To add a remote host in VisualVM, right-click the Remote node in the Applications window,
choose Add Remote Host and type the host name or IP address in the Add Remote Host dialog box.
When Java VisualVM is connected to a remote host, a node for the remote host appears under the Remote node in the Applications window.
You can expand the remote host node to view the applications running on the remote host.
Use jvisualvm.exe jdk/bin and you can monitor storm workers.
Jvisualvm can also point to remote Storm topology.
Download and add mbean plugin into jvisualvm.

Connecting to remote server with hive

So I have two machines, and I am trying to connect to the hive server with another machine. I simply enter
$hive -h<IP> -p<PORT>
However, it says I need to install hadoop. I only want to connect remotely. So why would I need hadoop? Is there any way to bypass this?
The hive program depends on the hadoop program, because it works by reading from HDFS, launching map-reduce jobs, etc. (In Hive, unlike a typical database server, the command-line interface actually does all the query processing, translating it to the underlying implementation; so you don't usually really run a "Hive server" in the way you seem to be expecting.) This doesn't mean that you need to actually install a Hadoop cluster on this machine, but you will need to install the basic software to connect to your Hadoop cluster.
One way to bypass this is run the Hive JDBC/Thrift server on the box that has the Hadoop infrastructure — that is, to run the hive program with command-line options to run it as a Hive-server on the desired port and so on — and then connect to it using your favorite JDBC-supporting SQL client. This more closely approximates the sort of database-server model of typical DBMSes (though it still differs, in that it still leaves open the possibility of other hive connections that aren't through this server). (Note: this used to be a bit tricky to set up. I'm not sure if it's easier now than it used to be.)
And this is probably obvious, but for completeness: another way to bypass this restriction is to use ssh, and actually run hive on the box that has the Hadoop infrastructure. :-)
Newer Hive CLI actually allows connecting to a remote Thrift server. See the beginning of https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli The remote machine should be running a Hive server for this to work.
You don't need your local box to be a part of a Hadoop cluster. However, you may need Hadoop programs/jars for Hive to work. If you install Hive from a standard repository, it should include a Hadoop distribution.

Remote Profiling Jprofiler

Hi i am very new to Jprofiler & Linux.
I am trying to Monitor my Apache Tomcat server installed on a linux machine from Jprofiler remote profiling which is installed on windows machine. Kindly help me in the procedure in detail.
I tried all the Help i could get from google but still stuck..any help will is appreciated. Thanks in advance.
In any case, you should extract the JProfiler tar.gz file for Linux on the remote machine. No further configuration is required on the remote side. On the local side you need a full installation of JProfiler.
There are two ways to get remote profiling to work:
A. Attach to the running Tomcat process
Execute the command line utility bin/jpenable in the JProfiler distribution on the remote machine and select the Tomcat process. The JVM will then be ready for profiling. If the profiled JVM is not listed, execute jpenable as the same user that runs the Tomcat JVM. If that does help, use alternative B.
On the local machine, create a session of type "Attach to profiled JVM (local or remote)", specify the host name of the remote machine and the profiling port that was set with jpenable.
When you start session, the JProfiler GUI will connect to the remote machine and you will see profiling data.
B. Use the integration wizard
Execute the command line uutility bin/jpintegrate in the JProfiler distribution on the remote machine and select your application server and follow the subsequent steps.
Then, proceed as in alternative A. This option is actually preferable to alternative A and unless you have to profile an already running JVM, you should take this route.