Spark long deploying time on EC2 with custom Windows AMI - ssh

I am trying to run a Spark cluster with some Windows instances on an Amazon EC2 infrastructure, but I am facing some issues with extremely high deploying times.
My project needs to be run on a Windows environment, and therefore I am using an alternative AMI by indicating it with the -a flag provided by Spark's spark-ec2 script. When I run the script, the process keeps stuck waiting for the instances to be up and running, with the following message:
Waiting for all instances in cluster to enter 'ssh-ready' state.............
When I use the default AMI, instead, the cluster launches normally after very few minutes of waiting.
I have searched for similar problems with other users, and so far I have only been able to find this statement about long deploying time with custom AMI-s (see Josh Rosen's answer).
I am using the version 1.2.0 of Spark. The call that launches the cluster looks something like the following:
./spark-ec2 -k MyKeyPair
-i MyKeyPair.pem
-s 10
-a ami-905fe9e7
--instance-type=t1.micro
--region=eu-west-1
--spark-version=1.2.0
launch MyCluster
The AMI indicated above refers to:
Microsoft Windows Server 2012 R2 Base - ami-905fe9e7
Desc: Microsoft Windows 2012 R2 Standard edition with 64-bit architecture. [English]
Any help or acclaration abouth this issue would be greatly appreciated.

I think I have figured out the problem. It seems Spark does not support the creation of clusters on a Windows environment with its default scripts. I think it is still possible to create a cluster with some manual tweaking, but it goes out of my limited knowledge. Here is the official post that explains it.
Instead, as a temporal solution, I am considering the usage of a Microsoft Azure cluster, which has just released an experimental tool that makes able to use a variant of Apache Hadoop (Spark) on their HDinsight clusters. Here is the article that explains it better.

Related

Run Redis server as service on Window 10?

I was able to run the redis server through the Windows Subsystem for Linux following this guide: https://medium.com/#RedisLabs/windows-subsystem-for-linux-wsl-10e3ca4d434e.
But I do not fully understand how the subsystem works. I thought it would run the server on Windows and I could see this in the Windows Services which is not the case. Can someone tell me how to run Redis as a service.
EDIT
Does someone know if there is a standard way to download and install Redis for windows other than using the WSL? I have seen some guides, but they are outdated.

What is the most robust way to install and run Redis on Windows Server 2012? (Updated for 2018)

I know this question has been asked before, but it was asked back in 2014. The proposed solution was running Microsoft's port of Redis. However, that port hasn't been touched since 2016.
OK... that answer is Good and Official but this one is the future.
Windows Linux Subsystem supports fork (the reason they say it is not recommended) and I was able to run the RQ tutorial on my Windows 10 laptop.
https://learn.microsoft.com/en-us/windows/wsl/install-win10
As far as I can tell, "Not Recommended" is the official answer:
https://redislabs.com/ebook/appendix-a/a-3-installing-on-windows/
Before we get into how to install Redis on Windows, I’d like to point
out that running Redis on Windows isn’t recommended, for a variety of
reasons. In this section we’ll cover these points:
Reasons why you shouldn’t be running Redis on Windows. How to
download, install, and run a precompiled Windows binary. How to
download and install Python for Windows. How to install the Redis
client library.
But even that ebook page (next page) points to the now archived MSFT Redis project.
I would go with running Redis in Docker
https://hub.docker.com/r/_/redis/

Is there a fast painless way to setup a linux distro on VirtualBox?

I like the Docker Hub with dockerfiles idea very much.
Is there a similar way to get a small working linux VirtualBox instance in a few commands, that could also be controlled from a command line?
Vagrant is a great tool that does just what you want and much more! It's a ruby application written for fast and simple setup of minimal development environments.
By default it creates VirtualBox images, but it supports VMWare and many others too. The whole setup of a box is managed by a single Vagrantfile! Your vm options, network settings and provisioning is done there.
Setting up a virtualbox box is as easy as executing just two shell commands. Checkout the Getting Started Guide for an example using Ubuntu.
You can use a vast range of prepared images from the Hashicorp Atlas or build your owns.
Also, vagrant doesn't limit you to one virtual machine per development setup, it enables you to model cluster setups on a single machine using multiple vms. I myself use docker for that part though.
Edit: fixed a typo :<

Connecting to remote server with hive

So I have two machines, and I am trying to connect to the hive server with another machine. I simply enter
$hive -h<IP> -p<PORT>
However, it says I need to install hadoop. I only want to connect remotely. So why would I need hadoop? Is there any way to bypass this?
The hive program depends on the hadoop program, because it works by reading from HDFS, launching map-reduce jobs, etc. (In Hive, unlike a typical database server, the command-line interface actually does all the query processing, translating it to the underlying implementation; so you don't usually really run a "Hive server" in the way you seem to be expecting.) This doesn't mean that you need to actually install a Hadoop cluster on this machine, but you will need to install the basic software to connect to your Hadoop cluster.
One way to bypass this is run the Hive JDBC/Thrift server on the box that has the Hadoop infrastructure — that is, to run the hive program with command-line options to run it as a Hive-server on the desired port and so on — and then connect to it using your favorite JDBC-supporting SQL client. This more closely approximates the sort of database-server model of typical DBMSes (though it still differs, in that it still leaves open the possibility of other hive connections that aren't through this server). (Note: this used to be a bit tricky to set up. I'm not sure if it's easier now than it used to be.)
And this is probably obvious, but for completeness: another way to bypass this restriction is to use ssh, and actually run hive on the box that has the Hadoop infrastructure. :-)
Newer Hive CLI actually allows connecting to a remote Thrift server. See the beginning of https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli The remote machine should be running a Hive server for this to work.
You don't need your local box to be a part of a Hadoop cluster. However, you may need Hadoop programs/jars for Hive to work. If you install Hive from a standard repository, it should include a Hadoop distribution.

Does a cloud service like Azure or EC2 exist which can run arbitrary workloads? (e.g. Client SKUs of Windows)

Azure and EC2 are optimized for running servers. Lots and lots of servers. Both platforms attempt to manage tons of things for you -- in Azure's case, it wants to manage even the target operating system.
However, I'd like to use such a service for a different reason: Testing.
I've got a ton of operating systems I need to support. My tests don't actually take that long, but running them on every platform is time consuming. I was going to just use a cloud service for this, thinking that these machines would be running for much less than an hour, and it wouldn't cost all that much.
The problem is that the major cloud services won't run client versions of Windows -- Windows Server only.
Is there a cloud service which would let me run every client and server version, and every service pack level, of Windows released starting with Windows 2000 SP4 to the present day?
Try CloudSigma, Defiantly can upload your own ISO's and run any x86 and 64bit OS you like on it. They have their in-house versions to get started but you can bring your own OS versions.
Based in Switzerland but they would have also the servers in the US, performance i've expected to quite good.
https://www.cloudsigma.com/
There is also a free trail on at the moment
https://cs.cloudsigma.com/accounts/signup/
The list of Open Virtualization Alliance members may have some candidates for you.
A search on the page for "operating system" suggests the following possibilities (in addition to the already-mentioned CloudSigma):
ElasticHosts
stepping stone GmbH (I'm less sure about this one)
Sublime IP
No, commercial cloud services like Azure and Amazon EC2 are themselves virtual, so you don't get a great deal of control over the operating system.
An option may be to consider renting a full physical server (colocated, or managed) and then use a battery of virtual machines to run the tests. Something like VMWare's snapshot feature sounds perfect: spin up a clean virtual machine, deploy the test code, then throw away changes to the disk once the tests have been completed.
Or, indeed, as #Stuart suggests - run the tests locally.
This definitely isn't something Azure offers - I think all of Azure's images are based near to Windows Server 2008 R2.
For EC2 you could set up images for Server 2003 through to 2008R2 - but nothing else. There are also some services out there to assist with this - e.g. VaasNet http://www.vaasnet.com/catalog
For testing the other Windows operating systems, I simply don't think there's a cloud service available to let you do this. I don't even think there are any cloud services where you can run "Virtual PC" type applications on top of the hosted operating system - as I think most of the virtualization APIs are disabled in the cloud environments (virtualization within virtualization not supported!)
Sorry to say this, but your best bet may be local test hardware running VirtualPC images.
It appears that the Xen Cloud Platform might do what you're after. This page ends with:
Guest Operating Systems: the XCP binary distribution is delivered with a wide range of Linux and Widnows guests. Check out the release notes for a complete list.
And their PDF document Xen Cloud Platform Virtual Machine Installation Guide (Release 0.1, Published October 2009) says that Windows 2000 Server has "No known issues."
(I don't have any affiliation with Xen)
In conjunction with the above, there is also a list of Xen VirtualPrivateServerProviders, several of which say they include Windows.
Buy time on an EC2 instance and use it to host VirtualBox VMs with VMs set up for each operating system you want to test for. Use a RDP client or VNC or some other means to control the guest OS. This forum post seems to point to that being possible. But yes it is not a cloud service itself and you would have todo some initial setup and configuration work yourself.