GPUs and google container engine - gpu

Kubernetes supports GPUs as an experimental feature. Does it work in google container engine? Do I need to have some special configuration to enable it? I want to be able to run machine learning workloads, but want to use Python 3 which isn't available in CloudML.

GPUs on Google Container Engine are now available in Alpha. Sign up form.
Beware that alpha cluster limitations apply: they cannot be upgraded, and they will be auto-deleted in 30 days.
Disclaimer: I work at GCP.

I am afraid this is not supported out of the box. When creating a regular instance in Google Compute Engine (GCE) you are able to select GPU specs for your machine. On the other side, when creating a cluster, these options are not available. I imagine that this will be available sooner or later, but not at the moment.
As an alternative, you can create several GCE instances and build a cluster using tools like kubeadm or following guides like Kubernetes the hard way: https://github.com/kelseyhightower/kubernetes-the-hard-way

I've not tested it, but as long as GPU vm are just machine types I would say that doing these two steps should make it feasible:
UPDATE: Main site for Custom Machine Types: https://cloud.google.com/custom-machine-types/
1- Create a GPU Custom Machine Type: https://cloud.google.com/compute/docs/gpus/
You can add GPUs to any non-shared-core predefined machine type or custom machine type that you are able to create in a zone
2- When creating nodes, chose your custom machine type in your cluster or node pool: https://cloud.google.com/container-engine/docs/clusters/operations
--machine-type: The Google Compute Engine machine type (e.g. n1-standard-1) to use for instances in this container cluster. If unspecified, the default machine type is n1-standard-1

Related

What is the difference between local runtime and hosted runtime in Google Colab?

I I just started using Google Colab for a project of mine. I see an button of "CONNECT" on the web page that presents before me two options:
Connect to Hosted Runtime
Connect to Local Runtime
Can anyone explain what the two mean and how it may affect my project? I did not find any useful documentation related to it.
Hosted Runtime runs on a new machine instance in Google Cloud. You don't need to set-up any hardware. But you may need to install a few libraries every time you use it.
Local Runtime runs on your machine at home. You need to install Python, Jupyter, and set-up some forwarding. It is useful if you have a lot of data to process locally, or if you have your own powerful GPU to use.
In most cases, I use Hosted Runtime.

To virtualize or not to virtualize a bare metal server for a kubernetes deployment

I'd like to deploy kubernetes on a large physical server (24 cores) and I'm uncertain as to a number of things.
What are the pros and cons of creating virtual machines for the k8s cluster other than running on bare-metal.
I have the following considerations:
Creating vms will allow for work load isolation. New vms for experiments can be created and assigned to devs.
On the other hand, with k8s running on bare metal a new NAMESPACE can be created for each developer for experimentation and they can run their code in it. After all their code should be running in docker containers.
Security:
Having vms would limit the amount of access given to future maintainers, limiting the amount of damage that could be done. While on the other hand the primary task for any future maintainers would be adding/deleting nodes and they would require bare metal access to do that.
Authentication:
At the moment devs would only touch the server when their code runs through the CI pipeline and their running deployments are deployed. But what about viewing logs? Could we setup tiered kubectl authentication to allow devs to only access whatever namespaces have been assigned to them (I believe this should be possible with the k8s namespace authorization plugin).
A number of vms already exist on the server. Would this be an issue?
128 cores and doubts.... That is a lot of cores for a single server.
For kubernetes however this is not relevant:
Kubernetes can use different sized servers and utilize them to the maximum. However if you combine the master server processes and the node/worker processes on a single server, you might create unwanted resource issues. You can manage those with namespaces, as you already mention.
What we do is use continuous integration with namespaces in a single dev/qa kubernetes environment in which changes have their own namespace (So we run many many namespaces) and run full environment deployments in those namespaces. A bunch of shell scripts are used to manage this. This works both with a large server as what you have, as well as it does with smaller (or virtual) boxes. The benefit of virtualization for you could mainly be in splitting the large box in smaller ones so that you can also use it for other purposes then just kubernetes (yes, kubernetes runs except MS Windows, no desktops, no kernel modules for VPN purposes, etc).
I would separate dev and prod in the form of different vms. I once had a webapp inside docker which used too many threads so the docker daemon on the host crashed. It was limited to one host luckily. You can protect this by setting limits, but it's a risk: one mistake in dev could bring down prod as well.
I think the answer is "it depends!" which is not really an answer. Personally, I would split up the machine using VM's and deploy that way. You've got better flexibility as to how much of the server's resources you carve out and you can easily create new environments, then destroy easily.
Even if these vms are really big, I think it's still easier to manage also given that you have existing vm's on the machine.
That said, there's not a technical reason that you can't run a single node server, but you may run into problems with downtime with upgrades (if that's an issue), as well as if that server needs patched or rebooted, then your entire cluster is down.
I would look at your environment needs for HA and uptime, as well as how you are going to deploy VM's (if you go that route), and decide what works the best for you.

Which Google Cloud Platform service is the easiest for running Tensorflow?

While working on Udacity Deep Learning assignments, I encountered memory problem. I need to switch to a cloud platform. I worked with AWS EC2 before but now I would like to try Google Cloud Platform (GCP). I will need at least 8GB memory. I know how to use docker locally but never tried it on the cloud.
Is there any ready-made solution for running Tensorflow on GCP?
If not, which service (Compute Engine or Container Engine) would make it easier to get started?
Any other tip is also appreciated!
Summing up the answers:
AI Platform Notebooks - One click Jupyter Lab environment
Deep Learning VMs images - Raw VMs with ML libraries pre-installed
Deep Learning Container Images - Containerized versions of the DLVM images
Cloud ML
Manual installation on Compute Engine. See instructions below.
Instructions to manually run TensorFlow on Compute Engine:
Create a project
Open the Cloud Shell (a button at the top)
List machine types: gcloud compute machine-types list. You can change the machine type I used in the next command.
Create an instance:
gcloud compute instances create tf \
--image container-vm \
--zone europe-west1-c \
--machine-type n1-standard-2
Run sudo docker run -d -p 8888:8888 --name tf b.gcr.io/tensorflow-udacity/assignments:0.5.0 (change the image name to the desired one)
Find your instance in the dashboard and edit default network.
Add a firewall rule to allow your IP as well as protocol and port tcp:8888.
Find the External IP of the instance from the dashboard. Open IP:8888 on your browser. Done!
When you are finished, delete the created cluster to avoid charges.
This is how I did it and it worked. I am sure there is an easier way to do it.
More Resources
You might be interested to learn more about:
Google Cloud Shell
Container-Optimized Google Compute Engine Images
Google Cloud SDK for a more responsive shell and more.
Good to know
"The contents of your Cloud Shell home directory persist across projects between all Cloud Shell sessions, even after the virtual machine terminates and is restarted"
To list all available image versions: gcloud compute images list --project google-containers
Thanks to #user728291, #MattW, #CJCullen, and #zain-rizvi
Google Cloud Machine Learning is open to the world in Beta form today. It provides TensorFlow as a Service so you don't have to manage machines and other raw resources. As part of the Beta release, Datalab has been updated to provide commands and utilities for machine learning. Check it out at: http://cloud.google.com/ml.
Google has a Cloud ML platform in a limited Alpha.
Here is a blog post and a tutorial about running TensorFlow on Kubernetes/Google Container Engine.
If those aren't what you want, the TensorFlow tutorials should all be able to run on either AWS EC2 or Google Compute Engine.
You now can also use pre-configured DeepLearning images. They have everything that is required for the TensorFlow.
This is an old question but there's are new, even easier options now:
If you want to run TensorFlow with Jupyter Lab
GCP AI Platform Notebooks, which gives you on-click access to a Jupyter Lab Notebook with Tensorflow pre-installed (you can also use Pytorch, R, or a few other libraries instead if you prefer).
If you just want to use a raw VM
If you don't care about Jupyer Lab and just want a raw VM with Tensorflow pre-installed, you can instead create a VM using GCP's Deep Learning VM Image. These DLVM images give you a VM with Tensorflow pre-installed and are all setup to use GPUs if you want. (The AI Platform Notebooks use these DLVM images under the hood)
If you'd like to run it on both your laptop and the cloud
Finally, if you want to be able to run tensorflow both on your personal laptop and in the cloud and are comfortable using Docker, you can use GCP's Deep Learning Container Images. It contains the exact same setup as the DLVM images, but packaged as a container instead, so you can launch these anywhere you like.
Extra benefit: If you're running this container image on your laptop, it's 100% free :D
Im not sure there if there is a need for you to stay on the Google Cloud platform. If you are able to use other products you might save a lot of time, and some money.
If you are using TensorFLow I would recommend a platform called TensorPort. It is exclusively for TesnorFlow and is the easy platform I am aware of. Code and data are loaded with git and they provide a python module for automatic toggling of paths between remote and your local machine. They also provide some boiler plate code for setting up distributed computing if you need it. Hope this helps.

Is there a fast painless way to setup a linux distro on VirtualBox?

I like the Docker Hub with dockerfiles idea very much.
Is there a similar way to get a small working linux VirtualBox instance in a few commands, that could also be controlled from a command line?
Vagrant is a great tool that does just what you want and much more! It's a ruby application written for fast and simple setup of minimal development environments.
By default it creates VirtualBox images, but it supports VMWare and many others too. The whole setup of a box is managed by a single Vagrantfile! Your vm options, network settings and provisioning is done there.
Setting up a virtualbox box is as easy as executing just two shell commands. Checkout the Getting Started Guide for an example using Ubuntu.
You can use a vast range of prepared images from the Hashicorp Atlas or build your owns.
Also, vagrant doesn't limit you to one virtual machine per development setup, it enables you to model cluster setups on a single machine using multiple vms. I myself use docker for that part though.
Edit: fixed a typo :<

Spark long deploying time on EC2 with custom Windows AMI

I am trying to run a Spark cluster with some Windows instances on an Amazon EC2 infrastructure, but I am facing some issues with extremely high deploying times.
My project needs to be run on a Windows environment, and therefore I am using an alternative AMI by indicating it with the -a flag provided by Spark's spark-ec2 script. When I run the script, the process keeps stuck waiting for the instances to be up and running, with the following message:
Waiting for all instances in cluster to enter 'ssh-ready' state.............
When I use the default AMI, instead, the cluster launches normally after very few minutes of waiting.
I have searched for similar problems with other users, and so far I have only been able to find this statement about long deploying time with custom AMI-s (see Josh Rosen's answer).
I am using the version 1.2.0 of Spark. The call that launches the cluster looks something like the following:
./spark-ec2 -k MyKeyPair
-i MyKeyPair.pem
-s 10
-a ami-905fe9e7
--instance-type=t1.micro
--region=eu-west-1
--spark-version=1.2.0
launch MyCluster
The AMI indicated above refers to:
Microsoft Windows Server 2012 R2 Base - ami-905fe9e7
Desc: Microsoft Windows 2012 R2 Standard edition with 64-bit architecture. [English]
Any help or acclaration abouth this issue would be greatly appreciated.
I think I have figured out the problem. It seems Spark does not support the creation of clusters on a Windows environment with its default scripts. I think it is still possible to create a cluster with some manual tweaking, but it goes out of my limited knowledge. Here is the official post that explains it.
Instead, as a temporal solution, I am considering the usage of a Microsoft Azure cluster, which has just released an experimental tool that makes able to use a variant of Apache Hadoop (Spark) on their HDinsight clusters. Here is the article that explains it better.