GCP: IA ML serving with autoscaling to zero - tensorflow

I wanted to try the ML serving AI platform from GCP, but i want the node to scale only if there is a call to prediction.
I see in the documentation here:
If you select "Auto scaling", the optional Minimum number of nodes field displays. You can enter the minimum number of nodes to keep running at all times, when the service has scaled down. This field defaults to 0.
But when i try to create my model version, it shows an error telling me that this field should be > 1.
Here is what i tried:
Name: testv1
Pre-Built Container
Python 3.7
Framework Tensorflow
TF version 2.4.0
ML 2.4
Scaling auto-scaling
Min nodes nb 0
machine type n1-standard-4
GPU TESLA_K80 * 1

I tried to reproduce your case and found the same thing, I was not able to set the Minimum number of nodes to 0.
This seems to be an outdated documentation issue. There is an ongoing Feature Request that explains it was possible to set a minimum of 0 machines with a legacy machine type, and requests to make this option available for current types too.
On the other hand, I went ahead and opened a ticket to update the documentation.
As a workaround, you can deploy and your models right when you need them and then proceed to un-deploy them. Be mindful that undeployments may take up to 45 minutes, so it is advisable to wait 1 hour to re-deploy that model to avoid any issues.

Related

Suggestion for Non Analytical Distributed Processing Frameworks

Can someone please suggest a tool, framework or a service to perform the below task faster.
Input : The input to the service is a CSV file which consists of an identifier and several image columns with over a million rows.
Objective: To check if any of the image column of the row meets the minimum resolution and create a new boolean column for every row according to the results.
True - If any of the image in the row meets the min resolution
False - If no images in the row meets the min resolution
Current Implementation: Python script with pandas and multiprocessing running on a large VM(60 Core CPU) which takes about 4 - 5 Hours. Since this is a periodic task we schedule and manage it with Cloud Workflow and Celery Backend.
Note: We are looking to cut down on costs as uptime of server is just about 4-6Hrs a day. Hence 60 Core CPU 24*7 would be a lot of resources wasted.
Options Explored:
We have ruled out Cloud Run due to the memory, cpu and timeout limitations.
Apache Beam with Cloud Dataflow, seems like there is less support for non analytical workloads and Dataframe implementation with Apache Beam looks buggy still.
Spark and Dataproc seems to be good for analytical workloads. Although a Serverless option would be much preferred.
Which direction should i be looking into?

Dask Yarn failed to allocate number of workers

We have a CDH cluster (version 5.14.4) with 6 worker servers with a total of 384 vcores (64 cores per server).
We are running some ETL processes using dask version 2.8.1, dask-yarn version 0.8 with skein 0.8 .
Currently we are having problem allocating the maximum number of workers .
We are not able to run a job with more the 18 workers! (we can see the actual number of workers in the dask dashboad.
The definition of the cluster is as follows:
cluster = YarnCluster(environment = 'path/to/my/env.tar.gz',
n_workers = 24,
worker_vcores = 4,
worker_memory= '64GB'
)
Even when increasing the number of workers to 50 nothing changes, although when changing the worker_vcores or worker_memory we can see the changes in the dashboard.
Any suggestions?
update
Following #jcrist answer I realized that I didn't fully understand the termenology between the Yarn web UI application dashboard and the Yarn Cluster parameters.
From my understanding:
a Yarn Container is equal to a dask worker.
When ever a Yarn cluster is generated there are 2 additional workers/containers that are running (one for a Schedualer and one for a logger - each with 1 vCore)
The limitation between the n_workers * worker_vcores vs. n_workers * worker_memory that I still need fully grok.
There is another issue - while optemizing I tried using cluster.adapt(). The cluster was running with 10 workers each with 10 ntrheads with a limit of 100GB but in the Yarn web UI there was only displayed 2 conteiners running (my cluster has 384 vCorres and 1.9TB so there is still plenty of room to expand). probably worth to open a different question.
There are many reasons why a job may be denied more containers. Do you have enough memory across your cluster to allocate that many 64 GiB chunks? Further, does 64 GiB tile evenly across your cluster nodes? Is your YARN cluster configured to allow jobs that large in this queue? Are there competing jobs that are also taking resources?
You can see the status of all containers using the ApplicationClient.get_containers method.
>>> cluster.application_client.get_containers()
You could filter on state REQUESTED to see just the pending containers
>>> cluster.application_client.get_containers(states=['REQUESTED'])
this should give you some insight as to what's been requested but not allocated.
If you suspect a bug in dask-yarn, feel free to file an issue (including logs from the application master for a problematic run), but I suspect this is more an issue with the size of containers you're requesting, and how your queue is configured/currently used.

How to delete an instance if cpu is low?

I am running managed Instance groups whose overall c.p.u is always below 30% but if i check instances individually then i found some are running at 70 above and others are running as low as 15 percent.
Keep in mind that Managed Instance Groups don't take into account individual instances as whether a machine should be removed from the pool or not. GCP's MIGs keep a running average of the last 10 minutes of activity of all instances in the group and use that metric to determine scaling decisions. You can find more details here.
Identifying instances with lower CPU usage than the group doesn't seem like the right goal here, instead I would suggest focusing on why some machines have 15% usage and others have 70%. How is work distributed to your instances, are you using the correct strategies for load balancing for your workload?
Maybe your applications have specific endpoints that cause large amounts of CPU usage while the majority of them are basic CRUD operations, having one machine generating a report and displaying higher usage is fine. If all instances render HTML pages from templates and return the results one machine performing much less work than the others is a distribution issue. Maybe you're using a RPS algorithm when you want a CPU utilization one.
In your use case, the best option is to create an Alert notification that will alert you when an instance goes over the desired CPU usage. Once you receive the notification, you will then be able to manually delete the VM instance. As it is part of the Managed Instance group, the VM instance will automatically recreate.
I have attached an article on how to create an Alert notification here.
There is no metric within Stackdriver that will call the GCE API to delete a VM instance .
There is currently no such automation in place. It should't be too difficult to implement it yourself though. You can write a small script that would run on all your machines (started from Cron or something) that monitors CPU usage. If it decides it is too low, the instance can delete itself from the MIG (you can use e.g. gcloud compute instance-groups managed delete-instances --instances ).

Retransmit a feed to multiple users

I have the following problem to solve and I wanted to know what options I have to do it. Basically I have a feed that comes to me via a socket and I need to retransmit it to multiple computers.
First of all I want to make a beta version as fast and simple as possible, for a maximum of 250 connections. Then in the future I will build a full version with an architecture that supports scaling in number of connections and perhaps a little more in the size of packages.
Some more detailed data:
The packages have a weight of 2KB approximately.
At peak moments we will send about 50 packets per second. (1 pack every 20 ms)
In the beta version if a package does not reach one of the consumers its not a problem but in the full version it is.
I would like to push the messages and not have the consumers pull them, but for the beta version its not mandatory.
In the beta version I do not need authentication but in the full version I do.
I was researching and I found that I could use:
Message Queue, ie RabbitMQ
An Api Stream like Twitter
Are there other alternatives? What technology and tools do you recommend me to use for beta version and full version?
Thank you very much
AMQP is a good choice, as you've mentioned.
You also can check out DDS: http://opendds.org/about/dds_overview.html

Google Cloud ML, extend previous run of hyperparameter tuning

I am running hyper parameter tuning using Google Cloud ML. I am wondering if it is possible to benefit from (possibly partial) previous runs.
One application would be :
I launch an hyperparameter tuning job
I stop it because I want to change the type of cluster I am using
I want to restart my hypertune job on a new cluster, but I want to benefit from previous runs I already paid for.
or another application :
I launch an hypertune campain
I want to extend the number of trials afterwards, without starting from scratch
and then for instance, I want remove one degree of liberty (e.g. training_rate), focusing on other parameters
Basically, what I need is "how can I have a checkpoint for hypertune ?"
Thx !
Yes, this is an interesting workflow -- Its not exactly possible with the current set of APIs, so its something we'll need to consider in future planning.
However, I wonder if there are some workarounds that can pan out to approximate your intended workflow, right now.
Start with higher number of trials - given you can cancel a job, but not extend one.
Finish a training job early based on some external input - eg. once you've arrived at a fixed training_rate, you could record that in a file in GCS, and mark subsequent trials with different training rate as infeasible, so those trials end fast.
To go further, eg. launch another job (to add runs, or change scale tier), you could potentially try using the same output directory, and this time lookup previous results for a given set of hyperparameters with an objective metric (you'll need to record them somewhere where you can look them up -- eg. create gcs files to track the trial runs), so the particular trial completes early, and training moves on to the next trial. Essentially rolling your own "checkpoint for hypertune".
As I mentioned, all of these are workarounds, and exploratory thoughts on what might be possible from your end with current capabilities.