SageMaker: visualizing training statistics - tensorflow

If I send a TensorFlow training job to a SageMaker instance, what is the typical way to view training progress? Can I access TensorBoard for this launched EC2 instance? Is there some other alternative? What I'm looking for specifically are things like graphs of current training epoch and mAP.

you can now specify metrics(metricName, Regex) that you want to track by using AWS management console or Amazon SageMaker Python SDK APIs. After the model training starts, Amazon SageMaker will automatically monitor and stream the specified metrics in real time to the Amazon CloudWatch console for visualizing time-series curves.
Ref:
https://docs.aws.amazon.com/sagemaker/latest/dg/API_MetricDefinition.html

Related

MLflow pipelines on databricks

I need to run four pyspark scripts over a databricks cluster back to back as steps in mlflow as a pipeline. Can someone please help me with any online content for this. This is an optimization algo model where there is no .pkl file involved to be read and make prediction . It can be considered as four pyspark scripts to be triggered back to back as a pipeline over MLflow in a batch mode.
I know it works in Azure ML studio where we make a pipeline and define these steps where each step is tagged to a .py script. I want to replicate a similar scenario here.

Difference between Sagemaker notebook instance and Training job submission

I am getting an error with a Sagemaker training job with the error message "OverflowError: signed integer is greater than maximum". This is an image identification problem with code written in keras and tensorflow. The input for this is a large npy file stored in an s3 bucket.
The code works fine when run the Sagemaker notebook cells but errors out when submitted as a Training job using boto3 request.
I am using the same role in both places. what could be the cause for this error? I am using ml.g4dn.16xlarge instance in both cases
Couple of things I would check is
Framework Versions used in your notebook instance vs the training job
Instance Storage volume for the training job, since you are using G4dn it comes with attached SSD which ideally should be good enough.
This seems like a bug, Requests and urllib3 should only ask for a maximum number of bytes and it is capable of handling at once.

How to load data for TPU inference in Colab, without using GCP?

For training models on the colab TPUs, the data needs to be on GCP buckets. However, for small amounts of data, I am wondering if it's possible to directly inference data directly from the local colab enviroment.
Unfortunately, it isn't possible to load local data into the TPU with Colab currently. You will need to continue using the GCS bucket for any data loading into the TPU.
You can read files with Python:
with open(image_path, "rb") as local_file:
img = local_file.read()

Store TensorFlow/TensorBoard data in Elastic or Prometheus

Is there a way to feed live training (summary) metrics from TensorFlow/TensorBoard into Elastic or Prometheus, so I can visualize these outside of TensorBoard? I'd like to combine my visualizations with other metrics that are not available to TensorBoard.
You could read the tfevents files as described in https://stackoverflow.com/a/40029298/179444 and forward them to a prometheus pushgateway, expose them for scraping or index them in elasticsearch.

Sequential Scripts conditioned by S3 file existence

I have three python scripts. These are supposed be executed sequentially, but in different environments.
script1: Generate training and test dataset using an AWS EMR cluster and save it on S3.
script2: Train a Machine Learning model using the training data and save the trained model on S3. (Executed on an AWS GPU instance)
script3: Run evaluation based on the test data and trained model and save the result on S3. (Executed on an AWS GPU instance)
I would like to run all these scripts automatically, without executing them one by one. My questions are:
Are there good practices for handling S3 file existence conditions? (false tolerence etc)
How can I trigger launching GPU instances and EMR clusters?
Are there good ways or tools to handle this kind of process?
Take a look at https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
You can configure your notification when an event occur on a bucket, for example when an object is created.
You can attach this notification directly to an AWS lambda function that, if will be set with the right role can create EMR cluster and all other resources accessible by AWS SDK.