How to load my train.tfrecord files in saturn cloud for running via Dask? - object-detection

I am working on Object Detection and I have two record files. Train.tfrecord(1.6GB) and Test.tfrecord(65MB) file. How do I load the training file in Saturn cloud, as I want to speed up the training time using Dask in Saturn Cloud?

As #SultanOrazbayev mentioned, since Saturn runs on AWS, the best way is to put your data in an S3 bucket. Then you can access it using whichever library you prefer. Normally we recommend using s3fs.
Note: I work at Saturn Cloud.

Related

How do I access files generated during Cloud Run function execution?

I'm running a very simple program getting screenshots of a page using Selenium in Cloud Run. I know that Cloud Run is stateless and I cannot access the screenshot that is generated after the program finishes executing, but I wanted to know where/how can I access these files right after the screenshot is taken and read them, so I can store a reference to them in my Cloud Storage bucket too
You have several solution:
Store the screenshot locally, and then upload them to Cloud Storage (you can create a script for that, use client libraries,...). A good evolution is to make a tar (optionally a gzip also) to upload only 1 file, it's faster.
Use Cloud Run execution runtime 2nd generation, and mount a bucket with GCSFuse into your Cloud Run instance. Like that, a file directly written in the mounted directory will be written on Cloud Storage. For that solution, and despite the good tutorial, it requires good skills in container.

How to access images directly from Google Cloud Storage (GCS) when using Keras?

I have developed a model in Keras that works perfectly when reading data stored locally. However, I now want to take advantage of Google Cloud Platform's GPUs for training the model. I have set up the GPU on GCP and am working in a Jupyter notebook. I have moved my images to Google Cloud Storage.
My question is:
How can I access these images (specifically the directories - training, validation, test) directly from Cloud Storage using the Keras' flow_from_directory method of the ImageDataGenerator class?
here's my directory structure in Google Cloud Storage (GCS):
mybucketname/
class_1/
img001.jpg
img002.jpg
...
class_2/
img001.jpg
img002.jpg
...
class_3/
img001.jpg
img002.jpg
...
While I haven't yet figured out a way to read the image data directly from GCS, in the meantime I can copy the files directly from Cloud Storage to the VM via import os, sys os.system('gsutil cp -r gs://mybucketname/ .')

Comparison of uploading files to GCS using Google Drive vs gsutil

I have been comparing how to upload files to a cloud storage, one is in-browser (or emulating a browser) and the other is command-line via gsutil to a Google Cloud Storage bucket.
Does Google Drive use gsutil in the backend, or or the uploader a totally customized and proprietary piece of software? Is there a way to achieve upload speeds to a Google Cloud Storage bucket similar to the upload speeds I'm able to achieve via Drive? If not, what would you suggest for how to get upload speeds equivalent to that in Google Drive, to upload files to a GCS bucket?
I'm not sure about GDrive using gsutil on the background.
There are several optimizations that you can use to improve gsutil speeds.
First of all you might use perfdiag to launch a small diagnostics tests that will give you and overview and possible speeds achievable.
gsutil perfdiag -o test.json gs://<your bucket name>
Secondly you will need to understand your workload(small/big files) and identifying the need for a regional or multi regional bucket(yes there is a perf difference)tl;dr:
"Regional buckets are great for data processing since their physical distance is fairly tight, and the overhead of write consistency is low."
"Multiregional Storage, on the other hand, guarantees 2 replicates which are geo diverse (100 miles apart) which can get better remote latency and availability.
"
There is some information on cloud Atlas specifically on this topic, you can check out in here:
https://medium.com/google-cloud/google-cloud-storage-what-bucket-class-for-the-best-performance-5c847ac8f9f2
https://medium.com/google-cloud/google-cloud-storage-large-object-upload-speeds-7339751eaa24?source=user_profile---------12----------------
https://medium.com/#duhroach/optimizing-google-cloud-storage-small-file-upload-performance-ad26530201dc
https://medium.com/#duhroach/google-cloud-storage-performance-4cfcec8bad72
https://cloud.google.com/storage/docs/best-practices

Save and Load models from S3

Any way to allow an H2O cluster to save/load directly to S3?
model.save('s3n://my-domain/gbm-from-the-future')
model.load('s3n://my-domain/gbm-from-the-future')
Historically, I have achieved this by:
- Saving to a file-system off of the Cluster
- Syncing with S3
- Downloading from S3
- Loading from the file-system
Obviously, there has to be a better way from the cluster itself.
According to the Python docs for h2o.save_model() this is already supported (you did not mention which of the APIs you are using, so I am using Python as an example). Have you tried putting an S3 address in the file location argument of the standard model save and load functions? If you find that this is not working, please file a bug report on the H2O JIRA.

Migrate s3 data to google cloud storage

I have a python web application deployed on Google App Engine.
I need to grab a log file stored on Amazon S3 and load it into Google Cloud Storage. Once it is in Google Cloud Storage I may need to perform some transformations and eventually import the data into BigQuery for analysis.
I tried using gsutil as a some sort of proof of concept, since boto is under the hood of gsutil and I'd like to use boto in my project. This did not work.
I'd like to know if anyone has managed to transfer file directly between the 2 clouds. If possible I'd like to see a simple example. In the end this task has to be accomplished through code executing on GAE.
Per this thread, you can stream data from S3 to Google Cloud Storage using gsutil but every byte still has to take two hops: S3 to your local computer and then your computer to GCS. Since you're using App Engine, however, you should be able to pull from S3 and deposit into GCS. It's the same progression as above except App Engine is the intermediary, i.e. every byte travels from S3 to your app and then to GCS. You could use boto for the pull side and the Google Cloud Storage API for the push side.
Google allows you to import entire buckets from S3 to the storage service:
https://cloud.google.com/storage/transfer/getting-started
You can set file filters on the source bucket to only import the file you want, or a "directory" (i.e. anything with a certain prefix).
I'm not aware of any cloud provider that provides an API for transferring data to a competing cloud provider. Cloud providers have no incentive to help you move your data to the competition. You will almost certainly have to read the data to an intermediate machine then write it to Google.
GCP supports not only transfer from S3, also it supports all the storage which have S3-compatible API's.
https://cloud.google.com/storage-transfer/docs/create-transfers
https://cloud.google.com/storage-transfer/docs/s3-compatible