I'm trying to train a model on kaggle and dump tensorboard logs into a GCS bucket. I'm hesitant to allow anonymous read/write on my project and would like to be able to have tensorflow use a custom service account with limited quotas for all GCP / gfile.GFile operations. Is there anyway to provide tensorflow with a service account json to use?
Is my best bet just security by obscurity?
I am not experienced using Kraggle and I do not really understand what limits do you want to apply on the service account, but you can follow the next steps to determine a service account access for Google Cloud Storage while using TensorFlow:
Follow this guide to implement GCS custom FileSystem in Tensorflow.
Check the Python client library to instantiate the client.
The service account permissions required for storage are listed here.
To grant roles to a service account, follow this guide.
Check the snippet in Federico's post here, based on this documentation, to implement the service account in your Python code.
Snippet:
from google.oauth2 import service_account
SERVICE_ACCOUNT_FILE = 'service.json'
credentials = service_account.Credentials.from_service_account_file(
SERVICE_ACCOUNT_FILE)
If you have service account credentials in a json file, you can specify it in the GOOGLE_APPLICATION_CREDENTIALS environment variable to have TensorFlow be able to read/write to GCS via gs:// urls.
You can test it out in the following way, by running the following in bash (it downloads a smoke test script from TensorFlow's repository and runs it on your bucket url with your credentials):
wget https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/gcs_test/python/gcs_smoke.py
GOOGLE_APPLICATION_CREDENTIALS=my_credentials.json python gcs_smoke.py --gcs_bucket_url=gs://my_bucket/test_tf
This should create some dummy records in GCS and read from them. After this, you'd want to clean up the remaining temporary outputs to avoid further charges:
gsutil rm -r gs://my_bucket/test_tf
Related
I'm running a very simple program getting screenshots of a page using Selenium in Cloud Run. I know that Cloud Run is stateless and I cannot access the screenshot that is generated after the program finishes executing, but I wanted to know where/how can I access these files right after the screenshot is taken and read them, so I can store a reference to them in my Cloud Storage bucket too
You have several solution:
Store the screenshot locally, and then upload them to Cloud Storage (you can create a script for that, use client libraries,...). A good evolution is to make a tar (optionally a gzip also) to upload only 1 file, it's faster.
Use Cloud Run execution runtime 2nd generation, and mount a bucket with GCSFuse into your Cloud Run instance. Like that, a file directly written in the mounted directory will be written on Cloud Storage. For that solution, and despite the good tutorial, it requires good skills in container.
Currently, I download csv files from AWS S3 to my local computer using:
aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]
I thought about 5 potential paths to solving this problem:
Use AWS CLI commands within Azure. I'm not entirely sure how to do that without running an Azure VM. Also, I would like to have my AWS profile credentials persist?
Use Python's subprocess library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.
Use Python's Boto3 library to access AWS services. In the past, it appears that Boto3 didn't support the AWS sync command. So, developers like #raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?
Use Azure Data Factory to copy data from AWS S3 bucket. [see Copy data from Amazon Simple Storage Service by using Azure Data Factory] My concern would be that I would want to sync rather than copy. I believe Azure Data Factory has functionality to check if a file already exists, but what if the file has been deleted from AWS S3 data source?
Use Azure Data Science Virtual Machine to: a) install the AWS CLI, 2) create my AWS profile to store the access credentials, and 3) run the aws s3 sync... command.
Any tips, suggestions, or ideas on automating this process are greatly appreciated.
Adding one more to the list :)
6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json
I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts .
#3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .
#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/
According to Microsoft's documentation it is possible to upload a python wheel file so that you can use custom libraries in Synapse Analytics.
Here is that documentation: https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries
I have created a simple library with just a hello world function that I was able to install with pip on my own computer. So I know my wheel file works.
I uploaded my wheel file to the location Microsoft's documentation say to upload the file.
I also found a youtube video of a person doing exactly what I am trying to do.
Here is the video: https://www.youtube.com/watch?v=t4-2i1sPD4U
Microsoft's documentation mentions this, "Custom packages can be added or modified between sessions. However, you will need to wait for the pool and session to restart to see the updated package."
As far as I can tell there is no way to restart a pool, and I also do not know how to tell if the pool is down or has restarted.
When I try to use the library in a notebook I get a module not found error.
Scaling up or down will force the cluster to restart .
Making changes to the spark pool's scale settings does restart the spark pool as HimanshuSinha-msft suggested. That was not my problem though.
The actual problem was that I needed the Storage Blob Data Contributor role in the data lake storage the files were stored in. I assumed because I already had owner permissions and because I could create a folder and upload there I had all the permissions I needed. Once I got the Storage Blob Data Contributor role though everything worked.
As anyone who has ever had the misfortune of having to interact with the panoply of Google CLI binaries programmatically will have realised, authenticating with the likes of gcloud, gsutil, bq, etc. is far from intuitive or trivial, especially when you need to work across different projects.
I am running various cron jobs that interact with Google Cloud Storage and BigQuery for different projects. Since the cron jobs may overlap, renaming config files is clearly not an option, and nor would any sane person take that approach.
There must surely be some sort of method of passing a path to a service account's key pair file to these CLI binaries, but bq help yields nothing.
The Google documentation, while verbose, is largely useless, taking one on a tour of how OAuth2 works, etc, instead of explaining what must surely be a very common requirement, vis-a-vis, how to actually authenticate a service account without running commands that modify central config files.
Can any enlightened being tell me whether the engineers at Google decided to add a feature as simple as passing the path to a service account's key pair file to the likes of gsutil and bq? Or perhaps I could simply export some variable so they know which key pair file to use for authentication?
I realise these simplistic approaches may be an insult to the intelligence, but we aren't concerning ourselves with harnessing nuclear fusion, so we needn't even consider what Amazon got so right with their approach to authentication in comparison...
Configuration in the Cloud SDK is global for the user, but you can specify what aspects of that config to use on a per command basis. To accomplish what you are trying to do you can:
gcloud auth activate-service-account foo#developer.gserviceaccount.com --key-file ...
gcloud auth activate-service-account bar#developer.gserviceaccount.com --key-file ...
At this point, both sets of credentials are in your global credentials store.
Now you can run:
gcloud --account foo#developer.gserviceaccount.com some-command
gcloud --account bar#developer.gserviceaccount.com some-command
in parallel, and each will use the given account without interfering.
A larger extension of this is 'configurations' which do the same thing, but for your entire set of config (including settings like account and project).
# Create first configuration
gcloud config configurations create myconfig
gcloud config configurations activate myconfig
gcloud config set account foo#developer.gserviceaccount.com
gcloud config set project foo
# Create second configuration
gcloud config configurations create anotherconfig
gcloud config configurations activate anotherconfig
gcloud config set account bar#developer.gserviceaccount.com
gcloud config set project bar
And you can say which configuration to use on a per command basis.
gcloud --configuration myconfig some-command
gcloud --configuration anotherconfig some-command
You can read more about configurations by running: gcloud topic configurations
All properties have corresponding environment variables that allow you to set that particular property for a single command invocation or for a terminal session. They take the form:
CLOUDSDK_<SECTION>_<PROPERTY>
for example: CLOUDSDK_CORE_ACCOUNT
You can see all the available config settings by running: gcloud help config
The equivalent of the --configuration flag is: CLOUDSDK_ACTIVE_CONFIG_NAME
If you really want complete isolation, you can also change the Cloud SDK's config directory by setting CLOUDSDK_CONFIG to a directory of your choosing. Note that if you do this, the config is completely separate including the credential store, all configurations, logs, etc.
I have a python web application deployed on Google App Engine.
I need to grab a log file stored on Amazon S3 and load it into Google Cloud Storage. Once it is in Google Cloud Storage I may need to perform some transformations and eventually import the data into BigQuery for analysis.
I tried using gsutil as a some sort of proof of concept, since boto is under the hood of gsutil and I'd like to use boto in my project. This did not work.
I'd like to know if anyone has managed to transfer file directly between the 2 clouds. If possible I'd like to see a simple example. In the end this task has to be accomplished through code executing on GAE.
Per this thread, you can stream data from S3 to Google Cloud Storage using gsutil but every byte still has to take two hops: S3 to your local computer and then your computer to GCS. Since you're using App Engine, however, you should be able to pull from S3 and deposit into GCS. It's the same progression as above except App Engine is the intermediary, i.e. every byte travels from S3 to your app and then to GCS. You could use boto for the pull side and the Google Cloud Storage API for the push side.
Google allows you to import entire buckets from S3 to the storage service:
https://cloud.google.com/storage/transfer/getting-started
You can set file filters on the source bucket to only import the file you want, or a "directory" (i.e. anything with a certain prefix).
I'm not aware of any cloud provider that provides an API for transferring data to a competing cloud provider. Cloud providers have no incentive to help you move your data to the competition. You will almost certainly have to read the data to an intermediate machine then write it to Google.
GCP supports not only transfer from S3, also it supports all the storage which have S3-compatible API's.
https://cloud.google.com/storage-transfer/docs/create-transfers
https://cloud.google.com/storage-transfer/docs/s3-compatible