I have a python web application deployed on Google App Engine.
I need to grab a log file stored on Amazon S3 and load it into Google Cloud Storage. Once it is in Google Cloud Storage I may need to perform some transformations and eventually import the data into BigQuery for analysis.
I tried using gsutil as a some sort of proof of concept, since boto is under the hood of gsutil and I'd like to use boto in my project. This did not work.
I'd like to know if anyone has managed to transfer file directly between the 2 clouds. If possible I'd like to see a simple example. In the end this task has to be accomplished through code executing on GAE.
Per this thread, you can stream data from S3 to Google Cloud Storage using gsutil but every byte still has to take two hops: S3 to your local computer and then your computer to GCS. Since you're using App Engine, however, you should be able to pull from S3 and deposit into GCS. It's the same progression as above except App Engine is the intermediary, i.e. every byte travels from S3 to your app and then to GCS. You could use boto for the pull side and the Google Cloud Storage API for the push side.
Google allows you to import entire buckets from S3 to the storage service:
https://cloud.google.com/storage/transfer/getting-started
You can set file filters on the source bucket to only import the file you want, or a "directory" (i.e. anything with a certain prefix).
I'm not aware of any cloud provider that provides an API for transferring data to a competing cloud provider. Cloud providers have no incentive to help you move your data to the competition. You will almost certainly have to read the data to an intermediate machine then write it to Google.
GCP supports not only transfer from S3, also it supports all the storage which have S3-compatible API's.
https://cloud.google.com/storage-transfer/docs/create-transfers
https://cloud.google.com/storage-transfer/docs/s3-compatible
Related
Currently, I download csv files from AWS S3 to my local computer using:
aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]
I thought about 5 potential paths to solving this problem:
Use AWS CLI commands within Azure. I'm not entirely sure how to do that without running an Azure VM. Also, I would like to have my AWS profile credentials persist?
Use Python's subprocess library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.
Use Python's Boto3 library to access AWS services. In the past, it appears that Boto3 didn't support the AWS sync command. So, developers like #raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?
Use Azure Data Factory to copy data from AWS S3 bucket. [see Copy data from Amazon Simple Storage Service by using Azure Data Factory] My concern would be that I would want to sync rather than copy. I believe Azure Data Factory has functionality to check if a file already exists, but what if the file has been deleted from AWS S3 data source?
Use Azure Data Science Virtual Machine to: a) install the AWS CLI, 2) create my AWS profile to store the access credentials, and 3) run the aws s3 sync... command.
Any tips, suggestions, or ideas on automating this process are greatly appreciated.
Adding one more to the list :)
6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json
I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts .
#3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .
#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/
I need to upload a large number of files to one cloud storage provider and then copy those files to another cloud storage provider using software that I will write. I have looked at several cloud storage providers and I don't see an easy way to do what I need to do unless I first download the files and then upload them to the second storage provider. I want to copy directly using cloud storage provider API's. Any suggestions or links to storage providers that have API's that will allow copying from one provider to another would be most welcome.
There is several option you could choose. First using cloud transfer services such as Multi Cloud. I've using it to transfer from AWS S3 or Egnyte to Google Drive.
Multicloud https://www.multcloud.com which is free to for 30GB data traffic per month.
Mountain duck https://mountainduck.io/ if connector are available you could mount each cloud services as your hard drive, and move each file easily.
I hope this could help.
If you want to write code for it use Google's gsutil :
The gsutil cp command allows you to copy data between your local file
system and the cloud, copy data within the cloud, and copy data
between cloud storage providers.
You will find detailed info in this link :
https://cloud.google.com/storage/docs/gsutil/commands/cp
If you want a software, use Multicloud. https://www.multcloud.com/
It can download directly from the web and it can also transfer the file from one cloud storage like dropbox to another like google drive.
Cloud HQ also as a chrome extension is one of the best solutions to sync your data between clouds. You can check it out.
I've been having difficulty understanding when to use s3cmd program over using the Java API. A vendor has documentation on accessing S3 with s3cmd. It is unclear to me as the bucket names appear to be dynamic. No region is specified. Additionally, I'm reaching out over an endpoint. I've tried writing some Java code to interact with S3 the same way that s3cmd does but I haven't been able to connect. Overall, it appears to quite a bit different.
To me s3cmd seems to be a utility to manipulate these files or quickly get at them. Integrating this utility into a Java program seems meaningless.
Anyone have any resources or can help me understand this better?
S3cmd (s3cmd) is a free command line tool and client for uploading, retrieving and managing data in Amazon S3 and other cloud storage service providers that use the S3 protocol, such as Google Cloud Storage or DreamHost DreamObjects. It is best suited for power users who are familiar with command line programs. It is also ideal for batch scripts and automated backup to S3, triggered from cron, etc.
S3cmd is written in Python. It's an open source project available under GNU Public License v2 (GPLv2) and is free for both commercial and private use. You will only have to pay Amazon for using their storage.
Lots of features and options have been added to S3cmd, since its very first release in 2008.... we recently counted more than 60 command line options, including multipart uploads, encryption, incremental backup, s3 sync, ACL and Metadata management, S3 bucket size, bucket policies, and more!
I have been comparing how to upload files to a cloud storage, one is in-browser (or emulating a browser) and the other is command-line via gsutil to a Google Cloud Storage bucket.
Does Google Drive use gsutil in the backend, or or the uploader a totally customized and proprietary piece of software? Is there a way to achieve upload speeds to a Google Cloud Storage bucket similar to the upload speeds I'm able to achieve via Drive? If not, what would you suggest for how to get upload speeds equivalent to that in Google Drive, to upload files to a GCS bucket?
I'm not sure about GDrive using gsutil on the background.
There are several optimizations that you can use to improve gsutil speeds.
First of all you might use perfdiag to launch a small diagnostics tests that will give you and overview and possible speeds achievable.
gsutil perfdiag -o test.json gs://<your bucket name>
Secondly you will need to understand your workload(small/big files) and identifying the need for a regional or multi regional bucket(yes there is a perf difference)tl;dr:
"Regional buckets are great for data processing since their physical distance is fairly tight, and the overhead of write consistency is low."
"Multiregional Storage, on the other hand, guarantees 2 replicates which are geo diverse (100 miles apart) which can get better remote latency and availability.
"
There is some information on cloud Atlas specifically on this topic, you can check out in here:
https://medium.com/google-cloud/google-cloud-storage-what-bucket-class-for-the-best-performance-5c847ac8f9f2
https://medium.com/google-cloud/google-cloud-storage-large-object-upload-speeds-7339751eaa24?source=user_profile---------12----------------
https://medium.com/#duhroach/optimizing-google-cloud-storage-small-file-upload-performance-ad26530201dc
https://medium.com/#duhroach/google-cloud-storage-performance-4cfcec8bad72
https://cloud.google.com/storage/docs/best-practices
In my usecase all google related app and ads data generation is going to store in google store.but my processing engine runs on Spark on AWS cloud.
can some one please help how i can move this GS data S3 to process.
Thank You in advance
If you have the google storage lib on your spark classpath, your EMR code just uses gs:// references for remote access for gcs cloud data. With the right credentials it's accessible from anywhere, including EMR.
You will run up bills though, and have to wait for slower reads and writes.