Upload files from multiple subfolders with gsutil - gsutil

I need to "move" all the content of a folder, including its subfolders to a bucket in Google Cloud Storage.
The closest way is to use gsutil -rsync, but it clones all the data without moving the files.
I need to move all the data and keep data only in GCP and not in local storage. My local storage is being used only as a pass-thought server (Cause I only have a few GB to store data on local storage)
How can I achieve this?
Is there any way with gsutil?
Thanks!

To move the data to a bucket and reclaim the space on your local disk, you need to use mv command for example:
gsutil mv -r mylocalfolder gs://mybucketname
mv command copies the files to a bucket and delete them after the upload.

Related

Streaming compression to S3 bucket with a custom directory structure

I have got an application that requires to create a compressed file from different objects that are saved on S3. The issue I am facing is I would like to compress objects on the fly without downloading files into a container and do the compression. The reason for that is the size of files can be quite big and I can easily run out of disk space and of course, there will be an extra round trip time of downloading files on disk, compressing them and upload the compressed file to s3 again.
It is worth mentioning that I would like to locate files in the output compressed file in different directories, so when a user decompress the file can see it is stored in different folders.
Since S3 does not have the concept of physical folder structure, I am not sure if this is possible and if there is a better way than download/uploading the files.
NOTE
My issue is not about how to use AWS Lambda to export a set of big files. It is about how I can export files from S3 without downloading objects on a local disk and create a zip file and upload to S3. I would like to simply zip the files on S3 on the fly and most importantly being able to customize the directory structure.
For example,
inputs:
big-file1
big-file2
big-file3
...
output:
big-zip.zip
with the directory structure of:
images/big-file1
images/big-file2
videos/big-file3
...
I have almost the same use case as yours. I have researched it for about 2 months and try with multiple ways but finally I have to use ECS (EC2) for my use case because of the zip file can be huge like 100GB ....
Currently AWS doesn't support a native way to perform compress. I have talked to them and they are considering it a feature but there is no time line given yet.
If your files is about 3 GB in term of size, you can think of Lambda to achieve your requirement.
If your files is more than 4 GB, I believe it is safe to do it with ECS or EC2 and attach more volume if it requires more space/memory for compression.
Thanks,
Yes, there are at least two ways: either using AWS-Lambda or AWS-EC2
EC2
Since aws-cli has support of cp command, you can pipe S3 file to any archiver using unix-pipe, e.g.:
aws s3 cp s3://yours-bucket/huge_file - | gzip | aws s3 cp - s3://yours-bucket/compressed_file
AWS-Lambda
Since maintaining and using EC2 instance just for compressing may be too expensive, you can use Lambda for one-time compressions.
But keep in mind that Lambda has a lifetime limit of 15 minutes. So, if your files really huge try this sequence:
To make sure that file will be compressed, try partial file compression using Lambda
Compressed files could me merged on S3 into one file using Upload Part - Copy

How to import data to Amazon S3 from URL

I have an S3 bucket and the URL of a large file. I would like to store the content located at the URL in the S3 bucket.
I could download the file to my local machine and then upload it to S3 with Cloudberry or Jungledisk or whatever. However, if the file is large, this may take a long time because the file must be transferred twice, and my network connection is much slower than Amazon's.
If I have a lot of data to store in S3, I can start an EC2 instance, retrieve the files to the instance with curl or wget, and then push the data from the EC2 instance to S3. This works, but it's a lot of steps if I just want to archive one file.
Any suggestions?
You can stream the file directly from the source to S3.
If you are using node, you can use streaming-s3.

cp command with Key for each file

I have a folder structure like
Test
Test2
Test2-1.jpg
Test2-2.png
Using cp command I am able to copy the local structure to the S3 bucket. But I have a server configured to access files like this in a bucket
Test2/Test2-1.jpg, because I have copied it using cp command from a local directory, I can't set the key to Test2/Test2-1.jpg.
Before I was copying each and every file manually through Boto API by setting Key manually. That worked but its very long process.
Is there any way I can achieve this using cp command?
EDIT:
The actual issue causing the problem was content-encoding gzip. I was passing this encoding for not gz file. Due to that, the file is not stored properly and accessible.
If you are in a directory with Test2-1.jpg in it you can copy it to yourbucket/Test2/Test2-1.jpg by running
aws s3 cp ./Test2-1.jpg s3://yourbucket/Test2/Test2-1.jpg
You can copy an entire directory by using the sync command
aws s3 sync . s3://yourbucket/Test2/

gsutil + rsync remove synced files? (no --remove-source-files flag?)

Is there a way to make gsutil rsync remove synced files?
As far as I know, normally it is done by passing --remove-source-files, but it does not seem to be an option with gsutil rsync (documentation).
Context:
I have a script that produces a large amount of CSV files (100GB+) I want those files to be transferred to Cloud Storage (and once transferred to be removed from my HDD).
Ended up using gcsfuse.
Per documentation:
Local storage: Objects that are new or modified will be stored in
their entirety in a local temporary file until they are closed or
synced.
One work-around for small buckets is delete all bucket contents and re-sync periodically.

Exporting data from Google Cloud Storage to Amazon S3

I would like to transfer data from a table in BigQuery, into another one in Redshift.
My planned data flow is as follows:
BigQuery -> Google Cloud Storage -> Amazon S3 -> Redshift
I know about Google Cloud Storage Transfer Service, but I'm not sure it can help me. From Google Cloud documentation:
Cloud Storage Transfer Service
This page describes Cloud Storage Transfer Service, which you can use
to quickly import online data into Google Cloud Storage.
I understand that this service can be used to import data into Google Cloud Storage and not to export from it.
Is there a way I can export data from Google Cloud Storage to Amazon S3?
You can use gsutil to copy data from a Google Cloud Storage bucket to an Amazon bucket, using a command such as:
gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket
Note that the -d option above will cause gsutil rsync to delete objects from your S3 bucket that aren't present in your GCS bucket (in addition to adding new objects). You can leave off that option if you just want to add new objects from your GCS to your S3 bucket.
Go to any instance or cloud shell in GCP
First of all configure your AWS credentials in your GCP
aws configure
if this is not recognising the install AWS CLI follow this guide https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html
follow this URL for AWS configure
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Attaching my screenshot
Then using gsutil
gsutil -m rsync -rd gs://storagename s3://bucketname
16GB data transferred in some minutes
Using Rclone (https://rclone.org/).
Rclone is a command line program to sync files and directories to and from
Google Drive
Amazon S3
Openstack Swift / Rackspace cloud files / Memset Memstore
Dropbox
Google Cloud Storage
Amazon Drive
Microsoft OneDrive
Hubic
Backblaze B2
Yandex Disk
SFTP
The local filesystem
Using the gsutil tool we can do a wide range of bucket and object management tasks, including:
Creating and deleting buckets.
Uploading, downloading, and deleting objects.
Listing buckets and objects. Moving, copying, and renaming objects.
we can copy data from a Google Cloud Storage bucket to an amazon s3 bucket using gsutil rsync and gsutil cp operations. whereas
gsutil rsync collects all metadata from the bucket and syncs the data to s3
gsutil -m rsync -r gs://your-gcs-bucket s3://your-s3-bucket
gsutil cp copies the files one by one and as the transfer rate is good it copies 1 GB in 1 minute approximately.
gsutil cp gs://<gcs-bucket> s3://<s3-bucket-name>
if you have a large number of files with high data volume then use this bash script and run it in the background with multiple threads using the screen command in amazon or GCP instance with AWS credentials configured and GCP auth verified.
Before running the script list all the files and redirect to a file and read the file as input in the script to copy the file
gsutil ls gs://<gcs-bucket> > file_list_part.out
Bash script:
#!/bin/bash
echo "start processing"
input="file_list_part.out"
while IFS= read -r line
do
command="gsutil cp ${line} s3://<bucket-name>"
echo "command :: $command :: $now"
eval $command
retVal=$?
if [ $retVal -ne 0 ]; then
echo "Error copying file"
exit 1
fi
echo "Copy completed successfully"
done < "$input"
echo "completed processing"
execute the Bash script and write the output to a log file to check the progress of completed and failed files.
bash file_copy.sh > /root/logs/file_copy.log 2>&1
I needed to transfer 2TB of data from Google Cloud Storage bucket to Amazon S3 bucket.
For the task, I created the Google Compute Engine of V8CPU (30 GB).
Allow Login using SSH on the Compute Engine.
Once logedin create and empty .boto configuration file to add AWS credential information. Added AWS credentials by taking the reference from the mentioned link.
Then run the command:
gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket
The data transfer rate is ~1GB/s.
Hope this help.
(Do not forget to terminate the compute instance once the job is done)
For large amounts of large files (100MB+) you might get issues with broken pipes and other annoyances, probably due to multipart upload requirement (as Pathead mentioned).
For that case you're left with simple downloading all files to your machine and uploading them back. Depending on your connection and data amount, it might be more effective to create VM instance to utilize high-speed connection and ability to run it in the background on different machine than yours.
Create VM machine (make sure the service account has access to your buckets), connect via SSH and install AWS CLI (apt install awscli) and configure the access to S3 (aws configure).
Run these two lines, or make it a bash script, if you have many buckets to copy.
gsutil -m cp -r "gs://$1" ./
aws s3 cp --recursive "./$1" "s3://$1"
(It's better to use rsync in general, but cp was faster for me)
Tools like gsutil and aws s3 cp won't use multipart uploads/downloads, so will have poor performance for large files.
Skyplane is a much faster alternative for transferring data between clouds (up to 110x for large files). You can transfer data with the command:
skyplane cp -r s3://aws-bucket-name/ gcs://google-bucket-name/
(disclaimer: I am a contributor)