I would like a pipeline setup that I can run manually. The idea here is that it deletes a single file held within an AWS S3 account. I know technically there are many ways to do this, but what is best practice?
Thank you!
You can use a task: AWS CLI and add it into pipeline to delete a single file held within an AWS S3 account.
You can follow below steps :
1、 You should create a service connection before adding a AWS CLI task to pipeline.
Create AWS service connection
2、 Add AWS CLI task to pipeline and configure required parameters. Please know the meaning of parameters about AWS CLI. You can refer the document :
Command structure in the AWS CLI
The command structure is like:
aws <command> <subcommand> [options and parameters]
In this example, you can use the command below to delete a single s3 file:
aws s3 rm s3://BUCKET_NAME/uploads/file_name.jpg
“s3://BUCKET_NAME/uploads/file_name.jpg” is the file path you saved in S3.
AWS CLI in pipeline
3 run the pipeline and the single s3 file can be deleted successfully.
I know that I can use aws cloudformation create-stack or aws cloudformation update-stack with --template-url switch to point to an existing template placed in S3 Bucket.
I would like to use aws cloudformation deploy, the same command for both creating and updating a CloudFormation stack for which I placed the template already in my S3 Bucket. Is it possible with any combination of the options?
The following syntax works, but it first uploads the template to the S3 Bucket:
aws cloudformation deploy \
--stack-name my-stack \
--template-file my-stack-template.yaml \
--s3-bucket my-bucket \
--s3-prefix templates \
--profile my-profile \
--region us-east-1
It first uploads the template my-stack-template.yaml as something like 1a381d4c65d9a3233450e92588a708b38.template in my-bucket/templates which I do not want. I would like to be able to deploy the stack through this method using the template already placed in the S3 Bucket and not needing it to be on my local computer.
Sadly there is no such way. The only way for the template to be not re-upload is when there are no changes to deploy. You have to use create-stack if you want to use pre-existing templates in S3.
When I use gsutil to connect to my bucket on Google Cloud Storage, I usually use the following command:
gcloud auth activate-service-account --key-file="pathKeyFile"
What should I do if two scripts that are running on the same machine at the same time need two different Service Accounts?
I would like to use a command such as:
gsutil ls mybucket --key-file="mykeyspath"
I say this because in the case my script is running and another script changes the Service Account which is actually active, my script would not have permission to access the bucket anymore.
You can do this with BOTO file. You can create one as explained in the documentation.
Then you can specify which file to use when you run your gsutil command (here an example in linux)
# if you have several GSUTIL command to run
export BOTO_CONFIG=/path/to/.botoMyBucket
gsutil ls myBucket
# For only one command, you can define an env var inline like this
BOTO_CONFIG=/path/to/.botoMyBucket2 gsutil ls myBucket2
I am thinking of building a work flow as follows:
I have an application that writes almost 1000 csv files to a folder MY_DIRECTORY in s3 bucket MY_BUCKET. Now I would like to parse those files from s3 bucket and load into MySQL database using Apache Airflow.
From reading a several posts here: Airflow S3KeySensor - How to make it continue running and Airflow s3 connection using UI, I think it would best to trigger my Airflow DAG using AWS lambda which will be called as soon as a file lands on the s3 folder.
Being new to Airflow and Lambda, I am not getting the idea how to set up the lambda to trigger Airflow DAG. In this regard, if anyone please give some pointers, it would be really helpful. Thanks.
Create the DAG that you want to trigger, then take advantage of the experimental REST APIs offered by Airflow.
You can read about them here: https://airflow.apache.org/docs/stable/api.html
In particular you want to use the following endpoint:
POST /api/experimental/dags/<DAG_ID>/dag_runs
You can pass the name of the DAG in the to trigger it correctly. Moreover you can explicitly pass the name of the file the DAG will have to process
curl -X POST \
http://localhost:8080/api/experimental/dags/<DAG_ID>/dag_runs \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{"conf":"{\"FILE_TO_PROCESS\":\"value\"}"}'
Then use a Hook within the DAG to read the file that you specified.
I would like to transfer data from a table in BigQuery, into another one in Redshift.
My planned data flow is as follows:
BigQuery -> Google Cloud Storage -> Amazon S3 -> Redshift
I know about Google Cloud Storage Transfer Service, but I'm not sure it can help me. From Google Cloud documentation:
Cloud Storage Transfer Service
This page describes Cloud Storage Transfer Service, which you can use
to quickly import online data into Google Cloud Storage.
I understand that this service can be used to import data into Google Cloud Storage and not to export from it.
Is there a way I can export data from Google Cloud Storage to Amazon S3?
You can use gsutil to copy data from a Google Cloud Storage bucket to an Amazon bucket, using a command such as:
gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket
Note that the -d option above will cause gsutil rsync to delete objects from your S3 bucket that aren't present in your GCS bucket (in addition to adding new objects). You can leave off that option if you just want to add new objects from your GCS to your S3 bucket.
Go to any instance or cloud shell in GCP
First of all configure your AWS credentials in your GCP
aws configure
if this is not recognising the install AWS CLI follow this guide https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html
follow this URL for AWS configure
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Attaching my screenshot
Then using gsutil
gsutil -m rsync -rd gs://storagename s3://bucketname
16GB data transferred in some minutes
Using Rclone (https://rclone.org/).
Rclone is a command line program to sync files and directories to and from
Google Drive
Amazon S3
Openstack Swift / Rackspace cloud files / Memset Memstore
Dropbox
Google Cloud Storage
Amazon Drive
Microsoft OneDrive
Hubic
Backblaze B2
Yandex Disk
SFTP
The local filesystem
Using the gsutil tool we can do a wide range of bucket and object management tasks, including:
Creating and deleting buckets.
Uploading, downloading, and deleting objects.
Listing buckets and objects. Moving, copying, and renaming objects.
we can copy data from a Google Cloud Storage bucket to an amazon s3 bucket using gsutil rsync and gsutil cp operations. whereas
gsutil rsync collects all metadata from the bucket and syncs the data to s3
gsutil -m rsync -r gs://your-gcs-bucket s3://your-s3-bucket
gsutil cp copies the files one by one and as the transfer rate is good it copies 1 GB in 1 minute approximately.
gsutil cp gs://<gcs-bucket> s3://<s3-bucket-name>
if you have a large number of files with high data volume then use this bash script and run it in the background with multiple threads using the screen command in amazon or GCP instance with AWS credentials configured and GCP auth verified.
Before running the script list all the files and redirect to a file and read the file as input in the script to copy the file
gsutil ls gs://<gcs-bucket> > file_list_part.out
Bash script:
#!/bin/bash
echo "start processing"
input="file_list_part.out"
while IFS= read -r line
do
command="gsutil cp ${line} s3://<bucket-name>"
echo "command :: $command :: $now"
eval $command
retVal=$?
if [ $retVal -ne 0 ]; then
echo "Error copying file"
exit 1
fi
echo "Copy completed successfully"
done < "$input"
echo "completed processing"
execute the Bash script and write the output to a log file to check the progress of completed and failed files.
bash file_copy.sh > /root/logs/file_copy.log 2>&1
I needed to transfer 2TB of data from Google Cloud Storage bucket to Amazon S3 bucket.
For the task, I created the Google Compute Engine of V8CPU (30 GB).
Allow Login using SSH on the Compute Engine.
Once logedin create and empty .boto configuration file to add AWS credential information. Added AWS credentials by taking the reference from the mentioned link.
Then run the command:
gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket
The data transfer rate is ~1GB/s.
Hope this help.
(Do not forget to terminate the compute instance once the job is done)
For large amounts of large files (100MB+) you might get issues with broken pipes and other annoyances, probably due to multipart upload requirement (as Pathead mentioned).
For that case you're left with simple downloading all files to your machine and uploading them back. Depending on your connection and data amount, it might be more effective to create VM instance to utilize high-speed connection and ability to run it in the background on different machine than yours.
Create VM machine (make sure the service account has access to your buckets), connect via SSH and install AWS CLI (apt install awscli) and configure the access to S3 (aws configure).
Run these two lines, or make it a bash script, if you have many buckets to copy.
gsutil -m cp -r "gs://$1" ./
aws s3 cp --recursive "./$1" "s3://$1"
(It's better to use rsync in general, but cp was faster for me)
Tools like gsutil and aws s3 cp won't use multipart uploads/downloads, so will have poor performance for large files.
Skyplane is a much faster alternative for transferring data between clouds (up to 110x for large files). You can transfer data with the command:
skyplane cp -r s3://aws-bucket-name/ gcs://google-bucket-name/
(disclaimer: I am a contributor)