Download/Copy tar.gz File from S3 to EC2 - amazon-s3

When I download a tar.gz file from AWS S3, and then I try to untar it, I am getting the following error:
tar -xzvf filename_backup_jan212021_01.tar.gz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
When I check what type of file it is, I get this:
file filename_backup_jan212021_01.tar.gz
filename_backup_jan212021_01.tar.gz: ASCII text
This is the command I am using to copy the file from S3 to my EC2:
aws s3 cp s3://bucket_name/filename_backup_jan212021_01.tar.gz .
Please, help me find a solution to extract a tar.gz file after downloading it from AWS S3.

tar -xzvf filename_backup_jan212021_01.tar.gz
gzip: stdin: not in gzip format
file filename_backup_jan212021_01.tar.gz
filename_backup_jan212021_01.tar.gz: ASCII text
cat filename_backup_jan212021_01.tar.gz
/home/ec2-user/file_delete_01.txt
/home/ec2-user/file_jan2021.txt
/home/ec2-user/filename_backup_jan1.tar.gz
/home/ec2-user/filename_backup_jan1.txt
/home/ec2-user/filename_backup_jan2.tar.gz
/home/ec2-user/filename_backup_jan3.tar.gz
All of these indicate that the file that was uploaded to S3 itself is not gzip'd tar file, rather just a plain text file uploaded with a .tar.gz filename. While filenames and extensions are used to indicate content type to humans, computers think otherwise :)
You can create the file with
tar cvzf <archive name> </path/to/files/to/be/tarred> && aws s3 cp <bucket path> <archive name>
to create the archive and upload it to S3, and use the commands you mention in the question to download them. Of course replace the placeholders with the proper names and such

Related

Google Colab: How do I unpack a file that's uploaded on Google Drive?

I have a file that's stored on the Bioinformatics project folder of my google drive. I want to load and unpack this file gbm_tcga_pub2013.tar.gz.
from google.colab import drive
drive.mount('/content/gdrive')
path='/content/gdrive/MyDrive/Bioinformatics project'
!tar -xvzf path/gbm_tcga_pub2013.tar.gz
Traceback:
tar: path/gbm_tcga_pub2013.tar.gz: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
tar is a bash command. path is a Python variable. There might be a way of mixing them, but this is not it. In this case, tar command is looking for path/gbm_tcga_pub2013.tar.gz directory, which obviously does not exist. The following should work:
!tar -xvzf /content/gdrive/MyDrive/Bioinformatics project/gbm_tcga_pub2013.tar.gz

How to download S3-Bucket, compress on the fly and reupload to another s3 bucket without downloading locally?

I want to download the contents of a s3 bucket (hosted on wasabi, claims to be fully s3 compatible) to my VPS, tar and gzip and gpg it and reupload this archive to another s3 bucket on wasabi!
My vps machine only has 30GB of storage, the whole buckets is about 1000GB in size so I need to download, archive, encrypt and reupload all of it on the fly without storing the data locally.
The secret seems to be in using the | pipe command. But I am stuck even in the beginning of download a bucket into an archive locally (I want to go step by step):
s3cmd sync s3://mybucket | tar cvz archive.tar.gz -
In my mind at the end I expect some code like this:
s3cmd sync s3://mybucket | tar cvz | gpg --passphrase secretpassword | s3cmd put s3://theotherbucket/archive.tar.gz.gpg
but its not working so far!
What am I missing?
The aws s3 sync command copies multiple files to the destination. It does not copy to stdout.
You could use aws s3 cp s3://mybucket - (including the dash at the end) to copy the contents of the file to stdout.
From cp — AWS CLI Command Reference:
The following cp command downloads an S3 object locally as a stream to standard output. Downloading as a stream is not currently compatible with the --recursive parameter:
aws s3 cp s3://mybucket/stream.txt -
This will only work for a single file.
You may try https://github.com/kahing/goofys. I guess, in your case it could be the following algo:
$ goofys source-s3-bucket-name /mnt/src
$ goofys destination-s3-bucket-name /mnt/dst
$ tar -cvzf /mnt/src | gpg -e -o /mnt/dst/archive.tgz.gpg

What are the correct flags to rsync to locally mounted S3 bucket?

I have an S3 bucket mounted locally at /mnt/s3 using s3fs.
I can manually cp -r /my-dir/. /mnt/s3, and the file testfile.txt in /mnt/s3 will be overwritten as expected, without error.
However, when using rsync to do this, I get errors about unlinking and copying if the file already exists in the bucket. (If a file of the same name does not exist in the bucket, it's copied properly, without any errors.)
$ rsync -vr --temp-dir=/tmp/rsync /my-dir/. /mnt/s3
sending incremental file list
testfile.txt
rsync: unlink "/mnt/s3/testfile.txt": Operation not permitted (1)
rsync: copy "/tmp/rsync/testfile.txt.Kkyy5n" -> "testfile.txt": Operation not permitted (1)
sent 274 bytes received 428 bytes 1,404.00 bytes/sec
total size is 95 speedup is 0.14
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]
I'm using --temp-dir because otherwise rsync was copying temporary files into /mnt/s3 and trying to rename them to their permanent names. However, rsync failed to rename them, and also failed to delete the temporary files, resulting in improperly copied files and lots of clutter in the S3 bucket.
You may want to try the rsync --inplace flag (instead of the temp_dir workaround), as per the writeup here:
https://baldnerd.com/preventing-rsync-from-doubling-or-even-tripling-your-s3-fees/

Compare (not sync) the contents of a local folder and a AWS S3 bucket

I need to compare the contents of a local folder with a AWS S3 bucket so that where there are differences a script is executed on the local files.
The idea is that local files (pictures) get encrypted and uploaded to S3. Once the upload has occurred I delete the encrypted copy of the pictures to save space. The next day new files get added to the local folder. I need to check between the local folder and the S3 bucket which pictures have already been encrypted and uploaded so that I only encrypt the newly added pictures rather than all of them all over again. I have a script that does exactly this between two local folders but I'm struggling to adapt it so that the comparison is performed between a local folder and a S3 bucket.
Thank you to anyone who can help.
Here is the actual script I am currently using for my picture sorting, encryption and back up to S3:
!/bin/bash
perl /volume1/Synology/scripts/Exiftool/exiftool '-createdate
perl /volume1/Synology/scripts/Exiftool/exiftool '-model=camera model missing' -r -if '(not $model)' -overwrite_original -r /volume1/photo/"input"/ --ext .DS_Store -i "#eaDir"
perl /volume1/Synology/scripts/Exiftool/exiftool '-Directory
cd /volume1/Synology/Pictures/"Pictures Glacier back up"/"Compressed encrypted pics for Glacier"/post_2016/ && (cd /volume1/Synology/Pictures/Pictures/post_2016/; find . -type d ! -name .) | xargs -i mkdir -p "{}"
while IFS= read -r file; do /usr/bin/gpg --encrypt -r xxx#yyy.com /volume1/Synology/Pictures/Pictures/post_2016/**///$(basename "$file" .gpg); done < <(comm -23 <(find /volume1/Synology/Pictures/Pictures/post_2016 -type f -printf '%f.gpg\n'|sort) <(find /volume1/Synology/Pictures/"Pictures Glacier back up"/"Compressed encrypted pics for Glacier"/post_2016 -type f -printf '%f\n'|sort))
rsync -zarv --exclude=#eaDir --include="/" --include=".gpg" --exclude="" /volume1/Synology/Pictures/Pictures/post_2016/ /volume1/Synology/Pictures/"Pictures Glacier back up"/"Compressed encrypted pics for Glacier"/post_2016/
find /volume1/Synology/Pictures/Pictures/post_2016/ -name ".gpg" -type f -delete
/usr/bin/aws s3 sync /volume1/Synology/Pictures/"Pictures Glacier back up"/"Compressed encrypted pics for Glacier"/post_2016/ s3://xyz/Pictures/post_2016/ --exclude "" --include ".gpg" --sse
It would be inefficient to continually compare the local and remote folders, especially as the quantity of objects increases.
A better flow would be:
Unencrypted files are added to a local folder
Each file is:
Copied to another folder in an encrypted state
Once that action is confirmed, the original file is then deleted
Files in the encrypted local folder are copied to S3
Once that action is confirmed, the source file is then deleted
The AWS Command-Line Interface (CLI) has an aws s3 sync command that makes it easy to copy new/modified files to an Amazon S3 bucket, but this could be slow if you have thousands of files.

403 (Forbidden) when doing `s3cmd get` but `s3cmd ls` works

I've set up s3cmd and can successfully put an encrypted file on S3 using it by doing this:
$ s3cmd put --encrypt --config=/home/phil/.s3cfg s3://my-bucket-name/my-dir/my-filename
The file is put there OK.
But when I try to get the file, this happens:
$ s3cmd get --config=/home/phil/.s3cfg --verbose s3://my-bucket-name/my-dir/my-filename
INFO: Applying --exclude/--include
INFO: Summary: 1 remote files to download
s3://my-bucket-name/my-dir/my-filename -> ./my-filename [1 of 1]
ERROR: S3 error: 403 (Forbidden):
The my-filename is created, but with 0 bytes.
I can do s3cmd ls and get a directory listing, so I can access the bucket and directory.
Why can I not get the file? I must have missed something...