Resuming interrupted s3 download with awscli - amazon-s3

I was downloading a file using awscli:
$ aws s3 cp s3://mybucket/myfile myfile
But the download was interrupted (computer went to sleep). How can I continue the download? S3 supports the Range header, but awscli s3 cp doesn't let me specify it.
The file is not publicly accessible so I can't use curl to specify the header manually.

There is a "hidden" command in the awscli tool which allows lower level access to S3: s3api.† It is less user friendly (no s3:// URLs and no progress bar) but it does support the range specifier on get-object:
--range (string) Downloads the specified range bytes of an object. For
more information about the HTTP range header, go to
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.
Here's how to continue the download:
$ size=$(stat -f%z myfile) # assumes OS X. Change for your OS
$ aws s3api get-object \
--bucket mybucket \
--key myfile \
--range "bytes=$size-" \
/dev/fd/3 3>>myfile
You can use pv for a rudimentary progress bar:
$ aws s3api get-object \
--bucket mybucket \
--key myfile \
--range "bytes=$size-" \
/dev/fd/3 3>&1 >&2 | pv >> myfile
(The reason for this unnamed pipe rigmarole is that s3api writes a debug message to stdout at the end of the operation, polluting your file. This solution rebinds stdout to stderr and frees up the pipe for regular file contents through an alias. The version without pv could technically write to stderr (/dev/fd/2 and 2>), but if an error occurs s3api writes to stderr, which would then get appended to your file. Thus, it is safer to use a dedicated pipe there, as well.)
† In git speak, s3 is porcelain, and s3api is plumbing.

Use s3cmd it has a --continue function built in. Example:
# Start a download
> s3cmd get s3://yourbucket/yourfile ./
download: 's3://yourbucket/yourfile' -> './yourfile' [1 of 1]
123456789 of 987654321 12.5% in 235s 0.5 MB/s
[ctrl-c] interrupt
# Pick up where you left off
> s3cmd --continue get s3://yourbucket/yourfile ./
Note that S3 cmd is not multithreaded where awscli is multithreaded, e.g. awscli is faster. A currently maintained fork of s3cmd, called s4cmd appears to provide the multi-threaded capabilities while maintaining the usability features of s3cmd:
https://github.com/bloomreach/s4cmd

Related

AWS CLI for S3 Select

I have the following code, which is used to run a SQL query on a keyfile, located in a S3 bucket. This runs perfectly. My question is, I do not wish to have the output written over to an output file. Could I see the output on the screen (my preference #1)? If not, what about an ability to append to the output file, rather than over-write it (my preference #2). I am using the AWS-CLI binaries to run this query. If there is another way, I am happy to try (as long as it is within bash)
aws s3api select-object-content \
--bucket "project2" \
--key keyfile1 \
--expression "SELECT * FROM s3object s where Lower(s._1) = 'email#search.com'" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"FieldDelimiter": ":"}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {"FieldDelimiter": ":"}}' "OutputFile"
Of course, you can use AWS CLI to do this since stdout is just a special file in linux.
aws s3api select-object-content \
--bucket "project2" \
--key keyfile1 \
--expression "SELECT * FROM s3object s where Lower(s._1) = 'email#search.com'" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"FieldDelimiter": ":"}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {"FieldDelimiter": ":"}}' /dev/stdout
Note the /dev/stdout in the end.
The AWS CLI does not offer such options.
However, you are welcome to instead call it via an AWS SDK of your choice.
For example, in the boto3 Python SDK, there is a select_object_content() function that returns the data as a stream. You can then read, manipulate, print or save it however you wish.
I think it opens /dev/stdout twice causing kaos.

Scaleway GLACIER class object storage with restic

Scaleway recently launched GLACIER class storage "C14 Cold Storage Class"
They have a great plan of 75GB free and I'd like to take advantage of this using the restic backup tool.
To get this working I have successfully followed the S3 instructions for repository creation and uploading, with one caveat. I can not successfully pass the storage-class header as GLACIER.
Using awscliv2, I can successfully pass a header that looks very much like this from my local machine: aws s3 cp object s3://bucket/ --storage-class GLACIER
But with restic, having dug through some github issues, I can see an option to pass a -o flag. The linked issues resolution is not that clear to me so I have tried the following restic commands without successfully seeing the "GLACIER" class of storage label next to the files objects in the Scaleway bucket console:
restic -r s3:s3.fr-par.scw.cloud/restic-testing -o GLACIER --verbose backup ~/test.txt
restic -r s3:s3.fr-par.scw.cloud/restic-testing -o storage-class=GLACIER --verbose backup ~/test.txt
Can someone suggest another option?
I'm starting to use C14's GLACIER storage class with restic, and until now it seems be working very well.
I suggest to create the repository in the usual way with restic -r s3:s3.fr-par.scw.cloud/test-bucket init, which will create the config file and keys in the STANDARD storage class.
For backups, I'm using the command:
$ restic backup -r s3:s3.fr-par.scw.cloud/test-bucket -o s3.storage-class=GLACIER --host host /path
similar to what you did, apart the option is s3.storage-class and not storage-class.
In this way files in the data and snapshots directories are in GLACIER storage class, and you can add backups with no problem.
I can also mount the repository while data is in GLACIER class (I suppose all the info are taken from cache) so I can do restic mount /mnt/c14 and I can browse the files, also if I cannot copy them or see their content.
If I need to restore files, I restore all bucket in STANDARD class with s3cmd restore --recursive s3://test-bucket/ (see s3cmd), I test that all files are correctly in standard class with:
$ aws s3 ls s3://test-bucket --recursive | tr -s ' ' | cut -d' ' -f 4 | xargs -n 1 -I {} sh -c "aws s3api head-object --bucket unitedhost --key '{}' | jq -r .StorageClass" | grep --quiet GLACIER
which returns true if at least one file is in GLACIER class, so you have to wait this command to returns false.
Obviously a restore will need more time, but I'm using C14 glacier as a second or third backup, while using another restic repository in Backblaze B2 which is a warm storage.
In addition to vstefanoxx 's answer : Here is my workflow.
I setup the restic repository just like vstefanoxx.
Now, if you want to prune the repository... you cannot as the files are in glacier and restic needs read-write access to the bucket to prune.
What is interesting about Scaleway is that file transferts between glacier and standard class are free. So let's move the data back to the standard class :
s3cmd restore --recursive s3://test-bucket
And wait until the end of the process using the command given by vstefanoxx. Once your data is in the standard class it costs you five times more, so we have to be efficient :-)
So we now prune the repository:
restic prune -r s3:s3.fr-par.scw.cloud/test-bucket
And once it is finished, move everything (in fact data, index and snapshots but not keys) back to glacier:
s3cmd cp s3://test-bucket/data/ s3://test-bucket/data/ --recursive --storage-class=GLACIER
s3cmd cp s3://test-bucket/index/ s3://test-bucket/index/ --recursive --storage-class=GLACIER
s3cmd cp s3://test-bucket/snapshots/ s3://test-bucket/snapshots/ --recursive --storage-class=GLACIER
So we are now to a point where we have pruned the repository, trying to pay the least amount of money !
The chosen answer doesn't seem to work when doing incremental backups. I went with a different solution.
I set up a normal bucket, initialized with your usual restic init. Then I set up the following lifetime rule:
<?xml version="1.0" ?>
<LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Rule>
<ID>data-to-glacier</ID>
<Filter>
<Prefix>data/</Prefix>
</Filter>
<Status>Enabled</Status>
<Transition>
<Days>0</Days>
<StorageClass>GLACIER</StorageClass>
</Transition>
</Rule>
</LifecycleConfiguration>
Days is set to 0, which means that the rule will be applied to all files. Rules are not applied continuously though, they're applied once a day at midnight UTC.
This rule will only apply to the files in data/, which are the big files.
This rule description is supposed to be used with s3cmd but you can also do it from the dashboard if you prefer a GUI.

How to download S3-Bucket, compress on the fly and reupload to another s3 bucket without downloading locally?

I want to download the contents of a s3 bucket (hosted on wasabi, claims to be fully s3 compatible) to my VPS, tar and gzip and gpg it and reupload this archive to another s3 bucket on wasabi!
My vps machine only has 30GB of storage, the whole buckets is about 1000GB in size so I need to download, archive, encrypt and reupload all of it on the fly without storing the data locally.
The secret seems to be in using the | pipe command. But I am stuck even in the beginning of download a bucket into an archive locally (I want to go step by step):
s3cmd sync s3://mybucket | tar cvz archive.tar.gz -
In my mind at the end I expect some code like this:
s3cmd sync s3://mybucket | tar cvz | gpg --passphrase secretpassword | s3cmd put s3://theotherbucket/archive.tar.gz.gpg
but its not working so far!
What am I missing?
The aws s3 sync command copies multiple files to the destination. It does not copy to stdout.
You could use aws s3 cp s3://mybucket - (including the dash at the end) to copy the contents of the file to stdout.
From cp — AWS CLI Command Reference:
The following cp command downloads an S3 object locally as a stream to standard output. Downloading as a stream is not currently compatible with the --recursive parameter:
aws s3 cp s3://mybucket/stream.txt -
This will only work for a single file.
You may try https://github.com/kahing/goofys. I guess, in your case it could be the following algo:
$ goofys source-s3-bucket-name /mnt/src
$ goofys destination-s3-bucket-name /mnt/dst
$ tar -cvzf /mnt/src | gpg -e -o /mnt/dst/archive.tgz.gpg

How to invalidate specific cached Amazon S3 file with s3cmd command?

I use s3cmd to copy local content to remote bucket with:
s3cmd --acl-public --cf-invalidate -M \
--add-header="Cache-Control: max-age=604800" \
--cf-invalidate \
--no-encrypt \
sync $LOCAL_FOLDER s3://$REMOTE_BUCKET
--cf-invalidate makes sure that old cached file with same name as the file being copied is invalidated. However, some cached files copied before won't be copied any more and thus won't be invalidated. How can I invalidate specific cached file?

Transporting with data from S3 amazon to local server

I am trying to import data from S3 and using the described below script (which I sort of inherited). It's a bit long...The problem is I kept receiving following output:
The config profile (importer) could not be found
I am not a bash person-so be gentle, please. It seemed there are some credentials missing or something else is wrong with configuration of "importer" on local machine.
In S3 configs(the console) - there is a user with the same name, which, according to permissions can perform access the bucket and download data.
I have tried changing access keys in amazon console for the user and creating file, named "credentials" in home/.aws(there was no .aws folder in home dir by default-created it), including the new keys in the file, tried upgrading AWS CLI with pip - nothing helped
Then I have modified the "credentials", placing [importer] as profile name, so it looked like:
[importer]
aws_access_key = xxxxxxxxxxxxxxxxx
aws_secret+key = xxxxxxxxxxxxxxxxxxx
Appears, that I have gone through the "miss-configuration":
A client error (InvalidAccessKeyId) occurred when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.
Completed 1 part(s) with ... file(s) remaining
And here's the part, where I am stuck...I placed the keys, I have obtained from the amazon into that config file. Double checked...Any suggestions? I can't produce anymore keys-aws quota/user. Below is part of the script:
#!/bin/sh
echo "\n$0 started at: `date`"
incomming='/Database/incomming'
IFS='
';
mkdir -p ${incomming}
echo "syncing files from arrivals bucket to ${incomming} incomming folder"
echo aws --profile importer \
s3 --region eu-west-1 sync s3://path-to-s3-folder ${incomming}
aws --profile importer \
s3 --region eu-west-1 sync s3://path-to-s3-folder ${incomming}
count=0
echo ""
echo "Searching for zip files in ${incomming} folder"
for f in `find ${incomming} -name '*.zip'`;
do
echo "\n${count}: ${f} --------------------"
count=$((count+1))
name=`basename "$f" | cut -d'.' -f1`
dir=`dirname "$f"`
if [ -d "${dir}/${name}" ]; then
echo "\tWarning: directory "${dir}/${name}" already exist for file: ${f} ... - skipping - not imported"
continue
fi