Prevent rclone from re-copying files to AWS S3 Deep Archive

Prevent rclone from re-copying files to AWS S3 Deep Archive - amazon-s3

I'm using rclone in order to copy some files to an S3 bucket (deep archive). The command I'm using is:
rclone copy --ignore-existing --progress --max-delete 0 "/var/vmail" foo-backups:foo-backups/vmail
This is making rclone to copy files that I know for sure that already exist in the bucket. I tried removing the --ignore-existing flag (which IMHO is badly named, as it does exactly the opposite of what you'd initially expect), but I still get the same behaviour.
I also tried adding --size-only, but the "bug" doesn't get fixed.
How can I make rclone copy only new files?

You could use rclone sync, check out https://rclone.org/commands/rclone_sync/
Doesn’t transfer unchanged files, testing by size and modification time or MD5SUM. Destination is updated to match source, including deleting files if necessary.

It turned out to be a bug in rclone. https://github.com/rclone/rclone/issues/3834

Related

Can make figure out file dependencies in AWS S3 buckets?

The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?

The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.

I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.

Sync with S3 with s3cmd, but not re-download files that only changed name

I'm syncing a bunch of files between my computer and Amazon S3. Say a couple of the files change name, but their content is still the same. Do I have to have the local file removed by s3cmd and then the "new" file re-downloaded, just because it has a new name? Or is there any other way of checking for changes? I would like s3cmd to, in that case, simply change the name of the local file in accordance with the new name on the server.

s3cmd upstream (github.com/s3tools/s3cmd master branch) and 1.5.0-rc1 latest published version, can figure this out, if you used a recent version to put the file into S3 in the first place that used the --preserve option to store the md5sum of each file. Using the md5sums, it knows that you have a duplicate (even if renamed) file locally, and won't re-download it, but instead will do a local copy (or hardlink) from the file system name to the name from S3.

Is it possible to sync a single file to s3?

I'd like to sync a single file from my filesystem to s3.
Is this possible or can only directories by synced?

Use include/exclude options for the sync-directory command:
e.g. To sync just /var/local/path/filename.xyz to S3 use:
s3 sync /var/local/path s3://bucket/path --exclude='*' --include='*/filename.xyz'

cp can be used to copy a single file to S3. If the filename already exists in the destination, this will replace it:
aws s3 cp local/path/to/file.js s3://bucket/path/to/file.js
Keep in mind that per the docs, sync will only make updates to the target if there have been file changes to the source file since the last run: s3 sync updates any files that have a size or modified time that are different from files with the same name at the destination. However, cp will always make updates to the target regardless of whether the source file has been modified.
Reference: AWS CLI Command Reference: cp

Just to comment on pythonjsgeo's answer. That seems to be the right solution but make sure so execute the command without the = symbol after the include and exclude tag. I was including the = symbol and getting weird behavior with the sync command.
s3 sync /var/local/path s3://bucket/path --exclude '*' --include '*/filename.xyz'

You can mount S3 bucket as a local folder (using RioFS, for example) and then use your favorite tool to synchronize file(-s) or directories.

Empty files on S3 prevent from downloading using s3cmd and s3sync

I am trying to setup a backup/restore using S3. The upload sync worked well using s3sync. However, next to each folder there is an empty file with matching name. I read somewhere that this is created to define the folder structure but I am not sure about that as it doesn't happen if I create a folder using a different method s3fox etc.
These empty files prevent me from restoring the directories/files. When I do s3cmd sync, I get an error message "can not make directory: File exists" as it first creates that empty file and that fails when trying to create the directory. Any ideas how I can solve this problem?

Which S3 manager creates '_$folder$' files for pseudo-folders?

I know that S3 has no folders, and I'm working with an inherited application that has some buckets filled with folder_name_$folder$ items. I know that a lot of different tools use these files, or other folder sigils, depending on the tool, to help represent 'folders' to various visual interfaces. I'm wondering which one uses this particular convention.
I'd like to remove them so that my various rake tasks that run down lists of files can go faster, but I'm afraid I'll end up breaking some tool that someone else in the company uses. Can anyone say which tools create these keys, and what functionality, if any, removing them would break? S3fox? The main AWS console?

the _$folder$ folders are created with S3Fox, AWS Console doesn't create them. you can safely delete them if you like
Thanks
Andy

They are also created by Hadoop S3 Native FileSystem (NativeS3FileSystem).

The $folder$ is created the hadoop and spark application during the file movement process.
You can remove them using s3 command
aws s3 rm s3://path --recursive --exclude "*" --include "*\$folder\$"

They seem to be created by 3Hub as well.

I've encounter this issue when I was creating Athena Partition when there was no such key in S3. These files can be safely removed.
Please check also here https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Prevent rclone from re-copying files to AWS S3 Deep Archive - amazon-s3

You could use rclone sync, check out https://rclone.org/commands/rclone_sync/ Doesn’t transfer unchanged files, testing by size and modification time or MD5SUM. Destination is updated to match source, including deleting files if necessary.

It turned out to be a bug in rclone. https://github.com/rclone/rclone/issues/3834

Related

Can make figure out file dependencies in AWS S3 buckets?

Sync with S3 with s3cmd, but not re-download files that only changed name

Is it possible to sync a single file to s3?

Empty files on S3 prevent from downloading using s3cmd and s3sync

Which S3 manager creates '_$folder$' files for pseudo-folders?

Categories

Resources