I know that S3 has no folders, and I'm working with an inherited application that has some buckets filled with folder_name_$folder$ items. I know that a lot of different tools use these files, or other folder sigils, depending on the tool, to help represent 'folders' to various visual interfaces. I'm wondering which one uses this particular convention.
I'd like to remove them so that my various rake tasks that run down lists of files can go faster, but I'm afraid I'll end up breaking some tool that someone else in the company uses. Can anyone say which tools create these keys, and what functionality, if any, removing them would break? S3fox? The main AWS console?
the _$folder$ folders are created with S3Fox, AWS Console doesn't create them. you can safely delete them if you like
Thanks
Andy
They are also created by Hadoop S3 Native FileSystem (NativeS3FileSystem).
The $folder$ is created the hadoop and spark application during the file movement process.
You can remove them using s3 command
aws s3 rm s3://path --recursive --exclude "*" --include "*\$folder\$"
They seem to be created by 3Hub as well.
I've encounter this issue when I was creating Athena Partition when there was no such key in S3. These files can be safely removed.
Please check also here https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/
Related
I'm using rclone in order to copy some files to an S3 bucket (deep archive). The command I'm using is:
rclone copy --ignore-existing --progress --max-delete 0 "/var/vmail" foo-backups:foo-backups/vmail
This is making rclone to copy files that I know for sure that already exist in the bucket. I tried removing the --ignore-existing flag (which IMHO is badly named, as it does exactly the opposite of what you'd initially expect), but I still get the same behaviour.
I also tried adding --size-only, but the "bug" doesn't get fixed.
How can I make rclone copy only new files?
You could use rclone sync, check out https://rclone.org/commands/rclone_sync/
Doesn’t transfer unchanged files, testing by size and modification time or MD5SUM. Destination is updated to match source, including deleting files if necessary.
It turned out to be a bug in rclone. https://github.com/rclone/rclone/issues/3834
I have a snakemake workflow that runs on a local HPC. I also want to use this workflow on AWS using boto to push files to and from S3 as needed. This circumvents IO issues when used in conjunction with --no-shared-fs. I was thinking the best way to do this was just using --default-remote-provider S3 & --default-remote-prefix to specify the bucket name. However, when this option gets invoked, it appends the S3.remote & bucket prefix to all the inputs. This includes executable programs such as: bwa, samtools, bedtools, etc. When boto tries to download these programs using boto, it doesn't keep executable permissions.
Is there a good way to specify "local" inputs vs S3 inputs without having to modify the rules? This way I can keep only 1 version of the workflow if I need to change/improve it.
Thanks for any help!
You can enclose the local files with local(...), analogous to the way to mark temp files. This way, they will be excluded from default remote application.
Note that tools should usually not be input or output files. Consider using the snakemake conda integration and bioconda for software deployment and specification:
http://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management
The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?
The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.
I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.
I'd like to sync a single file from my filesystem to s3.
Is this possible or can only directories by synced?
Use include/exclude options for the sync-directory command:
e.g. To sync just /var/local/path/filename.xyz to S3 use:
s3 sync /var/local/path s3://bucket/path --exclude='*' --include='*/filename.xyz'
cp can be used to copy a single file to S3. If the filename already exists in the destination, this will replace it:
aws s3 cp local/path/to/file.js s3://bucket/path/to/file.js
Keep in mind that per the docs, sync will only make updates to the target if there have been file changes to the source file since the last run: s3 sync updates any files that have a size or modified time that are different from files with the same name at the destination. However, cp will always make updates to the target regardless of whether the source file has been modified.
Reference: AWS CLI Command Reference: cp
Just to comment on pythonjsgeo's answer. That seems to be the right solution but make sure so execute the command without the = symbol after the include and exclude tag. I was including the = symbol and getting weird behavior with the sync command.
s3 sync /var/local/path s3://bucket/path --exclude '*' --include '*/filename.xyz'
You can mount S3 bucket as a local folder (using RioFS, for example) and then use your favorite tool to synchronize file(-s) or directories.
I have a lot of subdirectories containing a lot of images (millions) on S3. Having the files in these subdirectories has turned out to be a lot of trouble, and since all file names are actually unique, there is no reason why they should reside in subdirectories. So I need to find a fast and scalable way to move all files from the subdirectories into one common directory or alternatively delete the sub directories without deleting the files.
Is there a way to do this?
I'm on ruby, but open to almost anything
I have added a comment to your other question, explaining why S3 does not have folders, but file name prefixes instead (See Amazon AWS IOS SDK: How to list ALL file names in a FOLDER).
With that in mind, you will probably need to use a combination of two S3 API calls in order to achieve what you want: copy a file to a new one (removing the prefix from the file name) and deleting the original. Maybe there is a Ruby S3 SDK or framework out there exposing a rename feature, but under the hood it will likely be a copy/delete.
Related question: Amazon S3 boto: How do you rename a file in a bucket?