S3: Move all files from subdirectories into a common directory - amazon-s3

I have a lot of subdirectories containing a lot of images (millions) on S3. Having the files in these subdirectories has turned out to be a lot of trouble, and since all file names are actually unique, there is no reason why they should reside in subdirectories. So I need to find a fast and scalable way to move all files from the subdirectories into one common directory or alternatively delete the sub directories without deleting the files.
Is there a way to do this?
I'm on ruby, but open to almost anything

I have added a comment to your other question, explaining why S3 does not have folders, but file name prefixes instead (See Amazon AWS IOS SDK: How to list ALL file names in a FOLDER).
With that in mind, you will probably need to use a combination of two S3 API calls in order to achieve what you want: copy a file to a new one (removing the prefix from the file name) and deleting the original. Maybe there is a Ruby S3 SDK or framework out there exposing a rename feature, but under the hood it will likely be a copy/delete.
Related question: Amazon S3 boto: How do you rename a file in a bucket?

Related

Is there a method available to store a file in mongo under a specific directory

I need to maintain a file system in mongo where I need directories to be created and file should be placed in the directory which i have created. So is there any in built functionality in python or how we can do that using GridFs?. So basically along with uploading the file i need to mention the directory where it needs to be placed
GridFS permits specifying the name of the file. You can put anything you like in there including a path where components are separated by slashes so that it looks like a filesystem path.
However, GridFS does not provide any facilities for hierarchical traversal of stored files (i.e. grouping files into directories). You'd have to implement that in your own application/library if you need this functionality.

How to use a common file across many Terraform builds?

I have a directory structure as follows to build the Terraform resources for my project:
s3
main.tf
variables.tf
tag_variables.tf
ec2
main.tf
variables.tf
tag_variables.tf
vpc
main.tf
variables.tf
tag_variables.tf
When I want to build or change something in s3, I run the Terraform the s3 directory.
When I want to build the ec2 resources, I cd into that folder and do a Terraform build there.
I run them one at a time.
At the moment I have a list of tags defined as variables, inside each directory.
This is the same file, copied many times to each directory.
Is there a way to avoid copying the same tags file into all of the folders? I'm looking for a solution where I only have only one copy of the tags file.
Terraform do offer a solution of sorts using the "local" verb, but this still needs the file to be repeated in each directory.
What I tried:
I tried putting the variables in a module, but variables are internal to a module, modules are not designed to share code into the main file.
I tried making the variables an output from the module but it didn't like that either.
Does anyone have a way to achieve one central tags file that gets used everywhere? I'm thinking of something like an include of a source code chunk from elsewhere? Or any other solution would be great.
Thanks for that advice ydaetskcoR, I used symlinks and this works perfectly.
I placed the tags list.tf file in a common directory. Each Terraform project now has a symbolic link to the file. (Plus I've linked some other common files in this way, like provider.tf).
For the benefit of others, in Linux a symbolic link is a small file that is a link to another file, but it can be used exactly like the original file.
This allows many different and separate Terraform projects to refer to the common file.
Note: If you are looking to modularise common Terraform functions, have a look at Terraform modules, these are designed to modularise your code. They can't be used for the simple use case above, however.

Can make figure out file dependencies in AWS S3 buckets?

The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?
The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.
I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.

S3DistCP copies some files for the manifest and doesnt copy the rest

We are using S3Distcp to copy files from S3 to HDFS by using a manifest file - i.e., we use --copyFromManifest argument in the S3Distcp command. At the S3DistCP step, however, only some of the files are copied that are listed in the manifest. I am not sure where should we start looking for problems - i.e., why are some files being copies and others are not?
Thanks
Maybe the problem is you have files with the same name but on different directories. In this case you will need to change the way you construct the baseName and srcDir fields. Please describe how you build your manifest file.

Which S3 manager creates '_$folder$' files for pseudo-folders?

I know that S3 has no folders, and I'm working with an inherited application that has some buckets filled with folder_name_$folder$ items. I know that a lot of different tools use these files, or other folder sigils, depending on the tool, to help represent 'folders' to various visual interfaces. I'm wondering which one uses this particular convention.
I'd like to remove them so that my various rake tasks that run down lists of files can go faster, but I'm afraid I'll end up breaking some tool that someone else in the company uses. Can anyone say which tools create these keys, and what functionality, if any, removing them would break? S3fox? The main AWS console?
the _$folder$ folders are created with S3Fox, AWS Console doesn't create them. you can safely delete them if you like
Thanks
Andy
They are also created by Hadoop S3 Native FileSystem (NativeS3FileSystem).
The $folder$ is created the hadoop and spark application during the file movement process.
You can remove them using s3 command
aws s3 rm s3://path --recursive --exclude "*" --include "*\$folder\$"
They seem to be created by 3Hub as well.
I've encounter this issue when I was creating Athena Partition when there was no such key in S3. These files can be safely removed.
Please check also here https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/