Kafka connect s3 bucket folder - amazon-s3

Can I create my own directory in s3 using confluent S3SinkConnector?
I know it creates a folder structure, unfortunately we need a new directory strcuture.

Additionally, if you want to completely remove the first S3 'folder' ('topics' by default), you can set the topics.dir configuration to the backspace character: \b.
This way, {bucket}/\b/{partitioner_defined_path} becomes {bucket}/{partitioner_defined_path}.

You can change the topics.dir followed by the path extracted by the partitioner.class.
If you need "a new directory structure" (quoted because S3 has no directories), then you would need to look at implementing your own Partitioner class
https://docs.confluent.io/current/connect/kafka-connect-s3/index.html#s3-object-names

Related

Apache Camel eats S3 "folders" created programatically, but not ones created in AWS S3 Console

We have an Apache Camel app that is supposed to read files in a certain directory structure in S3, process the files (generating some metadata based on the folder the file is in), submit the data in the file (and metadata) to another system and finally put the consumed files into a different bucket, deleting the original from the incoming bucket.
The behaviour I'm seeing is that when I programatically create the directory structure in S3, those "folders" are being consumed, so the dir structure disappears.
I know S3 technically does not have folders, just empty files ending in /.
The twist here is that any "folder" created in the S3 Console, are NOT consumed, they stay there as we want them to. Any folders created via AWS CLI, or boto3 are immediately consumed.
The problem is that we do need the folders to be created with automation, there are too many to do by hand.
I've reached out to AWS Support, and they just tell me that there are no differences between how the Console creates folders, and how the CLI does it. Support confirmed that the command I used in CLI is correct.
I think my issue is similar to Apache Camel deleting AWS S3 bucket's folder , but that has no answer...
How can I get Camel to not "eat" any folders?

Upload files with name contains trailing slash to AWS S3 bucket

I'm trying to upload some files to my bucket on S3 through boto3 on Python.
These files name are websites addresses (for example www.google.com/gmail).
I want the file name to be the website address, but in fact it creates a folder with name "www.google.com" and inside the uploaded file with name "gmail"
I tried to solve it with double slash and backslash before the trailing slash, but it didn't work.
Is there any way to ignore the trailing slash and upload a file that its name is a website address?
Thanks.
You are misunderstanding S3 - it does not actually have a "folder" structure. Every object in a bucket has a unique key, and the object is accessed via that key.
Some S3 utilities (including to be fair the AWS console) fake up a "folder" structure, but this isn't too relevant to how S3 works.
Or in other words, don't worry about it. Just create the object with / in its key and everything will work as you expect.
S3 has a flat structure with no folders. The "folders" you are seeing are a feature in the AWS Console to make it easier to navigate through your objects. The console will group objects in a "folder" based on the prefix before the slash (if there is one).
There's nothing that prevents you from using slashes in S3 object keys. When you use the API via boto, you can refer to the full URL and you should get the object.
See: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html

Upload multiple files to AWS S3 bucket without overwriting existing objects

I am very new to AWS technology.
I want to add some files to an existing S3 bucket without overwriting existing objects. I am using Spring Boot technology for my project.
Can anyone please suggest how can we add/upload multiple files without overwriting existing objects?
AWS S3 supports object versioning in the bucket, in which for use case of uploading same file, S3 will keep all files within the bucket with different version rather than overwriting it.
This can be configured using AWS Console or CLI to enable the Versioning feature. You may want to refer this link for more info.
You probably already found an answer to this, but if you're using the CDK or the CLI you can specify a destinationKeyPrefix. If you want multiple folders in an S3, which was my case, the folder name will be your destinationKeyPrefix.

Mounted S3 on EC2 (Directories are not accessible from AWS UI)

Some quick questions:
Does S3 support soft link?
On mounted S3 on EC2, I can't access the created directory in Linux EC2 instance from AWS UI, however created files are accessible.
Thanks
Amazon S3 is an object store, not a filesystem. It has a specific set of APIs for uploading, listing, downloading, etc but it does not behave like a normal filesystem.
There are some utilities that can mount S3 as a filesystem (eg Expandrive, Cloudberry Drive, s3fs), but in the background these utilities are actually translating requests into API calls. This can cause some issues -- for example, you can modify a 100MB file on a local disk by just writing one by to disk. If you wish to modify one byte on S3, you must upload the whole object again. This can cause synchronization problems between your computer and S3, so such methods are not recommended for production situations. (However, they're a great way of uploading/downloading initial data.)
A good in-between option is to use the AWS Command-Line Interface (CLI), which has commands such as aws s3 cp and aws s3 sync, which are reliable ways to upload/download/sync files with Amazon S3.
To answer your questions...
Amazon S3 does not support a "soft link" (symbolic link). Amazon S3 is an object store, not a file system, so it only contains objects. Objects can also have meta-data that is often for cache control, redirection, classification, etc.
Amazon S3 does not support directories (sort of). Amazon S3 objects are kept within buckets, and the buckets are 'flat' -- they do not contains directories/sub-folders. However, it does maintain the illusion of directories. For example, if file bar.jpg is stored in the foo directory, then the Key (filename) of the object is foo/bar.jpg. This makes the object 'appear' to be in the foo directory, but that's not how it is stored. The AWS Management Console maintains this illusion by allowing users to create and open Folders, but the actual data is stored 'flat'.
This leads to some interesting behaviours:
You do not need to create a directory to store an object in the directory.
Directories don't exist. Just store a file called images/cat.jpg and the images directory magically appears (even though it doesn't exist).
You cannot rename objects. The Key (filename) is a unique identifier for the object. To 'rename' an object, you must copy it to a new Key and delete the original.
You cannot rename a directory. They don't exist. Instead, rename all the objects within the directory (which really means you have to copy the objects, then delete their old versions).
You might create a directory but not see it. Amazon S3 keeps track of CommonPrefixes to assist in listing objects by path, but it doesn't create traditional directories. So, don't get worried if you create a (pretend) directory and then don't see it. Just store your object with a full-path name and the directory will 'appear'.
The above-mentioned utilities take all this into account when allowing an Amazon S3 bucket to be mounted. They translate 'normal' filesystem commands into Amazon S3 API calls, but they can't do everything (eg they might emulate renaming a file but they typically won't let you rename a directory).

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.