Camel Kafka S3 Source Connector with multiple connectors for same bucket - amazon-s3

I am trying to define Camel S3 Source connectors. I have searched quite a bit without success to find answers to below questions.
How can I set up my connector such that the file in the S3 bucket doesn't get deleted but can be moved to another folder specified by me, after it is processed
Is it possible to define separate connectors with different value converters for the same bucket, maybe by folders under the bucket? The connectors will use different kafka topics based on the file type. How can I define the bucket with folder when defining the connector properties
Thank you

Related

Kafka Connect SpooldirCsv connector

I am trying to use the Confluent's SpoolDirCSVSource connector to read files from a directory and send to MSK. Works fine on my local. But now I need to source it from S3 bucket . Is there no way I can use this connector to do this ? Or is there some other connector which does this ? The input.path parameter works only with local directories I think. Any pointer in right direction to the correct connector or modifying the SpoolDirCSV connector would be appreciated.
I know this question is old , something similar. But I am curious to know if this functionality is still absent (I can be wrong).
How to use Kafka Connect to source .csv files from S3 bucket?
This is the exact error when the connector is deployed to cloud
There is an issue with the connector
Code: InvalidInput.InvalidConnectorConfiguration
Message: The connector configuration is invalid. Message: Connector configuration is invalid and contains the following 2 error(s): Invalid value File 's3:/mytestbucketak/input' is not an absolute path. for configuration input.path Invalid value File 's3:/mytestbucketak/error' is not an absolute path. for configuration error.path
That connector can only read from local filesystem, not S3.
Confluent has a specific S3 Source Connector, or as linked, FilePulse Connector also exists.

Upload multiple files to AWS S3 bucket without overwriting existing objects

I am very new to AWS technology.
I want to add some files to an existing S3 bucket without overwriting existing objects. I am using Spring Boot technology for my project.
Can anyone please suggest how can we add/upload multiple files without overwriting existing objects?
AWS S3 supports object versioning in the bucket, in which for use case of uploading same file, S3 will keep all files within the bucket with different version rather than overwriting it.
This can be configured using AWS Console or CLI to enable the Versioning feature. You may want to refer this link for more info.
You probably already found an answer to this, but if you're using the CDK or the CLI you can specify a destinationKeyPrefix. If you want multiple folders in an S3, which was my case, the folder name will be your destinationKeyPrefix.

Mule - Copy the directory from HDFS

I need to copy the directory (/tmp/xxx_files/xxx/Output) head containing sub folders and files from HDFS (Hadoop distributed file system). I'm using HDFS connector but it seems it does not support this.
It always getting an error like:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /tmp/xxx_files/xxx/Output/
I don't see any option is HDFS connector for copying the files/directories inside the path specified. It is always expecting file names to be copied.
Is it possible to copy a directory head containing sub-folders and files using the HDFS connector from MuleSoft?
As the technical documentation of the HSFS connector on the official MuleSoft website states, the code is hosted at the GitHub site of the connector:
The Anypoint Connector for the Hadoop Distributed File System (HDFS)
is used as a bi-directional gateway between applications. Its source
is stored at the HDFS Connector GitHub site.
What it does not state, that there is also a more detailed technical documentation available on the GitHub site.
Here you can also find different examples how to use the connector for basic file-system operations.
The links seem to be broken in the official MuleSoft documentation.
You can find the repository here:
https://github.com/mulesoft/mule-hadoop-connector
The operations are implemented in the HdfsOperations java class. (See also the FileSystemApiService class)
As you can see, the functionality you expect is not implemented. It is not supported out-of-the-box.
You can't copy a directory head containing sub folders and files from HDFS without any further effort using the HDFS connector.

Flink Streaming AWS S3 read multiple files in parallel

I am new to Flink, my understanding is following API call
StreamExecutionEnvironment.getExecutionEnvironment().readFile(format, path)
will read the files in parallel for given S3 bucket path.
We are storing log files in S3. The requirement is to serve multiple client requests to read from different folders with time stamps.
For my use case, to serve multiple client request, I am evaluating to use Flink. So I want Flink to perform AWS S3 read in parallel for different AWS S3 file paths.
Is it possible to achieve this in single Flink Job. Any suggestions?
Documentation about the S3 file system support can be found here.
You can read from different directories and use the union() operator to combine all the records from different directories into one stream.
It is also possible to read nested files by using something like (untested):
TextInputFormat format = new TextInputFormat(path);
Configuration config = new Configuration();
config.setBoolean("recursive.file.enumeration", true);
format.configure(this.config);
env.readFile(format, path);

In NiFi is it possible to read selectively through FetchS3Object processor?

In Apache NiFi, using FetchS3Object to read from an S3 bucket, I see it can reads all the object in bucket and as they are added. Is it possible:
To configure the processor to read only objects added now onwards, not the one already present?
How can I make it read a particular folder in the bucket?
NiFi seems great, just missing examples in their documentation for atleast the popular processors.
A combination of ListS3 and FetchS3Object processors will do this:
ListS3 - to enumerate your S3 bucket and generate flowfiles referencing each object. You can configure the Prefix property to specify a particular folder in the bucket to enumerate only a subset. ListS3 keeps track of what it has read using NiFi's state feature, so it will generate new flowfiles as new objects are added to the bucket.
FetchS3Object - to read S3 objects into flowfile content. You can use the output of ListS3 by configuring the FetchS3Object's Bucket property to ${s3.bucket} and Object Key property to ${filename}.
Another approach would be to configure your S3 bucket to send SNS notifications, subscribe an SQS queue. NiFi would read from the SQS queue to receive the notifications, filter objects of interest, and process them.
See Monitoring An S3 Bucket in Apache NiFi for more on this approach.
Use GetSQS and fetchS3Object processor and configure your GETSQS processor to listen for notification for newly added file. It's a event driven approach as whenever a new file comes SQS queue sends notification to nifi.
Use below link to get full clarifications:
AWS-NIFI integration