In Apache NiFi, using FetchS3Object to read from an S3 bucket, I see it can reads all the object in bucket and as they are added. Is it possible:
To configure the processor to read only objects added now onwards, not the one already present?
How can I make it read a particular folder in the bucket?
NiFi seems great, just missing examples in their documentation for atleast the popular processors.
A combination of ListS3 and FetchS3Object processors will do this:
ListS3 - to enumerate your S3 bucket and generate flowfiles referencing each object. You can configure the Prefix property to specify a particular folder in the bucket to enumerate only a subset. ListS3 keeps track of what it has read using NiFi's state feature, so it will generate new flowfiles as new objects are added to the bucket.
FetchS3Object - to read S3 objects into flowfile content. You can use the output of ListS3 by configuring the FetchS3Object's Bucket property to ${s3.bucket} and Object Key property to ${filename}.
Another approach would be to configure your S3 bucket to send SNS notifications, subscribe an SQS queue. NiFi would read from the SQS queue to receive the notifications, filter objects of interest, and process them.
See Monitoring An S3 Bucket in Apache NiFi for more on this approach.
Use GetSQS and fetchS3Object processor and configure your GETSQS processor to listen for notification for newly added file. It's a event driven approach as whenever a new file comes SQS queue sends notification to nifi.
Use below link to get full clarifications:
AWS-NIFI integration
Related
I am currently fetching an S3 file Using the FetchS3Object Processor. But it is a time-driven process and sometimes the S3 files are dumped later than expected due to which the Processor is unable to fetch the file.
Is there a way to make the processor event-driven or is there a way to make the processor run in a loop until it fetches the file?
You have at least these options:
define aws lambda function that sends SQS message every time new object is added to S3 bucket and then consume this event in NiFi with GetSQS processor
use ListS3 processor that will detect every new object added to S3 bucket
I am trying to define Camel S3 Source connectors. I have searched quite a bit without success to find answers to below questions.
How can I set up my connector such that the file in the S3 bucket doesn't get deleted but can be moved to another folder specified by me, after it is processed
Is it possible to define separate connectors with different value converters for the same bucket, maybe by folders under the bucket? The connectors will use different kafka topics based on the file type. How can I define the bucket with folder when defining the connector properties
Thank you
Is it possible to read events as they land in S3 source bucket via apache Flink and process and sink it back to some other S3 bucket? Is there a special connector for that , or I have to use the available read/save examples mentioned in Apache Flink?
How does the checkpointing happen in such case, does flink keep track of what it has read from S3 source bucket automatically, or does it need custom code to be built. Does flink also guarentee exactly once processing in S3 source case.
In Flink 1.11 the FileSystem SQL Connector is much improved; that will be an excellent solution for this use case.
With the DataStream API you can use FileProcessingMode.PROCESS_CONTINUOUSLY with readFile to monitor a bucket and ingest new files as they are atomically moved into it. Flink keeps track of the last-modified timestamp of the bucket, and ingests any children modified since that timestamp -- doing so in an exactly-once way (the read offsets into those files are included in checkpoints).
I am new to Flink, my understanding is following API call
StreamExecutionEnvironment.getExecutionEnvironment().readFile(format, path)
will read the files in parallel for given S3 bucket path.
We are storing log files in S3. The requirement is to serve multiple client requests to read from different folders with time stamps.
For my use case, to serve multiple client request, I am evaluating to use Flink. So I want Flink to perform AWS S3 read in parallel for different AWS S3 file paths.
Is it possible to achieve this in single Flink Job. Any suggestions?
Documentation about the S3 file system support can be found here.
You can read from different directories and use the union() operator to combine all the records from different directories into one stream.
It is also possible to read nested files by using something like (untested):
TextInputFormat format = new TextInputFormat(path);
Configuration config = new Configuration();
config.setBoolean("recursive.file.enumeration", true);
format.configure(this.config);
env.readFile(format, path);
We have an existing setup of AEM 6.1 which uses TarMK for data storage. To migrate the all assets to S3, I followed all steps here: https://docs.adobe.com/docs/en/aem/6-1/deploy/platform/data-store-config.html#Data%20Store%20Configurations (Amazon S3 Data Store). Apparently, the data synced to S3 but when I checked the disk usage report, I still see that assets are using disk space even for existing and newly added assets. What's the purpose of using S3 for assets if they still use the disk space? Or am I doing something wrong? How can I verify that my setup is really using S3? Here is my S3DataStore.config
accessKey="xxxxxxxxxx"
secretKey="xxxxxxxxxx"
s3Bucket="dev-aem-assets-local"
s3Region="eu-west-1"
connectionTimeout="120000"
socketTimeout="120000"
maxConnections="40"
writeThreads="30"
maxErrorRetry="10"
continueOnAsyncUploadFailure=B"true"
cacheSize="0"
minRecordLength="10"
Another question is: Do I need to do the same setup on publisher? Or is it ok just to do it on author and use publisher as is by replicating the binary data?
There are a few parts to your questiob so I'll break down the answer into logical blocks. Shout if I miss anything.
Your setup for migration is correct and S3 will use disk space. This is for the write-through cache.
AEM uses write-through cache for writing to S3 and all the settings for this cache are in your S3 config file. Any writes to data store are first written to this cache. Asynchronous background threads then uploaded to the S3 bucket. This mechanism makes AEM very responsive as it's not blocked by slow S3 writes. Also, data reads for recently written blobs are fast because they don't need slow reads from S3. In short, S3 IO traffic is too slow for AEM so this cache boosts the performance. You cannot disable it as it is required for asynchronous write to S3. You can reduce the size but it's recommended to be at least 50% of your S3 bucket size.
You can verify your S3 setup by looking at your logs for messages related to AWS (grep for aws).
As for publisher, yes you need to migrate from your old publisher to the new publisher. Assuming that you are not using binary-less replication, you will need a different S3 bucket for your publisher. In general, you migrate from author to author and publisher to publisher for a standard implementation.
You can also verify your S3 dat usage by looking at the S3 bucket and the traffic on it. If versioning is enabled on your S3 bucket all the blobs will show version stamping.
Async upload of blobs can be monitored from logs and IP traffic monitoring will show activities related to your S3 bucket. The most useful way is to see the network traffic between your AEM server and S3 end-point.