NiFi ListS3 Processor includes parent file path as a flow file - amazon-s3

My files in s3
s3://my_bucket/my_path/data/category/myfile.txt
Using the ListS3 processor with the bucket and pass "my_path/data/category/" as the prefix
I will get TWO flow files:
"s3://my_bucket/my_path/data/category/myfile.txt"
and
"s3://my_bucket/my_path/data/category/"
The 2nd one here is not an actual flow file but only a path to it.
How can I change my processor configuration to only get the entry for "myfile.txt"?
Also, FetchS3 seems to be picking this up and sending it to the next processor "ExecuteScript" which is modifying the contents of the file.
This ExecuteScript processor is obviously failing but not logging it, instead, it's just stuck in the queue.
How do I make it send this to the failure path instead of being stuck in the queue?

Found the solution! There is a 'Delimiter' property in the ListS3 bucket where I needed to include '/' as a delimiter. This is used to exclude the parent directory by Amason S3.

Related

How to read a S3 file from a REST Endpoint using Apache Camel 2.X

I need to read a file from S3 bucket based on the file name passed from a REST endpoint. The file name will be passed as a uri parameter. The camel route I am using is given below.
rest().get("/file")
.to("direct:readFile");
from("direct:readFile")
.setProperty("FileName",simple("${header.fileName}"))
.log(LoggingLevel.INFO, "Reading file from S3")
.setHeader(S3Constants.KEY, simple("${header.fileName}"))
.to("aws-s3://"+awsBucket+"?amazonS3Client=#s3Client&fileName=${header.fileName}")
...
Instead of reading the file, this is currently overwriting the existing file in S3 with an empty file. How do I specify that this is a read operation and not a write operation?
I know that the latest version has an operation=getObject parameter for this. But the version I am using is 2.18.1 as the rest of the application is also on this version, and this version does not support this operation.
Is there any other way to achieve this?

Apache Nifi - What happens when you run getFile processor without any downstream processor

I am a beginner to Apache Nifi and i want to move a file in my local filesystem from one location to another. When I used the getFile processor to move files from the corresponding input directory and started it, the file disappeared. I haven't connected it to a putFile processor. What exactly is happening here. Where does the file go if it disappears from the local directory i had placed it in. Also how can i get it back?
GetFile has a property Keep Source File, if you have set to true, the file is not deleted after it has been copied from Input Directory to the Content Repository, default is false so this is the reason your files are deleted and you must have set success relation for auto termination otherwise GetFile won't run without any downstream connection. Your files have been discarded. Not sure whether this will work, but try the Data Provenance option and replay content.
Have a look at this - GetFile Official Doc and Replaying a FlowFile

Event driven Fetch S3 file in Nifi

I am currently fetching an S3 file Using the FetchS3Object Processor. But it is a time-driven process and sometimes the S3 files are dumped later than expected due to which the Processor is unable to fetch the file.
Is there a way to make the processor event-driven or is there a way to make the processor run in a loop until it fetches the file?
You have at least these options:
define aws lambda function that sends SQS message every time new object is added to S3 bucket and then consume this event in NiFi with GetSQS processor
use ListS3 processor that will detect every new object added to S3 bucket

AWS CloudWatch Agent not uploading old files

During the initial migration to AWS CloudWatch logging I also want legacy log files to be synced. However, it seems that only the current active file (i.e. still being updated) will be synced. The old files even match the file name format will be ignore.
So are there any easy way to upload legacy files?
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html
Short answer: you should be able to upload all files by merging them. Or create a new [logstream] section for each file.
Log files in /var/log are usually archived periodically, for instance by logrotate. If the current active file is named abcd.log, then after a few days files will be created automatically with names like abcd.log.1, abcd.log.2...
Depending on your exact system and configuration, they can also be compressed automatically (abcd.log.1.gz,abcd.log.1.gz, ...).
The CloudWatch Logs documentation defines the file configuration parameter as such:
file
Specifies log files that you want to push to CloudWatch Logs. File can point to a specific file or multiple files (using wildcards such as /var/log/system.log*). Only the latest file is pushed to CloudWatch Logs based on file modification time.
Note : using a glob path with a star (*) will therefore not be sufficient to upload historical files.
Assuming that you have already configured a glob path, you could use the touch command sequentially on each of the historical files to trigger their upload. Problems :
you would need to guess when the CloudWatch agent has noticed each file before proceeding to the next
you would need to temporarily pause the current active file
zipped files are not supported, but you can decompress them manually
Alternatively you could decompress then aggregate all historical files in a single merged file. In the context of the first example, you could run cat abcd.log.* > abcd.log.merged. This newly created file would be detected by the CloudWatch agent (matches the glob pattern) which would consider it as the active file. Problem : the previous active file could be updated simultaneously and take the lead before CloudWatch notices your merged file. If this is a concern, you could simply create a new [logstream] config section dedication the historical file.
Alternatively, just decompress the historical files then create a new [logstream] config section for each.
Please correct any bad assumptions that I made about your system.

Nifi ListS3 processor not returning full path for files stored in S3

I am using the ListS3 processor to get files from S3 and piping it into the RouteOnAttribute processor. From there I am using the Route to Property name as the Routing Strategy and assigning properties bases on which files I am listening.
I am able to see all the files I want but can't do anything with them because my another processor down the line needs the full path of those files. I am using a python script, that takes in file path as cmd line arguments.
How do I extract the full absolute path of the files from S3?
You can list, download, and save S3 files locally using a sequence of NiFi processors like the following:
ListS3 - to get references to S3 objects you can filter. Output from ListS3 contains only references to objects, not the content itself, in attributes:
s3.bucket - name of the bucket, like my-bucket
filename - key of the object, like path/to/file.txt
FetchS3Object - to download object content from S3 using the bucket and key from ListS3 above.
PutFile - to store the file locally. Specify the Directory property where you want the files to be placed /path/to/directory. The filename attributes from S3 will contain relative paths from S3 keys, so these would be added to the Directory by default.
You can then assemble local paths for your Python script using NiFi expression language:
/path/to/directory/${filename}