Apache Nifi - What happens when you run getFile processor without any downstream processor - jvm

I am a beginner to Apache Nifi and i want to move a file in my local filesystem from one location to another. When I used the getFile processor to move files from the corresponding input directory and started it, the file disappeared. I haven't connected it to a putFile processor. What exactly is happening here. Where does the file go if it disappears from the local directory i had placed it in. Also how can i get it back?

GetFile has a property Keep Source File, if you have set to true, the file is not deleted after it has been copied from Input Directory to the Content Repository, default is false so this is the reason your files are deleted and you must have set success relation for auto termination otherwise GetFile won't run without any downstream connection. Your files have been discarded. Not sure whether this will work, but try the Data Provenance option and replay content.
Have a look at this - GetFile Official Doc and Replaying a FlowFile

Related

NiFi ListS3 Processor includes parent file path as a flow file

My files in s3
s3://my_bucket/my_path/data/category/myfile.txt
Using the ListS3 processor with the bucket and pass "my_path/data/category/" as the prefix
I will get TWO flow files:
"s3://my_bucket/my_path/data/category/myfile.txt"
and
"s3://my_bucket/my_path/data/category/"
The 2nd one here is not an actual flow file but only a path to it.
How can I change my processor configuration to only get the entry for "myfile.txt"?
Also, FetchS3 seems to be picking this up and sending it to the next processor "ExecuteScript" which is modifying the contents of the file.
This ExecuteScript processor is obviously failing but not logging it, instead, it's just stuck in the queue.
How do I make it send this to the failure path instead of being stuck in the queue?
Found the solution! There is a 'Delimiter' property in the ListS3 bucket where I needed to include '/' as a delimiter. This is used to exclude the parent directory by Amason S3.

AWS CloudWatch Agent not uploading old files

During the initial migration to AWS CloudWatch logging I also want legacy log files to be synced. However, it seems that only the current active file (i.e. still being updated) will be synced. The old files even match the file name format will be ignore.
So are there any easy way to upload legacy files?
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html
Short answer: you should be able to upload all files by merging them. Or create a new [logstream] section for each file.
Log files in /var/log are usually archived periodically, for instance by logrotate. If the current active file is named abcd.log, then after a few days files will be created automatically with names like abcd.log.1, abcd.log.2...
Depending on your exact system and configuration, they can also be compressed automatically (abcd.log.1.gz,abcd.log.1.gz, ...).
The CloudWatch Logs documentation defines the file configuration parameter as such:
file
Specifies log files that you want to push to CloudWatch Logs. File can point to a specific file or multiple files (using wildcards such as /var/log/system.log*). Only the latest file is pushed to CloudWatch Logs based on file modification time.
Note : using a glob path with a star (*) will therefore not be sufficient to upload historical files.
Assuming that you have already configured a glob path, you could use the touch command sequentially on each of the historical files to trigger their upload. Problems :
you would need to guess when the CloudWatch agent has noticed each file before proceeding to the next
you would need to temporarily pause the current active file
zipped files are not supported, but you can decompress them manually
Alternatively you could decompress then aggregate all historical files in a single merged file. In the context of the first example, you could run cat abcd.log.* > abcd.log.merged. This newly created file would be detected by the CloudWatch agent (matches the glob pattern) which would consider it as the active file. Problem : the previous active file could be updated simultaneously and take the lead before CloudWatch notices your merged file. If this is a concern, you could simply create a new [logstream] config section dedication the historical file.
Alternatively, just decompress the historical files then create a new [logstream] config section for each.
Please correct any bad assumptions that I made about your system.

SFTP file locking semantics [duplicate]

How can I make sure that a file uploaded through SFTP (in a Linux base system) stays locked during the transfer so an automated system will not read it?
Is there an option on the client side? Or server side?
SFTP protocol supports locking since version 5. See the specification.
You didn't specify, what SFTP server are you using. So I'm assuming the most widespread one, the OpenSSH. The OpenSSH supports SFTP version 3 only, so it does not support locking.
Anyway, even if your server supported file locking, most SFTP clients/libraries won't support SFTP version 5. Or even if they do, they won't support the locking feature. Note that the lock is explicit, the client has to request it.
There are some common workarounds for the problem:
As suggested by #user1717259, you can have the client upload a "done" file, once an upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after an upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for example of implementing this approach.
Also, some SFTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive (courtesy of #fakedad).
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, when you download an incomplete file.
A typical way of solving this problem is to upload your real file, and then to upload an empty 'done.txt' file.
The automated system should wait for the appearance of the 'done' file before trying to read the real file.
A simple file locking mechanism for SFTP is to first upload a file to a directory (folder) where the read process isn't looking. You can "make" an alternate folder using the sftp> mkdir command. Upload the file to the alternate directory, instead of the ultimate destination directory. Once the SFTP> put command completes, then do a move like this:
SFTP> move alternate_path/filename destination_path/filename. Since the SFTP "move" is just switching the file pointers, it is atomic, so it is an effective lock.

NFS server receives multiple inotify events on new file

I have 2 machines in our datacenter:
The public server exposes part of the internal servers's storage through ftp. When files are uploaded to the ftp, the files in fact end up on the internal storage. But when watching the inotify events on the internal server's storage, i notice the file gets written in chunks, probably due to buffering at client side. The software on the internal server, watches the inotify events, to determine if new files have arrived. But due to the NFS manner of writing the files, there is no good way of telling when a file is complete. Is there a way of telling the NFS client to write files in only one operation, or is there a work around for this behaviour?
EDIT:
The events i get on the internal server, when uploading a file of around 900 MB are:
./ CREATE big_buck_bunny_1080p_surround.avi
# after the CREATE i get around 250K MODIFY and CLOSE_WRITE,CLOSE events:
./ MODIFY big_buck_bunny_1080p_surround.avi
./ CLOSE_WRITE,CLOSE big_buck_bunny_1080p_surround.avi
# when the upload finishes i get a CLOSE_NOWRITE,CLOSE
./ CLOSE_NOWRITE,CLOSE big_buck_bunny_1080p_surround.avi
of course, i could listen to the CLOSE_NOWRITE event, but reading inotify documentation says:
close_nowrite
A watched file or a file within a watched directory was closed, after being opened in read-only mode.
Which is not exactly the same as 'the file is complete'. The only workaround I see, is to use .part or .filepart files and move them, once uploaded, to the original filename and ignore the .part files in my storage watcher. Disadvantage is I'll have to explain this to customers, how to upload with .part. Not many ftp clients support this by default.
Basically, if you want to check when the write operations is completed, monitor the event IN_CLOSE_WRITE.
IN_CLOSE_WRITE gets "fired" when a file gets closed which was open for writing. Even if the file gets transferred in chunks, the FTP server will close the file only after the whole file has been transferred.

How to detect that a file is being uploaded over FTP

My application is keeping watch on a set of folders where users can upload files. When a file upload is finished I have to apply a treatment, but I don't know how to detect that a file has not finish to upload.
Any way to detect if a file is not released yet by the FTP server?
There's no generic solution to this problem.
Some FTP servers lock the file being uploaded, preventing you from accessing it, while the file is still being uploaded. For example IIS FTP server does that. Most other FTP servers do not. See my answer at Prevent file from being accessed as it's being uploaded.
There are some common workarounds to the problem (originally posted in SFTP file lock mechanism, but relevant for the FTP too):
You can have the client upload a "done" file once the upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for an example of implementing this approach.
Also, some FTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive.
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, that the file is incomplete.
Some FTP servers allow you to configure a hook to be called, when an upload is finished. You can make use of that. For example ProFTPD has a mod_exec module (see the ExecOnCommand directive).
I use ftputil to implement this work-around:
connect to ftp server
list all files of the directory
call stat() on each file
wait N seconds
For each file: call stat() again. If result is different, then skip this file, since it was modified during the last seconds.
If stat() result is not different, then download the file.
This whole ftp-fetching is old and obsolete technology. I hope that the customer will use a modern http API the next time :-)
If you are reading files of particular extensions, then use WINSCP for File Transfer. It will create a temporary file with extension .filepart and it will turn to the actual file extension once it fully transfer the file.
I hope, it will help someone.
This is a classic problem with FTP transfers. The only mostly reliable method I've found is to send a file, then send a second short "marker" file just to tell the recipient the transfer of the first is complete. You can use a file naming convention and just check for existence of the second file.
You might get fancy and make the content of the second file a checksum of the first file. Then you could verify the first file. (You don't have the problem with the second file because you just wait until file size = checksum size).
And of course this only works if you can get the sender to send a second file.