Read content of server files in pentaho - pentaho

My task is to generate a report based on content in server files(ascii) in pentaho.
I could come up with job to establish connection, get required files onto disk and generate report. But I want to change the flow by getting files or file content into memory rather than onto disk.
Established connection with server with job entry 'Get a file with SFTP' and file content is injected into transformation which starts with entry 'modified javascript value' .
Could someone please help me with this ?

If the files are in an FTP, or any file system that VFS can access (see VFS Supported Systems), you can access these files through the Get File Names Transformation Step, by passing the Filename atribute directly into the input step you use (Text file or XML or any other)

Related

Apache Nifi - What happens when you run getFile processor without any downstream processor

I am a beginner to Apache Nifi and i want to move a file in my local filesystem from one location to another. When I used the getFile processor to move files from the corresponding input directory and started it, the file disappeared. I haven't connected it to a putFile processor. What exactly is happening here. Where does the file go if it disappears from the local directory i had placed it in. Also how can i get it back?
GetFile has a property Keep Source File, if you have set to true, the file is not deleted after it has been copied from Input Directory to the Content Repository, default is false so this is the reason your files are deleted and you must have set success relation for auto termination otherwise GetFile won't run without any downstream connection. Your files have been discarded. Not sure whether this will work, but try the Data Provenance option and replay content.
Have a look at this - GetFile Official Doc and Replaying a FlowFile

AWS CloudWatch Agent not uploading old files

During the initial migration to AWS CloudWatch logging I also want legacy log files to be synced. However, it seems that only the current active file (i.e. still being updated) will be synced. The old files even match the file name format will be ignore.
So are there any easy way to upload legacy files?
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html
Short answer: you should be able to upload all files by merging them. Or create a new [logstream] section for each file.
Log files in /var/log are usually archived periodically, for instance by logrotate. If the current active file is named abcd.log, then after a few days files will be created automatically with names like abcd.log.1, abcd.log.2...
Depending on your exact system and configuration, they can also be compressed automatically (abcd.log.1.gz,abcd.log.1.gz, ...).
The CloudWatch Logs documentation defines the file configuration parameter as such:
file
Specifies log files that you want to push to CloudWatch Logs. File can point to a specific file or multiple files (using wildcards such as /var/log/system.log*). Only the latest file is pushed to CloudWatch Logs based on file modification time.
Note : using a glob path with a star (*) will therefore not be sufficient to upload historical files.
Assuming that you have already configured a glob path, you could use the touch command sequentially on each of the historical files to trigger their upload. Problems :
you would need to guess when the CloudWatch agent has noticed each file before proceeding to the next
you would need to temporarily pause the current active file
zipped files are not supported, but you can decompress them manually
Alternatively you could decompress then aggregate all historical files in a single merged file. In the context of the first example, you could run cat abcd.log.* > abcd.log.merged. This newly created file would be detected by the CloudWatch agent (matches the glob pattern) which would consider it as the active file. Problem : the previous active file could be updated simultaneously and take the lead before CloudWatch notices your merged file. If this is a concern, you could simply create a new [logstream] config section dedication the historical file.
Alternatively, just decompress the historical files then create a new [logstream] config section for each.
Please correct any bad assumptions that I made about your system.

RavenDb File Storage issue based on 3.0.3 version

We have a critical file storage issue of RavenDb.
The context is we have a Web application server and a separate File service server based on RavenDB. The user tries to upload a file with the name of "xxx_Living #Mapping.pdf" for example. The web application would then send the request to File server by wrapping the file content into the request body, passing the encoded file name as parameter with Put Action. But all the characters after and includes the '#' was blocked upon request received by File Server for unknown network reason, result in "xxx_Living". The file server is able to create file by calling filesStore.AsyncFilesCommands.UploadAsync(fileName, Request.Body) without exception.
The issue is we are able to view the File uploaded from Raven.Studio by sending a request /RavenDbServer/fs/ClientFile/search?query=__directoryName%3A%2Fclientattachments+AND+__level%3A2&start=108&pageSize=27. (see the snapshot attached)
enter image description here
But, We are not able to see the metadata by selecting the target file in the list by sending request /RavenDbServer/studio/index.html#filesystems/edit?&id=clientattachments%2F0aecef9c-6dd0-4a9e-9df3-039228576471_Living%20&filesystem=clientFile. (see the snapshot attached)
enter image description here
We also checked the network flow. the request RavenDbServer/fs/clientFile/files/clientattachments%2F0aecef9c-6dd0-4a9e-9df3-039228576471_Living%20 returns 404 code.
My question is:
1) Are those kind of files stored successfully in RavenDb File server since we can see the id, file size and Last Modify date in List page, but not able to get it by click into the file from RavenDb studio or download it by calling api filesStore.AsyncFilesCommands.DownloadAsync(FileName)?
2) Can we store the file to Raven file server with the file name contains special characters, like '#', or any other ones?

How to detect that a file is being uploaded over FTP

My application is keeping watch on a set of folders where users can upload files. When a file upload is finished I have to apply a treatment, but I don't know how to detect that a file has not finish to upload.
Any way to detect if a file is not released yet by the FTP server?
There's no generic solution to this problem.
Some FTP servers lock the file being uploaded, preventing you from accessing it, while the file is still being uploaded. For example IIS FTP server does that. Most other FTP servers do not. See my answer at Prevent file from being accessed as it's being uploaded.
There are some common workarounds to the problem (originally posted in SFTP file lock mechanism, but relevant for the FTP too):
You can have the client upload a "done" file once the upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for an example of implementing this approach.
Also, some FTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive.
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, that the file is incomplete.
Some FTP servers allow you to configure a hook to be called, when an upload is finished. You can make use of that. For example ProFTPD has a mod_exec module (see the ExecOnCommand directive).
I use ftputil to implement this work-around:
connect to ftp server
list all files of the directory
call stat() on each file
wait N seconds
For each file: call stat() again. If result is different, then skip this file, since it was modified during the last seconds.
If stat() result is not different, then download the file.
This whole ftp-fetching is old and obsolete technology. I hope that the customer will use a modern http API the next time :-)
If you are reading files of particular extensions, then use WINSCP for File Transfer. It will create a temporary file with extension .filepart and it will turn to the actual file extension once it fully transfer the file.
I hope, it will help someone.
This is a classic problem with FTP transfers. The only mostly reliable method I've found is to send a file, then send a second short "marker" file just to tell the recipient the transfer of the first is complete. You can use a file naming convention and just check for existence of the second file.
You might get fancy and make the content of the second file a checksum of the first file. Then you could verify the first file. (You don't have the problem with the second file because you just wait until file size = checksum size).
And of course this only works if you can get the sender to send a second file.

Uploading via SFTP over slow connection to temporary location then moving to real location

I have an issue where occasionally I need to work at Starbucks.
When I upload a PHP file the connection is slow so if a user tries to access the PHP file while I am uploading it they will of course be issues a fatal error.
This is very inconvenient to my busy websites. Is there a way that when a file is uploaded it can be uploaded to a temporary location, and then the server moves it to the real location once finished?
You can make WinSCP upload the file to temporary file and rename it once transfer completes automatically.
In Preferences go to the Transfer > Endurance tab and select All Files in the Enable ... Transfer to temporary file name box.
For details refer to:
https://winscp.net/eng/docs/ui_pref_resume
Why don't you just upload the file to a temporary folder on the server and execute commands on the server to remove the old file and move the new file? It should move the file fast enough on the server to eliminate any hiccups the users would see unless their timing was just right.