I have 5000 files in a folder and on daily basis new file keep loaded in same file. I need to get the latest file only on daily basis among all the files.
Will it be possible to achieve the scenario in Mule out of box.
Tried keeping file component inside Poll component( To make use of waterMark) but not working.
Is there any way we can achieve this. If not please suggest the best way ( Any possible links).
Mule Studio: 5.3, RunTime 3.7.2.
Thanks in advance
Short answer: Not really any extremely quick out of the box solution. But there are other ways. Im not saying this is the right or only way of solving it, but I've earlier implemented a similar scenario in this way:
A Normal File inbound with a database table as file-log. Each time a new file is processed, a component checks if its name appears in the table. By choice or filter I only continue if it isn't in there already - and after processing I add the filename to the table.
This is a quite "heavy" solution though. A simpler access would be to use an idempotent filter with a object store. For example a Redis server: https://github.com/mulesoft/redis-connector/blob/master/src/test/resources/redis-objectstore-tests-config.xml
It is actually very simple if your incoming file contains timestamp........you can configure the file inbound connector by setting file:filename-regex-filter pattern="myfilename_#[function:timestamp].csv". I hope this helps
May be you can use a quartz scheduler( mention the time in cron expression), followed by a groovy script in which you can start the file connector . Keep the file connector in another flow.
Related
I have a requirement use a flowgear workflow to process files (via a droppoint, targeting an windows file share [SMB]), but targeting only the files that have been modified after a certain time of day.
How can one tell the the "Last Modified" date/time of a file using a flowgear node?
I have been searching the Flowgear help center, and have been experimenting with file-related nodes - File, File Enumerator, File Watcher and File Manage, but I haven't seen any property that exposes this piece of metadata.
Here's an example of how you can do it.
https://flowgear.me/#s/cCp8kGQ
In this sample, you simply use the script to get a list of files, after that you can once again use the normal File nodes to do the rest. It can be modified to return the file directly via the Script node, however that would require additional hand-coding and is needlessly complex.
I need to design a pipeline using Nifi, but I have some questions as I am thinking between two approaches and I am unsure which processors to use, so maybe you can help me.
The scenario is the following: I need to ingest some .csv files into my HDFS, those do not contain a date I want to use to partition the Hive tables I will later use, so I thought of two options:
At some point during the .csv treatment, create some kind of code snippet that is launched from Nifi to modify the .csv file adding the column with the date.
Create a temporary (internal?) table on hive, alter the table adding the column and finally add it to the table where I partition by date.
I am unsure which option is better (memory-wise, simplicity, resource management) or maybe if its even possible, or even if there is a better way to do it. Also I am unsure of which are the Nifi processors to use.
So any help is appreciated guys, thanks.
You should be able to do #1 easily in NiFi without writing any code :)
The steps would be something like this:
Source processor to get your CSV from somewhere, probably GetFile
UpdateAttribute to add an attribute for the current date
UpdateRecord with a CsvReader and CsvWriter, adds a new date field
with the value from #2
I've created an example of how to do this and posted the template here:
https://gist.githubusercontent.com/bbende/113f8fa44250c09a5282d04ee600cd09/raw/c6fe8b1b9f31bb106f9c816e4fd5ea90ebe19f80/CsvAddDate.xml
Save that xml file and use the palette on the left of NiFi canvas to upload it as a template. Then instantiate the template from the top toolbar by dragging on the template icon.
I'm pretty new on Jedox, and I'm trying for internal use to export some specific "warning logs" (eg. "(Mapping missing)" ) into an excel/wss file.
And I don't how to do that...
Can you help me please.
Regards,
Usik
The easiest way to get these information is to use the Integrator in Jedox.
There you have the possibility to use a File Extract and then you can filter the information you are searching for.
After that it's possible to load these filtered information into a File.
The minimum steps you'll need are Connection -> Extract -> Transform -> Load.
Please take a look at the sample projects that are delivered with the Jedox software. In the example "sampleBiker", there are also file connections, extracts etc.
You can find more samples in:
<Install_path>\tomcat\webapps\etlserver\data\samples
I recommend to check the Jedox Knowledgebase.
The other way (and maybe more flexible way) would be to use, for example, a PHP macro inside of a Jedox Web report and read the log file you're trying to display.
If you've got a more specific idea what you'd like to do, please let me know and I'll try to give you an example how to do so.
Using NiFi v0.6.1 is there a way to import backups/archives?
And by backups I mean the files that are generated when you call
POST /controller/archive using the REST api or "Controller Settings" (tool bar button) and then "Back-up flow" (link).
I tried unzipping the backup and importing it as a template but that didn't work. But after comparing it to an exported template file, the formats are reasonably different. But perhaps there is a way to transform it into a template?
At the moment my current work around is to not select any components on the top level flow and then select "create template"; which will add a template with all my components. Then I just export that. My issue with this is it's a bit more tricky to automate via the REST API. I used Fiddler to determine what the UI is doing and it first generates a snippet that includes all the components (labels, processors, connections, etc.). Then it calls create template (POST /nifi-api/contorller/templates) using the snippet ID. So the template call is easy enough but generating the definition for the snippet is going to take some work.
Note: Once the following feature request is implemented I'm assuming I would just use that instead:
https://cwiki.apache.org/confluence/display/NIFI/Configuration+Management+of+Flows
The entire flow for a NiFi instance is stored in a file called flow.xml.gz in the conf directory (flow.xml.tar in a cluster). The back-up functionality is essentially taking a snapshot of that file at the given point in time and saving it to the conf/archive directory. At a later point in time you could stop NiFi and replace conf/flow.xml.gz with one of those back-ups to restore the flow to that state.
Templates are a different format from the flow.xml.gz. Templates are more public facing and shareable, and can be used to represent portions of a flow, or the entire flow if no components are selected. Some people have used templates as a model to deploy their flows, essentially organizing their flow into process groups and making template for each group. This project provides some automation to work with templates: https://github.com/aperepel/nifi-api-deploy
You just need to stop NiFi, replace the nifi flow configuration file (for example this could be flow.xml.gz in the conf directory) and start NiFi back up.
If you have trouble finding it check your nifi.properties file for the string nifi.flow.configuration.file= to find out what you've set this too.
If you are using clustered mode you need only do this on the NCM.
Could you please suggest. I have two files each have 80 to 90k product and these two files are interlinked with each other(one file have information on other) and i need to generate one single file by looking up the other files. These files probably comes in the sameTime with different name.
Both the files are csv and i need to generate the new csv.
Is that the only way I should keep any one of these files in memory and keep looking by iterating.
I planned to use Batch inside dataMapper. Is there any way we can keep the first file in Datamapper userDefined table or something like that.And the getting the new file to make a look up on it.( I'm not provided with external DB)
If any one of the file have some 5000 or 10k lines it the sense, i can keep that in memory and make the 80k file to look on it. I'm not comfortable to keep 80 or 90k file in memory.
Have reference this link: Mule ESB - design a multi file processing flow when files are dependent on each other.
Could you please suggest me the best solution.
Also any idea How long to process the file it does take, Thanks in advance.
Mule studio:5.3.1 and Runtime: 3.7.2
I would think of the problem as two distinct events from Mule's perspective, and plan to keep state from the first one in a "database" of some kind. This doesn't have to be an Oracle cluster or anything, you can run H2 in process or Redis on the same server as Mule for example.
I think you're on the right track with the Batch idea. When the first file is received, I'd create a record for each in a batch job. Then when the second file is received, I'd run a second batch job that looks up the relevant information from the database, and generates the CSV file you need. It could also remove the records that have been matched from the database in a subsequent batch step.
For the transformations, I'd recommend trying DataWeave instead of DataMapper. It's a better way to write transformation logic, and Mulesoft has deprecated DataMapper, to be removed as of Mule 4.0.