Best approach for this data pipeline? - hive

I need to design a pipeline using Nifi, but I have some questions as I am thinking between two approaches and I am unsure which processors to use, so maybe you can help me.
The scenario is the following: I need to ingest some .csv files into my HDFS, those do not contain a date I want to use to partition the Hive tables I will later use, so I thought of two options:
At some point during the .csv treatment, create some kind of code snippet that is launched from Nifi to modify the .csv file adding the column with the date.
Create a temporary (internal?) table on hive, alter the table adding the column and finally add it to the table where I partition by date.
I am unsure which option is better (memory-wise, simplicity, resource management) or maybe if its even possible, or even if there is a better way to do it. Also I am unsure of which are the Nifi processors to use.
So any help is appreciated guys, thanks.

You should be able to do #1 easily in NiFi without writing any code :)
The steps would be something like this:
Source processor to get your CSV from somewhere, probably GetFile
UpdateAttribute to add an attribute for the current date
UpdateRecord with a CsvReader and CsvWriter, adds a new date field
with the value from #2
I've created an example of how to do this and posted the template here:
https://gist.githubusercontent.com/bbende/113f8fa44250c09a5282d04ee600cd09/raw/c6fe8b1b9f31bb106f9c816e4fd5ea90ebe19f80/CsvAddDate.xml
Save that xml file and use the palette on the left of NiFi canvas to upload it as a template. Then instantiate the template from the top toolbar by dragging on the template icon.

Related

Age Analysis Dynamically Sliced with Before Date

I am trying to create an Age Analysis for Creditors using a dynamic date slicer.
I followed each individual step specified on David Churchward's Blog, but I'm not able to replicate what he suggested there.
Herewith is the result of what I tried:
I'm expecting to see these values each in their own Ageing bucket based on what is outstanding.
Please download my PBIX file to see for yourself, then please advise what I did wrong.
The Excel source for PBIX is also in the folder.
Thank you.
The blog that you're referring is quite old and DAX has changed a lot since then.
Additionally PowerBI now has a in-built feature called binning which can do something similar to what you're looking for.
I was able to generate the below output using that feature which automatically groups the data based on the bin size.
There also a related feature called "Grouping" where you can manually choose the groups and their range. If you're up for it you can use this too. Below is the output for that:
I uploaded the file with these changes in the same folder.
Another resource that might be helpful for you is Radacad's article on dynamic banding

Access when exporting it removes spaces at the end of a string

Long story short, I am dealing with an excel files, which need to be modified a little bit. As the files are coming on weekly basis, I decided to write a simple program via Access, which will help me to make the process fully automatic.
The first step was to upload the excel file into an Access database. I managed to achieve that by creating a custom function and inside to just use the "DoCMD.TransferSpreadsheet acImport" aproach.
The second step was to create two queries and update the table that I just uploaded. That was also pretty straight forward too.
However, the third step is what I am struggling with. Now when the table is updated I wanted to export it back to .xlsx format. However, when I do that no matter if I do it manually via the "External Data" tab or simply use "DoCMD.TransferText acExport" approach I noticed that a few columns that have a space after the end of the string are trimmed automatically. For example, original:"string ", but after exporting it is changed to "string".
I would be really grateful if someone can tell me how to specify to Access that the space after the string is intended and not done by mistake? Preferably with a VBA solution than having to do it manually. Thank you in advance for the help!
PS: I know that .CSV format would be way better, but sadly I need it to be in a XLSX format.

How to separate the latest file from Multiple files in Mule

I have 5000 files in a folder and on daily basis new file keep loaded in same file. I need to get the latest file only on daily basis among all the files.
Will it be possible to achieve the scenario in Mule out of box.
Tried keeping file component inside Poll component( To make use of waterMark) but not working.
Is there any way we can achieve this. If not please suggest the best way ( Any possible links).
Mule Studio: 5.3, RunTime 3.7.2.
Thanks in advance
Short answer: Not really any extremely quick out of the box solution. But there are other ways. Im not saying this is the right or only way of solving it, but I've earlier implemented a similar scenario in this way:
A Normal File inbound with a database table as file-log. Each time a new file is processed, a component checks if its name appears in the table. By choice or filter I only continue if it isn't in there already - and after processing I add the filename to the table.
This is a quite "heavy" solution though. A simpler access would be to use an idempotent filter with a object store. For example a Redis server: https://github.com/mulesoft/redis-connector/blob/master/src/test/resources/redis-objectstore-tests-config.xml
It is actually very simple if your incoming file contains timestamp........you can configure the file inbound connector by setting file:filename-regex-filter pattern="myfilename_#[function:timestamp].csv". I hope this helps
May be you can use a quartz scheduler( mention the time in cron expression), followed by a groovy script in which you can start the file connector . Keep the file connector in another flow.

Enrich CSV with metadata from database

I've been looking around for a lightweight, scaleable solution to enrich a CSV file with additional metadata from a database. Each line in the CSV represents a data item and the columns the metadata belonging to that item.
Basically I have a CSV extract and I need to add additional metadata from a database. The metadata can be accessed via ODBC or REST API call.
I have a number of options in my head but I'm looking for other ideas. My options are as follows:
Import the CSV into a database table, apply the additional metadata with sql UPDATE statements by finding the necessary metadata with SELECT statements, and then export the data back into CSV format. For this solution I was thinking to use an ETL tool which may be a bit heavyweight to tackle this problem.
I also thought about a NodeJS based solution where I read the CSV in, call web service to get the metadata and write back the data into the CSV file. The CSV can be however quite large with potentially tens of thousands of rows so this could be heavy on memory or in case of line-by-line processing not very performant.
If you have a better solution in mind, please post. Many thanks.
I think you've come up with a couple of pretty good ideas here already.
Running with your first suggestion using an ETL tool to enrich your CSV files, you should check out https://github.com/streamsets/datacollector
It's a continuous ingestion approach, so you could even monitor a directory of CSV files to load as you get them. While there's no specific functionality yet for doing lookups in a database, its certainly possible in a number of ways (including writing your own custom logic in Java, or a script in python or JavaScript).
*Full disclosure I work on this project.

How do I update text descriptions and URL links in an xml file using a SQL database

Can someone head me in the right direction? I just can't get a clear answer for this , I am sure, ver simple task .I have created a database with images, text strings and links and I want to insert certain cells into an XML file I have that is used in an image flipper of sorts. For example when I automatically update the database I want the xml file to show the new images text and links.
Right now my xml looks like this
<photo image="images/01.jpg" url="http://www.straightapp.com/1.html" target="_blank"> <![CDATA[Download the new<br>Check my first image out]]></photo>
Any help would be greatly appreciated.
I would probably write a cron or daemon to check for updates to the database table (compare to local storage or set up some kind of trigger from the database if possible), then read in the xml file, append it, and rewrite it.
I'm sure there are more elegant solutions, but that would be pretty easy.