Synchronized shared definition in Pentaho - pentaho

Is there a way in Pentaho to create a synchronized shared definition?
Let say we have a source file s1 which is used in two transformations t1, t2. Now, suppose if I make change in t1 and add one more column in s1 there, I want it to get reflected in t2 too. Is there a way in Pentaho to achieve this?
When we share database connection in Pentaho all the changes gets reflected where ever we are using it. Can we do the similar thing with files too(if I am creating a shared definition of file and storing it in repository and then using it in other transformations)?
Thanks for you time.

Normally, I would recommend using a subtransformation for shared logic. But in this case, the number of fields in the resulting stream is going to change, so the subtransformation won't buy you much; you would still need to go into the parent transformations to change the stream metadata.
Another approach is to use the Metadata Injection step, so that you can have dynamic stream structures. This is probably overkill if you only have one source file used in two transformations, but if you have lots of source files shared by lots of transformations, it is a good approach. There are multiple sources on the web on how to use this step; one such can be found here.

Related

Mosaic Decisions Azure BLOB writer node creating multiple files

I’m using mosaic decisions data flow feature to read a file from Azure blob, do a few transformations and write that data back to Azure. It worked fine except that in the output file path I have given, it created a folder and I can see many files with some strange “part-000” etc in their names. What I need is a single file in that output location – Not many. Is there a way around this?
Mosaic-Decisions uses apache spark as its backend execution engine. In Spark, the dataframe read is split into multiple partitions and these partitions are written to the output location in parallel. That's the reason it creates multiple files at the target location with "part-0000", "part-0001" etc. (part here represents partition).
The workaround on this is to check "combine-output-files-into-one" in writer node. This will combine all of the part files into one big file. But use this with caution and only if you really need a single file - as this will come with a performance tradeoff.

Best approach for this data pipeline?

I need to design a pipeline using Nifi, but I have some questions as I am thinking between two approaches and I am unsure which processors to use, so maybe you can help me.
The scenario is the following: I need to ingest some .csv files into my HDFS, those do not contain a date I want to use to partition the Hive tables I will later use, so I thought of two options:
At some point during the .csv treatment, create some kind of code snippet that is launched from Nifi to modify the .csv file adding the column with the date.
Create a temporary (internal?) table on hive, alter the table adding the column and finally add it to the table where I partition by date.
I am unsure which option is better (memory-wise, simplicity, resource management) or maybe if its even possible, or even if there is a better way to do it. Also I am unsure of which are the Nifi processors to use.
So any help is appreciated guys, thanks.
You should be able to do #1 easily in NiFi without writing any code :)
The steps would be something like this:
Source processor to get your CSV from somewhere, probably GetFile
UpdateAttribute to add an attribute for the current date
UpdateRecord with a CsvReader and CsvWriter, adds a new date field
with the value from #2
I've created an example of how to do this and posted the template here:
https://gist.githubusercontent.com/bbende/113f8fa44250c09a5282d04ee600cd09/raw/c6fe8b1b9f31bb106f9c816e4fd5ea90ebe19f80/CsvAddDate.xml
Save that xml file and use the palette on the left of NiFi canvas to upload it as a template. Then instantiate the template from the top toolbar by dragging on the template icon.

LinqPad - Share data between queries

I have two connections to two distinct servers.
I'd like to access data from databases on both servers. It seems impossible (or at least complicated).
I was thinking it may be easier to do some requests on one server, store the results in memory in some variable and then access that variable in another query on the other server.
I tried with static variables in MyExtensions as well as with
AppDomain.CurrentDomain.SetData("myVariable", results) and AppDomain.CurrentDomain.GetData("myVariable") but both don't work.
I had this same problem, but the solution ended up being much simple than I had expected.
Since I had a project with both of datacontexts (on different servers) I wished to query, I added a reference (F4 in a query, then Additional References->Find your bin folder or whatever) to the .dll of my project. I added a config/connectionstrings section to the app.config of the query that contained the names of the connectionStrings my project context was looking for with the correct connection Info.
This gave me not only access to my data context, but also much of the business logic (for instance, repos/dtos/viewModels/other transformations) from my project. From there, I could grab whatever I needed from DB/server A, put it my preferred data type (usually a List), and then have it interact with data from DB/server B as needed.
Hope this helps!

Joining ADLS files created with Append and ConcurrentAppend

We have several large CSV files in Azure Data Lake Store that were created using the Append method of the .NET API. Recently, we switched over to ConcurrentAppend for performance reasons. Since ConcurrentAppend and Append cannot be used interchangeably, the switch required us to create a new folder structure for the files, to make sure that the ConcurrentAppend would never hit any files created using Append.
However, our downstream application needs to load all data, both from before and after the switch. Instead of changing our application, we wanted to join the files (using the PowerShell SDK Join-AzureRmDataLakeStoreItem cmdlet), but the documentation does not specify whether files joined this way can be written to by ConcurrentAppend after the join. I suspect that we will face issues, since we are going to join files created by both methods (maybe it's not even possible to do the join?)
So my questions are as follows:
Can ConcurrentAppend write to a file that has been joined using Join-AzureRmDataLakeStoreItem, even if one or more of the source files have been created using Append?
If not, we will use U-SQL to combine the files, but can ConcurrentAppend write to a file that has been outputted from a U-SQL job?
If not, do we have any other options than executing a local script (using the .NET API for example), which will read all files, and write a new set of files back to the lake using only ConcurrentAppend?
Cost is a concern, which is why we prefer to use the PowerShell cmdlet if possible, and would like to avoid the last option.
At present after the join operation, no append operations can be executed on the file. We are currently working on a feature to remove this limitation. However, at present after concatenating files, the appends will not work.

Need suggestions on which option will be efficient to store data on iPad

This is my first time that I am working on a big project for a client. So I was not sure how to solve this problem. However I have come up with two different ideas but I need professionals opinion about which one is better :)
Situation :
There is an application which runs on different client's iPad. Application data is stored by using giant XML file. This XML file is shared among all client by a server. So a server has a centralised copy and each client has their own copy. Once client made changes to their XML copy they updates server copy in and other client updates their copy by updated server copy.
Now only one client can make changes at one time, To fix this I have logic by which before client starts editing XML they need to get ownership from server and server will only allow one client to edit at one time.
Visual Representation :
Now on client side I have to think of a logic by which I will update my client copy and upload it to server. There are two options,
Option 1 :
In option 1, I can directly manipulate XML file by using GDataXML parser and upload that copy to server. For persistence I can save client copy on my iPad in document directory.
Option 2 :
In option 2, I can read XML file create a CoreData representation for local storage. When ever I update data inside core data it will I will change XML file too and than upload that file on server. Double work but I guess better persistence.
Now which one more robust and advisable? Personally I was planning to do option 2 because it seems more robust as I am persisting application data in core data. But option 1 seems more easy work but I don't know how good persistency will remain.
Sorry for lengthy question,
Thanks for any input given.
There are a number of factors which would influence selecting the second option over the first.
How big is the XML file? If you need to work with very large documents, you may need to incrementally parse the XML (SAX) into core data. This will allow you to access the document's contents without loading it all into memory at once.
Do you need to run complex queries in the data? If so, you may be better off using core data fetch predicates, rather than xpath or XSL.
Are you already using core data? Depending on how the XML data is structured, it might be simpler overall to import the data into your existing persistent store.
Otherwise, you can probably make due with parsing the entire document and either traversing the resulting tree or querying with xpath.
If you need to create an object graph based on what you get from server and show it to user (which you most probably need to do), you should stick up to second option, since it allows easy and robust data persistence.
If you do not need to present user with any data from the XML file you can, of course, store it in the Documents directory.
So, if this is a client application and it has at least some visual representation of the data from an XML file you should use CoreData.
If you want a regular update of data , then use CoreData