How to load multiple excel sheets into different tables using pentaho metadata injection - pentaho

I have one excel file which is having 5 different sheets, i want to load all 5 sheets into different table using pentaho meta-data injection.
Note: I have implemented normal approach of repeating flow 5 times.
What i have tried
1) I have created another excel sheet with meta-data of all 5 sheets
2) I am able to pass sheet-name as a run time variable and can able
to replace it into sheets property
3) I am stuck # how to read
corresponding meta-data file and replace into template.
Any solution is appreciated.

Have you followed Pentaho's example on how to use Metadata Injection? It also uses excel sheets to store metadata, and the name of the sheet is passed at runtime and applied by the use of a join.
In my experience, I've stored the metadata in a MySQL table and selected the corresponding one by the use of a Variable set in a previous transformation, depending on the format of the input file.

Related

Excel to CSV Plugin for Kettle

I am trying to develop a reusable component in Pentaho which will take an Excel file and convert it to a CSV with an encoding option.
In short, I need to develop a transformation that has an Excel input and a CSV output.
I don't know the columns in advance. The columns have to be dynamically injected to the excel input.
That's a perfect candidate for Pentaho Metadata Injection.
You should have a template transformation wich contains the basic workflow (read from the excel, write to the text file), but without specifiying the input and/or output formats. Then, you should store your metadata (the list of columns and their properties) somewhere. In Pentaho example an excel spreadsheet is used, but you're not limited to that. I've used a couple of database tables to store the metadata for example, one for the input format and another one for the output format.
Also, you need to have a transformation that has the Metadata Injection step to "inject" the metadata into the template transformation. What it basically does, is to create a new transformation at runtime, by using the template and the fields you set to be populated, and then it runs it.
Pentaho's example is pretty clear if you follow it step by step, and from that you can then create a more elaborated solution.
You'll need at least two steps in a transformation:
Input step: Microsoft Excel input
Output step: Text file output
So, Here is the solution. In your Excel Input Component, in Fields Section, mention maximum number of fields which will come in any excel. Then Route the Input excel to text field based on the Number of fields which are actually present. You need to play switch/case component here.

Reading metadata CSV from a datalake, too big for a lookup activity

I need to create a pipeline to read CSVs from a folder, load from Row 8 into an Azure SQL table, Frist 5 rows will go into a different table ([tblMetadata]).
So far I have done it using Lookup Activity, works fine, but one of the files is bigger than 6 MB and it fails.
I checked all options in Lookup, read everything about Copy Activity (which I am using to load main data - skip 7 rows). The pipeline is created using GUI.
The output from the Lookup is used as parameters for a Stored Procedure to insert into tblMetadata
Can someone advise me how to deal with this? At the moment I am on the training, no one can help me on site.
You could probably do this with a single Data Flow activity that has a couple of transformations.
You would use a Source transformation that reads from a folder using folder paths and wildcards, then add a conditional split transformation to send different rows to different sinks.
I did workaround in different way, modified CSVs that are bing imported to have whole Metadata in the first row (as this was part of my different project). Then used FirstRow only in Lookup.

Update multiple Excel sheets of one document within one Pentaho Kettle transformation

I am researching standard sample from Pentaho DI package: GetXMLData - Read parent children rows. It reads separately from same XML input parent rows & children rows. I need to do the same and update two different sheets of the same MS Excel Documents.
My understanding is that normal way to achieve it is to put first sequence in one transformation file with XML Output or Writer, second to the second one & at the end create job with chain from start, through 1st & 2nd transformations.
My problems are:
When I try to chain above sequences I loose content of first updated Excel sheet in the final document;
I need to have at the end just one file with either Job or Transformation without dependencies (In case of above proposed scenario I would have 1 KJB job + 2 KTR transformation files).
Questions are:
Is it possible to join 2 sequences from above sample with some wait node before starting update 2nd Excel sheet?
If above doesn't work: Is it possible to embed transformations to the job instead of referencing them from external files?
And extra question: What is better to use: Excel Output or Excel Writer?
=================
UPDATE:
Based on #AlainD proposal I have tried to put Block node in-between. Here is a result:
Looks like Block step can be an option, but somehow it doesn't work as expected with Excel Output / Writers node (or I do something wrong). What I have observed is that Pentaho tries to execute next after Block steps before Excel file is closed properly by the previous step. That leads to one of the following: I either get Excel file with one empty sheet or generated result file is malformed.
My input XML file (from Pentaho distribution) & test playground transformation are: HERE
NOTE: While playing do not forget to remove generated MS Excel files between runs.
Screenshot:
Any suggestions how to fix my transformation?
The pattern goes as follow:
read data: 1 row per children, with the parent data in one or more column
group the data : 1 row per parent, forget the children, keep the parent data. Transform and save as needed.
back from the original data, lookup each row (children) and fetch the parent in the grouped data flow.
the result is one row per children and the needed column of the transformed parent. Transform and save as needed.
It is a pattern, you may want to change the flow, and/or sort to speed up. But it will not lock, nor feed up the memory: the group by and lookup are pretty reliable.
Question 1: Yes, the step you are looking after is named Block until this (other) step finishes, or Blocking Step (untill all rows are processed).
Question 2: Yes, you can pass the rows from one transformation to an other via the job. But it would be wiser to first produce the parent sheet and, when finished, read it again in the second transformation. You can also pass the row in a sub-transformation, or use other architecture strategies...
Question 3: (Short answer) The Excel Writer appends data (new sheet or new rows) to an existing Excel file, while the Excel Output creates and feed a one sheet Excel file.

Data driven test using different XL sheets with different no of parameters

Hi i am using data provider for different excel sheets this be done by providing excel sheet name and table name through variable but the problem is that my diff xlsheets have different parameters i.e no of columns are different and i am providing the no of columns or parameters in test class so when a excel sheet changes my test scripts gets failed is there any way to solve this. Need your help it might solve my big problem
Your question is not very clear, but the following:
MyWorkSheet.UsedRange.Columns.Count
should return the number of columns used in the excel worksheet "MyWorkSheet". Does this help?

Copying/Mapping data between excel spread sheets, Vb.net

I need to copy data using Vb.net if possible from one excel spreadworbook to another and place the data into the correct columns in the existing excel spreadsheet. The column titles of the spreadsheets match up, I have several templates I need to place data into and the order of the columns is different in each template so I need a way of searching for a column header in the template and then copying the data into that column.
Would the best way of achieving this using ADO?
For example move the data from this Workbook1 with columns "Test1", "Test2" and data
Test1 Test2
1 2
12 23
123 234
Into workbook 2 which will have the same column names but could be in a different order:
Test0 Test1 Test1.1 Test2
I need to do this automatically as I have alot of data to copy and 30-40 workbook templates to copy the data into, the templates columns are in different orders and can not be moved around.
There are different ways to interface with Excel using .NET. If you are just looking to do it with one version of Excel, then VSTO might be your easiest solution, otherwise use something else. I like to use EXCEL-DNA.
You can also use ADO to get the data out, but to put it in another one, I would think you would need one of the ways listed above to do it (since you would need to reference the excel object). If you are using Excel 2007 and above you can also directly access the XML files and manipulate them that way (minus xlb, of course).
You can also create a library from you execution file and copy it locally. See here.
As for headers, just use a Dictionary(Of String, Integer) or List(Of String) to figure out what the index of the file(s) is.