Dynamically execute a transformation against a column at runtime - dynamic

I have a Pentaho Kettle job that can load data from x number of tables, and put it into target tables with a different schema.
Assume I have table 1, like so:
I want to load this table into a destination table that looks like this:
The columns have been renamed, the order has been changed, and the data has been transformed. The rename, and order is easily managed by using the Select Values step, which can be used within an ETL Metadata Injection step, making it dependent on some configuration values loaded at runtime.
But if I need to perform some transformation logic on some of the columns, based on where they go in the target table, this seems to be less straightforward.
In my example, I want the column "CountryName" to be capitalised, and the column "Rating" to be floored (as in changing the real number to the previous integer value).
While I could do this by just manually adding a transformation to accomplish each, I want my solution to be dynamic, so it could just as easily run the "CountryName" column through a checksum component, or perform a ceiling on "Rating" instead.
I can easily wrap these transformations in another transformation so that they can be parameterised and executed when needed:
But, where I'm having trouble is, when I process a row of data, I need a way to be able to say:
Column "CountryName" should be passed through the Capitalisation transform
Column "Rating" should be passed through the Floor transform
Column(s) "AnythingElse" should be passed through the SomeOther transform
Is there a way to dynamically split out the columns in a row, and execute a different transform on each one, based on some configuration metadata that can be supplied?
Logically, it would be something like this, although I suspect there may be a way to handle it as a loop or some form of dynamic transformation, rather than mapping out a path per column:
Kettle is so flexible that it seems like there must be a way to do this, I'm just struggling to know which components to use and how to do it. Any experts out there have some suggestions?
I'm dealing with some biggish data sets here (hundreds of millions of rows) so reluctant to use Row Normaliser/Denormaliser or writing to file/DB if possible.

Have you considered the Modified Java Script Value step? Start with the Data Grid step, the a Select Values step, then the Modified Java Script Value step. In that step you will transform the value of each column in what you form you want and output that in a file.
That of course requires some Java script knowledge but given your example it seems that the required knowledge is pretty basic.

Related

How do I partition a large file into files/directories using only U-SQL and certain fields in the file?

I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.
My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".
I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:
// Output to file
OUTPUT #dt
TO #"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();
It gives the following error:
Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d
Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.
Thanks in advance for reading/responding!
Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.
Dynamic outputs based on data are something we are actively working for release sometime later in 2017.
In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts
The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.
Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber

pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

Suppose I have the following flat file on HDFS (let's call this key_value):
1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing
Here is the output I'm looking for:
(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)
In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).
Towards this end, I started putting together this pig script:
data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
sorted = ORDER data BY key, value;
GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;
The above script gives me the following output:
(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)
Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).
Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?
UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:
(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)
First of all, let me point out that if for most rows, most of the columns are not filled out, that a better solution IMO would be to use a map. The builtin TOMAP UDF combined with a custom UDF to combine maps would enable you to do this.
I am sure there is a way to solve your original question by computing a list of all possible keys, exploding it out with null values and then throwing away the instances where a non-null value also exists... but this would involve a lot of MR cycles, really ugly code, and I suspect is no better than organizing your data in some other way.
You could also write a UDF to take in a bag of key/value pairs, another bag all possible keys, and generates the tuple you're looking for. That would be clearer and simpler.

Move FMOIX/FMCOX structures into Internal Table

I am a newbie to ABAP (3 days experience) and I am currently on a task to write reports using ABAP code. It is like moving some data from a specific SAP database to a Business Intelligence staging area.
So the core difficulty is that some data on the SAP server is in the format of dictionary structures (FMOIX, FMCOX, etc.) and I need to move these data into internal tables during program runtime. I was told that OPENSQL would not work in this case.
If you still do not get what I mean, I can suggest several ways, actually given by my supervisor. First is to use GET event, say
GET FMOIX.
IF FMOIX-zhdlt > From_dat and FMOIX-zhdlt < to_dat.
Append FMOIX to itab.
ENDIF.
The thing is that I am still not very clear about this GET event. Is it just a event handler thing, or can it loop through data records?
What I googled for more than two days give me something like
LOOP at FMOIX.
MOVE FMOIX to itab.
ENDLOOP.
So what are the ways to move transactional structure like FMOIX into internal tables, say the internal table name is ITAB?
Your answer would be greatly appreciated. Though I have time, I am totally new.
Thanks a lot.
If your supervisor is suggesting that you use the GET event, it means that your program is (or should be) using a logical database - in this case probably FMF or FMF_BCS.
Doing GET FMOIX reads a set of fields defined in the logical database (as a node). Underneath your GET statement, you can use FMOIX as a structure, e.g. WRITE FMOIX-field1. The program will (implicitly, it's not explicity defined in the code like a LOOP...ENDLOOP is) loop through all the rows returned according to your selection criteria. You should be able to use MOVE-CORRESPONDING to move the contents of each row into a proper structure, and then APPEND that structure to your itab.
Quick link on GET in ABAPDocu
Note: this answer is a bit of a guess, since I've only used a logical database once, and the documentation is a little thin on the ground compared to the volumes out there about standard SELECTs and internal tables.
You can create your internal table in type of that structure such as:
data: itab like table of fmoix with header line.
And you can use this internal table to fill up wherever you are using your select codes.
Such as:
select * from ____
into corresponding fields of itab
where zhdlt gt from_dat
and zhdlt lt to_dat.
I'm not sure this is what you are looking for but I can tell you creating itab in type of that structure can be filled up with all corresponding datas that coming from your select. You cant loop FMOIX because its not a table, its a structure. So is there any specific reason to hold your datas in structures?
Hope it was helpful.
Talha

What's the reasoning behind result columns being excluded from auto-select statements in PetaPoco

If I have a POCO class with ResultColumn attribute set and then when I do a Single<Entity>() call, my result column isn't mapped. I've set my column to be a result column because its value should always be generated by SQL column's default constraint. I don't want this column to be injected or updated from business layer. What I'm trying to say is that my column's type is a simple SQL data type and not a related entity type (as I've seen ResultColumn being used mostly on those).
Looking at code I can see this line in PetaPoco:
// Build column list for automatic select
QueryColumns = ( from c in Columns
where !c.Value.ResultColumn
select c.Key
).ToArray();
Why are result columns excluded from automatic select statement because as I understand it their nature is to be read only. So used in selects only. I can see this scenario when a column is actually a related entity type (complex). Ok. but then we should have a separate attribute like ComputedColumnAttribute that would always be returned in selects but never used in inserts or updates...
Why did PetaPoco team decide to omit result columns from selects then?
How am I supposed to read result columns then?
I can't answer why the creator did not add them to auto-selects, though I would assume it's because your particular use-case is not the main one that they were considering. If you look at the examples and explanation for that feature on their site, it's more geared towards extra columns you bring back in a join or calculation (like maybe a description from a lookup table for a code value). In these situations, you could not have them automatically added to the select because they are not part of the underlying table.
So if you want to use that attribute, and get a value for the property, you'll have to use your own manual select statement rather than relying on the auto-select.
Of course, the beauty of using PetaPoco is that you can easily modify it to suit your needs, by either creating a new attribute, like you suggest above, or modifying the code you showed to not exclude those fields from the select (assuming you are not using ResultColumn in other join-type situations).

question on how to use sql server integrated service

I have a table called book with, the attrbutes are booked_id, yearmon, and day_01....day_31. Now i need to unpivot the table and transform day_01...day_31 into rows, I have successed in doing that, but the problem is that my yearmon is a format of 200805 and i need to append a day to it based on day_01 or day_02 etc, so that i can create a new column with date information for example, if it is day_01, it looks like 20080501. Instead of writing huge query, does anyone how to use ssis to tranform it
You should be able to use the Unpivot component and the Derived Column component to do what you need. Look into those and post back if they don't seem to do what you need.