Transpose thousands rows to columns in Pentaho - pentaho

I have the dataset:
What I need is have all accounts for each concat group in one field, separated by a comma. I was able to achieve it with de-normalizer and then, some regex. It is just fine when you have a few accounts but now I have a case with more that 10K accounts. How can I achieve it?

Both the row denormaliser as well as the field concat step (which would achieve the second objective for this task) do not allow dynamic field names, as far as I can tell. So one unorthodox solution for dealing with a large number of possible values in the denormalisation and concatenation would be to simply specify them all. For example, a field in the denormaliser step is defined as
<field>
<field_name/>
<key_value/>
<target_name>field_1</target_name>
<target_type>None</target_type>
<target_format/>
<target_length>-1</target_length>
<target_precision>-1</target_precision>
<target_decimal_symbol/>
<target_grouping_symbol/>
<target_currency_symbol/>
<target_null_string/>
<target_aggregation_type>-</target_aggregation_type>
</field>
So you could write a script printing the template for all fields and insert it into the place of the <fields> tag in the transformation's XML.
Note: This is not fit for production. This is a solution if you need to do a task once or maybe twice to import some data. I wouldn't want to have to deal with a ETL process where this was deployed. A proper solution probably involves a custom step or an external script. I will gladly be proven wrong on this.

Related

SQL correct Uppercase and Lowercase in mixed text

I have a database with lots of text that are all in capital letters and now should be converted into normal lower/upper case writing. This applies to technical stuff, for example
CIRCUIT BREAKER(D)480Y/277VAC,1.5A,2POL
This should be written circuit breaker(D)480Y/277VAC,1.5A,2POL
I think the only Approach is to have a list of word with the correct spelling and then do something like a Search & Replace.
Does anybody can give me a clue to Approach this in MS SQL?
You could do one of two things -
Write a simple script (in whichever language you are comfortable - eg: php, perl, python etc.,) that reads the columns from the DB, does the case-conversion and updates the values back into the DB.
The advantage of this would be that you will have greater flexibility and control on what you want to modify and how you wish to do it.
For this solution to work, you may need to maintain a dict/hash in the script, having the mapping of upper-case to lower-case keyword mapping.
The second probable solution, if you do not wish to write a separate script, would be to create a SQL function instead, that reads the corresponding rows/columns that need to be updated, performs the case-conversion and writes/updates it back into the DB.
This might be slightly inefficient, depending on how you implement the function. But it takes away the dependency of writing another script for you.
For this solution to work, you may need to maintain another table having the mapping of upper-case to lower-case keyword mapping.
Whichever you are more comfortable with.
Create a mapping table for dictionary. It could be temporary table within a session. Load the table with values and use this temporary table for your update. If you want this to be permanant solution to handle new rows coming in, then have permanent table.
CREATE TABLE #dictionaryMap(sourceValue NVARCHAR(4000), targetValue NVARCHAR(4000));
CREATE TABLE #tableForUpdate(Term NVARCHAR(4000));
INSERT INTO #dictionaryMap
VALUES ('%BREAKER%','breaker'),('%CIRCUIT%','circuit');
INSERT INTO #tableForUpdate
VALUES ('CIRCUIT BREAKER(D)480Y/277VAC,1.5A,2POL');
Perform the UPDATE to #tableForUpdate using WhileLoop in TSQL, using #dictionaryMap.

Clean unstructured place name to a structured format

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

.Net Parsing Fixed Width Data... From a Concatenated, Single, Fixed-Width Column

I was bored and looking at old code that runs like molasses on a cold day. I found that a group of tables in our accounting system - each with 500,000 records of ~20 datapoints - that use a single column of concatenated, fixed-width values instead of separate columns. (Fixing the tables isn't an option.) An old .net ETL project is grabbing all records, doing a bunch of substrings on each record to set an object's corresponding attributes, then sending the object to merge with production data via a stored proc.
The way it is working is fine. It works. And, to be perfectly honest, I doubt I'll be given the go-ahead to fix it even if I come up with a better solution, but I was curious to see if anyone knew of a better way of doing this, because it's not entirely unlikely that I'll face a situation like this in the future.
I was thinking that if there was a way to use the TextFieldParser to parse a static string instead of a file/stream that might be a valid idea. Or, instead, I could write the entire table to a text file and then use the TextFieldParser to send data to the SProc. http://www.dotnetperls.com/textfieldparser does show that TextFieldParser is quite a bit faster than split, which I would assume is tantamount to the string manipulation our project is currently doing with substring. So there may be something to that idea.
Or perhaps the whole, old project should be dumped for a shiny new SSIS project. Would it also have to write the records to a flat file before importing into SQL? Or can it import directly from the table?
Thank you in advance!

Dynamically execute a transformation against a column at runtime

I have a Pentaho Kettle job that can load data from x number of tables, and put it into target tables with a different schema.
Assume I have table 1, like so:
I want to load this table into a destination table that looks like this:
The columns have been renamed, the order has been changed, and the data has been transformed. The rename, and order is easily managed by using the Select Values step, which can be used within an ETL Metadata Injection step, making it dependent on some configuration values loaded at runtime.
But if I need to perform some transformation logic on some of the columns, based on where they go in the target table, this seems to be less straightforward.
In my example, I want the column "CountryName" to be capitalised, and the column "Rating" to be floored (as in changing the real number to the previous integer value).
While I could do this by just manually adding a transformation to accomplish each, I want my solution to be dynamic, so it could just as easily run the "CountryName" column through a checksum component, or perform a ceiling on "Rating" instead.
I can easily wrap these transformations in another transformation so that they can be parameterised and executed when needed:
But, where I'm having trouble is, when I process a row of data, I need a way to be able to say:
Column "CountryName" should be passed through the Capitalisation transform
Column "Rating" should be passed through the Floor transform
Column(s) "AnythingElse" should be passed through the SomeOther transform
Is there a way to dynamically split out the columns in a row, and execute a different transform on each one, based on some configuration metadata that can be supplied?
Logically, it would be something like this, although I suspect there may be a way to handle it as a loop or some form of dynamic transformation, rather than mapping out a path per column:
Kettle is so flexible that it seems like there must be a way to do this, I'm just struggling to know which components to use and how to do it. Any experts out there have some suggestions?
I'm dealing with some biggish data sets here (hundreds of millions of rows) so reluctant to use Row Normaliser/Denormaliser or writing to file/DB if possible.
Have you considered the Modified Java Script Value step? Start with the Data Grid step, the a Select Values step, then the Modified Java Script Value step. In that step you will transform the value of each column in what you form you want and output that in a file.
That of course requires some Java script knowledge but given your example it seems that the required knowledge is pretty basic.

Script to compare two tables in database, from user input

I am very new to VBA and SQL and am trying to learn. I have a MS Access project that requires a VBA script that prompts the user to input two table names and numerous field names and create a SQL query utilizing those the names.
The specific SQL query I'm trying to use is below.
SELECT
A.user_index, A.input1, B.input1, A.input2, B.input2, A.input3, B.input3, B.input4,
A.input4, A.input5, B.input5
FROM
table1 AS A
LEFT JOIN
table2 AS B ON A.user_index = B.user_index
WHERE
(((A.input1) <> [B].[input1)) OR
(((A.input2) <> [B].[input2])) or
(((A.input4) <> [B].[input4]));
The overall purpose of this is to have a script that will be able to list fields for comparison that is applicable with any database. I know this is probably a relatively easy solution. However, I have no idea where to start.
My first instinct is to say "What have you tried so far?", but as you said, you don't know where to start.
It sounds like you need to first prompt the user for several field and table names, then build a query based on those values. I recommend first outlining exactly what you want your script to do. Maybe something like:
Declare variables to hold the values.
Prompt the user for each of the values and store them in the variables.
2a. After the user enters a value, make sure it is valid. If not, do something accordingly.
Declare a variable to hold your SQL query.
Construct the query.
Run the query.
This is obviously just an example. Break down each step into "baby steps" as much as possible.
It's a good idea to ask yourself how unique these baby steps are to your particular situation (hint: they almost certainly are not unique). If they aren't, then they have probably been solved tens of thousands of times already, so you have a very good chance of googling your questions.
If you still can't find an answer to how to do a particular step, feel free to ask here. Just remember to include your code even if it is broken :)