Converting Row Columns to Rows with Pentaho Kettle/PDI - pentaho

Brand new to Pentaho (and a newbie SO poster so look out!)
I'd like to use Kettle/PDI to transform data coming in from an RDBMS from this (for example):
Question1 Question2 Question3 Question4
1/1/13 123.00 Test 1 Test 1.1
1/2/13 124.00 Test 2 Test 1.2
1/3/13 125.00 Test 3 Test 1.3
1/4/13 126.00 Test 4 Test 1.4
1/5/13 127.00 Test 5 Test 1.5
to this:
QuestionName AnswerDate AnswerNumber AnswerString
Question1 1/1/13
Question1 1/2/13
Question1 1/3/13
Question1 1/4/13
Question1 1/5/13
Question2 123.00
Question2 124.00
Question2 125.00
Question2 126.00
Question2 127.00
Question3 Test 1
Question3 Test 2
Question3 Test 3
Question3 Test 4
Question3 Test 5
Question4 Test 1.1
Question4 Test 1.2
Question4 Test 1.3
Question4 Test 1.4
Question4 Test 1.5
As hopefully reflected above, there should be an "Answer<FieldDataType>" column for each available datatype in the original table. Is this possible with PDI? If so, can someone provide me with some pointers? I've tried using the Row Normaliser step to pivot the table and assign the new fields, but am probably not doing things quite right (or there is a bug [PDI 4.4]).

I accomplished this by using a scripting step to write an output row containing the column and value for each column in the input row. From there, I went to a Regex Evaluation step and used multiple capture groups to map the value types to additional columns in the stream. I messed around with the Row Normaliser for a while, but couldn't get it to do exactly what I wanted. The performance loss of using a scripting step was negligible.

use javascript step:
trans_Status = SKIP_TRANSFORMATION;
var row1 = createRowCopy(4);
var row2 = createRowCopy(4);
var row3 = createRowCopy(4);
var row4 = createRowCopy(4);
row1[0] = 'Question1';
row2[1] = 'Question2';
row3[2] = 'Question3';
row4[3] = 'Question4';
row1[1] = Question1;
row2[2] = Question2;
row3[3] = Question3;
row4[3] = Question4;
putRow(row1);
putRow(row2);
putRow(row3);
putRow(row4);
don't forget add fields;

The Row Normalizer is very sensitive to the order you specify the de-normalization.
I had a sparse matrix input and discovered the following rules:
The Type values must be grouped together, like with like
The new field column must be in the same order for each Type grouping
The Type groups must be arranged as most populous first, least populous last
Thus if, in the example given you specified
Fieldname Type new field
Question1 date AnswerDate
Question2 number AnswerNumber
Question3 string AnswerString
Question4 string AnswerString
will work better than
Fieldname Type new field
Question1 date AnswerDate
Question3 string AnswerString
Question2 number AnswerNumber
Question4 string AnswerString

Related

Compare last row to previous row by group and populate new column

I need to compare the last row of a group to the row above it, see if changes occur in a few columns, and populate a new column with 1 if a change occurs. The data presentation below will explain better.
Also need to account for having a group with only 1 row.
what we have:
Group Name Sport DogName Eligibility
1 Tom BBALL Toto Yes
1 Tom BBall Toto Yes
1 Tom golf spot Yes
2 Nancy vllyball Jimmy yes
2 Nancy vllyball rover no
what we want:
Group Name Sport DogName Eligibility N_change S_change D_Change E_change
1 Tom BBALL Toto Yes 0 0 0 0
1 Tom BBall Toto Yes 0 0 0 0
1 Tom golf spot Yes 0 1 1 0
2 Nancy vllyball Jimmy yes 0 0 0 0
2 Nancy vllyball rover no 0 0 1 1
Only care about changes from row to row within group. Thank you for any help in advance.
The rows are already ordered so we only need to last two of the group. If it is easier to compare sequential rows in a group then that is just as good for my purposes.
I did know this would be arrays and I struggle with these because never use them for my typical sas modeling. Wanted to keep things short and sweet.
Use the data step and lag statements. Ensure your data is sorted by group first, and that the rows within groups are sorted in the correct order. Using arrays will make your code much smaller.
The logic below will compare each row with the previous row. A flag of 1 will be set only if:
It's not the first row of the group
The current value differs from the previous value.
The syntax var = (test logic); is a shortcut to automatically generate dummy flags.
data want;
set have;
by group;
array var[*] name sport dogname eligibility;
array lagvar[*] $ lag_name lag_sport lag_dogname lag_eligibility;
array changeflag[*] N_change S_change D_change E_change;
do i = 1 to dim(var);
lagvar[i] = lag(var[i]);
changeflag[i] = (var[i] NE lagvar[i] AND NOT first.group);
end;
drop lag: i;
run;
It is not uncommon for procedural programmers to find this kind of dilemma in SQL, which is predominately a set language where rows have no position. If you write a procedure that reads the select data (sorted in the desired order), it can have variables to control creating the desired additional columns in the output, similar to the lag function above.
Or you can put it into a spreadsheet, which is happier detecting the changes in formula filled columns =if(a2<>a1,1,0). Just make sure nobody re-sorts the spreadsheet data into a new order!

pandas create Cross-Validation based on specific columns

I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.
As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.
As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.

Pandas Dataframe: Divide Column entries by number of occurence

my Problem:
I have this DF:
df_problem = pd.DataFrame({"Share":['5%','6%','9%','9%', '9%'],"level_1":[0,0,1,2,3], 'BO':['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
The Problem is, that the 9% are actually divided by the three shareholders. So I want to giv each of them their share of 3% and put it to their names. It then should look like this:
df_solution = pd.DataFrame({"Share":['5%','6%','3%','3%', '3%'],"level_1":[0,0,0,1,2], 'BO': ['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
How do I do this in a simple way?
You could try something like this:
f_problem['Share'] = (f_problem['Share'].str.replace('%', '').astype(float) /
f_problem.groupby('Share')['BO'].
transform('count')).astype(str) + '%'
>>> f_problem
Share level_1 BO
0 5.0% 0 Nestle
1 6.0% 0 Procter
2 3.0% 1 Nestle
3 3.0% 2 Tesla
4 3.0% 3 Jeff
Please note that I have assumed that the value of the column 'Share' to be float as you could see above.

Validating linked data

I have a set of source data that looks something like this:
Project Series Paper
Unit 1 1806 1
Unit 1 1806 2
Unit 1 1806 3
Unit 2 1903 1
Unit 2 1903 2
Unit 2 2003 1
Unit 2 2003 2
Unit 2 2103 1
Unit 2 2103 2
Unit 3 1806 1
Unit 3 1906 1
This data normally lives in a database and is huge. In the order of half a million rows.
We also have users that will input a combination of Project, Series and Paper then they will click submit.
Prior to the data being submitted, I would like the data to be validated from the source data and will tell the user if the combination they have entered is valid or not.
Something like this:
Project Series Paper Valid?
Unit 1 1806 1 No
Unit 2 1906 2 Yes
The easiest solution that I can think of is to concatenate the data and do a lookup on each. However, this will create unnecessary heavy load on the database, where a new column will have to be created with half a million rows of data...
I was wondering if there is a loop function in VBA that would check the combination from the source data and let the user know if it is valid or not?
I really appreciate your input.
Ideally, this should be done in your SQL update and check for Primary Key conflicts, but, to do this in Excel, you could use the CountIfs function and check if you have any matches in your dataset.
So, suppose you have your DB table in range, say A1:C500000 and you have your input checker values in cells F2:H2, you could use the following formula for your Valid? in cell I2:
I2: =IF(COUNTIFS(A1:A500000,F2,B1:B500000,G2,C1:C500000,H2)=0,"Yes", "No")
That should do the trick.

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation