Interpolation of missing values in Pentaho - pentaho

I am trying to fill down the missing values using Pentaho pdi.
Input:
Desired output:
Found so far only Filling data gaps in a stream in Pentaho Data Integration, is it possible? but it fills in with the last known value.
Potentially, I thought I could work with the above solution, I also added the next amount to the analytical query, along with the next date. Then, I added the flag in the clone step and filter the original results from the input into Dummy and generated results (from the calculator) to a calculator (at the moment). Then, potentially, I can dump that separate stream to a temp table in a database and run the sql query which will do the rolling subtraction. I am also investigating the javascript step.
I disregarded the Python or R Executor step because at the end I will be running the job on the aws vm and I already foresee the pain I will go through with the installation.
What would be your suggestions? Is there a simple way to do interpolation?
Updated for the question

The method provided in your link does work from my testing, (I am using LAG instead of LEAD for your tasks though). Here I am not looking to replicate that method, just another option for you by using JavaScript to build the logic which you might also extend to other applications:
In the testing below (tested on PDI-8.0), the transformation has 5 steps, see below
Data Grid step to create testing data with three fields: date, account number and amount
Sort rows to sort the rows based on account number and date. this is required for Analytic Query step, if your source data are already sorted, then skip this step
Analytic Query step, see below, create two more fields: prev_date and prev_amount
Modified Java Script Value step, add the following code, nothing else is needed to configure in this step:
var days_diff = dateDiff(prev_date, date, "d")
if (days_diff > 0) {
/* retrieve index for two fields: 'date', 'amount'
* and modify their values accordingly
*/
var idx_date = getInputRowMeta().indexOfValue("date")
var idx_amount = getInputRowMeta().indexOfValue("amount")
/* amount to increment by each row */
var delta_amount = (amount - prev_amount)/days_diff
for (var i = 1; i < days_diff; i++) {
newRow = createRowCopy(getOutputRowMeta().size());
newRow[idx_date] = dateAdd(prev_date, "d", i);
newRow[idx_amount] = prev_amount + delta_amount * i;
putRow(newRow);
}
}
Select values step to remove unwanted fields, i.e.: prev_date, prev_amount
Run the transformation, you will have the following shown under the Preview data tab of Modified Java Script Value step:
UPDATE:
Per your comments, you can do the following, assume you have a new field account_type:
in Analytic Query step, add a new field prev_account_type, similar to two other prev_ fields, just from different Subject: account_type
in Modified Java Script Value step, you need to retrieve the Row index for account_type and modify the logic to compute delta_amount, so when prev_account_type is not the same as the current account_type, the delta_amount is ZERO, see below code:
var days_diff = dateDiff(prev_date, date, "d")
if (days_diff > 0) {
/* retrieve index for three fields: 'date', 'amount', 'account_type' */
var idx_date = getInputRowMeta().indexOfValue("date")
var idx_amount = getInputRowMeta().indexOfValue("amount")
var idx_act_type = getInputRowMeta().indexOfValue("account_type")
/* amount to increment by each row */
var delta_amount = prev_account_type.equals(account_type) ? (amount - prev_amount)/days_diff : 0;
/* copy the current Row into newRow and modify fields accordingly */
for (var i = 1; i < days_diff; i++) {
newRow = createRowCopy(getOutputRowMeta().size());
newRow[idx_date] = dateAdd(prev_date, "d", i);
newRow[idx_amount] = prev_amount + delta_amount * i;
newRow[idx_act_type] = prev_account_type;
putRow(newRow);
}
}
Note: invoking Javascript interpreter does have some performance hit, so if that matters to you, stick to the method in the link you provided.

Related

GORM query - how to use calculated field in the query and filter according to its value

Currently, I have a GORM query that calculates a counter for each table entry using a second table and returns the first table with a new field "locks_total" which doesn't exist in the original.
Now, what I want to achieve is the same table returned (with the new "locks_total" field) but filtered with "locks_total" = 0.
I can't seem to make it happen because it doesn't recognize this field in the table, what can I do to make it happen? is it possible to run the query and then execute the filter on the new result table?
This is how we currently do it-
txn := dao.postgresManager.DB().Model(&models.SecretMetadataResponse{})
txn.Table(dao.firstTableName + " as s").
Select("s.*, (SELECT COUNT(*) FROM " + dao.secondTableName + " as l where s.id = l.secret_id) locks_total ")
var secretsTotal int64
var metadataResponseEntries []models.SecretMetadataResponse
txn = txn.Count(&secretsTotal). // Saving the count before trimming according to the pagination parameters
Limit(params.Limit).
Offset(params.Offset).
Order(constants.FieldName).
Order(constants.FieldId).
Find(&metadataResponseEntries)
When the SecretMetadataResponse contains the SecretMetadata which is the same fields as in the table and the LocksTotal is the new calculated field that we want to have.
type SecretMetadataResponse struct {
SecretMetadata
LocksTotal int `json:"locks_total"`
}
Thanks in advance :)

Struggling with simple boolean WHERE clause

Tired brain - perhaps you can help.
My table has two bit fields:
1) TestedByPCL and
2) TestedBySPC.
Both may = 1.
The user interface has two corresponding check boxes. In the code I convert the checks to int.
int TestedBySPC = SearchSPC ? 1 : 0;
int TestedByPCL = SearchPCL ? 1 : 0;
My WHERE clause looks something like this:
WHERE TestedByPCL = {TestedByPCL.ToString()} AND TestedBySPC = {TestedBySPC.ToString()}
The problem is when only one checkbox is selected I want to return rows having the corresponding field set to 1 or both fields set to 1.
Now when both fields are set to 1 my WHERE clause requires both check boxes to be checked instead of only one.
So, if one checkbox is ticked return records with with that field = 1 , regardless of whether the other field = 1.
Second attempt (I think I've got it now):
WHERE ((TestedByPCL = {chkTestedByPCL.IsChecked} AND TestedBySPC = {chkTestedBySPC.IsChecked})
OR
(TestedByPCL = 1 AND TestedBySPC = 1 AND 1 IN ({chkTestedByPCL.IsChecked}, {chkTestedBySPC.IsChecked})))
Misunderstood the question.
Change the AND to an OR:
WHERE TestedByPCL = {chkTestedByPCL.IsChecked} OR TestedBySPC = {chkTestedBySPC.IsChecked}
Also:
SQL Server does not have a Boolean data type, it's closest option is a bit data type.
The usage of curly brackets suggests using string concatenations to build your where clause. This might not be a big deal when you're handling checkboxes but it's a security risk when handling free text input as it's an open door for SQL injection attacks. Better use parameters whenever you can.

Berkeley DB equivalent of SELECT COUNT(*) All, SELECT COUNT(*) WHERE LIKE "%...%"

I'm looking for Berkeley DB equivalent of
SELECT COUNT All, SELECT COUNT WHERE LIKE "%...%"
I have got 100 records with keys: 1, 2, 3, ... 100.
I have got the following code:
//Key = 1
i=1;
strcpy_s(buf, to_string(i).size()+1, to_string(i).c_str());
key.data = buf;
key.size = to_string(i).size()+1;
key.flags = 0;
data.data = rbuf;
data.size = sizeof(rbuf)+1;
data.flags = 0;
//Cursor
if ((ret = dbp->cursor(dbp, NULL, &dbcp, 0)) != 0) {
dbp->err(dbp, ret, "DB->cursor");
goto err1;
}
//Get
dbcp->get(dbcp, &key, &data_read, DB_SET_RANGE);
db_recno_t cnt;
dbcp->count(dbcp, &cnt, 0);
cout <<"count: "<<cnt<<endl;
Count cnt is always 1 but I expect it calculates all the partial key matches for Key=1: 1, 10, 11, 21, ... 91.
What is wrong in my code/understanding of DB_SET_RANGE ?
Is it possible to get SELECT COUNT WHERE LIKE "%...%" in BDB ?
Also is it possible to get SELECT COUNT All records from the file ?
Thanks
You're expecting Berkeley DB to be way more high-level than it actually is. It doesn't contain anything like what you're asking for. If you want the equivalent of WHERE field LIKE '%1%' you have to make a cursor, read through all the values in the DB, and do the string comparison yourself to pick out the ones that match. That's what an SQL engine actually does to implement your query, and if you're using libdb instead of an SQL engine, it's up to you. If you want it done faster, you can use a secondary index (much like you can create additional indexes for a table in SQL), but you have to provide some code that links the secondary index to the main DB.
DB_SET_RANGE is useful to optimize a very specific case: you're looking for items whose key starts with a specific substring. You can DB_SET_RANGE to find the first matching key, then DB_NEXT your way through the matches, and stop when you get a key that doesn't match. This works only on DB_BTREE databases because it depends on the keys being returned in lexical order.
The count method tells you how many exact duplicate keys there are for the item at the current cursor position.
You can use method DB->stat().
For example, number of unique keys in the BT_TREE.
bool row_amount(DB *db, size_t &amount) {
amount = 0;
if (db==NULL) return false;
DB_BTREE_STAT *sp;
int ret = db->stat(db, NULL, &sp, 0);
if(ret!=0) return false;
amount = (size_t)sp->bt_nkeys;
return true;
}

Error in programmatically copying data to SpFieldChoice field SharePoint 2010

I am new to sharepoint, I have a custom field type derived from SpFieldChoice , my field allows users to select multiple values, I have a requirement of replacing some old custom columns with the new column and copy the data in old column to the new column. the old column also allows the users to select multiple values by ticking checkboxes, I have the following code to copy the data to new field.
foreach (SPListItem item in list.Items)
{
if (item[oldField.Title] == null)
{
item[newFld.Title] = string.Empty;
item.Update();
}
else
{
string[] itemvalues = item[oldField.Title].ToString().Split(new string[] {";#"}, StringSplitOptions.None);
StringBuilder multiLookupValues = new StringBuilder();
multiLookupValues.Append(";#");
for (int cnt = 0; cnt < (itemvalues.Length) / 2; cnt++)
{
multiLookupValues.Append (itemvalues[(cnt * 2) + 1].ToString() + ";#");
}
item[newFld.Title] = multiLookupValues.ToString();
item.SystemUpdate(false) ;
}
}
This code works fine until the length of resulting stringbuilder is less than 255 charachters , but when this length is greater then 255 I get the following Exception.
Invalid choice Value. A choice field contains invalid data.Please check the value and try again.
Is there any other way of copying data to SpFiledChoice, How can I resolve this problem? please help me.
Do the update multiple times so that the string doesn't exceed - i.e. value +=. However, if the problem is that the value can't be longer that 255, you have to consider how you are doing the choices. If it is exceeding the length and updating the value multiple times doesn't work (and a Site Column will have the same limitations), you can do the next best thing:
1) Create a new list that will hold the choices
2) Change the destination field to be a lookup
3) Update accordingly for each item (picking up the ID from the lookup field)
There's no limit to this.
David Sterling
david_sterling#sterling-consulting.com
www.sterling-consulting.com

Saving state of closure in Groovy

I would like to use a Groovy closure to process data coming from a SQL table. For each new row, the computation would depend on what has been computed previously. However, new rows may become available on further runs of the application, so I would like to be able to reload the closure, initialised with the intermediate state it had when the closure was last executed in the previous run of the application.
For example, a closure intending to compute the moving average over 3 rows would be implemented like this:
def prev2Val = null
def prevVal = null
def prevId = null
Closure c = { row ->
println([ prev2Val, prevVal, prevId])
def latestVal = row['val']
if (prev2Val != null) {
def movMean = (prev2Val + prevVal + latestVal) / 3
sql.execute("INSERT INTO output(id, val) VALUES (?, ?)", [prevId, movMean])
}
sql.execute("UPDATE test_data SET processed=TRUE WHERE id=?", [row['id']])
prev2Val = prevVal
prevVal = latestVal
prevId = row['id']
}
test_data has 3 columns: id (auto-incremented primary key), value and processed. A moving mean is calculated based on the two previous values and inserted into the output table, against the id of the previous row. Processed rows are flagged with processed=TRUE.
If all the data was available from the start, this could be called like this:
sql.eachRow("SELECT id, val FROM test_data WHERE processed=FALSE ORDER BY id", c)
The problem comes when new rows become available after the application has already been run. This can be simulated by processing a small batch each time (e.g. using LIMIT 5 in the previous statement).
I would like to be able to dump the full state of the closure at the end of the execution of eachRow (saving the intermediate data somewhere in the database for example) and re-initialise it again when I re-run the whole application (by loading those intermediate variable from the database).
In this particular example, I can do this manually by storing the values of prev2Val, prevVal and prevId, but I'm looking for a generic solution where knowing exactly which variables are used wouldn't be necessary.
Perhaps something like c.getState() which would return [ prev2Val: 1, prevVal: 2, prevId: 6] (for example), and where I could use c.setState([ prev2Val: 1, prevVal: 2, prevId: 6]) next time the application is executed (if there is a state stored).
I would also need to exclude sql from the list. It seems this can be done using c.#sql=null.
I realise this is unlikely to work in the general case, but I'm looking for something sufficiently generic for most cases. I've tried to dehydrate, serialize and rehydrate the closure, as described in this Groovy issue, but I'm not sure how to save and store all the # fields in a single operation.
Is this possible? Is there a better way to remember state between executions, assuming the list of variables used by the closure isn't necessarily known in advance?
Not sure this will work in the long run, and you might be better returning a list containing the values to pass to the closure to get the next set of data, but you can interrogate the binding of the closure.
Given:
def closure = { row ->
a = 1
b = 2
c = 4
}
If you execute it:
closure( 1 )
You can then compose a function like:
def extractVarsFromClosure( Closure cl ) {
cl.binding.variables.findAll {
!it.key.startsWith( '_' ) && it.key != 'args'
}
}
Which when executed:
println extractVarsFromClosure( closure )
prints:
['a':1, 'b':2, 'c':4]
However, any 'free' variables defined in the local binding (without a def) will be in the closures binding too, so:
fish = 42
println extractVarsFromClosure( closure )
will print:
['a':1, 'b':2, 'c':4, 'fish':42]
But
def fish = 42
println extractVarsFromClosure( closure )
will not print the value fish