Pentaho Kettle replace cell value if value exists in other two columns - pentaho

Am using Pentaho BI server with Data Integration to get data set and merge data from two input tables, Now i need to replace value of cell in each row, if two other columns in the same row matches my criteria, how can i accomplish this with Kettle?
I need to match many values with the cell value in each row, values to be matched with the cell are inside an excel sheet.
I have tried using Replace in string component but it does not work :( can you help me in this regard?

You can do compound tests with the Filter Rows step. For a result that passes you can follow up with a Set Field Value step. It would look something like this:
The way to get the "AND" condition in the Filter Rows step is to click on the little icon on the far upper right of the condition box. Note also that you must supply a value for 'ReplaceVal' earlier in the transform (I just hard coded it in the Data Grid).
EDIT: Based on the wording of your question, your criteria is a simple null check. "IS NULL" is one of the conditions available in the Filter Rows step.

Related

How to use vba to filer a column using value from a specific cell

I want to use a macro to filter columns in a table. I want to filter for values that are higher than a value I want to put in cell, to be able to easily change the filter. Does someone have a trick for doing this with vba?
Many thanks, Bram
Record a macro whilst filtering a table on a column value. You would right click on the table column header of interest whilst recording the code and select Number_Filters > Greater Than and enter your desired number. That would give you the outline code. You can then amend the code to pick up the desired value from a specified cell. If applying filter to multiple columns record macro whilst doing this process over several columns.
Thank you for you answer. I tried this already, but I could not get the macro to pick from a specific cell. If I stored the value of the specific cell under as 'value' and put that in the outlined code, it would just do Greater Than value.. DO you have shortcut for this?
Thanks!

Find first non-blank cell in column that meets criteria in another column

I've compiled multiple spreadsheets containing sporadic employee information, and I'm now trying to consolidate all of the information to remove duplicates and blanks. The formula below is my starting point, but if the first cell that meets that criteria is blank, it returns a blank. I want it to find the next cell that meets that criteria but has a value.
=INDEX(Working!C:C,MATCH($A3,Working!$B:$B,0))
Below is what the Working tab looks like, which contains the master list of data including blanks and duplicates. Working!C:C is the list of last names; $A3 is the Employee ID I'm hoping to retrieve data for, and Working!$B:$B is the list of Employee IDs. I'll be doing this for many columns, so to illustrate this, in the table example below I've shown that Column D is the phone number. Any help you can provide is appreciated!
Column B-------C-------D
---------287-----Doe----blank
---------287-----blank---333-333-3333
---------287-----Doe----blank
Use the following array formula:
=INDEX(Working!C$1:C$100,MATCH(1,($A3 = Working!$B$1:$B$100)*(Working!C$1:C$100<>""),0))
Being an array formula it needs to be confirmed with Ctrl-Shift-Enter instead of Enter when exiting edit mode. If done correctly then Excel will put {} around the formula.
Please note that with an array formula the references need to be the smallest range possible that covers the dataset.

OpenRefine - Fill between cells but not at the end of the list

I have a list of stock prices for several stocks. Some of the values are missing due to weekends, holidays and probably other reasons.
The gaps are not consistent. Some are two days and some are more than that.
I want to fill the gaps with the last known value but not at the end of the list.
I have tried in Excel to test a few cells below and if it's now empty, do the fill. The problem is that due to the inconsistency of the gaps, it's a tedious task to change the function for all the cases.
Is there a way to test for the end of a list?
UPDATE - added a screenshot.
See this screenshot. I want to fill where the blue dots are. The red dots are at the end of the list and I don't want to fill those cells.
I am looking for a way to detect the end of the list and stop the filling when the end is detected.
I think this is pretty difficult in OpenRefine and probably a different tool would work better. The main issue is that OpenRefine does not offer the ability to easily work across rows so 'summing a column' (or part of a column) is tricky - this is mentioned in https://github.com/OpenRefine/OpenRefine/issues/200
However, you can do this by forcing OpenRefine in Record mode with the whole project containing a single record. Once you've done this you can access all values in a column using syntax like:
row.record.cells["Column name"].value
This gives an array of all the non-blank values in the column. Since this ignores blank values, in order to have a true view of the values in the column you have to fill in blank cells with a value.
So I think you could probably achieve what you want as follows:
For each column you are going to work with do a cell transform to put a dummy value in empty cells - e.g. if(isBlank(value),"null",value)
Create a new column at the start of your project and put a single value in the very first cell in that column
Switch to Record mode
At this point you should have a single 'Record' in your project - e.g.
You can now access all cells in a column using syntax like row.record.cells["Column 1"].value. You can combine this with 'forRange' to iterate through the contents of this array, using the row.index as the marker for the current row.
I used the following formula to add a new column to the project:
with(row.record.cells["Column 1"].value,w,if(forRange(row.index,w.length(),1,i,w[i].toNumber()).sum()>0,"a","b"))
Then...
Change back to 'Row' mode
Remove the 'null' placeholder from the original column
Create a facet on the 'fill filter' column
In my case I filter to 'a'
Use the 'fill down' option
Remove the filter
And remove the 'record' column
Rather a long winded way of doing it to say the least, but so far I've not been able to find anything better while not going outside OpenRefine. I'm guessing you could probably compress steps 5-11 into a single step or smaller number of steps.
If you want to access the array of cell values using Jython as suggested by iMitwe you need to use:
row["record"]["cells"]["Column 1"]["value"]
instead of
row.record.cells["Column 1"].value
(step 5)
I am doing this on the top of my head, but I think your best chance my be using the fill down option in record mode:
first move your column to the first column and switch to record mode.
then use the following GREL: row.record.cells["data"].value[-1] where data is the name of your column
The [-1] will take the last value and fill the blank. For the case with the red dot, since there is no value it should remains empty. Let us know how it goes.
Unless there's something I am missing or not seeing...
I would have just sorted reverse (date ascending) on the Date column, then individually use Fill Down on each column, except for that last column where you could then use a Date facet on your column Date to specify the exact Date range you wanted to work with, then fill down on that last column, then remove the Date range facet.

Need a simple search function to display most common value in a column. (with ambiguous choices)

I have a very large array of data with many columns that display different outputs for the values presented. I would like to add a row above the data that will display the most common occurring value or word below.
Generally I would like to have each top of the column (right under the column label in row 1) have the most common value below. I will then use this value for various data analysis functions!
Is this possible, and if so, how? Preferably this will not require VBA, but simply a short code in the cell.
One caveat: The exact values may vary, so there is no set list where I can say "it will be one of these."
Any ideas appreciated!
Try a series of =COUNTIF(A:A,"VALUE TO SEARCH") functions if you want to stay away from VBA.
Otherwise, the best method would be to iterate through each column via VBA. With this method, you can even count the "varying" values and return the count and/or the value itself.
http://www.excel-easy.com/examples/most-frequently-occurring-word.html
This is a single formula you would write at the top of each column. Does not require VBA. You can replace the set range to an entire column, such as (A:A) instead of (A1:A7).
If you mean an array as in a data type, it could work differently but it depends what you're trying to do.
With data from A3 through A16, in A2 enter:
=INDEX($A$3:$A$16,MODE(MATCH($A$3:$A$16,$A$3:$A$16,0)))
This will work for text as well as numbers. Adjust this to match the column size.

SQL query in excel returning one value uses two rows or columns?

I'm running a query from excel to sql server that returns a single value. It seems as if this uses two cells in excel. Is this correct? Is there a way to prevent this? Is there a way to predict whether excel will use an extra row or an extra column (I've seen both happening)?
If, in excel, I use "import external data -> new database query", and then do a count() on an sql server table, excel usually puts the result in the cell underneath the cell which I had selected when starting to do the new query (not adding an extra row, but putting the value there). Sometimes, it will instead insert an extra column before the column of the cell I had selected and put the result in the new column in the same row as the cell I had selected.
Is there a way to have excel return the value in the same cell as the one selected when starting the query? If not, is there a way to predict which of the two scenarios above will happen?
Thanks,
Ernst
I finally found the answer:
uncheck preserve column sort/filter/layout