I am new to pentaho but I need to remove a duplicate row that is the newest, I tried to sort on the colum that I want to check for duplicates and on the date of that row (ascending) and than remove this row using unique rows. This works like a charm but it does not seem to check for the dates.
How can I Remove the rows that are newer and have a double number?
The way I did it worked, but the datefield that I wanted to sort on was still a String (because is had to extract it as a string from the excel file). I used the str2date function to convert it into a real date and that solved the problem, thanks!
Related
I have received help for splitting a column wit nr and letter.
In the SQL script it all works perfect. It runs complete, with no errors.
But the columns itself doesn't get filled.
I have tried to create te columns in advance as text or as integer. But it doesn't get filled. The SQL query it self turn out ok. But in reality it stay empty. What is wrong?
Your question is not completely clear, but it sounds like what you are trying to do is take a value from one column of a table, split it and use the result to update two other columns in the same table.
If that is the case, you would want to be using the SQL UPDATE command instead of SELECT.
UPDATE d1_plz_whatever
SET nr=SUBSTRING(hn FROM '^[0-9]+'),
zusatz =SUBSTRING(hn FROM '[a-zA-Z]+$');
I have to admit, that I am fairly new to excel and this community.
However, I did try to program a makro what can manage to delete rows in a specific range if a value below matches the one above. (e.g. A1:A100 if it matches a value listet in A101:A200), because the "delete duplicates" tool doesn't seem to work.
Maybe you guys can give me a good answer / macro-code, which can perfon this kind of action.
greetings, valerius21
OK, your 1st comment substantively changes the question. It's not that Remove Duplicates doesn't work - you're not actually trying to remove duplicate values - you're trying to remove items from one list where the Client ID is in the other list.
In a new column of the list you want to delete FROM, use the VLookup function to find matching values from the other list. Then, use the Filter feature to find all of the matching value rows (the ones whose value you were able to successfully look up) and delete those.
I have a list of stock prices for several stocks. Some of the values are missing due to weekends, holidays and probably other reasons.
The gaps are not consistent. Some are two days and some are more than that.
I want to fill the gaps with the last known value but not at the end of the list.
I have tried in Excel to test a few cells below and if it's now empty, do the fill. The problem is that due to the inconsistency of the gaps, it's a tedious task to change the function for all the cases.
Is there a way to test for the end of a list?
UPDATE - added a screenshot.
See this screenshot. I want to fill where the blue dots are. The red dots are at the end of the list and I don't want to fill those cells.
I am looking for a way to detect the end of the list and stop the filling when the end is detected.
I think this is pretty difficult in OpenRefine and probably a different tool would work better. The main issue is that OpenRefine does not offer the ability to easily work across rows so 'summing a column' (or part of a column) is tricky - this is mentioned in https://github.com/OpenRefine/OpenRefine/issues/200
However, you can do this by forcing OpenRefine in Record mode with the whole project containing a single record. Once you've done this you can access all values in a column using syntax like:
row.record.cells["Column name"].value
This gives an array of all the non-blank values in the column. Since this ignores blank values, in order to have a true view of the values in the column you have to fill in blank cells with a value.
So I think you could probably achieve what you want as follows:
For each column you are going to work with do a cell transform to put a dummy value in empty cells - e.g. if(isBlank(value),"null",value)
Create a new column at the start of your project and put a single value in the very first cell in that column
Switch to Record mode
At this point you should have a single 'Record' in your project - e.g.
You can now access all cells in a column using syntax like row.record.cells["Column 1"].value. You can combine this with 'forRange' to iterate through the contents of this array, using the row.index as the marker for the current row.
I used the following formula to add a new column to the project:
with(row.record.cells["Column 1"].value,w,if(forRange(row.index,w.length(),1,i,w[i].toNumber()).sum()>0,"a","b"))
Then...
Change back to 'Row' mode
Remove the 'null' placeholder from the original column
Create a facet on the 'fill filter' column
In my case I filter to 'a'
Use the 'fill down' option
Remove the filter
And remove the 'record' column
Rather a long winded way of doing it to say the least, but so far I've not been able to find anything better while not going outside OpenRefine. I'm guessing you could probably compress steps 5-11 into a single step or smaller number of steps.
If you want to access the array of cell values using Jython as suggested by iMitwe you need to use:
row["record"]["cells"]["Column 1"]["value"]
instead of
row.record.cells["Column 1"].value
(step 5)
I am doing this on the top of my head, but I think your best chance my be using the fill down option in record mode:
first move your column to the first column and switch to record mode.
then use the following GREL: row.record.cells["data"].value[-1] where data is the name of your column
The [-1] will take the last value and fill the blank. For the case with the red dot, since there is no value it should remains empty. Let us know how it goes.
Unless there's something I am missing or not seeing...
I would have just sorted reverse (date ascending) on the Date column, then individually use Fill Down on each column, except for that last column where you could then use a Date facet on your column Date to specify the exact Date range you wanted to work with, then fill down on that last column, then remove the Date range facet.
I have an Excel sheet with various data entries that are in date order going down the page, with the dates in column A. I need a formula that will take a text string from an adjacent cell, then look back up a neighbouring column for the most recent match then return the date from column A.
Currently I have this formula in cell H100: =LOOKUP(G100,E100:E$5,A100:A$5).
I want it to look for the text in G100 in column E, going backwards to find the most recent example and then return the corresponding date from column A but despite the LOOKUP command being in reverse it always returns the first example in date order, not the most recent.
I would really appreciate some help from an expert, which I am not!
I am not certain to understand the question, but try
=OFFSET($A$1, 1+MATCH(G100, E$5:E100, 0)0,1,1)
this should catch the first (higher in the sheet) instance of the lookup match.
I would bring it into an Access database where that kind of data manipulation is easy.
But then, I know how to use Access to do those kinds of things and I think it's much harder to do in Excel.
bascially, i have got a column consist of data that i want those starts with "EMUA-I" to be place at the front, with the ascending order of date. Then i want the "non EMUA-I" part to be placed at the back with the ascending order of date.
please take a look at this reference file :
http://www.speedyshare.com/files/23397356/1.xls
I need VBA script to perform the job as this documents need future update.
Thanks
Without knowing the exact structure of your spreadsheet (No viruses for me, thank you) it seems like you could accomplish this by creating a new column with a formula to extract the prefix, then sort by the new column and ascending date.
Assuming your spreadsheet looks something like this:
Part Date
------------- -----
non EMUA-I321 1-Jun
EMUA-I123 2-Jun
EMUA-I546 1-Mar
non EMUA-I789 1-May
Add a Type column with the formula =IF(LEFT([YourPartNoCellHere],6) = "EMUA-I", 0, 1).
Then sort by this column and then by your date column .