OpenRefine - Fill between cells but not at the end of the list - openrefine

I have a list of stock prices for several stocks. Some of the values are missing due to weekends, holidays and probably other reasons.
The gaps are not consistent. Some are two days and some are more than that.
I want to fill the gaps with the last known value but not at the end of the list.
I have tried in Excel to test a few cells below and if it's now empty, do the fill. The problem is that due to the inconsistency of the gaps, it's a tedious task to change the function for all the cases.
Is there a way to test for the end of a list?
UPDATE - added a screenshot.
See this screenshot. I want to fill where the blue dots are. The red dots are at the end of the list and I don't want to fill those cells.
I am looking for a way to detect the end of the list and stop the filling when the end is detected.

I think this is pretty difficult in OpenRefine and probably a different tool would work better. The main issue is that OpenRefine does not offer the ability to easily work across rows so 'summing a column' (or part of a column) is tricky - this is mentioned in https://github.com/OpenRefine/OpenRefine/issues/200
However, you can do this by forcing OpenRefine in Record mode with the whole project containing a single record. Once you've done this you can access all values in a column using syntax like:
row.record.cells["Column name"].value
This gives an array of all the non-blank values in the column. Since this ignores blank values, in order to have a true view of the values in the column you have to fill in blank cells with a value.
So I think you could probably achieve what you want as follows:
For each column you are going to work with do a cell transform to put a dummy value in empty cells - e.g. if(isBlank(value),"null",value)
Create a new column at the start of your project and put a single value in the very first cell in that column
Switch to Record mode
At this point you should have a single 'Record' in your project - e.g.
You can now access all cells in a column using syntax like row.record.cells["Column 1"].value. You can combine this with 'forRange' to iterate through the contents of this array, using the row.index as the marker for the current row.
I used the following formula to add a new column to the project:
with(row.record.cells["Column 1"].value,w,if(forRange(row.index,w.length(),1,i,w[i].toNumber()).sum()>0,"a","b"))
Then...
Change back to 'Row' mode
Remove the 'null' placeholder from the original column
Create a facet on the 'fill filter' column
In my case I filter to 'a'
Use the 'fill down' option
Remove the filter
And remove the 'record' column
Rather a long winded way of doing it to say the least, but so far I've not been able to find anything better while not going outside OpenRefine. I'm guessing you could probably compress steps 5-11 into a single step or smaller number of steps.
If you want to access the array of cell values using Jython as suggested by iMitwe you need to use:
row["record"]["cells"]["Column 1"]["value"]
instead of
row.record.cells["Column 1"].value
(step 5)

I am doing this on the top of my head, but I think your best chance my be using the fill down option in record mode:
first move your column to the first column and switch to record mode.
then use the following GREL: row.record.cells["data"].value[-1] where data is the name of your column
The [-1] will take the last value and fill the blank. For the case with the red dot, since there is no value it should remains empty. Let us know how it goes.

Unless there's something I am missing or not seeing...
I would have just sorted reverse (date ascending) on the Date column, then individually use Fill Down on each column, except for that last column where you could then use a Date facet on your column Date to specify the exact Date range you wanted to work with, then fill down on that last column, then remove the Date range facet.

Related

OpenRefine columnwise scripting

I spent some time Googling, but couldn't find anything useful.
How to select all the values of a single column in OpenRefine in a script?
It seems that all the operations are row-wise
In particular, I want to find highest and lowest values in a column
By default, OpenRefine functionality are limited for computation. The Stats Extension make basic stats per column (min, max, average, medium ...).
Facets will give you a list of all the values in a column - so the simplest way of getting the lowest/highest values in the column is to make a facet on the column and see the resulting highest/lowest in the facet to get the answer.
However I'm not sure if this meets your criteria for selecting the values 'in a script'. By this I assume you mean you want to be able to access the lowest/highest values in a GREL expression?
You can do this, but you have to force OpenRefine to treat all the rows in project as part of a single record. The easiest way to do this is usually to add a column at the start of the project which is empty except for the first cell which contains a value.
Once you've done this you can access all the values in a column by using syntax like:
row.record.cells["Column name"].value
See also my answer to OpenRefine - Fill between cells but not at the end of the list which uses the same technique
Further explanation:
Create a new column at the start of your project and put a single value in the very first cell in that column
Switch to Record mode
At this point you should have a single 'Record' in your project - e.g.
At this point using the syntax like row.record.cells["Column 1"].value gives you an array of all the values in "Column 1". You can then use GREL expressions to manipulate this - including sorting or comparing values.
A Text Facet has an nice undocumented option to gives you aggregated results in a column that you can just copy and paste.
Click on the "X choices" in the upper left corner of the Text Facet box.
This will bring up a separate dialog that contains the values along with the count of each value in that column.
(If your looking to just get ALL the values of a single column, then use Export -> Custom Tabular Exporter and then Select and Order Columns to Export by clicking on checkboxes, then click on Download tab to choose your export format and then click Download button.)

Need a simple search function to display most common value in a column. (with ambiguous choices)

I have a very large array of data with many columns that display different outputs for the values presented. I would like to add a row above the data that will display the most common occurring value or word below.
Generally I would like to have each top of the column (right under the column label in row 1) have the most common value below. I will then use this value for various data analysis functions!
Is this possible, and if so, how? Preferably this will not require VBA, but simply a short code in the cell.
One caveat: The exact values may vary, so there is no set list where I can say "it will be one of these."
Any ideas appreciated!
Try a series of =COUNTIF(A:A,"VALUE TO SEARCH") functions if you want to stay away from VBA.
Otherwise, the best method would be to iterate through each column via VBA. With this method, you can even count the "varying" values and return the count and/or the value itself.
http://www.excel-easy.com/examples/most-frequently-occurring-word.html
This is a single formula you would write at the top of each column. Does not require VBA. You can replace the set range to an entire column, such as (A:A) instead of (A1:A7).
If you mean an array as in a data type, it could work differently but it depends what you're trying to do.
With data from A3 through A16, in A2 enter:
=INDEX($A$3:$A$16,MODE(MATCH($A$3:$A$16,$A$3:$A$16,0)))
This will work for text as well as numbers. Adjust this to match the column size.

Adding a row (not a column) of a type (checkbox,dropdown) in datagridview

I have a datagridview that is only two or three rows long. It has 7 text columns, one for each day of the week (Monday - Sunday). I'm creating a scheduler, so basically on the left side I have added text to the row headers to assign to it. I.e. Enabled (let's say for Tuesday), start time and end time. This allows the user to schedule as need be.
Here's a picture of it right now:
What I want to do is possibly change the enabled row, or the start/end time to a particular type. So the enabled will be a checkbox and the start/end times will be drop down menus instead of these text boxes.
My question is, what's the "best" way to add a row of a certain type? Obviously columns are done easily, but is there a common method for a row type other than looping through and adding individual cells of that type to the datagridview?
The type of each cell can only be pre-determined by the column, not the row. As a result, you're going to have to add each cell individually. You can actually put a cell of any type anywhere you want. You simply create a cell of the desired type and assign it to the Item property of the grid, e.g.
myDataGridView(columnIndex, rowIndex) = newCell
You will simply have to use a For loop to do that for each valid column index with a single row index. Note that you'll have to create a new cell for each column, not reuse the same one.

Sorting by cell value or color in multiple cells

I am trying to sort data based upon either color (RGB(186,200,8)) or value ("AMP") within a cell. That part is easy but the problem comes when I want to look for the same value/color in multiple columns (it can occur up to four times) and put the ones with all for equal to the value at the top and then three values next and on down to no match.
I'm not sure how to go about, I think a for loop and/or else would work but I can't come up with one. Any suggestions?
My suggestion would be to calculate "hit" value for the row and based on that you could do the sort easily. For instance if you have two matches on the row value for that row = 2 etc. after each row validated just sort by the value and clean the data.

Excel - How do I find all relevant rows by typing unique invoice# listed Col A

I have a Worksheet with 10 columns and data range from A1:J55. Col A has the invoice # and rest of the columns have other demographic data. Goal is to type the invoice number on a cell and display all the rows matching the invoice number from col A.
Besides auto filter function, the only thing comes to my mind is VBA. Please advice what is the best way to get the data. Thanks for your help in advance.
Alright, I'm pretty proud of this one. Again avoiding VBA, this one uses the volatile formula OFFSET to keep moving its VLOOKUP search down the table until it's found all matches. Just make sure you paste enough rows of the formula that if there are many matches, there's room for all of them to appear. If you put a border around your match area then it would be clear if you ever ran out of room and needed to copy down the formula some more.
Again, in the main section, it's just a single formula (using index):
=IFERROR(INDEX($A$1:$J$200,$M3,MATCH(N$2,$A$1:$J$1,0)),"")
This gets to be so simple because the hard work of the lookup is done by an initial column which looks up the next row that matches the invoice number. It has the formula:
=IFERROR(MATCH($L$2,OFFSET($A$1:$A$200,M2,0),0)+M2," ")
Here is the working example that goes with those formulas:
Let me know if you need any further description of how it works, but it mostly uses the same rules as above so that it's robust in copying and moving around.
I've uploaded the Excel file so you can play with it, but everything you need to reproduce this feature should be in this solution.
Google Docs - Click link and hit Ctrl+S to download and open in Excel.
A popular solution to this problem is a simple VLookup. Lookup the invoice the user types in on the table A1:J55, and then return an adjascent column's data.
Here's an example of it working:
The formula in the highlighted cell is:
=VLOOKUP($L3,$A:$J,MATCH(N$2,$1:$1,0),FALSE)
What's nice about this formula is you only need to type it once and then you can copy it across and it'll automatically pick out the correct column of the table (that's the match part). The rest is very simple:
The first part says lookup value $L3 (the invoice number typed in),
The second part says look it up in range $A:$J (which is where your table is located). I've shown how you can select the entire columns $A:$J so that you can add and remove data without worrying about adjustin the range in your lookups. (Excel takes care of optimizing the formula so that unused cells aren't checked)
The third part picks the column from which the resulting data will be drawn once a matching row is found.
The FALSE part is an indication that the invoice number must match exactly (no approximate matching allowed)
The $ signs ensure that fixed ranges like the location of your source table ($A:$J) and your lookup value ($L3) don't get automatically changed as you copy the formula across for multiple columns.
The formula is pretty easy to adapt if you want to move around your table and the area where you do your lookup. Here's an example:
Bonus
If you want to add a little spiff, you can add a dropdown to the Invoice # field so that the user gets auto-completion and the option to browse existing values like so: