OpenRefine columnwise scripting - openrefine

I spent some time Googling, but couldn't find anything useful.
How to select all the values of a single column in OpenRefine in a script?
It seems that all the operations are row-wise
In particular, I want to find highest and lowest values in a column

By default, OpenRefine functionality are limited for computation. The Stats Extension make basic stats per column (min, max, average, medium ...).

Facets will give you a list of all the values in a column - so the simplest way of getting the lowest/highest values in the column is to make a facet on the column and see the resulting highest/lowest in the facet to get the answer.
However I'm not sure if this meets your criteria for selecting the values 'in a script'. By this I assume you mean you want to be able to access the lowest/highest values in a GREL expression?
You can do this, but you have to force OpenRefine to treat all the rows in project as part of a single record. The easiest way to do this is usually to add a column at the start of the project which is empty except for the first cell which contains a value.
Once you've done this you can access all the values in a column by using syntax like:
row.record.cells["Column name"].value
See also my answer to OpenRefine - Fill between cells but not at the end of the list which uses the same technique
Further explanation:
Create a new column at the start of your project and put a single value in the very first cell in that column
Switch to Record mode
At this point you should have a single 'Record' in your project - e.g.
At this point using the syntax like row.record.cells["Column 1"].value gives you an array of all the values in "Column 1". You can then use GREL expressions to manipulate this - including sorting or comparing values.

A Text Facet has an nice undocumented option to gives you aggregated results in a column that you can just copy and paste.
Click on the "X choices" in the upper left corner of the Text Facet box.
This will bring up a separate dialog that contains the values along with the count of each value in that column.
(If your looking to just get ALL the values of a single column, then use Export -> Custom Tabular Exporter and then Select and Order Columns to Export by clicking on checkboxes, then click on Download tab to choose your export format and then click Download button.)

Related

Filtering and display unique column pairs in Excel

Follow on from Excel Count unique value multiple columns
I am trying to filter and setup a table containing all the unique combinations of message types.
So with three message types as an example below, I want to create a table with all the possible flows from this.
So every time MessageA exists, it is either followed by a MessageA, MessageB, MessageC or is the last of the sequence.
And everytime we see MessageC it is only followed by MessageA.
On the left, is the data and on the right is the desired result.
I want this to be able to scale to multiple columns/rows
You could do it by comparing two offset ranges, A1:D5 and B1:E5
=SUMPRODUCT(($A$1:$D$5=$G2)*($B$1:$E$5=K$1))
As you can see, I have cheated slightly by setting K1 blank so it compares correctly with column E, but this could be made part of a longer formula if it was necessary to have END as the column header for K.

Make Spotfire ignore empty values in the categories of charts and show a visualization without "spaces" between the bars

I have a group of trellis graphs on some data, in there you can see a numeric variable on the Y axis and a series of cell dishes on the X axis. Not all the numeric values are present on all the series of cells. Because of this the visualization results in a graph with empty spaces:
This is OK most of the time but the thing is I would like to avoid the "empty spaces, only in these graph series, that you can see between the bars. I would like to see showing only the pattern of the cell dishes where I have data.
Trying to do so I tried creating a calculated column to use it as a ordering index (https://docs.tibco.com/pub/sfire-bauthor/7.9.0/doc/html/en-US/GUID-8CAA18D0-CF28-4707-9945-041BDFD99E99.html) (Sorting Filter values asc/desc on Tibco Spotfire), after that "Limit data by expression" using a "[MyColumn] is not null" on it (https://community.tibco.com/questions/can-i-automatically-make-spotfire-ignore-empty-values-categories-charts) (How to show the top 10 column values in Spotfire) with no luck and I tried also (https://docs.tibco.com/pub/spotfire/6.5.1/doc/html/ncfe/ncfe_details_on_custom_expression.htm) create a custom expression, which I think it would be a good solution because I understand it will only affect these graphs and not the complete set of visualizations but no, I don't reach the point to change it. Last that it should work but it doesn't is to "Show/Hide Items" under a Boolean expression that it would include that "[Axis.Value] is not NULL" and "Apply individually for each trellis panel" of the numeric column which sound terrific but... nope, it didn't work either...
Any help would be appreciated, now I will select one by one on individual graphs extracting them and plotting them in other place but this is not very useful as a "large scale" solution. I am sure there is a way to insert a proper expression to avoid the null values of the cross of both variables, the numeric and the cell dishes.
this is because you are trellising data, not the axis. you won't be able to filter out values on the x axis; it's simply not how trellis works.
using multiple visualizations is the solution, but I assume you've got n sets of categories that you want to separate out without creating a ton of charts on the page and perhaps you can't guarantee the number of categories or their names, so you want to build a flexible solution.
please check out an answer I just wrote over here which illustrates how to use a document property and a property control to limit a visualization. your property control can be linked to automatically and dynamically display unique values in your "category" column (the one you are trellising by). maybe this can be a solution for you?

OpenRefine - Fill between cells but not at the end of the list

I have a list of stock prices for several stocks. Some of the values are missing due to weekends, holidays and probably other reasons.
The gaps are not consistent. Some are two days and some are more than that.
I want to fill the gaps with the last known value but not at the end of the list.
I have tried in Excel to test a few cells below and if it's now empty, do the fill. The problem is that due to the inconsistency of the gaps, it's a tedious task to change the function for all the cases.
Is there a way to test for the end of a list?
UPDATE - added a screenshot.
See this screenshot. I want to fill where the blue dots are. The red dots are at the end of the list and I don't want to fill those cells.
I am looking for a way to detect the end of the list and stop the filling when the end is detected.
I think this is pretty difficult in OpenRefine and probably a different tool would work better. The main issue is that OpenRefine does not offer the ability to easily work across rows so 'summing a column' (or part of a column) is tricky - this is mentioned in https://github.com/OpenRefine/OpenRefine/issues/200
However, you can do this by forcing OpenRefine in Record mode with the whole project containing a single record. Once you've done this you can access all values in a column using syntax like:
row.record.cells["Column name"].value
This gives an array of all the non-blank values in the column. Since this ignores blank values, in order to have a true view of the values in the column you have to fill in blank cells with a value.
So I think you could probably achieve what you want as follows:
For each column you are going to work with do a cell transform to put a dummy value in empty cells - e.g. if(isBlank(value),"null",value)
Create a new column at the start of your project and put a single value in the very first cell in that column
Switch to Record mode
At this point you should have a single 'Record' in your project - e.g.
You can now access all cells in a column using syntax like row.record.cells["Column 1"].value. You can combine this with 'forRange' to iterate through the contents of this array, using the row.index as the marker for the current row.
I used the following formula to add a new column to the project:
with(row.record.cells["Column 1"].value,w,if(forRange(row.index,w.length(),1,i,w[i].toNumber()).sum()>0,"a","b"))
Then...
Change back to 'Row' mode
Remove the 'null' placeholder from the original column
Create a facet on the 'fill filter' column
In my case I filter to 'a'
Use the 'fill down' option
Remove the filter
And remove the 'record' column
Rather a long winded way of doing it to say the least, but so far I've not been able to find anything better while not going outside OpenRefine. I'm guessing you could probably compress steps 5-11 into a single step or smaller number of steps.
If you want to access the array of cell values using Jython as suggested by iMitwe you need to use:
row["record"]["cells"]["Column 1"]["value"]
instead of
row.record.cells["Column 1"].value
(step 5)
I am doing this on the top of my head, but I think your best chance my be using the fill down option in record mode:
first move your column to the first column and switch to record mode.
then use the following GREL: row.record.cells["data"].value[-1] where data is the name of your column
The [-1] will take the last value and fill the blank. For the case with the red dot, since there is no value it should remains empty. Let us know how it goes.
Unless there's something I am missing or not seeing...
I would have just sorted reverse (date ascending) on the Date column, then individually use Fill Down on each column, except for that last column where you could then use a Date facet on your column Date to specify the exact Date range you wanted to work with, then fill down on that last column, then remove the Date range facet.

Need a simple search function to display most common value in a column. (with ambiguous choices)

I have a very large array of data with many columns that display different outputs for the values presented. I would like to add a row above the data that will display the most common occurring value or word below.
Generally I would like to have each top of the column (right under the column label in row 1) have the most common value below. I will then use this value for various data analysis functions!
Is this possible, and if so, how? Preferably this will not require VBA, but simply a short code in the cell.
One caveat: The exact values may vary, so there is no set list where I can say "it will be one of these."
Any ideas appreciated!
Try a series of =COUNTIF(A:A,"VALUE TO SEARCH") functions if you want to stay away from VBA.
Otherwise, the best method would be to iterate through each column via VBA. With this method, you can even count the "varying" values and return the count and/or the value itself.
http://www.excel-easy.com/examples/most-frequently-occurring-word.html
This is a single formula you would write at the top of each column. Does not require VBA. You can replace the set range to an entire column, such as (A:A) instead of (A1:A7).
If you mean an array as in a data type, it could work differently but it depends what you're trying to do.
With data from A3 through A16, in A2 enter:
=INDEX($A$3:$A$16,MODE(MATCH($A$3:$A$16,$A$3:$A$16,0)))
This will work for text as well as numbers. Adjust this to match the column size.

Use columns.add(...) in Word with non-uniform column widths?

Problem I'm having is that table.Columns.add(ref Object BeforeColumn) requires a reference to another column in the table. However, when I try to access the last column in the table to pass as a reference using table.Columns.Add(table.Columns[table.Columns.Count])
I get the error:
"Cannot access individual columns in this collection because the table has mixed cell widths."
As my current work around, I catch the error, and call table.Columns.DistributeWidth() to make sure the columns are uniform and run the rest of the code. However, I lose the formatting of my cell widths this way, which is unfortunate.
Is there any way I can workaround this without losing the cell width?
(I realize one way is to store every cell's width before running this process, and then re-applying the widths afterward, but this seems like a very costly solution to something that should be simpler)
I've found one way to do it. Here's how I approached it.
*Caution, I'm assuming that the table is uniform. i.e. The number of columns is the same across all the rows. (Note, the API has a Table.uniform function, but the description is not complete. In the API it says "True if all the rows in a table have the same number of columns." However, it also checks if the columns have uniform width).
Instead of using table.Columns.Add(table.Columns[table.Columns.Count]) to add a column before the last below, I select a cell in the table and used the insert command:
//assuming table is the name of the table you want to add columns to
table.Cell(1, table.Columns.Count).Select();
word.Selection selection = table.Application.ActiveWindow.Selection;
selection.InsertColumns();
This might actually be a better way to add columns, as the api gives you way more options on how to insert (i.e. use InsertColumnsRight to insert to the right of the column). The Columns.Add(...) function by default inserts to the left of the select