Open refine create ranges for numeric values - openrefine

For a current project, I got one column in Open Refine which should represent a participant's age.
I am planning on having ranges so I can work with the different ranges as nominal values.
E.g. age >18, age between 18 and 30 etc.
How do I do that in Open Refine?
Thank you very much,

It depends on what you are trying to do. If you just want to do a quick interactive analysis, you can either use numeric facets set to the appropriate ranges or create custom text facets with expressions like value < 18 or and(value >= 18, value <30) which would evaluate to the text labels true or false.
If you want to add permanent labels for the ranges, you could set up the facets as above, then add a column or change an existing column to have a tag like "under18" or whatever you want to call it.

Related

Is there a way to use variables in vba to identify MS-Access report fields?

I am not a programmer, but have been tasked with doing this anyway! We are working on a research project that involves testing properties of different samples. I am trying to create a form that will generate a custom report based on what the user chooses. So, I have multiple text boxes and check boxes to allow the user to define the query parameters (e.g. composition of the sample must contain at least 5% component A) and choose what data they are interested in seeing in said report (e.g. show pH, color, but not melting point). I have successfully created code to generate the query, then generate a report based on that query, but the report defaults to column widths that are generally too big (for example, the pH column width is 3 inches, it only needs to be about 1). I would like to be able to fix this, but have not been able to figure out how. At the same time, some of these fields contain numbers that are averages of multiple test results, so I would like to limit the number of digits shown, and display them as % where appropriate. I started with just fixing the column width issue:
I have tried to make a collection of the fields that are included, then loop through the collection and set column widths, but cannot figure out how to identify a field with a variable:
If I know the field name I can do this:
Reports("ReportName")!FieldID.Width = 200
But if I have a collection of names, FieldNames, or a string VariableName, none of these work, giving me an error that FieldNames or VariableName is not a valid field in the report:
Reports("ReportName")!FieldNames(1).Width = 200
Reports("ReportName")![FieldNames(1)].Width = 200
Reports("ReportName")![VariableName].Width = 200
Is there a way to reference a field name with a variable?
Alternatively, I thought there might be a way to loop through all fields and set widths - this would involve looking up a column width for each field, which I thought to do by adding a key to a collection of column widths. But I cannot find a way to do that, something like:
For each Field in Reports("Report")
Field.Width = ColumnWidthCollection(Field)
Next
This hangs up on the Field.Width line, with "invalid procedure call or argument", which brings me back to how to reference a field name with a variable.
Any help would be greatly appreciated!
Try with:
Reports("ReportName")(VariableName).Width = 200

Redaction in Tableau

I am currently in the process of building some Tableau workbooks where we will need to redact visualizations or text tables if the results fall below a certain threshold (e.g. only ten data points are returned after filters are applied). Does anyone know how to create calculated fields or know of other methods to redact in Tableau?
You can create a threshold filter that compares the number of filtered responses to a threshold value set in a parameter.
First, create a parameter with integer data type and set it to the desired threshold. In this example, I called it Count Threshold.
Then create a calculated field for the filter with an equation like the following:
{FIXED: COUNTD([Respondent ID]) >= [Count Threshold]}
(I did this for survey results where we needed to hide results if the filtered number of respondents was fewer than 10.)
For the threshold filter to be applied after your other filters, choose "Add to Context" for your other filters.
I found a partial solution on the Tableau community forum/knowledgebase about redaction that might work for other implementations.
The basic idea is to create two different calculated fields, one which displays a integer value and the other that displays a string value. That way, when both are concatenated in the display you get the desired output without breaking any of the calculated field rules.
So create a calculated field that has a formula like:
IF sum([Datafield_to_Redact]) < 10 THEN "*" ELSE str(sum([Datafield_to_Redact])) END
And another that has a calculated field that has a formula like:
IF sum([Datafield_to_Redact]) < 10 THEN null ELSE sum([Datafield_to_Redact]) END
In the post the attached workbook and screenshot show how the two values are concatenated in the Text mark.
Workbook screen capture

OpenRefine - Fill between cells but not at the end of the list

I have a list of stock prices for several stocks. Some of the values are missing due to weekends, holidays and probably other reasons.
The gaps are not consistent. Some are two days and some are more than that.
I want to fill the gaps with the last known value but not at the end of the list.
I have tried in Excel to test a few cells below and if it's now empty, do the fill. The problem is that due to the inconsistency of the gaps, it's a tedious task to change the function for all the cases.
Is there a way to test for the end of a list?
UPDATE - added a screenshot.
See this screenshot. I want to fill where the blue dots are. The red dots are at the end of the list and I don't want to fill those cells.
I am looking for a way to detect the end of the list and stop the filling when the end is detected.
I think this is pretty difficult in OpenRefine and probably a different tool would work better. The main issue is that OpenRefine does not offer the ability to easily work across rows so 'summing a column' (or part of a column) is tricky - this is mentioned in https://github.com/OpenRefine/OpenRefine/issues/200
However, you can do this by forcing OpenRefine in Record mode with the whole project containing a single record. Once you've done this you can access all values in a column using syntax like:
row.record.cells["Column name"].value
This gives an array of all the non-blank values in the column. Since this ignores blank values, in order to have a true view of the values in the column you have to fill in blank cells with a value.
So I think you could probably achieve what you want as follows:
For each column you are going to work with do a cell transform to put a dummy value in empty cells - e.g. if(isBlank(value),"null",value)
Create a new column at the start of your project and put a single value in the very first cell in that column
Switch to Record mode
At this point you should have a single 'Record' in your project - e.g.
You can now access all cells in a column using syntax like row.record.cells["Column 1"].value. You can combine this with 'forRange' to iterate through the contents of this array, using the row.index as the marker for the current row.
I used the following formula to add a new column to the project:
with(row.record.cells["Column 1"].value,w,if(forRange(row.index,w.length(),1,i,w[i].toNumber()).sum()>0,"a","b"))
Then...
Change back to 'Row' mode
Remove the 'null' placeholder from the original column
Create a facet on the 'fill filter' column
In my case I filter to 'a'
Use the 'fill down' option
Remove the filter
And remove the 'record' column
Rather a long winded way of doing it to say the least, but so far I've not been able to find anything better while not going outside OpenRefine. I'm guessing you could probably compress steps 5-11 into a single step or smaller number of steps.
If you want to access the array of cell values using Jython as suggested by iMitwe you need to use:
row["record"]["cells"]["Column 1"]["value"]
instead of
row.record.cells["Column 1"].value
(step 5)
I am doing this on the top of my head, but I think your best chance my be using the fill down option in record mode:
first move your column to the first column and switch to record mode.
then use the following GREL: row.record.cells["data"].value[-1] where data is the name of your column
The [-1] will take the last value and fill the blank. For the case with the red dot, since there is no value it should remains empty. Let us know how it goes.
Unless there's something I am missing or not seeing...
I would have just sorted reverse (date ascending) on the Date column, then individually use Fill Down on each column, except for that last column where you could then use a Date facet on your column Date to specify the exact Date range you wanted to work with, then fill down on that last column, then remove the Date range facet.

Need a simple search function to display most common value in a column. (with ambiguous choices)

I have a very large array of data with many columns that display different outputs for the values presented. I would like to add a row above the data that will display the most common occurring value or word below.
Generally I would like to have each top of the column (right under the column label in row 1) have the most common value below. I will then use this value for various data analysis functions!
Is this possible, and if so, how? Preferably this will not require VBA, but simply a short code in the cell.
One caveat: The exact values may vary, so there is no set list where I can say "it will be one of these."
Any ideas appreciated!
Try a series of =COUNTIF(A:A,"VALUE TO SEARCH") functions if you want to stay away from VBA.
Otherwise, the best method would be to iterate through each column via VBA. With this method, you can even count the "varying" values and return the count and/or the value itself.
http://www.excel-easy.com/examples/most-frequently-occurring-word.html
This is a single formula you would write at the top of each column. Does not require VBA. You can replace the set range to an entire column, such as (A:A) instead of (A1:A7).
If you mean an array as in a data type, it could work differently but it depends what you're trying to do.
With data from A3 through A16, in A2 enter:
=INDEX($A$3:$A$16,MODE(MATCH($A$3:$A$16,$A$3:$A$16,0)))
This will work for text as well as numbers. Adjust this to match the column size.

Reporting Services - Two filters on the same chart Category Group?

I have sales data that I'd like to plot on my chart. However, at a specific point in time, we had a change taking place I'd like to ensure is clearly visible in the chart, preferably by dividing the sales data (which is stored in a single SQL Server column) into two different chunks, which would allow me to then treat them as different data series.
I used to solve this in Excel by storing the post-event data in a different column (by simply dragging them to a different column), and thus I was able to treat them as a different series (the blue and green line in the chart below. The red and orange line are pre-event and post-event averages):
I'd like to reproduce this effect in SSRS, but am not sure how to tackle it. I've tried using an approach where I added two category groups, both pointing to the date-time column, and applying filters to them (one <= the cutoff date, the other >=).
I then added my sales data twice, with the idea I could somehow connect them to the individual category groups, but that does not seem possible.
Has anyone tried anything like this before, or would have a different approach to achieve what I'm trying to get?
Thanks!
I managed to get this to work, and figured I'd share how to do it.
My dataset contains a field called DATEKEY, which stores the date in the format YYYYMMDD. It's possible to use this in an expression and evaluate the date for a specific row. In case the expression evaluates to true, we display the value. If not, we display a blank string.
In case we want to show the values prior to the date, the expression would be:
=IIF(Fields!DATEKEY.Value <= 20130601, Avg(Fields!My_NUMBER.Value), "")
The second series can then be made by reversing the symbol:
=IIF(Fields!DATEKEY.Value >= 20130601, Avg(Fields!My_NUMBER.Value), "")
The graph then looks like this: