Can OpenRefine easily do One Hot Encoding? - one-hot-encoding

I have a dataset like a multiple choice quiz result. One of the fields is semi-colon delimited. I would like to break these in to true/false columns.
Input
Student
Answers
Alice
B;C
Bob
A;B;D
Carol
A;D
Desired Output
Student
A
B
C
D
Alice
False
True
True
False
Bob
True
True
False
True
Carol
True
False
False
True
I've already tried "Split multi-valued cells" and "Split in to several columns", but these don't give me what I would like.
I'm aware that I could do a custom grel/python/jython along the lines of "if value in string: return true" for each value, but I was hoping there would be a more elegant solution.
Can anyone suggest a starting point?

GREL in OpenRefine has a somehow limited number of datastructures, but you can still build simple algorithms with it.
For your encoding you need two datastructures:
a list (technical array) of all available categories.
a list of the categories in the current cell.
With this you can check for each category, whether it is present in the current cell or not.
Assuming that the number of all available categories is somehow assessable,
I will use a hard coded list ["A", "B", "C", "D"].
The list of categories in the current cell we get via value.split(/\s*;\s*/).
Note that I am using an array instead of string matching
and use splitting with a regular expression considering whitespace.
This is mainly defensive programming and hopefully the algorithm will still be understandable.
So let's wrap this all together into a GREL expression and create a new column (or transform the current one):
with(
value.split(/\s*;\s*/),
cell_categories,
forEach(
["A", "B", "C", "D"],
category,
if(cell_categories.inArray(category), 1, 0)))
.join(";")
You can then split the new column into several columns using ; as separator.
The new column names you have to assign manually (sry ;).
Update: here is a more elaborate version to automatically extract the categories.
The idea is to create a single record for the whole dataset to be able to access all the entries in the column "Answers" and then extract all available categories from it.
Create a new column "Record" with content "Record".
Move the column "Record" to the beginning.
Blank down the column "Record".
Add a new column "Categories" based on the column "Answers" with the following GREL expression:
if(row.index>0, "",
row.record.cells["Answers"].value
.join(";")
.split(/\s*;\s*/)
.uniques()
.sort()
.join(";"))
Fill down the column "Categories".
Add a new column "Encoding" based on the column "Answers with the following GREL expression:
with(
value.split(/\s*;\s*/),
cell_categories,
forEach(
cells["Categories"].value.split(";"),
category,
if(cell_categories.inArray(category), 1, 0)))
.join(";")
Split the column "Encoding" on the character ;.
Delete the columns "Record" and "Categories".

Related

R code for matching multiple stings in two columns and returning into a third separated by a comma

I have two dataframes. The first df includes column b&c that has multiple stings seperated by a comma. the second has three columns, one that includes all stings in column B, two that includes all strings in c, and three is the resulting string I want to use.
x <- data.frame("uuid" = 1:2, "first" = c("jeff,fred,amy","tina,cat,dog"), "job" = c("bank teller,short cook, sky diver, no job, unknown job","bank clerk,short pet, ocean diver, hot job, rad job"))
x1 <- data.frame("meta" = c("ace", "king", "queen", "jack", 10, 9, 8,7,6,5,4,3), "first" = c("jeff","jeff","fred","amy","tina","cat","dog","fred","amy","tina","cat","dog"), "job" = c("bank teller","short cook", "sky diver", "no job", "unknown job","bank clerk","short pet", "ocean diver", "hot job", "rad job","bank teller","short cook"))
The result would be
result <- data.frame("uuid" = 1:2, "combined" = c("ace,king,queen,jack","5,9,8"))
Thank you in advance!
I tried to beat my head against the wall and it didn't help
Edit- This is the first half of the puzzle BUT it does not search for and then concat the strings together in a cell, only returns the first match found rather than all matches.
Is there a way to exactly match a string in one column with couple of strings in another column in R?

Is there an equivalent of an f-string in Google Sheets?

I am making a portfolio tracker in Google Sheets and wanted to know if there is a way to link the "TICKER" column with the code in the "PRICE" column that is used to pull JSON data from Coin Gecko. I was wondering if there was an f-string like there is in Python where you can insert a variable into the string itself. Ergo, every time the Ticker column is updated the coin id will be updated within the API request string. Essentially, string interpolation
For example:
TICKER PRICE
BTC =importJSON("https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd&ids={BTC}","0.current_price")
You could use CONCATENATE for this:
https://support.google.com/docs/answer/3094123?hl=en
CONCATENATE function
Appends strings to one another.
Sample Usage
CONCATENATE("Welcome", " ", "to", " ", "Sheets!")
CONCATENATE(A1,A2,A3)
CONCATENATE(A2:B7)
Syntax
CONCATENATE(string1, [string2, ...])
string1 - The initial string.
string2 ... - [ OPTIONAL ] - Additional strings to append in sequence.
Notes
When a range with both width and height greater than 1 is specified, cell values are appended across rows rather than down columns. That is, CONCATENATE(A2:B7) is equivalent to CONCATENATE(A2,B2,A3,B3, ... , A7,B7).
See Also
SPLIT: Divides text around a specified character or string, and puts each fragment into a separate cell in the row.
JOIN: Concatenates the elements of one or more one-dimensional arrays using a specified delimiter.

recoding multiple variables in the same way

I am looking for the shortest way to recode many variables in the same way.
For example I have data frame where columns a,b,c are names of items of survey and rows are observations.
d <- data.frame(a=c(1,2,3), b=c(1,3,2), c=c(1,2,1))
I want to change values of all observations for selected columns. For instance value 1 of column "a" and "c" should be replaced to string "low" and values 2,3 of these columns should be replaced to "high".
I do it often with many columns so I am looking for function which can do it in very simple way, like this:
recode2(data=d, columns=a,d, "1=low, 2,3=high").
Almost ok is function recode from package cars, but if I have 10 columns to recode I have to rewrite it 10 times and it is not as effective as I want.

Add values from a column when two other columns match

I have an ecology data table with about 12,000 rows. There are three columns: site, species, and value. I need to add up the values for each set of matching site and species - for example, all "red maple" values at "site A". I have the data sorted by site and species, so I can do it by hand, but it's slow going. The number of site/species matches varies, so I can't just add up the values in sets of three or anything.
Similar types of questions have talked about pivot tables, but none have needed to match two columns and add a third column, and I haven't been able to figure out how to extrapolate to my situation.
I'm reasonably comfortable coding and would like to do something that looks like this pseudocode, but I'm not clear on the syntax in VBA:
For each row
if a(x) = a(x+1) and b(x) = b(x+1) then
sum = sum + c(x)
else
d(x) = sum
sum = 0
next
Any ideas?
In a PivotTable, put site in Row Labels and species in Column Labels (or vice versa) and Sum of value in Σ Values:

How to access columns by their names and not by their positions?

I have just tried my first sqlite select-statement and got a result (an iterator over tuples). So, in other words, every row is represented by a tuple and I can access value in the cells of the row like this: r[7] or r[3] (get value from the column 7 or column 3). But I would like to access columns not by their positions but by their names. Let us say, I would like to know the value in the column user_name. What is the way to do it?
I found the answer on my question here:
cursor.execute("PRAGMA table_info(tablename)")
print cursor.fetchall()