How to write a function that evaluates two data frame columns of choice and returns the output? - dataframe

Given the following data frame:
site <- c("site_1", "site_2", "site_3", "site_4", "site_5", "site_6")
protein1 <- c("M", "Q", "W", "F", "M", "M")
protein2 <- c("M", "W", "V", "M", "M", "M")
protein3 <- c("M", "D", "W", "F", "M", "M")
df <- data.frame(site, protein1, protein2, protein3)
I would like to extract the first column of the data frame, and two additional columns (proteins) that are being compared. However, the latter two columns will vary, depending on the comparison, and only the rows (site_number) where the two proteins differ should be returned. I have achieved this using subset(), but I was hoping to avoid copying and pasting the same line many times and replacing the name of the columns in each line. Here's what I've been able to do, but what I feel is more script than necessary:
comparison1 <- subset(df, protein1 != protein2, select = c(site, protein1, protein2))
comparison2 <- subset(df, protein1 != protein3, select = c(site, protein1, protein3))
#in each case, this produces the desired result of showing the "site" and "protein" values, in rows where the "protein" values differ.
In a large dataset, one would have many columns (18) with different names. Additionally, two different pairwise comparisons would be performed for each column. So I thought it would be wise to write a function that takes the column names of interest as input. Not having so much experience, I tried the following function before learning that you should not use subset() inside functions:
#to establish the function:
compare <- function(first, second){
result <- subset(df, df$first != df$second, select = c(site, first, second))
return(result)
}
#then to do my comparisons:
compare(protein1, protein2)
compare(protein1, protein3)
This returned the following error:
Error in `[.data.frame`(x, r, vars, drop = drop) :
undefined columns selected
Downstream, I would like to put the results into a list of data frames.
I'm quite sure I'm overlooking something simple. Perhaps the answer lies in using square brackets ([[). At least it seems that R is not converting the "first" and "second" variables to character strings that can match names of columns, as the error shows columns are undefined. If anyone knows whether writing a function for this is the right thing to do, or if I should do something else, then I would be very grateful for the feedback!
Thanks, and take care,
A

Related

R code for matching multiple stings in two columns and returning into a third separated by a comma

I have two dataframes. The first df includes column b&c that has multiple stings seperated by a comma. the second has three columns, one that includes all stings in column B, two that includes all strings in c, and three is the resulting string I want to use.
x <- data.frame("uuid" = 1:2, "first" = c("jeff,fred,amy","tina,cat,dog"), "job" = c("bank teller,short cook, sky diver, no job, unknown job","bank clerk,short pet, ocean diver, hot job, rad job"))
x1 <- data.frame("meta" = c("ace", "king", "queen", "jack", 10, 9, 8,7,6,5,4,3), "first" = c("jeff","jeff","fred","amy","tina","cat","dog","fred","amy","tina","cat","dog"), "job" = c("bank teller","short cook", "sky diver", "no job", "unknown job","bank clerk","short pet", "ocean diver", "hot job", "rad job","bank teller","short cook"))
The result would be
result <- data.frame("uuid" = 1:2, "combined" = c("ace,king,queen,jack","5,9,8"))
Thank you in advance!
I tried to beat my head against the wall and it didn't help
Edit- This is the first half of the puzzle BUT it does not search for and then concat the strings together in a cell, only returns the first match found rather than all matches.
Is there a way to exactly match a string in one column with couple of strings in another column in R?

Can OpenRefine easily do One Hot Encoding?

I have a dataset like a multiple choice quiz result. One of the fields is semi-colon delimited. I would like to break these in to true/false columns.
Input
Student
Answers
Alice
B;C
Bob
A;B;D
Carol
A;D
Desired Output
Student
A
B
C
D
Alice
False
True
True
False
Bob
True
True
False
True
Carol
True
False
False
True
I've already tried "Split multi-valued cells" and "Split in to several columns", but these don't give me what I would like.
I'm aware that I could do a custom grel/python/jython along the lines of "if value in string: return true" for each value, but I was hoping there would be a more elegant solution.
Can anyone suggest a starting point?
GREL in OpenRefine has a somehow limited number of datastructures, but you can still build simple algorithms with it.
For your encoding you need two datastructures:
a list (technical array) of all available categories.
a list of the categories in the current cell.
With this you can check for each category, whether it is present in the current cell or not.
Assuming that the number of all available categories is somehow assessable,
I will use a hard coded list ["A", "B", "C", "D"].
The list of categories in the current cell we get via value.split(/\s*;\s*/).
Note that I am using an array instead of string matching
and use splitting with a regular expression considering whitespace.
This is mainly defensive programming and hopefully the algorithm will still be understandable.
So let's wrap this all together into a GREL expression and create a new column (or transform the current one):
with(
value.split(/\s*;\s*/),
cell_categories,
forEach(
["A", "B", "C", "D"],
category,
if(cell_categories.inArray(category), 1, 0)))
.join(";")
You can then split the new column into several columns using ; as separator.
The new column names you have to assign manually (sry ;).
Update: here is a more elaborate version to automatically extract the categories.
The idea is to create a single record for the whole dataset to be able to access all the entries in the column "Answers" and then extract all available categories from it.
Create a new column "Record" with content "Record".
Move the column "Record" to the beginning.
Blank down the column "Record".
Add a new column "Categories" based on the column "Answers" with the following GREL expression:
if(row.index>0, "",
row.record.cells["Answers"].value
.join(";")
.split(/\s*;\s*/)
.uniques()
.sort()
.join(";"))
Fill down the column "Categories".
Add a new column "Encoding" based on the column "Answers with the following GREL expression:
with(
value.split(/\s*;\s*/),
cell_categories,
forEach(
cells["Categories"].value.split(";"),
category,
if(cell_categories.inArray(category), 1, 0)))
.join(";")
Split the column "Encoding" on the character ;.
Delete the columns "Record" and "Categories".

Create new data frame of percentages from values of an old data frame?

So I want to create a new data frame adding the values of the Sometimes and Often column and dividing it by the values of the total column and multiplying it by 100 to get percentages (unless there is a function that automatically does this in R). How would I go about doing that?
You have added an "sql" tag to your question. Should you prefer SQL over R for reasons of experience and/or knowledge you might be interested in the fabulous sqldf package which allows you to use SQL syntax within R. You will have to download it first via install.packages("sqldf") and then you can use it as in
expl <- data.frame(sometimes = c(1, 2, 4), often = c(2, 2, 2), total =c(6, 9, 8))
library(sqldf)
sqldf("SELECT 100*(sometimes+often)/total FROM expl")
The far more often used way is to add a percent column to the same data.frame instead of introducing a new one. That way, all data are kept together and you do not loose the link to e. g. the week column.
One way to go about that would be the following one-liner:
expl <- data.frame(sometimes = c(1, 2, 4), often = c(2, 2, 2), total =c(6, 9, 8))
print(expl)
expl$percent = 100 * (expl$sometimes + expl$often)/expl$total
print(expl)
First, it looks as though Total, Sometimes, and Often are character because you have commas in them, so you would need to get rid of the commas and convert them to numeric. You can do that as follows (assuming your dataframe is called mydata):
for(i in c("Total","Sometimes","Often")) mydata[[i]] = as.numeric(gsub(",", "", mydata[[i]])
Then you can use the answer by Bernard:
mydata$percent = 100 * (mydata$Sometimes + mydata$Often)/mydata$Total
Another option using the tidyverse:
library(tidyverse)
newdataframe <- olddataframe %>%
mutate(percent = (Sometimes+Often)/Total*100) %>%
select(percent)
But as said before, better leave the percentage column with the other data. In that case, remove the %>% select(percent).

Query with lambda function in pymongo?

I use the following query to fish for all men in a database:
f = pd.DataFrame(x for x in collection.find({"gender": "M"},{"_id":0}))
How could I find only the men where the "name" starts with an "A". Obviously I could filter the resulting huge DataFrame but how can I avoid creating this Frame in the first place.
Many thanks
You can use a MongoDB regular expression query, something like:
from bson.regex import Regex
f = pd.DataFrame(x for x in collection.find({"gender": "M", "name": Regex(r"^A.*")},{"_id":0}))

R: Matching two tables on multiple columns and creating a matched/not matched flag

I'm a beginner to R from a SAS background trying to do a basic "case when" match on two tables to get a flag where I have and have not found a match. Please see the SAS code I have in mind below. I just need something analogous to this in R. Thanks in advance.
proc sql;
create table
x as
select
a.*,
b.*,
case when a.first_column=b.column_first and
a.second_column=b.column_second
then 1 else 0 end as matched_flag
from table1 as a
left join
table2 as b
on a.first_column=b.column_first and a.second_column=b.column_second;
quit;
I'm not familiar with SAS, but I think I understand what you are trying to do. To see how many rows/columns are similar between two tables, you can use %in% and the length function.
For example, initialize two matrices of different dimensions and given them similar row names and column names:
mat.a <- matrix(1, nrow=3, ncol = 2)
mat.b <- matrix(1, nrow=2, ncol = 3)
rownames(mat.a) <- c('a','b','c')
rownames(mat.b) <- c('a','d')
colnames(mat.a) <- c('g','h')
colnames(mat.b) <- c('h','i')
mat.a and mat.b now exist with different row and column names. To match the rows by names, you can use:
row.match <- rownames(mat.a)[rownames(mat.a) %in% rownames(mat.b)]
num.row.match <- length(row.match)
Note that row.match can now be used to index into both of the matrices. The %in% operator returns a logical of the same length of the first argument (in this case, rownames(mat.a)) that indicates if the ith element of the first argument was found anywhere in the elements of the second argument. This nature of %in% means that you have to be sensitive to how you order the arguments for your indexing.
If you simply want to quantify how many rows or columns are the same between the two matrices, then you can use the sum function with the %in% operator:
sum(rownames(mat.a) %in% rownames(mat.b))
With the sum function used like this, you do not need to be sensitive to how you order the arguments, because the number of row names of mat.a in row names of mat.b is equivalent to the number of row names of mat.b in row names of mat.a. That is to say that this usage of %in% is commutative.
I hope this helps!
You will want to use dataframe objects. These are like datasets in SAS. You can use bind to put two dataframe objects together side by side. Then you can select rows based on conditions and set the flag based on this. In the code below you will see that I did this twice: once to set the 1 flag and once to set the 0 flag.
To select the rows where all fields match you can do something similar, but instead of assigning a new column you can assign all the results back to the name of the table you are working on.
Here's the code:
# make up example a and b data frames
table1 <- data.frame(list(a.first_column=c(1,2,3),a.second_column=c(4,5,6)))
table2 <- data.frame(list(b.first_column=c(1,3,6),b.second_column=c(4,5,9)))
# Combine columns (horizontally)
x <- cbind(table1, table2)
print("Combined Data Frames")
print(x)
# create matched flag (1 when the first columns match)
x$matched_flag[x$a.first_column==x$b.first_column] <- 1
x$matched_flag[!x$a.first_column==x$b.first_column] <- 0
# only select records that match both data frames
x <- x[x$a.first_column==x$b.first_column & x$a.second_column==x$b.second_column,]
print("Matched Data Frames")
print(x)
BTW: since you are used to using SQL, you might want to try the sqldf package in R. It will let you use the same techniques that you are used to but in R and on data frames.