Trying to make a new dataframe with less observations but more columns - data-manipulation

I am new to R and I have this data frame that I am using to make another one. I am not sure how to ask this question so I figured I would give an example:
MRN <- c(1,1,1,2,3,5,5)
Comorbid <- c("Celiac", "T1D", "Hashimoto", "Celiac", "EoE", "Celiac", "Graves")
df2 <- data.frame(MRN, Comorbid)
MRN <- c(1,2,3,4,5)
Celiac <- c(1,1,0,0,1)
T1D <- c(1,0,0,0,0)
Hashimoto <- c(1,0,0,0,0)
EoE <- c(0,0,1,0,0)
Graves <- c(0,0,0,0,1)
df3 <- data.frame(MRN,Celiac,T1D,Hashimoto,EoE,Graves)
I have included pictures of both the dataframes below:
enter image description here
enter image description here
I am trying to make df3 from df2. Any idea how to do this?
Thanks!

Related

How to add a vector to a table in backend using dbplyr (R)

I created a table from a data source using tbl(). I need to add a column including 1:nrow() to my dataset and tried different methods but I didn't succeed. My code is as below:
nrow_df1 <- df1 %>% summarise(n = n()) %>% pull(n)
df1 <- df1 %>% mutate(ID = 1:nrow_df1, step = 1)
It doesn't add column ID to my dataset and only adds column step.
Using as.data.frame(), it works but so slow.
Do you have any ideas? thanks in advance
For this case, you can use row_number().
library(dbplyr)
library(DBI)
# simulate a fake database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
# add in the row
tbl(con, "mtcars") %>%
mutate(ID = row_number())
dbDisconnect(con)
I found the answer. It is to use row_number() but as.numeric is also needed to convert the output from integer64 to numeric:
df1 <- df1 %>% mutate(ID = as.numeric(row_number(a column)), step = 1)

How to specify vartype for sqlSave() for multiple columns without manually typing in R?

Some reproducible code. Note that your database and server name may be different.
library(RODBC)
iris <- iris
connection <- odbcDriverConnect(
"Driver={SQL Server};
Server=localhost\\SQLEXPRESS;
Database=testdb;
Trusted_connection=true;"
)
# create table in sql and move dataframe values in
columnTypes <- list(
Sepal.Length="decimal(28,0)",
Sepal.Width="decimal(28,0)",
Petal.Length ="decimal(28,0)",
Petal.Width ="decimal(28,0)",
Species = "varchar(255)"
)
sqlSave(connection,iris,varTypes = columnTypes)
This is how i export a dataframe to sql management studio as a table and it works but, say I have one hundred new columns in iris. Do I have to specify each column name = to decimal(28,0) in my columnTypes variable?
# but what if i have
iris$random1 <- rnorm(150)
iris$random2 <- rnorm(150)
iris$random3-999 ..... <- rnorm(150) ....
iris$random1000 <- rnorm(150)
By default the columns go in as floats as least in my actual dataframe (iris is just the example), so that's why I need to manually change them in columnTypes. I want everything after the 5 original columns in iris to be decimal(28,0) format without manually including them in columnTypes.
I did not read into the sqlSave() statement and possible alternatives. Meaning: there might be a more apropriate solution. Anyhow you can generate the list of wanted definitions in base R by repitition and combining lists:
# dummy data
df <- data.frame(Sepal.Lengt = 1:5,Sepal.Width = 1:5,Petal.Length = 1:5,Petal.Width = 1:5,Species = 1:5,col6 = 1:5,col7 = 1:5)
# all column names after the 5th -> if less then 6 columns you will get an error here!
vec <- names(df)[6:ncol(df)]
# generate list with same definition for all columns after the 5th
list_after_5 <- as.list(rep("decimal(28,0)", length(vec)))
# name the list items acording to the columns after the 5th
names(list_after_5) <- vec
# frist 5 column definitions manualy plus remaining columns with same definition from list generate above
columnTypes <- c(list(Sepal.Length="decimal(28,0)",
Sepal.Width="decimal(28,0)",
Petal.Length ="decimal(28,0)",
Petal.Width ="decimal(28,0)",
Species = "varchar(255)"),
list_after_5)
columnTypes
$Sepal.Length
[1] "decimal(28,0)"
$Sepal.Width
[1] "decimal(28,0)"
$Petal.Length
[1] "decimal(28,0)"
$Petal.Width
[1] "decimal(28,0)"
$Species
[1] "varchar(255)"
$col6
[1] "decimal(28,0)"
$col7
[1] "decimal(28,0)"
Since you example code seems to be working for you, this should also though (judging by the output) - I did not test it with a DB, as I have no test setup available atm.

Selecting with sqldf values greater than median of all values

I am trying to find all the stock records where the yield is greater than the median for all stock using sqldf but I am getting this message.
I have tried using the actual number 2.39 and it works but I have not been successful substituting a variable to make it dynamic. Maybe a sub-select would be better?
mYd <- median(df3$Yield, na.rm = TRUE)
df4 <- sqldf("SELECT a.*
FROM df3 a
WHERE (a.Yield > mYd)
;")
Error in rsqlite_send_query(conn#ptr, statement) : no such column: mYd
The error is stemming from SQLdf's inability to find a column in df3 called mYd. It needs to find columns in the data frame for every corresponding column referenced in your query. Try adding the mYd variable to your df3 data frame as a column proper:
df3$mYd <- median(df3$Yield, na.rm=TRUE)
df4 <- sqldf("SELECT * FROM df3 WHERE Yield > mYd;")
Note that you don't really need to alias df3 here since it is the only table in the query, and you aren't generating any computed columns.

Match fields within one data frame with column names in another data frame

I have two data frames. In the last column ("Bill") in the first data frame, I want to apply a function (fixed price + Quantity*price/qty). In order to apply the function, R should match the values in the first column of df1 to the column names of df2.
I have solved the problem by creating a function and several ifelse statements, but I would want to use a statement that automatically matches the values in df1 with the column names in df2. The data set that I have contains more than 2 million rows and I would need to apply the same rationale into building other similar functions. It would be nice to use something that does not require a loop or takes too long to process.
### Set up your data frames like so ###
Code <- c("a1", "a2", "c3", "a1")
Name <- c("Dan", "David", "Anna", "Lisa")
Quantity <- c(30, 12, 10, 10)
df1 <- as.data.frame(cbind("Code" = Code, "Name" = Name, "Quantity" = Quantity), stringsAsFactors = F)
df1$Quantity <- as.numeric(df1$Quantity)
fixed_price <- c(12, 5, 23)
price_per_qty <- c(1, 4, 7)
df2 <- as.data.frame(rbind("fixed_price" = fixed_price, "price_per_qty" = price_per_qty))
colnames(df2) <- c("a1", "a2", "c3")
### Combine dataframe 1 and 2 into a single dataframe ###
# Code below pulls individual columns from df2 based on the
# index provided by the "Code" column in df1, transposes them
# so they'll line up with df1, then column binds them to df1
df3 <- cbind(df1, t(df2[,df1$Code]))
# the bill is calculated simply enough
bill <- df3[4] + df3[3] * df3[5]
colnames(bill) <- "bill"
# Finally, output the results as you wanted
cbind(df3, bill)
So I have a fairly similar answer to graggsd, but here is what worked for me. I merged two data frames based on the key word "Code" and then combined it into the big data frame into combined_data. I then used a function which I think is what you defined above and then passed the respective data frames through it.
df2 <- t(data.frame(c(12,1),c(5,4),c(23,7)))
rownames(df2) <- c("a1","a2","c3")
test <- rownames(df2)
df2 <- cbind.data.frame(df2,test)
colnames(df2) <- c("fixed price","price/qty","Code")
df1 <- data.frame(c("a1","a2","c3","a1"), c("Dan","David","Anna","Lisa"),c(30,12,10,10))
colnames(df1) <- c("Code","Name","Quantity")
combined_data <- dplyr::inner_join(df1,df2, by = "Code")
f1 <- function(x,y,z){
x + y * z
}
bill <- f1(combined_data[,4],combined_data[,3],combined_data[,5])
finalDataSet <- cbind.data.frame(combined_data,bill)
The final data set:
Code Name Quantity fixed price price/qty bill
1 a1 Dan 30 12 1 42
2 a2 David 12 5 4 53
3 c3 Anna 10 23 7 93
4 a1 Lisa 10 12 1 22

I want to calculate a moving difference for my dataset below.

How do I add another column with a moving difference of Column2?
For Example: I want to add a column where it will have the following values: (0,-372706.6,-284087.1, -119883.7, etc.)
Here's a way to go about it.
## For a small dataset
x <- data.frame(matrix(nrow=7,ncol=2,c(0,12,1,10,2,9.5,3,8,4,7,5,5,6,2),byrow = T))
names(x) <- c("Time","Count")
x[1,"Diff"] <- NA
x[2:nrow(x),"Diff"] <- rev(diff(rev(x$Count)))
There is a way to do it with the plyr package as well.