How to use map from (purrr) package in an efficient and easy way? - purrr

I'm trying to use map from the purrr package in a more efficient way that I'm doing right now. I have 3 different datasets, let's say iris_1, iris_2, an iris_3.
I want to run the same linear regression for all 3 datasets. My final goal is to get all the coefficients from each of these 3 regressions using map.
My code looks like this:
library(purrr)
library(dplyr)
library(tidyr)
# Load data
iris <- iris
#-------------------------------------------------------------------------------------------------------------#
#Basic modifications
#-------------------------------------------------------------------------------------------------------------#
iris_1 <- iris %>% dplyr::filter(Species=="versicolor")
iris_2 <- iris %>% dplyr::filter(Species=="virginica")
iris_3 <- iris %>% dplyr::filter(Species=="setosa")
Databases <- list(iris_1,iris_2,iris_3)
####Step A
Linear_Models <- map(Databases, ~ lm(Sepal.Length ~ Sepal.Width + Petal.Length , data = .x))
M_1 <- Linear_Models[[1]]
M_2 <- Linear_Models[[2]]
M_3 <- Linear_Models[[3]]
####Step B
Linear_Models_Coeff <- list(M_1,M_2,M_3)
Coeff <- map(Linear_Models_Coeff, ~ coef(summary(.x)))
C_M_1 <- Coeff[[1]]
C_M_2 <- Coeff[[2]]
C_M_3 <- Coeff[[3]]
I tried to do these previous steps in a more efficient way (this is, putting together steps A and B) by doing the following. However when I try to get the coefficients, I don't get the desired results that I get in the previous steps (i.e. C_M_1 <- Coeff[[1]]).
Linear_Models <- map(Databases, ~ lm(Sepal.Length ~ Sepal.Width + Petal.Length , data = .x),~ coef(summary(.x)))
C_M_1 <- Linear_Models[[1]]
Many thanks in advance!! I know that there are multiple ways of doing this with other packages differents from purrr. But I really appreciate a help that includes the purrr package.

You could do this in one go (piping all the functions insinde map), e.g.
purrr::map(Databases, ~ lm(Sepal.Length ~ Sepal.Width + Petal.Length ,
data = .x) %>% summary() %>% coef()) %>%
set_names(c("M1", "M2", "M3"))
Result:
$M1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.1164314 0.4942556 4.282059 9.063960e-05
Sepal.Width 0.2476422 0.1868389 1.325431 1.914351e-01
Petal.Length 0.7355868 0.1247678 5.895648 3.870715e-07
$M2
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6247824 0.52486745 1.190362 2.398819e-01
Sepal.Width 0.2599540 0.15333757 1.695305 9.663372e-02
Petal.Length 0.9348189 0.08960197 10.433017 8.009442e-14
$M3
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3037382 0.38529423 5.979166 2.894273e-07
Sepal.Width 0.6674162 0.09035581 7.386533 2.125173e-09
Petal.Length 0.2834193 0.19722377 1.437044 1.573296e-0

Related

extract individual panels of facet plot to re-arrange

I would like to rearrange a facet plot with 3 panels to have them fit better in a poster presentation. Currently, I have A over B over C (one column), and it is important to keep B over C.
What I would like is to have a square (2x2) presentation, with A over nothing, and B over C.
Can I either extract the individual panels of the plot, or create a facet with no axes or other graphic (like plotgrid with a NULL panel).
A second option would be the ggh4x package which via facet_manual adds some flexibility to place the panels:
library(ggplot2)
library(ggh4x)
design <- "
AB
#C"
ggplot(mtcars, aes(mpg, disp)) +
geom_point() +
facet_manual(~cyl, design = design)
One approach could be creating separate plots using nest() and map() from {tidyverse} and then using {patchwork} package to align them as we want.
(Since OP didn't provide any data and code, I am using builtin mtcars dataset to show how to do this). Suppose This is the case where we have a facetted plot with 3 panels in a 3 x 1 format.
library(tidyverse)
# 3 x 1 faceted plot
mtcars %>%
ggplot(aes(mpg, disp)) +
geom_point() +
facet_wrap(~cyl, nrow = 3)
Now to match the question, lets suppose panel for cyl 4 is plot A, panel for cyl 6 is plot B and for cyl 8 is plot C.
So to this, we first created a nested dataset with respect to facet variable using group_by(facet_var) %>% nest() and then map the ggplot over the nested data to get plots (gg object) for each nested data.
library(tidyverse)
library(patchwork)
# Say, plotA is cyl 4
# plotB is cyl 6
# plotC is cyl 8
# 2 x 2 facet plot
plot_data <- mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(
plots = map2(
.x = data,
.y = cyl,
.f = ~ ggplot(data = .x, mapping = aes(mpg, disp)) +
geom_point() +
ggtitle(paste0("cyl is ",.y))
)
)
plot_data
#> # A tibble: 3 × 3
#> # Groups: cyl [3]
#> cyl data plots
#> <dbl> <list> <list>
#> 1 6 <tibble [7 × 10]> <gg>
#> 2 4 <tibble [11 × 10]> <gg>
#> 3 8 <tibble [14 × 10]> <gg>
Then simply align the plots using {patchwork} syntax as we wanted. I have used plot_spacer() to create blank space.
plot_data
plots <- plot_data$plots
plots[[2]] + plots[[1]] + plot_spacer() + plots[[3]] +
plot_annotation(
title = "A 2 X 2 faceted plot"
)

dplyr vs dbplyr filtering with white space

This is partly related to my previous question. If I filter a dataframe using dplyr based on unique ids with trailing white space from ids with no trailing white space, dplyr will consider white space to be a character and a match will not occur, resulting in an empty dataframe:
library(tidyverse)
df <- tibble(a = c("hjhjh"), d = c(1))
df
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 hjhjh 1
ids <- df %>%
select(a) %>%
pull()
ids
#[1] "hjhjh"
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
df_with_space
#quotation marks:
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1
# 2 "popopo" 2
#now filter
df_new <- df_with_space %>%
filter(a %in% ids)
df_new
# no direct match made, empty dataframe
# A tibble: 0 x 2
# ... with 2 variables: a <chr>, d <dbl>
If I try to do the same thing and filter using dbplyr from a SQL database, it ignores the white space in the filtering but still includes it in the final output, example code:
library(dbplyr)
library(DBI)
library(odbc)
test_db <- dbConnect(odbc::odbc(),
Database = "test",
dsn = "SQL_server")
db_df <- tbl(test_db, "testing")
db_df <- db_df %>%
filter(a %in% ids) %>%
collect()
#quotation marks:
# # A tibble: 1 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1 #matches but includes the white space
I'm not familiar with SQL - is this expected? If so, when do you need to worry about (trailing) white space? I thought I would need to trim whitespace first which is very slow on a large database:
db_df <- db_df %>%
mutate(a = str_trim(a, "both")) %>%
filter(a %in% ids) %>%
collect()
thanks
EDIT
With show_query
<SQL>
SELECT *
FROM `df`
WHERE (`a` IN ('hjhjh'))
I think this produces a reproducible scenario:
dfx <- data.frame(a = c("hjhjh ", "popopo"), d = c(1, 2))
dfx = tbl_lazy(dfx, con = simulate_mssql())
dfx %>%
filter(a %in% ids)
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`a` IN ('hjhjh'))
If you're connecting to SQL Server, then I can reproduce this. I'll label it as a "bug", personally, and will never rely on it ...
No need to use dbplyr here, the issue is in the underlying DBMS; dbplyr is just the messenger, don't blame the messenger :-)
Setup
consqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
conpg <- DBI::dbConnect(odbc::odbc(), ...)
conmar <- DBI::dbConnect(odbc::odbc(), ...)
conss <- DBI::dbConnect(odbc::odbc(), ...)
cons <- list(sqlite = consqlite, postgres = conpg, maria = conmar, sqlserver = conss)
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
for (thiscon in cons) {
DBI::dbWriteTable(thiscon, "mytable", df_with_space)
}
Tests
lapply(cons, function(thiscon) {
DBI::dbGetQuery(thiscon, "select * from mytable where a in ('hjhjh')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
# a d
# 1 hjhjh 1
# $sqlserver
# a d
# 1 hjhjh 1
lapply(cons, function(thiscon) {
DBI::dbGetQuery(thiscon, "select * from mytable where a in ('popopo ')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
# a d
# 1 popopo 2
# $sqlserver
# a d
# 1 popopo 2
SQL Server and MariaDB "fail" in both test cases, neither SQLite nor Postgres fall for it.
I don't see this in the SQL spec, so I don't know if these are bugs, unintended/undocumented features, options, or something else.
Workaround
Sorry, I don't have one off-hand. (Not without accepting this "feature" and doing additional filtering post-query.)

How to specify vartype for sqlSave() for multiple columns without manually typing in R?

Some reproducible code. Note that your database and server name may be different.
library(RODBC)
iris <- iris
connection <- odbcDriverConnect(
"Driver={SQL Server};
Server=localhost\\SQLEXPRESS;
Database=testdb;
Trusted_connection=true;"
)
# create table in sql and move dataframe values in
columnTypes <- list(
Sepal.Length="decimal(28,0)",
Sepal.Width="decimal(28,0)",
Petal.Length ="decimal(28,0)",
Petal.Width ="decimal(28,0)",
Species = "varchar(255)"
)
sqlSave(connection,iris,varTypes = columnTypes)
This is how i export a dataframe to sql management studio as a table and it works but, say I have one hundred new columns in iris. Do I have to specify each column name = to decimal(28,0) in my columnTypes variable?
# but what if i have
iris$random1 <- rnorm(150)
iris$random2 <- rnorm(150)
iris$random3-999 ..... <- rnorm(150) ....
iris$random1000 <- rnorm(150)
By default the columns go in as floats as least in my actual dataframe (iris is just the example), so that's why I need to manually change them in columnTypes. I want everything after the 5 original columns in iris to be decimal(28,0) format without manually including them in columnTypes.
I did not read into the sqlSave() statement and possible alternatives. Meaning: there might be a more apropriate solution. Anyhow you can generate the list of wanted definitions in base R by repitition and combining lists:
# dummy data
df <- data.frame(Sepal.Lengt = 1:5,Sepal.Width = 1:5,Petal.Length = 1:5,Petal.Width = 1:5,Species = 1:5,col6 = 1:5,col7 = 1:5)
# all column names after the 5th -> if less then 6 columns you will get an error here!
vec <- names(df)[6:ncol(df)]
# generate list with same definition for all columns after the 5th
list_after_5 <- as.list(rep("decimal(28,0)", length(vec)))
# name the list items acording to the columns after the 5th
names(list_after_5) <- vec
# frist 5 column definitions manualy plus remaining columns with same definition from list generate above
columnTypes <- c(list(Sepal.Length="decimal(28,0)",
Sepal.Width="decimal(28,0)",
Petal.Length ="decimal(28,0)",
Petal.Width ="decimal(28,0)",
Species = "varchar(255)"),
list_after_5)
columnTypes
$Sepal.Length
[1] "decimal(28,0)"
$Sepal.Width
[1] "decimal(28,0)"
$Petal.Length
[1] "decimal(28,0)"
$Petal.Width
[1] "decimal(28,0)"
$Species
[1] "varchar(255)"
$col6
[1] "decimal(28,0)"
$col7
[1] "decimal(28,0)"
Since you example code seems to be working for you, this should also though (judging by the output) - I did not test it with a DB, as I have no test setup available atm.

Multinomial (nnet) does not work using parsnip and broom

I'm trying to run a multinomial (nnet) using tidymodel, but it shows me the next result:
Error: object of type 'closure' is not subsettable
data(iris)
ml<-multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification") %>%
translate()
ml_fit <- ml %>%
fit(Species ~ Sepal.Width, data=iris)
broom::tidy(ml_fit, exponentiate = F)
But when I run ... works perfectly
formula <- Species ~ Sepal.Width
model <- nnet::multinom(formula, data = iris)
broom::tidy(model, exponentiate = F)
Any idea of whether or not I'm writing properly the tidy model or is something else?
In tidymodels, we handle things in a way that the original data and formula are not contained in the resulting call (in the usual way). Some parts of multinom() want those (plus the actual data in the same place) to do the computations.
We just changed how we handle the formula; that now comes through as it would have if you called multinom() directly. We can't really do the same with data but we did add a new function called repair_call() that you can use to make things they way that you want them.
# devtools::install_dev("parsnip")
library(parsnip)
library(broom)
multi_spec <- multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification")
multi_fit <- multi_spec %>%
fit(Species ~ Sepal.Width, data = iris)
tidy(multi_fit)
#> Error in as.data.frame.default(data, optional = TRUE): cannot coerce class '"function"' to a data.frame
multi_fit_new <- repair_call(multi_fit, iris)
tidy(multi_fit_new)
#> # A tibble: 4 x 6
#> y.level term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor (Intercept) 1.55e+8 3.06 6.15 7.54e-10
#> 2 versicolor Sepal.Width 2.20e-3 0.991 -6.17 6.70e-10
#> 3 virginica (Intercept) 4.41e+5 2.69 4.83 1.33e- 6
#> 4 virginica Sepal.Width 1.69e-2 0.844 -4.84 1.33e- 6
Created on 2020-05-22 by the reprex package (v0.3.0)

unique values in categorical variables R estudio

How can I find how many unique values each categorical takes in a data frame and then represent it with a graph? all this in R studio
We'll use the tidyverse here.
library(tidyverse)
You can apply the unique() function to a dataframe to remove any repeat rows.
df <- iris %>% unique()
The group_by(), summarise() and n() functions let you count the number of instances of a variable in a dataframe.
df2 <- df %>% group_by(Species) %>% summarise(n = n())
## alternatively use count() which does the same thing
df2 <- df %>% count(Species)
Finally we can use the ggplot package to create a graph.
ggplot() + geom_col(data = df2, aes(x = Species, y = n))
If you're not interested in having a separate dataframe with the data in it and want to jump straight to the graph, you can ignore the step with group_by() and summarise() and just use geom_bar().
ggplot() + geom_bar(data = df, aes(Species))