Multinomial (nnet) does not work using parsnip and broom - broom

I'm trying to run a multinomial (nnet) using tidymodel, but it shows me the next result:
Error: object of type 'closure' is not subsettable
data(iris)
ml<-multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification") %>%
translate()
ml_fit <- ml %>%
fit(Species ~ Sepal.Width, data=iris)
broom::tidy(ml_fit, exponentiate = F)
But when I run ... works perfectly
formula <- Species ~ Sepal.Width
model <- nnet::multinom(formula, data = iris)
broom::tidy(model, exponentiate = F)
Any idea of whether or not I'm writing properly the tidy model or is something else?

In tidymodels, we handle things in a way that the original data and formula are not contained in the resulting call (in the usual way). Some parts of multinom() want those (plus the actual data in the same place) to do the computations.
We just changed how we handle the formula; that now comes through as it would have if you called multinom() directly. We can't really do the same with data but we did add a new function called repair_call() that you can use to make things they way that you want them.
# devtools::install_dev("parsnip")
library(parsnip)
library(broom)
multi_spec <- multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification")
multi_fit <- multi_spec %>%
fit(Species ~ Sepal.Width, data = iris)
tidy(multi_fit)
#> Error in as.data.frame.default(data, optional = TRUE): cannot coerce class '"function"' to a data.frame
multi_fit_new <- repair_call(multi_fit, iris)
tidy(multi_fit_new)
#> # A tibble: 4 x 6
#> y.level term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor (Intercept) 1.55e+8 3.06 6.15 7.54e-10
#> 2 versicolor Sepal.Width 2.20e-3 0.991 -6.17 6.70e-10
#> 3 virginica (Intercept) 4.41e+5 2.69 4.83 1.33e- 6
#> 4 virginica Sepal.Width 1.69e-2 0.844 -4.84 1.33e- 6
Created on 2020-05-22 by the reprex package (v0.3.0)

Related

How to use map from (purrr) package in an efficient and easy way?

I'm trying to use map from the purrr package in a more efficient way that I'm doing right now. I have 3 different datasets, let's say iris_1, iris_2, an iris_3.
I want to run the same linear regression for all 3 datasets. My final goal is to get all the coefficients from each of these 3 regressions using map.
My code looks like this:
library(purrr)
library(dplyr)
library(tidyr)
# Load data
iris <- iris
#-------------------------------------------------------------------------------------------------------------#
#Basic modifications
#-------------------------------------------------------------------------------------------------------------#
iris_1 <- iris %>% dplyr::filter(Species=="versicolor")
iris_2 <- iris %>% dplyr::filter(Species=="virginica")
iris_3 <- iris %>% dplyr::filter(Species=="setosa")
Databases <- list(iris_1,iris_2,iris_3)
####Step A
Linear_Models <- map(Databases, ~ lm(Sepal.Length ~ Sepal.Width + Petal.Length , data = .x))
M_1 <- Linear_Models[[1]]
M_2 <- Linear_Models[[2]]
M_3 <- Linear_Models[[3]]
####Step B
Linear_Models_Coeff <- list(M_1,M_2,M_3)
Coeff <- map(Linear_Models_Coeff, ~ coef(summary(.x)))
C_M_1 <- Coeff[[1]]
C_M_2 <- Coeff[[2]]
C_M_3 <- Coeff[[3]]
I tried to do these previous steps in a more efficient way (this is, putting together steps A and B) by doing the following. However when I try to get the coefficients, I don't get the desired results that I get in the previous steps (i.e. C_M_1 <- Coeff[[1]]).
Linear_Models <- map(Databases, ~ lm(Sepal.Length ~ Sepal.Width + Petal.Length , data = .x),~ coef(summary(.x)))
C_M_1 <- Linear_Models[[1]]
Many thanks in advance!! I know that there are multiple ways of doing this with other packages differents from purrr. But I really appreciate a help that includes the purrr package.
You could do this in one go (piping all the functions insinde map), e.g.
purrr::map(Databases, ~ lm(Sepal.Length ~ Sepal.Width + Petal.Length ,
data = .x) %>% summary() %>% coef()) %>%
set_names(c("M1", "M2", "M3"))
Result:
$M1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.1164314 0.4942556 4.282059 9.063960e-05
Sepal.Width 0.2476422 0.1868389 1.325431 1.914351e-01
Petal.Length 0.7355868 0.1247678 5.895648 3.870715e-07
$M2
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6247824 0.52486745 1.190362 2.398819e-01
Sepal.Width 0.2599540 0.15333757 1.695305 9.663372e-02
Petal.Length 0.9348189 0.08960197 10.433017 8.009442e-14
$M3
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3037382 0.38529423 5.979166 2.894273e-07
Sepal.Width 0.6674162 0.09035581 7.386533 2.125173e-09
Petal.Length 0.2834193 0.19722377 1.437044 1.573296e-0

extract individual panels of facet plot to re-arrange

I would like to rearrange a facet plot with 3 panels to have them fit better in a poster presentation. Currently, I have A over B over C (one column), and it is important to keep B over C.
What I would like is to have a square (2x2) presentation, with A over nothing, and B over C.
Can I either extract the individual panels of the plot, or create a facet with no axes or other graphic (like plotgrid with a NULL panel).
A second option would be the ggh4x package which via facet_manual adds some flexibility to place the panels:
library(ggplot2)
library(ggh4x)
design <- "
AB
#C"
ggplot(mtcars, aes(mpg, disp)) +
geom_point() +
facet_manual(~cyl, design = design)
One approach could be creating separate plots using nest() and map() from {tidyverse} and then using {patchwork} package to align them as we want.
(Since OP didn't provide any data and code, I am using builtin mtcars dataset to show how to do this). Suppose This is the case where we have a facetted plot with 3 panels in a 3 x 1 format.
library(tidyverse)
# 3 x 1 faceted plot
mtcars %>%
ggplot(aes(mpg, disp)) +
geom_point() +
facet_wrap(~cyl, nrow = 3)
Now to match the question, lets suppose panel for cyl 4 is plot A, panel for cyl 6 is plot B and for cyl 8 is plot C.
So to this, we first created a nested dataset with respect to facet variable using group_by(facet_var) %>% nest() and then map the ggplot over the nested data to get plots (gg object) for each nested data.
library(tidyverse)
library(patchwork)
# Say, plotA is cyl 4
# plotB is cyl 6
# plotC is cyl 8
# 2 x 2 facet plot
plot_data <- mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(
plots = map2(
.x = data,
.y = cyl,
.f = ~ ggplot(data = .x, mapping = aes(mpg, disp)) +
geom_point() +
ggtitle(paste0("cyl is ",.y))
)
)
plot_data
#> # A tibble: 3 × 3
#> # Groups: cyl [3]
#> cyl data plots
#> <dbl> <list> <list>
#> 1 6 <tibble [7 × 10]> <gg>
#> 2 4 <tibble [11 × 10]> <gg>
#> 3 8 <tibble [14 × 10]> <gg>
Then simply align the plots using {patchwork} syntax as we wanted. I have used plot_spacer() to create blank space.
plot_data
plots <- plot_data$plots
plots[[2]] + plots[[1]] + plot_spacer() + plots[[3]] +
plot_annotation(
title = "A 2 X 2 faceted plot"
)

Altering rownames in tibble

I have 2 different tibbles, and have to find out how many of the rows from the first tibble is also present in the second tibble. Both tibbles have a first column named GeneID, but the problem is, that in one tibble the genes are names as 1, 2, 3, 4 ect, and in the sencond tibble they are named Gene1, Gene2, Gene3, Gene4... Are there anyway to either add 'Gene' before the number in the first tibble or remove 'Gene' in the second?
It is always good to include a sample of your data so that the responders can answer correctly. For example, if the field ordering are identical between the 2 datasets, e.g. df1 and df2, you can make the names the same by a simple:
names(df1) <- names(df2)
Is this what you'd like to do?
library(tidyverse)
df1 <- tribble(
~gene,
1,
2,
5,
6
)
df2 <- tribble(
~gene,
"Gene1",
"Gene2",
"Gene3",
"Gene4",
"Gene5"
)
# df1 rows also in df2
df1 |>
mutate(gene = str_c("Gene", gene)) |>
inner_join(df2, by = "gene")
#> # A tibble: 3 × 1
#> gene
#> <chr>
#> 1 Gene1
#> 2 Gene2
#> 3 Gene5
Created on 2022-06-16 by the reprex package (v2.0.1)

dplyr vs dbplyr filtering with white space

This is partly related to my previous question. If I filter a dataframe using dplyr based on unique ids with trailing white space from ids with no trailing white space, dplyr will consider white space to be a character and a match will not occur, resulting in an empty dataframe:
library(tidyverse)
df <- tibble(a = c("hjhjh"), d = c(1))
df
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 hjhjh 1
ids <- df %>%
select(a) %>%
pull()
ids
#[1] "hjhjh"
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
df_with_space
#quotation marks:
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1
# 2 "popopo" 2
#now filter
df_new <- df_with_space %>%
filter(a %in% ids)
df_new
# no direct match made, empty dataframe
# A tibble: 0 x 2
# ... with 2 variables: a <chr>, d <dbl>
If I try to do the same thing and filter using dbplyr from a SQL database, it ignores the white space in the filtering but still includes it in the final output, example code:
library(dbplyr)
library(DBI)
library(odbc)
test_db <- dbConnect(odbc::odbc(),
Database = "test",
dsn = "SQL_server")
db_df <- tbl(test_db, "testing")
db_df <- db_df %>%
filter(a %in% ids) %>%
collect()
#quotation marks:
# # A tibble: 1 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1 #matches but includes the white space
I'm not familiar with SQL - is this expected? If so, when do you need to worry about (trailing) white space? I thought I would need to trim whitespace first which is very slow on a large database:
db_df <- db_df %>%
mutate(a = str_trim(a, "both")) %>%
filter(a %in% ids) %>%
collect()
thanks
EDIT
With show_query
<SQL>
SELECT *
FROM `df`
WHERE (`a` IN ('hjhjh'))
I think this produces a reproducible scenario:
dfx <- data.frame(a = c("hjhjh ", "popopo"), d = c(1, 2))
dfx = tbl_lazy(dfx, con = simulate_mssql())
dfx %>%
filter(a %in% ids)
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`a` IN ('hjhjh'))
If you're connecting to SQL Server, then I can reproduce this. I'll label it as a "bug", personally, and will never rely on it ...
No need to use dbplyr here, the issue is in the underlying DBMS; dbplyr is just the messenger, don't blame the messenger :-)
Setup
consqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
conpg <- DBI::dbConnect(odbc::odbc(), ...)
conmar <- DBI::dbConnect(odbc::odbc(), ...)
conss <- DBI::dbConnect(odbc::odbc(), ...)
cons <- list(sqlite = consqlite, postgres = conpg, maria = conmar, sqlserver = conss)
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
for (thiscon in cons) {
DBI::dbWriteTable(thiscon, "mytable", df_with_space)
}
Tests
lapply(cons, function(thiscon) {
DBI::dbGetQuery(thiscon, "select * from mytable where a in ('hjhjh')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
# a d
# 1 hjhjh 1
# $sqlserver
# a d
# 1 hjhjh 1
lapply(cons, function(thiscon) {
DBI::dbGetQuery(thiscon, "select * from mytable where a in ('popopo ')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
# a d
# 1 popopo 2
# $sqlserver
# a d
# 1 popopo 2
SQL Server and MariaDB "fail" in both test cases, neither SQLite nor Postgres fall for it.
I don't see this in the SQL spec, so I don't know if these are bugs, unintended/undocumented features, options, or something else.
Workaround
Sorry, I don't have one off-hand. (Not without accepting this "feature" and doing additional filtering post-query.)

Getting longitude and latitude from RDS gadm file

I would like to ask,
is there a way how to get longtitude and latitude variables (in data.frame) from RDS geographic files downloaded for example form: https://gadm.org/download_country_v3.html.
I know we can easily plot from this dataset simply using:
df2 <- readRDS("C:/Users/petr7/Downloads/gadm36_DEU_1_sp.rds")
library(leaflet)
library(ggplot2)
# Using leaflet
leaflet() %>% addProviderTiles("CartoDB.Positron")%>%addPolygons(data=df, weight = 0.5, fill = F)
# Using ggplot
ggplot() +
geom_polygon(data = df2, aes(x=long, y = lat, group = group), color = "black", fill = F)
how ever even doing df2$ no long and latitude options are there
I would do something like this:
# packages
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
my_url <- "https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_ITA_0_sf.rds"
data <- readRDS(url(my_url))
italy_coordinate <- st_coordinates(data)
head(italy_coordinate)
#> X Y L1 L2 L3
#> [1,] 12.61486 35.49292 1 1 1
#> [2,] 12.61430 35.49292 1 1 1
#> [3,] 12.61430 35.49347 1 1 1
#> [4,] 12.61375 35.49347 1 1 1
#> [5,] 12.61375 35.49403 1 1 1
#> [6,] 12.61347 35.49409 1 1 1
Created on 2019-12-27 by the reprex package (v0.3.0)
Now you just to change the url according to your problem. Read the help page of st_coordinates function (i.e. ?sf::st_coordinates) for an explanation of the meaning of L1, L2 and L3 columns.