broom::tidy.lm() -- how to set number of digits? - tidyverse

I'm trying to use broom::tidy() to extract the ANOVA summary table for linear models, but for a better display, particularly for multivariate linear models.
I can't find a way to control the number of decimal digits that appear in the result for the sums of squares and statistics.
Here is an example, with a simple lm(): Note that I can get more digits in the output with print(aov1, digits=) [Sorry: reprex isn't working on my machine.]
> data("mtcars")
>
> mtcars$cyl <- factor(mtcars$cyl) # make factors
> mtcars$am <- factor(mtcars$am)
>
> aov1 <- anova(lm(mpg ~ cyl + am, data=mtcars))
> aov1
Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
cyl 2 824.78 412.39 43.6566 2.477e-09 ***
am 1 36.77 36.77 3.8922 0.05846 .
Residuals 28 264.50 9.45
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> print(aov1, digits=10) # print with more digits
Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
cyl 2 824.7845901 412.3922950 43.65661 2.4769e-09 ***
am 1 36.7669195 36.7669195 3.89221 0.058457 .
Residuals 28 264.4956779 9.4462742
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
But tidy() doesn't appear to have any way to control the number of digits that I can find...
> tidy(aov1)
# A tibble: 3 x 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 cyl 2 825. 412. 43.7 2.48e-9
2 am 1 36.8 36.8 3.89 5.85e-2
3 Residuals 28 264. 9.45 NA NA
>
> tidy(aov1, digits=7)
# A tibble: 3 x 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 cyl 2 825. 412. 43.7 2.48e-9
2 am 1 36.8 36.8 3.89 5.85e-2
3 Residuals 28 264. 9.45 NA NA
>
> print(tidy(aov1), digits=7)
# A tibble: 3 x 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 cyl 2 825. 412. 43.7 2.48e-9
2 am 1 36.8 36.8 3.89 5.85e-2
3 Residuals 28 264. 9.45 NA NA
My goal is actually more general: to extract the univariate tests for each response in a multivariate linear model.
> data(NeuroCog, package="heplots")
> NC.mlm <- lm(cbind( Speed, Attention, Memory, Verbal, Visual, ProbSolv) ~ Dx,
+ data=NeuroCog)
> car::Anova(NC.mlm)
Type II MANOVA Tests: Pillai test statistic
Df test stat approx F num Df den Df Pr(>F)
Dx 2 0.2992 6.8902 12 470 1.562e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I can do this by reshaping the data to long format and group_by(response) followed by some tidy processing, but I want more digits in the values for the SS and F statistics.
> #' Reshape from wide to long
> NC_long <- NeuroCog |>
+ select(-SocialCog, -Age, -Sex) |>
+ tidyr::gather(key = response, value = "value", Speed:ProbSolv)
>
> NC_long |>
+ mutate(response = factor(response, levels=unique(response))) |> # keep variable order
+ group_by(response) |>
+ do(tidy(anova(lm(value ~ Dx, .)))) |>
+ filter(term != "Residuals") |>
+ select(-term) |>
+ rename(F = statistic, df1 = df,
+ SS = sumsq, MS =meansq) |>
+ mutate(df2 = 239) |> # kludge: extract dfe from object?
+ relocate(df2, .after = df1) |>
+ mutate(signif = noquote(gtools::stars.pval(p.value))) |>
+ mutate(p.value = noquote(scales::pvalue(p.value)))
# A tibble: 6 x 8
# Groups: response [6]
response df1 df2 SS MS F p.value signif
<fct> <int> <dbl> <dbl> <dbl> <dbl> <noquote> <noquote>
1 Speed 2 239 8360. 4180. 37.1 <0.001 ***
2 Attention 2 239 5579. 2790. 17.4 <0.001 ***
3 Memory 2 239 3764. 1882. 13.9 <0.001 ***
4 Verbal 2 239 4672. 2336. 27.3 <0.001 ***
5 Visual 2 239 3692. 1846. 16.6 <0.001 ***
6 ProbSolv 2 239 4165. 2083. 25.1 <0.001 ***

Related

How to create multiple dataframes in R from different sql files

I have 49 .db files.
I want to open them in R and then store its content in a dataframe for further use.
I am able to do it for one file but I want to modify the code to be able to do it for all the 49 .db file in one go.
This is the code that I am trying to do for one file:
sqlite <- dbDriver("SQLite")
dbname <- "en_Whole_Blood.db"
db = dbConnect(sqlite,dbname)
wholeblood_df <- dbGetQuery(db,"SELECT * FROM weights")
View(wholeblood_df)
I tried to use the list.files function to do it for all the .db file but its not happening.Its only creating a dataframe for the last object
This is the code for it:
library("RSQLite")
sqlite <- dbDriver("SQLite")
sqlite <- dbDriver("SQLite")
dbname <- data_files
dbname
for (i in length(dbname){
db=dbConnect(sqlite,dbname[i])
df <- dbGetQuery(db,"SELECT * FROM weights")
}
##This only gives me last .db file as a dataframe.
Does anyone know how can I edit this code to get 49 dataframe for each sql file.
Thank you.
Try replacing the for loop with lapply:
list_of_df <- lapply(dbname, function(x) {
db <- dbConnect(sqlite, x)
df <- dbGetQuery(db, "SELECT * FROM weights")
})
I'm not experience in handling SQL and / or connections, but I think it might work.
Edit
Second alternative maintaining the for loop:
df <- list()
for (i in 1:length(dbname)) {
db <- dbConnect(sqlite,dbname[i])
df <- c(df, dbGetQuery(db,"SELECT * FROM weights"))
}
Hope it helps
Another suggestion:
files <- list.files(pattern = "\\.db$")
list_of_frames <- lapply(files, function(fn) {
db <- dbConnect(RSQLite::SQLite(), fn)
on.exit(dbDisconnect(db))
dbGetQuery(db, "select * from weights")
})
oneframe <- do.call(rbind, list_of_frames)
Reproducible example
Create data (you don't need this):
for (i in 1:3) {
db <- DBI::dbConnect(RSQLite::SQLite(), sprintf("mtcars%i.db", i))
DBI::dbWriteTable(db, "weights", mtcars[i * 5 + 1:3,], append = FALSE, create = TRUE)
DBI::dbDisconnect(db)
}
Working solution:
files <- list.files(pattern = "\\.db$")
files
# [1] "mtcars1.db" "mtcars2.db" "mtcars3.db"
list_of_frames <- lapply(files, function(fn) {
db <- dbConnect(RSQLite::SQLite(), fn)
on.exit(dbDisconnect(db))
dbGetQuery(db, "select * from mt")
})
list_of_frames
# [[1]]
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
# 2 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
# 3 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
# [[2]]
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
# 2 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
# 3 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
# [[3]]
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 2 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 3 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
oneframe <- do.call(rbind, list_of_frames)
oneframe
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# 2 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
# 3 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 4 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# 5 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 6 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
# 7 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 8 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 9 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Tidyverse alternative:
library(dplyr) # just for %>%, could use magrittr as well
library(purrr) # map_dfr
oneframe <- files %>%
map_dfr(~ {
db <- DBI::dbConnect(RSQLite::SQLite(), .)
on.exit(DBI::dbDisconnect(db))
DBI::dbGetQuery(db, "select * from mt")
})
### same result

Statistical Tests in R

I want to run Bonferroni P Adjusted Value Test on a stacked data set.
This is my code:
stat.2 <- stack.2 %>%
group_by(modules) %>%
t_test(values ~ phenotype) %>%
adjust_pvalue(method = "bonferroni") %>%
add_significance("p.adj")
The error which I'm facing is the following:
Error in mutate():
! Problem while computing data = map(.data$data, .f, ...).
Caused by error in t.test.default():
! not enough 'y' observations
Run rlang::last_error() to see where the error occurred.
Here's the data which I'm working on:
First I created reproducible data:
df <- data.frame(phenotype = c("Mesenchymal", "Classical", "Classical", "Mesenchymal", "Proneural", "Mesenchymal", "Proneural", "Messenchymal", "Messenchymal", "Classical", "Mesenchymal"),
values = runif(11, 0, 1),
modules = rep("MEmaroon", 11))
You can use this code:
library(dplyr)
library(rstatix)
df %>%
group_by(modules) %>%
t_test(values ~ phenotype) %>%
adjust_pvalue(method = "bonferroni") %>%
add_significance("p.adj")
Output:
# A tibble: 6 × 11
modules .y. group1 group2 n1 n2 statistic df p p.adj p.adj.signif
<chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 MEmaroon values Classical Mesench… 3 4 0.668 4.25 0.538 1 ns
2 MEmaroon values Classical Messenc… 3 2 0.361 1.48 0.763 1 ns
3 MEmaroon values Classical Proneur… 3 2 -0.0161 2.90 0.988 1 ns
4 MEmaroon values Mesenchymal Messenc… 4 2 -0.0136 1.32 0.991 1 ns
5 MEmaroon values Mesenchymal Proneur… 4 2 -0.749 2.84 0.511 1 ns
6 MEmaroon values Messenchymal Proneur… 2 2 -0.380 1.33 0.756 1 ns

How to convert large .csv file with "too many columns" into SQL database

I was given a large .csv file (around 6.5 Gb) with 25k rows and 20k columns. Let's call first column ID1 and then each additional column is a value for each of these ID1s in different conditions. Let's call these ID2s.
This is the first time I work with such large files. I wanted to process the .csv file in R and summarize the values, mean, standard deviation and coefficient of variation for each ID1.
My idea was to read the file directly (with datatable fread), convert it into "long" data (with dplyr) so I have three columns: ID1, ID2 and value. Then group them by ID1,ID2 and summarize. However, I do not seem to have enough memory to read the file (I assume R uses more memory than the file's size to store it).
I think it would be more efficient to first convert the file into a SQL database and then process it from there. I have tried to convert it using sqlite3 but it gives me an error message stating that the maximum number of columns to read are 4096.
I have no experience with SQL, so I was wondering what would be the best way of converting the .csv file into a database. I guess reading each column and storing them as a table or something like that would work.
I have searched for similar questions but most of them just say that having so many columns is a bad db design. I cannot generate the .csv file with a proper structure.
Any suggestions for an efficient way of processing the .csv file?
Best,
Edit: I was able to read the initial file in R, but I still find some problems:
1- I cannot write into a sqlite db because of the "too many columns" limit.
2- I cannot pivot it inside R because I get the error:
Error: cannot allocate vector of size 7.8 Gb
Eventhough my memory limit is high enough. I have 8.5 Gb of free memory and:
> memory.limit()
[1] 16222
I have used #danlooo 's code but the data is not in the format I would like it to be. Probably I was not clear enough explaining its structure.
Here is an example of how I would like the data to look like (ID1 = Sample, ID2 = name, value = value)
> test = input[1:5,1:5]
>
> test
Sample DRX007662 DRX007663 DRX007664 DRX014481
1: AT1G01010 12.141565 16.281420 14.482322 35.19884
2: AT1G01020 12.166693 18.054251 12.075236 37.14983
3: AT1G01030 9.396695 9.704697 8.211935 4.36051
4: AT1G01040 25.278412 24.429031 22.484845 17.51553
5: AT1G01050 64.082870 66.022141 62.268711 58.06854
> test2 = pivot_longer(test, -Sample)
> test2
# A tibble: 20 x 3
Sample name value
<chr> <chr> <dbl>
1 AT1G01010 DRX007662 12.1
2 AT1G01010 DRX007663 16.3
3 AT1G01010 DRX007664 14.5
4 AT1G01010 DRX014481 35.2
5 AT1G01020 DRX007662 12.2
6 AT1G01020 DRX007663 18.1
7 AT1G01020 DRX007664 12.1
8 AT1G01020 DRX014481 37.1
9 AT1G01030 DRX007662 9.40
10 AT1G01030 DRX007663 9.70
11 AT1G01030 DRX007664 8.21
12 AT1G01030 DRX014481 4.36
13 AT1G01040 DRX007662 25.3
14 AT1G01040 DRX007663 24.4
15 AT1G01040 DRX007664 22.5
16 AT1G01040 DRX014481 17.5
17 AT1G01050 DRX007662 64.1
18 AT1G01050 DRX007663 66.0
19 AT1G01050 DRX007664 62.3
20 AT1G01050 DRX014481 58.1
> test3 = test2 %>% group_by(Sample) %>% summarize(mean(value))
> test3
# A tibble: 5 x 2
Sample `mean(value)`
<chr> <dbl>
1 AT1G01010 19.5
2 AT1G01020 19.9
3 AT1G01030 7.92
4 AT1G01040 22.4
5 AT1G01050 62.6
How should I change the code to make it look that way?
Thanks a lot!
Pivoting in SQL is very tedious and often requires writing nested queries for each column. SQLite3 is indeed the way to go if the data can not live in the RAM. This code will read the text file in chunks, pivot the data in long format and puts it into the SQL database. Then you can access the database with dplyr verbs for summarizing. This uses another example dataset, because I have no idea which column types ID1 and ID2 have. You might want to do pivot_longer(-ID2) to have two name columns.
library(tidyverse)
library(DBI)
library(vroom)
conn <- dbConnect(RSQLite::SQLite(), "my-db.sqlite")
dbCreateTable(conn, "data", tibble(name = character(), value = character()))
file <- "https://github.com/r-lib/vroom/raw/main/inst/extdata/mtcars.csv"
chunk_size <- 10 # read this many lines of the text file at once
n_chunks <- 5
# start with offset 1 to ignore header
for(chunk_offset in seq(1, chunk_size * n_chunks, by = chunk_size)) {
# everything must be character to allow pivoting numeric and text columns
vroom(file, skip = chunk_offset, n_max = chunk_size,
col_names = FALSE, col_types = cols(.default = col_character())
) %>%
pivot_longer(everything()) %>%
dbAppendTable(conn, "data", value = .)
}
data <- conn %>% tbl("data")
data
#> # Source: table<data> [?? x 2]
#> # Database: sqlite 3.37.0 [my-db.sqlite]
#> name value
#> <chr> <chr>
#> 1 X1 Mazda RX4
#> 2 X2 21
#> 3 X3 6
#> 4 X4 160
#> 5 X5 110
#> 6 X6 3.9
#> 7 X7 2.62
#> 8 X8 16.46
#> 9 X9 0
#> 10 X10 1
#> # … with more rows
data %>%
# summarise only the 3rd column
filter(name == "X3") %>%
group_by(value) %>%
count() %>%
arrange(-n) %>%
collect()
#> # A tibble: 3 × 2
#> value n
#> <chr> <int>
#> 1 8 14
#> 2 4 11
#> 3 6 7
Created on 2022-04-15 by the reprex package (v2.0.1)

R summarise_at dynamically by condition : mean for some columns, sum for others

I would like that but with the conditions in the summarise_at()
Edit #1: I've added the word dynamically in the title: When I use vars(c()) in the summarise_at() it's for fast and clear examples, but in fact it's for use contains(), starts_with() and matches(,, perl=TRUE), because I have 50 columns, with many sum() and some mean().
And the goal is for generate dynamic SQL with tbl()..%>% group_by() ... %>% summarise_at()...%>% collect().
Edit #2: I added example with SQL generated in my second example
library(tidyverse)
(mtcars
%>% group_by(carb)
%>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
# I don't want this line below, I would like a conditional in summarise_at() because I have 50 columns in my real case
%>% select(carb,cyl_mean,disp_mean,mpg_sum)
)
#> # A tibble: 6 x 4
#> carb cyl_mean disp_mean mpg_sum
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 4.57 134. 177.
#> 2 2 5.6 208. 224
#> 3 3 8 276. 48.9
#> 4 4 7.2 309. 158.
#> 5 6 6 145 19.7
#> 6 8 8 301 15
Created on 2020-02-19 by the reprex package (v0.3.0)
This works, but I want only sum for mpg, and only mean for cyl and disp:
library(RSQLite)
library(dbplyr)
library(tidyverse)
library(DBI)
db <- dbConnect(SQLite(),":memory:")
dbCreateTable(db, "mtcars_table", mtcars)
(tbl( db, build_sql( con=db,"select * from mtcars_table" ))
%>% group_by(carb)
%>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
%>% select(carb,cyl_mean,disp_mean,mpg_sum)
%>% show_query()
)
#> <SQL>
#> Warning: Missing values are always removed in SQL.[...] to silence this warning
#> SELECT `carb`, `cyl_mean`, `disp_mean`, `mpg_sum`
#> FROM (SELECT `carb`, AVG(`mpg`) AS `mpg_mean`, AVG(`cyl`) AS `cyl_mean`, AVG(`disp`) AS `disp_mean`, SUM(`mpg`) AS `mpg_sum`, SUM(`cyl`) AS `cyl_sum`, SUM(`disp`) AS `disp_sum`
#> FROM (select * from mtcars_table)
#> GROUP BY `carb`)
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.30.1 [:memory:]
#> # … with 4 variables: carb <dbl>, cyl_mean <lgl>, disp_mean <lgl>,
#> # mpg_sum <lgl>
I tried all possibilities like that but it doesn't work or it produces error.
(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse(vars(contains(names(.),"mpg")),list(sum(.)),list(mean(.)))) )
Not good, too many columns
library(tidyverse)
(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse ((names(.)=="mpg"), list(~sum(.)) , list(~mean(.)))))
#> # A tibble: 6 x 34
#> carb mpg_sum cyl_sum disp_sum mpg_mean..2 cyl_mean..2 disp_mean..2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 177. 32 940. 25.3 4.57 134.
#> 2 2 224 56 2082. 22.4 5.6 208.
#> 3 3 48.9 24 827. 16.3 8 276.
#> 4 4 158. 72 3088. 15.8 7.2 309.
#> 5 6 19.7 6 145 19.7 6 145
#> 6 8 15 8 301 15 8 301
#> # … with 27 more variables: mpg_mean..3 <dbl>, cyl_mean..3 <dbl>,
#> # disp_mean..3 <dbl>, mpg_mean..4 <dbl>, cyl_mean..4 <dbl>,
#> # disp_mean..4 <dbl>, mpg_mean..5 <dbl>, cyl_mean..5 <dbl>,
#> # disp_mean..5 <dbl>, mpg_mean..6 <dbl>, cyl_mean..6 <dbl>,
#> # disp_mean..6 <dbl>, mpg_mean..7 <dbl>, cyl_mean..7 <dbl>,
#> # disp_mean..7 <dbl>, mpg_mean..8 <dbl>, cyl_mean..8 <dbl>,
#> # disp_mean..8 <dbl>, mpg_mean..9 <dbl>, cyl_mean..9 <dbl>,
#> # disp_mean..9 <dbl>, mpg_mean..10 <dbl>, cyl_mean..10 <dbl>,
#> # disp_mean..10 <dbl>, mpg_mean..11 <dbl>, cyl_mean..11 <dbl>,
#> # disp_mean..11 <dbl>
Some other attempts and remarks: I would like conditional sum(.) or mean(.) depending of the name of the column in the summarise().
It could be good if it accepts not only primitive functions.
At then end it's for tbl()..%>% group_by() ... %>% summarise_at()...%>% collect() to generate conditional SQL with AVG() and SUM().
T-SQL function like ~(convert(varchar()) works for mutate_at() and similar ~AVG()works for summarise_at() but I arrive at the same point: conditional summarise_at() doesn't work depending of name of columns.
:)
An option is to group_by the 'carb', and then create the sum of 'mpg' as another grouping variable and then use summarise_at with the rest of the variables needed
library(dplyr)
mtcars %>%
group_by(carb) %>%
group_by(mpg_sum = sum(mpg), .add = TRUE) %>%
summarise_at(vars(cyl, disp), list(mean = mean))
# A tibble: 6 x 4
# Groups: carb [6]
# carb mpg_sum cyl_mean disp_mean
# <dbl> <dbl> <dbl> <dbl>
#1 1 177. 4.57 134.
#2 2 224 5.6 208.
#3 3 48.9 8 276.
#4 4 158. 7.2 309.
#5 6 19.7 6 145
#6 8 15 8 301
Or using the devel version of dplyr, this can be done in a single summarise by wrapping the blocks of columns in across and the single column by themselves and apply different functions on it
mtcars %>%
group_by(carb) %>%
summarise(across(one_of(c("cyl", "disp")), list(mean = mean)),
mpg_sum = sum(mpg))
# A tibble: 6 x 4
# carb cyl_mean disp_mean mpg_sum
# <dbl> <dbl> <dbl> <dbl>
#1 1 4.57 134. 177.
#2 2 5.6 208. 224
#3 3 8 276. 48.9
#4 4 7.2 309. 158.
#5 6 6 145 19.7
#6 8 8 301 15
NOTE: summarise_at/summarise_if/mutate_at/mutate_if/... etc. will be superseded by the across verb with the default functions (summarise/mutate/filter/...) in the upcoming releases
workaround waiting across() with regex
library(RSQLite)
library(dbplyr)
library(tidyverse)
library(DBI)
db <- dbConnect(SQLite())
mtcars_table <- mtcars %>% rename(mpg_sum=mpg,cyl_mean=cyl,disp_mean=disp )
RSQLite::dbWriteTable(db, "mtcars_table", mtcars_table)
req<-as.character((tbl( db, build_sql( con=db,"select * from mtcars_table" ))
%>% group_by(carb)
%>% summarise_at(vars(c(ends_with("mean"), ends_with("sum")) ), ~sum(.))
) %>% sql_render())
#> Warning: Missing values are always removed in SQL.
#> Use `SUM(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
req<-gsub("(SUM)(\\(.{1,30}mean.{1,10}\\))", "AVG\\2", req, perl=TRUE)
print(req)
#> [1] "SELECT `carb`, AVG(`cyl_mean`) AS `cyl_mean`, AVG(`disp_mean`) AS `disp_mean`,
# SUM(`mpg_sum`) AS `mpg_sum`\nFROM (select * from mtcars_table)\n
# GROUP BY `carb`"
dbGetQuery(db, req)
#> carb cyl_mean disp_mean mpg_sum
#> 1 1 4.571429 134.2714 177.4
#> 2 2 5.600000 208.1600 224.0
#> 3 3 8.000000 275.8000 48.9
#> 4 4 7.200000 308.8200 157.9
#> 5 6 6.000000 145.0000 19.7
#> 6 8 8.000000 301.0000 15.0
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] DBI_1.1.0 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.4 purrr_0.3.3
[6] readr_1.3.1 tidyr_1.0.2 tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0
[11] dbplyr_1.4.2 RSQLite_2.2.0
loaded via a namespace (and not attached):
[1] xfun_0.10 tidyselect_1.0.0 haven_2.2.0 lattice_0.20-38 colorspace_1.4-1
[6] vctrs_0.2.2 generics_0.0.2 htmltools_0.4.0 blob_1.2.1 rlang_0.4.4
[11] pillar_1.4.3 glue_1.3.1 withr_2.1.2 bit64_0.9-7 modelr_0.1.5
[16] readxl_1.3.1 lifecycle_0.1.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0
[21] rvest_0.3.5 memoise_1.1.0 evaluate_0.14 knitr_1.25 callr_3.3.2
[26] ps_1.3.0 fansi_0.4.1 broom_0.5.2 Rcpp_1.0.3 clipr_0.7.0
[31] scales_1.1.0 backports_1.1.5 jsonlite_1.6.1 fs_1.3.1 bit_1.1-15.1
[36] hms_0.5.3 digest_0.6.23 stringi_1.4.5 processx_3.4.1 grid_3.6.1
[41] cli_2.0.1 tools_3.6.1 magrittr_1.5 lazyeval_0.2.2 whisker_0.4
[46] crayon_1.3.4 pkgconfig_2.0.3 xml2_1.2.2 reprex_0.3.0 lubridate_1.7.4
[51] assertthat_0.2.1 rmarkdown_1.16 httr_1.4.1 rstudioapi_0.10 R6_2.4.1
[56] nlme_3.1-141 compiler_3.6.1

dbplyr, dplyr, and functions with no SQL equivalents [eg `slice()`]

library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
I can create this mock SQL database above. And it's very cool that I can perform standard dplyr functions on this "database":
mtcars2 %>%
group_by(cyl) %>%
summarise(mpg = mean(mpg, na.rm = TRUE)) %>%
arrange(desc(mpg))
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.29.0 [:memory:]
#> # Ordered by: desc(mpg)
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 26.7
#> 2 6 19.7
#> 3 8 15.1
It appears I'm unable to use dplyr functions that have no direct SQL equivalents, (eg dplyr::slice()). In the case of slice() I can use the alternative combination of filter() and row_number() to get the same results as just using slice(). But what happens when there's not such an easy workaround?
mtcars2 %>% slice(1:5)
#>Error in UseMethod("slice_") :
#> no applicable method for 'slice_' applied to an object of class
#> "c('tbl_SQLiteConnection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"
When dplyr functions have no direct SQL equivalents can I force their use with dbplyr, or is the only option to get creative with dplyr verbs that do have SQL equivalents, or just write the SQL directly (which is not my preferred solution)?
I understood this question: How can I make slice() work for SQL databases? This is different from "forcing their use" but still might work in your case.
The example below shows how to implement a "poor man's" variant of slice() that works on the database. We still need to do the legwork and implement it with verbs that work on the database, but then we can use it similarly to data frames.
Read more about S3 classes in http://adv-r.had.co.nz/OO-essentials.html#s3.
library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
# mtcars2 has a class attribute
class(mtcars2)
#> [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql"
#> [4] "tbl_lazy" "tbl"
# slice() is an S3 method
slice
#> function(.data, ..., .preserve = FALSE) {
#> UseMethod("slice")
#> }
#> <bytecode: 0x560a03460548>
#> <environment: namespace:dplyr>
# we can implement a "poor man's" variant of slice()
# for the particular class. (It doesn't work quite the same
# in all cases.)
#' #export
slice.tbl_sql <- function(.data, ...) {
rows <- c(...)
.data %>%
mutate(...row_id = row_number()) %>%
filter(...row_id %in% !!rows) %>%
select(-...row_id)
}
mtcars2 %>%
slice(1:5)
#> # Source: lazy query [?? x 11]
#> # Database: sqlite 3.29.0 [:memory:]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
Created on 2019-12-07 by the reprex package (v0.3.0)