Pass SQL functions in dplyr filter function on database - sql

I'm using dplyr's automatic SQL backend to query subtable from a database table. E.g.
my_tbl <- tbl(my_db, "my_table")
where my_table in the database looks like
batch_name value
batch_A_1 1
batch_A_2 2
batch_A_2 3
batch_B_1 8
batch_B_2 9
...
I just want the data from batch_A_#, regardless of the number.
If I were writing this in SQL, I could use
select * where batch_name like 'batch_A_%'
If I were writing this in R, I could use a few ways to get this: grepl(), %in%, or str_detect()
# option 1
subtable <- my_tbl %>% select(batch_name, value) %>%
filter(grepl('batch_A_', batch_name, fixed = T))
# option 2
subtable <- my_tbl %>% select(batch_name, value) %>%
filter(str_detect(batch_name, 'batch_A_'))
All of these give the following Postgres error: HINT: No function matches the given name and argument types. You might need to add explicit type casts
So, how do I pass in SQL string functions or matching functions to help make the generated dplyr SQL query able to use a more flexible range of functions in filter?
(FYI the %in% function does work, but requires listing out all possible values. This would be okay combined with paste to make a list, but does not work in a more general regex case)

A "dplyr-only" solution would be this
tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
Full reprex:
suppressPackageStartupMessages({
library(dplyr)
library(dbplyr)
library(RPostgreSQL)
})
my_con <-
dbConnect(
PostgreSQL(),
user = "my_user",
password = "my_password",
host = "my_host",
dbname = "my_db"
)
my_table <- tribble(
~batch_name, ~value,
"batch_A_1", 1,
"batch_A_2", 2,
"batch_A_2", 3,
"batch_B_1", 8,
"batch_B_2", 9
)
copy_to(my_con, my_table)
tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
#> # A tibble: 3 x 2
#> batch_name value
#> * <chr> <dbl>
#> 1 batch_A_1 1
#> 2 batch_A_2 2
#> 3 batch_A_2 3
dbDisconnect(my_con)
#> [1] TRUE
This works because any functions that dplyr doesn't know how to
translate will be passed along as is, see
?dbplyr::translate\_sql.
Hat-tip to #PaulRougieux for his recent comment
here

Using dplyr
Get the table batch_name from the database as dataframe and use it for further data analysis.
library("dplyr")
my_db <- src_postgres(dbname = "database-name",
host = "localhost",
port = 5432,
user = "username",
password = "password")
df <- tbl(my_db, "my_table")
df %>% filter(batch_name == "batch_A_1")
Using DBI and RPostgreSQL
Get the table by sending sql query
library("DBI")
library("RPostgreSQL")
m <- dbDriver("PostgreSQL")
con <- dbConnect(drv = m,
dbname = "database-name",
host = "localhost",
port = 5432,
user = "username",
password = "password")
df <- dbGetQuery(con, "SELECT * FROM my_table WHERE batch_name %LIKE% 'batch_A_%'")
library("dplyr")
df %>% filter(batch_name == "batch_A_1")

Related

dplyr to data.table for speed up execution time

I am currently dealing with a moderately large dataframe called d.mkt (> 2M rows and 12 columns). As dplyr is too slow when applying summarise() function combined with group_by_at, I am trying to write an equivalent statement using data.table to speed up the summarise computation part of dplyr. However, the situation is quite special in the case that the original dataframe is group_by_at and then summarising over the same set of columns (e.g. X %>% select(-id) %>% group_by_at(vars(-x,-y,-z,-t) %>% summarise(x = sum(x), y = sum(y), z = sum(z), y = sum(t)) %>% ungroup()).
With that in mind, below is my current attempt, which kept failing to work because of this error: keyby or by has length (1,1,1,1). Could someone please help let me know how to fix this error?
dplyr's code
d.mkt <- d.mkt %>%
left_join(codes, by = c('rte_cd', 'cd')) %>%
mutate(is_valid = replace_na(is_valid, FALSE),
rte_cd = ifelse(is_valid, rte_cd, 'RC'),
rte_dsc = ifelse(is_valid, rte_dsc, 'SKIPPED')) %>%
select(-is_valid) %>%
group_by_at(vars(-c_rv, -g_rv, -h_rv, -rn)) %>%
summarise(c_rv = sum(as.numeric(c_rv)), g_rv = sum(as.numeric(g_rv)), h_rv = sum(as.numeric(h_rv)), rn = sum(as.numeric(rn))) %>%
ungroup()
My attempt for translating the above
d.mkt <- as.data.table(d.mkt)
d.mkt <- d.mkt[codes, on = c('rte_cd', 'sb_cd'),
`:=` (is.valid = replace_na(is_valid, FALSE), rte_cd = ifelse(is_valid, rte_cd, 'RC00'),
rte_ds = ifelse(is_valid, rte_ds, 'SKIPPED'))]
d.mkt <- d.mkt[, -"is.valid", with=FALSE]
d.mkt <- d.mkt[, .(c_rv=sum(c_rv), g_rv=sum(g_rv), h_rv = sum(h_rv), rn = sum(rn)), by = .('prop', 'date')] --- Error here already, but how do we ungroup a `data.table` though?
Close. Some suggestions/answers.
If you're shifting to data.table for speed, I suggest use if fifelse in lieu of replace_na and ifelse, minor.
The canonical way to remove is_valid is d.mkt[, is.valid := NULL].
Grouping cab be done with a setdiff. In data.table, there is no need to "ungroup", each [-call uses its own grouping. (For the reason, if you have multiple chained [-operations that use the same grouping, it can be useful to store that group as a variable, perhaps index it, and/or combine all the [-chain into a single call. This is prone to lots of benchmarking discussion outside the scope of what we have here.)
Since all of your summary stats are the same, we can lapply(.SD, ..) this for a little readability improvement.
This might work:
library(data.table)
setDT(codes) # or using `as.data.table(codes)` below instead
setDT(d.mkt) # ditto
tmp <- codes[d.mkt, on = .(rte_cd, cd) ] %>%
.[, c("is_valid", "rte_cd", "rte_dsc") :=
.(fcoalesce(is_valid, FALSE),
fifelse(is.na(is_valid), rte_cd, "RC"),
fifelse(is.an(is_valid), rte_dsc, "SKIPPED")) ]
tmp[, is_valid := NULL ]
cols <- c("c_rv", "g_rv", "h_rv", "rn")
tmp[, lapply(.SD, function(z) sum(as.numeric(z))),
by = setdiff(names(tmp), cols), .SDcols = cols ]

R: Renaming Multiple Variables

I am working with the R programming language. I am trying to rename all the variables in a data frame by adding "_a" to the end of each variable. I figured out how to do this with the "dplyr" library (for data frames in the global environment):
library(dplyr)
library(dbplyr)
var1 = rnorm(100,10,10)
var2 = rnorm(100,10,10)
var3 = rnorm(100,10,10)
my_data = data.frame(var1,var2,var3)
df = my_data %>% rename_all(paste0, "_a")
Problem: My actual data frame is on a database which I access using "RODBC SQL" commands, for example:
library(RODBC)
library(sqldf)
con = odbcConnect("some name", uid = "some id", pwd = "abc")
sample_query = sqlQuery(con, "select distinct * from df")
What I tried so far: Using the "dbplyr" library, I "extract" the SQL code performed above:
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df <- copy_to(con, my_data)
final = df %>%
rename_all(paste0, "_a") %>% show_query()
#results
<SQL>
SELECT `var1` AS `var1_a`, `var2` AS `var2_a`, `var3` AS `var3_a`
FROM `my_data`
My Question: The actual table I am using ("my_data") contains many columns and is on a database. I would like to rename all the columns (e.g. add "_a" at the end of each column), but I can not do this until I figure how to pass the SQL statement through sqlQuery() . This means that I am unable to get the entire list of columns all at once and rename them all at once.
I could manually do this, e.g.
sample_query = sqlQuery(con, "SELECT `var1` AS `var1_a`, `var2` AS `var2_a`, `var3` AS `var3_a` etc etc etc `varfinal` AS `varfinal_a`
FROM `my_data` ")
But I am looking for a way to do this automatically.
Is it possible to write something like this ?
#not sure if this is correct
sample_query = sqlQuery(con, "rename_all(paste0, "_a")")
Thanks!
There may be better alternatives, but you could build a query string as follows:
query <- paste(
"SELECT",
paste(
names(my_data),
"AS",
paste0(names(my_data), "_a"),
collapse = ", "
),
"FROM my_data"
)
Then query will give you:
[1] "SELECT var1 AS var1_a, var2 AS var2_a, var3 AS var3_a FROM my_data"
Which can be used in your sqlQuery statement:
sample_query = sqlQuery(con, query)

dplyr vs dbplyr filtering with white space

This is partly related to my previous question. If I filter a dataframe using dplyr based on unique ids with trailing white space from ids with no trailing white space, dplyr will consider white space to be a character and a match will not occur, resulting in an empty dataframe:
library(tidyverse)
df <- tibble(a = c("hjhjh"), d = c(1))
df
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 hjhjh 1
ids <- df %>%
select(a) %>%
pull()
ids
#[1] "hjhjh"
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
df_with_space
#quotation marks:
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1
# 2 "popopo" 2
#now filter
df_new <- df_with_space %>%
filter(a %in% ids)
df_new
# no direct match made, empty dataframe
# A tibble: 0 x 2
# ... with 2 variables: a <chr>, d <dbl>
If I try to do the same thing and filter using dbplyr from a SQL database, it ignores the white space in the filtering but still includes it in the final output, example code:
library(dbplyr)
library(DBI)
library(odbc)
test_db <- dbConnect(odbc::odbc(),
Database = "test",
dsn = "SQL_server")
db_df <- tbl(test_db, "testing")
db_df <- db_df %>%
filter(a %in% ids) %>%
collect()
#quotation marks:
# # A tibble: 1 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1 #matches but includes the white space
I'm not familiar with SQL - is this expected? If so, when do you need to worry about (trailing) white space? I thought I would need to trim whitespace first which is very slow on a large database:
db_df <- db_df %>%
mutate(a = str_trim(a, "both")) %>%
filter(a %in% ids) %>%
collect()
thanks
EDIT
With show_query
<SQL>
SELECT *
FROM `df`
WHERE (`a` IN ('hjhjh'))
I think this produces a reproducible scenario:
dfx <- data.frame(a = c("hjhjh ", "popopo"), d = c(1, 2))
dfx = tbl_lazy(dfx, con = simulate_mssql())
dfx %>%
filter(a %in% ids)
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`a` IN ('hjhjh'))
If you're connecting to SQL Server, then I can reproduce this. I'll label it as a "bug", personally, and will never rely on it ...
No need to use dbplyr here, the issue is in the underlying DBMS; dbplyr is just the messenger, don't blame the messenger :-)
Setup
consqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
conpg <- DBI::dbConnect(odbc::odbc(), ...)
conmar <- DBI::dbConnect(odbc::odbc(), ...)
conss <- DBI::dbConnect(odbc::odbc(), ...)
cons <- list(sqlite = consqlite, postgres = conpg, maria = conmar, sqlserver = conss)
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
for (thiscon in cons) {
DBI::dbWriteTable(thiscon, "mytable", df_with_space)
}
Tests
lapply(cons, function(thiscon) {
DBI::dbGetQuery(thiscon, "select * from mytable where a in ('hjhjh')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
# a d
# 1 hjhjh 1
# $sqlserver
# a d
# 1 hjhjh 1
lapply(cons, function(thiscon) {
DBI::dbGetQuery(thiscon, "select * from mytable where a in ('popopo ')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
# a d
# 1 popopo 2
# $sqlserver
# a d
# 1 popopo 2
SQL Server and MariaDB "fail" in both test cases, neither SQLite nor Postgres fall for it.
I don't see this in the SQL spec, so I don't know if these are bugs, unintended/undocumented features, options, or something else.
Workaround
Sorry, I don't have one off-hand. (Not without accepting this "feature" and doing additional filtering post-query.)

Dropping containing NA rows with dbplyr

here is how I ran some SQL queries by dbplyr
library(tidyverse)
library(dbplyr)
library(DBI)
library(RPostgres)
library(bit64)
library(tidyr)
drv <- dbDriver('Postgres')
con <- dbConnect(drv,dbname='mydb',port=5432,user='postgres')
table1 <- tbl(con,'table1')
table2 <- tbl(con,'table2')
table3 <- tbl(con,'table3')
table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
I wanna drop some rows which include NA then collect my final table but couldn't find anything helpful works with dbplyr queries.
I tried to pipe drop_na() from tidyr and some other base functions (complete.cases() etc.). Would you suggest me anything to succeed my aim ? Piping an SQL query (like WHERE FOO IS NOT NULL) to dbplyr query is also welcome.
Thanks in advance.
Try using !is.na(col_name) as part of a filter:
library(dplyr)
library(dbplyr)
df = data.frame(my_num = c(1,2,3))
df = tbl_lazy(df, con = simulate_mssql())
output = df %>% filter(!is.na(my_num))
Calling show_query(output) to check the generated sql gives:
<SQL>
SELECT *
FROM `df`
WHERE (NOT(((`my_num`) IS NULL)))
The extra brackets are part of how dbplyr does its translation.
If you want to do this for multiple columns, try the following approach based on this answer:
library(rlang)
library(dplyr)
library(dbplyr)
df = data.frame(c1 = c(1,2,3), c2 = c(9,8,7))
df = tbl_lazy(df, con = simulate_mssql())
colnames = c("c1","c2")
conditions = paste0("!is.na(",colnames,")")
output = df %>%
filter(!!!parse_exprs(conditions))
Calling show_query(output) shows both columns appear in the generated query:
<SQL>
SELECT *
FROM `df`
WHERE ((NOT(((`c1`) IS NULL))) AND (NOT(((`c2`) IS NULL))))
Well, actually I still don't get a satisfying solution. What I exactly wanted to do is to drop containing NA rows in R environment without typing an SQL query, I think dbplyr doesn't support this function yet.
Then I wrote a little and simple code to make my wish come true;
main_query<-table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
colnames <- main_query %>% colnames
query1 <- main_query %>% sql_render %>% paste('WHERE')
query2<-''
for(i in colnames){
if(i == tail(colnames,1)){query2<-paste(query2,i,'IS NOT NULL')}
else{query2<-paste(query2,i,'IS NOT NULL AND')}
}
desiredTable <- dbGetQuery(con,paste(query1,query2))
Yeah, I know it doesn't seem magical but maybe someone can make use of it.

Query if value in list using R and PostgreSQL

I have a dataframe like this
df1
ID value
1 c(YD11,DD22,EW23)
2 YD34
3 c(YD44,EW23)
4
And I want to query another database to tell me how many rows had these values in them. This will eventually be done in a loop through all rows but for now I just want to know how to do it for one row.
Let's say the database looks like this:
sql_database
value data
YD11 2222
WW20 4040
EW23 2114
YD44 3300
XH29 2040
So if I just looked at row 1, I would get:
dbGetQuery(con,
sprintf("SELECT * FROM sql_database WHERE value IN %i",
df1$value[1]) %>%
nrow()
OUTPUT:
2
And the other rows would be :
Row 2: 0
Row 3: 2
Row 4: 0
I don't need the loop created but because my code doesn't work I would like to know how to query all rows of a table which have a value in an R list.
You do not need a for loop for this.
library(tidyverse)
library(DBI)
library(dbplyr)
df1 <- tibble(
id = 1:4,
value = list(c("YD11","DD22","EW23"), "YD34", c("YD44","EW23"), NA)
)
# creating in memory database table
df2 <- tibble(
value = c("YD11", "WW20", "EW23", "YD44", "XH29"),
data = c(2222, 4040, 2114, 3300, 2040)
)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# Add auxilary schema
tmp <- tempfile()
DBI::dbExecute(con, paste0("ATTACH '", tmp, "' AS some_schema"))
copy_to(con, df2, in_schema("some_schema", "some_sql_table"), temporary = FALSE)
# counting rows
df1 %>%
unnest(cols = c(value)) %>%
left_join(tbl(con, dbplyr::in_schema("some_schema", "some_sql_table")) %>% collect(), by = "value") %>%
mutate(data = if_else(is.na(data), 0, 1)) %>%
group_by(id) %>%
summarise(n = sum(data))