dplyr vs dbplyr filtering with white space - sql

This is partly related to my previous question. If I filter a dataframe using dplyr based on unique ids with trailing white space from ids with no trailing white space, dplyr will consider white space to be a character and a match will not occur, resulting in an empty dataframe:
library(tidyverse)
df <- tibble(a = c("hjhjh"), d = c(1))
df
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 hjhjh 1
ids <- df %>%
select(a) %>%
pull()
ids
#[1] "hjhjh"
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
df_with_space
#quotation marks:
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1
# 2 "popopo" 2
#now filter
df_new <- df_with_space %>%
filter(a %in% ids)
df_new
# no direct match made, empty dataframe
# A tibble: 0 x 2
# ... with 2 variables: a <chr>, d <dbl>
If I try to do the same thing and filter using dbplyr from a SQL database, it ignores the white space in the filtering but still includes it in the final output, example code:
library(dbplyr)
library(DBI)
library(odbc)
test_db <- dbConnect(odbc::odbc(),
Database = "test",
dsn = "SQL_server")
db_df <- tbl(test_db, "testing")
db_df <- db_df %>%
filter(a %in% ids) %>%
collect()
#quotation marks:
# # A tibble: 1 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1 #matches but includes the white space
I'm not familiar with SQL - is this expected? If so, when do you need to worry about (trailing) white space? I thought I would need to trim whitespace first which is very slow on a large database:
db_df <- db_df %>%
mutate(a = str_trim(a, "both")) %>%
filter(a %in% ids) %>%
collect()
thanks
EDIT
With show_query
<SQL>
SELECT *
FROM `df`
WHERE (`a` IN ('hjhjh'))
I think this produces a reproducible scenario:
dfx <- data.frame(a = c("hjhjh ", "popopo"), d = c(1, 2))
dfx = tbl_lazy(dfx, con = simulate_mssql())
dfx %>%
filter(a %in% ids)
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`a` IN ('hjhjh'))

If you're connecting to SQL Server, then I can reproduce this. I'll label it as a "bug", personally, and will never rely on it ...
No need to use dbplyr here, the issue is in the underlying DBMS; dbplyr is just the messenger, don't blame the messenger :-)
Setup
consqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
conpg <- DBI::dbConnect(odbc::odbc(), ...)
conmar <- DBI::dbConnect(odbc::odbc(), ...)
conss <- DBI::dbConnect(odbc::odbc(), ...)
cons <- list(sqlite = consqlite, postgres = conpg, maria = conmar, sqlserver = conss)
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
for (thiscon in cons) {
DBI::dbWriteTable(thiscon, "mytable", df_with_space)
}
Tests
lapply(cons, function(thiscon) {
DBI::dbGetQuery(thiscon, "select * from mytable where a in ('hjhjh')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
# a d
# 1 hjhjh 1
# $sqlserver
# a d
# 1 hjhjh 1
lapply(cons, function(thiscon) {
DBI::dbGetQuery(thiscon, "select * from mytable where a in ('popopo ')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
# a d
# 1 popopo 2
# $sqlserver
# a d
# 1 popopo 2
SQL Server and MariaDB "fail" in both test cases, neither SQLite nor Postgres fall for it.
I don't see this in the SQL spec, so I don't know if these are bugs, unintended/undocumented features, options, or something else.
Workaround
Sorry, I don't have one off-hand. (Not without accepting this "feature" and doing additional filtering post-query.)

Related

How to specify vartype for sqlSave() for multiple columns without manually typing in R?

Some reproducible code. Note that your database and server name may be different.
library(RODBC)
iris <- iris
connection <- odbcDriverConnect(
"Driver={SQL Server};
Server=localhost\\SQLEXPRESS;
Database=testdb;
Trusted_connection=true;"
)
# create table in sql and move dataframe values in
columnTypes <- list(
Sepal.Length="decimal(28,0)",
Sepal.Width="decimal(28,0)",
Petal.Length ="decimal(28,0)",
Petal.Width ="decimal(28,0)",
Species = "varchar(255)"
)
sqlSave(connection,iris,varTypes = columnTypes)
This is how i export a dataframe to sql management studio as a table and it works but, say I have one hundred new columns in iris. Do I have to specify each column name = to decimal(28,0) in my columnTypes variable?
# but what if i have
iris$random1 <- rnorm(150)
iris$random2 <- rnorm(150)
iris$random3-999 ..... <- rnorm(150) ....
iris$random1000 <- rnorm(150)
By default the columns go in as floats as least in my actual dataframe (iris is just the example), so that's why I need to manually change them in columnTypes. I want everything after the 5 original columns in iris to be decimal(28,0) format without manually including them in columnTypes.
I did not read into the sqlSave() statement and possible alternatives. Meaning: there might be a more apropriate solution. Anyhow you can generate the list of wanted definitions in base R by repitition and combining lists:
# dummy data
df <- data.frame(Sepal.Lengt = 1:5,Sepal.Width = 1:5,Petal.Length = 1:5,Petal.Width = 1:5,Species = 1:5,col6 = 1:5,col7 = 1:5)
# all column names after the 5th -> if less then 6 columns you will get an error here!
vec <- names(df)[6:ncol(df)]
# generate list with same definition for all columns after the 5th
list_after_5 <- as.list(rep("decimal(28,0)", length(vec)))
# name the list items acording to the columns after the 5th
names(list_after_5) <- vec
# frist 5 column definitions manualy plus remaining columns with same definition from list generate above
columnTypes <- c(list(Sepal.Length="decimal(28,0)",
Sepal.Width="decimal(28,0)",
Petal.Length ="decimal(28,0)",
Petal.Width ="decimal(28,0)",
Species = "varchar(255)"),
list_after_5)
columnTypes
$Sepal.Length
[1] "decimal(28,0)"
$Sepal.Width
[1] "decimal(28,0)"
$Petal.Length
[1] "decimal(28,0)"
$Petal.Width
[1] "decimal(28,0)"
$Species
[1] "varchar(255)"
$col6
[1] "decimal(28,0)"
$col7
[1] "decimal(28,0)"
Since you example code seems to be working for you, this should also though (judging by the output) - I did not test it with a DB, as I have no test setup available atm.

Query if value in list using R and PostgreSQL

I have a dataframe like this
df1
ID value
1 c(YD11,DD22,EW23)
2 YD34
3 c(YD44,EW23)
4
And I want to query another database to tell me how many rows had these values in them. This will eventually be done in a loop through all rows but for now I just want to know how to do it for one row.
Let's say the database looks like this:
sql_database
value data
YD11 2222
WW20 4040
EW23 2114
YD44 3300
XH29 2040
So if I just looked at row 1, I would get:
dbGetQuery(con,
sprintf("SELECT * FROM sql_database WHERE value IN %i",
df1$value[1]) %>%
nrow()
OUTPUT:
2
And the other rows would be :
Row 2: 0
Row 3: 2
Row 4: 0
I don't need the loop created but because my code doesn't work I would like to know how to query all rows of a table which have a value in an R list.
You do not need a for loop for this.
library(tidyverse)
library(DBI)
library(dbplyr)
df1 <- tibble(
id = 1:4,
value = list(c("YD11","DD22","EW23"), "YD34", c("YD44","EW23"), NA)
)
# creating in memory database table
df2 <- tibble(
value = c("YD11", "WW20", "EW23", "YD44", "XH29"),
data = c(2222, 4040, 2114, 3300, 2040)
)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# Add auxilary schema
tmp <- tempfile()
DBI::dbExecute(con, paste0("ATTACH '", tmp, "' AS some_schema"))
copy_to(con, df2, in_schema("some_schema", "some_sql_table"), temporary = FALSE)
# counting rows
df1 %>%
unnest(cols = c(value)) %>%
left_join(tbl(con, dbplyr::in_schema("some_schema", "some_sql_table")) %>% collect(), by = "value") %>%
mutate(data = if_else(is.na(data), 0, 1)) %>%
group_by(id) %>%
summarise(n = sum(data))

Multinomial (nnet) does not work using parsnip and broom

I'm trying to run a multinomial (nnet) using tidymodel, but it shows me the next result:
Error: object of type 'closure' is not subsettable
data(iris)
ml<-multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification") %>%
translate()
ml_fit <- ml %>%
fit(Species ~ Sepal.Width, data=iris)
broom::tidy(ml_fit, exponentiate = F)
But when I run ... works perfectly
formula <- Species ~ Sepal.Width
model <- nnet::multinom(formula, data = iris)
broom::tidy(model, exponentiate = F)
Any idea of whether or not I'm writing properly the tidy model or is something else?
In tidymodels, we handle things in a way that the original data and formula are not contained in the resulting call (in the usual way). Some parts of multinom() want those (plus the actual data in the same place) to do the computations.
We just changed how we handle the formula; that now comes through as it would have if you called multinom() directly. We can't really do the same with data but we did add a new function called repair_call() that you can use to make things they way that you want them.
# devtools::install_dev("parsnip")
library(parsnip)
library(broom)
multi_spec <- multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification")
multi_fit <- multi_spec %>%
fit(Species ~ Sepal.Width, data = iris)
tidy(multi_fit)
#> Error in as.data.frame.default(data, optional = TRUE): cannot coerce class '"function"' to a data.frame
multi_fit_new <- repair_call(multi_fit, iris)
tidy(multi_fit_new)
#> # A tibble: 4 x 6
#> y.level term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor (Intercept) 1.55e+8 3.06 6.15 7.54e-10
#> 2 versicolor Sepal.Width 2.20e-3 0.991 -6.17 6.70e-10
#> 3 virginica (Intercept) 4.41e+5 2.69 4.83 1.33e- 6
#> 4 virginica Sepal.Width 1.69e-2 0.844 -4.84 1.33e- 6
Created on 2020-05-22 by the reprex package (v0.3.0)

How to I overcome Dividing by Zero Error in R

Hi, Stack community! I came across this error working today. I am not sure exactly how to work around it. I read to use a na.rm in my mutate, and I tried it, but it didn't work. I could be completely wrong.
library("DBI")
library("dbplyr")
library("odbc")
library("dplyr")
library("stringr")
library("tidyverse")
library("lubridate")
select(CustomerID, PostalCodeID, OrderID, ItemID, WrittenSales, WrittenUnits, TransCodeID, SalesType, ProductID, ProductName, GroupID, SubGroupID, CategoryID, TransDate, LocationID, LocationName) %>%
filter(SalesType == "W",
LocationID %in% Louisville) %>%
group_by(CustomerID, PostalCodeID, WrittenSales, TransCodeID, SalesType, ProductID, ProductName, GroupID, SubGroupID, CategoryID, TransDate, LocationID, LocationName) %>%
summarise(WrittenUnits_purchased = sum(WrittenUnits)) %>%
ungroup() %>%
group_by(CustomerID) %>%
mutate(prop_of_total = WrittenUnits_purchased/sum(WrittenUnits_purchased)) %>%
ungroup()```
While this is a SQL problem, it can be mitigated in your code.
Setup:
# library(odbc) or similar, for the DB driver
# con <- DBI::dbConnect(...)
DBI::dbExecute(con, "create table r2 (x int, y int)")
# [1] 0
DBI::dbExecute(con, "insert into r2 (x,y) values (1,1),(2,0)")
# [1] 2
DBI::dbGetQuery(con, "select * from r2")
# x y
# 1 1 1
# 2 2 0
Demonstration of the problem in base R, and the SQL-based fix:
DBI::dbGetQuery(con, "select x/y as xy from r2")
# Error in result_fetch(res#ptr, n) :
# nanodbc/nanodbc.cpp:2593: 22012: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Divide by zero error encountered.
# Warning in dbClearResult(rs) : Result already cleared
DBI::dbGetQuery(con, "select (case when y = 0 then null else x/y end) as xy from r2")
# xy
# 1 1
# 2 NA
Since you're using dbplyr, here's that side of things:
library(dplyr)
library(dbplyr)
tbl(con, "r2") %>%
collect()
# # A tibble: 2 x 2
# x y
# <int> <int>
# 1 1 1
# 2 2 0
tbl(con, "r2") %>%
mutate(xy = x/y) %>%
collect()
# Error in result_fetch(res#ptr, n) :
# nanodbc/nanodbc.cpp:2593: 22012: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Divide by zero error encountered.
# Warning in dbClearResult(res) : Result already cleared
tbl(con, "r2") %>%
mutate(xy = if_else(y == 0, NA, x/y)) %>%
collect()
# # A tibble: 2 x 3
# x y xy
# <int> <int> <int>
# 1 1 1 1
# 2 2 0 NA

Pass SQL functions in dplyr filter function on database

I'm using dplyr's automatic SQL backend to query subtable from a database table. E.g.
my_tbl <- tbl(my_db, "my_table")
where my_table in the database looks like
batch_name value
batch_A_1 1
batch_A_2 2
batch_A_2 3
batch_B_1 8
batch_B_2 9
...
I just want the data from batch_A_#, regardless of the number.
If I were writing this in SQL, I could use
select * where batch_name like 'batch_A_%'
If I were writing this in R, I could use a few ways to get this: grepl(), %in%, or str_detect()
# option 1
subtable <- my_tbl %>% select(batch_name, value) %>%
filter(grepl('batch_A_', batch_name, fixed = T))
# option 2
subtable <- my_tbl %>% select(batch_name, value) %>%
filter(str_detect(batch_name, 'batch_A_'))
All of these give the following Postgres error: HINT: No function matches the given name and argument types. You might need to add explicit type casts
So, how do I pass in SQL string functions or matching functions to help make the generated dplyr SQL query able to use a more flexible range of functions in filter?
(FYI the %in% function does work, but requires listing out all possible values. This would be okay combined with paste to make a list, but does not work in a more general regex case)
A "dplyr-only" solution would be this
tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
Full reprex:
suppressPackageStartupMessages({
library(dplyr)
library(dbplyr)
library(RPostgreSQL)
})
my_con <-
dbConnect(
PostgreSQL(),
user = "my_user",
password = "my_password",
host = "my_host",
dbname = "my_db"
)
my_table <- tribble(
~batch_name, ~value,
"batch_A_1", 1,
"batch_A_2", 2,
"batch_A_2", 3,
"batch_B_1", 8,
"batch_B_2", 9
)
copy_to(my_con, my_table)
tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
#> # A tibble: 3 x 2
#> batch_name value
#> * <chr> <dbl>
#> 1 batch_A_1 1
#> 2 batch_A_2 2
#> 3 batch_A_2 3
dbDisconnect(my_con)
#> [1] TRUE
This works because any functions that dplyr doesn't know how to
translate will be passed along as is, see
?dbplyr::translate\_sql.
Hat-tip to #PaulRougieux for his recent comment
here
Using dplyr
Get the table batch_name from the database as dataframe and use it for further data analysis.
library("dplyr")
my_db <- src_postgres(dbname = "database-name",
host = "localhost",
port = 5432,
user = "username",
password = "password")
df <- tbl(my_db, "my_table")
df %>% filter(batch_name == "batch_A_1")
Using DBI and RPostgreSQL
Get the table by sending sql query
library("DBI")
library("RPostgreSQL")
m <- dbDriver("PostgreSQL")
con <- dbConnect(drv = m,
dbname = "database-name",
host = "localhost",
port = 5432,
user = "username",
password = "password")
df <- dbGetQuery(con, "SELECT * FROM my_table WHERE batch_name %LIKE% 'batch_A_%'")
library("dplyr")
df %>% filter(batch_name == "batch_A_1")