How to I overcome Dividing by Zero Error in R - sql

Hi, Stack community! I came across this error working today. I am not sure exactly how to work around it. I read to use a na.rm in my mutate, and I tried it, but it didn't work. I could be completely wrong.
library("DBI")
library("dbplyr")
library("odbc")
library("dplyr")
library("stringr")
library("tidyverse")
library("lubridate")
select(CustomerID, PostalCodeID, OrderID, ItemID, WrittenSales, WrittenUnits, TransCodeID, SalesType, ProductID, ProductName, GroupID, SubGroupID, CategoryID, TransDate, LocationID, LocationName) %>%
filter(SalesType == "W",
LocationID %in% Louisville) %>%
group_by(CustomerID, PostalCodeID, WrittenSales, TransCodeID, SalesType, ProductID, ProductName, GroupID, SubGroupID, CategoryID, TransDate, LocationID, LocationName) %>%
summarise(WrittenUnits_purchased = sum(WrittenUnits)) %>%
ungroup() %>%
group_by(CustomerID) %>%
mutate(prop_of_total = WrittenUnits_purchased/sum(WrittenUnits_purchased)) %>%
ungroup()```

While this is a SQL problem, it can be mitigated in your code.
Setup:
# library(odbc) or similar, for the DB driver
# con <- DBI::dbConnect(...)
DBI::dbExecute(con, "create table r2 (x int, y int)")
# [1] 0
DBI::dbExecute(con, "insert into r2 (x,y) values (1,1),(2,0)")
# [1] 2
DBI::dbGetQuery(con, "select * from r2")
# x y
# 1 1 1
# 2 2 0
Demonstration of the problem in base R, and the SQL-based fix:
DBI::dbGetQuery(con, "select x/y as xy from r2")
# Error in result_fetch(res#ptr, n) :
# nanodbc/nanodbc.cpp:2593: 22012: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Divide by zero error encountered.
# Warning in dbClearResult(rs) : Result already cleared
DBI::dbGetQuery(con, "select (case when y = 0 then null else x/y end) as xy from r2")
# xy
# 1 1
# 2 NA
Since you're using dbplyr, here's that side of things:
library(dplyr)
library(dbplyr)
tbl(con, "r2") %>%
collect()
# # A tibble: 2 x 2
# x y
# <int> <int>
# 1 1 1
# 2 2 0
tbl(con, "r2") %>%
mutate(xy = x/y) %>%
collect()
# Error in result_fetch(res#ptr, n) :
# nanodbc/nanodbc.cpp:2593: 22012: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Divide by zero error encountered.
# Warning in dbClearResult(res) : Result already cleared
tbl(con, "r2") %>%
mutate(xy = if_else(y == 0, NA, x/y)) %>%
collect()
# # A tibble: 2 x 3
# x y xy
# <int> <int> <int>
# 1 1 1 1
# 2 2 0 NA

Related

dplyr vs dbplyr filtering with white space

This is partly related to my previous question. If I filter a dataframe using dplyr based on unique ids with trailing white space from ids with no trailing white space, dplyr will consider white space to be a character and a match will not occur, resulting in an empty dataframe:
library(tidyverse)
df <- tibble(a = c("hjhjh"), d = c(1))
df
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 hjhjh 1
ids <- df %>%
select(a) %>%
pull()
ids
#[1] "hjhjh"
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
df_with_space
#quotation marks:
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1
# 2 "popopo" 2
#now filter
df_new <- df_with_space %>%
filter(a %in% ids)
df_new
# no direct match made, empty dataframe
# A tibble: 0 x 2
# ... with 2 variables: a <chr>, d <dbl>
If I try to do the same thing and filter using dbplyr from a SQL database, it ignores the white space in the filtering but still includes it in the final output, example code:
library(dbplyr)
library(DBI)
library(odbc)
test_db <- dbConnect(odbc::odbc(),
Database = "test",
dsn = "SQL_server")
db_df <- tbl(test_db, "testing")
db_df <- db_df %>%
filter(a %in% ids) %>%
collect()
#quotation marks:
# # A tibble: 1 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1 #matches but includes the white space
I'm not familiar with SQL - is this expected? If so, when do you need to worry about (trailing) white space? I thought I would need to trim whitespace first which is very slow on a large database:
db_df <- db_df %>%
mutate(a = str_trim(a, "both")) %>%
filter(a %in% ids) %>%
collect()
thanks
EDIT
With show_query
<SQL>
SELECT *
FROM `df`
WHERE (`a` IN ('hjhjh'))
I think this produces a reproducible scenario:
dfx <- data.frame(a = c("hjhjh ", "popopo"), d = c(1, 2))
dfx = tbl_lazy(dfx, con = simulate_mssql())
dfx %>%
filter(a %in% ids)
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`a` IN ('hjhjh'))
If you're connecting to SQL Server, then I can reproduce this. I'll label it as a "bug", personally, and will never rely on it ...
No need to use dbplyr here, the issue is in the underlying DBMS; dbplyr is just the messenger, don't blame the messenger :-)
Setup
consqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
conpg <- DBI::dbConnect(odbc::odbc(), ...)
conmar <- DBI::dbConnect(odbc::odbc(), ...)
conss <- DBI::dbConnect(odbc::odbc(), ...)
cons <- list(sqlite = consqlite, postgres = conpg, maria = conmar, sqlserver = conss)
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
for (thiscon in cons) {
DBI::dbWriteTable(thiscon, "mytable", df_with_space)
}
Tests
lapply(cons, function(thiscon) {
DBI::dbGetQuery(thiscon, "select * from mytable where a in ('hjhjh')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
# a d
# 1 hjhjh 1
# $sqlserver
# a d
# 1 hjhjh 1
lapply(cons, function(thiscon) {
DBI::dbGetQuery(thiscon, "select * from mytable where a in ('popopo ')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
# a d
# 1 popopo 2
# $sqlserver
# a d
# 1 popopo 2
SQL Server and MariaDB "fail" in both test cases, neither SQLite nor Postgres fall for it.
I don't see this in the SQL spec, so I don't know if these are bugs, unintended/undocumented features, options, or something else.
Workaround
Sorry, I don't have one off-hand. (Not without accepting this "feature" and doing additional filtering post-query.)

Query if value in list using R and PostgreSQL

I have a dataframe like this
df1
ID value
1 c(YD11,DD22,EW23)
2 YD34
3 c(YD44,EW23)
4
And I want to query another database to tell me how many rows had these values in them. This will eventually be done in a loop through all rows but for now I just want to know how to do it for one row.
Let's say the database looks like this:
sql_database
value data
YD11 2222
WW20 4040
EW23 2114
YD44 3300
XH29 2040
So if I just looked at row 1, I would get:
dbGetQuery(con,
sprintf("SELECT * FROM sql_database WHERE value IN %i",
df1$value[1]) %>%
nrow()
OUTPUT:
2
And the other rows would be :
Row 2: 0
Row 3: 2
Row 4: 0
I don't need the loop created but because my code doesn't work I would like to know how to query all rows of a table which have a value in an R list.
You do not need a for loop for this.
library(tidyverse)
library(DBI)
library(dbplyr)
df1 <- tibble(
id = 1:4,
value = list(c("YD11","DD22","EW23"), "YD34", c("YD44","EW23"), NA)
)
# creating in memory database table
df2 <- tibble(
value = c("YD11", "WW20", "EW23", "YD44", "XH29"),
data = c(2222, 4040, 2114, 3300, 2040)
)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# Add auxilary schema
tmp <- tempfile()
DBI::dbExecute(con, paste0("ATTACH '", tmp, "' AS some_schema"))
copy_to(con, df2, in_schema("some_schema", "some_sql_table"), temporary = FALSE)
# counting rows
df1 %>%
unnest(cols = c(value)) %>%
left_join(tbl(con, dbplyr::in_schema("some_schema", "some_sql_table")) %>% collect(), by = "value") %>%
mutate(data = if_else(is.na(data), 0, 1)) %>%
group_by(id) %>%
summarise(n = sum(data))

Multinomial (nnet) does not work using parsnip and broom

I'm trying to run a multinomial (nnet) using tidymodel, but it shows me the next result:
Error: object of type 'closure' is not subsettable
data(iris)
ml<-multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification") %>%
translate()
ml_fit <- ml %>%
fit(Species ~ Sepal.Width, data=iris)
broom::tidy(ml_fit, exponentiate = F)
But when I run ... works perfectly
formula <- Species ~ Sepal.Width
model <- nnet::multinom(formula, data = iris)
broom::tidy(model, exponentiate = F)
Any idea of whether or not I'm writing properly the tidy model or is something else?
In tidymodels, we handle things in a way that the original data and formula are not contained in the resulting call (in the usual way). Some parts of multinom() want those (plus the actual data in the same place) to do the computations.
We just changed how we handle the formula; that now comes through as it would have if you called multinom() directly. We can't really do the same with data but we did add a new function called repair_call() that you can use to make things they way that you want them.
# devtools::install_dev("parsnip")
library(parsnip)
library(broom)
multi_spec <- multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification")
multi_fit <- multi_spec %>%
fit(Species ~ Sepal.Width, data = iris)
tidy(multi_fit)
#> Error in as.data.frame.default(data, optional = TRUE): cannot coerce class '"function"' to a data.frame
multi_fit_new <- repair_call(multi_fit, iris)
tidy(multi_fit_new)
#> # A tibble: 4 x 6
#> y.level term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor (Intercept) 1.55e+8 3.06 6.15 7.54e-10
#> 2 versicolor Sepal.Width 2.20e-3 0.991 -6.17 6.70e-10
#> 3 virginica (Intercept) 4.41e+5 2.69 4.83 1.33e- 6
#> 4 virginica Sepal.Width 1.69e-2 0.844 -4.84 1.33e- 6
Created on 2020-05-22 by the reprex package (v0.3.0)

Getting longitude and latitude from RDS gadm file

I would like to ask,
is there a way how to get longtitude and latitude variables (in data.frame) from RDS geographic files downloaded for example form: https://gadm.org/download_country_v3.html.
I know we can easily plot from this dataset simply using:
df2 <- readRDS("C:/Users/petr7/Downloads/gadm36_DEU_1_sp.rds")
library(leaflet)
library(ggplot2)
# Using leaflet
leaflet() %>% addProviderTiles("CartoDB.Positron")%>%addPolygons(data=df, weight = 0.5, fill = F)
# Using ggplot
ggplot() +
geom_polygon(data = df2, aes(x=long, y = lat, group = group), color = "black", fill = F)
how ever even doing df2$ no long and latitude options are there
I would do something like this:
# packages
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
my_url <- "https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_ITA_0_sf.rds"
data <- readRDS(url(my_url))
italy_coordinate <- st_coordinates(data)
head(italy_coordinate)
#> X Y L1 L2 L3
#> [1,] 12.61486 35.49292 1 1 1
#> [2,] 12.61430 35.49292 1 1 1
#> [3,] 12.61430 35.49347 1 1 1
#> [4,] 12.61375 35.49347 1 1 1
#> [5,] 12.61375 35.49403 1 1 1
#> [6,] 12.61347 35.49409 1 1 1
Created on 2019-12-27 by the reprex package (v0.3.0)
Now you just to change the url according to your problem. Read the help page of st_coordinates function (i.e. ?sf::st_coordinates) for an explanation of the meaning of L1, L2 and L3 columns.

Pass SQL functions in dplyr filter function on database

I'm using dplyr's automatic SQL backend to query subtable from a database table. E.g.
my_tbl <- tbl(my_db, "my_table")
where my_table in the database looks like
batch_name value
batch_A_1 1
batch_A_2 2
batch_A_2 3
batch_B_1 8
batch_B_2 9
...
I just want the data from batch_A_#, regardless of the number.
If I were writing this in SQL, I could use
select * where batch_name like 'batch_A_%'
If I were writing this in R, I could use a few ways to get this: grepl(), %in%, or str_detect()
# option 1
subtable <- my_tbl %>% select(batch_name, value) %>%
filter(grepl('batch_A_', batch_name, fixed = T))
# option 2
subtable <- my_tbl %>% select(batch_name, value) %>%
filter(str_detect(batch_name, 'batch_A_'))
All of these give the following Postgres error: HINT: No function matches the given name and argument types. You might need to add explicit type casts
So, how do I pass in SQL string functions or matching functions to help make the generated dplyr SQL query able to use a more flexible range of functions in filter?
(FYI the %in% function does work, but requires listing out all possible values. This would be okay combined with paste to make a list, but does not work in a more general regex case)
A "dplyr-only" solution would be this
tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
Full reprex:
suppressPackageStartupMessages({
library(dplyr)
library(dbplyr)
library(RPostgreSQL)
})
my_con <-
dbConnect(
PostgreSQL(),
user = "my_user",
password = "my_password",
host = "my_host",
dbname = "my_db"
)
my_table <- tribble(
~batch_name, ~value,
"batch_A_1", 1,
"batch_A_2", 2,
"batch_A_2", 3,
"batch_B_1", 8,
"batch_B_2", 9
)
copy_to(my_con, my_table)
tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
#> # A tibble: 3 x 2
#> batch_name value
#> * <chr> <dbl>
#> 1 batch_A_1 1
#> 2 batch_A_2 2
#> 3 batch_A_2 3
dbDisconnect(my_con)
#> [1] TRUE
This works because any functions that dplyr doesn't know how to
translate will be passed along as is, see
?dbplyr::translate\_sql.
Hat-tip to #PaulRougieux for his recent comment
here
Using dplyr
Get the table batch_name from the database as dataframe and use it for further data analysis.
library("dplyr")
my_db <- src_postgres(dbname = "database-name",
host = "localhost",
port = 5432,
user = "username",
password = "password")
df <- tbl(my_db, "my_table")
df %>% filter(batch_name == "batch_A_1")
Using DBI and RPostgreSQL
Get the table by sending sql query
library("DBI")
library("RPostgreSQL")
m <- dbDriver("PostgreSQL")
con <- dbConnect(drv = m,
dbname = "database-name",
host = "localhost",
port = 5432,
user = "username",
password = "password")
df <- dbGetQuery(con, "SELECT * FROM my_table WHERE batch_name %LIKE% 'batch_A_%'")
library("dplyr")
df %>% filter(batch_name == "batch_A_1")