I would like to filter a SQL database based whether a regular expression appears within any column. I would like to specify the regex as a variable; however it is read as a literal string. I am having trouble getting the regex in as a variable. Thank you for your help!
Resources I've consulted:
https://www.tidyverse.org/blog/2018/01/dbplyr-1-2/
https://github.com/tidyverse/dbplyr/issues/168
Note: I had trouble making a reprex using the mtcars dataset, following https://www.tidyverse.org/blog/2018/01/dbplyr-1-2/. I get the error: "Error: str_detect() is not available in this SQL variant". I cannot share a reprex using my actual SQL database. As such, below is a pseudo-reprex.
library(dplyr)
library(stringr)
# Variable with regex (either lower- or uppercase "m")
my_string <- "(?i)m"
# WITHOUT SQL DATABASE ----------------------------------------------------
# This runs
mtcars %>%
tibble::rownames_to_column() %>%
filter(str_length(rowname) > 5)
# This runs with STRING
mtcars %>%
tibble::rownames_to_column() %>%
filter(str_detect(rowname, "(?i)m"))
# This runs with VARIABLE
mtcars %>%
tibble::rownames_to_column() %>%
filter(str_detect(rowname, my_string))
# WITH SQL DATABASE -------------------------------------------------------
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mtcars_db <- copy_to(con, tibble::rownames_to_column(mtcars), "mtcars")
# This runs
tbl(con, "mtcars") %>%
filter(str_length(rowname) > 5)
# This *should* run with STRING -- pretend it does ;)
tbl(con, "mtcars") %>%
filter(str_detect(rowname, "M"))
# This does NOT run with VARIABLE
tbl(con, "mtcars") %>%
filter(str_detect(rowname, my_string))
This might depend a lot on the flavour of SQL you are using. This issue mentions a translation for str_detect and also provides an alternative.
Testing for SQL Server:
library(dbplyr)
library(dplyr)
library(stringr)
data(mtcars)
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
my_string <- "(?i)m"
df_sim %>%
filter(str_detect(my_string, gear)) %>%
show_query()
# Error: str_detect() is not available in this SQL variant
df_sim %>%
filter(gear %like% my_string) %>%
show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`gear` like '(?i)m')
So it appears str_detect can not translate to SQL Server. But you can use %like% as a work around.
Testing for MySQL:
library(dbplyr)
library(dplyr)
library(stringr)
data(mtcars)
df_sim = tbl_lazy(mtcars, con = simulate_mysql()) # changed to mysql
my_string <- "(?i)m"
df_sim %>%
filter(str_detect(my_string, gear)) %>%
show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE ('(?i)m' REGEXP `gear`)
df_sim %>%
filter(gear %like% my_string) %>%
show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`gear` like '(?i)m')
So it appears str_detect can be translated correctly for MySQL.
In every case my_string is translated into the query.
A couple of other things to check:
Updating the version of dbplyr. Older versions do not have all the translations in newer versions.
Try a column other than rownames. Data frames in R can have rownames but tables in SQL can only have columns.
With the help of a colleague, I have a solution to force evaluation of a variable with regex in string::str_detect:
tbl(con, "mtcars") %>%
filter(str_detect(rowname, {{my_string}}))
Related
I created a table from a data source using tbl(). I need to add a column including 1:nrow() to my dataset and tried different methods but I didn't succeed. My code is as below:
nrow_df1 <- df1 %>% summarise(n = n()) %>% pull(n)
df1 <- df1 %>% mutate(ID = 1:nrow_df1, step = 1)
It doesn't add column ID to my dataset and only adds column step.
Using as.data.frame(), it works but so slow.
Do you have any ideas? thanks in advance
For this case, you can use row_number().
library(dbplyr)
library(DBI)
# simulate a fake database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
# add in the row
tbl(con, "mtcars") %>%
mutate(ID = row_number())
dbDisconnect(con)
I found the answer. It is to use row_number() but as.numeric is also needed to convert the output from integer64 to numeric:
df1 <- df1 %>% mutate(ID = as.numeric(row_number(a column)), step = 1)
I need help transalting an SQL statement regarding this dataset https://www.kaggle.com/datasets/hugomathien/soccer , into r code using dplyr.
The SQL statement is :
SELECT Match.date ,Team.team_long_name, Team.team_short_name ,Match.home_team_goal
FROM Team JOIN Match
ON Match.home_team_api_id = Team.team_api_id
WHERE Match.match_api_id = 492476;
The r code that i have tried is:
con <- DBI::dbConnect(RSQLite::SQLite(), "data/database.sqlite")
library(tidyverse)
library(DBI)
match<-tbl(con,"Match")
team<-tbl(con,"Team")
table_4.2<-match %>%
filter(match_api_id=492476) %>%
select(date,home_team_goal,home_team_api_id) %>%
left_join(team)
and i get this error :
Error in dplyr::common_by():
! by required, because the data sources have no common variables.
Run rlang::last_error() to see where the error occurred.
Run rlang::last_error() to see where the error occurred.
Use the code:
library(tidyverse)
Team %>%
left_join(Match, by = c(home_team_api_id = 'team_api_id')) %>%
filter(match_api_id == 492476) %>%
select(date, team_long_name, team_short_name, home_team_goal)
I need to apply a procedure in SQL that is easy for me since R, but has been really tortuous in SQL.
I need to sort the data from highest to lowest by two variables, group based on another variable, and select the first item in each group.
I leave the code that I am trying to pass from R to SQL. Unfortunately the dbplyr package throws me an error when trying to convert one language to another: Error: first() is only available in a windowed (mutate()) context
library(tidyverse)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
mtcars2
mtcars2 %>%
arrange(-mpg,-disp) %>%
group_by(cyl) %>%
summarise(hp = first(hp)) %>%
show_query()
It seems to me that the DISTINCT ON function could help me.
Thanks for your help.
Maybe the following?
library(tidyverse)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
mtcars2 %>%
arrange(-mpg,-disp) %>%
group_by(cyl) %>%
mutate(hp = first(hp)) %>%
select(cyl, hp) %>%
distinct %>%
show_query
#> <SQL>
#> SELECT DISTINCT `cyl`, FIRST_VALUE(`hp`) OVER (PARTITION BY `cyl` ORDER BY -`mpg`, -`disp`) AS `hp`
#> FROM `mtcars`
#> ORDER BY -`mpg`, -`disp`
See: https://github.com/tidyverse/dbplyr/issues/129
here is how I ran some SQL queries by dbplyr
library(tidyverse)
library(dbplyr)
library(DBI)
library(RPostgres)
library(bit64)
library(tidyr)
drv <- dbDriver('Postgres')
con <- dbConnect(drv,dbname='mydb',port=5432,user='postgres')
table1 <- tbl(con,'table1')
table2 <- tbl(con,'table2')
table3 <- tbl(con,'table3')
table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
I wanna drop some rows which include NA then collect my final table but couldn't find anything helpful works with dbplyr queries.
I tried to pipe drop_na() from tidyr and some other base functions (complete.cases() etc.). Would you suggest me anything to succeed my aim ? Piping an SQL query (like WHERE FOO IS NOT NULL) to dbplyr query is also welcome.
Thanks in advance.
Try using !is.na(col_name) as part of a filter:
library(dplyr)
library(dbplyr)
df = data.frame(my_num = c(1,2,3))
df = tbl_lazy(df, con = simulate_mssql())
output = df %>% filter(!is.na(my_num))
Calling show_query(output) to check the generated sql gives:
<SQL>
SELECT *
FROM `df`
WHERE (NOT(((`my_num`) IS NULL)))
The extra brackets are part of how dbplyr does its translation.
If you want to do this for multiple columns, try the following approach based on this answer:
library(rlang)
library(dplyr)
library(dbplyr)
df = data.frame(c1 = c(1,2,3), c2 = c(9,8,7))
df = tbl_lazy(df, con = simulate_mssql())
colnames = c("c1","c2")
conditions = paste0("!is.na(",colnames,")")
output = df %>%
filter(!!!parse_exprs(conditions))
Calling show_query(output) shows both columns appear in the generated query:
<SQL>
SELECT *
FROM `df`
WHERE ((NOT(((`c1`) IS NULL))) AND (NOT(((`c2`) IS NULL))))
Well, actually I still don't get a satisfying solution. What I exactly wanted to do is to drop containing NA rows in R environment without typing an SQL query, I think dbplyr doesn't support this function yet.
Then I wrote a little and simple code to make my wish come true;
main_query<-table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
colnames <- main_query %>% colnames
query1 <- main_query %>% sql_render %>% paste('WHERE')
query2<-''
for(i in colnames){
if(i == tail(colnames,1)){query2<-paste(query2,i,'IS NOT NULL')}
else{query2<-paste(query2,i,'IS NOT NULL AND')}
}
desiredTable <- dbGetQuery(con,paste(query1,query2))
Yeah, I know it doesn't seem magical but maybe someone can make use of it.
I have the following table which i need to convert to a table and i have the below error which i can't figure out the problem. My main idea is to get a value from a particular column in the table. The view of the table is working fine. Thanks
library(RODBC)
library(odbc)
library(dplyr)
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "MSIGS75\\SQLEXPRESS",
Database = "Players")
dbListTables(con)
table <- tbl(con, "playersData")
View(tbl(con, "playersData"))
tableDF <- as.data.frame(table)
Error
Error in as.data.frame.default(table) : cannot coerce class ‘"function"’ to a data.frame
We can use collect
library(dbplyr)
library(dplyr)
yourcolumn <- "some column name"
yourindex <- 5# row 5
table %>%
collect() %>%
as.data.frame %>%
select(yourcolumn) %>%
slice(yourindex)