Translate SQL statement in R code using dplyr - sql

I need help transalting an SQL statement regarding this dataset https://www.kaggle.com/datasets/hugomathien/soccer , into r code using dplyr.
The SQL statement is :
SELECT Match.date ,Team.team_long_name, Team.team_short_name ,Match.home_team_goal
FROM Team JOIN Match
ON Match.home_team_api_id = Team.team_api_id
WHERE Match.match_api_id = 492476;
The r code that i have tried is:
con <- DBI::dbConnect(RSQLite::SQLite(), "data/database.sqlite")
library(tidyverse)
library(DBI)
match<-tbl(con,"Match")
team<-tbl(con,"Team")
table_4.2<-match %>%
filter(match_api_id=492476) %>%
select(date,home_team_goal,home_team_api_id) %>%
left_join(team)
and i get this error :
Error in dplyr::common_by():
! by required, because the data sources have no common variables.
Run rlang::last_error() to see where the error occurred.
Run rlang::last_error() to see where the error occurred.

Use the code:
library(tidyverse)
Team %>%
left_join(Match, by = c(home_team_api_id = 'team_api_id')) %>%
filter(match_api_id == 492476) %>%
select(date, team_long_name, team_short_name, home_team_goal)

Related

How to add a vector to a table in backend using dbplyr (R)

I created a table from a data source using tbl(). I need to add a column including 1:nrow() to my dataset and tried different methods but I didn't succeed. My code is as below:
nrow_df1 <- df1 %>% summarise(n = n()) %>% pull(n)
df1 <- df1 %>% mutate(ID = 1:nrow_df1, step = 1)
It doesn't add column ID to my dataset and only adds column step.
Using as.data.frame(), it works but so slow.
Do you have any ideas? thanks in advance
For this case, you can use row_number().
library(dbplyr)
library(DBI)
# simulate a fake database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
# add in the row
tbl(con, "mtcars") %>%
mutate(ID = row_number())
dbDisconnect(con)
I found the answer. It is to use row_number() but as.numeric is also needed to convert the output from integer64 to numeric:
df1 <- df1 %>% mutate(ID = as.numeric(row_number(a column)), step = 1)

Use variable with regex in string::str_detect in dbplyr SQL query

I would like to filter a SQL database based whether a regular expression appears within any column. I would like to specify the regex as a variable; however it is read as a literal string. I am having trouble getting the regex in as a variable. Thank you for your help!
Resources I've consulted:
https://www.tidyverse.org/blog/2018/01/dbplyr-1-2/
https://github.com/tidyverse/dbplyr/issues/168
Note: I had trouble making a reprex using the mtcars dataset, following https://www.tidyverse.org/blog/2018/01/dbplyr-1-2/. I get the error: "Error: str_detect() is not available in this SQL variant". I cannot share a reprex using my actual SQL database. As such, below is a pseudo-reprex.
library(dplyr)
library(stringr)
# Variable with regex (either lower- or uppercase "m")
my_string <- "(?i)m"
# WITHOUT SQL DATABASE ----------------------------------------------------
# This runs
mtcars %>%
tibble::rownames_to_column() %>%
filter(str_length(rowname) > 5)
# This runs with STRING
mtcars %>%
tibble::rownames_to_column() %>%
filter(str_detect(rowname, "(?i)m"))
# This runs with VARIABLE
mtcars %>%
tibble::rownames_to_column() %>%
filter(str_detect(rowname, my_string))
# WITH SQL DATABASE -------------------------------------------------------
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mtcars_db <- copy_to(con, tibble::rownames_to_column(mtcars), "mtcars")
# This runs
tbl(con, "mtcars") %>%
filter(str_length(rowname) > 5)
# This *should* run with STRING -- pretend it does ;)
tbl(con, "mtcars") %>%
filter(str_detect(rowname, "M"))
# This does NOT run with VARIABLE
tbl(con, "mtcars") %>%
filter(str_detect(rowname, my_string))
This might depend a lot on the flavour of SQL you are using. This issue mentions a translation for str_detect and also provides an alternative.
Testing for SQL Server:
library(dbplyr)
library(dplyr)
library(stringr)
data(mtcars)
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
my_string <- "(?i)m"
df_sim %>%
filter(str_detect(my_string, gear)) %>%
show_query()
# Error: str_detect() is not available in this SQL variant
df_sim %>%
filter(gear %like% my_string) %>%
show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`gear` like '(?i)m')
So it appears str_detect can not translate to SQL Server. But you can use %like% as a work around.
Testing for MySQL:
library(dbplyr)
library(dplyr)
library(stringr)
data(mtcars)
df_sim = tbl_lazy(mtcars, con = simulate_mysql()) # changed to mysql
my_string <- "(?i)m"
df_sim %>%
filter(str_detect(my_string, gear)) %>%
show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE ('(?i)m' REGEXP `gear`)
df_sim %>%
filter(gear %like% my_string) %>%
show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`gear` like '(?i)m')
So it appears str_detect can be translated correctly for MySQL.
In every case my_string is translated into the query.
A couple of other things to check:
Updating the version of dbplyr. Older versions do not have all the translations in newer versions.
Try a column other than rownames. Data frames in R can have rownames but tables in SQL can only have columns.
With the help of a colleague, I have a solution to force evaluation of a variable with regex in string::str_detect:
tbl(con, "mtcars") %>%
filter(str_detect(rowname, {{my_string}}))

Dropping containing NA rows with dbplyr

here is how I ran some SQL queries by dbplyr
library(tidyverse)
library(dbplyr)
library(DBI)
library(RPostgres)
library(bit64)
library(tidyr)
drv <- dbDriver('Postgres')
con <- dbConnect(drv,dbname='mydb',port=5432,user='postgres')
table1 <- tbl(con,'table1')
table2 <- tbl(con,'table2')
table3 <- tbl(con,'table3')
table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
I wanna drop some rows which include NA then collect my final table but couldn't find anything helpful works with dbplyr queries.
I tried to pipe drop_na() from tidyr and some other base functions (complete.cases() etc.). Would you suggest me anything to succeed my aim ? Piping an SQL query (like WHERE FOO IS NOT NULL) to dbplyr query is also welcome.
Thanks in advance.
Try using !is.na(col_name) as part of a filter:
library(dplyr)
library(dbplyr)
df = data.frame(my_num = c(1,2,3))
df = tbl_lazy(df, con = simulate_mssql())
output = df %>% filter(!is.na(my_num))
Calling show_query(output) to check the generated sql gives:
<SQL>
SELECT *
FROM `df`
WHERE (NOT(((`my_num`) IS NULL)))
The extra brackets are part of how dbplyr does its translation.
If you want to do this for multiple columns, try the following approach based on this answer:
library(rlang)
library(dplyr)
library(dbplyr)
df = data.frame(c1 = c(1,2,3), c2 = c(9,8,7))
df = tbl_lazy(df, con = simulate_mssql())
colnames = c("c1","c2")
conditions = paste0("!is.na(",colnames,")")
output = df %>%
filter(!!!parse_exprs(conditions))
Calling show_query(output) shows both columns appear in the generated query:
<SQL>
SELECT *
FROM `df`
WHERE ((NOT(((`c1`) IS NULL))) AND (NOT(((`c2`) IS NULL))))
Well, actually I still don't get a satisfying solution. What I exactly wanted to do is to drop containing NA rows in R environment without typing an SQL query, I think dbplyr doesn't support this function yet.
Then I wrote a little and simple code to make my wish come true;
main_query<-table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
colnames <- main_query %>% colnames
query1 <- main_query %>% sql_render %>% paste('WHERE')
query2<-''
for(i in colnames){
if(i == tail(colnames,1)){query2<-paste(query2,i,'IS NOT NULL')}
else{query2<-paste(query2,i,'IS NOT NULL AND')}
}
desiredTable <- dbGetQuery(con,paste(query1,query2))
Yeah, I know it doesn't seem magical but maybe someone can make use of it.

Converting a table into a data frame - R Studio

I have the following table which i need to convert to a table and i have the below error which i can't figure out the problem. My main idea is to get a value from a particular column in the table. The view of the table is working fine. Thanks
library(RODBC)
library(odbc)
library(dplyr)
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "MSIGS75\\SQLEXPRESS",
Database = "Players")
dbListTables(con)
table <- tbl(con, "playersData")
View(tbl(con, "playersData"))
tableDF <- as.data.frame(table)
Error
Error in as.data.frame.default(table) : cannot coerce class ‘"function"’ to a data.frame
We can use collect
library(dbplyr)
library(dplyr)
yourcolumn <- "some column name"
yourindex <- 5# row 5
table %>%
collect() %>%
as.data.frame %>%
select(yourcolumn) %>%
slice(yourindex)

percentile_cont in bigrquery

I would like to get the 20th percentile of a column in big query using dplyr syntax in bigrquery, but I keep getting the following errors. Here is a reproducible example:
library(bigrquery)
library(dplyr)
library(DBI)
billing <- YOUR_BILLING_INFO
con <- dbConnect(
bigrquery::bigquery(),
project = "publicdata",
dataset = "samples",
billing = billing
)
natality <- tbl(con, "natality")
natality %>%
filter(year %in% c(1969, 1970)) %>%
group_by(year) %>%
summarise(percentile_20 = percentile_cont(weight_pounds, 0.2))
I get the following error:
Error: Analytic function PERCENTILE_CONT cannot be called without an OVER clause at [1:16] [invalidQuery]
However, it is not clear how to include an OVER clause here. How can I get the 20th percentile with dplyr syntax?