How to add a vector to a table in backend using dbplyr (R) - sql

I created a table from a data source using tbl(). I need to add a column including 1:nrow() to my dataset and tried different methods but I didn't succeed. My code is as below:
nrow_df1 <- df1 %>% summarise(n = n()) %>% pull(n)
df1 <- df1 %>% mutate(ID = 1:nrow_df1, step = 1)
It doesn't add column ID to my dataset and only adds column step.
Using as.data.frame(), it works but so slow.
Do you have any ideas? thanks in advance

For this case, you can use row_number().
library(dbplyr)
library(DBI)
# simulate a fake database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
# add in the row
tbl(con, "mtcars") %>%
mutate(ID = row_number())
dbDisconnect(con)

I found the answer. It is to use row_number() but as.numeric is also needed to convert the output from integer64 to numeric:
df1 <- df1 %>% mutate(ID = as.numeric(row_number(a column)), step = 1)

Related

Use variable with regex in string::str_detect in dbplyr SQL query

I would like to filter a SQL database based whether a regular expression appears within any column. I would like to specify the regex as a variable; however it is read as a literal string. I am having trouble getting the regex in as a variable. Thank you for your help!
Resources I've consulted:
https://www.tidyverse.org/blog/2018/01/dbplyr-1-2/
https://github.com/tidyverse/dbplyr/issues/168
Note: I had trouble making a reprex using the mtcars dataset, following https://www.tidyverse.org/blog/2018/01/dbplyr-1-2/. I get the error: "Error: str_detect() is not available in this SQL variant". I cannot share a reprex using my actual SQL database. As such, below is a pseudo-reprex.
library(dplyr)
library(stringr)
# Variable with regex (either lower- or uppercase "m")
my_string <- "(?i)m"
# WITHOUT SQL DATABASE ----------------------------------------------------
# This runs
mtcars %>%
tibble::rownames_to_column() %>%
filter(str_length(rowname) > 5)
# This runs with STRING
mtcars %>%
tibble::rownames_to_column() %>%
filter(str_detect(rowname, "(?i)m"))
# This runs with VARIABLE
mtcars %>%
tibble::rownames_to_column() %>%
filter(str_detect(rowname, my_string))
# WITH SQL DATABASE -------------------------------------------------------
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mtcars_db <- copy_to(con, tibble::rownames_to_column(mtcars), "mtcars")
# This runs
tbl(con, "mtcars") %>%
filter(str_length(rowname) > 5)
# This *should* run with STRING -- pretend it does ;)
tbl(con, "mtcars") %>%
filter(str_detect(rowname, "M"))
# This does NOT run with VARIABLE
tbl(con, "mtcars") %>%
filter(str_detect(rowname, my_string))
This might depend a lot on the flavour of SQL you are using. This issue mentions a translation for str_detect and also provides an alternative.
Testing for SQL Server:
library(dbplyr)
library(dplyr)
library(stringr)
data(mtcars)
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
my_string <- "(?i)m"
df_sim %>%
filter(str_detect(my_string, gear)) %>%
show_query()
# Error: str_detect() is not available in this SQL variant
df_sim %>%
filter(gear %like% my_string) %>%
show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`gear` like '(?i)m')
So it appears str_detect can not translate to SQL Server. But you can use %like% as a work around.
Testing for MySQL:
library(dbplyr)
library(dplyr)
library(stringr)
data(mtcars)
df_sim = tbl_lazy(mtcars, con = simulate_mysql()) # changed to mysql
my_string <- "(?i)m"
df_sim %>%
filter(str_detect(my_string, gear)) %>%
show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE ('(?i)m' REGEXP `gear`)
df_sim %>%
filter(gear %like% my_string) %>%
show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`gear` like '(?i)m')
So it appears str_detect can be translated correctly for MySQL.
In every case my_string is translated into the query.
A couple of other things to check:
Updating the version of dbplyr. Older versions do not have all the translations in newer versions.
Try a column other than rownames. Data frames in R can have rownames but tables in SQL can only have columns.
With the help of a colleague, I have a solution to force evaluation of a variable with regex in string::str_detect:
tbl(con, "mtcars") %>%
filter(str_detect(rowname, {{my_string}}))

Sort by one variable, group by another, and select first row in SQL Query in R

I need to apply a procedure in SQL that is easy for me since R, but has been really tortuous in SQL.
I need to sort the data from highest to lowest by two variables, group based on another variable, and select the first item in each group.
I leave the code that I am trying to pass from R to SQL. Unfortunately the dbplyr package throws me an error when trying to convert one language to another: Error: first() is only available in a windowed (mutate()) context
library(tidyverse)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
mtcars2
mtcars2 %>%
arrange(-mpg,-disp) %>%
group_by(cyl) %>%
summarise(hp = first(hp)) %>%
show_query()
It seems to me that the DISTINCT ON function could help me.
Thanks for your help.
Maybe the following?
library(tidyverse)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
mtcars2 %>%
arrange(-mpg,-disp) %>%
group_by(cyl) %>%
mutate(hp = first(hp)) %>%
select(cyl, hp) %>%
distinct %>%
show_query
#> <SQL>
#> SELECT DISTINCT `cyl`, FIRST_VALUE(`hp`) OVER (PARTITION BY `cyl` ORDER BY -`mpg`, -`disp`) AS `hp`
#> FROM `mtcars`
#> ORDER BY -`mpg`, -`disp`
See: https://github.com/tidyverse/dbplyr/issues/129

Dropping containing NA rows with dbplyr

here is how I ran some SQL queries by dbplyr
library(tidyverse)
library(dbplyr)
library(DBI)
library(RPostgres)
library(bit64)
library(tidyr)
drv <- dbDriver('Postgres')
con <- dbConnect(drv,dbname='mydb',port=5432,user='postgres')
table1 <- tbl(con,'table1')
table2 <- tbl(con,'table2')
table3 <- tbl(con,'table3')
table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
I wanna drop some rows which include NA then collect my final table but couldn't find anything helpful works with dbplyr queries.
I tried to pipe drop_na() from tidyr and some other base functions (complete.cases() etc.). Would you suggest me anything to succeed my aim ? Piping an SQL query (like WHERE FOO IS NOT NULL) to dbplyr query is also welcome.
Thanks in advance.
Try using !is.na(col_name) as part of a filter:
library(dplyr)
library(dbplyr)
df = data.frame(my_num = c(1,2,3))
df = tbl_lazy(df, con = simulate_mssql())
output = df %>% filter(!is.na(my_num))
Calling show_query(output) to check the generated sql gives:
<SQL>
SELECT *
FROM `df`
WHERE (NOT(((`my_num`) IS NULL)))
The extra brackets are part of how dbplyr does its translation.
If you want to do this for multiple columns, try the following approach based on this answer:
library(rlang)
library(dplyr)
library(dbplyr)
df = data.frame(c1 = c(1,2,3), c2 = c(9,8,7))
df = tbl_lazy(df, con = simulate_mssql())
colnames = c("c1","c2")
conditions = paste0("!is.na(",colnames,")")
output = df %>%
filter(!!!parse_exprs(conditions))
Calling show_query(output) shows both columns appear in the generated query:
<SQL>
SELECT *
FROM `df`
WHERE ((NOT(((`c1`) IS NULL))) AND (NOT(((`c2`) IS NULL))))
Well, actually I still don't get a satisfying solution. What I exactly wanted to do is to drop containing NA rows in R environment without typing an SQL query, I think dbplyr doesn't support this function yet.
Then I wrote a little and simple code to make my wish come true;
main_query<-table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
colnames <- main_query %>% colnames
query1 <- main_query %>% sql_render %>% paste('WHERE')
query2<-''
for(i in colnames){
if(i == tail(colnames,1)){query2<-paste(query2,i,'IS NOT NULL')}
else{query2<-paste(query2,i,'IS NOT NULL AND')}
}
desiredTable <- dbGetQuery(con,paste(query1,query2))
Yeah, I know it doesn't seem magical but maybe someone can make use of it.

unique values in categorical variables R estudio

How can I find how many unique values each categorical takes in a data frame and then represent it with a graph? all this in R studio
We'll use the tidyverse here.
library(tidyverse)
You can apply the unique() function to a dataframe to remove any repeat rows.
df <- iris %>% unique()
The group_by(), summarise() and n() functions let you count the number of instances of a variable in a dataframe.
df2 <- df %>% group_by(Species) %>% summarise(n = n())
## alternatively use count() which does the same thing
df2 <- df %>% count(Species)
Finally we can use the ggplot package to create a graph.
ggplot() + geom_col(data = df2, aes(x = Species, y = n))
If you're not interested in having a separate dataframe with the data in it and want to jump straight to the graph, you can ignore the step with group_by() and summarise() and just use geom_bar().
ggplot() + geom_bar(data = df, aes(Species))

Converting a table into a data frame - R Studio

I have the following table which i need to convert to a table and i have the below error which i can't figure out the problem. My main idea is to get a value from a particular column in the table. The view of the table is working fine. Thanks
library(RODBC)
library(odbc)
library(dplyr)
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "MSIGS75\\SQLEXPRESS",
Database = "Players")
dbListTables(con)
table <- tbl(con, "playersData")
View(tbl(con, "playersData"))
tableDF <- as.data.frame(table)
Error
Error in as.data.frame.default(table) : cannot coerce class ‘"function"’ to a data.frame
We can use collect
library(dbplyr)
library(dplyr)
yourcolumn <- "some column name"
yourindex <- 5# row 5
table %>%
collect() %>%
as.data.frame %>%
select(yourcolumn) %>%
slice(yourindex)