Can you name dbplyr's simulated lazy tables? - sql

dbplyr has some very useful-looking simulation functions so you can write queries while not connected to any real database, but I can't seem to get actual table names into any of the queries I write that way. All their names are just `df`, and I can't seem to modify them afterward. In fact, I don't see `df` anywhere in the query object or its attributes (it doesn't have any), so now I have no idea how dbplyr processes table names at all.
MWE:
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(magrittr)
library(purrr, warn.conflicts = FALSE)
query <- tbl_lazy(df = mtcars)
query %$% names(ops)
#> [1] "x" "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`
# The actual data frame is stored in the object under the name `x`, but
# renaming it has no effect, unsurprisingly (since it wasn't named `df`
# anyway)
query %<>% modify_at("ops", set_names, "mtcars", "vars")
query %$% names(ops)
#> [1] "mtcars" "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`
My use case, by the way, is that I need to run SQL queries in another system with actual server access, so I'd like to have R scripts that produce SQL syntax that's ready to run in that system, even though R can't connect to it. Making an empty dummy database with the structure of the real thing (table & column names, column types, but no rows) is an option, but, obviously, it'd be simpler to just use these free-form simulations, iff the SQL can be generated ready to cut and paste. (lazy_frame() looked more appropriate for such non-existent tables, but, guess what, it's really just a wrapper for tbl_lazy(tibble()), so, same exact `df` name problem.)
Created on 2019-12-12 by the reprex package (v0.3.0)

I am not aware of any way to rename the simulated tables. According to to documentation, the important point of the simulate_* functions is to test database translation without actually connecting to a database.
When connected to a remote table, dbplyr uses the database, schema, and table name defined using tbl(). It also fetches the column names. Because of this, I would recommend developing in an environment where R can connect to the database. Consider the following:
# simulated
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
df_sim %>% head(5) %>% show_query()
# output
<SQL>
SELECT TOP(5) *
FROM `df`
# actual
df = tbl('db_table_name', con = database_connection_object)
df %>% head(5) %>% show_query()
# output
<SQL>
SELECT TOP(5) col1, col2, col3
FROM "database"."db_table_name"
Not only does df get replaced by the table name, but the * in the simulated query is replaced by column names in the second query.
One option you might consider if it is important to generate SQL scripts via simulation is converting to text, replacing, and converting back. For example:
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
query = df_sim %>% head(5) %>% as.character()
query = gsub("`df`", "[db].[schema].[table]", query)
# write query out to file
writeLines(query, "file.sql")
# OR create a remote connection
remote_table = tbl(db_connection, sql(query))
remote_table %>% show_query()
# output
<SQL>
SELECT TOP(5) *
FROM [db].[schema].[table]

Related

I've performed a JOIN using bigrquery and the dbGetQuery function. Now I'd like to query the temporary table I've created but can't connect

I'm afraid that if a bunch of folks start running my actual code I'll be billed for the queries so my example code is for a fake database.
I've successfully established my connection to BigQuery:
con <- dbConnect(
bigrquery::bigquery(),
project = 'myproject',
dataset = 'dataset',
billing = 'myproject'
)
Then performed a LEFT JOIN using the coalesce function:
dbGetQuery(con,
"SELECT
`myproject.dataset.table_1x`.Pokemon,
coalesce(`myproject.dataset.table_1`.Type_1,`myproject.dataset.table_2`.Type_1) AS Type_1,
coalesce(`myproject.dataset.table_1`.Type_2,`myproject.dataset.table_2`.Type_2) AS Type_2,
`myproject.dataset.table_1`.Total,
`myproject.dataset.table_1`.HP,
`myproject.dataset.table_1`.Attack,
`myproject.dataset.table_1`.Special_Attack,
`myproject.dataset.table_1`.Defense,
`myproject.dataset.table_1`.Special_Defense,
`myproject.dataset.table_1`.Speed,
FROM `myproject.dataset.table_1`
LEFT JOIN `myproject.dataset.table_2`
ON `myproject.dataset.table_1`.Pokemon = `myproject.dataset.table_2`.Pokemon
ORDER BY `myproject.dataset.table_1`.ID;")
The JOIN produced the table I intended and now I'd like to query that table but like...where is it? How do I connect? Can I save it locally so that I can start working my analysis in R? Even if I go to BigQuery, select the Project History tab, select the query I just ran in RStudio, and copy the Job ID for the temporary table, I still get the following error:
Error: Job 'poke-340100.job_y0IBocmd6Cpy-irYtNdLJ-mWS7I0.US' failed
x Syntax error: Unexpected string literal 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae' at [2:6] [invalidQuery]
Run `rlang::last_error()` to see where the error occurred.
And if I follow up:
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/rlang_error>
Job 'poke-340100.job_y0IBocmd6Cpy-irYtNdLJ-mWS7I0.US' failed
x Syntax error: Unexpected string literal 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae' at [2:6] [invalidQuery]
Backtrace:
1. DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
2. DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
3. DBI:::.local(conn, statement, ...)
5. bigrquery::dbSendQuery(conn, statement, ...)
6. bigrquery:::BigQueryResult(conn, statement, ...)
7. bigrquery::bq_job_wait(job, quiet = conn#quiet)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
Job 'poke-340100.job_y0IBocmd6Cpy-irYtNdLJ-mWS7I0.US' failed
x Syntax error: Unexpected string literal 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae' at [2:6] [invalidQuery]
Backtrace:
x
1. +-DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
2. \-DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
3. \-DBI:::.local(conn, statement, ...)
4. +-DBI::dbSendQuery(conn, statement, ...)
5. \-bigrquery::dbSendQuery(conn, statement, ...)
6. \-bigrquery:::BigQueryResult(conn, statement, ...)
7. \-bigrquery::bq_job_wait(job, quiet = conn#quiet)
Can someone please explain? Is it just that I can't query a temporary table with the bigrquery package?
From looking at the documentation here and here, the problem might just be that you did not assign the results anywhere.
local_df = dbGetQuery(...
should take the results from your database query and copy them into local R memory. Take care as there is no check for the size of the results, so it is easy to run out of memory in when doing this.
You have tagged the question with dbplyr, but it looks like you are just using the DBI package. If you want to be writing R and have it translated to SQL, then you can do this using dbplyr. It would look something like this:
con <- dbConnect(...) # your connection details here
remote_tbl1 = tbl(con, from = "table_1")
remote_tbl2 = tbl(con, from = "table_2")
new_remote_tbl = remote_tbl1 %>%
left_join(remote_tbl2, by = "Pokemon", suffix = c("",".y")) %>%
mutate(Type_1 = coalesce(Type_1, Type_1.y),
Type_2 = coalesce(Type_2, Type_2.y)) %>%
select(ID, Pokemon, Type_1, Type_2, ...) %>% # list your return columns
arrange(ID)
When you use this approach, new_remote_tbl can be thought of as a new table in the database which you can query and manipulate further. (It is not actually a table - no data was saved to disc - but you can query it and interact with it as if it were and the database will produce it for you on demand).
There are some limitations of working with a remote table (the biggest is you are limited to commands that dbplyr can translate into SQL). When you want to copy the current remote table into local R memory, use collect:
local_df = remote_df %>%
collect()

How to pipe SQL into R's dplyr?

I can use the following code in R to select distinct rows in any generic SQL database. I'd use dplyr::distinct() but it's not supported in SQL syntax. Anyways, this does indeed work:
dbGetQuery(database_name,
"SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
I've been using it with success, but wonder how I can pipe that same SQL query after other dplyr steps, as opposed to just using it as a first step as shown above. This is best illustrated with an example:
distinct.df <-
left_join(sql_table_1, sql_table_2, by = "col5") %>%
sql("SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
So I dplyr::left_join() two SQL tables, then I want to look at distinct rows, and keep all columns. Do I pipe SQL code into R as shown above (simply utilizing the sql() function)? And if so what would I use for the table_name on the line FROM table_name t?
In my first example I use the actual table name that I'm pulling from. It's too obvious! But in this case I am piping and am used to using the magrittr pronoun . or sometimes the .data pronoun from rlang if I were in memory working in R without databases.
I'm in a SQL database though... so how do I handle this situation? How do I properly pipe my known working SQL into my R code (with a proper table name pronoun)? dbplyr's reference page is a good starting point but doesn't really answer this specific question.
It looks like you are wanting to combine custom SQL code with auto-generated SQL code from dbplyr. For this it is important to distinguish between:
DBI::db* commands - that execute the provided SQL on the database and return the result.
dbplyr translation - where you work with a remote connection to a table
You can only combine these in certain ways. Below I have given several examples depending on your particular use case. All assume that DISTINCT is a command that is accepted in your specific SQL environment.
Reference examples that cover many of the different use cases
If you'll excuse some self-promotion, I recommend you take a look at my dbplyr_helpers GitHub repository (here). This includes:
union_all function that takes in two tables accessed via dbplyr and outputs a single table using some custom SQL code.
write_to_datebase function that takes a table accessed via dbplyr and converts it to code that can be executed via DBI::dbExecute
Automatic piping
dbplyr automatically pipes your code into the next query for you when you are working with standard dplyr verbs for which there are SQL translations defined. So long as sql translations are defined you can chain together many pipes (I used 10 or more at once) with the (almost) only disadvantage being that the sql translated query gets difficult for a human to read.
For example, consider the following:
library(dbplyr)
library(dplyr)
tmp_df = data.frame(col1 = c(1,2,3), col2 = c("a","b","c"))
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1") %>%
distinct()
When you then call show_query(df) R returns the following auto-generated SQL code:
SELECT DISTINCT *
FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) `dbplyr_002`
But not as nicely formatted. Note that the initial command (left join) appears as a nested query, with a distinct in the outer query. Hence df is an R link to a remote database table defined by the above sql query.
Creating custom SQL functions
You can pipe dbplyr into custom SQL functions. Piping means that the thing being piped becomes the first argument of the receiving function.
custom_distinct <- function(df){
db_connection <- df$src$con
sql_query <- build_sql(con = db_connection,
"SELECT DISTINCT * FROM (\n",
sql_render(df),
") AS nested_tbl"
)
return(tbl(db_connection, sql(sql_query)))
}
df = left_join(df1, df2, by = "col1") %>%
custom_distinct()
When you then call show_query(df) R should return the following SQL code (I say 'should' because I can not get this working with simulated sql connections), but not as nicely formatted:
SELECT DISTINCT * FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) nested_tbl
As with the previous example, df is an R link to a remote database table defined by the above sql query.
Converting dbplyr to DBI
You can take the code from an existing dbplyr remote table and convert it to a string that can be executed using DBI::db*.
As another way of writing a distinct query:
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1")
custom_distinct2 = paste0("SELECT DISTINCT * FROM (",
as.character(sql_render(df)),
") AS nested_table")
local_table = dbGetQuery(db_connection, custom_distinct2)
Which will return a local R dataframe with the equivalent sql command as per the previous examples.
If you want to do custom SQL processing on the result of a dbplyr operation, it may be useful to compute() first, which creates a new table (temporary or permanent) with the result set on the database. The reprex below shows how to access the name of the newly generated table if you rely on autogeneration. (Note that this relies on dbplyr internals and is subject to change without notice -- perhaps it's better to name the table explicitly.) Then, use dbGetQuery() as usual.
library(tidyverse)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
lazy_query <-
memdb_frame(a = 1:3) %>%
mutate(b = a + 1) %>%
summarize(c = sum(a * b, na.rm = TRUE))
lazy_query
#> # Source: lazy query [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#> c
#> <dbl>
#> 1 20
lazy_query_computed <-
lazy_query %>%
compute()
lazy_query_computed
#> # Source: table<dbplyr_002> [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#> c
#> <dbl>
#> 1 20
lazy_query_computed$ops$x
#> <IDENT> dbplyr_002
Created on 2020-01-01 by the reprex package (v0.3.0)
If your SQL dialect supports CTEs, you could also extract the query string and use this as part of a custom SQL, perhaps similarly to Simon's suggestion.
library(tidyverse)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
lazy_query <-
memdb_frame(a = 1:3) %>%
mutate(b = a + 1) %>%
summarize(c = sum(a * b, na.rm = TRUE))
sql <-
lazy_query %>%
sql_render()
cte_sql <-
paste0(
"WITH my_result AS (", sql, ") ",
"SELECT c + 1 AS d FROM my_result"
)
cte_sql
#> [1] "WITH my_result AS (SELECT SUM(`a` * `b`) AS `c`\nFROM (SELECT `a`, `a` + 1.0 AS `b`\nFROM `dbplyr_001`)) SELECT c + 1 AS d FROM my_result"
DBI::dbGetQuery(
lazy_query$src$con,
cte_sql
)
#> d
#> 1 21
Created on 2020-01-01 by the reprex package (v0.3.0)

Is R dbplyr "smart" enough to determine the SQL database type and apply appropriate syntax?

library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
As I'm starting to learn SQL my first big lesson is that the syntax will be different dependent on database type.
For example, this type of query will often work:
mtcars3 <- DBI::dbGetQuery(con, "SELECT * FROM mtcars LIMIT 5")
But on some SQL databases I try SELECT * FROM xyz LIMIT 5 and I get a syntax error. I then try something along the lines of:
DBI::dbGetQuery(con, "SELECT TOP 5 * FROM xyz")
and I'm able to get the result I want.
This makes me very curious as to what will happen when I start using dbplyr exclusively and forego using SQL queries at all (to the extent possible). Is dbplyr going to be "smart" enough to recognize the different SQL databases I'm working in? And more importantly will dbplyr apply the correct syntax, dependent on database type?
Yes, dbplyr is 'smart' enough to use the connect type of the table when translating dplyr commands to database (SQL) syntax. Consider the following example:
library(dplyr)
library(dbplyr)
data(mtcars)
# mimic postgre database
df_postgre = tbl_lazy(mtcars, con = simulate_postgres())
df_postgre %>% head(5) %>% show_query()
# resulting SQL translation
<SQL>
SELECT *
FROM `df`
LIMIT 5
# mimic MS server database
df_server = tbl_lazy(mtcars, con = simulate_mssql())
df_server %>% head(5) %>% show_query()
# resulting SQL translation
<SQL>
SELECT TOP(5) *
FROM `df`
You can experiment with the different simulate_* functions in dbplyr to check translations for your particular database.

How to give dplyr a SQL query and have it return a remote tbl object?

Say I have a remote tbl open using dbplyr, and I want to use a SQL query on it (maybe because there's not dbplyr translation for what I want to do), how do I give it such that it returns a remote tbl object?
The DBI::dbGetQuery() function allows you to give a query to db, but it returns a data frame on memory, and not an remote tbl object.
For example, say you already have a connection con open to a db, you can create a table like this:
library(tidyverse)
x_df <- expand.grid(A = c('a','b','c'), B = c('d','e','f', 'g','h')) %>%
mutate(C = round(rnorm(15), 2))
DBI::dbWriteTable(conn = con,
name = "x_tbl",
value = x_df,
overwrite = TRUE)
x_tbl = tbl(con, 'x_tbl')
sql_query <- build_sql('SELECT a, b, c, avg(c) OVER (PARTITION BY a) AS mean_c FROM x_tbl')
y_df <- DBI::dbGetQuery(con, sql_query) # This returns a data frame on memory
y_tbl <- x_tbl %>%
group_by(a) %>%
mutate(mean_c = mean(c))
show_query(y_tbl) # This is a remote tbl object
In this case, I could just use y_tbl. But there are cases in which the function has not been translated on dbplyr (for example, quantile doesn't work), and I need to use SQL code. But I don't want to collect the result, I want it to create a remote tbl object. Is there a way I can give a SQL query (like with dbGetQuery()) but have it return a remote tbl?
Thank you
Well, playing with how it works, I think I found a way. You can give a sql query inside the mutate function:
y_tbl <- x_tbl %>%
group_by(a) %>%
mutate(mean_c = sql("avg(c) OVER (PARTITION BY a)"))
show_query(y_tbl) # This is a remote tbl object
This will let you give a SQL definition of a variable without having to compute the table too.
As I understand it, there are a collection of standard translations that dbplyr makes from dplyr to SQL. Anything that falls outside this translation is left as is.
For example, DATEFROMPARTS is an SQL function but not an R function. I commonly use the following mutate:
y_tbl <- x_tbl %>%
mutate(new_date = DATEFROMPARTS(year_col, month_col, day_col)
And because there is no defined translation from an R function DATEFROMPARTS to an SQL function (because the R function does not exist in dplyr) it is left as is.

R: How to replace single columns of a table in database

Does anyone know a workaround for replacing a single column in a table on a MS SQL Server database?
The point is that, due to memory restrictions, I want to avoid to pull the entire table into my R-Session. Instead of dbWriteTable(), can one only (over)write a colum?
What would also help was an appropriate query that can be taken to the database via dbSendQuery() and that takes into account the new column to be tranferred to the database.
To make things clearer, I add some code describing what I am searching for:
# pull "someColumn" and "identifier" from "databaseTable"
dplyr::tbl(odbConnection, dbplyr::in_schema("schemaName", "databaseTable")) %>%
select(identifier, someColumn) %>%
dplyr::collect() -> data2beUpdated
# update "someColumn" accoring to a "lookupTable" in R-session also having the "identifier"
data2beUpdated$newColumn <- lookupTable[match(data2beUpdated$identifier, lookupTable$identifier), "newValues"]
apply(data2beUpdated,
1,
FUN = function(x){ ifelse(
is.na(x["newValues"]),
x["someColumn"],
x["newValues"]
)#ifelse
}#fun
) -> data2beUpdated$someColumn
# now write back the data in the way I DON'T want to do it since it costs runtime to pull in the entire data table
dplyr::tbl(odbConnection, dbplyr::in_schema("schemaName", "databaseTable")) %>%
dplyr::collect() -> df
df$someColumn <- data2beUpdated$someColumn
DBI::dbWriteTable(odbcConnection, "databaseTable", df, overwrite = TRUE)
Any suggestions?!