I can use the following code in R to select distinct rows in any generic SQL database. I'd use dplyr::distinct() but it's not supported in SQL syntax. Anyways, this does indeed work:
dbGetQuery(database_name,
"SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
I've been using it with success, but wonder how I can pipe that same SQL query after other dplyr steps, as opposed to just using it as a first step as shown above. This is best illustrated with an example:
distinct.df <-
left_join(sql_table_1, sql_table_2, by = "col5") %>%
sql("SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
So I dplyr::left_join() two SQL tables, then I want to look at distinct rows, and keep all columns. Do I pipe SQL code into R as shown above (simply utilizing the sql() function)? And if so what would I use for the table_name on the line FROM table_name t?
In my first example I use the actual table name that I'm pulling from. It's too obvious! But in this case I am piping and am used to using the magrittr pronoun . or sometimes the .data pronoun from rlang if I were in memory working in R without databases.
I'm in a SQL database though... so how do I handle this situation? How do I properly pipe my known working SQL into my R code (with a proper table name pronoun)? dbplyr's reference page is a good starting point but doesn't really answer this specific question.
It looks like you are wanting to combine custom SQL code with auto-generated SQL code from dbplyr. For this it is important to distinguish between:
DBI::db* commands - that execute the provided SQL on the database and return the result.
dbplyr translation - where you work with a remote connection to a table
You can only combine these in certain ways. Below I have given several examples depending on your particular use case. All assume that DISTINCT is a command that is accepted in your specific SQL environment.
Reference examples that cover many of the different use cases
If you'll excuse some self-promotion, I recommend you take a look at my dbplyr_helpers GitHub repository (here). This includes:
union_all function that takes in two tables accessed via dbplyr and outputs a single table using some custom SQL code.
write_to_datebase function that takes a table accessed via dbplyr and converts it to code that can be executed via DBI::dbExecute
Automatic piping
dbplyr automatically pipes your code into the next query for you when you are working with standard dplyr verbs for which there are SQL translations defined. So long as sql translations are defined you can chain together many pipes (I used 10 or more at once) with the (almost) only disadvantage being that the sql translated query gets difficult for a human to read.
For example, consider the following:
library(dbplyr)
library(dplyr)
tmp_df = data.frame(col1 = c(1,2,3), col2 = c("a","b","c"))
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1") %>%
distinct()
When you then call show_query(df) R returns the following auto-generated SQL code:
SELECT DISTINCT *
FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) `dbplyr_002`
But not as nicely formatted. Note that the initial command (left join) appears as a nested query, with a distinct in the outer query. Hence df is an R link to a remote database table defined by the above sql query.
Creating custom SQL functions
You can pipe dbplyr into custom SQL functions. Piping means that the thing being piped becomes the first argument of the receiving function.
custom_distinct <- function(df){
db_connection <- df$src$con
sql_query <- build_sql(con = db_connection,
"SELECT DISTINCT * FROM (\n",
sql_render(df),
") AS nested_tbl"
)
return(tbl(db_connection, sql(sql_query)))
}
df = left_join(df1, df2, by = "col1") %>%
custom_distinct()
When you then call show_query(df) R should return the following SQL code (I say 'should' because I can not get this working with simulated sql connections), but not as nicely formatted:
SELECT DISTINCT * FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) nested_tbl
As with the previous example, df is an R link to a remote database table defined by the above sql query.
Converting dbplyr to DBI
You can take the code from an existing dbplyr remote table and convert it to a string that can be executed using DBI::db*.
As another way of writing a distinct query:
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1")
custom_distinct2 = paste0("SELECT DISTINCT * FROM (",
as.character(sql_render(df)),
") AS nested_table")
local_table = dbGetQuery(db_connection, custom_distinct2)
Which will return a local R dataframe with the equivalent sql command as per the previous examples.
If you want to do custom SQL processing on the result of a dbplyr operation, it may be useful to compute() first, which creates a new table (temporary or permanent) with the result set on the database. The reprex below shows how to access the name of the newly generated table if you rely on autogeneration. (Note that this relies on dbplyr internals and is subject to change without notice -- perhaps it's better to name the table explicitly.) Then, use dbGetQuery() as usual.
library(tidyverse)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
lazy_query <-
memdb_frame(a = 1:3) %>%
mutate(b = a + 1) %>%
summarize(c = sum(a * b, na.rm = TRUE))
lazy_query
#> # Source: lazy query [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#> c
#> <dbl>
#> 1 20
lazy_query_computed <-
lazy_query %>%
compute()
lazy_query_computed
#> # Source: table<dbplyr_002> [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#> c
#> <dbl>
#> 1 20
lazy_query_computed$ops$x
#> <IDENT> dbplyr_002
Created on 2020-01-01 by the reprex package (v0.3.0)
If your SQL dialect supports CTEs, you could also extract the query string and use this as part of a custom SQL, perhaps similarly to Simon's suggestion.
library(tidyverse)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
lazy_query <-
memdb_frame(a = 1:3) %>%
mutate(b = a + 1) %>%
summarize(c = sum(a * b, na.rm = TRUE))
sql <-
lazy_query %>%
sql_render()
cte_sql <-
paste0(
"WITH my_result AS (", sql, ") ",
"SELECT c + 1 AS d FROM my_result"
)
cte_sql
#> [1] "WITH my_result AS (SELECT SUM(`a` * `b`) AS `c`\nFROM (SELECT `a`, `a` + 1.0 AS `b`\nFROM `dbplyr_001`)) SELECT c + 1 AS d FROM my_result"
DBI::dbGetQuery(
lazy_query$src$con,
cte_sql
)
#> d
#> 1 21
Created on 2020-01-01 by the reprex package (v0.3.0)
Related
In the R programming language, I am interested in performing a "fuzzy join" and passing this through a SQL Connection:
library(fuzzyjoin)
library(dplyr)
library(RODBC)
library(sqldf)
con = odbcConnect("some name", uid = "some id", pwd = "abc")
sample_query = sqlQuery( stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) %>%
filter(date_1 >= date_2, date_1 <= date_3) )
view(sample_query)
However, I do not think this is possible, because the function which us being used for the "fuzzy join" (stringdist_inner_join) is not supported .
I tried to find the source code for this "fuzzy join" function, and found it over here: https://rdrr.io/cran/fuzzyjoin/src/R/stringdist_join.R
My Question: Does anyone know if it is possible to (manually) convert this "fuzzy join" function into an SQL format that will be recognized? Are there any quick ways to re-write this function (stringdist_inner_join) such that it can be recognized by Netezza? Are there any pre-existing ways to do this?
Right now I can only execute "sample_query" on locally - re-writing this function (stringdist_inner_join) would let perform the "sample_query" much faster.
Does anyone know if this is possible?
Note:
My data looks like this:
table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))
table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)
table_2 = data.frame(id1 = c("0123", "1233", "125 .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))
table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)
Based on your other post about this issue, a solution to the question of how to structure the SQL query was solved:
SAS: Fuzzy Joins
select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3)
and (le_dst(a.id1, b.id1) = 1 or a.id2 = b.id2)
To get this to run in an R script, I would recommend using dbplyr and creating this using tbl so you can continue doing basic manipulation of it as if it were a data.frame and dbplyr will translate it into SQL (at least basic commands), then combine everything into a query and eventually pull the data from the query with the collect() function.
Edit: Just a note, the tbl command will start building a SQL statement and get column names, but it won't run it to pull data until you enter collect() at which point, R will send the query to the server, the server will run the query and send the data.
Just keep this in mind because if dbplyr can't translate something to SQL, it will assume it's a SQL command and try to send it, so you won't know there's an error until you try to collect. For example, a function from the stringr package, str_dectect, isn't implemented in dbplyr and so dbplyr would send that command to the database, which would throw an error because it doesn't know what that is, but only after running collect(). Check out the dbplyr page linked above for details.
library(dbplyr)
new_con<- dbConnect(
odbc(),
Driver= "ODBC Driver 17 for SQL Server (as an example)",
Server = "Server name here",
uid = "some_id",
pwd = "abc"
)
sample_query<- dbplyr::tbl(
new_con,
dbplyr::sql(
"select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3)"
sample_data<-sample_query %>%
filter(silly_example==TRUE) %>%
collect()
I agree with #Roger-123's approach. But here is a variation that might assist:
Assuming you are using remote connections to access the Netezza database, you could do this using dbplyr as follows:
remote_1 = tbl(con, "table_1_name")
remote_2 = tbl(con, "table_2_name")
# create dummy column
remote_1 = mutate(remote_1, ones = 1)
remote_2 = mutate(remote_2, ones = 1)
output = remote_1 %>%
# cross_join
inner_join(remote_2, by = "ones", prefix = c("_1","_2")) %>%
# calculate Levenshtein distance
mutate(distance = le_dst(id1, id2)) %>%
# filter to close matches
filter(distance <= 2)
Notes:
dbplyr does not allow for complex conditions in its joins. Hence we do the most general join possible and then filter.
If you also want joins by date, then you can put them into the inner_join if the conditions are simple, or create another filter condition if they are complex.
le_dst is not an R function and there is no dbplyr translation for it, so dbplyr will pass it to the server as-is.
Netezza accepts two distance functions for text: le_dst and dle_dst. You can use whichever you please here.
Output is a query, it will act like a table but it is being generated/calculated on the fly. It has not been written to disk or loaded into R memory. Depending on your application you will want to store/save this.
I would like to iterate this SQL query over the 17 rows in my df. My df and code are below. I think I may need single quotes around dat$ptt_id, because I get a syntax error at the "IN" function. Any ideas how to correctly write this?
df looks like:
ptt_id
1 181787
2 181788
3 184073
4 184098
5 197601
6 197602
7 197603
8 197604
9 197605
10 197606
11 197607
12 197608
13 197609
14 200853
15 200854
16 200851
17 200852
#Load data----
dat <- read.csv("ptts.csv")
dat2<-list(dat)
#Send to database----
for(i in 1:nrow(dat)){
q <- paste("SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN", dat$ptt_id[i],";")
#Get query----
tags <- dbGetQuery(conn, q)
}
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "181787"
LINE 3: WHERE sat_appld_id IN 181787 ;
^
Thanks for any assistance.
Two options:
Parameter binding.
qmarks <- paste0("(", paste(rep("?", nrow(df)), collapse = ","), ")")
qmarks
# [1] "(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
qry <- paste(
"SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN", qmarks)
tags <- dbGetQuery(conn, qry, params = df[,1])
Temporary table. This might be more useful when you have a large number of ids to use. (This also works without a temp table if you get the ids from the database anyway, and can use that query in this sub-query.)
dbWriteTable(conn, df, "mytemp")
qry <- "SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN (select id from mytemp)"
tags <- dbGetQuery(conn, qry)
dbExecute(conn, "drop table mytemp")
(Name your temp table carefully. DBMSes usually have a nomenclature to ensure the table is automatically cleaned/dropped when you disconnect, often something like "#mytemp". Check with your DBMS or DBA.)
The IN operator requires a list. You can think of it as multiple OR conditions.
E.g. instead of WHERE sat_appld_id IN 181787 it should be WHERE sat_appld_id IN (181787)
And to that point, instead of a loop you could create a list from your dat$ptt_id column for just one sql query such as WHERE sat_appld_id IN (181787, 181788, 184073, ...) and do any additional wraggling within your R code instead of making multiple database queries.
dbplyr has some very useful-looking simulation functions so you can write queries while not connected to any real database, but I can't seem to get actual table names into any of the queries I write that way. All their names are just `df`, and I can't seem to modify them afterward. In fact, I don't see `df` anywhere in the query object or its attributes (it doesn't have any), so now I have no idea how dbplyr processes table names at all.
MWE:
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(magrittr)
library(purrr, warn.conflicts = FALSE)
query <- tbl_lazy(df = mtcars)
query %$% names(ops)
#> [1] "x" "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`
# The actual data frame is stored in the object under the name `x`, but
# renaming it has no effect, unsurprisingly (since it wasn't named `df`
# anyway)
query %<>% modify_at("ops", set_names, "mtcars", "vars")
query %$% names(ops)
#> [1] "mtcars" "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`
My use case, by the way, is that I need to run SQL queries in another system with actual server access, so I'd like to have R scripts that produce SQL syntax that's ready to run in that system, even though R can't connect to it. Making an empty dummy database with the structure of the real thing (table & column names, column types, but no rows) is an option, but, obviously, it'd be simpler to just use these free-form simulations, iff the SQL can be generated ready to cut and paste. (lazy_frame() looked more appropriate for such non-existent tables, but, guess what, it's really just a wrapper for tbl_lazy(tibble()), so, same exact `df` name problem.)
Created on 2019-12-12 by the reprex package (v0.3.0)
I am not aware of any way to rename the simulated tables. According to to documentation, the important point of the simulate_* functions is to test database translation without actually connecting to a database.
When connected to a remote table, dbplyr uses the database, schema, and table name defined using tbl(). It also fetches the column names. Because of this, I would recommend developing in an environment where R can connect to the database. Consider the following:
# simulated
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
df_sim %>% head(5) %>% show_query()
# output
<SQL>
SELECT TOP(5) *
FROM `df`
# actual
df = tbl('db_table_name', con = database_connection_object)
df %>% head(5) %>% show_query()
# output
<SQL>
SELECT TOP(5) col1, col2, col3
FROM "database"."db_table_name"
Not only does df get replaced by the table name, but the * in the simulated query is replaced by column names in the second query.
One option you might consider if it is important to generate SQL scripts via simulation is converting to text, replacing, and converting back. For example:
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
query = df_sim %>% head(5) %>% as.character()
query = gsub("`df`", "[db].[schema].[table]", query)
# write query out to file
writeLines(query, "file.sql")
# OR create a remote connection
remote_table = tbl(db_connection, sql(query))
remote_table %>% show_query()
# output
<SQL>
SELECT TOP(5) *
FROM [db].[schema].[table]
library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
As I'm starting to learn SQL my first big lesson is that the syntax will be different dependent on database type.
For example, this type of query will often work:
mtcars3 <- DBI::dbGetQuery(con, "SELECT * FROM mtcars LIMIT 5")
But on some SQL databases I try SELECT * FROM xyz LIMIT 5 and I get a syntax error. I then try something along the lines of:
DBI::dbGetQuery(con, "SELECT TOP 5 * FROM xyz")
and I'm able to get the result I want.
This makes me very curious as to what will happen when I start using dbplyr exclusively and forego using SQL queries at all (to the extent possible). Is dbplyr going to be "smart" enough to recognize the different SQL databases I'm working in? And more importantly will dbplyr apply the correct syntax, dependent on database type?
Yes, dbplyr is 'smart' enough to use the connect type of the table when translating dplyr commands to database (SQL) syntax. Consider the following example:
library(dplyr)
library(dbplyr)
data(mtcars)
# mimic postgre database
df_postgre = tbl_lazy(mtcars, con = simulate_postgres())
df_postgre %>% head(5) %>% show_query()
# resulting SQL translation
<SQL>
SELECT *
FROM `df`
LIMIT 5
# mimic MS server database
df_server = tbl_lazy(mtcars, con = simulate_mssql())
df_server %>% head(5) %>% show_query()
# resulting SQL translation
<SQL>
SELECT TOP(5) *
FROM `df`
You can experiment with the different simulate_* functions in dbplyr to check translations for your particular database.
Say I have a remote tbl open using dbplyr, and I want to use a SQL query on it (maybe because there's not dbplyr translation for what I want to do), how do I give it such that it returns a remote tbl object?
The DBI::dbGetQuery() function allows you to give a query to db, but it returns a data frame on memory, and not an remote tbl object.
For example, say you already have a connection con open to a db, you can create a table like this:
library(tidyverse)
x_df <- expand.grid(A = c('a','b','c'), B = c('d','e','f', 'g','h')) %>%
mutate(C = round(rnorm(15), 2))
DBI::dbWriteTable(conn = con,
name = "x_tbl",
value = x_df,
overwrite = TRUE)
x_tbl = tbl(con, 'x_tbl')
sql_query <- build_sql('SELECT a, b, c, avg(c) OVER (PARTITION BY a) AS mean_c FROM x_tbl')
y_df <- DBI::dbGetQuery(con, sql_query) # This returns a data frame on memory
y_tbl <- x_tbl %>%
group_by(a) %>%
mutate(mean_c = mean(c))
show_query(y_tbl) # This is a remote tbl object
In this case, I could just use y_tbl. But there are cases in which the function has not been translated on dbplyr (for example, quantile doesn't work), and I need to use SQL code. But I don't want to collect the result, I want it to create a remote tbl object. Is there a way I can give a SQL query (like with dbGetQuery()) but have it return a remote tbl?
Thank you
Well, playing with how it works, I think I found a way. You can give a sql query inside the mutate function:
y_tbl <- x_tbl %>%
group_by(a) %>%
mutate(mean_c = sql("avg(c) OVER (PARTITION BY a)"))
show_query(y_tbl) # This is a remote tbl object
This will let you give a SQL definition of a variable without having to compute the table too.
As I understand it, there are a collection of standard translations that dbplyr makes from dplyr to SQL. Anything that falls outside this translation is left as is.
For example, DATEFROMPARTS is an SQL function but not an R function. I commonly use the following mutate:
y_tbl <- x_tbl %>%
mutate(new_date = DATEFROMPARTS(year_col, month_col, day_col)
And because there is no defined translation from an R function DATEFROMPARTS to an SQL function (because the R function does not exist in dplyr) it is left as is.