R query sql from data.table values list - sql

given the following data:
library(data.table)
Name <- c('Plato','Hegel','Nietzsche')
Amount <- c(10,20,30)
ID <- c('x01','x02','y02')
dt <- data.table(Name, Amount, ID)
philos <- dt[["Name"]]
philos
how can I create philos in a way that is possible to run:
query <- glue_sql("select * from philosophers where names in ({philos*})", .con = con)
I would need to get something like that is understood in sql like:
names in ('Plato','Hegel','Nietzsche')
Thanks!

You can use philos as you currently have constructed it; just pass it as a named argument to glue_sql():
glue::glue_sql(
"select * FROM philosophers where names in ({philos*})",
philos = philos,
.con=con
)
<SQL> select * FROM philosophers where names in ('Plato', 'Hegel', 'Nietzsche')

Related

For Loop to iterate SQL query in R

I would like to iterate this SQL query over the 17 rows in my df. My df and code are below. I think I may need single quotes around dat$ptt_id, because I get a syntax error at the "IN" function. Any ideas how to correctly write this?
df looks like:
ptt_id
1 181787
2 181788
3 184073
4 184098
5 197601
6 197602
7 197603
8 197604
9 197605
10 197606
11 197607
12 197608
13 197609
14 200853
15 200854
16 200851
17 200852
#Load data----
dat <- read.csv("ptts.csv")
dat2<-list(dat)
#Send to database----
for(i in 1:nrow(dat)){
q <- paste("SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN", dat$ptt_id[i],";")
#Get query----
tags <- dbGetQuery(conn, q)
}
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "181787"
LINE 3: WHERE sat_appld_id IN 181787 ;
^
Thanks for any assistance.
Two options:
Parameter binding.
qmarks <- paste0("(", paste(rep("?", nrow(df)), collapse = ","), ")")
qmarks
# [1] "(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
qry <- paste(
"SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN", qmarks)
tags <- dbGetQuery(conn, qry, params = df[,1])
Temporary table. This might be more useful when you have a large number of ids to use. (This also works without a temp table if you get the ids from the database anyway, and can use that query in this sub-query.)
dbWriteTable(conn, df, "mytemp")
qry <- "SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN (select id from mytemp)"
tags <- dbGetQuery(conn, qry)
dbExecute(conn, "drop table mytemp")
(Name your temp table carefully. DBMSes usually have a nomenclature to ensure the table is automatically cleaned/dropped when you disconnect, often something like "#mytemp". Check with your DBMS or DBA.)
The IN operator requires a list. You can think of it as multiple OR conditions.
E.g. instead of WHERE sat_appld_id IN 181787 it should be WHERE sat_appld_id IN (181787)
And to that point, instead of a loop you could create a list from your dat$ptt_id column for just one sql query such as WHERE sat_appld_id IN (181787, 181788, 184073, ...) and do any additional wraggling within your R code instead of making multiple database queries.

How to pipe SQL into R's dplyr?

I can use the following code in R to select distinct rows in any generic SQL database. I'd use dplyr::distinct() but it's not supported in SQL syntax. Anyways, this does indeed work:
dbGetQuery(database_name,
"SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
I've been using it with success, but wonder how I can pipe that same SQL query after other dplyr steps, as opposed to just using it as a first step as shown above. This is best illustrated with an example:
distinct.df <-
left_join(sql_table_1, sql_table_2, by = "col5") %>%
sql("SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
So I dplyr::left_join() two SQL tables, then I want to look at distinct rows, and keep all columns. Do I pipe SQL code into R as shown above (simply utilizing the sql() function)? And if so what would I use for the table_name on the line FROM table_name t?
In my first example I use the actual table name that I'm pulling from. It's too obvious! But in this case I am piping and am used to using the magrittr pronoun . or sometimes the .data pronoun from rlang if I were in memory working in R without databases.
I'm in a SQL database though... so how do I handle this situation? How do I properly pipe my known working SQL into my R code (with a proper table name pronoun)? dbplyr's reference page is a good starting point but doesn't really answer this specific question.
It looks like you are wanting to combine custom SQL code with auto-generated SQL code from dbplyr. For this it is important to distinguish between:
DBI::db* commands - that execute the provided SQL on the database and return the result.
dbplyr translation - where you work with a remote connection to a table
You can only combine these in certain ways. Below I have given several examples depending on your particular use case. All assume that DISTINCT is a command that is accepted in your specific SQL environment.
Reference examples that cover many of the different use cases
If you'll excuse some self-promotion, I recommend you take a look at my dbplyr_helpers GitHub repository (here). This includes:
union_all function that takes in two tables accessed via dbplyr and outputs a single table using some custom SQL code.
write_to_datebase function that takes a table accessed via dbplyr and converts it to code that can be executed via DBI::dbExecute
Automatic piping
dbplyr automatically pipes your code into the next query for you when you are working with standard dplyr verbs for which there are SQL translations defined. So long as sql translations are defined you can chain together many pipes (I used 10 or more at once) with the (almost) only disadvantage being that the sql translated query gets difficult for a human to read.
For example, consider the following:
library(dbplyr)
library(dplyr)
tmp_df = data.frame(col1 = c(1,2,3), col2 = c("a","b","c"))
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1") %>%
distinct()
When you then call show_query(df) R returns the following auto-generated SQL code:
SELECT DISTINCT *
FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) `dbplyr_002`
But not as nicely formatted. Note that the initial command (left join) appears as a nested query, with a distinct in the outer query. Hence df is an R link to a remote database table defined by the above sql query.
Creating custom SQL functions
You can pipe dbplyr into custom SQL functions. Piping means that the thing being piped becomes the first argument of the receiving function.
custom_distinct <- function(df){
db_connection <- df$src$con
sql_query <- build_sql(con = db_connection,
"SELECT DISTINCT * FROM (\n",
sql_render(df),
") AS nested_tbl"
)
return(tbl(db_connection, sql(sql_query)))
}
df = left_join(df1, df2, by = "col1") %>%
custom_distinct()
When you then call show_query(df) R should return the following SQL code (I say 'should' because I can not get this working with simulated sql connections), but not as nicely formatted:
SELECT DISTINCT * FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) nested_tbl
As with the previous example, df is an R link to a remote database table defined by the above sql query.
Converting dbplyr to DBI
You can take the code from an existing dbplyr remote table and convert it to a string that can be executed using DBI::db*.
As another way of writing a distinct query:
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1")
custom_distinct2 = paste0("SELECT DISTINCT * FROM (",
as.character(sql_render(df)),
") AS nested_table")
local_table = dbGetQuery(db_connection, custom_distinct2)
Which will return a local R dataframe with the equivalent sql command as per the previous examples.
If you want to do custom SQL processing on the result of a dbplyr operation, it may be useful to compute() first, which creates a new table (temporary or permanent) with the result set on the database. The reprex below shows how to access the name of the newly generated table if you rely on autogeneration. (Note that this relies on dbplyr internals and is subject to change without notice -- perhaps it's better to name the table explicitly.) Then, use dbGetQuery() as usual.
library(tidyverse)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
lazy_query <-
memdb_frame(a = 1:3) %>%
mutate(b = a + 1) %>%
summarize(c = sum(a * b, na.rm = TRUE))
lazy_query
#> # Source: lazy query [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#> c
#> <dbl>
#> 1 20
lazy_query_computed <-
lazy_query %>%
compute()
lazy_query_computed
#> # Source: table<dbplyr_002> [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#> c
#> <dbl>
#> 1 20
lazy_query_computed$ops$x
#> <IDENT> dbplyr_002
Created on 2020-01-01 by the reprex package (v0.3.0)
If your SQL dialect supports CTEs, you could also extract the query string and use this as part of a custom SQL, perhaps similarly to Simon's suggestion.
library(tidyverse)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
lazy_query <-
memdb_frame(a = 1:3) %>%
mutate(b = a + 1) %>%
summarize(c = sum(a * b, na.rm = TRUE))
sql <-
lazy_query %>%
sql_render()
cte_sql <-
paste0(
"WITH my_result AS (", sql, ") ",
"SELECT c + 1 AS d FROM my_result"
)
cte_sql
#> [1] "WITH my_result AS (SELECT SUM(`a` * `b`) AS `c`\nFROM (SELECT `a`, `a` + 1.0 AS `b`\nFROM `dbplyr_001`)) SELECT c + 1 AS d FROM my_result"
DBI::dbGetQuery(
lazy_query$src$con,
cte_sql
)
#> d
#> 1 21
Created on 2020-01-01 by the reprex package (v0.3.0)

How to iteratively run a query on each column in a SQL table with R?

I have a table with multiple columns (colA, colB, colC) and I want to run a query against each of them and store the result so I can use them for comparison purposes later, for example this query to find the ratio of NULL and not NULL values in a column:
SELECT COUNT(*) - COUNT(column), COUNT(column) FROM table;
I have too many columns to do this manually, so I'm looking for a way for it to cycle through each column and store the result. Using a WHILE loop in t-sql doesn't seem to be suitable to this problem, and trying to use for loop with R doesn't work at all:
tableDataColumnName <- names(tableDataDataframe)
for (i in tableDataColumnName){
nullColumnNumber <- dbGetQuery(con, "SELECT COUNT (*) - COUNT(i), COUNT(i) FROM dbo.table;")
}
Is there a way to execute a query multiple times, once for each column in a table, without doing so manually?
You're trying to use a variable within a string (the i). To do this you should either use paste or paste0 from base or something like the glue package
## Base
tableDataColumnName <- names(tableDataDataframe)
for (i in tableDataColumnName){
nullColumnNumber <- dbGetQuery(con, paste0("SELECT COUNT (*) - COUNT(", i, "), COUNT(", i, ") FROM dbo.table;"))
}
## Glue
library(glue)
for (i in tableDataColumnName){
nullColumnNumber <- dbGetQuery(con, glue("SELECT COUNT (*) - COUNT({i}), COUNT({i}) FROM dbo.table;"))
However, you're also overwriting the result on each iteration of the loop. My solution for the whole problem would be something like the following:
library(glue)
tableDataColumnName <- names(tableDataDataframe)
nullColumnNumber <- numeric(length(tableDataColumnName))
for (i in seq_along(tableDataColumnName)){
nullColumnNumber[i] <- dbGetQuery(con, glue("SELECT COUNT (*) - COUNT({tableDataColumnName[i]}), COUNT({tableDataColumnName[i]}) FROM dbo.table;"))
}

Regular expression in dbGetQuery does not work

I want to get all rows in my database where a condition with regular expressions is met. The variable should start with "J12", "J13", "J14" or "J15".
This was my attempt:
Data <- dbGetQuery(db,
"SELECT * FROM 'XXX.XXXX.XXX'
WHERE TYPE = 'xyz' AND [xyz_DIAG] LIKE '^J1[2-5]' ")
Then a data.frame with 0 rows is returned.
When I send the query
Data <- dbGetQuery(db,
"SELECT * FROM 'XXX.XXXX.XXX'
WHERE TYPE = 'xyz'")
I get a quite large data.frame and then I call
Data %>% setDT %>% .[str_detect(xyz_DIAG, "^J1[2-5]")] and I get the expected result because in fact there are many rows that fulfill that regexp. Have I done something wrong?
For the time being, REGEXP operator has not been added to RSQLITE, see this pull request.
You thus need to "unwrap" the regex and use ORed LIKE:
Data <- dbGetQuery(db,
"SELECT * FROM 'XXX.XXXX.XXX'
WHERE TYPE = 'xyz' AND ([xyz_DIAG] LIKE 'J12%' OR [xyz_DIAG] LIKE 'J13%' OR [xyz_DIAG] LIKE 'J14%' OR [xyz_DIAG] LIKE 'J15%') ")

Putting output from sql query into another query using R environment

I am wondering what approach should have been selected to perform action from title. I am using ODBC connection and what I get from first sql query are like 40-50 rows in one column. What I want is to put this output as a values in to search for.
How should i treat this? Like a array or separated variables? I still do not know R well so just need to know where to search for.
Regards
------more explanation below----
I have list of 40-50 numbers of 10 digits each, organized in a column.
I am trying to do this:
list <- c(my_input)
sql_in <- paste0(list, collapse="")
and characters are organized like this after this operations:
'c(1234567890, , 1234567890, 1234567890)'
and almost all looks fine and fit into my query besides additional c character at the beginning and missing apostrophes.I try to use gsub function but did not work in way I want.
You may likely do this in one SQL call using a subquery. Notice in the call below that the result of
SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4)
Is passed to the WHERE clause of the primary query. This is perfectly valid and will allow your query to execute entirely in SQL without having to do any intermediate steps in R.
(I use sqldf for simplicity of illustration, but this should work through just about any ODBC connection)
library(sqldf)
Gear <- data.frame(n_gear = 1:5)
sqldf(
"SELECT mpg, qsec, gear, wt
FROM mtcars
WHERE gear IN (SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4))"
)
Try something like this:
list<-c("try","this") #The output from your first query
sql_in<-paste0(list, collapse="','")
The Output
paste("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
[1] "select * from table where table.var in ('try','this')"
If yuo have space as first or last element of the string you can use this code:
`list<-c(" first element is a space","try","this","last element is a space ")` #The output from your first query
Find space at first or last character
first_space<-substr(list, start = 1, stop = 1)==" "
last_space<-substr(list, start = nchar(list), stop = nchar(list))==" "
Remove spaces
list[first_space]<-substr(list[first_space], start = 2, stop = nchar(list[first_space]))
list[last_space]<-substr(list[last_space], start = 1, stop = nchar(list[last_space])-1)
sql_in<-paste0(list, collapse="','")
Your output
paste0("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
"select * from table where table.var in ('first element is a space','try','this','last element is a space')"
I think You are expecting some thing like shown below code,
data <- dbGetQuery(con, "select column from yourfirsttable")
list <- paste(data$column, collapse="','")
result <- dbGetQuery(con, statement = sprintf("select * from yourresulttable where inv in ('%s')",list))
It's not entirely clear exactly what you're wanting to achieve here. For example, one use case just means you can do it all with a join. But I have cases where I don't know the values for the test without doing some computation. Then I do a separate query having created a query string thus:
> id <- 1:5
> paste0("SELECT * FROM table WHERE ID IN (", paste0(id, collapse = ","), ")")
[1] "SELECT * FROM table WHERE ID IN (1,2,3,4,5)"