For Loop to iterate SQL query in R - sql

I would like to iterate this SQL query over the 17 rows in my df. My df and code are below. I think I may need single quotes around dat$ptt_id, because I get a syntax error at the "IN" function. Any ideas how to correctly write this?
df looks like:
ptt_id
1 181787
2 181788
3 184073
4 184098
5 197601
6 197602
7 197603
8 197604
9 197605
10 197606
11 197607
12 197608
13 197609
14 200853
15 200854
16 200851
17 200852
#Load data----
dat <- read.csv("ptts.csv")
dat2<-list(dat)
#Send to database----
for(i in 1:nrow(dat)){
q <- paste("SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN", dat$ptt_id[i],";")
#Get query----
tags <- dbGetQuery(conn, q)
}
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "181787"
LINE 3: WHERE sat_appld_id IN 181787 ;
^
Thanks for any assistance.

Two options:
Parameter binding.
qmarks <- paste0("(", paste(rep("?", nrow(df)), collapse = ","), ")")
qmarks
# [1] "(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
qry <- paste(
"SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN", qmarks)
tags <- dbGetQuery(conn, qry, params = df[,1])
Temporary table. This might be more useful when you have a large number of ids to use. (This also works without a temp table if you get the ids from the database anyway, and can use that query in this sub-query.)
dbWriteTable(conn, df, "mytemp")
qry <- "SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN (select id from mytemp)"
tags <- dbGetQuery(conn, qry)
dbExecute(conn, "drop table mytemp")
(Name your temp table carefully. DBMSes usually have a nomenclature to ensure the table is automatically cleaned/dropped when you disconnect, often something like "#mytemp". Check with your DBMS or DBA.)

The IN operator requires a list. You can think of it as multiple OR conditions.
E.g. instead of WHERE sat_appld_id IN 181787 it should be WHERE sat_appld_id IN (181787)
And to that point, instead of a loop you could create a list from your dat$ptt_id column for just one sql query such as WHERE sat_appld_id IN (181787, 181788, 184073, ...) and do any additional wraggling within your R code instead of making multiple database queries.

Related

How to pipe SQL into R's dplyr?

I can use the following code in R to select distinct rows in any generic SQL database. I'd use dplyr::distinct() but it's not supported in SQL syntax. Anyways, this does indeed work:
dbGetQuery(database_name,
"SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
I've been using it with success, but wonder how I can pipe that same SQL query after other dplyr steps, as opposed to just using it as a first step as shown above. This is best illustrated with an example:
distinct.df <-
left_join(sql_table_1, sql_table_2, by = "col5") %>%
sql("SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
So I dplyr::left_join() two SQL tables, then I want to look at distinct rows, and keep all columns. Do I pipe SQL code into R as shown above (simply utilizing the sql() function)? And if so what would I use for the table_name on the line FROM table_name t?
In my first example I use the actual table name that I'm pulling from. It's too obvious! But in this case I am piping and am used to using the magrittr pronoun . or sometimes the .data pronoun from rlang if I were in memory working in R without databases.
I'm in a SQL database though... so how do I handle this situation? How do I properly pipe my known working SQL into my R code (with a proper table name pronoun)? dbplyr's reference page is a good starting point but doesn't really answer this specific question.
It looks like you are wanting to combine custom SQL code with auto-generated SQL code from dbplyr. For this it is important to distinguish between:
DBI::db* commands - that execute the provided SQL on the database and return the result.
dbplyr translation - where you work with a remote connection to a table
You can only combine these in certain ways. Below I have given several examples depending on your particular use case. All assume that DISTINCT is a command that is accepted in your specific SQL environment.
Reference examples that cover many of the different use cases
If you'll excuse some self-promotion, I recommend you take a look at my dbplyr_helpers GitHub repository (here). This includes:
union_all function that takes in two tables accessed via dbplyr and outputs a single table using some custom SQL code.
write_to_datebase function that takes a table accessed via dbplyr and converts it to code that can be executed via DBI::dbExecute
Automatic piping
dbplyr automatically pipes your code into the next query for you when you are working with standard dplyr verbs for which there are SQL translations defined. So long as sql translations are defined you can chain together many pipes (I used 10 or more at once) with the (almost) only disadvantage being that the sql translated query gets difficult for a human to read.
For example, consider the following:
library(dbplyr)
library(dplyr)
tmp_df = data.frame(col1 = c(1,2,3), col2 = c("a","b","c"))
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1") %>%
distinct()
When you then call show_query(df) R returns the following auto-generated SQL code:
SELECT DISTINCT *
FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) `dbplyr_002`
But not as nicely formatted. Note that the initial command (left join) appears as a nested query, with a distinct in the outer query. Hence df is an R link to a remote database table defined by the above sql query.
Creating custom SQL functions
You can pipe dbplyr into custom SQL functions. Piping means that the thing being piped becomes the first argument of the receiving function.
custom_distinct <- function(df){
db_connection <- df$src$con
sql_query <- build_sql(con = db_connection,
"SELECT DISTINCT * FROM (\n",
sql_render(df),
") AS nested_tbl"
)
return(tbl(db_connection, sql(sql_query)))
}
df = left_join(df1, df2, by = "col1") %>%
custom_distinct()
When you then call show_query(df) R should return the following SQL code (I say 'should' because I can not get this working with simulated sql connections), but not as nicely formatted:
SELECT DISTINCT * FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) nested_tbl
As with the previous example, df is an R link to a remote database table defined by the above sql query.
Converting dbplyr to DBI
You can take the code from an existing dbplyr remote table and convert it to a string that can be executed using DBI::db*.
As another way of writing a distinct query:
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1")
custom_distinct2 = paste0("SELECT DISTINCT * FROM (",
as.character(sql_render(df)),
") AS nested_table")
local_table = dbGetQuery(db_connection, custom_distinct2)
Which will return a local R dataframe with the equivalent sql command as per the previous examples.
If you want to do custom SQL processing on the result of a dbplyr operation, it may be useful to compute() first, which creates a new table (temporary or permanent) with the result set on the database. The reprex below shows how to access the name of the newly generated table if you rely on autogeneration. (Note that this relies on dbplyr internals and is subject to change without notice -- perhaps it's better to name the table explicitly.) Then, use dbGetQuery() as usual.
library(tidyverse)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
lazy_query <-
memdb_frame(a = 1:3) %>%
mutate(b = a + 1) %>%
summarize(c = sum(a * b, na.rm = TRUE))
lazy_query
#> # Source: lazy query [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#> c
#> <dbl>
#> 1 20
lazy_query_computed <-
lazy_query %>%
compute()
lazy_query_computed
#> # Source: table<dbplyr_002> [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#> c
#> <dbl>
#> 1 20
lazy_query_computed$ops$x
#> <IDENT> dbplyr_002
Created on 2020-01-01 by the reprex package (v0.3.0)
If your SQL dialect supports CTEs, you could also extract the query string and use this as part of a custom SQL, perhaps similarly to Simon's suggestion.
library(tidyverse)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
lazy_query <-
memdb_frame(a = 1:3) %>%
mutate(b = a + 1) %>%
summarize(c = sum(a * b, na.rm = TRUE))
sql <-
lazy_query %>%
sql_render()
cte_sql <-
paste0(
"WITH my_result AS (", sql, ") ",
"SELECT c + 1 AS d FROM my_result"
)
cte_sql
#> [1] "WITH my_result AS (SELECT SUM(`a` * `b`) AS `c`\nFROM (SELECT `a`, `a` + 1.0 AS `b`\nFROM `dbplyr_001`)) SELECT c + 1 AS d FROM my_result"
DBI::dbGetQuery(
lazy_query$src$con,
cte_sql
)
#> d
#> 1 21
Created on 2020-01-01 by the reprex package (v0.3.0)

How to iteratively run a query on each column in a SQL table with R?

I have a table with multiple columns (colA, colB, colC) and I want to run a query against each of them and store the result so I can use them for comparison purposes later, for example this query to find the ratio of NULL and not NULL values in a column:
SELECT COUNT(*) - COUNT(column), COUNT(column) FROM table;
I have too many columns to do this manually, so I'm looking for a way for it to cycle through each column and store the result. Using a WHILE loop in t-sql doesn't seem to be suitable to this problem, and trying to use for loop with R doesn't work at all:
tableDataColumnName <- names(tableDataDataframe)
for (i in tableDataColumnName){
nullColumnNumber <- dbGetQuery(con, "SELECT COUNT (*) - COUNT(i), COUNT(i) FROM dbo.table;")
}
Is there a way to execute a query multiple times, once for each column in a table, without doing so manually?
You're trying to use a variable within a string (the i). To do this you should either use paste or paste0 from base or something like the glue package
## Base
tableDataColumnName <- names(tableDataDataframe)
for (i in tableDataColumnName){
nullColumnNumber <- dbGetQuery(con, paste0("SELECT COUNT (*) - COUNT(", i, "), COUNT(", i, ") FROM dbo.table;"))
}
## Glue
library(glue)
for (i in tableDataColumnName){
nullColumnNumber <- dbGetQuery(con, glue("SELECT COUNT (*) - COUNT({i}), COUNT({i}) FROM dbo.table;"))
However, you're also overwriting the result on each iteration of the loop. My solution for the whole problem would be something like the following:
library(glue)
tableDataColumnName <- names(tableDataDataframe)
nullColumnNumber <- numeric(length(tableDataColumnName))
for (i in seq_along(tableDataColumnName)){
nullColumnNumber[i] <- dbGetQuery(con, glue("SELECT COUNT (*) - COUNT({tableDataColumnName[i]}), COUNT({tableDataColumnName[i]}) FROM dbo.table;"))
}

dbReadTable won't pull data but dbListFields will see correct fields

I am trying to pull data from a SQL database that I have access to. I can connect to the database, see the tables and get the fields associated with a given table, but cannot read a table into an R variable.
I'm working in R Studio, in case this makes a difference.
I have tried using online code snippets (new to R) and these work, except for the dbReadTable() examples. I have used both "Payments" and name="Payments" as the second argument, and both with and without "" quotes.
library(DBI)
con<-(dbConnect(odbc::odbc(), .connection_string="Driver={SQL Server},
Server=example_1234
Database=exampleDB
TrustedConnection=TRUE")
testing123 <- dbListFields(con,"Payments")
testing456 <- dbReadTable(con,"Payments")
I expect a connection to the database which is now named con. This works.
I expect testing123 to contain all the fields in "Payments". This also works.
I expect testing456 to be a data.frame copy of Payments. This produces:
Error: 'SELECT * FROM "Payments"
nanodbc/nanobdc.cpp:1587 42s02 [Microsoft][ODBC SQL SERVER DRIVER][SQL SERVER]Invalid pbject name 'Payments'.
It's slightly different without "Payments" as the argument - simply saying "Object "Payments" not found".
Any help much appreciated.
I suspect that it's because your table is in a different catalog or schema.
Rationale: DBI::dbListFields is doing select * from ... limit 0 (which is not correct syntax for sql server), but odbc::dbListFields is really calling a C++ function connection_sql_columns that is SQL Server specific. It might be permitting you to be a touch sloppy in that it will find the table even if you do not specify the catalog and/or schema. This is why your dbListFields is working. However, DBI::dbReadTable is really doing select * from ... under the hood (and odbc:: is not overriding it), so it is not allowing you to omit the schema (and/or catalog).
First, find the specific table information for your case:
DBI::dbGetQuery(con, "select top 1 table_catalog, table_schema, table_name, column_name from information_schema.columns where table_name='events'")
# table_catalog table_schema table_name column_name
# 1 my_catalog dbo Payments Id
(I'm projecting what you'll find.)
From here, try one of the following until it works:
x <- DBI::dbReadTable(con, DBI::SQL("[Payments]")) # equivalent to the original
x <- DBI::dbReadTable(con, DBI::SQL("[dbo].[Payments]"))
x <- DBI::dbReadTable(con, DBI::SQL("[my_catalog].[dbo].[Payments]"))
My guess is that DBI::dbGetQuery(con, "select top 1 * from Payments") will not work, so for "regular queries" you'll need to use the same hierarchy of catalog.schema.table, such as one of
DBI::dbGetQuery(con, "select top 1 * from dbo.Payments")
DBI::dbGetQuery(con, "select top 1 * from [dbo].[Payments]")
DBI::dbGetQuery(con, "select top 1 * from [my_catalog].[dbo].[Payments]")
(The use of the [ and ] quoted-identifier brackets are often a personal preference, strictly required in only some corner cases.)
Try changing your con argument just slightly:
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "example_1234",
Database = "exampleDB",
TrustedConnection = TRUE)
# read table to df
testing456 <- dbReadTable(con,"Payments")
# you can also use SQL queries directly, such as:
testing789 <- dbGetQuery(con, statement = "SELECT * FROM Payments WHERE ...")

How to pass data.frame into SQL "IN" condition using R?

I am reading list of values from CSV file in R, and trying to pass the values into IN condition of SQL(dbGetQuery). Can some one help me out with this?
library(rJava)
library(RJDBC)
library(dbplyr)
library(tibble)
library(DBI)
library(RODBC)
library(data.table)
jdbcDriver <- JDBC("oracle.jdbc.OracleDriver",classPath="C://Users/********/Oracle_JDBC/ojdbc6.jar")
jdbcConnection <- dbConnect(jdbcDriver, "jdbc:oracle:thin:Rahul#//Host/DB", "User_name", "Password")
## Setting working directory for the data
setwd("C:\\Users\\**********\\Desktop")
## reading csv file into data frame
pii<-read.csv("sample.csv")
pii
PII_ID
S0094-5765(17)31236-5
S0094-5765(17)31420-0
S0094-5765(17)31508-4
S0094-5765(17)31522-9
S0094-5765(17)30772-5
S0094-5765(17)30773-7
PII_ID1<-dbplyr::build_sql(pii$PII_ID)
PII_ID1
<SQL> ('S0094-5765(17)31236-5', 'S0094-5765(17)31420-0', 'S0094-5765(17)31508-4', 'S0094-5765(17)31522-9', 'S0094-5765(17)30772-5', 'S0094-5765(17)30773-7')
Data<-dbGetQuery(jdbcConnection, "SELECT ARTICLE_ID FROM JRBI_OWNER.JRBI_ARTICLE_DIM WHERE PII_ID in ?",(PII_ID1))
Expected:
ARTICLE_ID
12345
23456
12356
14567
13456
Actual result:
[1] ARTICLE_ID
<0 rows> (or 0-length row.names)
The SQL code you pass in the second argument to dbGetQuery is just a text string, hence you can construct this using paste or equivalents.
You are after something like the following:
in_clause <- paste0("('", paste0(pii$PII_ID, collapse = "', '"), "')")
sql_text <- paste0("SELECT ARTICLE_ID
FROM JRBI_OWNER.JRBI_ARTICLE_DIM
WHERE PII_ID IN ", in_clause)
data <- dbGetQuery(jdbcConnection, sql_text)
However, the exact form of the first paste0 depends on the format of PII_ID (I have assumed it is text) and how this format is represented in sql (I have assumed single quotes).
Be sure to check sql_text is valid SQL before passing it to dbGetQuery.
IMPORTANT: This approach is only suitable when pii contains a small number of values (I recommend fewer than 10). If pii contains a large number of values your query will be very large and will run very slowly. If you have many values in pii then a better approach would be a join or semi-join as per #nicola's comment.

Putting output from sql query into another query using R environment

I am wondering what approach should have been selected to perform action from title. I am using ODBC connection and what I get from first sql query are like 40-50 rows in one column. What I want is to put this output as a values in to search for.
How should i treat this? Like a array or separated variables? I still do not know R well so just need to know where to search for.
Regards
------more explanation below----
I have list of 40-50 numbers of 10 digits each, organized in a column.
I am trying to do this:
list <- c(my_input)
sql_in <- paste0(list, collapse="")
and characters are organized like this after this operations:
'c(1234567890, , 1234567890, 1234567890)'
and almost all looks fine and fit into my query besides additional c character at the beginning and missing apostrophes.I try to use gsub function but did not work in way I want.
You may likely do this in one SQL call using a subquery. Notice in the call below that the result of
SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4)
Is passed to the WHERE clause of the primary query. This is perfectly valid and will allow your query to execute entirely in SQL without having to do any intermediate steps in R.
(I use sqldf for simplicity of illustration, but this should work through just about any ODBC connection)
library(sqldf)
Gear <- data.frame(n_gear = 1:5)
sqldf(
"SELECT mpg, qsec, gear, wt
FROM mtcars
WHERE gear IN (SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4))"
)
Try something like this:
list<-c("try","this") #The output from your first query
sql_in<-paste0(list, collapse="','")
The Output
paste("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
[1] "select * from table where table.var in ('try','this')"
If yuo have space as first or last element of the string you can use this code:
`list<-c(" first element is a space","try","this","last element is a space ")` #The output from your first query
Find space at first or last character
first_space<-substr(list, start = 1, stop = 1)==" "
last_space<-substr(list, start = nchar(list), stop = nchar(list))==" "
Remove spaces
list[first_space]<-substr(list[first_space], start = 2, stop = nchar(list[first_space]))
list[last_space]<-substr(list[last_space], start = 1, stop = nchar(list[last_space])-1)
sql_in<-paste0(list, collapse="','")
Your output
paste0("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
"select * from table where table.var in ('first element is a space','try','this','last element is a space')"
I think You are expecting some thing like shown below code,
data <- dbGetQuery(con, "select column from yourfirsttable")
list <- paste(data$column, collapse="','")
result <- dbGetQuery(con, statement = sprintf("select * from yourresulttable where inv in ('%s')",list))
It's not entirely clear exactly what you're wanting to achieve here. For example, one use case just means you can do it all with a join. But I have cases where I don't know the values for the test without doing some computation. Then I do a separate query having created a query string thus:
> id <- 1:5
> paste0("SELECT * FROM table WHERE ID IN (", paste0(id, collapse = ","), ")")
[1] "SELECT * FROM table WHERE ID IN (1,2,3,4,5)"