How to pass data.frame into SQL "IN" condition using R? - sql

I am reading list of values from CSV file in R, and trying to pass the values into IN condition of SQL(dbGetQuery). Can some one help me out with this?
library(rJava)
library(RJDBC)
library(dbplyr)
library(tibble)
library(DBI)
library(RODBC)
library(data.table)
jdbcDriver <- JDBC("oracle.jdbc.OracleDriver",classPath="C://Users/********/Oracle_JDBC/ojdbc6.jar")
jdbcConnection <- dbConnect(jdbcDriver, "jdbc:oracle:thin:Rahul#//Host/DB", "User_name", "Password")
## Setting working directory for the data
setwd("C:\\Users\\**********\\Desktop")
## reading csv file into data frame
pii<-read.csv("sample.csv")
pii
PII_ID
S0094-5765(17)31236-5
S0094-5765(17)31420-0
S0094-5765(17)31508-4
S0094-5765(17)31522-9
S0094-5765(17)30772-5
S0094-5765(17)30773-7
PII_ID1<-dbplyr::build_sql(pii$PII_ID)
PII_ID1
<SQL> ('S0094-5765(17)31236-5', 'S0094-5765(17)31420-0', 'S0094-5765(17)31508-4', 'S0094-5765(17)31522-9', 'S0094-5765(17)30772-5', 'S0094-5765(17)30773-7')
Data<-dbGetQuery(jdbcConnection, "SELECT ARTICLE_ID FROM JRBI_OWNER.JRBI_ARTICLE_DIM WHERE PII_ID in ?",(PII_ID1))
Expected:
ARTICLE_ID
12345
23456
12356
14567
13456
Actual result:
[1] ARTICLE_ID
<0 rows> (or 0-length row.names)

The SQL code you pass in the second argument to dbGetQuery is just a text string, hence you can construct this using paste or equivalents.
You are after something like the following:
in_clause <- paste0("('", paste0(pii$PII_ID, collapse = "', '"), "')")
sql_text <- paste0("SELECT ARTICLE_ID
FROM JRBI_OWNER.JRBI_ARTICLE_DIM
WHERE PII_ID IN ", in_clause)
data <- dbGetQuery(jdbcConnection, sql_text)
However, the exact form of the first paste0 depends on the format of PII_ID (I have assumed it is text) and how this format is represented in sql (I have assumed single quotes).
Be sure to check sql_text is valid SQL before passing it to dbGetQuery.
IMPORTANT: This approach is only suitable when pii contains a small number of values (I recommend fewer than 10). If pii contains a large number of values your query will be very large and will run very slowly. If you have many values in pii then a better approach would be a join or semi-join as per #nicola's comment.

Related

For Loop to iterate SQL query in R

I would like to iterate this SQL query over the 17 rows in my df. My df and code are below. I think I may need single quotes around dat$ptt_id, because I get a syntax error at the "IN" function. Any ideas how to correctly write this?
df looks like:
ptt_id
1 181787
2 181788
3 184073
4 184098
5 197601
6 197602
7 197603
8 197604
9 197605
10 197606
11 197607
12 197608
13 197609
14 200853
15 200854
16 200851
17 200852
#Load data----
dat <- read.csv("ptts.csv")
dat2<-list(dat)
#Send to database----
for(i in 1:nrow(dat)){
q <- paste("SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN", dat$ptt_id[i],";")
#Get query----
tags <- dbGetQuery(conn, q)
}
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "181787"
LINE 3: WHERE sat_appld_id IN 181787 ;
^
Thanks for any assistance.
Two options:
Parameter binding.
qmarks <- paste0("(", paste(rep("?", nrow(df)), collapse = ","), ")")
qmarks
# [1] "(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
qry <- paste(
"SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN", qmarks)
tags <- dbGetQuery(conn, qry, params = df[,1])
Temporary table. This might be more useful when you have a large number of ids to use. (This also works without a temp table if you get the ids from the database anyway, and can use that query in this sub-query.)
dbWriteTable(conn, df, "mytemp")
qry <- "SELECT orgnl_pit, t_name, cap_date, species, sex, mass, cap_lat, cap_lon, sat_appld_id
FROM main.capev JOIN uid.turtles USING (orgnl_pit)
WHERE sat_appld_id IN (select id from mytemp)"
tags <- dbGetQuery(conn, qry)
dbExecute(conn, "drop table mytemp")
(Name your temp table carefully. DBMSes usually have a nomenclature to ensure the table is automatically cleaned/dropped when you disconnect, often something like "#mytemp". Check with your DBMS or DBA.)
The IN operator requires a list. You can think of it as multiple OR conditions.
E.g. instead of WHERE sat_appld_id IN 181787 it should be WHERE sat_appld_id IN (181787)
And to that point, instead of a loop you could create a list from your dat$ptt_id column for just one sql query such as WHERE sat_appld_id IN (181787, 181788, 184073, ...) and do any additional wraggling within your R code instead of making multiple database queries.

Can you name dbplyr's simulated lazy tables?

dbplyr has some very useful-looking simulation functions so you can write queries while not connected to any real database, but I can't seem to get actual table names into any of the queries I write that way. All their names are just `df`, and I can't seem to modify them afterward. In fact, I don't see `df` anywhere in the query object or its attributes (it doesn't have any), so now I have no idea how dbplyr processes table names at all.
MWE:
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(magrittr)
library(purrr, warn.conflicts = FALSE)
query <- tbl_lazy(df = mtcars)
query %$% names(ops)
#> [1] "x" "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`
# The actual data frame is stored in the object under the name `x`, but
# renaming it has no effect, unsurprisingly (since it wasn't named `df`
# anyway)
query %<>% modify_at("ops", set_names, "mtcars", "vars")
query %$% names(ops)
#> [1] "mtcars" "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`
My use case, by the way, is that I need to run SQL queries in another system with actual server access, so I'd like to have R scripts that produce SQL syntax that's ready to run in that system, even though R can't connect to it. Making an empty dummy database with the structure of the real thing (table & column names, column types, but no rows) is an option, but, obviously, it'd be simpler to just use these free-form simulations, iff the SQL can be generated ready to cut and paste. (lazy_frame() looked more appropriate for such non-existent tables, but, guess what, it's really just a wrapper for tbl_lazy(tibble()), so, same exact `df` name problem.)
Created on 2019-12-12 by the reprex package (v0.3.0)
I am not aware of any way to rename the simulated tables. According to to documentation, the important point of the simulate_* functions is to test database translation without actually connecting to a database.
When connected to a remote table, dbplyr uses the database, schema, and table name defined using tbl(). It also fetches the column names. Because of this, I would recommend developing in an environment where R can connect to the database. Consider the following:
# simulated
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
df_sim %>% head(5) %>% show_query()
# output
<SQL>
SELECT TOP(5) *
FROM `df`
# actual
df = tbl('db_table_name', con = database_connection_object)
df %>% head(5) %>% show_query()
# output
<SQL>
SELECT TOP(5) col1, col2, col3
FROM "database"."db_table_name"
Not only does df get replaced by the table name, but the * in the simulated query is replaced by column names in the second query.
One option you might consider if it is important to generate SQL scripts via simulation is converting to text, replacing, and converting back. For example:
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
query = df_sim %>% head(5) %>% as.character()
query = gsub("`df`", "[db].[schema].[table]", query)
# write query out to file
writeLines(query, "file.sql")
# OR create a remote connection
remote_table = tbl(db_connection, sql(query))
remote_table %>% show_query()
# output
<SQL>
SELECT TOP(5) *
FROM [db].[schema].[table]

Putting output from sql query into another query using R environment

I am wondering what approach should have been selected to perform action from title. I am using ODBC connection and what I get from first sql query are like 40-50 rows in one column. What I want is to put this output as a values in to search for.
How should i treat this? Like a array or separated variables? I still do not know R well so just need to know where to search for.
Regards
------more explanation below----
I have list of 40-50 numbers of 10 digits each, organized in a column.
I am trying to do this:
list <- c(my_input)
sql_in <- paste0(list, collapse="")
and characters are organized like this after this operations:
'c(1234567890, , 1234567890, 1234567890)'
and almost all looks fine and fit into my query besides additional c character at the beginning and missing apostrophes.I try to use gsub function but did not work in way I want.
You may likely do this in one SQL call using a subquery. Notice in the call below that the result of
SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4)
Is passed to the WHERE clause of the primary query. This is perfectly valid and will allow your query to execute entirely in SQL without having to do any intermediate steps in R.
(I use sqldf for simplicity of illustration, but this should work through just about any ODBC connection)
library(sqldf)
Gear <- data.frame(n_gear = 1:5)
sqldf(
"SELECT mpg, qsec, gear, wt
FROM mtcars
WHERE gear IN (SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4))"
)
Try something like this:
list<-c("try","this") #The output from your first query
sql_in<-paste0(list, collapse="','")
The Output
paste("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
[1] "select * from table where table.var in ('try','this')"
If yuo have space as first or last element of the string you can use this code:
`list<-c(" first element is a space","try","this","last element is a space ")` #The output from your first query
Find space at first or last character
first_space<-substr(list, start = 1, stop = 1)==" "
last_space<-substr(list, start = nchar(list), stop = nchar(list))==" "
Remove spaces
list[first_space]<-substr(list[first_space], start = 2, stop = nchar(list[first_space]))
list[last_space]<-substr(list[last_space], start = 1, stop = nchar(list[last_space])-1)
sql_in<-paste0(list, collapse="','")
Your output
paste0("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
"select * from table where table.var in ('first element is a space','try','this','last element is a space')"
I think You are expecting some thing like shown below code,
data <- dbGetQuery(con, "select column from yourfirsttable")
list <- paste(data$column, collapse="','")
result <- dbGetQuery(con, statement = sprintf("select * from yourresulttable where inv in ('%s')",list))
It's not entirely clear exactly what you're wanting to achieve here. For example, one use case just means you can do it all with a join. But I have cases where I don't know the values for the test without doing some computation. Then I do a separate query having created a query string thus:
> id <- 1:5
> paste0("SELECT * FROM table WHERE ID IN (", paste0(id, collapse = ","), ")")
[1] "SELECT * FROM table WHERE ID IN (1,2,3,4,5)"

Limit number of characters imported from SQL in R

I am using the sqlquery function in R to connect the DB with R. I am using the following lines
for (i in 1:length(Counter)){
if (Counter[i] %in% str_sub(dir(),1,29) == FALSE){
DT <- data.table(sqlQuery(con, paste0("select a.* from edp_data.sme_loan a
where a.edcode IN (", print(paste0("\'",EDCode,"\'"), quote=FALSE),
") and a.poolcutoffdate in (",print(paste0("\'",str_sub(PoolCutoffDate,1,4),"-",str_sub(PoolCutoffDate,5,6),"-",
str_sub(PoolCutoffDate,7,8),"\'"), quote=FALSE),")")))}}
Thus I am importing subsets of the DB by EDCode and PoolCutoffDate. This works perfectly, however there is one variable in edp_data.sme in one particular EDCode which produces an undesired result.
If I take the unique of this "as.3" variable for a particular EDCode I get:
unique(DT$as3)
[1] 30003000000000019876240886000 30003000000000028672000424000
In reality there shoud be more unique IDs in this DB. The problem is that the string of as3 is much longer than the one which is imported.
nchar(unique(DT$as3))
[1] 29 29
How can I import more characters from this string? I do not want to specify each variable in select a.* ideally, but only make sure that it imports the full string of as3.
Any help is appreciated!

RODBC Multiple Inputs from Shiny

I have a Shiny app that has a checkbox group input. The user can select multiple inputs. I also have an ODBC connection linked to a database. The process would be that when a user selects items from the check box group, that user input would be part of a string in the sql query to filter the data.
UI.R (partial to show example)
checkboxGroupInput('Type', 'Type', c(
"AX"="AX",
"AY"="AY",
"AZ"="AZ",
"BGB"="BGB",
"BT"="BT",
"BX"="BX",
"BXT"="BXT",
"C"="C",
"CNT"="CNT")),
The column in the table where the "Type" information is in is called COMPONENT, so my sql query using RODBC is
data <- odbcConnect("database", uid="username", pwd="password")
query <- (SELECT ID, NAME, TYPE FROM COMPONENT WHERE TYPE LIKE Input$Type)
df <- odbcQuery(data, query)
The query line would not work, but I have no idea how to take multiple inputs and place them properly in the query. Also, there is an added level of complexity that I am not sure how to handle. The data in the database is alpha numeric, so instead of AX, it might be listed as AX14 or AX 71. Also, because there are some one letter types, using a wildcard seems a little difficult.
To answer your initial question regarding "multiple inputs in the query", I use concatenation to achieve this.
Using paste0(), I write something as follows:
type = "AX14"
myQuery <- paste0("Select variable1, variable2 from my_table where type like ",type)
myQuery
[1] "Select variable1, variable2 from my_table where type like AX14"
You can add little things like single quotes or wildcard operators as follows:
myQuery <- paste0("Select variable1, variable2 from my_table where type like '%",type,"%'")
myQuery
[1] "Select variable1, variable2 from my_table where type like '%AX14%'"
Then proceed with actually running the query:
df <- odbcQuery(data, myQuery)