SQL query with comments import into R from file - sql

A few posters have asked similar questions on here and these have taken me 80% of the way toward reading text files with sql queries in them into R to use as input to RODBC:
Import multiline SQL query to single string
RODBC Temporary Table Issue when connecting to MS SQL Server
However, my sql files have quite a few comments in them (as --comment on this and that). My question is, how would one go about either stripping comment lines from query on import, or making sure that the resulting string keeps line breaks, thus not appending actual queries to comments?
For example, query6.sql:
--query 6
select a6.column1,
a6.column2,
count(a6.column3) as counts
--count the number of occurences in table 1
from data.table a6
group by a6.column1
becomes:
sqlStr <- gsub("\t","", paste(readLines(file('SQL/query6.sql', 'r')), collapse = ' '))
sqlStr
"--query 6select a6.column1, a6.column2, count(a6.column3) as counts --count the number of occurences in table 1from data.table a6 group by a6.column1"
when read into R.

Are you sure you can't just use it as is? This works despite taking up multiple lines and having a comment:
> library(sqldf)
> sql <- "select * -- my select statement
+ from BOD
+ "
> sqldf(sql)
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
This works too:
> sql2 <- c("select * -- my select statement", "from BOD")
> sql2.paste <- paste(sql2, collapse = "\n")
> sqldf(sql2.paste)
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8

I had trouble with the other answer, so I modified Roman's and made a little function. This has worked for all my test cases, including multiple comments, single-line and partial-line comments.
read.sql <- function(filename, silent = TRUE) {
q <- readLines(filename, warn = !silent)
q <- q[!grepl(pattern = "^\\s*--", x = q)] # remove full-line comments
q <- sub(pattern = "--.*", replacement="", x = q) # remove midline comments
q <- paste(q, collapse = " ")
return(q)
}

Summary
Function clean_query:
Removes all mixed comments
Creates single string output
Takes a SQL path or text string
Is simple
Function
require(tidyverse)
# pass in either a text query or path to a sql file
clean_query <- function( text_or_path = '//example/path/to/some_query.sql' ){
# if sql path, read, otherwise assume text input
if( str_detect(text_or_path, "(?i)\\.sql$") ){
text_or_path <- text_or_path %>% read_lines() %>% str_c(sep = " ", collapse = "\n")
}
# echo original query to the console
# (unnecessary, but helpful for status if passing sequential queries to a db)
cat("\nThe query you're processing is: \n", text_or_path, "\n\n")
# return
text_or_path %>%
# remove all demarked /* */ sql comments
gsub(pattern = '/\\*.*?\\*/', replacement = ' ') %>%
# remove all demarked -- comments
gsub(pattern = '--[^\r\n]*', replacement = ' ') %>%
# remove everything after the query-end semicolon
gsub(pattern = ';.*', replacement = ' ') %>%
#remove any line break, tab, etc.
gsub(pattern = '[\r\n\t\f\v]', replacement = ' ') %>%
# remove extra whitespace
gsub(pattern = ' +', replacement = ' ')
}
You could attach regexps together if you want incomprehensibly long expressions, but I recommend readable code.
Output for "query6.sql"
[1] " select a6.column1, a6.column2, count(a6.column3) as counts from data.table a6 group by a6.column1 "
Additional Text Input Example
query <- "
/* this query has
intentionally messy
comments
*/
Select
COL_A -- with a comment here
,COL_B
,COL_C
FROM
-- and some helpful comment here
Database.Datatable
;
-- or wherever
/* and some more comments here */
"
Call function:
clean_query(query)
Output:
[1] " Select COL_A ,COL_B ,COL_C FROM Database.Datatable "
If you want to test reading from a .sql file:
temp_path <- path.expand("~/query.sql")
cat(query, file = temp_path)
clean_query(temp_path)
file.remove(temp_path)

Something like this?
> cat("--query 6
+ select a6.column1,
+ a6.column2,
+ count(a6.column3) as counts
+ --count the number of occurences in table 1
+ from data.table a6
+ group by a6.column1", file = "query6.sql")
>
> my.q <- readLines("query6.sql")
Warning message:
In readLines("query6.sql") : incomplete final line found on 'query6.sql'
> my.q
[1] "--query 6" "select a6.column1, "
[3] "a6.column2," "count(a6.column3) as counts"
[5] "--count the number of occurences in table 1 " "from data.table a6"
[7] "group by a6.column1"
> find.com <- grepl("--", my.q)
>
> my.q <- my.q[!find.com]
> paste(my.q, collapse = " ")
[1] "select a6.column1, a6.column2, count(a6.column3) as counts from data.table a6 group by a6.column1"
>
> unlink("query6.sql")
> rm(list = ls())

had to solve a similar problem lately using another language and still find R to be easier to implement
readSQLFile <- function(fname, retainNewLines=FALSE) {
lines <- readLines(fname)
#remove -- type comments
lines <- vapply(lines, function(x) {
#handle /* -- */ type comments
if (grepl("/\\*(.*)--", x))
return(x)
strsplit(x,"--")[[1]][1]
}, character(1))
#remove /* */ type comments
sqlstr <- paste(lines, collapse=ifelse(retainNewLines, "&&&&&&&&&&" , " "))
sqlstr <- gsub("/\\*(.|\n)*?\\*/","",sqlstr)
if (retainNewLines) {
sqlstr <- strsplit(sqlstr, "&&&&&&&&&&")[[1]]
sqlstr <- sqlstr[sqlstr!=""]
}
sqlstr
} #readSQLFile
#example
fname <- tempfile("sql",fileext=".sql")
cat("--query 6
select a6.column1, --trailing comments
a6.column2, ---test triple -
count(a6.column3) as counts, --/* funny comment */
a6.column3 - a6.column4 ---test single -
/*count the number of occurences in table 1;
test another comment style
*/
from data.table a6 /* --1st weirdo comment */
/* --2nd weirdo comment */
group by a6.column1\n", file=fname)
#remove new lines
readSQLFile(fname)
#retain new lines
readSQLFile(fname, TRUE)
unlink(fname)

It's possible to use readChar() instead of readLines(). I had an ongoing issue with mixed commenting (-- or /* */) and this has always worked well for me.
sql <- readChar(path.to.file, file.size(path.to.file))
query <- sqlQuery(con, sql, stringsAsFactors = TRUE)

Related

Remove Duplicate Columns via sqlQuery()

I am working with the R programming language. Suppose I have the following data frame:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
head(testframe)
age height height2 gender gender2
1 18 76.1 76.1 M M
2 19 77.0 77.0 F F
3 20 78.1 78.1 M M
4 21 78.2 78.2 M M
5 22 78.8 78.8 F F
6 23 79.7 79.7 F F
If I want to remove columns with different names but identical values, I can use the following line of code:
no_dup = testframe[!duplicated(as.list(testframe))]
head(no_dup)
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
My Question: Suppose the data frame is not located in the global environment - is it possible to pass the above line of code through a sqlQuery() command? For example:
library(RODBC)
library(sqldf)
con = odbcConnect("some name", uid = "some id", pwd = "abc")
#not sure if this is correct?
sample_query = sqlQuery(con, "testframe[!duplicated(as.list(testframe))]")
Can someone please show me how to do this?
Thanks!
This does all the substantive processing on the SQL side and only does name manipulation on the R side. The database is not downloaded to R.
The first pipeline inputs the names (we have hard coded the names in Names but you can retrieve them from the database if necessary) and returns an SQL statement, sql1, that when run against your database will produce a one line data frame from the data base that has a column for each pair of variables in testframe and whose value is the number of unequal values.
We then run sql1 using sqldf for reproducibility but you can replace that with an appropriate call to sqlQuery.
The second pipeline then uses numDF to generate one or more SQL statements in character vector sql2 to drop the duplicated columns, i.e. those for which there are zero unequal values You can then run those SQL statements against your database.
We used sqldf with SQLite for reproducibility but you can replace the calls to sqldf with an appropriately modified call to sqlQuery, e.g. sqlQuery(con, sql1) where con is the connection you have previously defined.
It is likely that whatever database system you are using accepts the same SQL but if not there may be small changes needed in the code to generate SQL accepted by whatever it is you are using.
library(magrittr)
library(sqldf)
Names <- c("age", "height", "height2", "gender", "gender2")
sql1 <- Names %>%
{ toString(sprintf("sum(%s)", combn(., 2, paste, collapse = "!="))) } %>%
paste("select", ., "from testframe")
numDF <- sqldf(sq11) # replace with call to your database
sql2 <- numDF %>%
Filter(Negate(c), .) %>%
names %>%
sub(".*!=(.*.).", "alter table testframe drop \\1", .)
# Just run the sql2 part against your db, not select * ... part.
# The select * ... downloads table for demo purposes only.
sqldf(c(sql2, "select * from testframe")) # replace
## age height gender
## 1 18 76.1 M
## 2 19 77.0 F
## 3 20 78.1 M
## 4 21 78.2 M
## ...snip...
Note that sql1 and sql2 are the following. sql1 is a single sql select statement and sql2 is a vector of sql alter statements, one statement per column to drop. If your data base allows ALTER to drop multiple columns at once you might be able to simplify that but SQLite only allows one at a time.
sql1
## [1] "select sum(age!=height), sum(age!=height2), sum(age!=gender), sum(age!=gender2), sum(height!=height2), sum(height!=gender), sum(height!=gender2), sum(height2!=gender), sum(height2!=gender2), sum(gender!=gender2) from testframe"
sql2
## [1] "alter table testframe drop height2" "alter table testframe drop gender2"

for loop or use apply functions over a RODBC sql query [duplicate]

This question already has answers here:
Dynamic "string" in R
(4 answers)
Add a dynamic value into RMySQL getQuery [duplicate]
(2 answers)
RSQLite query with user specified variable in the WHERE field [duplicate]
(2 answers)
Closed 5 years ago.
Is there any way to pass a variable defined within R to the sqlQuery function within the RODBC package?
Specifically, I need to pass such a variable to either a scalar/table-valued function, a stored procedure, and/or perhaps the WHERE clause of a SELECT statement.
For example, let:
x <- 1 ## user-defined
Then,
example <- sqlQuery(myDB,"SELECT * FROM dbo.my_table_fn (x)")
Or...
example2 <- sqlQuery(myDB,"SELECT * FROM dbo.some_random_table AS foo WHERE foo.ID = x")
Or...
example3 <- sqlQuery(myDB,"EXEC dbo.my_stored_proc (x)")
Obviously, none of these work, but I'm thinking that there's something that enables this sort of functionality.
Build the string you intend to pass. So instead of
example <- sqlQuery(myDB,"SELECT * FROM dbo.my_table_fn (x)")
do
example <- sqlQuery(myDB, paste("SELECT * FROM dbo.my_table_fn (",
x, ")", sep=""))
which will fill in the value of x.
If you use sprintf, you can very easily build the query string using variable substitution. For extra ease-of-use, if you pre-parse that query string (I'm using stringr) you can write it over multiple lines in your code.
e.g.
q1 <- sprintf("
SELECT basketid, count(%s)
FROM %s
GROUP BY basketid
"
,item_barcode
,dbo.sales
)
q1 <- str_replace_all(str_replace_all(q1,"\n",""),"\\s+"," ")
df <- sqlQuery(shopping_database, q1)
Side-note and hat-tip to another R chap
Recently I found I wanted to make the variable substitution even simpler by using something like Python's string.format() function, which lets you reuse and reorder variables within the string
e.g.
$: w = "He{0}{0}{1} W{1}r{0}d".format("l","o")
$: print(w)
"Hello World"
However, this function doesn't appear to exist in R, so I asked around on Twitter, and a very helpful chap #kevin_ushey replied with his own custom function to be used in R. Check it out!
With more variables do this:
aaa <- "
SELECT ColOne, ColTwo
FROM TheTable
WHERE HpId = AAAA and
VariableId = BBBB and
convert (date,date ) < 'CCCC'
"
--------------------------
aaa <- gsub ("AAAA", toString(111),aaa)
aaa <- gsub ("BBBB", toString(2222),aaa)
aaa <- gsub ("CCCC", toString (2016-01-01) ,aaa)
try with this
x <- 1
example2 <- fn$sqlQuery(myDB,"SELECT * FROM dbo.some_random_table AS foo WHERE foo.ID = '$x'")

Bucketing in R or SQL

I am completely stumped on a problem and would like some guidance. I am picking random sets of 8 numbers from the set of 1 to 8 (for example, 5,6,8,1,3,4,2,7) and trying to bucket those numbers as subsets of sequential numbers according to the order they appear.
For the example above, the first bucket would start with a 5 then the 6 would be added. Upon hitting the 8 a new bucket would be started. Whenever we get to a number that belongs in an existing bucket (e.g., when we reach 2, it can be added to 1's bucket), we add it there. In this example, after all 8 numbers we'd arrive at:
5,6,7
8
1,2
3,4
For a total of 4 buckets.
I am not actually concerned with the contents of the buckets, I just want to count how many buckets there are for a given random set of 8 digits. I plan on looping through a set of 1000 of these 8 digit sequences.
My solution, not ripped of from nongkrong but quite similar. You get the count of buckets:
x <- as.integer(c(5,6,8,1,3,4,2,7))
sum(is.na(sapply(1:length(x), function(i) which((x[i]-1L)==x[1:i])[1L])))
# [1] 4
I believe it is possible to vectorize it, then it would scale perfectly.
If you are just interested in the number of buckets,
## Your data
dat <- c( 5,6,8,1,3,4,2,7)
## Get the number of buckets
count <- 0
for (i in seq_along(dat))
if (!((dat[i] - 1) %in% dat[1:i])) count <- count+1
count
# 4
and, more succinctly in a function
countBuckets <- function(lst) sum(sapply(1:length(lst), function(i)
(!((lst[i]-1) %in% lst[1:i]))))
And, here is a recursive implementation to get the contents of buckets.
f <- function(lst, acc=NULL) {
if (length(lst) == 0) return(acc)
if (missing(acc)) return( Recall(lst[-1], list(lst[1])) )
diffs <- sapply(acc, function(x) lst[1] - x[length(x)] == 1)
if (any(diffs)) {
acc[[which(diffs)]] <- c(acc[[which(diffs)]], lst[1])
} else { acc <- c(acc, lst[1]) }
return ( Recall(lst[-1], acc) )
}
f(dat)
# [[1]]
# [1] 5 6 7
#
# [[2]]
# [1] 8
#
# [[3]]
# [1] 1 2
#
# [[4]]
# [1] 3 4
Inspired by #jangorecki but quicker:
x <- sample(8L)
1 + sum(sapply(2L:8L, function(i) !any(x[i] - x[1:(i - 1L)] == 1)))
Here's a vectorized answer:
ind.mat <- matrix(rep(1:8, each=8), ncol=8)
ind.mat[upper.tri(ind.mat)] <- NA
8 - sum(rowSums(matrix(rep(x, 8), ncol=8) - x[ind.mat] == 1, na.rm=TRUE))
Note that we only need to declare ind.mat once, so scales up well to replication.
I'm not too familiar with R, but you can definitely do something like:
setOf8 = your array of 8 numbers
buckets=0
for( i = [2,8] )
{
if( (setOf8[i] < setOf8[i-1]) )
{
buckets = buckets + 1
}
}
EDIT:
You could do something like:
func countBuckets( buckets, set )
{
set = your array
current = 1
for( i = [2,size(set)] )
{
if( set[current] + 1 == set[i] )
{
set.remove( current )
current = set[i-1]
}
}
if( size(set) == 0 )
{
return buckets
}
return countBuckets( buckets + 1, set )
}
I'm not sure how it will fare on Oracle, but since you have added the SQL Server tag, here is a T-SQL solution:
declare #set char(8) = '56813427';
with cte as (
select s.Id, cast(substring(#set, s.Id, 1) as int) as [Item]
from dbo.Sequencer s
where s.Id between 1 and 8
union all
select 9 as [Id], 0 as [Item]
)
select count(*) as [TotalBuckets]
from cte s
inner join cte n on (s.Item = n.Item - 1) and s.Id > n.Id;
The idea behind it is to count the cases when next number goes before the current one, beginning a new bucket rather than continuing the current one. The only problem here is with boundaries, so I added trailing zero. Without it, least set item (1 in your case) is not counted as a separate bucket.
P.S. dbo.Sequencer is a table with incrementing integers. I usually keep one in the database to project ordered sequences.

Providing lookup list from R vector as SQL table for RODBC lookup

I have a list of IDs in an R vector.
IDlist <- c(23, 232, 434, 35445)
I would like to write an RODBC sqlQuery with a clause stating something like
WHERE idname IN IDlist
Do I have to read the whole table and then merge it to the idList vector within R? Or how can I provide these values to the RODBC statement, so recover only the records I'm interested in?
Note: As the list is quite long, pasting individual values into the SQL statement, as in the answer below, won't do it.
You could always construct the statement using paste
IDlist <- c(23, 232, 434, 35445)
paste("WHERE idname IN (", paste(IDlist, collapse = ", "), ")")
#[1] "WHERE idname IN ( 23, 232, 434, 35445 )"
Clearly you would need to add more to this to construct your exact statement
I put together a solution to a similar problem by combining the tips here and here and running in batches. Approximate code follows (retyped from an isolated machine):
#assuming you have a list of IDs you want to match in vIDs and an RODBC connection in mycon
#queries that don't change
q_create_tmp <- "create table #tmptbl (ID int)"
q_get_records <- "select * from mastertbl as X join #tmptbl as Y on (X.ID = Y.ID)"
q_del_tmp <- "drop table #tmptbl"
#initialize counters and storage
start_row <- 1
batch_size <- 1000
allresults <- data.frame()
while(start_row <= length(vIDs) {
end_row <- min(length(vIDs), start_row+batch_size-1)
q_fill_tmp <- sprintf("insert into #tmptbl (ID) values %s", paste(sprintf("(%d)", vIDs[start_row:end_row]), collapse=","))
q_all <- list(q_create_tmp, q_fill_tmp, q_get_records, q_del_tmp)
sqlOutput <- lapply(q_all, function(x) sqlQuery(mycon, x))
allresults <- rbind(allresults, sqlOutput[[3]])
start_row <- end_row + 1
}

Pass R variable to RODBC's sqlQuery? [duplicate]

This question already has answers here:
Dynamic "string" in R
(4 answers)
Add a dynamic value into RMySQL getQuery [duplicate]
(2 answers)
RSQLite query with user specified variable in the WHERE field [duplicate]
(2 answers)
Closed 5 years ago.
Is there any way to pass a variable defined within R to the sqlQuery function within the RODBC package?
Specifically, I need to pass such a variable to either a scalar/table-valued function, a stored procedure, and/or perhaps the WHERE clause of a SELECT statement.
For example, let:
x <- 1 ## user-defined
Then,
example <- sqlQuery(myDB,"SELECT * FROM dbo.my_table_fn (x)")
Or...
example2 <- sqlQuery(myDB,"SELECT * FROM dbo.some_random_table AS foo WHERE foo.ID = x")
Or...
example3 <- sqlQuery(myDB,"EXEC dbo.my_stored_proc (x)")
Obviously, none of these work, but I'm thinking that there's something that enables this sort of functionality.
Build the string you intend to pass. So instead of
example <- sqlQuery(myDB,"SELECT * FROM dbo.my_table_fn (x)")
do
example <- sqlQuery(myDB, paste("SELECT * FROM dbo.my_table_fn (",
x, ")", sep=""))
which will fill in the value of x.
If you use sprintf, you can very easily build the query string using variable substitution. For extra ease-of-use, if you pre-parse that query string (I'm using stringr) you can write it over multiple lines in your code.
e.g.
q1 <- sprintf("
SELECT basketid, count(%s)
FROM %s
GROUP BY basketid
"
,item_barcode
,dbo.sales
)
q1 <- str_replace_all(str_replace_all(q1,"\n",""),"\\s+"," ")
df <- sqlQuery(shopping_database, q1)
Side-note and hat-tip to another R chap
Recently I found I wanted to make the variable substitution even simpler by using something like Python's string.format() function, which lets you reuse and reorder variables within the string
e.g.
$: w = "He{0}{0}{1} W{1}r{0}d".format("l","o")
$: print(w)
"Hello World"
However, this function doesn't appear to exist in R, so I asked around on Twitter, and a very helpful chap #kevin_ushey replied with his own custom function to be used in R. Check it out!
With more variables do this:
aaa <- "
SELECT ColOne, ColTwo
FROM TheTable
WHERE HpId = AAAA and
VariableId = BBBB and
convert (date,date ) < 'CCCC'
"
--------------------------
aaa <- gsub ("AAAA", toString(111),aaa)
aaa <- gsub ("BBBB", toString(2222),aaa)
aaa <- gsub ("CCCC", toString (2016-01-01) ,aaa)
try with this
x <- 1
example2 <- fn$sqlQuery(myDB,"SELECT * FROM dbo.some_random_table AS foo WHERE foo.ID = '$x'")