How to read the contents of an .sql file into an R script to run a query? - sql

I have tried the readLines and the read.csv functions but then don't work.
Here is the contents of the my_script.sql file:
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE HireDate >= '1-july-1993'
and it is saved on my Desktop.
Now I want to run this query from my R script. Here is what I have:
conn = connectDb()
fileName <- "C:\\Users\\me\\Desktop\\my_script.sql"
query <- readChar(fileName, file.info(fileName)$size)
query <- gsub("\r", " ", query)
query <- gsub("\n", " ", query)
query <- gsub("", " ", query)
recordSet <- dbSendQuery(conn, query)
rate <- fetch(recordSet, n = -1)
print(rate)
disconnectDb(conn)
And I am not getting anything back in this case. What can I try?

I've had trouble with reading sql files myself, and have found that often times the syntax gets broken if there are any single line comments in the sql. Since in R you store the sql statement as a single line string, if there are any double dashes in the sql it will essentially comment out any code after the double dash.
This is a function that I typically use whenever I am reading in a .sql file to be used in R.
getSQL <- function(filepath){
con = file(filepath, "r")
sql.string <- ""
while (TRUE){
line <- readLines(con, n = 1)
if ( length(line) == 0 ){
break
}
line <- gsub("\\t", " ", line)
if(grepl("--",line) == TRUE){
line <- paste(sub("--","/*",line),"*/")
}
sql.string <- paste(sql.string, line)
}
close(con)
return(sql.string)
}

I've found for queries with multiple lines, the read_file() function from the readr package works well. The only thing you have to be mindful of is to avoid single quotes (double quotes are fine). You can even add comments this way.
Example query, saved as query.sql
SELECT
COUNT(1) as "my_count"
-- comment goes here
FROM -- tabs work too
my_table
I can then store the results in a data frame with
df <- dbGetQuery(con, statement = read_file('query.sql'))

You can use the read_file() function from the readr package.
fileName = read_file("C:/Users/me/Desktop/my_script.sql")
You will get a string variable fileName with the desired text.
Note: Use / instead of \\\

The answer by Matt Jewett is quite useful, but I wanted to add that I sometimes encounter the following warning when trying to read .sql files generated by sql server using that answer:
Warning message: In readLines(con, n = 1) : line 1 appears to contain
an embedded nul
The first line returned by readLines is often "ÿþ" in these cases (i.e. the UTF-16 byte order mark) and subsequent lines are not read properly. I solved this by opening the sql file in Microsoft SQL Server Management Studio and selecting
File -> Save As ...
then on the small downarrow next to the save button selecting
Save with Encoding ...
and choosing
Unicode (UTF-8 without signature) - Codepage 65001
from the Encoding dropdown menu.
If you do not have Microsoft SQL Server Management Studio and are using a Windows machine, you could also try opening the file with the default text editor and then selecting
File -> Save As ...
Encoding: UTF-8
to save with a .txt file extension.
Interestingly changing the file within Microsoft SQL Server Management Studio removes the BOM (byte order mark) altogether, whereas changing the file within the text editor converts the BOM to the UTF-8 BOM but nevertheless causes the query to be properly read using the referenced answer.

The combination of readr and textclean works well without having to create any new functions. read_file() reads the file into a character vector and replace_white() ensures all escape sequence characters are removed from your .sql file. Note: Does cause problems if you have comments in your SQL string !!
library(readr)
library(textclean)
SQL <- replace_white(read_file("file_path")))

Related

Execute multiple statements separated by semicolons in RODBC

I have a fairly complex SQL query that I am trying to run through RODBC that involves defining variables. A simplified version looks like this:
DECLARE #VARX CHAR = 'X';
SELECT * FROM TABLE WHERE TYPE = #VARX;
Running this code works just fine. This fails:
library(RODBC)
q <- "DECLARE #VARX CHAR = 'X';\nSELECT * FROM TABLE WHERE TYPE = #VARX;"
sqlQuery(ch, q)
# returns character(0)
I have found through experimentation that the first statement before the semicolon is executed, but the rest is not. There is no error--it just seems that everything after the semicolon is ignored. Is there a way to execute the full query?
I'm using SQL server by the way.
NOTE: I asked this question before and it was marked as a duplicate of this question, but they are asking completely different things. In this question I would like to execute a script that contains multiple statements, and in the other the author is only trying to execute a single statement.
You can try this:
library(RODBC)
library(stringr)
filename = "filename.sql" ### file where the sql code is stored
queries <- readLines(filename) ### read the sql file into R
queries1 = str_replace_all(queries,'--.*$'," ") ### remove any commented lines
queries2 = paste(queries1, collapse = '\n') ### collapse with new lines
queries3 = unlist(str_split(queries2,"(?<=;)")) ### separate individual queries
set up the odbc connection at this point and run the for loop below. you can also modify the queries to add/change variables within the queries before running the for loop
for (i in 1:length(queries3)) {
print(i)
sqlQuery(conn, queries3[i])
}
after the for loop is done, you can pull any volatile or regular tables generated in your session into R using sqlQuery(). I havent tested this extensively and there might be cases where it can fail, but it worked for what I was doing

SQL code in Rnw document with knitr

I used the following sql code in .Rmd document. However, I want to use the same SQL code in .Rnw document.
```{r label = setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, max.print = NA)
```
```{r, echo=FALSE, results='hide'}
library(DBI)
db <- dbConnect(RSQLite::SQLite(), dbname = "survey.db")
dbListTables(db)
```
```{sql, label = Q1, connection=db, tab.cap = "Table Caption"}
SELECT *
FROM Person;
```
Would prefer to get code formatting and output printing facility.
Porting the RMarkdown to RNW requires some tweaking:
Of course, chunk delimiters need to be adjusted: The RNW equivalent of ```{r, echo=FALSE} is <<echo=FALSE>>= and RNW chunks end with #. (See the minimal RNW example.)
Importantly, while chunks in RMarkdown documents always specify an engine, the engine in RNW is implicitly R unless the option engine is set. So ```{r} becomes simply <<>>=, but the equivalent of ```{sql} is <<engine="sql">>=.
RMarkdown includes some very useful magic when embedding SQL chunks, see knitr Language Engines: SQL on rmarkdown.rstudio.com. By default, results are rendered as a nice table and only the first 10 results are printed. In RNW, we need to take care of this on our own.
For embedding SQL in RMarkdown, note that the SQL connection must be passed to the SQL chunk via the connection option. The option output.var can be used to specify the name of the object to which the result of the query will be assigned.
A simple solution (see previous revision) would just assign the SQL result to an object, say res, using output.var and add another R chunk that prints res nicely, e.g. using xtable. However, there is a more elegant approach using hooks:
The example uses the SQLite sample database from sqlitetutorial.net. Unzip it to your working directory before running the code.
\documentclass{article}
\begin{document}
\thispagestyle{empty}
<<include=FALSE>>=
library(knitr)
library(DBI)
knit_hooks$set(formatSQL = function(before, options, envir) {
if (!before && opts_current$get("engine") == "sql") {
sqlData <- get(x = opts_current$get("output.var"))
max.print <- min(nrow(sqlData), opts_current$get("max.print"))
myxtable <- do.call(xtable::xtable, c(list(x = sqlData[1:max.print, ]), opts_current$get("xtable.args")))
capture.output(myoutput <-do.call(xtable::print.xtable, c(list(x = myxtable, file = "test.txt"), opts_current$get("print.xtable.args"))))
return(asis_output(paste(
"\\end{kframe}",
myoutput,
"\\begin{kframe}")))
}
})
opts_chunk$set(formatSQL = TRUE)
opts_chunk$set(output.var = "formatSQL_result")
opts_chunk$set(max.print = getOption("max.print"))
#
<<echo=FALSE, results="hide">>=
db <- dbConnect(RSQLite::SQLite(), dbname = "chinook.db")
#
<<engine = "sql", connection=db, max.print = 8, xtable.args=list(caption = "My favorite artists?", label="tab:artist"), print.xtable.args=list(comment=FALSE, caption.placement="top")>>=
SELECT * FROM artists;
#
\end{document}
A new chunk hook formatSQL is added. (Chunk hooks run whenever the corresponding chunk option is not NULL.) After a chunk with engine="sql", it reads the SQL results into sqlData. Then, it uses xtable to print the first max.print rows of the result.
By default, the chunk hook formatSQL is activated (i.e. it is globally set to TRUE) and SQL results are stored in formatSQL_result. The chunk option max.print controls the number of rows to be printed (set it to Inf to print all rows, always).
The table produced by xtable is highly customizable. The chunk option xtable.args is passed to xtable and print.xtable.args is passed to print.xtable. In the example these options are used to set a caption, a label and to suppress xtable's default comment.
Below the generated PDF. Note that syntax highlighting for non-R code in RNW requires installing highlight and adding the directory to path (Windows).

How to run same syntax on multiple spss files

I have 24 spss files in .sav format in a single folder. All these files have the same structure. I want to run the same syntax on all these files. Is it possible to write a code in spss for this?
You can use the SPSSINC PROCESS FILES user submitted command to do this or write your own macro. So first lets create some very simple fake data to work with.
*FILE HANDLE save /NAME = "Your Handle Here!".
*Creating some fake data.
DATA LIST FREE / X Y.
BEGIN DATA
1 2
3 4
END DATA.
DATASET NAME Test.
SAVE OUTFILE = "save\X1.sav".
SAVE OUTFILE = "save\X2.sav".
SAVE OUTFILE = "save\X3.sav".
EXECUTE.
*Creating a syntax file to call.
DO IF $casenum = 1.
PRINT OUTFILE = "save\TestProcess_SHOWN.sps" /"FREQ X Y.".
END IF.
EXECUTE.
Now we can use the SPSSINC PROCESS FILES command to specify the sav files in the folder and apply the TestProcess_SHOWN.sps syntax to each of those files.
*Now example calling the syntax.
SPSSINC PROCESS FILES INPUTDATA="save\X*.sav"
SYNTAX="save\TestProcess_SHOWN.sps"
OUTPUTDATADIR="save" CONTINUEONERROR=YES
VIEWERFILE= "save\Results.spv" CLOSEDATA=NO
MACRONAME="!JOB"
/MACRODEFS ITEMS.
Another (less advanced) way is to use the command INSERT. To do so, repeatedly GET each sav-file, run the syntax with INSERT, and sav the file. Probably something like this:
get 'file1.sav'.
insert file='syntax.sps'.
save outf='file1_v2.sav'.
dataset close all.
get 'file2.sav'.
insert file='syntax.sps'.
save outf='file2_v2.sav'.
etc etc.
Good luck!
If the Syntax you need to run is completely independent of the files then you can either use: INSERT FILE = 'Syntax.sps' or put the code in a macro e.g.
Define !Syntax ()
* Put Syntax here
!EndDefine.
You can then run either of these 'manually';
get file = 'file1.sav'.
insert file='syntax.sps'.
save outfile ='file1_v2.sav'.
Or
get file = 'file1.sav'.
!Syntax.
save outfile ='file1_v2.sav'.
Or if the files follow a reasonably strict naming structure you can embed either of the above in a simple bit of python;
Begin Program.
imports spss
for i in range(0, 24 + 1):
syntax = "get file = 'file" + str(i) + ".sav.\n"
syntax += "insert file='syntax.sps'.\n"
syntax += "save outfile ='file1_v2.sav'.\n"
print syntax
spss.Submit(syntax)
End Program.

Trouble running SQL queries via RODBC

I have a file called q_cleanup.sql that I am reading into R via readLines(). This file has lots of little queries we wrote to clean up some really ugly data. Once I read the into R and process the text, I run each query in the file.
All of the queries work when run directly through Oracle's SQL Developer and Tora.
Some of the queries fail when run via RODBC.
For example. The file contains the following two queries (cut and pasted out of the file)
update T_HH_TMP
set program_type = 'not able to contact'
where
program_type like '%n0t%'
or program_type like '%not able to%'
;
update T_HH_TMP
set program_type = 'hh substance use'
where program_type like '%hh substance abuse%'
;
The first query runs. The second query errors. Below is the relevant section out of my cleanup.R file. The command odbcStart() is a function I built to simplify opening and closing rodbc connections. It is not the problem.
odbcStart()
qry <- readLines("sql/q_cleanup.sql")
qry <- paste(qry[-grep("--", qry)] , collapse=" ")
qry <- unlist(strsplit(qry, ";"))
for(i in seq_along(qry)) {
print("------------------------------------------------------------")
print(qry[i])
print(sqlQuery(con, qry[i]))
}
odbcClose(com)
I am stripping off anything / everything that I can think of that might cause a problem and my string is wrapped in double quotes and my query contains ONLY single quotes. Yet, the output looks like this:
[1] "------------------------------------------------------------"
[1] " update T_HH_TMP set program_type = 'not able to contact' where program_type like '%n0t%' or program_type like '%not able to%' "
character(0)
[1] "------------------------------------------------------------"
[1] " update T_HH_TMP set program_type = 'hh substance use' where program_type like '\\%hh substance abuse\\%' "
[1] "[RODBC] ERROR: Could not SQLExecDirect ' update T_HH_TMP set program_type = 'hh substance use' where program_type like '\\%hh substance abuse\\%' '"
I do not feel that the % is the problem because the first query runs just fine.
Any help? I really would prefer to script the running of all these queries in R.
I thought I would share what I know. I have a solution, even though I consider it sub-optimal because it complicates my workflow unnecessarily.
I do not know if the problem is caused by Oracle server, SQL Plus or if it has something to do with R / Emacs on Windows. I am not an Oracle expert and the office I work for is moving to Vertica by the end of the summer, so I am not going to invest much more effort in fixing this.
I am using sqlplus.exe to run SQL syntax that creates either a view or stored procedure and I am then running the view / SP via R. Thus, the command I have to pass to Oracle via R is SIMPLE and it can handle it.
To script sqlplus from R, I am using the following function that I will someday improve. It has no error handling and it basically assumes you are being nice, but it does work.
#' queryFile() runs a longish series of queries in a .sql file.
#' It is very important to understand that the path to sqlplus is hardcoded
#' because Windows has a shitty path system. It may not run on another system
#' without being edited.
#'
#' #param file - The relative path to the .sql file.
#' #return output - Vector containing the results from sqlplush
#'
queryFile <- function(file){
cmd <- "c:/Oracle/app/product/11.2.0/client_1/sqlplus.exe %user/%password#%db #%file"
cmd <- gsub("%user", getOption("DataMart")$uid, cmd )
cmd <- gsub("%password", getOption("DataMart")$pwd, cmd )
cmd <- gsub("%db", getOption("DataMart")$db, cmd )
cmd <- gsub("%file", file, cmd )
print(cmd)
output <- system(cmd, intern=TRUE)
return(output)
}
Apparently Markdown does not like my Roxygen style comments. Sorry.
The point of this function is that you pass it the file with the SQL syntax. It uses SQL Plus to run the syntax. To store / access user name, password, etc. I use a file called ~/passwords.R. It has a series of options() commands that look like this:
## Fake example.
options( DataMart = list(
uid = "user_name"
,pwd = "user_password"
,db = "TNS Database"
,con_type = "ODBC"
,srvr_type = "Oracle"
)
)
The last two (cont_type and srvr_type) are just things that I like to have documented. They are not really needed. I have ~ 10 of these in my file and I use this to remind me which db server I am writing against. I have to write against SQL Server, Vertica, MySQL and Oracle (different projects / employers) and this helps me.
The function I provided uses options() to access that necessary information and then runs SQLPlus.exe. I could have added SQLPlus to my Window's path, but I was trying to make this function semi-independent and it seems like our IT people are consistent about where SQL Plus lives (of course there are different versions running around, but at least I don't have to explain the idea of path to someone who is not really a programmer.)

iconv: SQL Server 2005 to R

I need to import files that were created with SQL Server 2005 into R. I need R to read the current format or else I need a method for my data provider so that my colleague can save in a format that R can read, with csv being the first choice.
A colleague is sending me quite a few large files that have been saved with MS SQL Server 2005 on a server. I am using R 2.15.1 on Windows 7.
Using R I am trying to read in the files using standard techniques. Although each file has a csv extension, when I go to Excel or WordPad and do SAVE AS I see that it is Unicode Text. Notepad indicates that the encoding is Unicode. Right now I have to do a few things from within Excel (such as Text to Columns. Each row is entirely in Column A) and eventually save as a true csv file before I can read it into R and then use it.
Is there way to solve this from within R? I am also open to easy SQL Server 2005 solutions.
I tried the following from within R.
testDF = read.table("Info06.csv", header = TRUE, sep = ",")
testDF2 = iconv(x = testDF, from = "Unicode", to = "")
Error in iconv(x = testDF, from = "Unicode", to = "") :
unsupported conversion from 'Unicode' to '' in codepage 1252
# The next line did not produce an error message
testDF3 = iconv(x = testDF, from = "UTF-8" , to = "")
testDF3[1:6, 1:3]
Error in testDF3[1:6, 1:3] : incorrect number of dimensions
# The next line did not produce an error message
testDF4 = iconv(x = testDF, from = "macroman" , to = "")
testDF4[1:6, 1:3]
Error in testDF4[1:6, 1:3] : incorrect number of dimensions
Encoding(testDF3)
[1] "unknown"
Encoding(testDF4)
[1] "unknown"
This is the first few lines from WordPad
Date,StockID,Price,MktCap,ADV,SectorID,Days,A1,std1,std2
2006-01-03 00:00:00.000,#Stock1 ,2.53,467108197.38,567381.144444444,4,133.14486997089,-0.0162107939626307,0.0346283580367959,0.0126471695454834
2006-01-03 00:00:00.000,#Stock2 ,1.3275,829803070.531114,6134778.93292,5,124.632223896458,0.071513138376339,0.0410694546850102,0.0172091268025929
It depends on your locale settings, but following works for me:
read.table("Info06.csv", header = TRUE, sep = ",", fileEncoding = "UCS-2LE")
If it won't work for you I recommend using Notepad++ to detect encoding. Open file with it and under "Encoding" menu current encoding should be marked with a dot.
Also check question about detecting encoding.