I am doing some experiments with SQL in R using the sqldf package.
I am trying to test some commands to check the output, in particular I am trying to create tables.
Here the code:
sqldf("CREATE TABLE tbl1 AS
SELECT cut
FROM diamonds")
Very simple code, however I get this error
sqldf("CREATE TABLE tbl1 AS
+ SELECT cut
+ FROM diamonds")
data frame with 0 columns and 0 rows
Warning message:
In result_fetch(res#ptr, n = n) :
Don't need to call dbFetch() for statements, only for queries
Why is it saying the the table create as 0 columns and 0 rows?
Can someone help?
That is a warning, not an error. The warning is caused by a backward incompatibility in recent versions of RSQLite. You can ignore it since it works anyways.
The sqldf statement that is shown in the question
creates an empty database
uploads the diamonds data frame to a table of the same name in that database
runs the create statement which creates a second table tbl1 in the database
returns nothing (actually a 0 column 0 row data frame) since a create statement has no value
destroys the database
When using sqldf you don't need create statements. It automatically creates a table in the backend database for any data frame referenced in your sql statement so the following sqldf statement
sqldf("select * from diamonds")
will
create an empty database
upload diamonds to it
run the select statement
return the result of the select statement as a data frame
destroy the database
You can use the verbose=TRUE argument to see the individual calls to the lower level RSQLite (or other backend database if you specify a different backend):
sqldf("select * from diamonds limit 3", verbose = TRUE)
giving:
sqldf: library(RSQLite)
sqldf: m <- dbDriver("SQLite")
sqldf: connection <- dbConnect(m, dbname = ":memory:")
sqldf: initExtension(connection)
sqldf: dbWriteTable(connection, 'diamonds', diamonds, row.names = FALSE)
sqldf: dbGetQuery(connection, 'select * from diamonds limit 3')
sqldf: dbDisconnect(connection)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
Suggest you thoroughly review help("sqldf") as well as the info on the sqldf github home page
Related
I am working with the R programming language. Suppose I have the following data frame:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
head(testframe)
age height height2 gender gender2
1 18 76.1 76.1 M M
2 19 77.0 77.0 F F
3 20 78.1 78.1 M M
4 21 78.2 78.2 M M
5 22 78.8 78.8 F F
6 23 79.7 79.7 F F
If I want to remove columns with different names but identical values, I can use the following line of code:
no_dup = testframe[!duplicated(as.list(testframe))]
head(no_dup)
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
My Question: Suppose the data frame is not located in the global environment - is it possible to pass the above line of code through a sqlQuery() command? For example:
library(RODBC)
library(sqldf)
con = odbcConnect("some name", uid = "some id", pwd = "abc")
#not sure if this is correct?
sample_query = sqlQuery(con, "testframe[!duplicated(as.list(testframe))]")
Can someone please show me how to do this?
Thanks!
This does all the substantive processing on the SQL side and only does name manipulation on the R side. The database is not downloaded to R.
The first pipeline inputs the names (we have hard coded the names in Names but you can retrieve them from the database if necessary) and returns an SQL statement, sql1, that when run against your database will produce a one line data frame from the data base that has a column for each pair of variables in testframe and whose value is the number of unequal values.
We then run sql1 using sqldf for reproducibility but you can replace that with an appropriate call to sqlQuery.
The second pipeline then uses numDF to generate one or more SQL statements in character vector sql2 to drop the duplicated columns, i.e. those for which there are zero unequal values You can then run those SQL statements against your database.
We used sqldf with SQLite for reproducibility but you can replace the calls to sqldf with an appropriately modified call to sqlQuery, e.g. sqlQuery(con, sql1) where con is the connection you have previously defined.
It is likely that whatever database system you are using accepts the same SQL but if not there may be small changes needed in the code to generate SQL accepted by whatever it is you are using.
library(magrittr)
library(sqldf)
Names <- c("age", "height", "height2", "gender", "gender2")
sql1 <- Names %>%
{ toString(sprintf("sum(%s)", combn(., 2, paste, collapse = "!="))) } %>%
paste("select", ., "from testframe")
numDF <- sqldf(sq11) # replace with call to your database
sql2 <- numDF %>%
Filter(Negate(c), .) %>%
names %>%
sub(".*!=(.*.).", "alter table testframe drop \\1", .)
# Just run the sql2 part against your db, not select * ... part.
# The select * ... downloads table for demo purposes only.
sqldf(c(sql2, "select * from testframe")) # replace
## age height gender
## 1 18 76.1 M
## 2 19 77.0 F
## 3 20 78.1 M
## 4 21 78.2 M
## ...snip...
Note that sql1 and sql2 are the following. sql1 is a single sql select statement and sql2 is a vector of sql alter statements, one statement per column to drop. If your data base allows ALTER to drop multiple columns at once you might be able to simplify that but SQLite only allows one at a time.
sql1
## [1] "select sum(age!=height), sum(age!=height2), sum(age!=gender), sum(age!=gender2), sum(height!=height2), sum(height!=gender), sum(height!=gender2), sum(height2!=gender), sum(height2!=gender2), sum(gender!=gender2) from testframe"
sql2
## [1] "alter table testframe drop height2" "alter table testframe drop gender2"
I would like to change some records in my table. I think the easest way is to use sqldf and Update. But when i using it i get warning (the table b isn't empty):
c<-sqldf("UPDATE b
SET l_all = ''
where id='12293' ")
# In result_fetch(res#ptr, n = n) :
# SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
Can you help me how to change chosen records in the easest way?
The query worked but there are several possible problems:
The message is a spurious warning, not an error, caused by backwardly incompatible changes to RSQLite. You can ignore the warning or use the sqldf2 workaround here: https://github.com/ggrothendieck/sqldf/issues/40
The SQL update command does not return anything so one would not expect the command shown in the question to return anything. To return the updated value ask for it.
1) Using the built in BOD data frame, defining sqldf2 from (1) and taking into account (2) we have:
sqldf2(c("update BOD set demand = 0 where Time = 1", "select * from BOD"))
giving:
Time demand
1 1 0.0
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
2) Another approach to do it is to use select giving the same result.
sqldf("select Time, iif(Time == 1, 0, demand) demand from BOD")
Is there a way in R using the sqldf package to select all columns except one?
Your call to sqldf based on some query should return a data frame, where each DF column corresponds to one of the columns appearing in the select clause of your SQL query. Consider the following example:
sql <- "SELECT * FROM yourTable WHERE <some conditions>"
df <- sqldf(sql)
drop <- c("some_column")
df <- df[, !(names(df) %in% drop)]
Note in the above I am doing a SELECT * to fetch all columns in the table (what I assume is your use case). I then subset off a column some_column from the resulting data frame.
Note that doing this from SQL directly generally is not possible. That is, once you do SELECT *, the cat is out of the bag, and you end up with all columns.
1) SQLite Using the default SQLite backend, suppose we want to return the first 3 rows of all columns in mtcars except for the cyl column. First create a comma separated string, sel, of all such column names and then use fn$sqldf to allow string interpolation referring to it in the SQL statement as $sel. Add verbose=TRUE argument to sqldf if you want to see the SQL statement that was generated.
library(sqldf)
sel <- toString(setdiff(names(mtcars), "cyl"))
fn$sqldf("select $sel from mtcars limit 3")
giving:
mpg disp hp drat wt qsec vs am gear carb
1 21.0 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 108 93 3.85 2.320 18.61 1 1 4 1
2) H2 The H2 backend supports alter table ... drop column ... so we can write the following. Since alter does not return anything we add a select which returns the altered table.
library(RH2)
library(sqldf)
sqldf(c("alter table mtcars drop column cyl",
"select * from mtcars limit 3"))
I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.
I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))
You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.