Copy_to in R results in dates being converted to numeric columns - sql

I am attempting to create a small, training database for a package that I am writing. I am using the following code to create the database:
library(tidyverse)
library(DBI)
dat <- data.frame(name = rep("Clyde", 100),
DOB = sample(x = seq(as.POSIXct('1970/01/01'), as.POSIXct('1995/01/01'), by="day"),
size = 100, replace = T))
# Example using schemas with SQLite
train_con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
## create tables in primary db
copy_to(dest = train_con, df = dat, name = "client_list", temporary = FALSE)
The above portion works fine. However, when I attempt to pull data from the database, I see that all dates have been converted to numeric.
train_con %>% tbl("client_list")
Can anybody tell me how to fix this? Thanks!

SQLite does not have a datetime type. In the absence of such a type POSIXct objects are sent to the database as seconds since the UNIX Epoch and SQLite does not know that they are intended to represent date times.
Either convert such columns yourself after you read them back into R or else use a different database. Nearly all databases except SQLite support this.

Related

Discrepancy between "query_exec" and "bq_table_download" using bigrquery

So far I used bigrquery's query_exec to download timeseries data from BigQuery.
sql <- "SELECT Date, val1, val2
FROM `mydata`
WHERE DATE(_PARTITIONTIME) BETWEEN '2020-05-01' AND '2020-06-01'"
project <- "myproj"
df <- query_exec(sql, project = project, max_pages = Inf, use_legacy_sql = FALSE) %>% as_tibble()
Since the last update a warning appears indicating that query_exec is deprecated and instead bq_table_download in conjuction with bq_project_query should be used.
tb <- bq_project_query(project, sql)
df <- bq_table_download(tb, page_size = 100000)
Adjusting my code resulted in a dataframe of the same size (more than 4 million rows) as doing the request with query_exec. However, from row ~80000 onwards now only dates of format 1970-01-01 appear and the remaining columns are either empty or contain zeros. Using the old way with query_exec still works and results in the correctly formatted dataframe.
Any ideas what could be the problem here?
This is most likely related to the page_size parameter that you set to 100000. If this is increased to bigger numbers the results are not properly parsed anymore and some NAs or wrongly parsed errors appear. I assume that you dates come out as 1970-01-01 because of that.
Try setting the page_size to something closer to the default 10000 and it should work. I have not yet found the perfect value, but 20000 works fine for me.

Writing table from SQL query directly to other database in R

So in a database Y I have a table X with more than 400 million observations. Then I have a KEY.csv file with IDs, that I want to use for filtering the data (small data set, ca. 50k unique IDs). If I had unlimited memory, I would do something like this:
require(RODBC)
require(dplyr)
db <- odbcConnect('Y',uid = "123",pwd = '123')
df <- sqlQuery(db,'SELECT * from X')
close(db)
keys <- read.csv('KEY.csv')
df_final <- df %>% filter(ID %in% KEY$ID)
My issue is, that I don't have the rights to upload the KEY.csv file to the database Y and do the filtering there. Would it be somehow possible to do the filtering in the query, while referencing the file loaded in R memory? And then write this filtered X table directly to a database I have access? I think even after filtering it R might not be able to keep it in the memory.
I could also try to do this in Python, however don't have much experience in that language.
I dont know how many keys you have but maybe you can try to use the build_sql() function to use the keys inside the query.
I dont use RODBC, I think you should use odbc and DBI (https://db.rstudio.com).
library(dbplyr) # dbplyr not dplyr
library(DBI)
library(odbc)
# Get keys first
keys = read.csv('KEY.csv')
db = dbConnect('Y',uid = "123",pwd = '123') # the name of function changes in odbc
# write your query (dbplyr)
sql_query = build_sql('SELECT * from X
where X.key IN ', keys, con = db)
df = dbGetQuery(db,sql_query) # the name of function changes in odbc

Putting dbSendQuery into a function in R

I'm using RJDBC in RStudio to pull a set of data from an Oracle database into R.
After loading the RJDBC package I have the following lines:
drv = JDBC("oracle.jdbc.OracleDriver", classPath="C:/R/ojdbc7.jar", identifier.quote = " ")
conn = dbConnect(drv,"jdbc:oracle:thin:#private_server_info", "804301", "password")
rs = dbSendQuery(conn, statement= paste("LONG SQL QUERY TO SELECT REQUIRED DATA INCLUDING REQUEST FOR VARIABLE x"))
masterdata = fetch(rs, n = -1) # extract all rows
Run through the usual script, they always execute without fail; it can sometimes take a few minutes dependent on variable x, e.g. may result in 100K rows or 1M rows being pulled. masterdata will return everything in a dataframe.
I'm now trying to place all of the above into a function, with one required argument, variable x which is a TEXT argument (a city name); this input however is also part of the LONG SQL QUERY.
The function I wrote called Data_Grab is as follows:
Data_Grab = function(x) {
drv = JDBC("oracle.jdbc.OracleDriver", classPath="C:/R/ojdbc7.jar", identifier.quote = " ")
conn = dbConnect(drv,"jdbc:oracle:thin:#private_server_info", "804301", "password")
rs = dbSendQuery(conn, statement= paste("LONG SQL QUERY TO SELECT REQUIRED DATA,
INCLUDING REQUEST FOR VARIABLE x"))
masterdata = fetch(rs, n = -1) # extract all rows
return (masterdata)
}
My function appears to execute in seconds (no error is produced) however I get just the 21 column headings for the dataframe and the line
<0 rows> (or 0-length row.names)
Not sure what is wrong here; obviously expecting function to still take minutes to execute as data being pulled is large, but not being returned any actual data frame.
Help is appreciated!
if you want to parameterize your query to a JDBC database, try also using the gsubfn package. code might look like this:
library(gsubfn)
library(RJDBC)
Data_Grab = function(x) {
rd1 = x
df <- fn$dbGetQuery(conn,"SELECT BLAH1, BLAH2
FROM TABLENAME
WHERE BLAH1 = '$rd1')
return(df)
basically, you need to put a $ before the variable name that stores the parameter you wish to pass.

R forecast output to SQL Server

I am doing database analysis using SQL Server and forecasting using R. I need to get the results from R back into the SQL Server database. One approach is to output the forecast data to a text file using write.table and import using BULK INSERT. Is there a better way?
You can use dbBulkCopy from rsqlserver package. It is a DBI extension that interfaces the Microsoft SQL Server popular command-line utility named bcp to quickly bulk copying large files into table.
dat <- matrix(round(rnorm(nrow * ncol), 2), nrow = nrow, ncol = ncol)
colnames(dat) <- cnames
id.file = "temp_file.csv"
write.csv(dat, file = id.file, row.names = FALSE)
dbBulkCopy(conn, "NEW_BP_TABLE", value = id.file)
Thanks for your comments and answers! I went with a solution based on the comment by nrussell. Below is my code. The specific command is the last line; I am providing the preceding lines to provide a little bit of context for anyone trying to use this answer.
data <- sqlQuery(myconn, query) # returns time series with year, month (both numeric), and value
data_ts <- ts(data$value,
start=c(data$year[1],data$month[1]), # start is first year and month
end=c(data$year[nrow(data)],data$month[nrow(data)]), # end is last year and month
frequency=12)
data_fit <- auto.arima(data_ts)
fct <- forecast(data_ts, 12)
sqlQuery(myconn, 'truncate table dgtForecast') # Pre-existing table
sqlSave(myconn, data.frame(fct), tablename='dgtForecast', rownames='MonthYear', append=TRUE)

store matrix data in SQLite for fast retrieval in R

I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.