Effecient Ways to Append SQL Results in R - sql

I am trying to perform "random sampling WITH REPLACEMENT" through SQL, using a file located on a server - in R.
Pretend that this file is the file I am trying to sample WITH REPLACEMENT (I want 30 observations from "setosa" and 30 observations from "virginica"):
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "iris", iris)
I could not not figure out a way to randomly sample WITH REPLACEMENT - so I tried to write a loop that takes one sample at a time (to make sure replicates will occur), and then appends all these samples together. Here, I am trying to create 10 random sample datasets with each of these dataset having 60 random samples (30 from each class) multiplied by 10 :
all_tables <- list()
all_tables_1 <- list()
for (j in 1:10) {
for (i in 1:10)
{
table_i = rbind(DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (Species = 'setosa') ORDER BY RANDOM() LIMIT 30;"), DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (Species = 'virginica') ORDER BY RANDOM() LIMIT 30;"))
all_tables[[i]] <- table_i
}
all_tables_j <- do.call(rbind.data.frame, all_tables)
all_tables_1[[j]] <- all_tables_j
}
lapply(seq_along(all_tables_1), function(i,x) {assign(paste0("sample_",i),x[[i]], envir=.GlobalEnv)}, x=all_tables_1)
dbDisconnect(con)
I have a feeling what I have written is (mostly) correct - but its probably not very efficient. For example, if I want 1000 datasets and require that each dataset to have 10000 random samples, this will then require querying the file on the server many times and thus become slow/inefficient.
Could someone please suggest some ideas/advice on how to make this random sampling code more efficient?
Thank you!

Related

Optimization : splitting a column into a thoursand columns in R or SQLite

I need to analyse data from a very large dataset. For that, I need to separate a character variable into more than a thousand columns.
The structure of this variable is :
number$number$number$ and so on for a thousand numbers
My data is stored in a .db file from SQLite. I then imported it in R using the package "RSQLite".
I tried splitting this column into multiple columns using dplyr :
#d is a data.table with my data
d2=d %>% separate(column_to_separate, paste0("S",c(1:number_of_final_columns)))
It works, but it is also taking forever. Do someone have a solution to split this column faster (either on R or using SQLite) ?
Thanks.
You may use the tidyfast package (see here), that leverages on data.table. In this test, it is approximately three times faster:
test <- data.frame(
long.var = rep(paste0("V", 1:1000, "$", collapse = ""), 1000)
)
system.time({
test |>
tidyr::separate(long.var, into = paste0("N", 1:1001), sep="\\$")
})
#> user system elapsed
#> 0.352 0.012 0.365
system.time({
test |>
tidyfast::dt_separate(long.var, into = paste0("N", 1:1001), sep="\\$")
})
#> user system elapsed
#> 0.117 0.000 0.118
Created on 2023-02-03 with reprex v2.0.2
You can try to write the file as is and then try to load it with fread, which is in general rather fast.
library(data.table)
library(dplyr)
library(tidyr)
# Prepare example
x <- matrix(rnorm(1000*10000), ncol = 1000)
dta <- data.frame(value = apply(x, 1, function(x) paste0(x, collapse = "$")))
# Run benchmark
microbenchmark::microbenchmark({
dta_2 <- dta %>%
separate(col = value, sep = "\\$", into = paste0("col_", 1:1000))
},
{
tmp_file <- tempfile()
fwrite(dta, tmp_file)
dta_3 <- fread(tmp_file, sep = "$", header = FALSE)
}, times = 3
)
Edit: I tested the speed and it seems faster than dt_seperate from tidyfast, but it depends on the size of your dataset.

R SQL: Is the Default Option Sampling WITH Replacement?

I want to sample a file WITH REPLACEMENT on a server using SQL with R:
Pretend that this file in the file I am trying to sample:
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "iris", iris)
I want to sample with replacement 30 rows where species = setosa and 30 rows where species = virginica. I used the following code to do this:
rbind(DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (`Species` = 'setosa') ORDER BY RANDOM() LIMIT 30;"), DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (`Species` = 'virginica') ORDER BY RANDOM() LIMIT 30;"))
However at this point, I am not sure if the random sampling being performed is WITH REPLACEMENT or WITHOUT REPLACEMENT.
Can someone please help me determine if the random sampling being performed is WITH REPLACEMENT or WITHOUT REPLACEMENT - and if it is being done WITHOUT REPLACEMENT, how can I change this so that its done WITH REPLACEMENT?
Thank you!

Splitting a Dataset into Arbitrary Sections

I have this data set:
var_1 = rnorm(1000,1000,1000)
var_2 = rnorm(1000,1000,1000)
var_3 = rnorm(1000,1000,1000)
sample_data = data.frame(var_1, var_2, var_3)
I would like to split this data set into 10 different datasets (each containing 100 rows) and then upload them on to a server.
I know how to do this by hand:
sample_1 = sample_data[1:100,]
sample_2 = sample_data[101:200,]
sample_3 = sample_data[201:300,]
# etc.
library(DBI)
#establish connection (my_connection)
dbWriteTable(my_connection, SQL("sample_1"), sample_1)
dbWriteTable(my_connection, SQL("sample_2"), sample_2)
dbWriteTable(my_connection, SQL("sample_3"), sample_3)
# etc
Is there a way to do this "quicker"?
I thought of a general way to do this - but I am not sure how to correctly write the code for this:
i = seq(1:1000, by = 100)
j = 1 - 99
{
sample_i = sample_data[ i:j,]
dbWriteTable(my_connection, SQL("sample_i"), sample_i)
}
Can someone please help me with this?
Thank you!
Here's an example using the SQLite database engine. We'll start with your sample data set:
var_1 = rnorm(1000,1000,1000)
var_2 = rnorm(1000,1000,1000)
var_3 = rnorm(1000,1000,1000)
sample_data = data.frame(var_1, var_2, var_3)
Now we'll break your large data frame into a list of 10 data frames using split(). The result will be stored in a list:
list_of_dfs <- split(
sample_data, (seq(nrow(sample_data))-1) %/% 100
)
We'll create a vector with the names of the tables in the database. Here, I'm just making simple vector with the names sample_1, sample_2, etc.
table_names <- paste0("sample_", 1:10)
Now we're ready to write to the database. We'll make a connection and then iterate over the list of data frames and the vector of table names simultaneously, calling dbWriteTable() each time:
connection <- dbConnect(RSQLite::SQLite(), dbname = "test.db")
map2(
table_names,
list_of_dfs,
function(x,y) dbWriteTable(connection, x, y)
)

Writing table from SQL query directly to other database in R

So in a database Y I have a table X with more than 400 million observations. Then I have a KEY.csv file with IDs, that I want to use for filtering the data (small data set, ca. 50k unique IDs). If I had unlimited memory, I would do something like this:
require(RODBC)
require(dplyr)
db <- odbcConnect('Y',uid = "123",pwd = '123')
df <- sqlQuery(db,'SELECT * from X')
close(db)
keys <- read.csv('KEY.csv')
df_final <- df %>% filter(ID %in% KEY$ID)
My issue is, that I don't have the rights to upload the KEY.csv file to the database Y and do the filtering there. Would it be somehow possible to do the filtering in the query, while referencing the file loaded in R memory? And then write this filtered X table directly to a database I have access? I think even after filtering it R might not be able to keep it in the memory.
I could also try to do this in Python, however don't have much experience in that language.
I dont know how many keys you have but maybe you can try to use the build_sql() function to use the keys inside the query.
I dont use RODBC, I think you should use odbc and DBI (https://db.rstudio.com).
library(dbplyr) # dbplyr not dplyr
library(DBI)
library(odbc)
# Get keys first
keys = read.csv('KEY.csv')
db = dbConnect('Y',uid = "123",pwd = '123') # the name of function changes in odbc
# write your query (dbplyr)
sql_query = build_sql('SELECT * from X
where X.key IN ', keys, con = db)
df = dbGetQuery(db,sql_query) # the name of function changes in odbc

store matrix data in SQLite for fast retrieval in R

I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.