R SQL: Is the Default Option Sampling WITH Replacement? - sql

I want to sample a file WITH REPLACEMENT on a server using SQL with R:
Pretend that this file in the file I am trying to sample:
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "iris", iris)
I want to sample with replacement 30 rows where species = setosa and 30 rows where species = virginica. I used the following code to do this:
rbind(DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (`Species` = 'setosa') ORDER BY RANDOM() LIMIT 30;"), DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (`Species` = 'virginica') ORDER BY RANDOM() LIMIT 30;"))
However at this point, I am not sure if the random sampling being performed is WITH REPLACEMENT or WITHOUT REPLACEMENT.
Can someone please help me determine if the random sampling being performed is WITH REPLACEMENT or WITHOUT REPLACEMENT - and if it is being done WITHOUT REPLACEMENT, how can I change this so that its done WITH REPLACEMENT?
Thank you!

Related

Optimization : splitting a column into a thoursand columns in R or SQLite

I need to analyse data from a very large dataset. For that, I need to separate a character variable into more than a thousand columns.
The structure of this variable is :
number$number$number$ and so on for a thousand numbers
My data is stored in a .db file from SQLite. I then imported it in R using the package "RSQLite".
I tried splitting this column into multiple columns using dplyr :
#d is a data.table with my data
d2=d %>% separate(column_to_separate, paste0("S",c(1:number_of_final_columns)))
It works, but it is also taking forever. Do someone have a solution to split this column faster (either on R or using SQLite) ?
Thanks.
You may use the tidyfast package (see here), that leverages on data.table. In this test, it is approximately three times faster:
test <- data.frame(
long.var = rep(paste0("V", 1:1000, "$", collapse = ""), 1000)
)
system.time({
test |>
tidyr::separate(long.var, into = paste0("N", 1:1001), sep="\\$")
})
#> user system elapsed
#> 0.352 0.012 0.365
system.time({
test |>
tidyfast::dt_separate(long.var, into = paste0("N", 1:1001), sep="\\$")
})
#> user system elapsed
#> 0.117 0.000 0.118
Created on 2023-02-03 with reprex v2.0.2
You can try to write the file as is and then try to load it with fread, which is in general rather fast.
library(data.table)
library(dplyr)
library(tidyr)
# Prepare example
x <- matrix(rnorm(1000*10000), ncol = 1000)
dta <- data.frame(value = apply(x, 1, function(x) paste0(x, collapse = "$")))
# Run benchmark
microbenchmark::microbenchmark({
dta_2 <- dta %>%
separate(col = value, sep = "\\$", into = paste0("col_", 1:1000))
},
{
tmp_file <- tempfile()
fwrite(dta, tmp_file)
dta_3 <- fread(tmp_file, sep = "$", header = FALSE)
}, times = 3
)
Edit: I tested the speed and it seems faster than dt_seperate from tidyfast, but it depends on the size of your dataset.

Effecient Ways to Append SQL Results in R

I am trying to perform "random sampling WITH REPLACEMENT" through SQL, using a file located on a server - in R.
Pretend that this file is the file I am trying to sample WITH REPLACEMENT (I want 30 observations from "setosa" and 30 observations from "virginica"):
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "iris", iris)
I could not not figure out a way to randomly sample WITH REPLACEMENT - so I tried to write a loop that takes one sample at a time (to make sure replicates will occur), and then appends all these samples together. Here, I am trying to create 10 random sample datasets with each of these dataset having 60 random samples (30 from each class) multiplied by 10 :
all_tables <- list()
all_tables_1 <- list()
for (j in 1:10) {
for (i in 1:10)
{
table_i = rbind(DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (Species = 'setosa') ORDER BY RANDOM() LIMIT 30;"), DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (Species = 'virginica') ORDER BY RANDOM() LIMIT 30;"))
all_tables[[i]] <- table_i
}
all_tables_j <- do.call(rbind.data.frame, all_tables)
all_tables_1[[j]] <- all_tables_j
}
lapply(seq_along(all_tables_1), function(i,x) {assign(paste0("sample_",i),x[[i]], envir=.GlobalEnv)}, x=all_tables_1)
dbDisconnect(con)
I have a feeling what I have written is (mostly) correct - but its probably not very efficient. For example, if I want 1000 datasets and require that each dataset to have 10000 random samples, this will then require querying the file on the server many times and thus become slow/inefficient.
Could someone please suggest some ideas/advice on how to make this random sampling code more efficient?
Thank you!

Copy_to in R results in dates being converted to numeric columns

I am attempting to create a small, training database for a package that I am writing. I am using the following code to create the database:
library(tidyverse)
library(DBI)
dat <- data.frame(name = rep("Clyde", 100),
DOB = sample(x = seq(as.POSIXct('1970/01/01'), as.POSIXct('1995/01/01'), by="day"),
size = 100, replace = T))
# Example using schemas with SQLite
train_con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
## create tables in primary db
copy_to(dest = train_con, df = dat, name = "client_list", temporary = FALSE)
The above portion works fine. However, when I attempt to pull data from the database, I see that all dates have been converted to numeric.
train_con %>% tbl("client_list")
Can anybody tell me how to fix this? Thanks!
SQLite does not have a datetime type. In the absence of such a type POSIXct objects are sent to the database as seconds since the UNIX Epoch and SQLite does not know that they are intended to represent date times.
Either convert such columns yourself after you read them back into R or else use a different database. Nearly all databases except SQLite support this.

Writing table from SQL query directly to other database in R

So in a database Y I have a table X with more than 400 million observations. Then I have a KEY.csv file with IDs, that I want to use for filtering the data (small data set, ca. 50k unique IDs). If I had unlimited memory, I would do something like this:
require(RODBC)
require(dplyr)
db <- odbcConnect('Y',uid = "123",pwd = '123')
df <- sqlQuery(db,'SELECT * from X')
close(db)
keys <- read.csv('KEY.csv')
df_final <- df %>% filter(ID %in% KEY$ID)
My issue is, that I don't have the rights to upload the KEY.csv file to the database Y and do the filtering there. Would it be somehow possible to do the filtering in the query, while referencing the file loaded in R memory? And then write this filtered X table directly to a database I have access? I think even after filtering it R might not be able to keep it in the memory.
I could also try to do this in Python, however don't have much experience in that language.
I dont know how many keys you have but maybe you can try to use the build_sql() function to use the keys inside the query.
I dont use RODBC, I think you should use odbc and DBI (https://db.rstudio.com).
library(dbplyr) # dbplyr not dplyr
library(DBI)
library(odbc)
# Get keys first
keys = read.csv('KEY.csv')
db = dbConnect('Y',uid = "123",pwd = '123') # the name of function changes in odbc
# write your query (dbplyr)
sql_query = build_sql('SELECT * from X
where X.key IN ', keys, con = db)
df = dbGetQuery(db,sql_query) # the name of function changes in odbc

R forecast output to SQL Server

I am doing database analysis using SQL Server and forecasting using R. I need to get the results from R back into the SQL Server database. One approach is to output the forecast data to a text file using write.table and import using BULK INSERT. Is there a better way?
You can use dbBulkCopy from rsqlserver package. It is a DBI extension that interfaces the Microsoft SQL Server popular command-line utility named bcp to quickly bulk copying large files into table.
dat <- matrix(round(rnorm(nrow * ncol), 2), nrow = nrow, ncol = ncol)
colnames(dat) <- cnames
id.file = "temp_file.csv"
write.csv(dat, file = id.file, row.names = FALSE)
dbBulkCopy(conn, "NEW_BP_TABLE", value = id.file)
Thanks for your comments and answers! I went with a solution based on the comment by nrussell. Below is my code. The specific command is the last line; I am providing the preceding lines to provide a little bit of context for anyone trying to use this answer.
data <- sqlQuery(myconn, query) # returns time series with year, month (both numeric), and value
data_ts <- ts(data$value,
start=c(data$year[1],data$month[1]), # start is first year and month
end=c(data$year[nrow(data)],data$month[nrow(data)]), # end is last year and month
frequency=12)
data_fit <- auto.arima(data_ts)
fct <- forecast(data_ts, 12)
sqlQuery(myconn, 'truncate table dgtForecast') # Pre-existing table
sqlSave(myconn, data.frame(fct), tablename='dgtForecast', rownames='MonthYear', append=TRUE)