I have this data set:
var_1 = rnorm(1000,1000,1000)
var_2 = rnorm(1000,1000,1000)
var_3 = rnorm(1000,1000,1000)
sample_data = data.frame(var_1, var_2, var_3)
I would like to split this data set into 10 different datasets (each containing 100 rows) and then upload them on to a server.
I know how to do this by hand:
sample_1 = sample_data[1:100,]
sample_2 = sample_data[101:200,]
sample_3 = sample_data[201:300,]
# etc.
library(DBI)
#establish connection (my_connection)
dbWriteTable(my_connection, SQL("sample_1"), sample_1)
dbWriteTable(my_connection, SQL("sample_2"), sample_2)
dbWriteTable(my_connection, SQL("sample_3"), sample_3)
# etc
Is there a way to do this "quicker"?
I thought of a general way to do this - but I am not sure how to correctly write the code for this:
i = seq(1:1000, by = 100)
j = 1 - 99
{
sample_i = sample_data[ i:j,]
dbWriteTable(my_connection, SQL("sample_i"), sample_i)
}
Can someone please help me with this?
Thank you!
Here's an example using the SQLite database engine. We'll start with your sample data set:
var_1 = rnorm(1000,1000,1000)
var_2 = rnorm(1000,1000,1000)
var_3 = rnorm(1000,1000,1000)
sample_data = data.frame(var_1, var_2, var_3)
Now we'll break your large data frame into a list of 10 data frames using split(). The result will be stored in a list:
list_of_dfs <- split(
sample_data, (seq(nrow(sample_data))-1) %/% 100
)
We'll create a vector with the names of the tables in the database. Here, I'm just making simple vector with the names sample_1, sample_2, etc.
table_names <- paste0("sample_", 1:10)
Now we're ready to write to the database. We'll make a connection and then iterate over the list of data frames and the vector of table names simultaneously, calling dbWriteTable() each time:
connection <- dbConnect(RSQLite::SQLite(), dbname = "test.db")
map2(
table_names,
list_of_dfs,
function(x,y) dbWriteTable(connection, x, y)
)
Related
I need to analyse data from a very large dataset. For that, I need to separate a character variable into more than a thousand columns.
The structure of this variable is :
number$number$number$ and so on for a thousand numbers
My data is stored in a .db file from SQLite. I then imported it in R using the package "RSQLite".
I tried splitting this column into multiple columns using dplyr :
#d is a data.table with my data
d2=d %>% separate(column_to_separate, paste0("S",c(1:number_of_final_columns)))
It works, but it is also taking forever. Do someone have a solution to split this column faster (either on R or using SQLite) ?
Thanks.
You may use the tidyfast package (see here), that leverages on data.table. In this test, it is approximately three times faster:
test <- data.frame(
long.var = rep(paste0("V", 1:1000, "$", collapse = ""), 1000)
)
system.time({
test |>
tidyr::separate(long.var, into = paste0("N", 1:1001), sep="\\$")
})
#> user system elapsed
#> 0.352 0.012 0.365
system.time({
test |>
tidyfast::dt_separate(long.var, into = paste0("N", 1:1001), sep="\\$")
})
#> user system elapsed
#> 0.117 0.000 0.118
Created on 2023-02-03 with reprex v2.0.2
You can try to write the file as is and then try to load it with fread, which is in general rather fast.
library(data.table)
library(dplyr)
library(tidyr)
# Prepare example
x <- matrix(rnorm(1000*10000), ncol = 1000)
dta <- data.frame(value = apply(x, 1, function(x) paste0(x, collapse = "$")))
# Run benchmark
microbenchmark::microbenchmark({
dta_2 <- dta %>%
separate(col = value, sep = "\\$", into = paste0("col_", 1:1000))
},
{
tmp_file <- tempfile()
fwrite(dta, tmp_file)
dta_3 <- fread(tmp_file, sep = "$", header = FALSE)
}, times = 3
)
Edit: I tested the speed and it seems faster than dt_seperate from tidyfast, but it depends on the size of your dataset.
I am trying to perform "random sampling WITH REPLACEMENT" through SQL, using a file located on a server - in R.
Pretend that this file is the file I am trying to sample WITH REPLACEMENT (I want 30 observations from "setosa" and 30 observations from "virginica"):
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "iris", iris)
I could not not figure out a way to randomly sample WITH REPLACEMENT - so I tried to write a loop that takes one sample at a time (to make sure replicates will occur), and then appends all these samples together. Here, I am trying to create 10 random sample datasets with each of these dataset having 60 random samples (30 from each class) multiplied by 10 :
all_tables <- list()
all_tables_1 <- list()
for (j in 1:10) {
for (i in 1:10)
{
table_i = rbind(DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (Species = 'setosa') ORDER BY RANDOM() LIMIT 30;"), DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (Species = 'virginica') ORDER BY RANDOM() LIMIT 30;"))
all_tables[[i]] <- table_i
}
all_tables_j <- do.call(rbind.data.frame, all_tables)
all_tables_1[[j]] <- all_tables_j
}
lapply(seq_along(all_tables_1), function(i,x) {assign(paste0("sample_",i),x[[i]], envir=.GlobalEnv)}, x=all_tables_1)
dbDisconnect(con)
I have a feeling what I have written is (mostly) correct - but its probably not very efficient. For example, if I want 1000 datasets and require that each dataset to have 10000 random samples, this will then require querying the file on the server many times and thus become slow/inefficient.
Could someone please suggest some ideas/advice on how to make this random sampling code more efficient?
Thank you!
I'm trying to write code that will loop through a list of integers, which relate to a number of sensors, to provide summary statistics (at this stage just cor()).
# GOOD TO HERE
corr_table <-data.frame(ID = integer()
, HxT = double())
for(j in gt_thrsh_key){ #this is currently set to 2:5 for testing - its a list of sensors I want to summarise
# extract humidity and time vectors
x <- sqldf(sprintf("SELECT humidity FROM data_agg_2 WHERE ID = %s",j))
y <- sqldf(sprintf("SELECT time_elapsed FROM data_agg_2 WHERE ID = %s",j))
# format into row
new_row <- data.frame(ID = c(j), HxT = c(cor(x,y))) #insert new variables into row
# append to dataframe
corr_table <- rbind(corr_table, new_row)
print(sprintf("Sensor %s has been summarised.",j)) # check 1
print(cor(x,y)) # check 2
}
print(corr_table)
assign("data_agg_2", data_agg_2, envir = .GlobalEnv)
I get output:
[1] "Sensor 2 has been summarised." "Sensor 3 has been summarised." "Sensor 4 has been summarised." "Sensor 5 has been summarised."
humidity -0.08950285
ID HxT
1 2 -0.08950285 #INCORRECT
2 3 -0.08950285 #INCORRECT
3 4 -0.08950285 #INCORRECT
4 5 -0.08950285 #correct
This is only the correct measurement for the final iteration of loop (id = 5), so somehow I must be overwriting previous entries. Does anyone know why this is happening? Or can you recommend a better way to perform this loop?
Thanks!!
EDIT: check 2 which prints the cor() of x and y through the loop confirms that only the final run of loop is calculating a value. Has anyone seen this before?
Here is a base R solution that uses lapply() to generate the correlations and write them to a list(). The list is converted to a data frame with do.call(rbind,...).
# simulate some data
set.seed(19041798) # ensure consistency across multiple runs
ID <- rep(1:10,20)
humidity <- rnorm(200,mean = 30,sd = 15)
elapsed_time <- rpois(200,2.5)
data <- data.frame(ID,humidity, elapsed_time)
uniqueIDs <- unique(data$ID)
correlationList <- lapply(uniqueIDs,function(x){
y <- subset(data,ID == x)
HxT <- cor(y$humidity,y$elapsed_time)
# return as data frame
data.frame(ID = x,HxT = HxT)
})
correlations <- do.call(rbind,correlationList)
...and the output:
> correlations
ID HxT
1 1 -0.1805885
2 2 -0.3166290
3 3 0.1749233
4 4 -0.2517737
5 5 0.1428092
6 6 0.3112812
7 7 -0.3180825
8 8 0.3774637
9 9 -0.3790178
10 10 -0.3070866
>
sqldf() version
We can restructure the code from the original post so it extracts all the data it needs through a single SQL query, and performs all subsequent processing in R.
First, we simulate 60,000 rows of data.
set.seed(19041798) # ensure consistency across multiple runs
ID <- rep(1:30,2000)
humidity <- rnorm(60000,mean = 30,sd = 15)
elapsed_time <- rpois(60000,2.5)
data <- data.frame(ID,humidity, elapsed_time)
Next, we extract data for the first 5 sensors from the data with sqldf(), as well as the vector of uniqueIDs.
library(sqldf)
# select ID <= 5
sqlStmt <- "select ID, humidity,elapsed_time from data where ID <= 5"
dataSubset <- sqldf(sqlStmt)
sqlStmt <- "select distinct ID from data where ID <= 5"
uniqueIDs <- sqldf(sqlStmt)[[1]]
At this point, the dataSubset data frame has 10,000 observations. We use lapply() with the vector of uniqueIDs to generate correlations by ID, count the complete.cases() included in each correlation, and write the results to a list of data frames.
correlationList <- lapply(uniqueIDs,function(x){
y <- subset(dataSubset,ID == x)
count <- sum(complete.cases(y)) # number of obs included in cor()
HxT <- cor(y$humidity,y$elapsed_time)
# return as data frame
data.frame(ID = x,count = count,HxT = HxT)
})
Finally, a do.call(rbind,...) and a print, and we have our list of correlations including counts of rows used to calculate the correlation.
correlations <- do.call(rbind,correlationList)
correlations
...and the output:
> correlations
ID count HxT
1 1 2000 0.015640244
2 2 2000 0.017143573
3 3 2000 -0.011283180
4 4 2000 0.052482666
5 5 2000 0.002083603
>
I'm using RJDBC in RStudio to pull a set of data from an Oracle database into R.
After loading the RJDBC package I have the following lines:
drv = JDBC("oracle.jdbc.OracleDriver", classPath="C:/R/ojdbc7.jar", identifier.quote = " ")
conn = dbConnect(drv,"jdbc:oracle:thin:#private_server_info", "804301", "password")
rs = dbSendQuery(conn, statement= paste("LONG SQL QUERY TO SELECT REQUIRED DATA INCLUDING REQUEST FOR VARIABLE x"))
masterdata = fetch(rs, n = -1) # extract all rows
Run through the usual script, they always execute without fail; it can sometimes take a few minutes dependent on variable x, e.g. may result in 100K rows or 1M rows being pulled. masterdata will return everything in a dataframe.
I'm now trying to place all of the above into a function, with one required argument, variable x which is a TEXT argument (a city name); this input however is also part of the LONG SQL QUERY.
The function I wrote called Data_Grab is as follows:
Data_Grab = function(x) {
drv = JDBC("oracle.jdbc.OracleDriver", classPath="C:/R/ojdbc7.jar", identifier.quote = " ")
conn = dbConnect(drv,"jdbc:oracle:thin:#private_server_info", "804301", "password")
rs = dbSendQuery(conn, statement= paste("LONG SQL QUERY TO SELECT REQUIRED DATA,
INCLUDING REQUEST FOR VARIABLE x"))
masterdata = fetch(rs, n = -1) # extract all rows
return (masterdata)
}
My function appears to execute in seconds (no error is produced) however I get just the 21 column headings for the dataframe and the line
<0 rows> (or 0-length row.names)
Not sure what is wrong here; obviously expecting function to still take minutes to execute as data being pulled is large, but not being returned any actual data frame.
Help is appreciated!
if you want to parameterize your query to a JDBC database, try also using the gsubfn package. code might look like this:
library(gsubfn)
library(RJDBC)
Data_Grab = function(x) {
rd1 = x
df <- fn$dbGetQuery(conn,"SELECT BLAH1, BLAH2
FROM TABLENAME
WHERE BLAH1 = '$rd1')
return(df)
basically, you need to put a $ before the variable name that stores the parameter you wish to pass.
I want to create a new custom TA-indicator to the stock symbol in R. But I have no idea about how to convert my SQL conditional strategy into R self-defined function and add it up to the ChartSeries in R.
The question are listed in the following code as the explanation.
library("quantmod")
library("FinancialInstrument")
library("PerformanceAnalytics")
library("TTR")
stock <- getSymbols("002457.SZ",auto.assign=FALSE,from="2012-11-26",to="2014-01-30")
head(stock)
chartSeries(stock, theme = "white", subset = "2013-07-01/2014-01-30",TA = "addSMA(n=5,col=\"gray\");addSMA(n=10,col=\"yellow\");
addSMA(n=20,col=\"pink\");addSMA(n=30,col=\"green\");addSMA(n=60,col=\"blue\");addVo()")
Question: How can I rewrite the code below to make it available as a function in R?
#Signal Design
#Today's volume is the lowset during the last 20 trading days
lowvolume <- VOL<=LLV(VOL,20);
#seveal moving average lines stick together
X1:=ABS(MA(C,10)/MA(C,20)-1)<0.01;
X2:=ABS(MA(C,5)/MA(C,10)-1)<0.01;
X3:=ABS(MA(C,5)/MA(C,20)-1)<0.01;
#If the follwing condition is satisfied, then the signal appears
MA(C,5)>REF(MA(C,5),1) AND X1 AND X2 AND X3 AND lowvolume;
#Convert the above SQL code into the following R custom function
VOLINE <- function(x) {
}
#Create a new TA function for the chartseries and then add it up.
addVoline <- newTA(FUN=VOLINE,
+ preFUN=Cl,
+ col=c(rep(3,6),
+ rep(”#333333”,6)),
+ legend=”VOLINE”)
I dont think you need sql in this case
Try this
require(quantmod)
# fetch the data
s <- get(getSymbols('yhoo'))
# add the indicators
s$ma5 <- SMA(Cl(s) ,5)
s$ma10 <- SMA(Cl(s) ,10)
s$ma20 <- SMA(Cl(s) ,20)
s$llv <- rollapply(Vo(s), 20, min)
# generate the signal
s$signal <- (s$ma10 / s$ma20 - 1 < 0.01 & s$ma5 / s$ma10 - 1 < 0.01 & s$ma5 / s$ma20 - 1 < 0.01 & Vo(s) == s$llv)
# draw
chart_Series(s)
add_TA(s$signal == 1, on = 1, col='red')
I'm not sure what REF means but i'm sure you can do that by your self.
This is the output (i cant seem to upload the photo but you see a chart with horizontal lines where signal eq 1)
Use the function as a wrapper for sqldf() in the sqldf package. The argument to sqldf() will be a select statement on the data frame that has the data.
A good tutorial for this can be found at Burns Statistics.