I have a data set assigned to a variable named 'temps', which have columns 'date', 'temperature', 'country'.
I want to do something like this, which I can do in SQL
SELECT * FROM temps WHERE country != 'mycountry'
How can I do similar selection in R?
We can use similar syntax in base R
temps[temps$country != "mycountry",]
Benchmarks
set.seed(24)
temps1 <- data.frame(country = sample(LETTERS, 1e7, replace=TRUE),
val = rnorm(1e7))
system.time(temps1[!temps1$country %in% "A",])
# user system elapsed
# 0.92 0.11 1.04
system.time(temps1[temps1$country != "A",])
# user system elapsed
# 0.70 0.17 0.88
If we are using package solutions
library(sqldf)
system.time(sqldf("SELECT * FROM temps1 WHERE country != 'A'"))
# user system elapsed
# 12.78 0.37 13.15
library(data.table)
system.time(setDT(temps1, key = 'country')[!("A")])
# user system elapsed
# 0.62 0.19 0.37
This should do it.
temps2 <- temps[!temps$country %in% "mycountry",]
Here are sqldf and base R approaches with the source and sample output based on the input shown in the Note below.
1) sqldf
library(sqldf)
sqldf("SELECT * FROM temps WHERE country != 'mycountry'")
## country value
## 1 other 2
2) base R
subset(temps, country != "mycountry")
## country value
## 2 other 2
Note: The test data used above are shown here. Next time pleaes provide such reproducible sample data in the question.
# test data
temps <- data.frame(country = c("mycountry", "other"), value = 1:2)
Related
I need to analyse data from a very large dataset. For that, I need to separate a character variable into more than a thousand columns.
The structure of this variable is :
number$number$number$ and so on for a thousand numbers
My data is stored in a .db file from SQLite. I then imported it in R using the package "RSQLite".
I tried splitting this column into multiple columns using dplyr :
#d is a data.table with my data
d2=d %>% separate(column_to_separate, paste0("S",c(1:number_of_final_columns)))
It works, but it is also taking forever. Do someone have a solution to split this column faster (either on R or using SQLite) ?
Thanks.
You may use the tidyfast package (see here), that leverages on data.table. In this test, it is approximately three times faster:
test <- data.frame(
long.var = rep(paste0("V", 1:1000, "$", collapse = ""), 1000)
)
system.time({
test |>
tidyr::separate(long.var, into = paste0("N", 1:1001), sep="\\$")
})
#> user system elapsed
#> 0.352 0.012 0.365
system.time({
test |>
tidyfast::dt_separate(long.var, into = paste0("N", 1:1001), sep="\\$")
})
#> user system elapsed
#> 0.117 0.000 0.118
Created on 2023-02-03 with reprex v2.0.2
You can try to write the file as is and then try to load it with fread, which is in general rather fast.
library(data.table)
library(dplyr)
library(tidyr)
# Prepare example
x <- matrix(rnorm(1000*10000), ncol = 1000)
dta <- data.frame(value = apply(x, 1, function(x) paste0(x, collapse = "$")))
# Run benchmark
microbenchmark::microbenchmark({
dta_2 <- dta %>%
separate(col = value, sep = "\\$", into = paste0("col_", 1:1000))
},
{
tmp_file <- tempfile()
fwrite(dta, tmp_file)
dta_3 <- fread(tmp_file, sep = "$", header = FALSE)
}, times = 3
)
Edit: I tested the speed and it seems faster than dt_seperate from tidyfast, but it depends on the size of your dataset.
I'm trying to write code that will loop through a list of integers, which relate to a number of sensors, to provide summary statistics (at this stage just cor()).
# GOOD TO HERE
corr_table <-data.frame(ID = integer()
, HxT = double())
for(j in gt_thrsh_key){ #this is currently set to 2:5 for testing - its a list of sensors I want to summarise
# extract humidity and time vectors
x <- sqldf(sprintf("SELECT humidity FROM data_agg_2 WHERE ID = %s",j))
y <- sqldf(sprintf("SELECT time_elapsed FROM data_agg_2 WHERE ID = %s",j))
# format into row
new_row <- data.frame(ID = c(j), HxT = c(cor(x,y))) #insert new variables into row
# append to dataframe
corr_table <- rbind(corr_table, new_row)
print(sprintf("Sensor %s has been summarised.",j)) # check 1
print(cor(x,y)) # check 2
}
print(corr_table)
assign("data_agg_2", data_agg_2, envir = .GlobalEnv)
I get output:
[1] "Sensor 2 has been summarised." "Sensor 3 has been summarised." "Sensor 4 has been summarised." "Sensor 5 has been summarised."
humidity -0.08950285
ID HxT
1 2 -0.08950285 #INCORRECT
2 3 -0.08950285 #INCORRECT
3 4 -0.08950285 #INCORRECT
4 5 -0.08950285 #correct
This is only the correct measurement for the final iteration of loop (id = 5), so somehow I must be overwriting previous entries. Does anyone know why this is happening? Or can you recommend a better way to perform this loop?
Thanks!!
EDIT: check 2 which prints the cor() of x and y through the loop confirms that only the final run of loop is calculating a value. Has anyone seen this before?
Here is a base R solution that uses lapply() to generate the correlations and write them to a list(). The list is converted to a data frame with do.call(rbind,...).
# simulate some data
set.seed(19041798) # ensure consistency across multiple runs
ID <- rep(1:10,20)
humidity <- rnorm(200,mean = 30,sd = 15)
elapsed_time <- rpois(200,2.5)
data <- data.frame(ID,humidity, elapsed_time)
uniqueIDs <- unique(data$ID)
correlationList <- lapply(uniqueIDs,function(x){
y <- subset(data,ID == x)
HxT <- cor(y$humidity,y$elapsed_time)
# return as data frame
data.frame(ID = x,HxT = HxT)
})
correlations <- do.call(rbind,correlationList)
...and the output:
> correlations
ID HxT
1 1 -0.1805885
2 2 -0.3166290
3 3 0.1749233
4 4 -0.2517737
5 5 0.1428092
6 6 0.3112812
7 7 -0.3180825
8 8 0.3774637
9 9 -0.3790178
10 10 -0.3070866
>
sqldf() version
We can restructure the code from the original post so it extracts all the data it needs through a single SQL query, and performs all subsequent processing in R.
First, we simulate 60,000 rows of data.
set.seed(19041798) # ensure consistency across multiple runs
ID <- rep(1:30,2000)
humidity <- rnorm(60000,mean = 30,sd = 15)
elapsed_time <- rpois(60000,2.5)
data <- data.frame(ID,humidity, elapsed_time)
Next, we extract data for the first 5 sensors from the data with sqldf(), as well as the vector of uniqueIDs.
library(sqldf)
# select ID <= 5
sqlStmt <- "select ID, humidity,elapsed_time from data where ID <= 5"
dataSubset <- sqldf(sqlStmt)
sqlStmt <- "select distinct ID from data where ID <= 5"
uniqueIDs <- sqldf(sqlStmt)[[1]]
At this point, the dataSubset data frame has 10,000 observations. We use lapply() with the vector of uniqueIDs to generate correlations by ID, count the complete.cases() included in each correlation, and write the results to a list of data frames.
correlationList <- lapply(uniqueIDs,function(x){
y <- subset(dataSubset,ID == x)
count <- sum(complete.cases(y)) # number of obs included in cor()
HxT <- cor(y$humidity,y$elapsed_time)
# return as data frame
data.frame(ID = x,count = count,HxT = HxT)
})
Finally, a do.call(rbind,...) and a print, and we have our list of correlations including counts of rows used to calculate the correlation.
correlations <- do.call(rbind,correlationList)
correlations
...and the output:
> correlations
ID count HxT
1 1 2000 0.015640244
2 2 2000 0.017143573
3 3 2000 -0.011283180
4 4 2000 0.052482666
5 5 2000 0.002083603
>
I have a database consisting of three tables like this:
I want to make a machine learning model in R using that database, and the data I need is like this:
I can use one hot encoding to convert categorical variable from t_pengolahan (such as "Pengupasan, Fermentasi, etc") into attributes. But, how to set flag (yes or no) to the data value based on "result (using SQL query)" data above?
We can combine two answers to previous related questions, each of which provides half of the solution; those answers are found here and here:
library(dplyr) ## dplyr and tidyr loaded for wrangling
library(tidyr)
options(dplyr.width = Inf) ## we want to show all columns of result
yes_fun <- function(x) { ## helps with pivot_wider() below
if ( length(x) > 0 ) {
return("yes")
}
}
sql_result %>%
separate_rows(pengolahan) %>% ## add rows for unique words in pengolahan
pivot_wider(names_from = pengolahan, ## spread to yes/no indicators
values_from = pengolahan,
values_fill = list(pengolahan = "no"),
values_fn = list(pengolahan = yes_fun))
Data
id_pangan <- 1:3
kategori <- c("Daging", "Buah", "Susu")
pengolahan <- c("Penggilingan, Perebusan", "Pengupasan",
"Fermentasi, Sterilisasi")
batas <- c(100, 50, 200)
sql_result <- data.frame(id_pangan, kategori, pengolahan, batas)
# A tibble: 3 x 8
id_pangan kategori batas Penggilingan Perebusan Pengupasan
<int> <fct> <dbl> <chr> <chr> <chr>
1 1 Daging 100 yes yes no
2 2 Buah 50 no no yes
3 3 Susu 200 no no no
Fermentasi Sterilisasi
<chr> <chr>
1 no no
2 no no
3 yes yes
This seems unclear to me. What do you mean with "how to set flag (yes or no) to the data value based on "result (using SQL query)" data"? Do you want to convert one of the column to a boolean value? If so you need to specify the decision rule.
This may look like this:
SELECT (... other columns),
CASE case_expression
WHEN when_expression_1 THEN 'yes'
WHEN when_expression_2 THEN 'no'
ELSE ''
END
To help others help you:
- which SQL variant do you use? (e.g. would a sqlite solution work for you?)
- provide an sql script of your table creation, plus a script to "use one hot encoding to convert categorical variable from t_pengolahan (such as "Pengupasan, Fermentasi, etc") into attributes"
I am doing some experiments with SQL in R using the sqldf package.
I am trying to test some commands to check the output, in particular I am trying to create tables.
Here the code:
sqldf("CREATE TABLE tbl1 AS
SELECT cut
FROM diamonds")
Very simple code, however I get this error
sqldf("CREATE TABLE tbl1 AS
+ SELECT cut
+ FROM diamonds")
data frame with 0 columns and 0 rows
Warning message:
In result_fetch(res#ptr, n = n) :
Don't need to call dbFetch() for statements, only for queries
Why is it saying the the table create as 0 columns and 0 rows?
Can someone help?
That is a warning, not an error. The warning is caused by a backward incompatibility in recent versions of RSQLite. You can ignore it since it works anyways.
The sqldf statement that is shown in the question
creates an empty database
uploads the diamonds data frame to a table of the same name in that database
runs the create statement which creates a second table tbl1 in the database
returns nothing (actually a 0 column 0 row data frame) since a create statement has no value
destroys the database
When using sqldf you don't need create statements. It automatically creates a table in the backend database for any data frame referenced in your sql statement so the following sqldf statement
sqldf("select * from diamonds")
will
create an empty database
upload diamonds to it
run the select statement
return the result of the select statement as a data frame
destroy the database
You can use the verbose=TRUE argument to see the individual calls to the lower level RSQLite (or other backend database if you specify a different backend):
sqldf("select * from diamonds limit 3", verbose = TRUE)
giving:
sqldf: library(RSQLite)
sqldf: m <- dbDriver("SQLite")
sqldf: connection <- dbConnect(m, dbname = ":memory:")
sqldf: initExtension(connection)
sqldf: dbWriteTable(connection, 'diamonds', diamonds, row.names = FALSE)
sqldf: dbGetQuery(connection, 'select * from diamonds limit 3')
sqldf: dbDisconnect(connection)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
Suggest you thoroughly review help("sqldf") as well as the info on the sqldf github home page
I want to create a new custom TA-indicator to the stock symbol in R. But I have no idea about how to convert my SQL conditional strategy into R self-defined function and add it up to the ChartSeries in R.
The question are listed in the following code as the explanation.
library("quantmod")
library("FinancialInstrument")
library("PerformanceAnalytics")
library("TTR")
stock <- getSymbols("002457.SZ",auto.assign=FALSE,from="2012-11-26",to="2014-01-30")
head(stock)
chartSeries(stock, theme = "white", subset = "2013-07-01/2014-01-30",TA = "addSMA(n=5,col=\"gray\");addSMA(n=10,col=\"yellow\");
addSMA(n=20,col=\"pink\");addSMA(n=30,col=\"green\");addSMA(n=60,col=\"blue\");addVo()")
Question: How can I rewrite the code below to make it available as a function in R?
#Signal Design
#Today's volume is the lowset during the last 20 trading days
lowvolume <- VOL<=LLV(VOL,20);
#seveal moving average lines stick together
X1:=ABS(MA(C,10)/MA(C,20)-1)<0.01;
X2:=ABS(MA(C,5)/MA(C,10)-1)<0.01;
X3:=ABS(MA(C,5)/MA(C,20)-1)<0.01;
#If the follwing condition is satisfied, then the signal appears
MA(C,5)>REF(MA(C,5),1) AND X1 AND X2 AND X3 AND lowvolume;
#Convert the above SQL code into the following R custom function
VOLINE <- function(x) {
}
#Create a new TA function for the chartseries and then add it up.
addVoline <- newTA(FUN=VOLINE,
+ preFUN=Cl,
+ col=c(rep(3,6),
+ rep(”#333333”,6)),
+ legend=”VOLINE”)
I dont think you need sql in this case
Try this
require(quantmod)
# fetch the data
s <- get(getSymbols('yhoo'))
# add the indicators
s$ma5 <- SMA(Cl(s) ,5)
s$ma10 <- SMA(Cl(s) ,10)
s$ma20 <- SMA(Cl(s) ,20)
s$llv <- rollapply(Vo(s), 20, min)
# generate the signal
s$signal <- (s$ma10 / s$ma20 - 1 < 0.01 & s$ma5 / s$ma10 - 1 < 0.01 & s$ma5 / s$ma20 - 1 < 0.01 & Vo(s) == s$llv)
# draw
chart_Series(s)
add_TA(s$signal == 1, on = 1, col='red')
I'm not sure what REF means but i'm sure you can do that by your self.
This is the output (i cant seem to upload the photo but you see a chart with horizontal lines where signal eq 1)
Use the function as a wrapper for sqldf() in the sqldf package. The argument to sqldf() will be a select statement on the data frame that has the data.
A good tutorial for this can be found at Burns Statistics.