Making R functions run faster [closed]

Making R functions run faster [closed] - sql

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 days ago.
Improve this question
Background: I have a very large dataset which logs what settings Users have saved.
I wrote some code which logs if a user has changed any settings from day one to day 2. The following are two functions I wrote which are part to make this happen:
packages used are 'duckdb' and 'dplyr'
p_loop <- function(p){
progress <- paste(p,"out of",nrow(alluser_list), sep =" ")
write.table(progress, "progress_estimation.csv")
user1_day1 <- dbGetQuery(con, paste("SELECT * FROM demo_day1 WHERE userId = '",alluser_list[p,],"'",sep = ""))
user1_day2 <- dbGetQuery(con, paste("SELECT * FROM demo_day2 WHERE userId = '",alluser_list[p,],"'",sep = ""))
#from first day, make list with all locations
user1_locations <- user1_day1 %>% distinct(name)
o_loop <- function(o){
user1_day1_location_o <- user1_day1[user1_day1[,6]==user1_locations[o,],c(7,8,9,10)]
user1_day2_location_o <- user1_day2[user1_day2[,6]==user1_locations[o,],c(7,8,9,10)]
arrange(user1_day1_location_o, warnType)
arrange(user1_day2_location_o, warnType)
user1_day2_location_o
if(isTRUE(all_equal(user1_day1_location_o,user1_day2_location_o))){
numticker <- 0
} else{
numticker <- 1
}
###
###
#check if numticker is greater than 0. If yes at least one change in settings occured for this user and this location.
#record result in table Results. Records UserId, location and if setting change occured(0/1)
if(numticker>0){
results <- data.frame(alluser_list[p,],user1_locations[o,],1)
colnames(results) <- c("UserId","Location","Result")
dbAppendTable(con,"Results",results)
} else {
results <- data.frame(alluser_list[p,],user1_locations[o,],0)
colnames(results) <- c("UserId","Location","Result")
dbAppendTable(con,"Results",results)
}
}
lapply(1:nrow(user1_locations),o_loop)
}
lapply(1:nrow(alluser_list),p_loop)
My code works but it is way too slow. On the cluster, that I work with, one iteration of p_loop takes about 14 seconds. There are a total of 3.5 million iterations. So I need a way to speed this up by magnitudes. I already reworked the code to use functions and lapply instead of for loops (hence the _loop names, because I learned that for loops are very inefficient. This saved about 4 seconds per iteration so not nearly enough to make this usable.
I'm asking this question because I'm unsure if this is even possible.
My knowledge of R is super basic.

Related

Set SQL WHERE value using an element from a list [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have a variable with a list on it and I need to use its value for my find option. I get an error when I set my id_user to id_u.
Here is the list
id_u = user_key[0]
This is my SELECT and WHERE
find = ("SELECT * FROM hashtags WHERE id_user=id_u")

You have to concatenate SELECT string with variable value.
Try like this:
id_u = user_key[0]
find = ("SELECT * FROM hashtags WHERE id_user=" + id_u)

Server SQL how to clean column using STUFF? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
608A
608 A
17113 R
16524 DC1
ASM-1780
234604A - Low L2 Cu
19658B-->
234605 - High L2 Cu
17015 Rev A 405734UD0A
43224A (W
23809 REVB
Is there an SQL server query that cleans the column above and removes the excess content on the right such that the data is converted to below:
608
608
17113
16524
ASM-1780
234604
19658
234605
17015
43224
23809
I have tried using STUFF, but it doesn't clean well.

You seem to want everything up to and including the first digit followed by a non-digit.
Well, this returns what you are asking for:
select str, left(str, patindex('%[0-9][^0-9]%', str + ' '))
Here is a db<>fiddle.

Selenium-getfirstselectedoption method too slow [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am using getfirstselectedoption.gettext() method to get the default selected value in drop down which helps to reduce execution time as i need to select the value in that drop down every time. But it's taking approx 15 to 20 secs to get that default selected value. Drop down contains nearly 180 string values. I don't understand why it's taking that much time. Any help would be appreciated.

Looking through the Selenium API, and associated source tells you: .getFirstSelectedOption() is:
public WebElement getFirstSelectedOption() {
for (WebElement option : getOptions()) {
if (option.isSelected()) {
return option;
}
}
throw new NoSuchElementException("No options are selected");
}
and getOptions() is:
/**
* #return All options belonging to this select tag
*/
public List<WebElement> getOptions() {
return element.findElements(By.tagName("option"));
}
So your expectation that at the first hit the loop will stop is not correct; it has to fetch all your options first.

Handling paginated SQL query results

For my dissertation data collection, one of the sources is an externally-managed system, which is based on Web form for submitting SQL queries. Using R and RCurl, I have implemented an automated data collection framework, where I simulate the above-mentioned form. Everything worked well while I was limiting the size of the resulting dataset. But, when I tried to go over 100000 records (RQ_SIZE in the code below), the tandem "my code - their system" started being unresponsive ("hanging").
So, I have decided to use SQL pagination feature (LIMIT ... OFFSET ...) to submit a series of requests, hoping then to combine the paginated results into a target data frame. However, after changing my code accordingly, the output that I see is only one pagination progress character (*) and then no more output. I'd appreciate, if you could help me identify the probable cause of the unexpected behavior. I cannot provide reproducible example, as it's very difficult to extract the functionality, not to mention the data, but I hope that the following code snippet would be enough to reveal the issue (or, at least, a direction toward the problem).
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
numRequests <- as.numeric(data) %/% RQ_SIZE + 1
# Now, we can request & retrieve data via SQL pagination
for (i in 1:numRequests) {
# setup SQL pagination
if (rq$where == '') rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
# some code
# add current data frame to the list
dfList <- c(dfList, data)
if (DEBUG) message("*", appendLF = FALSE)
}
# merge all the result pages' data frames
data <- do.call("rbind", dfList)
# save current data frame to RDS file
saveRDS(data, rdataFile)

It probably falls into the category when presumably MySQL hinders LIMIT OFFSET:
Why does MYSQL higher LIMIT offset slow the query down?
Overall, fetching large data sets over HTTP repeatedly is not very reliable.

Since this is for your dissertation, here is a hand:
## Folder were to save the results to disk.
## Ideally, use a new, empty folder. Easier then to load from disk
folder.out <- "~/mydissertation/sql_data_scrape/"
## Create the folder if not exist.
dir.create(folder.out, showWarnings=FALSE, recursive=TRUE)
## The larger this number, the more memory you will require.
## If you are renting a large box on, say, EC2, then you can make this 100, or so
NumberOfOffsetsBetweenSaves <- 10
## The limit size per request
RQ_SIZE <- 1000
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
## Get the total number of rows
TotalRows <- as.numeric(srdaGetData())
TotalNumberOfRequests <- TotalRows %/% RQ_SIZE
TotalNumberOfGroups <- TotalNumberOfRequests %/% NumberOfOffsetsBetweenSaves + 1
## FYI: Total number of rows being requested is
## (NumberOfOffsetsBetweenSaves * RQ_SIZE * TotalNumberOfGroups)
for (g in seq(TotalNumberOfGroups)) {
ret <-
lapply(seq(NumberOfOffsetsBetweenSaves), function(i) {
## function(i) is the same code you have
## inside your for loop, but cleaned up.
# setup SQL pagination
if (rq$where == '')
rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*g*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
# retrieve result
data <- srdaGetData()
# some code
if (DEBUG) message("*", appendLF = FALSE)
### DONT ASSIGN TO dfList, JUST RETURN `data`
# xxxxxx DONT DO: xxxxx dfList <- c(dfList, data)
### INSTEAD:
## return
data
})
## save each iteration
file.out <- sprintf("%s/data_scrape_%04i.RDS", folder.out, g)
saveRDS(do.call(rbind, ret), file=file.out)
## OPTIONAL (this will be slower, but will keep your rams and goats in line)
# rm(ret)
# gc()
}
Then, once you are done scraping:
library(data.table)
folder.out <- "~/mydissertation/sql_data_scrape/"
files <- dir(folder.out, full=TRUE, pattern="\\.RDS$")
## Create an empty list
myData <- vector("list", length=length(files))
## Option 1, using data.frame
for (i in seq(myData))
myData[[i]] <- readRDS(files[[i]])
DT <- do.call(rbind, myData)
## Option 2, using data.table
for (i in seq(myData))
myData[[i]] <- as.data.table(readRDS(files[[i]]))
DT <- rbindlist(myData)

I'm answering my own question, as, finally, I have figured out what has been the real source of the problem. My investigation revealed that the unexpected waiting state of the program was due to PostgreSQL becoming confused by malformed SQL queries, which contained multiple LIMIT and OFFSET keywords.
The reason of that is pretty simple: I used rq$where both outside and inside the for loop, which made paste() concatenate previous iteration's WHERE clause with the current one. I have fixed the code by processing contents of the WHERE clause and saving it before the loop and then using the saved value in each iteration of the loop safely, as it became independent from the value of the original WHERE clause.
This investigation also helped me to fix some other deficiencies in my code and make improvements (such as using sub-selects to properly handle SQL queries returning number of records for queries with aggregate functions). The moral of the story: you can never be too careful in software development. Big thank you to those nice people who helped with this question.

xts objects and split

I have many "problems" but i will try to split them up as good as i can.
First i will present my code:
# Laster pakker
library(RODBC)
library(plyr)
library(lattice)
library(colorRamps)
library(Perfor)
# Picking up data
query <- "select convert(varchar(10),fk_dim_date,103) fk_dim_date,fk_dim_portfolio, dtd_portfolio_return_pct, dtd_benchmark_return_pct, * from nbim_dm..v_fact_performance
where fk_dim_date > '20130103' and fk_dim_portfolio in ('6906', '1812964')
"
# Formatting SQLen
query <- strwrap(query,width=nchar(query),simplify=TRUE)
# quering
ch <- odbcDriverConnect("driver={SQL Server};server=XXXX;Database=XXXX;", rows_at_time = 1024)
result <- sqlQuery(ch, query, as.is=c(TRUE, TRUE, TRUE))
close(ch)
# Do some cleanup
`enter code here`resultt$v_d <- as.Date(as.POSIXct(t$v_d))
#split
y <- split(qt,qt$fk_dim_portfolio)
#making names
new_names <- c("one","two")
for (i in 1:length(y){assign(new_names[i],y[[i]])})
So far so good:
The table that my SQL is running on has approx 178 diff. port_ids, some of which are useless and others that are highly useful. However i want this code to pull all fk_dim_ports (pulling: '6906', '1812964 was just for example purposes). After pulling the data i want to seperate it into n (now 178 sets) and make them xts objects which i have run into some trouble using:
qt <- xts(t[,-1],order.by=t[,1])
But works perfectly well if i don`t split the data using:
y <- split(qt,qt$fk_dim_portfolio)
Assuming this will work, my intention is to create charts.PerformanceSummary(mydata) for every table of my previous created data frames.
If you have any tips on how to split, make timeseries objects and loop the generation of the charts i would highly appreciate this.
I am aware that this post probably don`t comply to your rules/customs etc, but thanks for helping.
Lars

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Making R functions run faster [closed] - sql

Related

Set SQL WHERE value using an element from a list [closed]

Server SQL how to clean column using STUFF? [closed]

Selenium-getfirstselectedoption method too slow [closed]

Handling paginated SQL query results

xts objects and split

Categories

Resources