QUANTSTRAT - apply.paramset issue - optimization

I am trying to optimize MACD parameters for a trading strategy but unfortunately I am stuck with paramset.label value. This is the code:
################################# MACD PARAMETERS OPTIMIZATION
.fastMA <- (20:40)
.slowMA <- (30:70)
.nsamples = 10
strat.st <- 'volStrat'
# Paramset
add.distribution(strat.st,
paramset.label = 'EMA',
component.type = 'indicator',
component.label = 'macd.out',
variable = list(n = .fastMA),
label = 'nFast'
)
add.distribution(strat.st,
paramset.label = 'EMA',
component.type = 'indicator',
component.label = 'macd.out',
variable = list(n = .slowMA),
label = 'nSlow'
)
add.distribution.constraint(strat.st,
paramset.label = 'EMA',
distribution.label.1 = 'nFast',
distribution.label.2 = 'nSlow',
operator = '<',
label = 'nFast<nSlow'
)
results <- apply.paramset(strat.st,
paramset.label = 'EMA',
portfolio = portfolio2.st,
account = account.st,
nsamples = .nsamples,
verbose = TRUE)
stats <- results$tradeStats
print(stats)
When I run it, this error occurs for every sample:
evaluation # 1:
$param.combo
nFast nSlow
379 23 51
[1] "Processing param.combo 379"
nFast nSlow
379 23 51
result of evaluating expression:
<simpleError in strategy[[components.type]][[index]]: subscript out of bounds>
got results for task 1
numValues: 1, numResults: 1, stopped: FALSE
returning status FALSE
And then, for the last one, this is the error:
evaluation # 10:
$param.combo
nFast nSlow
585 40 60
[1] "Processing param.combo 585"
nFast nSlow
585 40 60
result of evaluating expression:
<simpleError in strategy[[components.type]][[index]]: subscript out of bounds>
got results for task 10
numValues: 10, numResults: 10, stopped: FALSE
first call to combine function
evaluating call object to combine results:
fun(result.1, result.2, result.3, result.4, result.5, result.6,
result.7, result.8, result.9, result.10)
error calling combine function:
<simpleError in fun(result.1, result.2, result.3, result.4, result.5, result.6, result.7, result.8, result.9, result.10): attempt to select less than one element>
numValues: 10, numResults: 10, stopped: TRUE
I really don't understand how can I fix it.
Can anyone how can I solve this?
Thank you so much

You didn't give the code before OPTIMIZATION part, so here is only my guess direction.
I understand you want to test 20:40 and 30:70, but in your OPTIMIZATION code, you add 2 distribution both pointing to " component.label = 'macd.out' ".
I did similar test, although they both use MA type indicators, they generally should not point to the same MA data(" component.label = 'macd.out' "), these code worked one distribution points to "component.label = 'fast'" and another points to "component.label = 'slow'" as they are pointing different datas so that they can be compared.
You can try to debug in this direction.

Related

Sudoku Solution Using Multiprocessing

I tried sudoku solution using backtracking, but it was taking a lot time around 12sec to give output. I tried to implement a multiprocessing technique but it's taking lot more time than that of backtracking. I never ran it completely it's too slow. Please suggest what am I missing? Even better if someone can also tell me how to run this through my GPU. (using CUDA).
import concurrent.futures
import copy
A = [[0]*9 for _ in range(9)]
A[0][6] = 2
A[1][1] = 8
A[1][5] = 7
A[1][7] = 9
A[2][0] = 6
A[2][2] = 2
A[2][6] = 5
A[3][1] = 7
A[3][4] = 6
A[4][3] = 9
A[4][5] = 1
A[5][4] = 2
A[5][7] = 4
A[6][2] = 5
A[6][6] = 6
A[6][8] = 3
A[7][1] = 9
A[7][3] = 4
A[7][7] = 7
A[8][2] = 6
Boards = [A]
L = []
for i in range(9):
for j in range(9):
if A[i][j] == 0:
L.append([i,j])
def RC_Check(A,Value,N):
global L
i,j = L[N]
for x in range(9):
if A[x][j] == Value:
return False
if A[i][x] == Value:
return False
return True
def Square_Check(A,Value,N):
global L
i,j = L[N]
X, Y = int(i/3)*3,int(j/3)*3
for x in range(X,X+3):
for y in range(Y,Y+3):
if A[x][y] == Value:
return False
return True
def New_Boards(Board,N):
global L
i,j = L[N]
Boards = []
with concurrent.futures.ProcessPoolExecutor() as executor:
RC_Process = executor.map(RC_Check,[Board]*10,list(range(1,10)),[N]*10)
Square_Process = executor.map(Square_Check,[Board]*10,list(range(1,10)),[N]*10)
for Value, (RC_Process, Square_Process) in enumerate(zip(RC_Process,Square_Process)):
if RC_Process and Square_Process:
Board[i][j] = Value+1
Boards.append(copy.deepcopy(Board))
return Boards
def Solve_Boards(Boards,N):
Results = []
with concurrent.futures.ProcessPoolExecutor() as executor:
Process = executor.map(New_Boards,Boards,[N]*len(Boards))
for new_boards in Process:
if len(new_boards):
Results.extend(new_boards)
return Results
if __name__ == "__main__":
N = 0
while N < len(L):
Boards = Solve_Boards(Boards,N)
N+=1
print(len(Boards),N)
print(Boards)
Multi processing is NOT a silver bullet. Backtracking is pretty more efficient than exhaustive search parallelly in most cases. I tried running this code on my PC which has 32 cores 64 threads, but it takes long time.
And you look like to want to use GPGPU to solve this problem, but i doesn't suit, Because state of board depends on previous state, so can't split calculation efficiently.

Is non-identical not enough to be considered 'distinct' for kmeans centroids?

I have an issue with kmeans clustering providing centroids. I saw the same problem already asked (
K-means: Initial centers are not distinct), but the solution in that post is not working in my case.
I selected the centroids using ClusterR::Kmeans_arma. I confirmed that my centroids are not identical using mgcv::uniquecombs, but still got the initial centers are not distinct error.
> dim(t(dat))
[1] 13540 11553
> centroids = ClusterR::KMeans_arma(data = t(dat), centers = 561,
n_iter = 50, seed_mode = "random_subset",
verbose = FALSE, CENTROIDS = NULL)
> dim(centroids)
[1] 561 11553
> x = mgcv::uniquecombs(centroids)
> dim(x)
[1] 561 11553
> res = kmeans(t(dat), centers = centroids, iter.max = 200)
Error in kmeans(t(dat), centers = centroids, iter.max = 200) :
initial centers are not distinct
Any suggestion to resolve this? Thanks!
I replicated the issue you've mentioned with the following data:
cols = 13540
rows = 11553
set.seed(1)
vec_dat = runif(rows * cols)
dat = matrix(vec_dat, nrow = rows, ncol = cols)
dim(dat)
dat = t(dat)
dim(dat)
There is no 'centers' parameter in the 'ClusterR::KMeans_arma()' function, therefore I've assumed you actually mean 'clusters',
centroids = ClusterR::KMeans_arma(data = dat,
clusters = 561,
n_iter = 50,
seed_mode = "random_subset",
verbose = TRUE,
CENTROIDS = NULL)
str(centroids)
dim(centroids)
The 'centroids' is a matrix of class "k-means clustering". If your intention is to come to the clusters then you can use,
clust = ClusterR::predict_KMeans(data = dat,
CENTROIDS = centroids,
threads = 6)
length(unique(clust)) # 561
class(centroids) # "k-means clustering"
If you want to pass the 'centroids' to the base R 'kmeans' function you have to set the 'class' of the 'centroids' object to NULL and that because the base R 'kmeans' function uses internally the base R 'duplicated()' function (you can view this by using print(kmeans) in the R console) which does not recognize the 'centroids' object as a matrix or data.frame (it is an object of class "k-means clustering") and performs the checking column-wise rather than row-wise. Therefore, the following should work for your case,
class(centroids) = NULL
dups = duplicated(centroids)
sum(dups) # this should actually give 0
res = kmeans(dat, centers = centroids, iter.max = 200)
I've made a few adjustments to the "ClusterR::predict_KMeans()" and particularly I've added the "threads" parameter and a check for duplicates, therefore if you want to come to the clusters using multiple cores you have to install the package from Github using,
remotes::install_github('mlampros/ClusterR',
upgrade = 'always',
dependencies = TRUE,
repos = 'https://cloud.r-project.org/')
The changes will take effect in the next version of the CRAN package which will be "1.2.2"
UPDATE regarding output and performance (based on your comment):
data(dietary_survey_IBS, package = 'ClusterR')
kmeans_arma = function(data) {
km_cl = ClusterR::KMeans_arma(data,
clusters = 2,
n_iter = 10,
seed_mode = "random_subset",
seed = 1)
pred_cl = ClusterR::predict_KMeans(data = data,
CENTROIDS = km_cl,
threads = 1)
return(pred_cl)
}
km_arma = kmeans_arma(data = dietary_survey_IBS)
km_algos = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")
for (algo in km_algos) {
cat('base-kmeans-algo:', algo, '\n')
km_base = kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo)
km_cl = as.vector(km_base$cluster)
print(table(km_arma, km_cl))
cat('--------------------------\n')
}
microbenchmark::microbenchmark(kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo), kmeans_arma(data = dietary_survey_IBS), times = 100)
I don't see any significant difference in the output clusters between the 'base R kmeans' and the 'kmeans_arma' function for all available 'base R kmeans' algorithms (you can test it also for your own data sets). I am not sure which algorithm the 'armadillo' library uses internally and moreover the 'base R kmeans' includes the 'nstart' parameter (you can consult the documentation for more info). Regarding performance you won't see any substantial differences for small to medium data sets but due to the fact that the armadillo library uses OpenMP internally in case that your computer has more than 1 cores then for big data sets I think the 'ClusterR::KMeans_arma' function will return the 'centroids' faster.

LoadError using approximate bayesian criteria

I am getting an error that is confusing me.
using DifferentialEquations
using RecursiveArrayTools # for VectorOfArray
using DiffEqBayes
f2 = #ode_def_nohes LotkaVolterraTest begin
dx = x*(1 - x - A*y)
dy = rho*y*(1 - B*x - y)
end A B rho
u0 = [1.0;1.0]
tspan = (0.0,10.0)
p = [0.2,0.5,0.3]
prob = ODEProblem(f2,u0,tspan,p)
sol = solve(prob,Tsit5())
t = collect(linspace(0,10,200))
randomized = VectorOfArray([(sol(t[i]) + .01randn(2)) for i in 1:length(t)])
data = convert(Array,randomized)
priors = [Uniform(0.0, 2.0), Uniform(0.0, 2.0), Uniform(0.0, 2.0)]
bayesian_result_abc = abc_inference(prob, Tsit5(), t, data,
priors;num_samples=500)
Returns the error
ERROR: LoadError: DimensionMismatch("first array has length 400 which does not match the length of the second, 398.")
while loading..., in expression starting on line 20.
I have not been able to locate any array of size 400 or 398.
Thanks for your help.
Take a look at https://github.com/JuliaDiffEq/DiffEqBayes.jl/issues/52, that was due to an error in passing the t. This has been fixed on master so you can use that or wait some time, we will have a new release soon with the 1.0 upgrades which will have this fixed too.
Thanks!

Data handling on multiple Heart rate files

I have been collecting the Heart rates of 12 calves who each received an anesthetic through four different routes of administration. I now have 48 txt files of this format:
Time HRbpm
0:00:01.7 97
0:00:02.3 121
0:00:02.8 15
... ...
HR was recorded for around 2hours. The Time column was dependent of the monitor, resulting in inconsistent time intervals between two measures.
The txt files are named as follows: 6133_IM_27.00.txt
With 6133 being the ID, IM the route and 27.00 the time (min:min.s:s) at which the treatment was injected.
My first goal is to have all the HR data so I can do an outlier analysis.
Then, I would like to include all this data in a single data frame that would look like this:
data.frame(ID=c(6133,6133,6133,6133,"...",6134,6134,"..."),
Route = c("IM","IM","IM","IM","...","SC","SC","..."),
time=c(0, 10, 20, 30,"...",0,10,"..."),
HR=c(160, 150, 145, 130,"...",162,158,"..."))
Time column going from 0 to 120 in 10min increments.
Each HR of this df would represent the mean of the HR values for the preceding minute for a given time (e.g. for time = 30, HR would represent the mean between 29 and 30 minutes for a given ID/Route combination).
I'm fairly new to R, so I've been having trouble just knowing by what angle starting on that problem. Any help would be welcome.
Thanks,
Thomas
For anyone who stumbles on this post, here's what I've done, seems to be working.
library(plyr)
library(reshape)
library(ggplot2)
setwd("/directory")
filelist = list.files(pattern = ".*.txt")
datalist = lapply(filelist, read.delim)
for (i in 1:length(datalist))
{datalist[[i]][3] = filelist[i]}
df = do.call("rbind", datalist)
attach(df)
out_lowHR = quantile(HRbpm,0.25)-1.5*IQR(HRbpm)
out_highHR = quantile(HRbpm,0.75)+1.5*IQR(HRbpm) #outliers thresholds: 60 and 200
dfc = subset(df,HRbpm>=60 & HRbpm<=200)
(length(df$HRbpm)-length(dfc$HRbpm))/length(df$HRbpm)*100 #8.6% of values excluded
df = dfc
df$ID = substr(df$V3,4,7)
df$ROA = substr(df$V3,9,11)
df$ti = substr(df$V3,13,17)
df$Time = as.POSIXct(as.character(df$Time), format="%H:%M:%S")
df$ti = as.POSIXct(as.character(df$ti), format="%M.%S")
df$t = as.numeric(df$Time-df$ti)
m=60
meanHR = ddply(df, c("ROA","ID"), summarise,
mean0 = mean(HRbpm[t>-60*m & t <=0]),
mean10 = mean(HRbpm[t>9*m & t <=10*m]),
mean20 = mean(HRbpm[t>19*m & t <=20*m]),
mean30 = mean(HRbpm[t>29*m & t <=30*m]),
mean45 = mean(HRbpm[t>44*m & t <=45*m]),
mean60 = mean(HRbpm[t>59*m & t <=60*m]),
mean90 = mean(HRbpm[t>89*m & t <=90*m]),
mean120 = mean(HRbpm[t>119*m & t <=120*m]))
meanHR = melt(meanHR)
meanHR$time = as.numeric(gsub("mean", "", meanHR$variable))
ggplot(meanHR, aes(x = time, y = value, col = ROA))+
geom_smooth()+
theme_classic()

bnlearn error in structural.em

I got an error when try to use structural.em in "bnlearn" package
This is the code:
cut.learn<- structural.em(cut.df, maximize = "hc",
+ maximize.args = "restart",
+ fit="mle", fit.args = list(),
+ impute = "parents", impute.args = list(), return.all = FALSE,
+ max.iter = 5, debug = FALSE)
Error in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
warn.if.no.missing = TRUE, : at least one variable has no observed
values.
Did anyone have the same problems, please tell me how to fix it.
Thank you.
I got structural.em working. I am currently working on a python interface to bnlearn that I call pybnl. I also ran into the problem you desecribe above.
Here is a jupyter notebook that shows how to use StructuralEM from python marks.
The gist of it is described in slides-bnshort.pdf on page 135, "The MARKS Example, Revisited".
You have to create an inital fit with an inital imputed dataframe by hand and then provide the arguments to structural.em like so (ldmarks is the latent-discrete-marks dataframe where the LAT column only contains missing/NA values):
library(bnlearn)
data('marks')
dmarks = discretize(marks, breaks = 2, method = "interval")
ldmarks = data.frame(dmarks, LAT = factor(rep(NA, nrow(dmarks)), levels = c("A", "B")))
imputed = ldmarks
# Randomly set values of the unobserved variable in the imputed data.frame
imputed$LAT = sample(factor(c("A", "B")), nrow(dmarks2), replace = TRUE)
# Fit the parameters over an empty graph
dag = empty.graph(nodes = names(ldmarks))
fitted = bn.fit(dag, imputed)
# Although we've set imputed values randomly, nonetheless override them with a uniform distribution
fitted$LAT = array(c(0.5, 0.5), dim = 2, dimnames = list(c("A", "B")))
# Use whitelist to enforce arcs from the latent node to all others
r = structural.em(ldmarks, fit = "bayes", impute="bayes-lw", start=fitted, maximize.args=list(whitelist = data.frame(from = "LAT", to = names(dmarks))), return.all = TRUE)
You have to use bnlearn 4.4-20180620 or later, because it fixes a bug in the underlying impute function.