R EVMIX convert pdf to uniform marginals - uniform

I'm trying to convert a distribution into a pseudo-uniform distribution. Using the spd R package, it is easy and it works as expected.
library(spd)
x <- c(rnorm(100,-1,0.7),rnorm(100,3,1))
fit<-spdfit(x,upper=0.9,lower=0.1,tailfit="GPD", kernelfit="epanech")
uniformX = pspd(x,fit)
I want to generalize extreme value modeling to include threshold uncertainity. So I used the evmix package.
library(evmix)
x <- c(rnorm(100,-1,0.7),rnorm(100,3,1))
fit = fgkg(x, phiul = FALSE, phiur = FALSE, std.err = FALSE)
pgkg(x,fit$lambda, fit$ul, fit$sigmaul, fit$xil, fit$phiul, fit$ur,
fit$sigmaur, fit$xir, fit$phiur)
Im messing up somewhere.

Please check out the help for pgkg function:
help(pgkg)
which gives the syntax:
pgkg(q, kerncentres, lambda = NULL, ul = as.vector(quantile(kerncentres,
0.1)), sigmaul = sqrt(6 * var(kerncentres))/pi, xil = 0, phiul = TRUE,
ur = as.vector(quantile(kerncentres, 0.9)), sigmaur = sqrt(6 *
var(kerncentres))/pi, xir = 0, phiur = TRUE, bw = NULL,
kernel = "gaussian", lower.tail = TRUE)
You have missed the kernel centres (the data), which is always needed for kernel density estimators. Here is the corrected code:
library(evmix)
x <- c(rnorm(100,-1,0.7),rnorm(100,3,1))
fit = fgkg(x, phiul = FALSE, phiur = FALSE, std.err = FALSE)
prob = pgkg(x, x, fit$lambda, fit$ul, fit$sigmaul, fit$xil, fit$phiul,
fit$ur, fit$sigmaur, fit$xir, fit$phiur)
hist(prob) % now uniform as expected

Related

How to optimize for a variable that goes into the argument of a function in pyomo?

I am trying to code a first order plus dead time (FOPDT) model and use it
for PID tuning. The inspiration for the work is the scipy code from: https://apmonitor.com/pdc/index.php/Main/FirstOrderOptimization
When I use model.Thetam() in the ODE constraint, it does not optimize Thetam,
keeps it at the initial value. When I use only model.Theta then the code throws an error -
ValueError: object arrays are not supported if I remove it from uf argument i.e.model.Km * (uf(tt - model.Thetam)-model.U0))
and if I remove it from the if statement (if tt > model.Thetam), then the error is - ERROR:pyomo.core:Rule failed when generating expression for Constraint ode with index 0.0: PyomoException: Cannot convert non-constant Pyomo expression (Thetam < 0.0) to bool. This error is usually caused by using a Var, unit, or mutable Param in a Boolean context such as an "if" statement, or when checking container membership or equality.
Code:
`url = 'http://apmonitor.com/pdc/uploads/Main/data_fopdt.txt'
data = pd.read_csv(url)
data = data.iloc[1:]
t = data['time'].values - data['time'].values[0]
u = data['u'].values
yp = data['y'].values
u0 = u[0]
yp0 = yp[0]
yf = interp1d(t, yp)
# specify number of steps
ns = len(t)
delta_t = t[1]-t[0]
# create linear interpolation of the u data versus time
uf = interp1d(t,u,fill_value="extrapolate")
model = ConcreteModel()
model.T = ContinuousSet(initialize = t)
model.Y = Var(model.T)
model.dYdT = DerivativeVar(model.Y, wrt = (model.T))
model.Y[0].fix(yp0)
model.Yp0 = Param(initialize = yp0)
model.U0 = Param(initialize = u0)
model.Km = Var(initialize = 2, bounds = (0.1, 10))
model.Taum = Var(initialize = 3, bounds = (0.1, 10))
model.Thetam = Var(initialize = 0, bounds = (0, 10))
model.ode = Constraint(model.T,
rule = lambda model, tt: model.dYdT[tt] == (-(model.Y[tt]-model.Yp0) + model.Km * (uf(tt - model.Thetam())-model.U0))/model.Taum if tt > model.Thetam()
else model.dYdT[tt] == -(model.Y[tt]-model.Yp0)/model.Taum)
def obj_rule(m):
return sum((m.Y[i] - yf(i))**2 for i in m.T)
model.obj = Objective(rule = obj_rule)
discretizer = TransformationFactory('dae.finite_difference')
discretizer.apply_to(model, nfe = 500, wrt = model.T, scheme = 'BACKWARD')
opt=SolverFactory('ipopt', executable='/content/ipopt')
opt.solve(model)#, tee = True)
model.pprint()
model2 = ConcreteModel()
model2.T = ContinuousSet(initialize = t)
model2.Y = Var(model2.T)
model2.dYdT = DerivativeVar(model2.Y, wrt = (model2.T))
model2.Y[0].fix(yp0)
model2.Yp0 = Param(initialize = yp0)
model2.U0 = Param(initialize = u0)
model2.Km = Param(initialize = 3.0145871)#3.2648)
model2.Taum = Param(initialize = 1.85862177) # 5.2328)
model2.Thetam = Param(initialize = 0)#2.936839032) #0.1)
model2.ode = Constraint(model2.T,
rule = lambda model, tt: model.dYdT[tt] == (-(model.Y[tt]-model.Yp0) + model.Km * (uf(tt - model.Thetam())-model.U0))/model.Taum)
discretizer2 = TransformationFactory('dae.finite_difference')
discretizer2.apply_to(model2, nfe = 500, wrt = model2.T, scheme = 'BACKWARD')
opt2=SolverFactory('ipopt', executable='/content/ipopt')
opt2.solve(model2)#, tee = True)
# model.pprint()
t = [i for i in model.T]
ypred = [model.Y[i]() for i in model.T]
ytrue = [yf(i) for i in model.T]
yoptim = [model2.Y[i]() for i in model2.T]
plt.plot(t, ypred, 'r-')
plt.plot(t, ytrue)
plt.plot(t, yoptim)
plt.legend(['pred', 'true', 'optim'])
`

Non-linear regression line and its Computation failed in `stat_smooth()`: argument "p" is missing, with no default error

I have been trying to fit a non-linear regression line into my standard curve. However, I am getting the following error:
The main problem is that with the linear regression line I could use a simple command like:
stat_cor(label.y = c(825),
label.x = c(0.88),
aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~")))+
stat_regline_equation(label.x=0.88, label.y=750)+
And the equation for the linear regression line with an a, and b values appear. In this case after using the following:
stat_smooth(method= "nlm",
formula = y~a*x/(b+x),
method.args = list( start = c(a = 3.8, b = 1457.2)),
se=FALSE)+
I am getting the above error.
You may ask where I got the a, and b values? I got them from:
nls(y~a*x/(b+x))
That has shown:
I do not know where I am making mistakes.
This is the entire code for my graph
library(tidyverse)
library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
library(ggpubr)
ggplot(data = STD, aes(x = Absorbance, y = STD)) +
labs(title = "Quantifying PGD2 in cell culture lysates and its enzymatic reactions ",
caption = "PGD2 ELISA")+
geom_point(colour = "#69b3a2")+
stat_smooth(method= "nlm",
formula = y~a*x/(b+x),
method.args = list( start = c(a = 3.8, b = 1457.2)),
se=FALSE)+
xlab(expression(paste("%B/"~B[0])))+
ylab(expression(paste("Prostaglandin"~ D[2], ~~ " MOX Concentration (pg/ml) ")))+
theme(plot.background = element_rect(fill = "transparent"),
panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))+
theme(legend.spacing.y = unit(0.01, "cm"))+
theme(legend.position = c(0.77, .91),
legend.background = element_rect(colour = NA, fill = NA))+
theme(plot.title = element_text(size = 12, face = "bold.italic"),
plot.caption = element_text(hjust = 0))
That gives the following outcome
And this is DataUsed
So, I think I have found a solution to my problem. I installed the install.packages(drc) in which the four parametric function is included. I set up my data model <- drm(STD ~ Absorbance, fct = LL.4(), data = STD), then plot(model) , and I got
I know it requires some alternations to make it look more professional, but it is just a cosmetic thing that I should be fine to do. Thank you #stefan for your time.

Is non-identical not enough to be considered 'distinct' for kmeans centroids?

I have an issue with kmeans clustering providing centroids. I saw the same problem already asked (
K-means: Initial centers are not distinct), but the solution in that post is not working in my case.
I selected the centroids using ClusterR::Kmeans_arma. I confirmed that my centroids are not identical using mgcv::uniquecombs, but still got the initial centers are not distinct error.
> dim(t(dat))
[1] 13540 11553
> centroids = ClusterR::KMeans_arma(data = t(dat), centers = 561,
n_iter = 50, seed_mode = "random_subset",
verbose = FALSE, CENTROIDS = NULL)
> dim(centroids)
[1] 561 11553
> x = mgcv::uniquecombs(centroids)
> dim(x)
[1] 561 11553
> res = kmeans(t(dat), centers = centroids, iter.max = 200)
Error in kmeans(t(dat), centers = centroids, iter.max = 200) :
initial centers are not distinct
Any suggestion to resolve this? Thanks!
I replicated the issue you've mentioned with the following data:
cols = 13540
rows = 11553
set.seed(1)
vec_dat = runif(rows * cols)
dat = matrix(vec_dat, nrow = rows, ncol = cols)
dim(dat)
dat = t(dat)
dim(dat)
There is no 'centers' parameter in the 'ClusterR::KMeans_arma()' function, therefore I've assumed you actually mean 'clusters',
centroids = ClusterR::KMeans_arma(data = dat,
clusters = 561,
n_iter = 50,
seed_mode = "random_subset",
verbose = TRUE,
CENTROIDS = NULL)
str(centroids)
dim(centroids)
The 'centroids' is a matrix of class "k-means clustering". If your intention is to come to the clusters then you can use,
clust = ClusterR::predict_KMeans(data = dat,
CENTROIDS = centroids,
threads = 6)
length(unique(clust)) # 561
class(centroids) # "k-means clustering"
If you want to pass the 'centroids' to the base R 'kmeans' function you have to set the 'class' of the 'centroids' object to NULL and that because the base R 'kmeans' function uses internally the base R 'duplicated()' function (you can view this by using print(kmeans) in the R console) which does not recognize the 'centroids' object as a matrix or data.frame (it is an object of class "k-means clustering") and performs the checking column-wise rather than row-wise. Therefore, the following should work for your case,
class(centroids) = NULL
dups = duplicated(centroids)
sum(dups) # this should actually give 0
res = kmeans(dat, centers = centroids, iter.max = 200)
I've made a few adjustments to the "ClusterR::predict_KMeans()" and particularly I've added the "threads" parameter and a check for duplicates, therefore if you want to come to the clusters using multiple cores you have to install the package from Github using,
remotes::install_github('mlampros/ClusterR',
upgrade = 'always',
dependencies = TRUE,
repos = 'https://cloud.r-project.org/')
The changes will take effect in the next version of the CRAN package which will be "1.2.2"
UPDATE regarding output and performance (based on your comment):
data(dietary_survey_IBS, package = 'ClusterR')
kmeans_arma = function(data) {
km_cl = ClusterR::KMeans_arma(data,
clusters = 2,
n_iter = 10,
seed_mode = "random_subset",
seed = 1)
pred_cl = ClusterR::predict_KMeans(data = data,
CENTROIDS = km_cl,
threads = 1)
return(pred_cl)
}
km_arma = kmeans_arma(data = dietary_survey_IBS)
km_algos = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")
for (algo in km_algos) {
cat('base-kmeans-algo:', algo, '\n')
km_base = kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo)
km_cl = as.vector(km_base$cluster)
print(table(km_arma, km_cl))
cat('--------------------------\n')
}
microbenchmark::microbenchmark(kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo), kmeans_arma(data = dietary_survey_IBS), times = 100)
I don't see any significant difference in the output clusters between the 'base R kmeans' and the 'kmeans_arma' function for all available 'base R kmeans' algorithms (you can test it also for your own data sets). I am not sure which algorithm the 'armadillo' library uses internally and moreover the 'base R kmeans' includes the 'nstart' parameter (you can consult the documentation for more info). Regarding performance you won't see any substantial differences for small to medium data sets but due to the fact that the armadillo library uses OpenMP internally in case that your computer has more than 1 cores then for big data sets I think the 'ClusterR::KMeans_arma' function will return the 'centroids' faster.

bnlearn error in structural.em

I got an error when try to use structural.em in "bnlearn" package
This is the code:
cut.learn<- structural.em(cut.df, maximize = "hc",
+ maximize.args = "restart",
+ fit="mle", fit.args = list(),
+ impute = "parents", impute.args = list(), return.all = FALSE,
+ max.iter = 5, debug = FALSE)
Error in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
warn.if.no.missing = TRUE, : at least one variable has no observed
values.
Did anyone have the same problems, please tell me how to fix it.
Thank you.
I got structural.em working. I am currently working on a python interface to bnlearn that I call pybnl. I also ran into the problem you desecribe above.
Here is a jupyter notebook that shows how to use StructuralEM from python marks.
The gist of it is described in slides-bnshort.pdf on page 135, "The MARKS Example, Revisited".
You have to create an inital fit with an inital imputed dataframe by hand and then provide the arguments to structural.em like so (ldmarks is the latent-discrete-marks dataframe where the LAT column only contains missing/NA values):
library(bnlearn)
data('marks')
dmarks = discretize(marks, breaks = 2, method = "interval")
ldmarks = data.frame(dmarks, LAT = factor(rep(NA, nrow(dmarks)), levels = c("A", "B")))
imputed = ldmarks
# Randomly set values of the unobserved variable in the imputed data.frame
imputed$LAT = sample(factor(c("A", "B")), nrow(dmarks2), replace = TRUE)
# Fit the parameters over an empty graph
dag = empty.graph(nodes = names(ldmarks))
fitted = bn.fit(dag, imputed)
# Although we've set imputed values randomly, nonetheless override them with a uniform distribution
fitted$LAT = array(c(0.5, 0.5), dim = 2, dimnames = list(c("A", "B")))
# Use whitelist to enforce arcs from the latent node to all others
r = structural.em(ldmarks, fit = "bayes", impute="bayes-lw", start=fitted, maximize.args=list(whitelist = data.frame(from = "LAT", to = names(dmarks))), return.all = TRUE)
You have to use bnlearn 4.4-20180620 or later, because it fixes a bug in the underlying impute function.

Error when generating pdf using script in R

I'm using R to loop through the columns of a data frame and make a graph of the resulting analysis. I don't get any errors when the script runs, but it generates a pdf that cannot be opened.
If I run the content of the script, it works fine. I wondered if there is a problem with how quickly it is looping through, so I tried to force it to pause. This did not seem to make a difference. I'm interested in any suggestions that people have, and I'm also quite new to R so suggestions as to how I can improve the approach are welcome too. Thanks.
for (i in 2:22) {
# Organise data
pop_den_z = subset(pop_den, pop_den[i] != "0") # Remove zeros
y = pop_den_z[,i] # Get y col
x = pop_den_z[,1] # get x col
y = log(y) # Log transform
# Regression
lm.0 = lm(formula = y ~ x) # make linear model
inter = summary(lm.0)$coefficients[1,1] # Get intercept
slop = summary(lm.0)$coefficients[2,1] # Get slope
# Write to File
a = c(i, inter, slop)
write(a, file = "C:/pop_den_coef.txt", ncolumns = 3, append = TRUE, sep = ",")
## Setup pdf
string = paste("C:/LEED/results/Images/R_graphs/Pop_den", paste(i-2), "City.pdf")
pdf(string, height = 6, width = 9)
p <- qplot(
x, y,
xlab = "Radius [km]",
ylab = "Population Density [log(people/km)]",
xlim = x_range,
main = "Analysis of Cities"
)
# geom_abline(intercept,slope)
p + geom_abline(intercept = inter, slope = slop, colour = "red", size = 1)
Sys.sleep(5)
### close the PDF file
dev.off()
}
The line should be
print(p + geom_abline(intercept = inter, slope = slop, colour = "red", size = 1))
In pdf devices, ggplot (and lattice) only writes to file when explicitly printed.