how to get better Kriging result graphs in openturns? - pandas

I performed spherical Kriging, but I can't seem to get good output graphs.
The coordinates(x, and y) range from around around 51 latitude and around 6.5 as longitude
my observations range from -70 to +10
here is my code :
import openturns as ot
import pandas as pd
# your input / output data can be easily formatted as samples for openturns
df = pd.read_csv("kreuzkerpenutm.csv")
inputdata = ot.Sample(df[['x','y']].values)
outputdata = ot.Sample(df[['z']].values)
dimension = 2 # dimension of your input (x,y)
basis = ot.ConstantBasisFactory(dimension).build()
covarianceModel = ot.SphericalModel(dimension)
algo = ot.KrigingAlgorithm(inputdata, outputdata, covarianceModel, basis)
algo.run()
result = algo.getResult()
metamodel = result.getMetaModel()
lower = [-10.0] * 2 # lower bound of the 2D window
upper = [50.0] * 2 # upper bound of the 2D window
graph = metamodel.draw(lower, upper)
graph.setBoundingBox(ot.Interval(lower, upper))
graph.add(ot.Cloud(inputdata)) # overlay a scatter plot of the observation points
graph.setTitle("Kriging metamodel")
# A View object allows us to interact with the underlying matplotlib figure
from openturns.viewer import View
view = View(graph, legend_kw={'bbox_to_anchor':(1,1), 'loc':"upper left"})
view.getFigure().tight_layout()
here is my output:
kriging metamodel graph
I don't know why my graph won't show me my inputs aswell as my kriging results.
thanks for ideas and help

If the input data is not scaled in [-1,1]^d, the kriging metamodel may have issues to identify the scale parameters using maximum likelihood optimization. In order to help for this, we may:
provide a better starting point for the scale parameters of the covariance model (this is trick "A" below),
set the bounds of the optimization algorithm so that the interval where the parameters are searched for correspond to the data at hand (this is trick "B" below).
This is what the following script does, using simulated data instead of a csv data file. In the script, I create the data using a g function which is scaled so that it produces results in the [-10, 70] range, as in your problem. Please look carefuly at the setScale() method which sets the initial value of the covariance model: this is the starting point of the optimization algorithm. Then look at the setOptimizationBounds() method, which sets the bounds of the optimization algorithm.
import openturns as ot
dimension = 2 # dimension of your input (x,y)
distribution = ot.ComposedDistribution([ot.Uniform(-10.0, 50.0)] * dimension)
inputdata = distribution.getSample(100)
g = ot.SymbolicFunction(["x", "y"], ["30 + 3.0 * sin(x / 10.0) * (y / 10.0) ^ 2"])
outputdata = g(inputdata)
basis = ot.ConstantBasisFactory(dimension).build()
covarianceModel = ot.SphericalModel(dimension)
covarianceModel.setScale(inputdata.getMax()) # Trick A
algo = ot.KrigingAlgorithm(inputdata, outputdata, covarianceModel, basis)
# Trick B, v2
x_range = inputdata.getMax() - inputdata.getMin()
scale_max_factor = 2.0 # Must be > 1, tune this to match your problem
scale_min_factor = 0.1 # Must be < 1, tune this to match your problem
maximum_scale_bounds = scale_max_factor * x_range
minimum_scale_bounds = scale_min_factor * x_range
scaleOptimizationBounds = ot.Interval(minimum_scale_bounds, maximum_scale_bounds)
algo.setOptimizationBounds(scaleOptimizationBounds)
algo.run()
result = algo.getResult()
metamodel = result.getMetaModel()
metamodel.setInputDescription(["x", "y"])
metamodel.setOutputDescription(["z"])
lower = [-10.0] * 2 # lower bound of the 2D window
upper = [50.0] * 2 # upper bound of the 2D window
graph = metamodel.draw(lower, upper)
graph.setBoundingBox(ot.Interval(lower, upper))
graph.add(ot.Cloud(inputdata)) # overlay a scatter plot of the observation points
graph.setTitle("Kriging metamodel")
# A View object allows us to interact with the underlying matplotlib figure
from openturns.viewer import View
view = View(graph, legend_kw={"bbox_to_anchor": (1, 1), "loc": "upper left"})
view.getFigure().tight_layout()
The previous script produces the following figure.
There are other ways to implement trick B. Here is one provided by J.Pelamatti:
# Trick B, v3
for d in range(X_train.getDimension()):
dist = scipy.spatial.distance.pdist(X_train[:,d])
scale_max_factor = 2.0 # Must be > 1, tune this to match your problem
scale_min_factor = 0.1 # Must be < 1, tune this to match your problem
maximum_scale_bounds = scale_max_factor * np.max(dist)
minimum_scale_bounds = scale_min_factor * np.min(dist)
This topic is discussed in this particular thread in OT's forum.

Sorry for the late answer.
Which version of openturns are you using?
Probably you have an embedded transformation of (input) data, which makes the data range between (-3, 3) approximately (standard scaling). The kriging result should contains the transformation in such a case.
With more recent openturns implementations, this feature has been removed.
Hope this can help.
Cheers

Related

Finding n-tuple that minimizes expensive cost function

Suppose there are three variables that take on discrete integer values, say w1 = {1,2,3,4,5,6,7,8,9,10,11,12}, w2 = {1,2,3,4,5,6,7,8,9,10,11,12}, and w3 = {1,2,3,4,5,6,7,8,9,10,11,12}. The task is to pick one value from each set such that the resulting triplet minimizes some (black box, computationally expensive) cost function.
I've tried the surrogate optimization in Matlab but I'm not sure it is appropriate. I've also heard about simulated annealing but found no implementation applied to this instance.
Which algorithm, apart from exhaustive search, can solve this combinatorial optimization problem?
Any help would be much appreciated.
The requirement/benefit of Simulated Annealing (SA), is that the objective surface is somewhat smooth, that is, we can be close to a solution.
For a completely random spiky surface- you might as well do a random search
If it is anything smooth, or even sometimes, it makes sense to try SA.
The idea is that (sometimes) changing only 1 of the 3 values, we have little effect on out blackbox function.
Here is a basic example to do this with Simulated Annealing, using frigidum in Python
import numpy as np
w1 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w2 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w3 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
W = np.array([w1,w2,w3])
LENGTH = 12
I define a black-box using the Rastrigin function.
def rastrigin_function_n( x ):
"""
N-dimensional Rastrigin
https://en.wikipedia.org/wiki/Rastrigin_function
x_i is in [-5.12, 5.12]
"""
A = 10
n = x.shape[0]
return A*n + np.sum( x**2- A*np.cos(2*np.pi * x) )
def black_box( x ):
"""
Transform from domain [1,12] to [-5,5]
to be able to push to rastrigin
"""
x = (x - 6.5) * (5/5.5)
return rastrigin_function_n(x)
Simulated Annealing needs to modify state X. Instead of taking/modifying values directly, we keep track of indices. This simplifies creating new proposals as an index is always an integer we can simply add/subtract 1 modulo LENGTH.
def random_start():
"""
returns 3 random indices
"""
return np.random.randint(0, LENGTH, size=3)
def random_small_step(x):
"""
change only 1 index
"""
d = np.array( [1,0,0] )
if np.random.random() < .5:
d = np.array( [-1,0,0] )
np.random.shuffle(d)
return (x+d) % LENGTH
def random_big_step(x):
"""
change 2 indici
"""
d = np.array( [1,-1,0] )
np.random.shuffle(d)
return (x+d) % LENGTH
def obj(x):
"""
We have a triplet of indici,
1. Calculate corresponding values in W = [w1,w2,w3]
2. Push the values in out black-box function
"""
indices = x
values = W[np.array([0,1,2]), indices]
return black_box(values)
And throw a SA Scheme at it
import frigidum
local_opt = frigidum.sa(random_start=random_start,
neighbours=[random_small_step, random_big_step],
objective_function=obj,
T_start=10**4,
T_stop=0.000001,
repeats=10**3,
copy_state=frigidum.annealing.naked)
I am not sure what the minimum for this function should be, but it found a objective with 47.9095 with indicis np.array([9, 2, 2])
Edit:
For frigidum to change the cooling schedule, use alpha=.9. My experience is that all the work of experiment which cooling scheme works best doesn't out-weight simply let it run a little longer. The multiplication you proposed, (sometimes called geometric) is the standard one, also implemented in frigidum. So to implement Tn+1 = 0.9*Tn you need a alpha=.9. Be aware this cooling step is done after N repeats, so if repeats=100, it will first do 100 proposals before lowering the temperature with factor alpha
Simple variations on current state often works best. Since its best practice to set the initial temperature high enough to make most proposals (>90%) accepted, it doesn't matter the steps are small. But if you fear its soo small, try 2 or 3 variations. Frigidum accepts a list of proposal functions, and combinations can enforce each other.
I have no experience with MINLP. But even if, so many times experiments can surprise us. So if time/cost is small to bring another competitor to the table, yes!
Try every possible combination of the three values and see which has the lowest cost.

How to perform raster calculation (e.g. aspect) on subset of raster based on point intersection in R

I'm working with some raster data in R using the raster package. I want to calculate and extract some geographic information (e.g., slope, aspect) from the raster, but only at specific points (I also have some data as a SpatialPointsDataFrame at which I want to calculate slope/aspect/etc.). I'm doing this for several high-resolution rasters, and it seems like a poor use of resources to calculate this for every raster cell when I only need maybe 5-10% of them.
I thought maybe the raster::stackApply function might work, but that seems to perform calculations on subsets of a rasterBrick rather than calculations on subsets of a single raster based on point locations (please correct me if I'm wrong). I also thought I could do a for loop, where I extract the surrounding cells nearest each point of interest, and iteratively calculate slope/aspect that way. That seems clunky, and I was hoping for a more elegant or built-in solution, but it should work.
These are my thoughts so far on the for loop, but I'm not sure how best to even do this.
# Attach packages
library(rgdal)
library(raster)
# Generate example raster data
r = raster()
set.seed(0)
values(r) = runif(ncell(r), min = 0, max = 1000)
# Generate example point data
df.sp = SpatialPoints(
coords = cbind(runif(25, min = -100, max = 100),
runif(25, min = -50, max = 50)),
proj4string = crs(r))
# Iterate on each row of SpatialPoints
for (i in 1:nrow(df.sp)) {
# Find cell index of current SpatialPoint
cell.idx = raster::extract(r, df.sp[i,], cellnumbers = TRUE)[1]
# Find indices of cells surrounding point of interest
neighbors.idx = raster::adjacent(r, cell.idx, directions = 16)
# Get DEM values for cell and surrounding cells
vals.local = r[c(cell.idx, neighbors.idx[,2])]
# Somehow convert this back to an appropriate georeferenced matrix
#r.local = ...
# Perform geometric calculations on local raster
#r.stack = terrain(r.local, opt = c('slope', 'aspect'))
# Remaining data extraction, etc. (I can take it from here...)
}
In summary, I need a method to calculate slope and aspect from a DEM raster only at specific points as given by a SpatialPoints object. If you know of a pre-built or more elegant solution, great! If not, some help finishing the for loop (how to best extract a neighborhood of surrounding cells and run calculations on that) would be most appreciated as well.
Interesting question. Here is a possible approach.
library(raster)
r <- raster()
set.seed(0)
values(r) <- runif(ncell(r), min = 0, max = 1000)
coords <- cbind(runif(25, min = -100, max = 100),
runif(25, min = -50, max = 50))
x <- rasterize(coords, r)
f <- focal(x, w=matrix(1, nc=3, nr=3), na.rm=TRUE)
rr <- mask(r, f)
slope <- terrain(rr, "slope")
extract(slope, coords)
# [1] 0.0019366236 0.0020670699 0.0006305257 0.0025334280 0.0023480935 0.0007527267 0.0002699272 0.0004699626
# [9] 0.0004869054 0.0025651333 0.0010415805 0.0008574920 0.0010664869 0.0017700297 0.0001666226 0.0008405391
#[17] 0.0017682167 0.0009854172 0.0015350466 0.0017714466 0.0012994945 0.0016563132 0.0003276584 0.0020499529
#[25] 0.0006582073
Probably not much efficiency gain, as it still processes all the NA values
So maybe like this, more along your line of thinking:
cells <- cellFromXY(r, coords)
ngbs <- raster::adjacent(r, cells, pairs=TRUE)
slope <- rep(NA, length(cells))
for (i in 1:length(cells)) {
ci <- ngbs[ngbs[,1] == cells[i], 2]
e <- extentFromCells(r, ci)
x <- crop(r, e)
slope[i] <- terrain(x, "slope")[5]
}
slope
#[1] 0.0019366236 0.0020670699 0.0006305257 0.0025334280 0.0023480935 0.0007527267 0.0002699272 0.0004699626
#[9] 0.0004869054 0.0025651333 0.0010415805 0.0008574920 0.0010664869 0.0017700297 0.0001666226 0.0008405391
#[17] 0.0017682167 0.0009854172 0.0015350466 0.0017714466 0.0012994945 0.0016563132 0.0003276584 0.0020499529
#[25] 0.0006582073
But I find that brute force
slope <- terrain(r, "slope")
extract(slope, coords)
is fastest, 9x faster than my first alternative and 4 times faster than the second alternative

How to implement a method to generate Poincaré sections for a non-linear system of ODEs?

I have been trying to work out how to calculate Poincaré sections for a system of non-linear ODEs, using a paper on the exact system as reference, and have been wrestling with numpy to try and make it run better. This is intended to run within a bounded domain.
Currently, I have the following code
import numpy as np
from scipy.integrate import odeint
X = 0
Y = 1
Z = 2
def generate_poincare_map(function, initial, plane, iterations, delta):
intersections = []
p_i = odeint(function, initial.flatten(), [0, delta])[-1]
for i in range(1, iterations):
p_f = odeint(function, p_i, [i * delta, (i+1) * delta])[-1]
if (p_f[Z] > plane) and (p_i[Z] < plane):
intersections.append(p_i[:2])
if (p_f[Z] > plane) and (p_i[Z] < plane):
intersections.append(p_i[:2])
p_i = p_f
return np.stack(intersections)
This is pretty wasteful due to the integration solely between successive time steps, and seems to produce incorrect results. The original reference includes sections along the lines of
whereas mine tend to result in something along the lines of
Do you have any advice on how to proceed to make this more correct, and perhaps a little faster?
To get a Pointcaré map of the ABC flow
def ABC_ode(u,t):
A, B, C = 0.75, 1, 1 # matlab parameters
x, y, z = u
return np.array([
A*np.sin(z)+C*np.cos(y),
B*np.sin(x)+A*np.cos(z),
C*np.sin(y)+B*np.cos(x)
])
def mysolver(u0, tspan): return odeint(ABC_ode, u0, tspan, atol=1e-10, rtol=1e-11)
you have first to understand that the dynamical system is really about the points (cos(x),sin(x)) etc. on the unit circle. So values different by multiples of 2*pi represent the same point. In the computation of the section one has to reflect this, either by computing it on the Cartesian product of the 3 circles. Let's stay with the second variant, and chose [-pi,pi] as the fundamental period to have the zero location well in the center. Keep in mind that jumps larger pi are from the angle reduction, not from a real crossing of that interval.
def find_crosssections(x0,y0):
u0 = [x0,y0,0]
px = []
py = []
u = mysolver(u0, np.arange(0, 4000, 0.5)); u0 = u[-1]
u = np.mod(u+pi,2*pi)-pi
x,y,z = u.T
for k in range(len(z)-1):
if z[k]<=0 and z[k+1]>=0 and z[k+1]-z[k]<pi:
# find a more exact intersection location by linear interpolation
s = -z[k]/(z[k+1]-z[k]) # 0 = z[k] + s*(z[k+1]-z[k])
rx, ry = (1-s)*x[k]+s*x[k+1], (1-s)*y[k]+s*y[k+1]
px.append(rx);
py.append(ry);
return px,py
To get a full picture of the Poincare cross-section and avoid duplicate work, use a grid of squares and mark if one of the intersections already fell in it. Only start new iterations from the centers of free squares.
N=20
grid = np.zeros([N,N], dtype=int)
for i in range(N):
for j in range(N):
if grid[i,j]>0: continue;
x0, y0 = (2*i+1)*pi/N-pi, (2*j+1)*pi/N-pi
px, py = find_crosssections(x0,y0)
for rx,ry in zip(px,py):
m, n = int((rx+pi)*N/(2*pi)), int((ry+pi)*N/(2*pi))
grid[m,n]=1
plt.plot(px, py, '.', ms=2)
You can now play with the density of the grid and the length of the integration interval to get the plot a little more filled out, but all characteristic features are already here. But I'd recommend re-programming this in a compiled language, as the computation will take some time.

PyMC3 PK modelling. Model cant resolve to parameters used to create the data set

I am new to PK modelling and pymc3, but I have been playing around with pymc3 and trying to implement a simple PK model as part of my own learning. Specifically a model that captures this relationship...
Where C(t)(Cpred) is concentration at time t, Dose is the dose given, V is Volume of distribution, CL is clearance.
I have generated some test data (30 subjects) with values of CL =2 , V=10, for 3 doses 100,200,300, and generated data at timepoints 0,1,2,4,8,12, and also included some random error on CL (normal distribution, 0 mean, omega =0.6) and on the residual unexplained error DV = Cpred + sigma, where sigma is normally distributed the SD =0.33. In addition I have included a transformation on C and V with respect to the weight (uniform distribution 50-90) CLi = CL * WT/70; Vi = V * WT/70.
# Create Data for modelling
np.random.seed(0)
# Subject ID's
data = pd.DataFrame(np.arange(1,31), columns=['subject'])
# Dose
Data['dose'] = np.array([100,100,100,100,100,100,100,100,100,100,
200,200,200,200,200,200,200,200,200,200,
300,300,300,300,300,300,300,300,300,300])
# Random Body Weight
data['WT'] = np.random.randint(50,100, size =30)
# Fixed Clearance and Volume for the population
data['CLpop'] =2
data['Vpop']=10
# Error rate for individual clearance rate
OMEGA = 0.66
# Individual clearance rate as a function of weight and omega
data['CLi'] = data['CLpop']*(data['WT']/70)+ np.random.normal(0, OMEGA )
# Individual Volume as a function of weight
data['Vi'] = data['Vpop']*(data['WT']/70)
# Expand dataframe to account for time points
data = pd.concat([data]*6,ignore_index=True )
data = data.sort('subject')
# Add in time points
data['time'] = np.tile(np.array([0,1,2,4,8,12]), 30)
# Create concentration values using equation
data['Cpred'] = data['dose']/data['Vi'] *np.exp(-1*data['CLi']/data['Vi']*data['time'])
# Error rate for DV
SIGMA = 0.33
# Create Dependenet Variable from Cpred + error
data['DV']= data['Cpred'] + np.random.normal(0, SIGMA )
# Create new df with only data for modelling...
df = data[['subject','dose','WT', 'time', 'DV']]
Create arrays ready for model...
# Prepare data from df to model specific arrays
time = np.array(df['time'])
dose = np.array(df['dose'])
DV = np.array(df['DV'])
WT = np.array(df['WT'])
n_patients = len(data['subject'].unique())
subject = data['subject'].values-1
I have built a simple model in pymc3 ....
pk_model = Model()
with pk_model:
# Hyperparameter Priors
sigma = Lognormal('sigma', mu =0, tau=0.01)
V = Lognormal('V', mu =2, tau=0.01)
CL = Lognormal('CL', mu =1, tau=0.01)
# Transformation wrt to weight
CLi = CL*(WT)/70
Vi = V*(WT)/70
# Expected value of outcome
pred = dose/Vi*np.exp(-1*(CLi/Vi)*time)
# Likelihood (sampling distribution) of observations
conc = Normal('conc', mu =pred, tau=sigma, observed = DV)
My expectation was that I should have been able to resolve from the data the constants and error rates that were originally used to generate the data, although I have not been able to do this, although I can get close. In this example...
data['CLi'].mean()
> 2.322473543135788
data['Vi'].mean()
> 10.147619047619049
And the trace shows....
So my questions are..
Is my code structured correctly and are there any glaring mistakes that I have overlooked that might account for this difference?
Can I structure the pymc3 model to better reflect the relationship from which I have generated the data?
What would be your suggestions to improve the model?
Thanks in advance!
I'm going to answer my own question!
But I implemented a hierarchal model following the example found here...
GLM -hierarchical
and it works a treat. Also I noticed errors in the way I was applying the errors in the dataframe - should use
data['CLer'] = np.random.normal(scale=OMEGA, size=30)
To ensure each subject has a different value for the error

Dummy Variables in Julia

In R there is nice functionality for running a regression with dummy variables for each level of a categorical variable. e.g. Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level
Is there an equivalent way to do this in Julia.
x = randn(1000)
group = repmat(1:25 , 40)
groupMeans = randn(25)
y = 3*x + groupMeans[group]
data = DataFrame(x=x, y=y, g=group)
for i in levels(group)
data[parse("I$i")] = data[:g] .== i
end
lm(y~x+I1+I2+I3+I4+I5+I6+I7+I8+I9+I10+
I11+I12+I13+I14+I15+I16+I17+I18+I19+I20+
I21+I22+I23+I24, data)
If you are using the DataFrames package, after you pool the data, the package will take care of the rest:
Pooling columns is important for working with the GLM package When fitting regression models, PooledDataArray columns in the input are translated into 0/1 indicator columns in the ModelMatrix - with one column for each of the levels of the PooledDataArray.
You can see the rest of documentation on pooled data here