Dummy Variables in Julia - dataframe

In R there is nice functionality for running a regression with dummy variables for each level of a categorical variable. e.g. Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level
Is there an equivalent way to do this in Julia.
x = randn(1000)
group = repmat(1:25 , 40)
groupMeans = randn(25)
y = 3*x + groupMeans[group]
data = DataFrame(x=x, y=y, g=group)
for i in levels(group)
data[parse("I$i")] = data[:g] .== i
end
lm(y~x+I1+I2+I3+I4+I5+I6+I7+I8+I9+I10+
I11+I12+I13+I14+I15+I16+I17+I18+I19+I20+
I21+I22+I23+I24, data)

If you are using the DataFrames package, after you pool the data, the package will take care of the rest:
Pooling columns is important for working with the GLM package When fitting regression models, PooledDataArray columns in the input are translated into 0/1 indicator columns in the ModelMatrix - with one column for each of the levels of the PooledDataArray.
You can see the rest of documentation on pooled data here

Related

how to get better Kriging result graphs in openturns?

I performed spherical Kriging, but I can't seem to get good output graphs.
The coordinates(x, and y) range from around around 51 latitude and around 6.5 as longitude
my observations range from -70 to +10
here is my code :
import openturns as ot
import pandas as pd
# your input / output data can be easily formatted as samples for openturns
df = pd.read_csv("kreuzkerpenutm.csv")
inputdata = ot.Sample(df[['x','y']].values)
outputdata = ot.Sample(df[['z']].values)
dimension = 2 # dimension of your input (x,y)
basis = ot.ConstantBasisFactory(dimension).build()
covarianceModel = ot.SphericalModel(dimension)
algo = ot.KrigingAlgorithm(inputdata, outputdata, covarianceModel, basis)
algo.run()
result = algo.getResult()
metamodel = result.getMetaModel()
lower = [-10.0] * 2 # lower bound of the 2D window
upper = [50.0] * 2 # upper bound of the 2D window
graph = metamodel.draw(lower, upper)
graph.setBoundingBox(ot.Interval(lower, upper))
graph.add(ot.Cloud(inputdata)) # overlay a scatter plot of the observation points
graph.setTitle("Kriging metamodel")
# A View object allows us to interact with the underlying matplotlib figure
from openturns.viewer import View
view = View(graph, legend_kw={'bbox_to_anchor':(1,1), 'loc':"upper left"})
view.getFigure().tight_layout()
here is my output:
kriging metamodel graph
I don't know why my graph won't show me my inputs aswell as my kriging results.
thanks for ideas and help
If the input data is not scaled in [-1,1]^d, the kriging metamodel may have issues to identify the scale parameters using maximum likelihood optimization. In order to help for this, we may:
provide a better starting point for the scale parameters of the covariance model (this is trick "A" below),
set the bounds of the optimization algorithm so that the interval where the parameters are searched for correspond to the data at hand (this is trick "B" below).
This is what the following script does, using simulated data instead of a csv data file. In the script, I create the data using a g function which is scaled so that it produces results in the [-10, 70] range, as in your problem. Please look carefuly at the setScale() method which sets the initial value of the covariance model: this is the starting point of the optimization algorithm. Then look at the setOptimizationBounds() method, which sets the bounds of the optimization algorithm.
import openturns as ot
dimension = 2 # dimension of your input (x,y)
distribution = ot.ComposedDistribution([ot.Uniform(-10.0, 50.0)] * dimension)
inputdata = distribution.getSample(100)
g = ot.SymbolicFunction(["x", "y"], ["30 + 3.0 * sin(x / 10.0) * (y / 10.0) ^ 2"])
outputdata = g(inputdata)
basis = ot.ConstantBasisFactory(dimension).build()
covarianceModel = ot.SphericalModel(dimension)
covarianceModel.setScale(inputdata.getMax()) # Trick A
algo = ot.KrigingAlgorithm(inputdata, outputdata, covarianceModel, basis)
# Trick B, v2
x_range = inputdata.getMax() - inputdata.getMin()
scale_max_factor = 2.0 # Must be > 1, tune this to match your problem
scale_min_factor = 0.1 # Must be < 1, tune this to match your problem
maximum_scale_bounds = scale_max_factor * x_range
minimum_scale_bounds = scale_min_factor * x_range
scaleOptimizationBounds = ot.Interval(minimum_scale_bounds, maximum_scale_bounds)
algo.setOptimizationBounds(scaleOptimizationBounds)
algo.run()
result = algo.getResult()
metamodel = result.getMetaModel()
metamodel.setInputDescription(["x", "y"])
metamodel.setOutputDescription(["z"])
lower = [-10.0] * 2 # lower bound of the 2D window
upper = [50.0] * 2 # upper bound of the 2D window
graph = metamodel.draw(lower, upper)
graph.setBoundingBox(ot.Interval(lower, upper))
graph.add(ot.Cloud(inputdata)) # overlay a scatter plot of the observation points
graph.setTitle("Kriging metamodel")
# A View object allows us to interact with the underlying matplotlib figure
from openturns.viewer import View
view = View(graph, legend_kw={"bbox_to_anchor": (1, 1), "loc": "upper left"})
view.getFigure().tight_layout()
The previous script produces the following figure.
There are other ways to implement trick B. Here is one provided by J.Pelamatti:
# Trick B, v3
for d in range(X_train.getDimension()):
dist = scipy.spatial.distance.pdist(X_train[:,d])
scale_max_factor = 2.0 # Must be > 1, tune this to match your problem
scale_min_factor = 0.1 # Must be < 1, tune this to match your problem
maximum_scale_bounds = scale_max_factor * np.max(dist)
minimum_scale_bounds = scale_min_factor * np.min(dist)
This topic is discussed in this particular thread in OT's forum.
Sorry for the late answer.
Which version of openturns are you using?
Probably you have an embedded transformation of (input) data, which makes the data range between (-3, 3) approximately (standard scaling). The kriging result should contains the transformation in such a case.
With more recent openturns implementations, this feature has been removed.
Hope this can help.
Cheers

Plotting an exponential function given one parameter

I'm fairly new to python so bare with me. I have plotted a histogram using some generated data. This data has many many points. I have defined it with the variable vals. I have then plotted a histogram with these values, though I have limited it so that only values between 104 and 155 are taken into account. This has been done as follows:
bin_heights, bin_edges = np.histogram(vals, range=[104, 155], bins=30)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(bin_centres, bin_heights, np.sqrt(bin_heights), fmt=',', capsize=2)
plt.xlabel("$m_{\gamma\gamma} (GeV)$")
plt.ylabel("Number of entries")
plt.show()
Giving the above plot:
My next step is to take into account values from vals which are less than 120. I have done this as follows:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
I need to plot a curve on the same plot as the histogram, which follows the form B(x) = Ae^(-x/λ)
I then estimated a value of λ using the maximum likelihood estimator formula:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
#print(background_data)
N_background=len(background_data)
print(N_background)
sigma_background_data=sum(background_data)
print(sigma_background_data)
lamb = (sigma_background_data)/(N_background) #maximum likelihood estimator for lambda
print('lambda estimate is', lamb)
where lamb = λ. I got a value of roughly lamb = 27.75, which I know is correct. I now need to get an estimate for A.
I have been advised to do this as follows:
Given a value of λ, find A by scaling the PDF to the data such that the area beneath
the scaled PDF has equal area to the data
I'm not quite sure what this means, or how I'd go about trying to do this. PDF means probability density function. I assume an integration will have to take place, so to get the area under the data (vals), I have done this:
data_area= integrate.cumtrapz(background_data, x=None, dx=1.0)
print(data_area)
plt.plot(background_data, data_area)
However, this gives me an error
ValueError: x and y must have same first dimension, but have shapes (981555,) and (981554,)
I'm not sure how to fix it. The end result should be something like:
See the cumtrapz docs:
Returns: ... If initial is None, the shape is such that the axis of integration has one less value than y. If initial is given, the shape is equal to that of y.
So you are either to pass an initial value like
data_area = integrate.cumtrapz(background_data, x=None, dx=1.0, initial = 0.0)
or discard the first value of the background_data:
plt.plot(background_data[1:], data_area)

Finding n-tuple that minimizes expensive cost function

Suppose there are three variables that take on discrete integer values, say w1 = {1,2,3,4,5,6,7,8,9,10,11,12}, w2 = {1,2,3,4,5,6,7,8,9,10,11,12}, and w3 = {1,2,3,4,5,6,7,8,9,10,11,12}. The task is to pick one value from each set such that the resulting triplet minimizes some (black box, computationally expensive) cost function.
I've tried the surrogate optimization in Matlab but I'm not sure it is appropriate. I've also heard about simulated annealing but found no implementation applied to this instance.
Which algorithm, apart from exhaustive search, can solve this combinatorial optimization problem?
Any help would be much appreciated.
The requirement/benefit of Simulated Annealing (SA), is that the objective surface is somewhat smooth, that is, we can be close to a solution.
For a completely random spiky surface- you might as well do a random search
If it is anything smooth, or even sometimes, it makes sense to try SA.
The idea is that (sometimes) changing only 1 of the 3 values, we have little effect on out blackbox function.
Here is a basic example to do this with Simulated Annealing, using frigidum in Python
import numpy as np
w1 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w2 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w3 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
W = np.array([w1,w2,w3])
LENGTH = 12
I define a black-box using the Rastrigin function.
def rastrigin_function_n( x ):
"""
N-dimensional Rastrigin
https://en.wikipedia.org/wiki/Rastrigin_function
x_i is in [-5.12, 5.12]
"""
A = 10
n = x.shape[0]
return A*n + np.sum( x**2- A*np.cos(2*np.pi * x) )
def black_box( x ):
"""
Transform from domain [1,12] to [-5,5]
to be able to push to rastrigin
"""
x = (x - 6.5) * (5/5.5)
return rastrigin_function_n(x)
Simulated Annealing needs to modify state X. Instead of taking/modifying values directly, we keep track of indices. This simplifies creating new proposals as an index is always an integer we can simply add/subtract 1 modulo LENGTH.
def random_start():
"""
returns 3 random indices
"""
return np.random.randint(0, LENGTH, size=3)
def random_small_step(x):
"""
change only 1 index
"""
d = np.array( [1,0,0] )
if np.random.random() < .5:
d = np.array( [-1,0,0] )
np.random.shuffle(d)
return (x+d) % LENGTH
def random_big_step(x):
"""
change 2 indici
"""
d = np.array( [1,-1,0] )
np.random.shuffle(d)
return (x+d) % LENGTH
def obj(x):
"""
We have a triplet of indici,
1. Calculate corresponding values in W = [w1,w2,w3]
2. Push the values in out black-box function
"""
indices = x
values = W[np.array([0,1,2]), indices]
return black_box(values)
And throw a SA Scheme at it
import frigidum
local_opt = frigidum.sa(random_start=random_start,
neighbours=[random_small_step, random_big_step],
objective_function=obj,
T_start=10**4,
T_stop=0.000001,
repeats=10**3,
copy_state=frigidum.annealing.naked)
I am not sure what the minimum for this function should be, but it found a objective with 47.9095 with indicis np.array([9, 2, 2])
Edit:
For frigidum to change the cooling schedule, use alpha=.9. My experience is that all the work of experiment which cooling scheme works best doesn't out-weight simply let it run a little longer. The multiplication you proposed, (sometimes called geometric) is the standard one, also implemented in frigidum. So to implement Tn+1 = 0.9*Tn you need a alpha=.9. Be aware this cooling step is done after N repeats, so if repeats=100, it will first do 100 proposals before lowering the temperature with factor alpha
Simple variations on current state often works best. Since its best practice to set the initial temperature high enough to make most proposals (>90%) accepted, it doesn't matter the steps are small. But if you fear its soo small, try 2 or 3 variations. Frigidum accepts a list of proposal functions, and combinations can enforce each other.
I have no experience with MINLP. But even if, so many times experiments can surprise us. So if time/cost is small to bring another competitor to the table, yes!
Try every possible combination of the three values and see which has the lowest cost.

PyMC3 PK modelling. Model cant resolve to parameters used to create the data set

I am new to PK modelling and pymc3, but I have been playing around with pymc3 and trying to implement a simple PK model as part of my own learning. Specifically a model that captures this relationship...
Where C(t)(Cpred) is concentration at time t, Dose is the dose given, V is Volume of distribution, CL is clearance.
I have generated some test data (30 subjects) with values of CL =2 , V=10, for 3 doses 100,200,300, and generated data at timepoints 0,1,2,4,8,12, and also included some random error on CL (normal distribution, 0 mean, omega =0.6) and on the residual unexplained error DV = Cpred + sigma, where sigma is normally distributed the SD =0.33. In addition I have included a transformation on C and V with respect to the weight (uniform distribution 50-90) CLi = CL * WT/70; Vi = V * WT/70.
# Create Data for modelling
np.random.seed(0)
# Subject ID's
data = pd.DataFrame(np.arange(1,31), columns=['subject'])
# Dose
Data['dose'] = np.array([100,100,100,100,100,100,100,100,100,100,
200,200,200,200,200,200,200,200,200,200,
300,300,300,300,300,300,300,300,300,300])
# Random Body Weight
data['WT'] = np.random.randint(50,100, size =30)
# Fixed Clearance and Volume for the population
data['CLpop'] =2
data['Vpop']=10
# Error rate for individual clearance rate
OMEGA = 0.66
# Individual clearance rate as a function of weight and omega
data['CLi'] = data['CLpop']*(data['WT']/70)+ np.random.normal(0, OMEGA )
# Individual Volume as a function of weight
data['Vi'] = data['Vpop']*(data['WT']/70)
# Expand dataframe to account for time points
data = pd.concat([data]*6,ignore_index=True )
data = data.sort('subject')
# Add in time points
data['time'] = np.tile(np.array([0,1,2,4,8,12]), 30)
# Create concentration values using equation
data['Cpred'] = data['dose']/data['Vi'] *np.exp(-1*data['CLi']/data['Vi']*data['time'])
# Error rate for DV
SIGMA = 0.33
# Create Dependenet Variable from Cpred + error
data['DV']= data['Cpred'] + np.random.normal(0, SIGMA )
# Create new df with only data for modelling...
df = data[['subject','dose','WT', 'time', 'DV']]
Create arrays ready for model...
# Prepare data from df to model specific arrays
time = np.array(df['time'])
dose = np.array(df['dose'])
DV = np.array(df['DV'])
WT = np.array(df['WT'])
n_patients = len(data['subject'].unique())
subject = data['subject'].values-1
I have built a simple model in pymc3 ....
pk_model = Model()
with pk_model:
# Hyperparameter Priors
sigma = Lognormal('sigma', mu =0, tau=0.01)
V = Lognormal('V', mu =2, tau=0.01)
CL = Lognormal('CL', mu =1, tau=0.01)
# Transformation wrt to weight
CLi = CL*(WT)/70
Vi = V*(WT)/70
# Expected value of outcome
pred = dose/Vi*np.exp(-1*(CLi/Vi)*time)
# Likelihood (sampling distribution) of observations
conc = Normal('conc', mu =pred, tau=sigma, observed = DV)
My expectation was that I should have been able to resolve from the data the constants and error rates that were originally used to generate the data, although I have not been able to do this, although I can get close. In this example...
data['CLi'].mean()
> 2.322473543135788
data['Vi'].mean()
> 10.147619047619049
And the trace shows....
So my questions are..
Is my code structured correctly and are there any glaring mistakes that I have overlooked that might account for this difference?
Can I structure the pymc3 model to better reflect the relationship from which I have generated the data?
What would be your suggestions to improve the model?
Thanks in advance!
I'm going to answer my own question!
But I implemented a hierarchal model following the example found here...
GLM -hierarchical
and it works a treat. Also I noticed errors in the way I was applying the errors in the dataframe - should use
data['CLer'] = np.random.normal(scale=OMEGA, size=30)
To ensure each subject has a different value for the error

Looping and if statements in SPSS

I'm new to SPSS and I'm a bit stuck on a problem. I have about 200 variables and I want to loop through pairs of them looking for variables with correlation coefficients above 0.7. I know that I can use CORRELATIONS to get a matrix of coefficients but it would be huge and difficult to look through. Basically, in pseudocode, what I want to do is:
for (i = W1_1 to W1_200) {
for (j = i to W1_200) {
if CORRELATIONS(i,j)>0.7 {
print i, j, CORRELATIONS(i,j)
}
}
}
I can't for the life of me work out how to do any of this in SPSS. Help!
SPSS has a helper function on the CORRELATIONS command to export the correlation matrix. From there you can manipulate the data to give the correlation pairs that meet your criteria. So first, lets make some fake data to illustrate.
*Making fake data.
set seed 5.
input program.
loop i = 1 to 100.
end case.
end loop.
end file.
end input program.
dataset name test.
compute #base = RV.NORMAL(0,1).
vector X(20).
loop #i = 1 to 20.
compute X(#i) = #base*(#i/20) + RV.NORMAL(0,1).
end loop.
exe.
Now, we can run the CORRELATIONS command and export the table to a new dataset (which I named here Corrs).
DATASET DECLARE Corrs.
CORRELATIONS
/VARIABLES=X1 to X20
/MATRIX=OUT('Corrs').
Unfortunately SPSS returns the full matrix (plus other info on the sample size). We can only select the rows we are interested in (ones with "CORR" in the ROWTYPE_ column) and then use a DO REPEAT to set the upper or lower half of the matrix to system missing values.
DATASET ACTIVATE Corrs.
SELECT IF ROWTYPE_ = "CORR".
*Now only making lower half of matrix.
COMPUTE #iter = 0.
DO REPEAT X = X1 TO X20.
COMPUTE #iter = #iter + 1.
IF #iter > ($casenum-1) X = $SYSMIS.
END REPEAT.
I set them to system missing values because the next part I will reshape the data using VARSTOCASES. This by default drops missing values, so we won't end up having redundant correlation pairs.
VARSTOCASES
/MAKE Corr FROM X1 TO X20
/INDEX X2 (Corr)
/DROP ROWTYPE_.
RENAME VARIABLES (VARNAME_ = X1).
Now you have your correlation pairs list and can just select out the correlations that meet your criteria.
SELECT IF ABS(Corr) >= .5.
Making of the correlation pairs can be made into a MACRO function pretty easily to return the pair list. Below is that function, recreating the exact steps used here.
DEFINE !CorrPairs (!POSITIONAL !CMDEND)
DATASET DECLARE Corrs.
CORRELATIONS
/VARIABLES=!1
/MATRIX=OUT('Corrs').
DATASET ACTIVATE Corrs.
SELECT IF ROWTYPE_ = "CORR".
COMPUTE #iter = 0.
DO REPEAT X = !1.
COMPUTE #iter = #iter + 1.
IF #iter > ($casenum-1) X = $SYSMIS.
END REPEAT.
VARSTOCASES
/MAKE Corr FROM !1
/INDEX X2 (Corr)
/DROP ROWTYPE_.
RENAME VARIABLES (VARNAME_ = X1).
!ENDDEFINE.
The macro just takes a list of variables (in the active dataset) to grab the correlations, and returns a second dataset named Corrs with the correlation pairs and the variable names defined in the X1 and X2 columns. Then after the above macro is defined the above steps can be recreated simply by below.
!CorrPairs X1 to X20.
SELECT IF ABS(Corr) >= .5.
EXECUTE.
My suggestion is to use OMS to extract your correlation values from the output into a datafile. Use a macro to only run the correlations you need:
DATASET DECLARE Correlations.
OMS /SELECT TABLES /IF COMMANDS=['Correlations'] SUBTYPES=['Correlations']
/DESTINATION FORMAT=SAV NUMBERED=TableNumber_ OUTFILE='Correlations' VIEWER=YES.
define runCorrs ()
!do !i1=1 !to 200
!do !i2=!i1 !to 200
!if (!i2<>!i1) !then
corr !concat("W_",!i1) with !concat("W_",!i2).
!ifend
!doend !doend
!enddefine.
runCorrs.
OMSEND.
datas act Correlations.
select if var2="Pearson Correlation".
VARSTOCASES /make crlVal from W_2 to W_200/index=withvar(crlVal)
/drop TableNumber_ Command_ Subtype_ Label_ Var2.
now you have a nice list of all the correlations to work with:
select if crlVal>0.7.
exe.