Multiple Coins from Single Mint Example in PyMC - bayesian

Trying to learn PyMC by transferring some of the models from the book "Doing Bayesian Data Analysis" (Kruschke). One basic example (from Ch. 9) is to assume a set of coins is distributed according to p~Bern(theta) where theta comes from a Beta distribution (the "mint") with fixed parameters. Here's how I have coded it up (in PyMC2):
import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sbn
from pymc.Matplot import plot as mcplot
from pymc import Bernoulli, Beta, Gamma
flips = [[True, False, False, False],
[False, False, False, True],
[True, False, False, False],
[False, False, False, False]]
mint = Beta('mint', alpha=2, beta=2)
coin0 = Bernoulli('coin0', p=mint, value=flips[0], observed=True)
coin1 = Bernoulli('coin1', p=mint, value=flips[1], observed=True)
coin2 = Bernoulli('coin2', p=mint, value=flips[2], observed=True)
coin3 = Bernoulli('coin3', p=mint, value=flips[3], observed=True)
mcmc = pm.MCMC([mint, coin0, coin1, coin2, coin3])
mcmc.sample(iter=10000, burn=1000)
mcmc.summary()
mint:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------
[[ 0.253]] [[ 0.096]] [[ 0.002]] [ 0.074 0.439]
Posterior quantiles:
2.5 25 50 75 97.5
|---------------|===============|===============|---------------|
[[ 0.089]] [[ 0.183]] [[ 0.242]] [[ 0.318]] [[ 0.46]]
This seems to have worked, but what I'm wondering is how do I get the theta values for each coin? I assume there should be samples generated from the posterior distribution for each coin?

Not sure what you mean by theta, since there is no theta in your model. Are you referring to the coin-specific probabilities (which are here represented by mint)? You have specified a single probability for all the coins, rather than 4 probabilities. Try modifying your mint parameter to:
mint = Beta('mint', alpha=2, beta=2, size=4)
which will specify a vector-valued stochastic of size 4.

Related

matplotlib unstructered quadrilaterals instead of triangles

I've two netcdf files containing both unstructured grids. The first grid has 3 vertices per face and the second has 4 vertices per face.
For the grid containing 3 vertices per face I can use matplotlib.tri for visualization (like triplot_demo.py:
import matplotlib.pyplot as plt
import matplotlib.tri as tri
import numpy as np
xy = np.asarray([
[-0.101, 0.872], [-0.080, 0.883], [-0.069, 0.888], [-0.054, 0.890],
[-0.045, 0.897], [-0.057, 0.895], [-0.073, 0.900], [-0.087, 0.898],
[-0.090, 0.904], [-0.069, 0.907], [-0.069, 0.921], [-0.080, 0.919],
[-0.073, 0.928], [-0.052, 0.930], [-0.048, 0.942], [-0.062, 0.949],
[-0.054, 0.958], [-0.069, 0.954], [-0.087, 0.952], [-0.087, 0.959],
[-0.080, 0.966], [-0.085, 0.973], [-0.087, 0.965], [-0.097, 0.965],
[-0.097, 0.975], [-0.092, 0.984], [-0.101, 0.980], [-0.108, 0.980],
[-0.104, 0.987], [-0.102, 0.993], [-0.115, 1.001], [-0.099, 0.996],
[-0.101, 1.007], [-0.090, 1.010], [-0.087, 1.021], [-0.069, 1.021],
[-0.052, 1.022], [-0.052, 1.017], [-0.069, 1.010], [-0.064, 1.005],
[-0.048, 1.005], [-0.031, 1.005], [-0.031, 0.996], [-0.040, 0.987],
[-0.045, 0.980], [-0.052, 0.975], [-0.040, 0.973], [-0.026, 0.968],
[-0.020, 0.954], [-0.006, 0.947], [ 0.003, 0.935], [ 0.006, 0.926],
[ 0.005, 0.921], [ 0.022, 0.923], [ 0.033, 0.912], [ 0.029, 0.905],
[ 0.017, 0.900], [ 0.012, 0.895], [ 0.027, 0.893], [ 0.019, 0.886],
[ 0.001, 0.883], [-0.012, 0.884], [-0.029, 0.883], [-0.038, 0.879],
[-0.057, 0.881], [-0.062, 0.876], [-0.078, 0.876], [-0.087, 0.872],
[-0.030, 0.907], [-0.007, 0.905], [-0.057, 0.916], [-0.025, 0.933],
[-0.077, 0.990], [-0.059, 0.993]])
x = np.degrees(xy[:, 0])
y = np.degrees(xy[:, 1])
triangles = np.asarray([
[65, 44, 20],
[65, 60, 44]])
triang = tri.Triangulation(x, y, triangles)
plt.figure()
plt.gca().set_aspect('equal')
plt.triplot(triang, 'go-', lw=1.0)
plt.title('triplot of user-specified triangulation')
plt.xlabel('Longitude (degrees)')
plt.ylabel('Latitude (degrees)')
plt.show()
-- indices of the related point annotated afterwards
BUT how to visualize the unstructured grid containing 4 vertices per face (quadrilaterals)? Following the previous exapmle, my faces looks like:
quatrang = np.asarray([
[65, 60, 44, 20]])
Obviously trying tri.Triangulation doesn't work:
quatr = tri.Triangulation(x, y, quatrang)
ValueError: triangles must be a (?,3) array
I cannot find anything in the matplotlib libraries regarding 4 vertices per face. Any help is greatly appreciated..
EDIT: Changed the question based upon a minimal, complete and verifiable example
As commented already, since there is no Quatrangulation or simiar, there is no standard way to plot a a similar plot as triplot with four points per shape in matplotlib.
Of course you could triangulate your mesh again to obtain 2 triangles per quadrilateral. Or, you can plot a PolyCollection of the shapes, given their coordinates in space. The following shows the latter, defining a quatplot function which takes the coordinates and the indices of the vertices as input and draws a PolyCollection of those to the axes.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.collections
xy = np.asarray([
[-0.101, 0.872], [-0.080, 0.883], [-0.069, 0.888], [-0.054, 0.890],
[-0.090, 0.904], [-0.069, 0.907], [-0.069, 0.921], [-0.080, 0.919],
[-0.080, 0.966], [-0.085, 0.973], [-0.087, 0.965], [-0.097, 0.965],
[-0.104, 0.987], [-0.102, 0.993], [-0.115, 1.001], [-0.099, 0.996],
[-0.052, 1.022], [-0.052, 1.017], [-0.069, 1.010], [-0.064, 1.005],
[-0.045, 0.980], [-0.052, 0.975], [-0.040, 0.973], [-0.026, 0.968],
[ 0.017, 0.900], [ 0.012, 0.895], [ 0.027, 0.893], [ 0.019, 0.886],
[ 0.001, 0.883], [-0.012, 0.884], [-0.029, 0.883], [-0.038, 0.879],
[-0.030, 0.907], [-0.007, 0.905], [-0.057, 0.916], [-0.025, 0.933],
[-0.077, 0.990], [-0.059, 0.993]])
x = np.degrees(xy[:, 0])
y = np.degrees(xy[:, 1])
quatrang = np.asarray([
[19,13,10,22], [35,7,3,28]])
def quatplot(x,y, quatrangles, ax=None, **kwargs):
if not ax: ax=plt.gca()
xy = np.c_[x,y]
verts=xy[quatrangles]
pc = matplotlib.collections.PolyCollection(verts, **kwargs)
ax.add_collection(pc)
ax.autoscale()
plt.figure()
plt.gca().set_aspect('equal')
quatplot(x,y, quatrang, ax=None, color="crimson", facecolor="None")
plt.plot(x,y, marker="o", ls="", color="crimson")
plt.title('quatplot of user-specified quatrangulation')
plt.xlabel('Longitude (degrees)')
plt.ylabel('Latitude (degrees)')
for i, (xi,yi) in enumerate(np.degrees(xy)):
plt.text(xi,yi,i, size=8)
plt.show()

Implementing minimization in SciPy

I am trying to implement the 'Iterative hessian Sketch' algorithm from https://arxiv.org/abs/1411.0347 page 12. However, I am struggling with step two which needs to minimize the matrix-vector function.
Imports and basic data generating function
import numpy as np
import scipy as sp
from sklearn.datasets import make_regression
from scipy.optimize import minimize
import matplotlib.pyplot as plt
%matplotlib inline
from numpy.linalg import norm
def generate_data(nsamples, nfeatures, variance=1):
'''Generates a data matrix of size (nsamples, nfeatures)
which defines a linear relationship on the variables.'''
X, y = make_regression(n_samples=nsamples, n_features=nfeatures,\
n_informative=nfeatures,noise=variance)
X[:,0] = np.ones(shape=(nsamples)) # add bias terms
return X, y
To minimize the matrix-vector function, I have tried implementing a function which computes the quanity I would like to minimise:
def f2min(x, data, target, offset):
A = data
S = np.eye(A.shape[0])
#S = gaussian_sketch(nrows=A.shape[0]//2, ncols=A.shape[0] )
y = target
xt = np.ravel(offset)
norm_val = (1/2*S.shape[0])*norm(S#A#(x-xt))**2
#inner_prod = (y - A#xt).T#A#x
return norm_val - inner_prod
I would eventually like to replace S with some random matrices which can reduce the dimensionality of the problem, however, first I need to be confident that this optimisation method is working.
def grad_f2min(x, data, target, offset):
A = data
y = target
S = np.eye(A.shape[0])
xt = np.ravel(offset)
S_A = S#A
grad = (1/S.shape[0])*S_A.T#S_A#(x-xt) - A.T#(y-A#xt)
return grad
x0 = np.zeros((X.shape[0],1))
xt = np.zeros((2,1))
x_new = np.zeros((2,1))
for it in range(1):
result = minimize(f2min, x0=xt,args=(X,y,x_new),
method='CG', jac=False )
print(result)
x_new = result.x
I don't think that this loop is correct at all because at the very least there should be some local convergence before moving on to the next step. The output is:
fun: 0.0
jac: array([ 0.00745058, 0.00774882])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 416
nit: 0
njev: 101
status: 2
success: False
x: array([ 0., 0.])
Does anyone have an idea if:
(1) Why I'm not achieving convergence at each step
(2) I can implement step 2 in a better way?

Usefull way of reverting Homogeneous coordinates back to 2d?

Is there some numpy sugar for reverting Homogeneous coordinates back to 2d coordinates.
So this:
[[4,8,2],
6,3,2]]
becomes this:
[[2,4],
[3,1.5]]
One approach making use of broadcasted elementwise divisions -
from __future__ import division
a[:,:2]/a[:,[-1]]
We can use a[:,-1,None] or a[:,-1][:,None] or a[:,-1].reshape(-1,1) in place of a[:,[-1]]. With a[:,[-1]], we are keeping the number of dims intact, letting us perform the broadcasting divisions.
Another with np.true_divide again using broadcasting -
np.true_divide(a[:,:2], a[:,[-1]])
Sample run -
In [194]: a
Out[194]:
array([[4, 8, 2],
[6, 3, 2]])
In [195]: a[:,:2]/a[:,[-1]]
Out[195]:
array([[ 2. , 4. ],
[ 3. , 1.5]])
In [196]: np.true_divide(a[:,:2], a[:,[-1]])
Out[196]:
array([[ 2. , 4. ],
[ 3. , 1.5]])
If you have your input as a vector called x you could do
x[:-1]/x[-1]
Full example:
import numpy as np
x = np.array([6,3,2])
x[:-1]/x[-1] # array([ 3. , 1.5])
You can also apply it to multiple coordinates in an array:
xs = np.array([[4,8,2],[6,3,2]])
np.array([x[:-1]/x[-1] for x in xs]) # array([[ 2. , 4. ],
# [ 3. , 1.5]])
If you want to reuse this you can define a function homogen:
homogen = lambda x: x[:-1]/x[-1]
# previous stuff becomes something like
np.array([homogen(x) for x in xs])

Numpy creating logical array in the presence of NaNs

I have an array x, from which I would like to extract a logical mask. x contains nan values, and the mask operation raises a warning, which is what I am trying to avoid.
Here is my code:
import numpy as np
x = np.array([[0, 1], [2.0, np.nan]])
mask = np.isfinite(x) & (x > 0)
The resulting mask is correct (array([[False, True], [ True, False]], dtype=bool)), but a warning is raised:
__main__:1: RuntimeWarning: invalid value encountered in greater
How can I construct the mask in a way that avoids comparing against NaNs? I am not trying to suppress the warning (which I know how to do).
We could do it in two steps - Create the mask of finite ones and then use the same mask to index into itself and also to select the valid mask of remaining finite elements off x for testing and setting into the remaining elements in that mask. So, we would have an implementation like so -
In [35]: x
Out[35]:
array([[ 0., 1.],
[ 2., nan]])
In [36]: mask = np.isfinite(x)
In [37]: mask[mask] = x[mask]>0
In [38]: mask
Out[38]:
array([[False, True],
[ True, False]], dtype=bool)
Looks like masked arrays works with this case:
In [214]: x = np.array([[0, 1], [2.0, np.nan]])
In [215]: xm = np.ma.masked_invalid(x)
In [216]: xm
Out[216]:
masked_array(data =
[[0.0 1.0]
[2.0 --]],
mask =
[[False False]
[False True]],
fill_value = 1e+20)
In [217]: xm>0
Out[217]:
masked_array(data =
[[False True]
[True --]],
mask =
[[False False]
[False True]],
fill_value = 1e+20)
In [218]: _.data
Out[218]:
array([[False, True],
[ True, False]], dtype=bool)
But other than propagating the masking I don't know how it handles element by element operations like this. The usual fill and compressed steps don't seem relevant.

Partition training data by class in NumPy

I have a 50000 x 784 data matrix (50000 samples and 784 features) and the corresponding 50000 x 1 class vector (classes are integers 0-9). I'm looking for an efficient way to group the data matrix into 10 data matrices and class vectors that each have only the data for a particular class 0-9.
I can't seem to find an elegant way to do this, aside from just looping through the data matrix and constructing the 10 other matrices that way.
Does anyone know if there is a clean way to do this with something in scipy, numpy, or sklearn?
Probably the cleanest way of doing this in numpy, especially if you have many classes, is through sorting:
SAMPLES = 50000
FEATURES = 784
CLASSES = 10
data = np.random.rand(SAMPLES, FEATURES)
classes = np.random.randint(CLASSES, size=SAMPLES)
sorter = np.argsort(classes)
classes_sorted = classes[sorter]
splitter, = np.where(classes_sorted[:-1] != classes_sorted[1:])
data_splitted = np.split(data[sorter], splitter + 1)
data_splitted will be a list of arrays, one for each class found in classes. Running the above code with SAMPLES = 10, FEATURES = 2 and CLASSES = 3 I get:
>>> data
array([[ 0.45813694, 0.47942962],
[ 0.96587082, 0.73260743],
[ 0.70539842, 0.76376921],
[ 0.01031978, 0.93660231],
[ 0.45434223, 0.03778273],
[ 0.01985781, 0.04272293],
[ 0.93026735, 0.40216376],
[ 0.39089845, 0.01891637],
[ 0.70937483, 0.16077439],
[ 0.45383099, 0.82074859]])
>>> classes
array([1, 1, 2, 1, 1, 2, 0, 2, 0, 1])
>>> data_splitted
[array([[ 0.93026735, 0.40216376],
[ 0.70937483, 0.16077439]]),
array([[ 0.45813694, 0.47942962],
[ 0.96587082, 0.73260743],
[ 0.01031978, 0.93660231],
[ 0.45434223, 0.03778273],
[ 0.45383099, 0.82074859]]),
array([[ 0.70539842, 0.76376921],
[ 0.01985781, 0.04272293],
[ 0.39089845, 0.01891637]])]
If you want to make sure the sort is stable, i.e. that data points in the same class remain in the same relative order after sorting, you will need to specify sorter = np.argsort(classes, kind='mergesort').
If your data and labels matrices are in numpy format, you can do:
data_class_3 = data[labels == 3, :]
If they aren't, turn them into numpy format:
import numpy as np
data = np.array(data)
labels = np.array(labels)
data_class_3 = data[labels == 3, :]
You can loop and do this for all labels automatically if you like. Something like this:
import numpy as np
split_classes = np.array([data[labels == i, :] for i in range(10)])
After #Jaime numpy optimal answer, I suggest you pandas, specialized in data manipulations :
import pandas
df=pandas.DataFrame(data,index=classes).sort_index()
then df.loc[i] is your class i.
if you want a list, just do
metadata=[df.loc[i].values for i in range(10)]
so metadata[i] is the subset you want, or make a panel with pandas. All that is based on numpy arrays, so efficiency is preserved.