Simple Hidden Markov Model in PyMC3 throws Theano error - numpy

I'm new to PyMC3, Theano, and numpy. Was just trying to duplicate the first 'hidden' Markov Model in the Stan manual--the one in which the states are actually observed. But, I keep running into errors having to do with Theano, numpy, and perhaps what is going on behind PyMC3 distributions, which seem a bit mysterious to me. My code for the model is below:
import pandas as pd
dat_hmm = pd.read_csv('hmmVals.csv')
emission=dat_hmm.emission.values
state=dat_hmm.state.values
from pymc3 import Model, Dirichlet, Categorical
import numpy as np
basic_model = Model()
with basic_model:
#Model constants:
#num unique hidden states, num unique emissions, num instances
K=3; V=9; T=10
alpha=np.ones(K); beta=np.ones(V)
# Priors for unknown model parameters
theta = np.empty(K, dtype=object) #theta=transmission
phi = np.empty(K, dtype=object) #phi=emission
#observed emission, state:
w=np.empty(T, dtype=object); z=np.empty(T, dtype=object);
for k in range(K):
theta[k]=Dirichlet('theta'+str(k), alpha)
phi[k]=Dirichlet('phi'+str(k), beta)
# Likelihood (sampling distribution) of observations
for t in range(T):
w[t]=Categorical('w'+str(t),theta[state[t]], shape=1, observed=emission[t])
for t in range(2, T):
z[t]=Categorical('z'+str(t),phi[state[t-1]], shape=1, observed=state[t])
The line "w[t]=Categorical('w'+str(t),theta[state[t]], shape=1, observed=emission[t])" generates the error, but not on t=0, which fills in w0, but on t=1 which generates an index out of bound error. There is no index out of bound in the code line itself because state[1], theta[state[t]], and emission[t] all exist. The error messages are:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/pymc3/distributions/distribution.py", line 25, in __new__
return model.Var(name, dist, data)
File "/usr/local/lib/python3.4/dist-packages/pymc3/model.py", line 306, in Var
var = ObservedRV(name=name, data=data, distribution=dist, model=self)
File "/usr/local/lib/python3.4/dist-packages/pymc3/model.py", line 581, in __init__
self.logp_elemwiset = distribution.logp(data)
File "/usr/local/lib/python3.4/dist-packages/pymc3/distributions/discrete.py", line 400, in logp
a = tt.log(p[value])
File "/usr/local/lib/python3.4/dist-packages/theano/tensor/var.py", line 532, in __getitem__
lambda entry: isinstance(entry, Variable)))
File "/usr/local/lib/python3.4/dist-packages/theano/gof/op.py", line 668, in __call__
required = thunk()
File "/usr/local/lib/python3.4/dist-packages/theano/gof/op.py", line 883, in rval
fill_storage()
File "/usr/local/lib/python3.4/dist-packages/theano/gof/cc.py", line 1707, in __call__
reraise(exc_type, exc_value, exc_trace)
File "/usr/local/lib/python3.4/dist-packages/six.py", line 686, in reraise
raise value
IndexError: index out of bounds
I don't know about the wisdom of sticking numpy objects into PyMC3 distributions or using the result of that to try to parameterize another distribution, but I have seen somewhat similar code on the web, minus the last part. Is there perhaps no good way to code such a hidden Markov model in PyMC3 yet?

I have found a way to fix the above error. The following code works--no errors and I'm able to get correct parameter estimates with Metropolis at least.
I made two mistakes and didn't realize they were so simple because I expected something complicated to be happening in Theano. One is that my data was set up for Stan and so indexed to start at 1 rather than 0. Python indexes everything to 0. I changed the data file by subtracting 1 from every value. The other error was I used theta, the transmission matrix, to calculate individual emissions and vice versa for the phi matrix. Theta was too short for the emissions.
What I wish I understood now was why the NUTS sampler keeps telling me I have a non-positive definite scaling, even though I'm feeding it MAP estimates. Metropolis works, but is slow-- about 11 minutes for these 300 observations and 1000 samples. The other mystery is why PyMC3 thinks it only took a couple seconds to calculate the samples.
import pandas as pd
dat_hmm = pd.read_csv('hmmVals.csv')
emission=dat_hmm.emission.values
state=dat_hmm.state.values
from pymc3 import Model, Dirichlet, Categorical
import numpy as np
basic_model = Model()
with basic_model:
#Model constants:
K=3; V=9; T=300 #num unique hidden states, num unique emissions, num instances
alpha=np.ones(K); beta=np.ones(V)
# Priors for unknown model parameters
theta = np.empty(K, dtype=object) #theta=transmission
phi = np.empty(K, dtype=object) #phi=emission
w=np.empty(T, dtype=object); z=np.empty(T, dtype=object); #observed emission, state
for k in range(K):
theta[k]=Dirichlet('theta'+str(k), alpha)
phi[k]=Dirichlet('phi'+str(k), beta)
#Likelihood (sampling distribution) of observationss
for t in range(2, T):
z[t]=Categorical('z'+str(t),theta[state[t-1]], shape=1, observed=state[t])
for t in range(T):
w[t]=Categorical('w'+str(t),phi[state[t]], shape=1, observed=emission[t])

I have also tried to implement HMMs in pymc3 and I had run into similar problems. I just found a way to implement a two level HMM in a vectorized fashion (to be perfectly honest, my model is not hidden but the hidden part can be added easily - I vectorized the description of the state variable). I am not sure whether this is the most efficient way, but I tested this code against a simple for loop for defining the states. The code below runs in less than a minute with 1000 data points whereas the for loop took several hours.
Here is the code:
import numpy as np
import theano.tensor as tt
import pymc3 as pm
class HMMStates(pm.Discrete):
"""
Hidden Markov Model States
Parameters
----------
P1 : tensor
probability to remain in state 1
P2 : tensor
probability to move from state 2 to state 1
"""
def __init__(self, PA=None, P1=None, P2=None,
*args, **kwargs):
super(HMMStates, self).__init__(*args, **kwargs)
self.PA = PA
self.P1 = P1
self.P2 = P2
self.mean = 0.
def logp(self, x):
PA = self.PA
P1 = self.P1
P2 = self.P2
# now we need to create an array with probabilities
# so that for x=A: PA=P1, PB=(1-P1)
# and for x=B: PA=P2, PB=(1-P2)
length = x.shape[0]
P1T = tt.tile(P1,(length-1,1)).T
P2T = tt.tile(P2,(length-1,1)).T
P = tt.switch(x[:-1],P1T,P2T).T
x_i = x[1:]
ou_like = pm.Categorical.dist(P).logp(x_i)
return pm.Categorical.dist(PA).logp(x[0]) + tt.sum(ou_like)
This class creates the states of the HMM. To call it you can do the following:
theta = np.ones(2) # prior for probabilities
with pm.Model() as model:
# 2 state model
# P1 is probablility to stay in state 1
# P2 is probability to move from state 2 to state 1
P1 = pm.Dirichlet('P1', a=theta)
P2 = pm.Dirichlet('P2', a=theta)
PA = pm.Deterministic('PA',P2/(P2+1-P1))
states = HMMStates('states',PA,P1,P2, observed=data)
start = pm.find_MAP()
trace = pm.sample(5000, start=start)
just to show how the old code looks like:
with pm.Model() as model:
# 2 state model
# P1 is probablility to stay in state 1
# P2 is probability to move from state 2 to state 1
P1 = pm.Dirichlet('P1', a=np.ones(2))
P2 = pm.Dirichlet('P2', a=np.ones(2))
PA = pm.Deterministic('PA',P2/(P2+1-P1))
state = pm.Categorical('state0',PA, observed=data[0])
for i in range(1,N_chain):
state = pm.Categorical('state'+str(i), tt.switch(data[i-1],P1,P2), observed=data[i])
start = pm.find_MAP()
trace = pm.sample(5000, start=start)

Related

Using SMOTE with NaN values

Is there a way one can use SMOTE with NaNs?
Here is a dummy prog to try using SMOTE in presence of NaN values
# Imports
from collections import Counter
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import Imputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.combine import SMOTEENN
# Load data
bc = load_breast_cancer()
X, y = bc.data, bc.target
# Initial number of samples per class
print('Number of samples for both classes: {} and {}.'.format(*Counter(y).values()))
# SMOTEd class distribution
print('Dataset has %s missing values.' % np.isnan(X).sum())
_, y_resampled = SMOTE().fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))
# Generate artificial missing values
X[X > 1.0] = np.nan
print('Dataset has %s missing values.' % np.isnan(X).sum())
#_, y_resampled = make_pipeline(Imputer(), SMOTE()).fit_sample(X, y)
sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
smote_enn = SMOTEENN(smote = sm)
x_train_res, y_train_res = smote_enn.fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))
I get the following output/error:
Number of samples for both classes: 212 and 357.
Dataset has 0 missing values.
Number of samples for both classes: 357 and 357.
Dataset has 6051 missing values.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
You already include the answer. Notice that fit_resample is used instead of fit_sample. You should use the make_pipeline as follows:
# Imports
import numpy as np
from collections import Counter
from sklearn.datasets import load_breast_cancer
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.combine import SMOTEENN
# Load data
bc = load_breast_cancer()
X, y = bc.data, bc.target
X[X > 1.0] = np.nan
# Over-sampling
smote = SMOTE(ratio='auto',k_neighbors=5, n_jobs=-1)
smote_enn = make_pipeline(SimpleImputer(), SMOTEENN(smote=smote))
_, y_res = smote_enn.fit_resample(X, y)
# Class distribution
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_res).values()))
Check also your imbalanced-learn version.
Generally no, SMOTE is preparing a data set for further model fitting.
Usual models (like random forest, etc.) do not work with NAin the label variable, because what are you actually predicting here? The same goes for NAin the predictor variables where most algorithms either do not work or simply ignore cases with NA.
So the error is pretty much by design because you cannot and should not have missing values in your training data set for the algorithm and logically you do not want to "balance" cases with missing values, you only want to SMOTE cases with valid labels.
If you feel that missing labels still represent valid information that should be balanced (e.g. you actually want to oversample the NAclass because you think it is underpreresented), then it should not be a missing value but rather a defined value called "Unknown" or something else, indicating a known class with the characteristic of "NA" but I do not really see any research questions where this makes sense.
Update 1:
Another way to go is to impute the missing values first so that you actually have three steps in fitting your model:
Imputing missing values (using MICE or similar)
SMOTE to balance training set
Fit algorithm/model

Calling scheduler.multiprocessing.get in a separate process in dask

I am training a neural network with a large text corpora. Each text generate quite a big matrix because I'm using a convolutional model. As my data won't feet in my still large memory, I try to stream it, and use keras.models fit_generator.
To feed keras, I have a pipeline composed of different preprocessing steps, that I arrange with a dask bag with lots of partitions. The dask bag reads a file on disk.
Even is dask is not handling iteration in a smart way (it just compute() and iter on result, which in my case blow up memory), I was to use something like this:
def compute_partition_iter(collection, **kwargs):
"""A utility to compute a collection items after items
"""
get = kwargs.pop("get", None) or _globals['get']
if get is None:
get = collection.__dask_scheduler__
postcompute_func, postcompute_args = collection.__dask_postcompute__()
dsk = collection.__dask_graph__()
for key in collection.__dask_keys__():
yield from f([partition], *args)
This compute partitions one by one and return items, computing next partition as we cross partition border.
This approach has a problem : it's only when we hit last item from partition that we provoque the computation of next elements, leading to a lag until next element. Within this lag, keras is stalled and we loose precious time !
So I imagine running the above compute_partition_iter in a separate process thanks to multiprocessing.Pool, feeding partitions in a Queue with say 2 slots, so that in the generator, I won't always have one more partition ready.
But it seems that this is not supported by dask.bag. I didn't dive deeply enough in the code, but it seems like there are some async methods used, or I don't know what.
Here is a reproductible code for the problem.
First a code that work, using a simple range.
import multiprocessing
import time
def put_q(n, q):
for i in range(n):
print(i, "<-")
q.put(i)
q.put(None)
q = multiprocessing.Queue(2)
with multiprocessing.Pool(1, put_q, (4, q)) as pool:
i = True
while i is not None:
print("zzz")
time.sleep(.5)
i = q.get()
if i is None:
break
print("-> ", i)
This outputs
0 <-
1 <-
2 <-
zzz
3 <-
-> 0
zzz
-> 1
zzz
-> 2
zzz
-> 3
zzz
you can see that, as expected, elements where computed in anticipation and it's all ok.
Now let's replace the range by a dask.bag:
import multiprocessing
import time
import dask.bag
def put_q(n, q):
for i in dask.bag.from_sequence(range(n), npartitions=2):
print(i, "<-")
q.put(i)
q.put(None)
q = multiprocessing.Queue(5)
with multiprocessing.Pool(1, put_q, (4, q)) as pool:
i = True
while i is not None:
print("zzz")
time.sleep(.5)
i = q.get()
if i is None:
break
print("-> ", i)
In a jupyter notebook, it indefinitely raises :
Process ForkPoolWorker-71:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 103, in worker
initializer(*initargs)
File "<ipython-input-3-e1e9ef9354a0>", line 8, in put_q
for i in dask.bag.from_sequence(range(n), npartitions=2):
File "/usr/local/lib/python3.5/dist-packages/dask/bag/core.py", line 1190, in __iter__
return iter(self.compute())
File "/usr/local/lib/python3.5/dist-packages/dask/base.py", line 154, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/dask/base.py", line 407, in compute
results = get(dsk, keys, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/dask/multiprocessing.py", line 152, in get
initializer=initialize_worker_process)
File "/usr/lib/python3.5/multiprocessing/context.py", line 118, in Pool
context=self.get_context())
File "/usr/lib/python3.5/multiprocessing/pool.py", line 168, in __init__
self._repopulate_pool()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
w.start()
File "/usr/lib/python3.5/multiprocessing/process.py", line 103, in start
'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
while the main process is stalled, waiting for elements in queue.
I also tried using a ipyparallel cluster but in this case the main process is simply stalled (no trace of the exception).
Does anyone knows the right way to do that ?
Is there a way I can run scheduler.get in parallel to my main code ?
Finally I should have got a closer look at the exception !
Stackoverflow gave me the solution : Python Process Pool non-daemonic?
In fact as the bag scheduler uses Pool, it can't be called inside a process spawned by pool. The solution in my case is to simply use threads. (Note that the bug and its solution depends on the scheduler you use).
So I have substituted multiprocessing.Pool for multiprocessing.pool.ThreadPool and it works like a charm, either in a normal notebook, or when using ipyparallel.
So it goes like this:
import queue
from multiprocessing.pool import ThreadPool
import time
import dask.bag
def put_q(n, q):
b = dask.bag.from_sequence(range(n), npartitions=3)
for i in b:
print(i, "<-")
q.put(i)
q.put(None)
q = queue.Queue(2)
with ThreadPool(1, put_q, (6, q)) as pool:
i = True
while i is not None:
print("zzz")
time.sleep(.5)
i = q.get()
if i is None:
break
print("-> ", i)
Which outputs:
zzz
0 <-
1 <-
2 <-
-> 0
zzz
3 <-
-> 1
zzz
4 <-
-> 5 <-
2
zzz
-> 3
zzz
-> 4
zzz
-> 5
zzz

Use "tf.contrib.factorization.KMeansClustering"

Referring to this Link, (the Link)
I try to practice using tf.contrib.factorization.KMeansClustering for clustering. The simple codes as follow works okay:
import numpy as np
import tensorflow as tf
# ---- Create Data Sample -----
k = 5
n = 100
variables = 5
points = np.random.uniform(0, 1000, [n, variables])
# ---- Clustering -----
input_fn=lambda: tf.train.limit_epochs(tf.convert_to_tensor(points, dtype=tf.float32), num_epochs=1)
kmeans=tf.contrib.factorization.KMeansClustering(num_clusters=6)
kmeans.train(input_fn=input_fn)
centers = kmeans.cluster_centers()
# ---- Print out -----
cluster_indices = list(kmeans.predict_cluster_index(input_fn))
for i, point in enumerate(points):
cluster_index = cluster_indices[i]
print ('point:', point, 'is in cluster', cluster_index, 'centered at', centers[cluster_index])
My question is why would this "input_fn" code does the trick?
If I change the code to this, it will run into an infinite loop. Why??
input_fn=lambda:tf.convert_to_tensor(points, dtype=tf.float32)
From the document (here), it seems that train() is expecting argument of input_fn, which is simply a A 'tf.data.Dataset' object , like Tensor(X). So, why do I have to do all these tricky things regarding lambda: tf.train.limit_epochs()?
Can anyone who is familiar with the fundamental of tensorflow estimators help to explain? Many Thanks!
My question is why would this "input_fn" code does the trick? If I change the code to this, it will run into an infinite loop. Why??
The documentation states that input_fn is called repeatedly until it returns a tf.errors.OutOfRangeError. Adorning your tensor with tf.train.limit_epochs ensures that the error is eventually raised, which signals to KMeans that it should stop training.

How to simulate from priors with pymc3

I'd like to simulate y from the prior (not from the posterior) with pymc3.
I first defined the model:
import pymc3 as pm
with pm.Model() as m:
mu = pm.Normal('mu', mu=0, sd=10)
sigma = pm.Uniform('sigma', lower=0, upper=10)
y = pm.Normal('y', mu=mu, sd=sigma)
trace = pm.sample(1000, tune=1000)
Then I tried to get 10 simulated y from the model with:
y_pred = pm.sample_ppc(trace, 10, m, size=10)
But result comes out empty. I searched through the documentation but I didn't find a relevant example. Is it possible to do it with pymc3?
The trace contains the sample from the prior when no observed is associated with the model definition. However, this could fail sometimes. We are currently working on a sample_prior function that would make this process easier and more straightforward: https://github.com/pymc-devs/pymc3/pull/2876

Clustering of sparse matrix in python and scipy

I'm trying to cluster some data with python and scipy but the following code does not work for reason I do not understand:
from scipy.sparse import *
matrix = dok_matrix((en,en), int)
for pub in pubs:
authors = pub.split(";")
for auth1 in authors:
for auth2 in authors:
if auth1 == auth2: continue
id1 = e2id[auth1]
id2 = e2id[auth2]
matrix[id1, id2] += 1
from scipy.cluster.vq import vq, kmeans2, whiten
result = kmeans2(matrix, 30)
print result
It says:
Traceback (most recent call last):
File "cluster.py", line 40, in <module>
result = kmeans2(matrix, 30)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 683, in kmeans2
clusters = init(data, k)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 576, in _krandinit
return init_rankn(data)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 563, in init_rankn
mu = np.mean(data, 0)
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2374, in mean
return mean(axis, dtype, out)
TypeError: mean() takes at most 2 arguments (4 given)
When I'm using kmenas instead of kmenas2 I have the following error:
Traceback (most recent call last):
File "cluster.py", line 40, in <module>
result = kmeans(matrix, 30)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 507, in kmeans
guess = take(obs, randint(0, No, k), 0)
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 103, in take
return take(indices, axis, out, mode)
TypeError: take() takes at most 3 arguments (5 given)
I think I have the problems because I'm using sparse matrices but my matrices are too big to fit the memory otherwise. Is there a way to use standard clustering algorithms from scipy with sparse matrices? Or I have to re-implement them myself?
I created a new version of my code to work with vector space
el = len(experts)
pl = len(pubs)
print el, pl
from scipy.sparse import *
P = dok_matrix((pl, el), int)
p_id = 0
for pub in pubs:
authors = pub.split(";")
for auth1 in authors:
if len(auth1) < 2: continue
id1 = e2id[auth1]
P[p_id, id1] = 1
from scipy.cluster.vq import kmeans, kmeans2, whiten
result = kmeans2(P, 30)
print result
But I'm still getting the error:
TypeError: mean() takes at most 2 arguments (4 given)
What am I doing wrong?
K-means cannot be run on distance matrixes.
It needs a vector space to compute means in, that is why it is called k-means. If you want to use a distance matrix, you need to look into purely distance based algorithms such as DBSCAN and OPTICS (both on Wikipedia).
May I suggest, "Affinity Propagation" from scikit-learn? On the work I've been doing with it, I find that it has generally been able to find the 'naturally' occurring clusters within my data set. The inputs into the algorithm are an affinity matrix, or similarity matrix, of any arbitrary similarity measure.
I don't have a good handle on the kind of data you have on hand, so I can't speak to the exact suitability of this method to your data set, but it may be worth a try, perhaps?
Alternatively, if you're looking to cluster graphs, I'd take a look at NetworkX. That might be a useful tool for you. The reason I suggest this is because it looks like the data you're looking to work with networks of authors. Hence, with NetworkX, you can put in an adjacency matrix and find out which authors are clustered together.
For a further elaboration on this, you can see a question that I had asked earlier for inspiration here.