General Mixture Model with PyMC3 - bayesian

I am new to PyMC3 and I have been attempting to create a mixture of independent Poisson's using the following code:
import pymc3 as pm
import numpy as np
from pymc3.distributions.continuous import Uniform
from pymc3.distributions.discrete import Poisson
from pymc3.distributions.mixture import Mixture
from pymc3 import Deterministic
# Bayesian non-parametric histogram comparison
def bayesian_nphist(bins_1, bins_2):
# n equals m anyway
m = len(bins_1)
n = len(bins_2)
with pm.Model():
beta = Uniform('beta', lower=0, upper=np.mean(np.append(bins_1, bins_2)) * 2, shape=m)
eta = Uniform('eta', lower=0, upper=np.mean(bins_1) * 2, shape=m)
gamma = Uniform('gamma', lower=0, upper=np.mean(bins_2) * 2, shape=n)
pi_1 = Uniform('pi_1', lower=0, upper=1)
pi_2 = Deterministic('pi_2', 1 - pi_1)
m1 = Poisson.dist(mu=beta, shape=[m, 1])
n1 = Poisson.dist(mu=beta, shape=[n, 1])
m2 = Poisson.dist(mu=eta, shape=[m, 1])
n2 = Poisson.dist(mu=gamma, shape=[n, 1])
Mixture('mixture_M', [pi_1, pi_2], [m1, m2], observed=bins_1)
Mixture('mixture_N', [pi_1, pi_2], [n1, n2], observed=bins_2)
start = pm.find_MAP()
step = pm.NUTS(state=start)
trace = pm.sample(5000, step=step, progressbar=True)
pm.traceplot(trace)
Basically, this is a non-parametric way of comparing histograms, using the fact that a multinomial distribution is a non-parametric model for a histogram, and that multinomial can be modelled as a bunch of independent Poisson variables.
Somehow I managed to get this error:
ValueError: Input dimension mis-match. (input[0].shape[1] = 2, input[1].shape[1] = 14)
on line:
Mixture('mixture_M', [pi_1, pi_2], [m1, m2], observed=bins_1)
when running with bins_1 and bins_2 as (14, 1) integer arrays, so n=m=14. I printed out the shapes of different variables and I believe they match. Have I misunderstood how Mixture in PYMC3 works? Thanks in advance!

Related

scipy weird unexpected behavior curve_fit large data set for sin wave

For some reason when I am trying to large amount of data to a sin wave it fails and fits it to a horizontal line. Can somebody explain?
Minimal working code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
# Seed the random number generator for reproducibility
import pandas
np.random.seed(0)
# Here it work as expected
# x_data = np.linspace(-5, 5, num=50)
# y_data = 2.9 * np.sin(1.05 * x_data + 2) + 250 + np.random.normal(size=50)
# With this data it breaks
x_data = np.linspace(0, 2500, num=2500)
y_data = -100 * np.sin(0.01 * x_data + 1) + 250 + np.random.normal(size=2500)
# And plot it
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data)
def test_func(x, a, b, c, d):
return a * np.sin(b * x + c) + d
# Used to fit the correct function
# params, params_covariance = optimize.curve_fit(test_func, x_data, y_data)
# making some guesses
params, params_covariance = optimize.curve_fit(test_func, x_data, y_data,
p0=[-80, 3, 0, 260])
print(params)
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data, label='Data')
plt.plot(x_data, test_func(x_data, *params),
label='Fitted function')
plt.legend(loc='best')
plt.show()
Does anybody know, how to fix this issue. Should I use a different fitting method not least square? Or should I reduce the number of data points?
Given your data, you can use the more robust lmfit instead of scipy.
In particular, you can use SineModel (see here for details).
SineModel in lmfit is not for "shifted" sine waves, but you can easily deal with the shift doing
y_data_offset = y_data.mean()
y_transformed = y_data - y_data_offset
plt.scatter(x_data, y_transformed)
plt.axhline(0, color='r')
Now you can fit to sine wave
from lmfit.models import SineModel
mod = SineModel()
pars = mod.guess(y_transformed, x=x_data)
out = mod.fit(y_transformed, pars, x=x_data)
you can inspect results with print(out.fit_report()) and plot results with
plt.plot(x_data, y_data, lw=7, color='C1')
plt.plot(x_data, out.best_fit+y_data_offset, color='k')
# we add the offset ^^^^^^^^^^^^^
or with the builtin plot method out.plot_fit(), see here for details.
Note that in SineModel all parameters "are constrained to be non-negative", so your defined negative amplitude (-100) will be positive (+100) in the parameters fit results. So the phase too won't be 1 but π+1 (PS: they call shift the phase)
print(out.best_values)
{'amplitude': 99.99631403054289,
'frequency': 0.010001193681616227,
'shift': 4.1400215410836605}

How to calculate the confidence intervals for prediction in Regression? and also how to plot it in python

Fig 7.1, An Introduction To Statistical Learning
I am currently studying a book named Introduction to Statistical Learning with applications in R, and also converting the solutions to python language.
I am not able to get how to get the confidence intervals and plot them as shown in the above image(dashed lines).
I have plotted the line. Here's my code for that -
(I am using polynomial regression with predictiors - 'age' and response - 'wage',degree is 4)
poly = PolynomialFeatures(4)
X = poly.fit_transform(data['age'].to_frame())
y = data['wage']
# X.shape
model = sm.OLS(y,X).fit()
print(model.summary())
# So, what we want here is not only the final line, but also the standart error related to the line
# TO find that we need to calcualte the predictions for some values of age
test_ages = np.linspace(data['age'].min(),data['age'].max(),100)
X_test = poly.transform(test_ages.reshape(-1,1))
pred = model.predict(X_test)
plt.figure(figsize = (12,8))
plt.scatter(data['age'],data['wage'],facecolors='none', edgecolors='darkgray')
plt.plot(test_ages,pred)
Here data is WAGE data which is available in R.
This is the resulting graph i get -
I have used bootstraping to calculate the confidence intervals, for this i have used a self customed module -
import numpy as np
import pandas as pd
from tqdm import tqdm
class Bootstrap_ci:
def boot(self,X_data,y_data,R,test_data,model):
predictions = []
for i in tqdm(range(R)):
predictions.append(self.alpha(X_data,y_data,self.get_indices(X_data,200),test_data,model))
return np.percentile(predictions,2.5,axis = 0),np.percentile(predictions,97.5,axis = 0)
def alpha(self,X_data,y_data,index,test_data,model):
X = X_data.loc[index]
y = y_data.loc[index]
lr = model
lr.fit(pd.DataFrame(X),y)
return lr.predict(pd.DataFrame(test_data))
def get_indices(self,data,num_samples):
return np.random.choice(data.index, num_samples, replace=True)
The above module can be used as -
poly = PolynomialFeatures(4)
X = poly.fit_transform(data['age'].to_frame())
y = data['wage']
X_test = np.linspace(min(data['age']),max(data['age']),100)
X_test_poly = poly.transform(X_test.reshape(-1,1))
from bootstrap import Bootstrap_ci
bootstrap = Bootstrap_ci()
li,ui = bootstrap.boot(pd.DataFrame(X),y,1000,X_test_poly,LinearRegression())
This will give us the lower confidence interval, and upper confidence interval.
To plot the graph -
plt.scatter(data['age'],data['wage'],facecolors='none', edgecolors='darkgray')
plt.plot(X_test,pred,label = 'Fitted Line')
plt.plot(X_test,ui,linestyle = 'dashed',color = 'r',label = 'Confidence Intervals')
plt.plot(X_test,li,linestyle = 'dashed',color = 'r')
The resultant graph is
Following code results in the 95% confidence interval
from scipy import stats
confidence = 0.95
squared_errors = (<<predicted values>> - <<true y_test values>>) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
loc=squared_errors.mean(),
scale=stats.sem(squared_errors)))

How to batch a transformed (scaled and quantized) Beta distribution in tensorflow probability

I'm trying to fit a beta distribution to the results of a survey with discrete scores (1, 2, 3, 4, 5).
For that to work I need a working log_prob of a Beta in TensorFlow probability. However, there is a problem with how batching is handled in Beta.
Here is a minimal example that gives me an error:
InvalidArgumentError: Shapes of a and x are inconsistent: [3] vs. [1000,1] [Op:Betainc]
The same code seems to work ok with Normal distribution...
What am I doing wrong here?
import numpy as np
import tensorflow_probability as tfp
tfd = tfp.distributions
#Generate fake data
np.random.seed(2)
data = np.random.beta(2.,2.,1000)*5.0
data = np.ceil(data)
data = data[:,None]
# Create a batch of three Beta distributions.
alpha = np.array([1., 2., 3.]).astype(np.float32)
beta = np.array([1., 2., 3.]).astype(np.float32)
bt = tfd.Beta(alpha, beta)
#bt = tfd.Normal(loc=alpha, scale=beta)
#Scale beta to 0-5
scbt = tfd.TransformedDistribution(
distribution=bt,
bijector=tfp.bijectors.AffineScalar(
shift=0.,
scale=5.))
# quantize beta to (1,2,3,4,5)
qdist = tfd.QuantizedDistribution(distribution=scbt,low=1,high=5)
#calc log_prob for 3 distributions
print(np.sum(qdist.log_prob(data),axis=0))
print(qdist.log_prob(data).shape)
TensorFlow 2.0.0
tensorflow_probability 0.8.0
EDIT:
As suggested by Chris Suter. Here is broadcasting by hand solution:
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
from matplotlib import pyplot as plt
#Generate fake data
numdata = 100
numbeta = 3
np.random.seed(2)
data = np.random.beta(2.,2.,numdata)
data *= 5.0
data = np.ceil(data)
data = data[:,None].astype(np.float32)
#alpha and beta [[1., 2., 3.]]
alpha = np.expand_dims(np.arange(1,4),0).astype(np.float32)
beta = np.expand_dims(np.arange(1,4),0).astype(np.float32)
#tile to compensate for betainc
alpha = tf.tile(alpha,[numdata,1])
beta = tf.tile(beta,[numdata,1])
data = tf.tile(data,[1,numbeta])
bt = tfd.Beta(concentration1=alpha, concentration0=beta)
scbt = tfd.TransformedDistribution(
distribution=bt,
bijector=tfp.bijectors.AffineScalar(
shift=0.,
scale=5.))
# quantize beta to (1,2,3,4,5)
qdist = tfd.QuantizedDistribution(distribution=scbt,low=1,high=5)
#calc log_prob for numbeta number of distributions
print(np.sum(qdist.log_prob(data),axis=0))
EDIT2: The above solution does not work when I try to apply it in MCMC sampling.
The new code looks like this:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
from time import time
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
import numpy as np
#Generate fake data
numdata = 100
np.random.seed(2)
data = np.random.beta(2.,2.,numdata)
data *= 5.0
data = np.ceil(data)
data = data[:,None].astype(np.float32)
#tf.function
def sample_chain():
#Parameters of MCMC
num_burnin_steps = 300
num_results = 200
num_chains = 50
step_size = 0.01
#data tensor
outcomes = tf.convert_to_tensor(data, dtype=tf.float32)
def modeldist(alpha,beta):
bt = tfd.Beta(concentration1=alpha, concentration0=beta)
scbt = tfd.TransformedDistribution(
distribution=bt,
bijector=tfp.bijectors.AffineScalar(
shift=0.,
scale=5.))
# quantize beta to (1,2,3,4,5)
qdist = tfd.QuantizedDistribution(distribution=scbt,low=1,high=5)
return qdist
def joint_log_prob(con1,con0):
#manual broadcast
tcon1 = tf.tile(con1[None,:],[numdata,1])
tcon0 = tf.tile(con0[None,:],[numdata,1])
toutcomes = tf.tile(outcomes,[1,num_chains])
#model distribution with manual broadcast
dist = modeldist(tcon1,tcon0)
#joint log prob
return tf.reduce_sum(dist.log_prob(toutcomes),axis=0)
kernel = tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn=joint_log_prob,
num_leapfrog_steps=5,
step_size=step_size)
kernel = tfp.mcmc.SimpleStepSizeAdaptation(
inner_kernel=kernel, num_adaptation_steps=int(num_burnin_steps * 0.8))
init_state = [tf.identity(tf.random.uniform([num_chains])*10.0,name='init_alpha'),
tf.identity(tf.random.uniform([num_chains])*10.0,name='init_beta')]
samples, [step_size, is_accepted] = tfp.mcmc.sample_chain(
num_results=num_results,
num_burnin_steps=num_burnin_steps,
current_state=init_state,
kernel=kernel,
trace_fn=lambda _, pkr: [pkr.inner_results.accepted_results.step_size,
pkr.inner_results.is_accepted])
return samples
samples = sample_chain()
This ends up with an error message:
ValueError: Encountered None gradient. fn_arg_list: [tf.Tensor 'init_alpha:0' shape=(50,) dtype=float32, tf.Tensor 'init_beta:0' shape=(50,) dtype=float32] grads: [None, None]
Sadly tf.math.betainc doesn't support broadcasting at present, which causes the cdf computation, which QuantizedDistribution calls, to fail. If you must use Beta, the only workaround I can think of is to broadcast "manually" by tiling the data and Beta params.
Alternatively, you might be able to get away with using the Kumaraswamy distribution, which is similar to Beta but has some nicer analytical properties.

Implementing minimization in SciPy

I am trying to implement the 'Iterative hessian Sketch' algorithm from https://arxiv.org/abs/1411.0347 page 12. However, I am struggling with step two which needs to minimize the matrix-vector function.
Imports and basic data generating function
import numpy as np
import scipy as sp
from sklearn.datasets import make_regression
from scipy.optimize import minimize
import matplotlib.pyplot as plt
%matplotlib inline
from numpy.linalg import norm
def generate_data(nsamples, nfeatures, variance=1):
'''Generates a data matrix of size (nsamples, nfeatures)
which defines a linear relationship on the variables.'''
X, y = make_regression(n_samples=nsamples, n_features=nfeatures,\
n_informative=nfeatures,noise=variance)
X[:,0] = np.ones(shape=(nsamples)) # add bias terms
return X, y
To minimize the matrix-vector function, I have tried implementing a function which computes the quanity I would like to minimise:
def f2min(x, data, target, offset):
A = data
S = np.eye(A.shape[0])
#S = gaussian_sketch(nrows=A.shape[0]//2, ncols=A.shape[0] )
y = target
xt = np.ravel(offset)
norm_val = (1/2*S.shape[0])*norm(S#A#(x-xt))**2
#inner_prod = (y - A#xt).T#A#x
return norm_val - inner_prod
I would eventually like to replace S with some random matrices which can reduce the dimensionality of the problem, however, first I need to be confident that this optimisation method is working.
def grad_f2min(x, data, target, offset):
A = data
y = target
S = np.eye(A.shape[0])
xt = np.ravel(offset)
S_A = S#A
grad = (1/S.shape[0])*S_A.T#S_A#(x-xt) - A.T#(y-A#xt)
return grad
x0 = np.zeros((X.shape[0],1))
xt = np.zeros((2,1))
x_new = np.zeros((2,1))
for it in range(1):
result = minimize(f2min, x0=xt,args=(X,y,x_new),
method='CG', jac=False )
print(result)
x_new = result.x
I don't think that this loop is correct at all because at the very least there should be some local convergence before moving on to the next step. The output is:
fun: 0.0
jac: array([ 0.00745058, 0.00774882])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 416
nit: 0
njev: 101
status: 2
success: False
x: array([ 0., 0.])
Does anyone have an idea if:
(1) Why I'm not achieving convergence at each step
(2) I can implement step 2 in a better way?

One-hot encoding Tensorflow Strings

I have a list of strings as labels for training a neural network. Now I want to convert them via one_hot encoding so that I can use them for my tensorflow network.
My input list looks like this:
labels = ['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']
The requested outcome should be something like
one_hot [0,1,0,2,0]
What is the easiest way to do this? Any help would be much appreciated.
Cheers,
Andi
the desired outcome looks like LabelEncoder in sklearn, not like OneHotEncoder - in tf you need CategoryEncoder - BUT it is A preprocessing layer which encodes integer features.:
inp = layers.Input(shape=[X.shape[0]])
x0 = layers.CategoryEncoding(
num_tokens=3, output_mode="multi_hot")(inp)
model = keras.Model(inputs=[inp], outputs=[x0])
model.compile(optimizer= 'adam',
loss='categorical_crossentropy',
metrics=[tf.keras.metrics.CategoricalCrossentropy()])
print(model.summary())
this part gets encoding of unique values... And you can make another branch in this model to input your initial vector & fit it according labels from this reference-branch (it is like join reference-table with fact-table in any database) -- here will be ensemble of referenced-data & your needed data & output...
pay attention to -- num_tokens=3, output_mode="multi_hot" -- are being given explicitly... AND numbers from class_names get apriory to model use, as is Feature Engineering - like this (in pd.DataFrame)
import numpy as np
import pandas as pd
d = {'transport_col':['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']}
dataset_df = pd.DataFrame(data=d)
classes = dataset_df['transport_col'].unique().tolist()
print(f"Label classes: {classes}")
df= dataset_df['transport_col'].map(classes.index).copy()
print(df)
from manual example REF: Encode the categorical label into an integer.
Details: This stage is necessary if your classification label is represented as a string. Note: Keras expected classification labels to be integers.
in another architecture, perhaps, you could use StringLookup
vocab= np.array(np.unique(labels))
inp = tf.keras.Input(shape= labels.shape[0], dtype=tf.string)
x = tf.keras.layers.StringLookup(vocabulary=vocab)(inp)
but labels are dependent vars usually, as opposed to features, and shouldn't be used at Input
Everything in keras.docs
possible FULL CODE:
import numpy as np
import pandas as pd
import keras
X = np.array([['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']])
vocab= np.unique(X)
print(vocab)
y= np.array([[0,1,0,2,0]])
inp = layers.Input(shape=[X.shape[0]], dtype='string')
x0= tf.keras.layers.StringLookup(vocabulary=vocab, name='finish')(inp)
model = keras.Model(inputs=[inp], outputs=[x0])
model.compile(optimizer= 'adam',
loss='categorical_crossentropy',
metrics=[tf.keras.metrics.categorical_crossentropy])
print(model.summary())
from tensorflow.keras import backend as K
for layerIndex, layer in enumerate(model.layers):
print(layerIndex)
func = K.function([model.get_layer(index=0).input], layer.output)
layerOutput = func([X]) # input_data is a numpy array
print(layerOutput)
if layerIndex==1: # the last layer here
scale = lambda x: x - 1
print(scale(layerOutput))
res:
[[0 1 0 2 0]]
another possible Solution for your case - layers.TextVectorization
import numpy as np
import keras
input_array = np.atleast_2d(np.array(['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']))
vocab= np.unique(input_array)
input_data = keras.Input(shape=(None,), dtype='string')
layer = layers.TextVectorization( max_tokens=None, standardize=None, split=None, output_mode="int", vocabulary=vocab)
int_data = layer(input_data)
model = keras.Model(inputs=input_data, outputs=int_data)
output_dataset = model.predict(input_array)
print(output_dataset) # starts from 2 ... probably [0, 1] somehow concerns binarization ?
scale = lambda x: x - 2
print(scale(output_dataset))
result:
array([[0, 1, 0, 2, 0]])