Using pymc3 to fit lomax model - bayesian

I have a pretty simple example that doesn't seem to work. My goal is to build a Lomax model, and since PyMC3 doesn't have a Lomax distribution I use the fact that an Exponential mixed with a Gamma is a Lomax (see here):
import pymc3 as pm
from scipy.stats import lomax
# Generate artificial data with a shape and scale parameterization
data = lomax.rvs(c=2.5, scale=3, size=1000)
# if t ~ Exponential(lamda) and lamda ~ Gamma(shape, rate), then t ~ Lomax(shape, rate)
with pm.Model() as hierarchical:
shape = pm.Uniform('shape', 0, 10)
rate = pm.Uniform('rate', 0 , 10)
lamda = pm.Gamma('lamda', alpha=shape, beta=rate)
t = pm.Exponential('t', lam=lamda, observed=data)
trace = pm.sample(1000, tune=1000)
The summary is:
>>> pm.summary(trace)
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
shape 4.259874 2.069418 0.060947 0.560821 8.281654 1121.0 1.001785
rate 6.532874 2.399463 0.068837 2.126299 9.998271 1045.0 1.000764
lamda 0.513459 0.015924 0.000472 0.483754 0.545652 1096.0 0.999662
I would expect the shape and rate estimates to be close to 2.5 and 3 respectively. I tried various non-informative priors for shape and rate, including pm.HalfFlat() and pm.Uniform(0, 100) but both resulted in worse fits. Any ideas?

Figured it out: To derive a lomax from an exponential-gamma mixture, I need to specify a lamda for each example in the dataset (lamda = pm.Gamma('lamda', alpha=shape, beta=rate, shape=len(data)). This is because the model assumes each subject in the data has its own lamda_i where lamda_i ~ Gamma(shape, rate) for every i.

Related

How do I pre-process the dataset if the feature ranges are too wide?

I have a dataset with 5 features and each column being in a different range of numbers. I have tried using MinMaxScaler and StandardScaler but the accuracy for this multi-class problem is too low.
If StandardScaler and MinMaxScaler don't have the desired affect, then another thing to check for is skewed data:
# Check the skew of all numerical features
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
Lower is better. If you get high scores, you can use a transform (log, boxcox, etc) to make the data distribution more normal in shape.
correcting for skew:
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_features = skewness.index
lam_f = 0.15
for feat in skewed_features:
#all_data[feat] += 1
all_data[feat] = boxcox1p(all_data[feat], lam_f)
Other things to try:
either remove fliers or try RobustScaler()
PowerTransformer()
Reference: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

Tensorflow Quantum: PQC not optimizing

I have followed the tutorial available at: https://www.tensorflow.org/quantum/tutorials/mnist. I have modified this tutorial to the simplest example I could think of: an input set in which x increases linearly from 0 to 1 and y = x < 0.3. I then use a PQC with a single Rx gate with a symbol, and a readout using a Z gate.
When retrieving the optimized symbol and adjusting it manually, I can easily find a value that provides 100% accuracy, but when I let the Adam optimizer run, it converges to either always predict 1 or always predict -1. Does anybody spot what I do wrong? (and I apologize for not being able to break down the code to a smaller example)
import tensorflow as tf
import tensorflow_quantum as tfq
import cirq
import sympy
import numpy as np
# used to embed classical data in quantum circuits
def convert_to_circuit_cont(image):
"""Encode truncated classical image into quantum datapoint."""
values = np.ndarray.flatten(image)
qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
for i, value in enumerate(values):
if value:
circuit.append(cirq.rx(value).on(qubits[i]))
return circuit
# define classical dataset
length = 1000
np.random.seed(42)
# create a linearly increasing set for x from 0 to 1 in 1/length steps
x_train_sorted = np.asarray([[x/length] for x in range(0,length)], dtype=np.float32)
# p is used to shuffle x and y similarly
p = np.random.permutation(len(x_train_sorted))
x_train = x_train_sorted[p]
# y = x < 0.3 in {-1, 1} for Hinge loss
y_train_sorted = np.asarray([1 if (x/length)<0.30 else -1 for x in range(0,length)])
y_train = y_train_sorted[p]
# test == train for this example
x_test = x_train_sorted[:]
y_test = y_train_sorted[:]
# convert classical data into quantum circuits
x_train_circ = [convert_to_circuit_cont(x) for x in x_train]
x_test_circ = [convert_to_circuit_cont(x) for x in x_test]
x_train_tfcirc = tfq.convert_to_tensor(x_train_circ)
x_test_tfcirc = tfq.convert_to_tensor(x_test_circ)
# define the PQC circuit, consisting out of 1 qubit with 1 gate (Rx) and 1 parameter
def create_quantum_model():
data_qubits = cirq.GridQubit.rect(1, 1)
circuit = cirq.Circuit()
a = sympy.Symbol("a")
circuit.append(cirq.rx(a).on(data_qubits[0])),
return circuit, cirq.Z(data_qubits[0])
model_circuit, model_readout = create_quantum_model()
# Build the Keras model.
model = tf.keras.Sequential([
# The input is the data-circuit, encoded as a tf.string
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# used for logging progress during optimization
def hinge_accuracy(y_true, y_pred):
y_true = tf.squeeze(y_true) > 0.0
y_pred = tf.squeeze(y_pred) > 0.0
result = tf.cast(y_true == y_pred, tf.float32)
return tf.reduce_mean(result)
# compile the model with Hinge loss and Adam, as done in the example. Have tried with various learning_rates
model.compile(
loss = tf.keras.losses.Hinge(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
metrics=[hinge_accuracy])
EPOCHS = 20
BATCH_SIZE = 32
NUM_EXAMPLES = 1000
# fit the model
qnn_history = model.fit(
x_train_tfcirc, y_train,
batch_size=32,
epochs=EPOCHS,
verbose=1,
validation_data=(x_test_tfcirc, y_test),
use_multiprocessing=False)
results = model.predict(x_test_tfcirc)
results_mapped = [-1 if x<=0 else 1 for x in results[:,0]]
print(np.sum(np.equal(results_mapped, y_test)))
After 20 epochs of optimization, I get the following:
1000/1000 [==============================] - 0s 410us/sample - loss: 0.5589 - hinge_accuracy: 0.6982 - val_loss: 0.5530 - val_hinge_accuracy: 0.7070
This results in 700 samples out of 1000 predicted correctly. When looking at the mapped results, this is because all results are predicted as -1. When looking at the raw results, they linearly increase from -0.5484014 to -0.99996257.
When retrieving the weight with w = model.layers[0].get_weights(), subtracting 0.8, and setting it again with model.layers[0].set_weights(w), I get 920/1000 correct. Fine-tuning this process allows me to achieve 1000/1000.
Update 1:
I have also printed the update of the weight over the various epochs:
4.916246, 4.242602, 3.3765688, 2.6855211, 2.3405066, 2.206207, 2.1734586, 2.1656137, 2.1510274, 2.1634471, 2.1683235, 2.188944, 2.1510284, 2.1591303, 2.1632445, 2.1542525, 2.1677444, 2.1702878, 2.163104, 2.1635907
I set the weight to 1.36, a value which gives 908/1000 (as opposed to 700/100). The optimizer moves away from it:
1.7992111, 2.0727847, 2.1370323, 2.15711, 2.1686404, 2.1603785, 2.183334, 2.1563332, 2.156857, 2.169908, 2.1658351, 2.170673, 2.1575692, 2.1505954, 2.1561477, 2.1754034, 2.1545155, 2.1635509, 2.1464484, 2.1707492
One thing that I noticed is that the value for the hinge accuracy was 0.75 with the weight 1.36, which is higher than the 0.7 for 2.17. If this is the case, I am either in an unlucky part of the optimization landscape where the global minimum does not correspond to the minimum of the loss landscape, or the loss value is determined incorrectly. This is what I will be investigating next.
The minima of the Hinge loss function for this examples does not correspond with the maxima of number of correctly classified examples. Please see plot of these w.r.t. the value of the parameter. Given that the optimizer works towards the minima of the loss, not the maxima of the number of classified examples, the code (and framework/optimizer) do what they are supposed to do. Alternatively, one could use a different loss function to try to find a better fit. For example binarized l1 loss. This function would have the same global optimum, but would likely have a very flat landscape.

Denoising autoencoder - training with added noise on custom interval

I'm trying to understand denoising autoencoders. I've followed this keras tutorial - https://blog.keras.io/building-autoencoders-in-keras.html
In the tutorial, the training data is created by adding an artificial noise in the following way:
x_train_noisy = x_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_train.shape)
x_test_noisy = x_test + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_test.shape)
which produces:
Which means the noise as well as the underlying data from MINST dataset have values between 0 and 1.
After applying the trained model, most of the noise is correctly removed:
I'm trying to train the model with only very little artificial noise, but on interval -5 to 5 as following:
def noise_matrix(arr, num, min, max):
m = np.product(arr.shape)
arr.ravel()[np.random.randint(0, m, size=num)] = np.random.uniform(min, max, num)
return arr
x_train_noisy = noise_matrix(x_train, x_train.shape[0] * 2, -5, 5)
x_test_noisy = noise_matrix(x_test, x_test.shape[0] * 2, -5, 5)
which produces:
(Differences in contrast in the above picture is caused by implicit normalization in matplotlib library)
Now, when I train the autoencoder and apply the model, I'm getting the following result:
Most of the noise is not removed. What steps do I need to do in order to remove the noise from interval (-5,5)? I've tried to normalize all the data after adding noise to interval (0,1) but this is not the way to go (I was getting very bad results with this approach).
The decoded image still has obvious noises. Since you did not provide the code used to fit the autoencoder, I am guessing that you are fitting it on the noised data ae.fit(x=x_noised, y=x_noised),
whereas you should be fitting on the original data:
ae.fit(x=x_noised, y=x_original)

Sampling from posterior using custom likelihood in pymc3

I'm trying to create a custom likelihood using pymc3. The distribution is called Generalized Maximum Likelihood (GEV) which has the location (loc), scale (scale) and shape (c) parameters.
The main ideia is to choose a beta distribution as a prior to the scale parameter and fix the location and scale parameters in the GEV likelihood.
The GEV distribuition is not contained in the pymc3 standard distributions, so I have to create a custom likelihood. I googled it and found out that I should use the densitydist method but I don't know why it is incorrect.
See the code below:
import pymc3 as pm
import numpy as np
from theano.tensor import exp
data=np.random.randn(20)
with pm.Model() as model:
c=pm.Beta('c',alpha=6,beta=9)
loc=1
scale=2
gev=pm.DensityDist('gev', lambda value: exp(-1+c*(((value-loc)/scale)^(1/c))), testval=1)
modelo=pm.gev(loc=loc, scale=scale, c=c, observed=data)
step = pm.Metropolis()
trace = pm.sample(1000, step)
pm.traceplot(trace)
I'm sorry in advance if this is a dumb question, but I could'nt figure it out.
I'm studying annual maximum flows and I'm trying to implement the methodology described in "Generalized maximum-likelihood generalized extreme-value
quantile estimators for hydrologic data" written by Martins and Stedinger.
If you mean the generalized extreme value distribution (https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution), then something like this should work (for c != 0):
import pymc3 as pm
import numpy as np
import theano.tensor as tt
from pymc3.distributions.dist_math import bound
data = np.random.randn(20)
with pm.Model() as model:
c = pm.Beta('c', alpha=6, beta=9)
loc = 1
scale = 2
def gev_logp(value):
scaled = (value - loc) / scale
logp = -(scale
+ ((c + 1) / c) * tt.log1p(c * scaled)
+ (1 + c * scaled) ** (-1/c))
alpha = loc - scale / c
bounds = tt.switch(value > 0, value > alpha, value < alpha)
return bound(logp, bounds, c != 0)
gev = pm.DensityDist('gev', gev_logp, observed=data)
trace = pm.sample(2000, tune=1000, njobs=4)
pm.traceplot(trace)
Your logp function was invalid. Exponentiation is ** in python, and part of the expression wasn't valid for negative values.

My TensorFlow Gradient Descent diverges

import tensorflow as tf
import pandas as pd
import numpy as np
def normalize(data):
return data - np.min(data) / np.max(data) - np.min(data)
df = pd.read_csv('sat.csv', skipinitialspace=True)
x_reading = df['reading_score']
x_math = df['math_score']
x_reading, x_math = np.array(x_reading[df.reading_score != 's']), np.array(x_math[df.math_score != 's'])
x_data = normalize(np.float32(np.array([x_reading, x_math])))
y_writing = df[['writing_score']]
y_data = normalize(np.float32(np.array(y_writing[df.writing_score != 's'])))
W = tf.Variable(tf.random_uniform([1, 2], -.5, .5)) #float32
b = tf.Variable(tf.ones([1]))
y = tf.matmul(W, x_data) + b
loss = tf.reduce_mean(tf.square(y - y_data.T))
optimizer = tf.train.GradientDescentOptimizer(0.005)
train = optimizer.minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for step in range(1000):
sess.run(train)
print step, sess.run(W), sess.run(b), sess.run(loss)
Here's my code. My sat.csv contains a data of reading, writing and math scores at SAT. As you can guess, the difference between the features is not that big.
This is a part of sat.csv.
DBN,SCHOOL NAME,Num of Test Takers,reading_score,math_score,writing_score
01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363
01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366
01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370
01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359
01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384
01M515,LOWER EAST SIDE PREPARATORY HIGH SCHOOL,112,332,557,316
01M539,"NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND MATH HIGH SCHOOL",159,522,574,525
01M650,CASCADES HIGH SCHOOL,18,417,418,411
01M696,BARD HIGH SCHOOL EARLY COLLEGE,130,624,604,628
02M047,47 THE AMERICAN SIGN LANGUAGE AND ENGLISH SECONDARY SCHOOL,16,395,400,387
I've only used math, writing and reading scores. My goal for the code above is to predict the writing score if I give math and reading scores.
I've never seen Tensorflow's gradient descent model diverges with this such simple data. What'd be wrong?
Here are a few options you could try:
Normalise you input and output data
Set smaller initial values for your weights
Use a lower learning rate
Divide your loss by the amount of samples you have (not putting your data in a placeholder is already uncommon).
Let me know what (if any) of these options helped and good luck!