pymc MAP warning : Stochastic tau's value is neither numerical nor array with floating-point dtype. Recommend fitting method fmin (default) - bayesian

I have looked at a similar question here
pymc warning: value is neither numerical nor array with floating-point dtype
but there are no answers, can someone please tell me whether I should ignore this warning or what to do otherwise ?
The model has a stochastic variable (among others) tau which is DiscreteUniform
Following is the relevant code for the model :
tau = pm.DiscreteUniform("tau", lower = 0, upper = n_count_data)
lambda_1 = pm.Exponential("lambda_1", alpha)
lambda_2 = pm.Exponential("lambda_2", alpha)
print "Initial values: ", tau.value, lambda_1.value, lambda_2.value
#pm.deterministic
def lambda_(tau = tau, lambda_1 = lambda_1, lambda_2 = lambda_2):
out = np.zeros(n_count_data)
out[:tau] = lambda_1
out[tau:] = lambda_2
return out
observation = pm.Poisson("obs", lambda_, value = count_data, observed = True)
model = pm.Model([observation, lambda_1, lambda_2, tau]);
m = pm.MAP(model) # **This line caueses error**
print "Output after using MAP: ", tau.value, lambda_1.value, lambda_2.value

The pymc documentation says "MAP can only handle variables whose dtype is float". Your tau is from a discrete distribution so it should rather have a dtype like int. If you call the fit method to estimate maximum a posteriori values of your parameters tau will be assumed to be a float. If this makes any sense and hence, whether you could ignore this warning depends on the problem at hand. If your likelihood is well-behaved you might end up with a float value of tau that's close the integer value you'd actually like to estimate. But if you imagine a case where your likelihood will be 0 for all non-integer values a gradient method like fmin will not work and your maximum a posteriori values don't make any sense. In this case you'd have to find a different way to calculate you maximum a posteriori values (again, depending on the problem at hand).

Related

Set the objective of an optimizer as the standard deviation of the input (Nonlinear optimization using pymo)

I am trying to use pymo for a single objective nonlinear optimization problem.
The objective function is to minimize the variance (or standard deviation) of the input variables following certain constraints (which I was able to do in Excel).
Following is a code example of what I am trying to do
model = pyo.ConcreteModel()
# declare decision variables
model.x1 = pyo.Var(domain=pyo.NonNegativeReals)
model.x2 = pyo.Var(domain=pyo.NonNegativeReals)
model.x3 = pyo.Var(domain=pyo.NonNegativeReals)
model.x4 = pyo.Var(domain=pyo.NonNegativeReals)
# declare objective
from statistics import stdev
model.variance = pyo.Objective(
expr = stdev([model.x1, model.x2, model.x3, model.x4]),
sense = pyo.minimize)
# declare constraints
model.max_charging = pyo.Constraint(expr = model.x1 + model.x2 + model.x3 + model.x4 >= 500)
model.max_x1 = pyo.Constraint(expr = model.x1 <= 300)
model.max_x2 = pyo.Constraint(expr = model.x2 <= 200)
model.max_x3 = pyo.Constraint(expr = model.x3 <= 100)
model.max_x4 = pyo.Constraint(expr = model.x4 <= 200)
# solve
pyo.SolverFactory('glpk').solve(model).write()
#print
print("energy_price = ", model.variance())
print(f'Variables = [{model.x1()},{model.x2()},{model.x3()},{model.x4()}]')
The error I get is TypeError: can't convert type 'ScalarVar' to numerator/denominator
The problem seems to be caused by using the stdev function from statistics.
My assumption is that the models variables x1-x4 are yet to have been assigned a value and that is the main issue. However, I am not sure how to approach this?
First: stdev is nonlinear. So why even try to solve this with a linear solver?
Pyomo does not know about the statistics package. You'll have to program the standard deviation using elementary operations, use an external function approach, or use an approximation (like minimizing the range).
So I managed to solve this issue and I am including the solution below. But first, there are a couple of points I would like to highlight
As #Erwin Kalvelagen mentioned, 'stdev' is nonlinear so a linear solver like 'glpk' would always result in an error. For my problem, 'ipopt' worked fine but be careful as it can perform poorly in some cases.
Also, as #Erwin Kalvelagen mentioned, Pyomo does not know about the statistics package. So When you try to use a function from that package (e.g., 'stdev', 'variance', etc.), it will try to evaluate the model variables before the solver assigns them any value and that will cause an error.
A pyomo objective function needs an expression. The code sample below shows how to dynamically generate an expression for the variance without using an external package. The code is also agnostic to the number of model variables you have.
Using either the variance or the standard deviation will serve the same purpose for my project. I opted for using the variance to avoid calculating its square root (as the standard deviation is the square root of the variance).
Variability Objective Function
import pyomo.environ as pyo
def variabilityRule(model):
#Calculate mean of all model variables
for index in model.x:
mean += model.x[index]
mean = mean/len(model.x)
#Calculate the difference between each model variable and the mean
obj_exp = ((model.x[1])-mean) * ((model.x[1])-mean)
for i in range(2, len(model.x)+1): #note that pyomo variables start from 1 not zero
obj_exp += ((model.x[i])-mean) * ((model.x[i])-mean)
#Divide by number of items
obj_exp = (obj_exp/(len(model.x)))
return obj_exp
model = pyo.ConcreteModel()
model.objective = pyo.Objective(
rule = variabilityRule,
sense = pyo.maximize)
EDIT: Standard Deviation Calculation
You can use the same approach to calculate the standard deviation as I found out. Just multiply the final expression (`obj_exp`) by power 0.5
obj_exp = obj_exp ** 0.5
...
P.S. If you are interested, this is how I dynamically generated my model variables based on an input array
model.x = pyo.VarList(domain=pyo.NonNegativeReals)
for i in range(0, len(input_array)):
model.x.add()

Automatic Differentiation with respect to rank-based computations

I'm new to automatic differentiation programming, so this maybe a naive question. Below is a simplified version of what I'm trying to solve.
I have two input arrays - a vector A of size N and a matrix B of shape (N, M), as well a parameter vector theta of size M. I define a new array C(theta) = B * theta to get a new vector of size N. I then obtain the indices of elements that fall in the upper and lower quartile of C, and use them to create a new array A_low(theta) = A[lower quartile indices of C] and A_high(theta) = A[upper quartile indices of C]. Clearly these two do depend on theta, but is it possible to differentiate A_low and A_high w.r.t theta?
My attempts so far seem to suggest no - I have using the python libraries of autograd, JAX and tensorflow, but they all return a gradient of zero. (The approaches I have tried so far involve using argsort or extracting the relevant sub-arrays using tf.top_k.)
What I'm seeking help with is either a proof that the derivative is not defined (or cannot be analytically computed) or if it does exist, a suggestion on how to estimate it. My eventual goal is to minimize some function f(A_low, A_high) wrt theta.
This is the JAX computation that I wrote based on your description:
import numpy as np
import jax.numpy as jnp
import jax
N = 10
M = 20
rng = np.random.default_rng(0)
A = jnp.array(rng.random((N,)))
B = jnp.array(rng.random((N, M)))
theta = jnp.array(rng.random(M))
def f(A, B, theta, k=3):
C = B # theta
_, i_upper = lax.top_k(C, k)
_, i_lower = lax.top_k(-C, k)
return A[i_lower], A[i_upper]
x, y = f(A, B, theta)
dx_dtheta, dy_dtheta = jax.jacobian(f, argnums=2)(A, B, theta)
The derivatives are all zero, and I believe this is correct, because the change in value of the outputs does not depend on the change in value of theta.
But, you might ask, how can this be? After all, theta enters into the computation, and if you put in a different value for theta, you get different outputs. How could the gradient be zero?
What you must keep in mind, though, is that differentiation doesn't measure whether an input affects an output. It measures the change in output given an infinitesimal change in input.
Let's use a slightly simpler function as an example:
import jax
import jax.numpy as jnp
A = jnp.array([1.0, 2.0, 3.0])
theta = jnp.array([5.0, 1.0, 3.0])
def f(A, theta):
return A[jnp.argmax(theta)]
x = f(A, theta)
dx_dtheta = jax.grad(f, argnums=1)(A, theta)
Here the result of differentiating f with respect to theta is all zero, for the same reasons as above. Why? If you make an infinitesimal change to theta, it will in general not affect the sort order of theta. Thus, the entries you choose from A do not change given an infinitesimal change in theta, and thus the derivative with respect to theta is zero.
Now, you might argue that there are circumstances where this is not the case: for example, if two values in theta are very close together, then certainly perturbing one even infinitesimally could change their respective rank. This is true, but the gradient resulting from this procedure is undefined (the change in output is not smooth with respect to the change in input). The good news is this discontinuity is one-sided: if you perturb in the other direction, there is no change in rank and the gradient is well-defined. In order to avoid undefined gradients, most autodiff systems will implicitly use this safer definition of a derivative for rank-based computations.
The result is that the value of the output does not change when you infinitesimally perturb the input, which is another way of saying the gradient is zero. And this is not a failure of autodiff – it is the correct gradient given the definition of differentiation that autodiff is built on. Moreover, were you to try changing to a different definition of the derivative at these discontinuities, the best you could hope for would be undefined outputs, so the definition that results in zeros is arguably more useful and correct.

Getting nans for gradient

I am trying to create a search relevance model where I take the dot product between query vector and resulting documents. I add a positional bias term on top to take into account the fact that position 1 is more likely to be clicked on. The final (unnormalised) log likelihood calculation is as follows:
query = self.query_model(query_input_ids, query_attention_mask)
docs = self.doc_model(doc_input_ids, doc_attention_mask)
positional_bias = self.position_model()
if optimizer_idx is not None:
if optimizer_idx == 0:
docs = docs.detach()
positional_bias = positional_bias.clone().detach()
elif optimizer_idx == 1:
query = query.detach()
positional_bias = positional_bias.clone().detach()
else:
query = query.detach()
docs = docs.detach()
similarity = (docs # query.unsqueeze(-1)).squeeze()
click_log_lik = (similarity + positional_bias)\
.reshape(doc_mask.shape)\
.masked_fill_((1 - doc_mask).bool(), float("-inf"))
The query and doc model is simply a distilbert model with a projection layer on top of CLS token. The models can be seen here: https://pastebin.com/g21g9MG3
When inspecting the first gradient descent step, it has nans, but only for the query model and not the doc model. My hypothesis is that normalizing the return values for doc and query models (return F.normalize(out, dim=-1)) is somehow playing up with the gradients.
Does anyone know 1. If my hypothesis is true and more importantly 2. How can I rectify nan gradients?.
Additional Info:
None of the losses are inf or nan.
query is BS x 768
docs is BS x DOC_RESULTS x 768
positional_bias is DOC_RESULTS
DOC_RESULTS is 10 in my case.
The masked_fill in the last line is because occasionally I have less than 10 data points for a query.
Update 1
The following changes made no difference to nans:
Changing masked_fill from -inf to 1e5.
Changing the projection from F.normalize(out, dim=-1) to out / 100.
Removed positional bias altogether with again no luck.
If it helps anyone, and you come across this while using Transformers this is what I did:
So in the end the bug was due to the fact that I was masking away nan's. Since I had some documents with zero length, the output of the transformer was nan. I was hoping that masked_fill would fix this problem, but it doesn't. The solution in my case was to only put non-zero length sequences through transformers, and then append with zeros to fill the batch size.

struggling minimizing non linear function

I am looking forward to minimize a non linear function with 3 arguments (x1,x2 and x3)
My sources of information are:
the explanation of the minimization function:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
And an example they provide:
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
I do not belong to a mathematical area, so first off forgive me if I am using incorrect wording / expressions.
This is my code :
import numpy as np
from scipy.optimize import minimize
def rosen(x1,x2,x3):
return np.sqrt(((x1**2)*0.002)+((x2**2)*0.0035)+((x3**2)*0.0015)+(2*x1*x2*0.015)+(2*x1*x3*0.01)+(2*x2*x3*0.02))
I think that the first step is okey up to here..
Then it is required to state the:
x0 : ndarray
Initial guess. len(x0) is the dimensionality of the minimization problem.
Given that I am stating 3 args in the minimization function I shall state a 3 dim array , such like this?
x0=np.array([1,1,1])
res = minimize(rosen, x0)
print(res.x)
The undesired output is:
rosen() missing 2 required positional arguments: 'x2' and 'x3'
Which I do not really understand where shall I state the positional arguments.
Apart from that I would like to set some bounds for the outputing values for x1,x2,x3 .
Which I tried
res = minimize(rosen, x0, bounds=([0,None]),options={"disp": False})
Which outputs also that :
ValueError: length of x0 != length of bounds
How should I then express the bounds inside the res then?
The desired output would be simply to output an array for x1,x2,x3 according to the minimization of the function where each value is minimun 0, as stated in the bounds and that the sum of the args add up to 1.
Function-definition
Read the docs carefully, e.g. for your function-def:
fun : callable
The objective function to be minimized. Must be in the form f(x, *args). The
optimizing argument, x, is a 1-D array of points, and args is a tuple of any
additional fixed parameters needed to completely specify the function.
Your function should take a 1d-array, while you implement the multi-argument for multi-variables approach!
Changing:
def rosen(x1,x2,x3):
return np.sqrt(((x1**2)*0.002)+((x2**2)*0.0035)+((x3**2)*0.0015)+(2*x1*x2*0.015)+(2*x1*x3*0.01)+(2*x2*x3*0.02))
def rosen(x):
x1,x2,x3 = x # unpack vector for your kind of calculations
return np.sqrt(((x1**2)*0.002)+((x2**2)*0.0035)+((x3**2)*0.0015)+(2*x1*x2*0.015)+(2*x1*x3*0.01)+(2*x2*x3*0.02))
should work. This is a bit a repair-something-to-keep-my-other-code approach but won't hurt much in this example. Usually you implement your function-definition on the 1d-array-input assumption!
Bounds
Again from the docs:
bounds : sequence, optional
Bounds for variables (only for L-BFGS-B, TNC and SLSQP). (min, max) pairs for each
element in x, defining the bounds on that parameter. Use None for one of min or max
when there is no bound in that direction.
So you need n_vars pairs! Easily achieved by using a list-comprehension, deducing the necessary info from x0.
res = minimize(rosen, x0, bounds=[[0,None] for i in range(len(x0))],options={"disp": False})
Make variables sum up to 1 / Constraints
Your comment implies you want the variables to sum up to 1. You would need to use an equality-constraint then (only 1 solver supporting this and inequality-constraints; one other only inequality-constraints; the rest no constraints; solver will be picked automatically if none explicitly given).
It looks somewhat like:
cons = ({'type': 'eq', 'fun': lambda x: sum(x) - 1}) # read docs to understand!
# to think about:
# sum vs. np.sum
# (not much diff here)
res = minimize(rosen, x0, bounds=[[0,None] for i in range(len(x0))],options={"disp": False}, constraints=cons)
For the case of x nonnegative, the constraint is usually called the probability-simplex.
(untested code; conceptually correct!)

pymc python change point detection for small probabilities. ZeroProbability Error

I am trying to use pymc to find a change point in a time-series. The value I am looking at over time is probability to "convert" which is very small, 0.009 on average with a range of 0.001-0.016.
I give the two probabilities a uniform distribution as a prior between zero and the max observation.
alpha = df.cnvrs.max() # Set upper uniform
center_1_c = pm.Uniform("center_1_c", 0, alpha)
center_2_c = pm.Uniform("center_2_c", 0, alpha)
day_c = pm.DiscreteUniform("day_c", lower=1, upper=n_days)
#pm.deterministic
def lambda_(day_c=day_c, center_1_c=center_1_c, center_2_c=center_2_c):
out = np.zeros(n_days)
out[:day_c] = center_1_c
out[day_c:] = center_2_c
return out
observation = pm.Uniform("obs", lambda_, value=df.cnvrs.values, observed=True)
When I run this code I get:
ZeroProbability: Stochastic obs's value is outside its support,
or it forbids its parents' current values.
I'm pretty new to pymc so not sure if I'm missing something obvious. My guess is I might not have appropriate distributions for modelling small probabilities.
It's impossible to tell where you've introduced this bug—and programming is off-topic here, in any case—without more of your output. But there is a statistical issue here: You've somehow constructed a model that cannot produce either the observed variables or the current sample of latent ones.
To give a simple example, say you have a dataset with negative values, and you've assumed it to be gamma distributed; this will produce an error, because the data has zero probability under a gamma. Similarly, an error will be thrown if an impossible value is sampled during an MCMC chain.