(Openmdao 2.4.0) 'compute_partials' function of a Component seems to be run even when forcing 'declare_partials' to FD for this component - derivative

I want to solve MDA for Sellar using Newton non linear solver for the Group . I have defined Disciplines with Derivatives (using 'compute_partials') but I want to check the number of calls to Discipline 'compute' and 'compute_partials' when forcing or not the disciplines not to use their analytical derivatives (using 'declare_partials' in the Problem definition ). The problem is that is seems that the 'compute_partials' function is still called even though I force not to use it .
Here is an example (Sellar)
So for Discipline 2, I add a counter and I have
from openmdao.test_suite.components.sellar import SellarDis1, SellarDis2
class SellarDis2withDerivatives(SellarDis2):
"""
Component containing Discipline 2 -- derivatives version.
"""
def _do_declares(self):
# Analytic Derivs
self.declare_partials(of='*', wrt='*')
self.exec_count_d = 0
def compute_partials(self, inputs, J):
"""
Jacobian for Sellar discipline 2.
"""
y1 = inputs['y1']
if y1.real < 0.0:
y1 *= -1
J['y2', 'y1'] = .5*y1**-.5
J['y2', 'z'] = np.array([[1.0, 1.0]])
self.exec_count_d += 1
I create a similar MDA as on OpendMDAO docs but calling SellarDis2withDerivatives I have created and SellarDis1withDerivatives and changing the nonlinear_solver for Newton_solver() like this
cycle.add_subsystem('d1', SellarDis1withDerivatives(), promotes_inputs=['x', 'z', 'y2'], promotes_outputs=['y1'])
cycle.add_subsystem('d2', SellarDis2withDerivatives(), promotes_inputs=['z', 'y1'], promotes_outputs=['y2'])
# Nonlinear Block Gauss Seidel is a gradient free solver
cycle.nonlinear_solver = NewtonSolver()
cycle.linear_solver = DirectSolver()
Then I run the following problem
prob2 = Problem()
prob2.model = SellarMDA()
prob2.setup()
prob2.model.cycle.d1.declare_partials('*', '*', method='fd')
prob2.model.cycle.d2.declare_partials('*', '*', method='fd')
prob2['x'] = 2.
prob2['z'] = [-1., -1.]
prob2.run_model()
count = prob2.model.cycle.d2.exec_count_d
print("Number of derivatives calls (%i)"% (count))
And , as a results, I obtain
=====
cycle
NL: Newton Converged in 3 iterations
Number of derivatives calls (3)
Therefore, it seems that the function 'compute_partials' is still called somehow (even if the derivatives are computed with FD ). Does someone as an explanation ?

I believe this to be a bug (or perhaps an unintended consequence of how derivatives are specified.)
This behavior is a by-product of mixed declaration of derivative, where we allow the user to specify some derivatives on a component to be 'fd' and other derivatives to be analytic. So, we are always capable of doing both fd and compute_partials on a component.
There are two changes we could make in openmdao to remedy this:
Don't call compute_partials if no derivatives were explicitly declared as analytic.
Filter out any variables declared as 'fd' so that if a user tries to set them in compute_partials, a keyerror is raised (or maybe just a warning, and the derivative value is not overwritten)
In the meantime, the only workarounds would be to comment out the compute_partials method, or alternatively enclose the component in a group and finite difference the group.

Another workaround is to have an attribute (here called _call_compute_partials) in your class, which tracks, if there where any analytical derivatives declared. And the conditional in compute_partials() could be implemented outside the method, where the method is called.
from openmdao.core.explicitcomponent import ExplicitComponent
from openmdao.core.indepvarcomp import IndepVarComp
from openmdao.core.problem import Problem
from openmdao.drivers.scipy_optimizer import ScipyOptimizeDriver
class ExplicitComponent2(ExplicitComponent):
def __init__(self, **kwargs):
super(ExplicitComponent2, self).__init__(**kwargs)
self._call_compute_partials = False
def declare_partials(self, of, wrt, dependent=True, rows=None, cols=None, val=None,
method='exact', step=None, form=None, step_calc=None):
if method == 'exact':
self._call_compute_partials = True
super(ExplicitComponent2, self).declare_partials(of, wrt, dependent, rows, cols, val,
method, step, form, step_calc)
class Cylinder(ExplicitComponent2):
"""Main class"""
def setup(self):
self.add_input('radius', val=1.0)
self.add_input('height', val=1.0)
self.add_output('Area', val=1.0)
self.add_output('Volume', val=1.0)
# self.declare_partials('*', '*', method='fd')
# self.declare_partials('*', '*')
self.declare_partials('Volume', 'height', method='fd')
self.declare_partials('Volume', 'radius', method='fd')
self.declare_partials('Area', 'height', method='fd')
self.declare_partials('Area', 'radius')
# self.declare_partials('Area', 'radius', method='fd')
def compute(self, inputs, outputs):
radius = inputs['radius']
height = inputs['height']
area = height * radius * 2 * 3.14 + 3.14 * radius ** 2 * 2
volume = 3.14 * radius ** 2 * height
outputs['Area'] = area
outputs['Volume'] = volume
def compute_partials(self, inputs, partials):
if self._call_compute_partials:
print('Calculate partials...')
if __name__ == "__main__":
prob = Problem()
indeps = prob.model.add_subsystem('indeps', IndepVarComp(), promotes=['*'])
indeps.add_output('radius', 2.) # height
indeps.add_output('height', 3.) # radius
main = prob.model.add_subsystem('cylinder', Cylinder(), promotes=['*'])
# setup the optimization
prob.driver = ScipyOptimizeDriver()
prob.model.add_design_var('radius', lower=0.5, upper=5.)
prob.model.add_design_var('height', lower=0.5, upper=5.)
prob.model.add_objective('Area')
prob.model.add_constraint('Volume', lower=10.)
prob.setup()
prob.run_driver()
print(prob['Volume']) # should be around 10

Related

Tensorflow: OOM when batch size too large

My script is failing due to too high memory usage. When I reduce the batch size it works.
#tf.function(autograph=not DEBUG)
def step(prev_state, input_b):
input_b = tf.reshape(input_b, shape=[1,input_b.shape[0]])
state = FastALIFStateTuple(v=prev_state[0], z=prev_state[1], b=prev_state[2], r=prev_state[3])
new_b = self.decay_b * state.b + (tf.ones(shape=[self.units],dtype=tf.float32) - self.decay_b) * state.z
thr = self.thr + new_b * self.beta
z = state.z
i_in = tf.matmul(input_b, W_in)
i_rec = tf.matmul(z, W_rec)
i_t = i_in + i_rec
I_reset = z * thr * self.dt
new_v = self._decay * state.v + (1 - self._decay) * i_t - I_reset
# Spike generation
is_refractory = tf.greater(state.r, .1)
zeros_like_spikes = tf.zeros_like(z)
new_z = tf.where(is_refractory, zeros_like_spikes, self.compute_z(new_v, thr))
new_r = tf.clip_by_value(state.r + self.n_refractory * new_z - 1,
0., float(self.n_refractory))
return [new_v, new_z, new_b, new_r]
#tf.function(autograph=not DEBUG)
def evolve_single(inputs):
accumulated_state = tf.scan(step, inputs, initializer=state0)
Z = tf.squeeze(accumulated_state[1]) # -> [T,units]
if self.model_settings['avg_spikes']:
Z = tf.reshape(tf.reduce_mean(Z, axis=0), shape=(1,-1))
out = tf.matmul(Z, W_out) + b_out
return out # - [BS,Num_labels]
# # - Using a simple loop
# out_store = []
# for i in range(fingerprint_3d.shape[0]):
# out_store.append(tf.squeeze(evolve_single(fingerprint_3d[i,:,:])))
# return tf.reshape(out_store, shape=[fingerprint_3d.shape[0],self.d_out])
final_out = tf.squeeze(tf.map_fn(evolve_single, fingerprint_3d)) # -> [BS,T,self.units]
return final_out
This code snippet is inside a tf.function, but I omitted it since I don't think it's relevant.
As can be seen, I run the code on fingerprint_3d, a tensor that has the dimension [BatchSize,Time,InputDimension], e.g. [50,100,20]. When I run this with BatchSize < 10 everything works fine, although tf.scan already uses a lot of memory for that.
When I now execute the code on a batch of size 50, suddenly I get an OOM, even though I am executing it in an iterative matter (here commented out).
How should I execute this code so that the Batch Size doesn't matter?
Is tensorflow maybe parallelizing my for loop so that it executed over multiple batches at once?
Another unrelated question is the following: What function instead of tf.scan should I use if I only want to accumulate one state variable, compared to the case for tf.scan where it just accumulates all the state variables? Or is that possible with tf.scan?
As mentioned in the discussions here, tf.foldl, tf.foldr, and tf.scan all require keeping track of all values for all iterations, which is necessary for computations like gradients. I am not aware of any ways to mitigate this issue; still, I would also be interested if anyone has a better answer than mine.
When I used
#tf.function
def get_loss_and_gradients():
with tf.GradientTape(persistent=False) as tape:
logits, spikes = rnn.call(fingerprint_input=graz_dict["train_input"], W_in=W_in, W_rec=W_rec, W_out=W_out, b_out=b_out)
loss = loss_normal(tf.cast(graz_dict["train_groundtruth"],dtype=tf.int32), logits)
gradients = tape.gradient(loss, [W_in,W_rec,W_out,b_out])
return loss, logits, spikes, gradients
it works.
When I remove the #tf.function decorator the memory blows up. So it really seems important that tensorflow can create a graph for you computations.

Balanced Accuracy Score in Tensorflow

I am implementing a CNN for an highly unbalanced classification problem and I would like to implement custum metrics in tensorflow to use the Select Best Model callback.
Specifically I would like to implement the balanced accuracy score, which is the average of the recall of each class (see sklearn implementation here), does someone know how to do it?
I was facing the same issue so I implemented a custom class based off SparseCategoricalAccuracy:
class BalancedSparseCategoricalAccuracy(keras.metrics.SparseCategoricalAccuracy):
def __init__(self, name='balanced_sparse_categorical_accuracy', dtype=None):
super().__init__(name, dtype=dtype)
def update_state(self, y_true, y_pred, sample_weight=None):
y_flat = y_true
if y_true.shape.ndims == y_pred.shape.ndims:
y_flat = tf.squeeze(y_flat, axis=[-1])
y_true_int = tf.cast(y_flat, tf.int32)
cls_counts = tf.math.bincount(y_true_int)
cls_counts = tf.math.reciprocal_no_nan(tf.cast(cls_counts, self.dtype))
weight = tf.gather(cls_counts, y_true_int)
return super().update_state(y_true, y_pred, sample_weight=weight)
The idea is to set each class weight inversely proportional to its size.
This code produces some warnings from Autograph but I believe those are Autograph bugs, and the metric seems to work fine.
There are 3 ways I can think of tackling the situation :-
1)Random Under-sampling - In this method you can randomly remove samples from the majority classes.
2)Random Over-sampling - In this method you can increase the samples by replicating them.
3)Weighted cross entropy - You can also use weighted cross entropy so that the loss value can be compensated for the minority classes. See here
I have personally tried method2 and it does increase my accuracy by significant value but it may vary from dataset to dataset
NOTE
It appears that the implementation/API of the Recall class, which I used as a template for my answer, has been modified in the newer TF versions (as pointed out by #guilaumme-gaudin), so I recommend you look at the Recall implementation used in your current TF version and take it from there to implement the metric using the same approach I describe in the original post, this way I don't have to update my answer every time the TF team modifies the implementation/API of its metrics.
Original post
I'm not an expert in Tensorflow but using a bit of pattern matching between metrics implementations in the tf source code I came up with this
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.metrics import Metric
from tensorflow.python.keras.utils import metrics_utils
from tensorflow.python.ops import init_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.keras.utils.generic_utils import to_list
class BACC(Metric):
def __init__(self, thresholds=None, top_k=None, class_id=None, name=None, dtype=None):
super(BACC, self).__init__(name=name, dtype=dtype)
self.init_thresholds = thresholds
self.top_k = top_k
self.class_id = class_id
default_threshold = 0.5 if top_k is None else metrics_utils.NEG_INF
self.thresholds = metrics_utils.parse_init_thresholds(
thresholds, default_threshold=default_threshold)
self.true_positives = self.add_weight(
'true_positives',
shape=(len(self.thresholds),),
initializer=init_ops.zeros_initializer)
self.true_negatives = self.add_weight(
'true_negatives',
shape=(len(self.thresholds),),
initializer=init_ops.zeros_initializer)
self.false_positives = self.add_weight(
'false_positives',
shape=(len(self.thresholds),),
initializer=init_ops.zeros_initializer)
self.false_negatives = self.add_weight(
'false_negatives',
shape=(len(self.thresholds),),
initializer=init_ops.zeros_initializer)
def update_state(self, y_true, y_pred, sample_weight=None):
return metrics_utils.update_confusion_matrix_variables(
{
metrics_utils.ConfusionMatrix.TRUE_POSITIVES: self.true_positives,
metrics_utils.ConfusionMatrix.TRUE_NEGATIVES: self.true_negatives,
metrics_utils.ConfusionMatrix.FALSE_POSITIVES: self.false_positives,
metrics_utils.ConfusionMatrix.FALSE_NEGATIVES: self.false_negatives,
},
y_true,
y_pred,
thresholds=self.thresholds,
top_k=self.top_k,
class_id=self.class_id,
sample_weight=sample_weight)
def result(self):
"""
Returns the Balanced Accuracy (average between recall and specificity)
"""
result = (math_ops.div_no_nan(self.true_positives, self.true_positives + self.false_negatives) +
math_ops.div_no_nan(self.true_negatives, self.true_negatives + self.false_positives)) / 2
return result
def reset_states(self):
num_thresholds = len(to_list(self.thresholds))
K.batch_set_value(
[(v, np.zeros((num_thresholds,))) for v in self.variables])
def get_config(self):
config = {
'thresholds': self.init_thresholds,
'top_k': self.top_k,
'class_id': self.class_id
}
base_config = super(BACC, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
I've simply taken the Recall class implementation from the source code as a template and I extended it to make sure it has a TP,TN,FP and FN defined.
After that I modified the result method so that it calculates balanced accuracy and voila :)
I compared the results from this with sklearn's balanced accuracy score and the values matched so I think it's correct, but do double check just in case.
I have not tested this code yet, but looking at the source code of tensorflow==2.1.0, this might work for the binary classification case:
from tensorflow.keras.metrics import Recall
from tensorflow.python.ops import math_ops
class BalancedBinaryAccuracy(Recall):
def result(self):
result = (math_ops.div_no_nan(self.true_positives, self.true_positives + self.false_negatives) +
math_ops.div_no_nan(self.true_negatives, self.true_negatives + self.false_positives)) / 2
return result[0] if len(self.thresholds) == 1 else result
As an alternative to writing a custom metric, you can write a custom callback using the metrics already implemented ad available via the training logs. For example you can define the training balanced accuracy callback like this:
class TrainBalancedAccuracyCallback(tf.keras.callbacks.Callback):
def __init__(self, **kargs):
super(TrainBalancedAccuracyCallback, self).__init__(**kargs)
def on_epoch_end(self, epoch, logs={}):
train_sensitivity = logs['tp'] / (logs['tp'] + logs['fn'])
train_specificity = logs['tn'] / (logs['tn'] + logs['fp'])
logs['train_sensitivity'] = train_sensitivity
logs['train_specificity'] = train_specificity
logs['train_balacc'] = (train_sensitivity + train_specificity) / 2
print('train_balacc', logs['train_balacc'])
and the same for the validation:
class ValBalancedAccuracyCallback(tf.keras.callbacks.Callback):
def __init__(self, **kargs):
super(ValBalancedAccuracyCallback, self).__init__(**kargs)
def on_epoch_end(self, epoch, logs={}):
val_sensitivity = logs['val_tp'] / (logs['val_tp'] + logs['val_fn'])
val_specificity = logs['val_tn'] / (logs['val_tn'] + logs['val_fp'])
logs['val_sensitivity'] = val_sensitivity
logs['val_specificity'] = val_specificity
logs['val_balacc'] = (val_sensitivity + val_specificity) / 2
print('val_balacc', logs['val_balacc'])
and then you can use these as values to the callback argument of the fit method of the model.

Eager-Mode very slow (~22x slower than Graph-Mode)

I read that Tensorflow 2.0 will have some major changes, and a big part will be eager-execution [1], so I tried to play a bit around with the eager-mode of tensorflow.
I took a code from a github-repo and tried to run it in eager-mode (however, without usage of Keras-Model/Layers as proposed).
It turned out, that its quite slow. So I tried different modifications and compared it with the original source (graph-mode) of the model. The result is, that the graph-mode is around 22x times faster than the eager-mode. Its total clear to me, that the graph mode is faster, but by this number?
Is this always the case or do I need some special modifications / configurations of the variables to get a comparable performance to graph mode?
The source code, for both variants, can be found at [2].
Thanks in advance!
Eager-Mode:
# With
# with tf.device("/gpu:0"):
# ...
#
# Runtime is 0.35395
# Runtime is 0.12711
# Runtime is 0.12438
# Runtime is 0.12428
# Runtime is 0.12572
# Runtime is 0.12593
# Runtime is 0.12505
# Runtime is 0.12527
# Runtime is 0.12418
# Runtime is 0.12340
Graph Mode:
# Runtime is 0.81241
# Runtime is 0.00573
# Runtime is 0.00573
# Runtime is 0.00570
# Runtime is 0.00555
# Runtime is 0.00564
# Runtime is 0.00545
# Runtime is 0.00540
# Runtime is 0.00591
# Runtime is 0.00574
[1] https://groups.google.com/a/tensorflow.org/forum/#!topic/developers/JHDpgRyFVUs
[2] https://gist.github.com/lhlmgr/f6709e5aba4a5314b5221d58232b09bd
Using eager execution may mean undoing some habits developed with TensorFlow graphs since code snippets that used to run once (e.g., Python function that constructs the graph to compute the loss) will run repeatedly (the same Python function will now compute the loss on each iteration).
I took a cursory look at code links provided and noticed some easy wins that would probably also be seen by using standard Python profiling tools. You may want use those (cProfile, pyspy etc.)
For example, the Keras network is currently implemented as:
class NFModel(tf.keras.Model):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def call(self, *args, **kwargs):
num_layers = 6
d, r = 2, 2
bijectors = []
for i in range(num_layers):
with tf.variable_scope('bijector_%d' % i):
V = tf.get_variable('V', [d, r], dtype=DTYPE) # factor loading
shift = tf.get_variable('shift', [d], dtype=DTYPE) # affine shift
L = tf.get_variable('L', [d * (d + 1) / 2], dtype=DTYPE) # lower triangular
bijectors.append(tfb.Affine(
scale_tril=tfd.fill_triangular(L),
scale_perturb_factor=V,
shift=shift,
))
alpha = tf.get_variable('alpha', [], dtype=DTYPE)
abs_alpha = tf.abs(alpha) + .01
bijectors.append(LeakyReLU(alpha=abs_alpha))
base_dist = tfd.MultivariateNormalDiag(loc=tf.zeros([2], DTYPE))
mlp_bijector = tfb.Chain(list(reversed(bijectors[:-1])), name='2d_mlp_bijector')
dist = tfd.TransformedDistribution(distribution=base_dist, bijector=mlp_bijector)
Instead, if you create the variables in __init__ once and avoid tf.get_variable calls on every call to the network, you should see a big improvement.
class NFModel(tf.keras.Model):
def __init__(self, *args, **kwargs):
super(NFModel, self).__init__(*args, **kwargs)
num_layers = 6
d, r = 2, 2
self.num_layers = num_layers
self.V = [tf.get_variable('V', [d, r], dtype=DTYPE) for _ in range(num_layers)]
self.shift = [tf.get_variable('shift', [d], dtype=DTYPE) for _ in range(num_layers)]
self.L = [tf.get_variable('L', [d * (d + 1) / 2], dtype=DTYPE) for _ in range(num_layers)]
self.alpha = [tf.get_variable('alpha', [], dtype=DTYPE) for _ in range(num_layers)]
def call(self, *args, **kwargs):
bijectors = []
for i in range(self.num_layers):
V = self.V[i]
shift = self.shift[i]
L = self.L[i]
bijectors.append(tfb.Affine(
scale_tril=tfd.fill_triangular(L),
scale_perturb_factor=V,
shift=shift,
))
alpha = self.alpha[i]
abs_alpha = tf.abs(alpha) + .01
bijectors.append(LeakyReLU(alpha=abs_alpha))
base_dist = tfd.MultivariateNormalDiag(loc=tf.zeros([2], DTYPE))
mlp_bijector = tfb.Chain(list(reversed(bijectors[:-1])), name='2d_mlp_bijector')
dist = tfd.TransformedDistribution(distribution=base_dist, bijector=mlp_bijector)
return {"dist": dist}
There are probably other such easy wins, a profiling tool will nudge you in the right direction.
Also, note that, TF 2.0 is less about "eager execution" and more about how one interacts with graphs, as per the RFC
Hope that helps.

Avoiding optimization pitfalls when modeling an ordinal predicted variable in PyMC3

I am trying to model an ordinal predicted variable using PyMC3 based on the approach in chapter 23 of Doing Bayesian Data Analysis. I would like to determine a good starting value using find_MAP, but am receiving an optimization error.
The model:
import pymc3 as pm
import numpy as np
import theano
import theano.tensor as tt
# Some helper functions
def cdf(x, location=0, scale=1):
epsilon = np.array(1e-32, dtype=theano.config.floatX)
location = tt.cast(location, theano.config.floatX)
scale = tt.cast(scale, theano.config.floatX)
div = tt.sqrt(2 * scale ** 2 + epsilon)
div = tt.cast(div, theano.config.floatX)
erf_arg = (x - location) / div
return .5 * (1 + tt.erf(erf_arg + epsilon))
def percent_to_thresh(idx, vect):
return 5 * tt.sum(vect[:idx + 1]) + 1.5
def full_thresh(thresh):
idxs = tt.arange(thresh.shape[0] - 1)
thresh_mod, updates = theano.scan(fn=percent_to_thresh,
sequences=[idxs],
non_sequences=[thresh])
return tt.concatenate([[-1 * np.inf, 1.5], thresh_mod, [6.5, np.inf]])
def compute_ps(thresh, location, scale):
f_thresh = full_thresh(thresh)
return cdf(f_thresh[1:], location, scale) - cdf(f_thresh[:-1], location, scale)
# Generate data
real_ps = [0.05, 0.05, 0.1, 0.1, 0.2, 0.3, 0.2]
data = np.random.choice(7, size=1000, p=real_ps)
# Run model
with pm.Model() as model:
mu = pm.Normal('mu', mu=4, sd=3)
sigma = pm.Uniform('sigma', lower=0.1, upper=70)
thresh = pm.Dirichlet('thresh', a=np.ones(5))
cat_p = compute_ps(thresh, mu, sigma)
results = pm.Categorical('results', p=cat_p, observed=data)
with model:
start = pm.find_MAP()
trace = pm.sample(2000, start=start)
When running this, I receive the following error:
Applied interval-transform to sigma and added transformed sigma_interval_ to model.
Applied stickbreaking-transform to thresh and added transformed thresh_stickbreaking_ to model.
Traceback (most recent call last):
File "cm_net_log.v1-for_so.py", line 53, in <module>
start = pm.find_MAP()
File "/usr/local/lib/python3.5/site-packages/pymc3/tuning/starting.py", line 133, in find_MAP
specific_errors)
ValueError: Optimization error: max, logp or dlogp at max have non-finite values. Some values may be outside of distribution support. max: {'thresh_stickbreaking_': array([-1.04298465, -0.48661088, -0.84326554, -0.44833646]), 'sigma_interval_': array(-2.220446049250313e-16), 'mu': array(7.68422528308479)} logp: array(-3506.530143064723) dlogp: array([ 1.61013190e-06, nan, -6.73994118e-06,
-6.93873894e-06, 6.03358122e-06, 3.18954680e-06])Check that 1) you don't have hierarchical parameters, these will lead to points with infinite density. 2) your distribution logp's are properly specified. Specific issues:
My questions:
How can I determine why dlogp is nan at certain points?
Is there a different way that I can express this model to avoid dlogp being nan?
Also worth noting:
This model runs fine if I don't find_MAP and use a Metropolis sampler. However, I'd like to have the flexibility of using other samplers as this model becomes more complex.
I have a suspicion that the issue is due to the relationship between the thresholds and the normal distribution, but I don't know how to disentangle them for the optimization.
Regarding question 2: I expressed the model for the ordinal predicted variable (single group) differently; I used the Theano #as_op decorator for a function that calculates probabilities for the outcomes. That also explains why I cannot use find_MAP() or gradient based samplers: Theano cannot calculate a gradient for the custom function. (http://pymc-devs.github.io/pymc3/notebooks/getting_started.html#Arbitrary-deterministics)
# Number of outcomes
nYlevels = df.Y.cat.categories.size
thresh = [k + .5 for k in range(1, nYlevels)]
thresh_obs = np.ma.asarray(thresh)
thresh_obs[1:-1] = np.ma.masked
#as_op(itypes=[tt.dvector, tt.dscalar, tt.dscalar], otypes=[tt.dvector])
def outcome_probabilities(theta, mu, sigma):
out = np.empty(nYlevels)
n = norm(loc=mu, scale=sigma)
out[0] = n.cdf(theta[0])
out[1] = np.max([0, n.cdf(theta[1]) - n.cdf(theta[0])])
out[2] = np.max([0, n.cdf(theta[2]) - n.cdf(theta[1])])
out[3] = np.max([0, n.cdf(theta[3]) - n.cdf(theta[2])])
out[4] = np.max([0, n.cdf(theta[4]) - n.cdf(theta[3])])
out[5] = np.max([0, n.cdf(theta[5]) - n.cdf(theta[4])])
out[6] = 1 - n.cdf(theta[5])
return out
with pm.Model() as ordinal_model_single:
theta = pm.Normal('theta', mu=thresh, tau=np.repeat(.5**2, len(thresh)),
shape=len(thresh), observed=thresh_obs, testval=thresh[1:-1])
mu = pm.Normal('mu', mu=nYlevels/2.0, tau=1.0/(nYlevels**2))
sigma = pm.Uniform('sigma', nYlevels/1000.0, nYlevels*10.0)
pr = outcome_probabilities(theta, mu, sigma)
y = pm.Categorical('y', pr, observed=df.Y.cat.codes.as_matrix())
http://nbviewer.jupyter.org/github/JWarmenhoven/DBDA-python/blob/master/Notebooks/Chapter%2023.ipynb

Using TensorArrays in the context of a while_loop to accumulate values

Below I have an implementation of a Tensorflow RNN Cell, designed to emulate Alex Graves' algorithm ACT in this paper: http://arxiv.org/abs/1603.08983.
At a single timestep in the sequence called via rnn.rnn(with a static sequence_length parameter, so the rnn is unrolled dynamically - I am using a fixed batch size of 20), we recursively call ACTStep, producing outputs of size(1,200) where the hidden dimension of the RNN cell is 200 and we have a batch size of 1.
Using the while loop in Tensorflow, we iterate until the accumulated halting probability is high enough. All of this works reasonably smoothly, but I am having problems accumulating states, probabilities and outputs within the while loop, which we need to do in order to create weighted combinations of these as the final cell output/state.
I have tried using a simple list, as below, but this fails when the graph is compiled as the outputs are not in the same frame(is it possible to use the "switch" function in control_flow_ops to forward the tensors to the point at which they are required, ie the add_n function just before we return the values?). I have also tried using the TensorArray structure, but I am finding this difficult to use as it seems to destroy shape information and replacing it manually hasn't worked. I also haven't been able to find much documentation on TensorArrays, presumably as they are, I imagine, mainly for internal TF use.
Any advice on how it might be possible to correctly accumulate the variables produced by ACTStep would be much appreciated.
class ACTCell(RNNCell):
"""An RNN cell implementing Graves' Adaptive Computation time algorithm"""
def __init__(self, num_units, cell, epsilon, max_computation):
self.one_minus_eps = tf.constant(1.0 - epsilon)
self._num_units = num_units
self.cell = cell
self.N = tf.constant(max_computation)
#property
def input_size(self):
return self._num_units
#property
def output_size(self):
return self._num_units
#property
def state_size(self):
return self._num_units
def __call__(self, inputs, state, scope=None):
with vs.variable_scope(scope or type(self).__name__):
# define within cell constants/ counters used to control while loop
prob = tf.get_variable("prob", [], tf.float32,tf.constant_initializer(0.0))
counter = tf.get_variable("counter", [],tf.float32,tf.constant_initializer(0.0))
tf.assign(prob,0.0)
tf.assign(counter, 0.0)
# the predicate for stopping the while loop. Tensorflow demands that we have
# all of the variables used in the while loop in the predicate.
pred = lambda prob,counter,state,input,\
acc_state,acc_output,acc_probs:\
tf.logical_and(tf.less(prob,self.one_minus_eps), tf.less(counter,self.N))
acc_probs = []
acc_outputs = []
acc_states = []
_,iterations,_,_,acc_states,acc_output,acc_probs = \
control_flow_ops.while_loop(pred,
self.ACTStep,
[prob,counter,state,input,acc_states,acc_outputs,acc_probs])
# TODO:fix last part of this, need to use the remainder.
# TODO: find a way to accumulate the regulariser
# here we take a weighted combination of the states and outputs
# to use as the actual output and state which is passed to the next timestep.
next_state = tf.add_n([tf.mul(x,y) for x,y in zip(acc_probs,acc_states)])
output = tf.add_n([tf.mul(x,y) for x,y in zip(acc_probs,acc_outputs)])
return output, next_state
def ACTStep(self,prob,counter,state,input, acc_states,acc_outputs,acc_probs):
output, new_state = rnn.rnn(self.cell, [input], state, scope=type(self.cell).__name__)
prob_w = tf.get_variable("prob_w", [self.cell.input_size,1])
prob_b = tf.get_variable("prob_b", [1])
p = tf.nn.sigmoid(tf.matmul(prob_w,new_state) + prob_b)
acc_states.append(new_state)
acc_outputs.append(output)
acc_probs.append(p)
return [tf.add(prob,p),tf.add(counter,1.0),new_state, input,acc_states,acc_outputs,acc_probs]
I'm going to preface this response that this is NOT a complete solution, but rather some commentary on how to improve your cell.
To start off, in your ACTStep function, you call rnn.rnn for one timestep (as defined by [input]. If you're doing a single timestep, it is probably more efficient to simple use the actual self.cell call function. You'll see this same mechanism used in tensorflow rnncell wrappers
You mentioned that you have tried using TensorArrays. Did you pack and unpack the tensorarrays appropriately? Here is a repo where you'll find under model.py the tensorarrays are packed and unpacked properly.
You also asked if there is a function in control_flow_ops that will require all the tensors to be accumulated. I think you are looking for tf.control_dependencies
You can list all of your output tensors operations in control_dependicies and that will require tensorflow to compute all tensors up into that point.
Also, it looks like your counter variable is trainable. Are you sure you want this to be the case? If you're adding plus one to your counter, that probably wouldn't yield the correct result. On the other hand, you could have purposely kept it trainable to differentiate it at the end for the ponder cost function.
Also I believe the Remainder function should be in your script:
remainder = 1.0 - tf.add_n(acc_probs[:-1])
#note that there is a -1 in the list as you do not want to grab the last probability
Here is my version of your code edited:
class ACTCell(RNNCell):
"""An RNN cell implementing Graves' Adaptive Computation time algorithm
Notes: https://www.evernote.com/shard/s189/sh/fd165646-b630-48b7-844c-86ad2f07fcda/c9ab960af967ef847097f21d94b0bff7
"""
def __init__(self, num_units, cell, max_computation = 5.0, epsilon = 0.01):
self.one_minus_eps = tf.constant(1.0 - epsilon) #episolon is 0.01 as found in the paper
self._num_units = num_units
self.cell = cell
self.N = tf.constant(max_computation)
#property
def input_size(self):
return self._num_units
#property
def output_size(self):
return self._num_units
#property
def state_size(self):
return self._num_units
def __call__(self, inputs, state, scope=None):
with vs.variable_scope(scope or type(self).__name__):
# define within cell constants/ counters used to control while loop
prob = tf.constant(0.0, shape = [batch_size])
counter = tf.constant(0.0, shape = [batch_size])
# the predicate for stopping the while loop. Tensorflow demands that we have
# all of the variables used in the while loop in the predicate.
pred = lambda prob,counter,state,input,acc_states,acc_output,acc_probs:\
tf.logical_and(tf.less(prob,self.one_minus_eps), tf.less(counter,self.N))
acc_probs, acc_outputs, acc_states = [], [], []
_,iterations,_,_,acc_states,acc_output,acc_probs = \
control_flow_ops.while_loop(
pred,
self.ACTStep, #looks like he purposely makes the while loop here
[prob, counter, state, input, acc_states, acc_outputs, acc_probs])
'''mean-field updates for states and outputs'''
next_state = tf.add_n([x*y for x,y in zip(acc_probs,acc_states)])
output = tf.add_n([x*y for x,y in zip(acc_probs,acc_outputs)])
remainder = 1.0 - tf.add_n(acc_probs[:-1]) #you take the last off to avoid a negative ponder cost #the problem here is we need to take the sum of all the remainders
tf.add_to_collection("ACT_remainder", remainder) #if this doesnt work then you can do self.list based upon timesteps
tf.add_to_collection("ACT_iterations", iterations)
return output, next_state
def ACTStep(self,prob, counter, state, input, acc_states, acc_outputs, acc_probs):
'''run rnn once'''
output, new_state = rnn.rnn(self.cell, [input], state, scope=type(self.cell).__name__)
prob_w = tf.get_variable("prob_w", [self.cell.input_size,1])
prob_b = tf.get_variable("prob_b", [1])
halting_probability = tf.nn.sigmoid(tf.matmul(prob_w,new_state) + prob_b)
acc_states.append(new_state)
acc_outputs.append(output)
acc_probs.append(halting_probability)
return [p + prob, counter + 1.0, new_state, input,acc_states,acc_outputs,acc_probs]
def PonderCostFunction(self, time_penalty = 0.01):
'''
note: ponder is completely different than probability and ponder = roe
the ponder cost function prohibits the rnn from cycling endlessly on each timestep when not much is needed
'''
n_iterations = tf.get_collection_ref("ACT_iterations")
remainder = tf.get_collection_ref("ACT_remainder")
return tf.reduce_sum(n_iterations + remainder) #completely different from probability
This is a complicated paper to implement that I have been working on myself. I wouldn't mind collaborating with you to get it done in Tensorflow. If you're interested, please add me at LeavesBreathe on Skype and we can go from there.