With openMDAO, I am using FD derivatives and trying to solve a non-linearly constrained optimization problem with the SLSQP method. Any time the optimizer arrives at a point that violates one of the constraints, it just crashes with the message:
Optimization FAILED. Positive directional derivative for linesearch
For instance, if I intentionally set the initial point to an unfeasible design point, the optimizer performs 1 iteration and exits with the above error (the same happens when I start from a feasible point, but then optimizer arrives at an unfeasible point after a few iterations).
Based on the answer to the question in In OpenMDAO, is there a way to ensure that the constraints are respected before proceeding with a computation?, I'm assuming that raising the AnalysisError exception will not work in my case, is that correct? Is there any other way to prevent the optimizer from going into unfeasible regions or at least backtrack on the linesearch and try a different direction/distance? Or should the SLSQP method be only used when analytic derivatives are available?
Reproducible test case:
import numpy as np
import openmdao.api as om
class d1(om.ExplicitComponent):
def setup(self):
# Global design variables
self.add_input('r', val= [3,3,3])
self.add_input('T', val= 20)
# Coupling output
self.add_output('M', val=0)
self.add_output('cost', val=0)
def setup_partials(self):
# Finite difference all partials.
self.declare_partials('*', '*', method='fd')
def compute(self, inputs, outputs):
# define inputs
r = inputs['r']
T = inputs['T'][0]
cost = 174.42 * T * (r[0]**2 + 2*r[1]**2 + r[2]**2 + r[0]*r[1] + r[1]*r[2])
M = 456.19 * T * (r[0]**2 + 2*r[1]**2 + r[2]**2 + r[0]*r[1] + r[1]*r[2]) - 599718
outputs['M'] = M
outputs['cost'] = cost
class MDA(om.Group):
class ObjCmp(om.ExplicitComponent):
def setup(self):
# Global Design Variable
self.add_input('cost', val=0)
# Output
self.add_output('obj', val=0.0)
def setup_partials(self):
# Finite difference all partials.
self.declare_partials('*', '*', method='fd')
def compute(self, inputs, outputs):
outputs['obj'] = inputs['cost']
class ConCmp(om.ExplicitComponent):
def setup(self):
# Global Design Variable
self.add_input('M', val=0)
# Output
self.add_output('con', val=0.0)
def setup_partials(self):
# Finite difference all partials.
self.declare_partials('*', '*', method='fd')
def compute(self, inputs, outputs):
# assemble outputs
outputs['con'] = inputs['M']
def setup(self):
self.add_subsystem('d1', d1(), promotes_inputs=['r','T'],
promotes_outputs=['M','cost'])
self.add_subsystem('con_cmp', self.ConCmp(), promotes_inputs=['M'],
promotes_outputs=['con'])
self.add_subsystem('obj_cmp', self.ObjCmp(), promotes_inputs=['cost'],
promotes_outputs=['obj'])
# Build the model
prob = om.Problem(model=MDA())
model = prob.model
model.add_design_var('r', lower= [3,3,3], upper= [10,10,10])
model.add_design_var('T', lower= 20, upper= 220)
model.add_objective('obj', scaler=1)
model.add_constraint('con', lower=0)
# Setup the optimization
prob.driver = om.ScipyOptimizeDriver(optimizer='SLSQP', tol=1e-3, disp=True)
prob.setup()
prob.set_solver_print(level=0)
prob.run_driver()
# Printout
print('minimum found at')
print(prob.get_val('T')[0])
print(prob.get_val('r'))
print('constraint')
print(prob.get_val('con')[0])
print('minimum objective')
print(prob.get_val('obj')[0])
Based on your provided test case, the problem here is that your have a really poorly scaled objective and constraint (you also have some very strange coding choices ... which I modified).
Running the OpenMDAO scaling report shows that your objective and constraint values are both around 1e6 in magnitude:
This is quite large, and is the source of your problems. A (very rough) rule of thumb is that your objectives and constraints should be around order 1. Thats not hard and fast rule, but is generally a good starting point. Sometimes other scaling will work better, if you have very very larger or small derivatives ... but there are parts of SQP methods that are sensitive to the scaling of objective and constraint values directly. So trying to keep them roughly in the range of 1 is a good idea.
Adding ref=1e6 to both objective and constraints gave enough resolution for the numerical methods to converge the problem:
Current function value: [0.229372]
Iterations: 8
Function evaluations: 8
Gradient evaluations: 8
Optimization Complete
-----------------------------------
minimum found at
20.00006826587515
[3.61138704 3. 3.61138704]
constraint
197.20821903413162
minimum objective
229371.99547899762
Here is the code I modified (including removing the extra class definitions inside your group that didn't seem to be doing anything):
import numpy as np
import openmdao.api as om
class d1(om.ExplicitComponent):
def setup(self):
# Global design variables
self.add_input('r', val= [3,3,3])
self.add_input('T', val= 20)
# Coupling output
self.add_output('M', val=0)
self.add_output('cost', val=0)
def setup_partials(self):
# Finite difference all partials.
self.declare_partials('*', '*', method='cs')
def compute(self, inputs, outputs):
# define inputs
r = inputs['r']
T = inputs['T'][0]
cost = 174.42 * T * (r[0]**2 + 2*r[1]**2 + r[2]**2 + r[0]*r[1] + r[1]*r[2])
M = 456.19 * T * (r[0]**2 + 2*r[1]**2 + r[2]**2 + r[0]*r[1] + r[1]*r[2]) - 599718
outputs['M'] = M
outputs['cost'] = cost
class MDA(om.Group):
def setup(self):
self.add_subsystem('d1', d1(), promotes_inputs=['r','T'],
promotes_outputs=['M','cost'])
# self.add_subsystem('con_cmp', self.ConCmp(), promotes_inputs=['M'],
# promotes_outputs=['con'])
# self.add_subsystem('obj_cmp', self.ObjCmp(), promotes_inputs=['cost'],
# promotes_outputs=['obj'])
# Build the model
prob = om.Problem(model=MDA())
model = prob.model
model.add_design_var('r', lower= [3,3,3], upper= [10,10,10])
model.add_design_var('T', lower= 20, upper= 220)
model.add_objective('cost', ref=1e6)
model.add_constraint('M', lower=0, ref=1e6)
# Setup the optimization
prob.driver = om.ScipyOptimizeDriver(optimizer='SLSQP', tol=1e-3, disp=True)
prob.setup()
prob.set_solver_print(level=0)
prob.set_val('r', 7.65)
prob.run_driver()
# Printout
print('minimum found at')
print(prob.get_val('T')[0])
print(prob.get_val('r'))
print('constraint')
print(prob.get_val('M')[0])
print('minimum objective')
print(prob.get_val('cost')[0])
Which SLSQP method are you using? There is one implementation in pyOptSparse and one in ScipyOptimizer. The one in pyoptsparse is older and doesn't respect bounds constraints. The one in Scipy is newer and does. (Yes, its very confusing that they have the same name and share some lineage... but are not the same optimizer any more)
You shouldn't raise an analysis error when you go outside the bounds. If you need strict bounds respecting, I suggest using IPopt from within pyoptsparse (if you can get it to compile) or switching to ScipyOptimizerDriver and its SLSQP implementation.
Based on your question, its not totally clear to me if you're talking about bounds constraints or inequality/equality constraints. If its the latter, then then there isn't any optimizer that would guarantee you remain in a feasible region all the time. Interior point methods like IPopt will stay inside the region much better, but not 100% of the time.
In general, with gradient based optimization its pretty critical that you make your problem smooth and continuous even when its outside the constraint areas. If there are parts of the space that you absolutely can not go into, then you need to make those variables into design variables and use bound constraints. This sometimes requires reformulating your problem formulation a little bit, possibly by adding a kind of compatibility constraint that says "design variable = computed_value". Then you can make sure that the design variable is passed into anything that requires the value to be strictly within a bound, and (hopefully) a converged answer will also satisfy your compatibility constraint.
If you provide some kind of a test case or example, I can amend my answer with a more specific suggestion.
I am trying to optimize the inference time on GPT2. The current time to generate a sample after calling the script is 55 secs on Google Colab. I put in timestamps to try to isolate where the bottleneck is.
This is the code:
for _ in range(nsamples // batch_size):
out = sess.run(output, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
for i in range(batch_size):
generated += 1
text = enc.decode(out[i])
print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
print(text)
print("=" * 80)
The line
out = sess.run(output, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
is where the complexity lies. Does anyone have any way I can improve this piece of code ? Thank you so much!
batch_size is set to 1 in GPT2 and there is no way to change that without crashing the process. So "[context_tokens for _ in range(batch_size)]" means "[context_tokens for _ in range(1)]" means "[context_tokens]" which will not improve speed by much but is safe to implement and makes looking at the code a bit more sensible. The real complexty is you have a 6 gigabyte bohemoth in your ram that you are accessing in that session.
As a practical matter, the less tokens you send over and the less processing those tokens take the faster this part will execute. As each token needs to be sent through the GPT2 AI. But consequently the less 'intelligent' the response will be.
By the way // is an integer division operation, so nsamples // batch_size = nsamples/1 = nsamples size. And from what I have seen the nsamples was 1 when I printed its value in print(nsamples). So that for loop is another loop of one item, which means the loop can be removed.
GPT2 is just a implementation of tensorflow. Look up: how to make a graph in tensorflow; how to call a session for that graph; how to make a saver save the variables in that session and how to use the saver to restore the session. You will learn about checkpoints, meta files and other implementation that will make your files make more sense.
The tensorflow module is found in Lib, site-packages, tensorflow_core (at least in the AI Dungeon 2 Henk717 fork). Most of the processing is happening in sub directories python/ops and framework. You will see these pop up if your coding breaks the hooks tf was expecting.
If this question regards the implementation in AI Dungeon the best I have been able to implement is a recursive call to generator.generate that is exited by a try except KeyboardInterrupt: with a print(token, end = '', flush = True) for each token as they are generated. This way you are able to view each token as the AI generates it, rather that waiting for 55 sec for a ping sound.
Also, the Cuda warnings need a single quote, not double so,
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
not "3"
That will take off the cuda warnings when tensorflow is imported.
Next there are depreciations that popup from the implementation of GPT2 in tensorflow versions above 1.5.
To shut those off
tfv = tf.compat.v1
tfv.set_verbosity(tfv.logging.Error)
Is all you need. You don't need to import warnings.
Even so it is a long load time between the tf initialization, the sample initial generation and the loading of the module into ram. I added in model.shape_list(x):
the followin line
print("_",end ='',flush=True)
And at least for the module being built to localize it to the machine you can view a "progress bar" of sorts.
The weights retrieved from restored model doesn't change and the input is also constant
But the output of 'Relu:0' operation is giving different results each time.
Below is my code:
sess=tf.Session()
saver = tf.train.import_meta_graph('checkpoints/checkpoints_otherapproach_1/cameranetwork_RAID_CNN-3100.meta')
saver.restore(sess,tf.train.latest_checkpoint(checkpoint_dir='checkpoints/checkpoints_otherapproach_1/'))
images = tf.get_default_graph().get_tensor_by_name('images:0')
phase = tf.get_default_graph().get_tensor_by_name('phase:0')
Activ = tf.get_default_graph().get_tensor_by_name('network/siamese_model/convolution_1/conv_1/Relu:0')
image_array = np.zeros(shape = [1,3,128,64,3]) #*******
imagepath = 'RAiD_Dataset' + '/images_afterremoving_persons_notinallcameras/'+'test'+'/camera_'+str(1)
fullfile_name = imagepath+"/"+ 'camera_1_person_23_index_1.jpg'
image_array[0][0] = cv2.imread(fullfile_name)
image_array[0][1] = image_array[0][0]
image_array[0][2] = image_array[0][0]
image_array = image_array.astype(np.float32)
feed_dict_values ={images: image_array, phase:False}
temp2 = sess.run(Activ, feed_dict =feed_dict_values)
temp1 = sess.run(Activ, feed_dict =feed_dict_values)
print (temp1==temp2).all() #output is false
There are two possible reasons for this:
Some of the tensorflow ops inherit non-deterministic behavior from CUDA. This results in small numerical errors (which might be amplified by non-linearities). See this answer on how to try running your model on a single CPU thread. If the two arrays will turn out to be identical in this condition, then this is the case.
I'm assuming that you know the graph you are loading, but the graph itself might produce inconsistent results 'by design' due to operations deliberately introducing either randomness or inconstant data. For example, consider operations that use the random number generator or operations that update variables (e.g., tf.assign) each time Activ is evaluated.
I'm implementing a RNN cell around BasicLSTMCell where I want to be able to look back on past hidden states (across batch boundaries). I'm using dynamic_rnn() and the basic pattern I use is:
def __call__(self, inputs, old_state, scope=None):
mem = old_state[2]
# [do something with with mem]
cell_out, new_state = self.cell(inputs,
(old_state[0],
old_state[1]))
h_state = new_state.h
c_state = new_state.c
# control dependency required because of self.buf_index
with tf.get_default_graph().control_dependencies([cell_out]):
new_mem = write_to_buf(self.out_buf,
cell_out,
self.buf_index)
# update the buffer index
with tf.get_default_graph().control_dependencies(new_mem):
inc_step = tf.assign(self.buf_index, (self.buf_index + 1) %
self.buf_size)
with tf.get_default_graph().control_dependencies([inc_step]):
h_state = tf.identity(h_state)
t = [c_state, h_state, new_mem]
return cell_out, tuple(t)
self.buf and self.buf_index are variables. write_to_buf() is a function that uses scatter_update() to write the new hidden states to the buffer and returns the result.
I rely on the assumption that accesses to scatter updates return value guarantee that the new variable value is used (similar to this) so that caching of variables does not mess things up.
From debug prints it seems to work but it would be nice to get some confirmation or suggestions on alternatives.
The (source code) documentation for tf.cond is unclear on whether the functions to be performed when the predicate is evaluated can have side effects or not. I've done some tests but I'm getting conflicting results. For example the code below does not work:
import tensorflow as tf
from tensorflow.python.ops import control_flow_ops
pred = tf.placeholder(tf.bool, [])
count = tf.Variable(0)
adder = count.assign_add(1)
subtractor = count.assign_sub(2)
my_op = control_flow_ops.cond(pred, lambda: adder, lambda: subtractor)
sess = tf.InteractiveSession()
tf.initialize_all_variables().run()
my_op.eval(feed_dict={pred: True})
count.eval() # returns -1
my_op.eval(feed_dict={pred: False})
count.eval() # returns -2
I.e. no matter what value the predicate evaluates to, both functions are getting run, and so the net result is a subtraction of 1. On the other hand, this code snippet does work, where the only difference is that I add new ops to the graph every time my_op is called:
pred = tf.placeholder(tf.bool, [])
count = tf.Variable(0)
my_op = control_flow_ops.cond(pred, lambda:count.assign_add(1), lambda:count.assign_sub(2))
sess = tf.InteractiveSession()
tf.initialize_all_variables().run()
my_op.eval(feed_dict={pred: False})
count.eval() # returns -2
my_op.eval(feed_dict={pred: True})
count.eval() # returns -1
Not sure why creating new ops every time works while the other case doesn't, but I'd obviously rather not be adding nodes as the graph will eventually become too big.
Your second version—where the assign_add() and assign_sub() ops are creating inside the lambdas passed to cond()—is the correct way to do this. Fortunately, each of the two lambdas is only evaluated once, during the call to cond(), so your graph will not grow without bound.
Essentially what cond() does is the following:
Create a Switch node, which forwards its input to only one of two outputs, depending on the value of pred. Let's call the outputs pred_true and pred_false. (They have the same value as pred but that's unimportant since this is never directly evaluated.)
Build the subgraph corresponding to the if_true lambda, where all of the nodes have a control dependency on pred_true.
Build the subgraph corresponding to the if_false lambda, where all of the nodes have a control dependency on pred_false.
Zip together the lists of return values from the two lambdas, and create a Merge node for each of these. A Merge node takes two inputs, of which only one is expected to be produced, and forwards it to its output.
Return the tensors that are the outputs of the Merge nodes.
This means you can run your second version, and be content that the graph remains a fixed size, regardless of how many steps you run.
The reason your first version doesn't work is that, when a Tensor is captured (like adder or subtractor in your example), an additional Switch node is added to enforce the logic that the value of the tensor is only forwarded to the branch that actually executes. This is an artifact of how TensorFlow combines feed-forward dataflow and control flow in its execution model. The result is that the captured tensors (in this case the results of the assign_add and assign_sub) will always be evaluated, even if they aren't used, and you'll see their side effects. This is something we need to document better, and as Michael says, we're going to make this more usable in future.
The second case works because you have added the ops within the cond: this causes them to conditionally execute.
The first case it is analogous to saying:
adder = (count += 1)
subtractor = (count -= 2)
if (cond) { adder } else { subtractor }
Since adder and subtractor are outside the conditional, they are always executed.
The second case is more like saying
if (cond) { adder = (count += 1) } else { subtractor = (count -= 2) }
which in this case does what you expected.
We realize that the interaction between side effects and (somewhat) lazy evaluation is confusing, and we have a medium-term goal to make things more uniform. But the important thing to understand for now is that we do not do true lazy evaluation: the conditional acquires a dependency on every quantity defined outside the conditional that is used within either branch.