CPU time in docplex - Python - docplex

Let us assume that I have created a mathematical model in python and want to solve it using the below code (the docplex library.)
start = time.perf_counter() # CPU time calculator of the CPLEX solver
# Obj 1
mdl.minimize(obj1)
solution = mdl.solve(log_output=True)
if (solution is not None) and (solution.is_feasible_solution()):
lb[0] = obj1.solution_value
if obj2.solution_value > ub[1]: ub[1] = obj2.solution_value
if obj3.solution_value > ub[2]: ub[2] = obj3.solution_value
sol[0, 0] = obj1.solution_value
sol[0, 1] = obj2.solution_value
sol[0, 2] = obj3.solution_value
sol[0, 3] = round(time.perf_counter() - start, 3)
Given that I have set mdl.time_limit=480, why could the time recorded in sol[0, 3] be greater than 480 seconds?
Thanks!

Within CPLEX docplex you can get solve time:
from docplex.mp.model import Model
mdl = Model(name='buses')
nbbus40 = mdl.integer_var(name='nbBus40')
nbbus30 = mdl.integer_var(name='nbBus30')
mdl.add_constraint(nbbus40*40 + nbbus30*30 >= 300, 'kids')
mdl.minimize(nbbus40*500 + nbbus30*400)
mdl.solve(log_output=True,)
mdl.export("c:\\temp\\buses.lp")
for v in mdl.iter_integer_vars():
print(v," = ",v.solution_value)
print(mdl.solve_details)
print("solve time =",mdl.solve_details.time)
gives
status = integer optimal solution
time = 0.109 s.
problem = MILP
gap = 0%
solve time = 0.10900000005494803

To answer your initial question, The time_limit parameter applies only to the solve call, and is handled internally by CPLEX, counting exclusively solve time.
The Python time module, on the other hand, counts time in a different manner, including Python code that is executed after the call to solve but before the second call to time.time()

Related

Multi-GPU TFF simulation errors "Detected dataset reduce op in multi-GPU TFF simulation"

I ran my code for an emotion detection model using Tensorflow Federated simulation. My code work perfectly fine using CPUs only. However, I received this error when trying to run TFF with GPU.
ValueError: Detected dataset reduce op in multi-GPU TFF simulation: `use_experimental_simulation_loop=True` for `tff.learning`; or use `for ... in iter(dataset)` for your own dataset iteration.Reduce op will be functional after b/159180073.
What is this error about and how can I fix it? I tried to search many places but found no answer.
Here is the call stack if it help. It is very long so I pasted into this link: https://pastebin.com/b1R93gf1
EDIT:
Here is the code containing iterative_process
def startTraining(output_file):
iterative_process = tff.learning.build_federated_averaging_process(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.01),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0),
use_experimental_simulation_loop=True
)
flstate = iterative_process.initialize()
evaluation = tff.learning.build_federated_evaluation(model_fn)
output_file.write(
'round,available_users,loss,sparse_categorical_accuracy,val_loss,val_sparse_categorical_accuracy,test_loss,test_sparse_categorical_accuracy\n')
curr_round_result = [0,0,100,0,100,0]
min_val_loss = 100
for round in range(1,ROUND_COUNT + 1):
available_users = fetch_available_users_and_increase_time(ROUND_DURATION_AVERAGE + random.randint(-ROUND_DURATION_VARIATION, ROUND_DURATION_VARIATION + 1))
if(len(available_users) == 0):
write_to_file(curr_round_result)
continue
train_data = make_federated_data(available_users, 'train')
flstate, metrics = iterative_process.next(flstate, train_data)
val_data = make_federated_data(available_users, 'val')
val_metrics = evaluation(flstate.model, val_data)
curr_round_result[0] = round
curr_round_result[1] = len(available_users)
curr_round_result[2] = metrics['train']['loss']
curr_round_result[3] = metrics['train']['sparse_categorical_accuracy']
curr_round_result[4] = val_metrics['loss']
curr_round_result[5] = val_metrics['sparse_categorical_accuracy']
write_to_file(curr_round_result)
Here is the code for make_federated_data
def make_federated_data(users, dataset_type):
offset = 0
if(dataset_type == 'val'):
offset = train_size
elif(dataset_type == 'test'):
offset = train_size + val_size
global LOADED_USER
for id in users:
if(id + offset not in LOADED_USER):
LOADED_USER[id + offset] = getDatasetFromFilePath(filepaths[id + offset])
return [
LOADED_USER[id + offset]
for id in users
]
TFF does support Multi-GPU, and as the error message says one of two things is happening:
The code is using tff.learning but using the default use_experimental_simulation_loop argument value of False. With multiple GPUs, this must be set to True when using APIs including tff.learning.build_federated_averaging_process. For example, calling with:
training_process = tff.learning.build_federated_averaging_process(
..., use_experimental_simulation_loop=True)
The code contains a custom tf.data.Dataset.reduce(...) call somewhere. This must be replaced with Python code that iterates over the dataset. For example:
result = dataset.reduce(initial_state=0, reduce_func=lambda s, x: s + x)
becomes
s = 0
for x in iter(dataset):
s += x
I realized that TFF has not yet supported multi-GPUs. Therefore, we need to limit number visible of GPUs to just 1, using:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

Tensorflow: OOM when batch size too large

My script is failing due to too high memory usage. When I reduce the batch size it works.
#tf.function(autograph=not DEBUG)
def step(prev_state, input_b):
input_b = tf.reshape(input_b, shape=[1,input_b.shape[0]])
state = FastALIFStateTuple(v=prev_state[0], z=prev_state[1], b=prev_state[2], r=prev_state[3])
new_b = self.decay_b * state.b + (tf.ones(shape=[self.units],dtype=tf.float32) - self.decay_b) * state.z
thr = self.thr + new_b * self.beta
z = state.z
i_in = tf.matmul(input_b, W_in)
i_rec = tf.matmul(z, W_rec)
i_t = i_in + i_rec
I_reset = z * thr * self.dt
new_v = self._decay * state.v + (1 - self._decay) * i_t - I_reset
# Spike generation
is_refractory = tf.greater(state.r, .1)
zeros_like_spikes = tf.zeros_like(z)
new_z = tf.where(is_refractory, zeros_like_spikes, self.compute_z(new_v, thr))
new_r = tf.clip_by_value(state.r + self.n_refractory * new_z - 1,
0., float(self.n_refractory))
return [new_v, new_z, new_b, new_r]
#tf.function(autograph=not DEBUG)
def evolve_single(inputs):
accumulated_state = tf.scan(step, inputs, initializer=state0)
Z = tf.squeeze(accumulated_state[1]) # -> [T,units]
if self.model_settings['avg_spikes']:
Z = tf.reshape(tf.reduce_mean(Z, axis=0), shape=(1,-1))
out = tf.matmul(Z, W_out) + b_out
return out # - [BS,Num_labels]
# # - Using a simple loop
# out_store = []
# for i in range(fingerprint_3d.shape[0]):
# out_store.append(tf.squeeze(evolve_single(fingerprint_3d[i,:,:])))
# return tf.reshape(out_store, shape=[fingerprint_3d.shape[0],self.d_out])
final_out = tf.squeeze(tf.map_fn(evolve_single, fingerprint_3d)) # -> [BS,T,self.units]
return final_out
This code snippet is inside a tf.function, but I omitted it since I don't think it's relevant.
As can be seen, I run the code on fingerprint_3d, a tensor that has the dimension [BatchSize,Time,InputDimension], e.g. [50,100,20]. When I run this with BatchSize < 10 everything works fine, although tf.scan already uses a lot of memory for that.
When I now execute the code on a batch of size 50, suddenly I get an OOM, even though I am executing it in an iterative matter (here commented out).
How should I execute this code so that the Batch Size doesn't matter?
Is tensorflow maybe parallelizing my for loop so that it executed over multiple batches at once?
Another unrelated question is the following: What function instead of tf.scan should I use if I only want to accumulate one state variable, compared to the case for tf.scan where it just accumulates all the state variables? Or is that possible with tf.scan?
As mentioned in the discussions here, tf.foldl, tf.foldr, and tf.scan all require keeping track of all values for all iterations, which is necessary for computations like gradients. I am not aware of any ways to mitigate this issue; still, I would also be interested if anyone has a better answer than mine.
When I used
#tf.function
def get_loss_and_gradients():
with tf.GradientTape(persistent=False) as tape:
logits, spikes = rnn.call(fingerprint_input=graz_dict["train_input"], W_in=W_in, W_rec=W_rec, W_out=W_out, b_out=b_out)
loss = loss_normal(tf.cast(graz_dict["train_groundtruth"],dtype=tf.int32), logits)
gradients = tape.gradient(loss, [W_in,W_rec,W_out,b_out])
return loss, logits, spikes, gradients
it works.
When I remove the #tf.function decorator the memory blows up. So it really seems important that tensorflow can create a graph for you computations.

A Simple Bayesian Network with a Coin-Flipping Problem

I am trying to implement a Bayesian network and solve a regression problem using PYMC3. In my model, I have a fair coin as the parent node. If the parent node is H, the child node selects the normal distribution N(5,0.2); if T, the child selects N(0,0.5). Here is an illustration of my network.
To simulate this network, I generated a sample dataset and tried doing Bayesian regression using the code below. Currently, the model does regression only for the child node as if the parent node does not exist. I would greatly appreciate it if anyone can let me know how to implement the conditional probability P(D|C). Ultimately, I am interested in finding the probability distribution for mu1 and mu2. Thank you!
# Generate data for coin flip P(C) and store in c1
theta_real = 0.5 # unkown value in a real experiment
n_sample = 10
c1 = bernoulli.rvs(p=theta_real, size=n_sample)
# Generate data for normal distribution P(D|C) and store in d1
np.random.seed(123)
mu1 = 0
sigma1 = 0.5
mu2 = 5
sigma2 = 0.2
d1 = []
for index, item in enumerate(c1):
if item == 0:
d1.extend(normal(mu1, sigma1, 1))
else:
d1.extend(normal(mu2, sigma2, 1))
# I start building PYMC3 model here
c1_tensor = theano.shared(np.array(c1))
d1_tensor = theano.shared(np.array(d1))
with pm.Model() as model:
# define prior for c1. I am not sure how to do this.
#c1_present = pm.Categorical('c1',observed=c1_tensor)
# how do I incorporate P(D | C)
mu_prior = pm.Normal('mu', mu=2, sd=2, shape=1)
sigma_prior = pm.HalfNormal('sigma', sd=2, shape=1)
y_likelihood = pm.Normal('y', mu=mu_prior, sd=sigma_prior, observed=d1_tensor)
You could use the Dirichlet distribution as a prior for the coin toss and NormalMixture as the prior of the two Gaussians. In the following snippet I changed the fairness of the coin and increased the number of coin tosses, but you could adjust these in any way want:
import numpy as np
import pymc3 as pm
from scipy.stats import bernoulli
# Generate data for coin flip P(C) and store in c1
theta_real = 0.2 # unkown value in a real experiment
n_sample = 2000
c1 = bernoulli.rvs(p=theta_real, size=n_sample)
# Generate data for normal distribution P(D|C) and store in d1
np.random.seed(123)
mu1 = 0
sigma1 = 0.5
mu2 = 5
sigma2 = 0.2
d1 = []
for index, item in enumerate(c1):
if item == 0:
d1.extend(np.random.normal(mu1, sigma1, 1))
else:
d1.extend(np.random.normal(mu2, sigma2, 1))
with pm.Model() as model:
w = pm.Dirichlet('p', a=np.ones(2))
mu = pm.Normal('mu', 0, 20, shape=2)
sigma = np.array([0.5,0.2])
pm.NormalMixture('like',w=w,mu=mu,sigma=sigma,observed=np.array(d1))
trace = pm.sample()
pm.summary(trace)
This will give you the following:
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu__0 4.981222 0.023900 0.000491 4.935044 5.027420 2643.052184 0.999637
mu__1 -0.007660 0.004946 0.000095 -0.017388 0.001576 2481.146286 1.000312
p__0 0.213976 0.009393 0.000167 0.195602 0.231803 2245.905021 0.999302
p__1 0.786024 0.009393 0.000167 0.768197 0.804398 2245.905021 0.999302
The parameters are recovered nicely as you can also see from the traceplots:
The above implementation will give you the posterior of theta_real, mu1 and mu2 but I could not get convergence when I added sigma1 and sigma2 as parameters to be estimated by the data (even though the prior was quite narrow):
with pm.Model() as model:
w = pm.Dirichlet('p', a=np.ones(2))
mu = pm.Normal('mu', 0, 20, shape=2)
sigma = pm.HalfNormal('sigma', sd=2, shape=2)
pm.NormalMixture('like',w=w,mu=mu,sigma=sigma,observed=np.array(d1))
trace = pm.sample()
print(pm.summary(trace))
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, mu, p]
Sampling 4 chains: 100%|██████████| 4000/4000 [00:10<00:00, 395.57draws/s]
The acceptance probability does not match the target. It is 0.883057127209148, but should be close to 0.8. Try to increase the number of tuning steps.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
mean sd mc_error ... hpd_97.5 n_eff Rhat
mu__0 1.244021 2.165433 0.216540 ... 5.005507 2.002049 212.596596
mu__1 3.743879 2.165122 0.216510 ... 5.012067 2.002040 235.750129
p__0 0.643069 0.248630 0.024846 ... 0.803369 2.004185 30.966189
p__1 0.356931 0.248630 0.024846 ... 0.798632 2.004185 30.966189
sigma__0 0.416207 0.125435 0.012517 ... 0.504110 2.009031 17.333177
sigma__1 0.271763 0.125539 0.012533 ... 0.497208 2.007779 19.217223
[6 rows x 7 columns]
Based on that you most likely will need to reparametrize if you also wanted to estimate the two standard deviations from this data.
This answer is to supplement #balleveryday's answer, which suggests the Gaussian Mixture Model, but had some trouble getting the symmetry breaking to work. Admittedly, the symmetry breaking in the official example is done in the context of Metropolis-Hastings sampling, whereas I think NUTS might be a little more sensitive to encountering impossible values (not sure). Here's what worked for me:
import numpy as np
import pymc3 as pm
from scipy.stats import bernoulli
import theano.tensor as tt
# everything should reproduce
np.random.seed(123)
n_sample = 2000
# Generate data for coin flip P(C) and store in c1
theta_real = 0.2 # unknown value in a real experiment
c1 = bernoulli.rvs(p=theta_real, size=n_sample)
# Generate data for normal distribution P(D|C) and store in d1
mu1, mu2 = 0, 5
sigma1, sigma2 = 0.5, 0.2
d1 = np.empty_like(c1, dtype=np.float64)
d1[c1 == 0] = np.random.normal(mu1, sigma1, np.sum(c1 == 0))
d1[c1 == 1] = np.random.normal(mu2, sigma2, np.sum(c1 == 1))
with pm.Model() as gmm_asym:
# mixture vector
w = pm.Dirichlet('p', a=np.ones(2))
# Gaussian parameters (testval helps start off ordered)
mu = pm.Normal('mu', 0, 20, shape=2, testval=[-10, 10])
sigma = pm.HalfNormal('sigma', sd=2, shape=2)
# break symmetry, forcing mu[0] < mu[1]
order_means_potential = pm.Potential('order_means_potential',
tt.switch(mu[1] - mu[0] < 0, -np.inf, 0))
# observed
pm.NormalMixture('like', w=w, mu=mu, sigma=sigma, observed=d1)
# reproducible sampling
tr_gmm_asym = pm.sample(tune=2000, target_accept=0.9, random_seed=20191121)
This produces samples with the statistics
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu__0 0.004549 0.011975 0.000226 -0.017398 0.029375 2425.487301 0.999916
mu__1 5.007663 0.008993 0.000166 4.989247 5.024692 2181.134002 0.999563
p__0 0.789983 0.009091 0.000188 0.773059 0.808062 2417.356539 0.999788
p__1 0.210017 0.009091 0.000188 0.191938 0.226941 2417.356539 0.999788
sigma__0 0.497322 0.009103 0.000186 0.480394 0.515867 2227.397854 0.999358
sigma__1 0.191310 0.006633 0.000141 0.178924 0.204859 2286.817037 0.999614
and the traces

TPU slower than GPU?

I just tried using TPU in Google Colab and I want to see how much TPU is faster than GPU. I got surprisingly the opposite result.
The following is the NN.
random_image = tf.random_normal((100, 100, 100, 3))
result = tf.layers.conv2d(random_image, 32, 7)
result = tf.reduce_sum(result)
Performance results:
CPU: 8s
GPU: 0.18s
TPU: 0.50s
I wonder why.... The complete code for TPU is as follows:
def calc():
random_image = tf.random_normal((100, 100, 100, 3))
result = tf.layers.conv2d(random_image, 32, 7)
result = tf.reduce_sum(result)
return result
tpu_ops = tf.contrib.tpu.batch_parallel(calc, [], num_shards=8)
session = tf.Session(tpu_address)
try:
print('Initializing global variables...')
session.run(tf.global_variables_initializer())
print('Warming up...')
session.run(tf.contrib.tpu.initialize_system())
print('Profiling')
start = time.time()
session.run(tpu_ops)
end = time.time()
elapsed = end - start
print(elapsed)
finally:
session.run(tf.contrib.tpu.shutdown_system())
session.close()
Benchmarking devices properly is hard, so please take everything you learn from these examples with a grain of salt. It's better in general to compare specific models you are interested in (e.g. running an ImageNet network) to understand performance differences. That said, I understand it's fun to do this, so...
Larger models will illustrate the TPU and GPU performance better. Your example also is including the compilation time in the cost of the TPU call: every call after the first for a given program and shape will be cached, so you will want to tpu_ops once before starting the timer unless you want to capture the compilation time.
Currently each call to a TPU function copies the weights to the TPU before it can start running, this affects small operations more significantly. Here's an example that runs a loop on the TPU before returning to the CPU, with the following outputs.
1 0.010800600051879883
10 0.09931182861328125
100 0.5581905841827393
500 2.7688047885894775
. So you can actually run 100 iterations of this function in 0.55s.
import os
import time
import tensorflow as tf
def calc(n):
img = tf.random_normal((128, 100, 100, 3))
def body(_):
result = tf.layers.conv2d(img, 32, 7)
result = tf.reduce_sum(result)
return result
return tf.contrib.tpu.repeat(n[0], body, [0.0])
session = tf.Session('grpc://' + os.environ['COLAB_TPU_ADDR'])
try:
print('Initializing TPU...')
session.run(tf.contrib.tpu.initialize_system())
for i in [1, 10, 100, 500]:
tpu_ops = tf.contrib.tpu.batch_parallel(calc, [[i] * 8], num_shards=8)
print('Warming up...')
session.run(tf.global_variables_initializer())
session.run(tpu_ops)
print('Profiling')
start = time.time()
session.run(tpu_ops)
end = time.time()
elapsed = end - start
print(i, elapsed)
finally:
session.run(tf.contrib.tpu.shutdown_system())
session.close()

how to use conda accelerate / benchmarks?

I'm attempting to use Conda Accelerate to speedup some data preprocessing, but initial benchmarks indicate either I'm not using it correctly or it has no effect on FFT & linear algebra execution times in numpy and librosa. Re-reading the literature - does this mean I'm supposed to decorate and recode every ndarray operation as in the batch-matmul example for NumbaPro? I'd assumed I simply installed and it made numpy faster, but this doesn't appear to be the case.
Benchmarks and code are below. I've installed accelerate via conda install accelerate and also imported it for good measure.
Thanks!
Result - negligible difference before and after conda install accelerate
Total time was 25.356
Total load time was 1.6743
Total math time was 22.1599
Total save time was 1.5139
Total stft math time was 12.9219
Total other numpy math time was 9.1886
Relevant code:
loads, maths, saves = [], [], []
stfts, nps = [], []
# now we have a dict of all source files grouped by voice
for i in range(30):
v0_fn = v0_list[i]
v1_fn = v1_list[i]
tl0 = time.time()
# Process v0 & v1 file
v0_fn = signal_dir+v0_fn
v0, fs_s = librosa.load(v0_fn, sr=None)
v1_fn = signal_dir+v1_fn
v1, fs_s = librosa.load(v1_fn, sr=None)
tl1 = time.time()
loads.append((tl1-tl0))
mix = v0 + v1
# Capture the magnitude and phase of signal and signal + noise
tm0 = time.time()
v0_stft = librosa.stft(v0, int(frame_size*fs), int(step_size*fs)).transpose()
tm1 = time.time()
v0_mag = (v0_stft.real**2 + v0_stft.imag**2)**0.5
v0_pha = np.arctan2(v0_stft.imag, v0_stft.real)
v0_rtheta = np.stack((v0_mag, v0_pha), axis=0)
tm2 = time.time()
v1_stft = librosa.stft(v1, int(frame_size*fs), int(step_size*fs)).transpose()
tm3 = time.time()
v1_mag = (v1_stft.real**2 + v1_stft.imag**2)**0.5
v1_pha = np.arctan2(v1_stft.imag, v1_stft.real)
v1_rtheta = np.stack((v1_mag, v1_pha), axis=0)
tm4 = time.time()
mix_stft = librosa.stft(mix, int(frame_size*fs), int(step_size*fs)).transpose()
tm5 = time.time()
mix_mag = (mix_stft.real**2 + mix_stft.imag**2)**0.5
mix_pha = np.arctan2(mix_stft.imag, mix_stft.real)
mix_rtheta = np.stack((mix_mag, mix_pha), axis=0)
tm6 = time.time()
stfts += [tm1-tm0, tm3-tm2, tm5-tm4]
nps += [tm2-tm1, tm4-tm3, tm6-tm5]
data['sig_rtheta'] = v0_rtheta
data['noi_rtheta'] = v1_rtheta
data['mix_rtheta'] = mix_rtheta
tl2 = time.time()
maths.append(tl2-tl1)
with open(write_name, 'w') as f:
cPickle.dump(all_info, f, protocol=-1)
tl3 = time.time()
saves.append(tl3-tl2)
t1 = time.time()
print 'Total time was %.3f' % (t1-t0)
print 'Total load time was %.4f' % np.sum(loads)
print 'Total math time was %.4f' % np.sum(maths)
print 'Total save time was %.4f' % np.sum(saves)
print 'Total stft math was %.4f' % np.sum(stfts)
print 'Total other numpy math time was %.4f' % np.sum(nps)