Why window_length/hop_length are multiplied with sample rate in librosa.core.stft in this example? - voice-recognition

I'm new to voice recognition and I'm going through the details in this implementation of speaker verification. In data_preprocess.py authors use librosa library. Here is a simplified version of the code:
def preprocess_data(data_dir, res_dir, N, M, tdsv_frame, sample_rate, nfft, window_len, hop_len):
os.makedirs(res_dir, exist_ok=True)
batch_frames = N * M * tdsv_frame
batch_number = 0
batch = []
batch_len = 0
for i, path in enumerate(tqdm(os.listdir(data_dir))):
data, sr = librosa.core.load(os.path.join(data_dir, path), sr=sample_rate)
S = librosa.core.stft(y=data, n_fft=nfft, win_length=int(window_len * sample_rate), hop_length=int(hop_len * sample_rate))
batch.append(S)
batch_len += S.shape[1]
if batch_len < batch_frames: continue
batch = np.concatenate(batch, axis=1)[:,:batch_frames]
np.save(os.path.join(res_dir, "voice_%d.npy" % batch_number), batch)
batch_number += 1
batch = []
batch_len = 0
N = 2 # number of speakers of batch
M = 400 # number of utterances per speaker
tdsv_frame = 80 # feature size
sample_rate = 8000 # sampling rate
nfft = 512 # fft kernel size
window_len = 0.025 # window length (ms)
hop_len = 0.01 # hop size (ms)
data_dir = "./data/clean_testset_wav/"
res_dir = "./data/clean_testset_wav_prep/"
Based on a figure in the paper, they want to create a batch of features in the size of (N*M)*tdsv_frame.
I think I understand the concept of window_length, hop_length, but what is a question to me is how the authors set these parameters. Why we should multiple these lengths with sample_rate as it's done here:
S = librosa.core.stft(y=data, n_fft=nfft, win_length=int(window_len * sample_rate), hop_length=int(hop_len * sample_rate))
Thank you.

librosa.core.stft takes win_length/hop_length in number of samples. This is typical for Digital Signal Processing, as fundamentally the systems are discrete based on the number of samples per second (the sample rate).
However for ease of understanding for humans, it makes more sense to think of these times in seconds/milliseconds. As in your example
window_len = 0.025 # window length (ms)
hop_len = 0.01 # hop size (ms)
So to go from a time in seconds to time in number of samples, one has to multiply by the sample rate.

The unit of window_len and hop_len is (ms), however, in librosa, they should be the number of samples.
# of samples = sampling_rate * (ms)

Related

Why does GradientTape behave differently when watching loop operations as opposed to array operations?

There is something about the workings of GradientTape that escapes my understanding.
Suppose we want to train an agent on the classic bandit problem using an actor-critic RL framework. There are two bandits, A and B, and the agent must learn to select A, which yields higher returns on average. The training consists of, say, 1000 epochs, in each of which the agent draws, say, 100 samples from each bandit. The reward is 1 every time the agent selects A, and 0 otherwise.
Let's see how the agent learns by observing rewards over 10 training simulations. Here is the code defining the agent and the environment (neither needs to be more complicated than below).
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from keras import Model
from keras.optimizers import Adam
n_sims = 10 # number of simulations
for n in range(n_sims):
# define actors and optimizers for each simulation
actor_input = Input(shape=(2,))
actor_output = Dense(2, activation='softmax')(actor_input)
globals()[f'actor_{n}'] = Model(inputs=actor_input, outputs=actor_output)
globals()[f'actor_opt_{n}'] = Adam(learning_rate=.1)
# define critics and optimizers for each simulation
critic_input = Input(shape=(2,))
critic_output = Dense(1, activation='softmax')(critic_input)
globals()[f'critic_{n}'] = Model(inputs=critic_input, outputs=critic_output)
globals()[f'critic_opt_{n}'] = Adam(learning_rate=.1)
globals()[f'mean_rewards_{n}'] = [] # list to store rewards over training epochs for each simulation
A = np.random.normal(loc=10, scale=15, size=int(1e5)) # bandit A
B = np.random.normal(loc=0, scale=1, size=int(1e5)) # bandit B
n_training_epochs = 1000
n_samples = 100
Let's consider two alternative codes for the training loop using GradientTape, both based on a simple 'vanilla' loss function.
The first is the slow one and literally involves a for loop over the samples drawn in each epoch. Cumulative actor and critic's losses are iteratively computed, and then their means are used to update their respective network weights.
for _ in range(n_training_epochs):
A_samples = np.random.choice(A, size=n_samples)
B_samples = np.random.choice(B, size=n_samples)
for n in range(n_sims):
cum_actor_loss, cum_critic_loss, cum_reward = 0, 0, 0
with tf.GradientTape() as actor_tape, tf.GradientTape() as critic_tape:
for A_sample, B_sample in zip(A_samples, B_samples):
probs = globals()[f'actor_{n}'](tf.reshape([A_sample, B_sample], (1,-1)))[0]
action = np.random.choice(['A','B'], p=np.squeeze(probs))
reward = 1 if action == 'A' else 0
cum_reward += reward
action_prob = probs[['A','B'].index(action)]
value = globals()[f'critic_{n}'](tf.reshape([A_sample, B_sample], (1,-1)))[0]
advantage = reward - value
cum_actor_loss += -tf.math.log(action_prob)*advantage
cum_critic_loss += advantage**2
mean_actor_loss = cum_actor_loss/n_samples
mean_critic_loss = cum_critic_loss/n_samples
globals()[f'mean_rewards_{n}'].append(cum_reward/n_samples)
actor_grads = actor_tape.gradient(mean_actor_loss, globals()[f'actor_{n}'].trainable_variables)
globals()[f'actor_opt_{n}'].apply_gradients(zip(actor_grads, globals()[f'actor_{n}'].trainable_variables))
critic_grads = critic_tape.gradient(mean_critic_loss, globals()[f'critic_{n}'].trainable_variables)
globals()[f'critic_opt_{n}'].apply_gradients(zip(critic_grads, globals()[f'critic_{n}'].trainable_variables))
If you plot the average training rewards over each epoch, you'll probably get something like this figure
In the second option, instead of using an explicit for loop over samples in each epoch, we perform operations on arrays. This alternative is much faster in terms of computation time.
for _ in range(n_training_epochs):
A_samples = np.random.choice(A, size=n_samples)
B_samples = np.random.choice(B, size=n_samples)
for n in range(n_sims):
with tf.GradientTape() as actor_tape, tf.GradientTape() as critic_tape:
probs = globals()[f'actor_{n}'](tf.reshape([[A_sample, B_sample] for A_sample, B_sample in zip(A_samples, B_samples)], (n_samples,-1)))
actions = np.array([np.random.choice(['A','B'], p=np.squeeze(probs[i])) for i in range(len(probs))]).reshape(n_samples, -1)
rewards = np.array([1.0 if action == 'A' else 0.0 for action in actions]).reshape(n_samples, -1)
globals()[f'mean_rewards_{n}'].append(np.mean(rewards))
values = globals()[f'critic_{n}'](tf.reshape([[A_sample, B_sample] for A_sample, B_sample in zip(A_samples, B_samples)], (n_samples,-1)))
advantages = rewards + tf.math.negative(values)
actions_num = [['A','B'].index(action) for action in actions]
action_probs = tf.reduce_sum(tf.one_hot(actions_num, len(['A','B'])) * probs, axis=1)
mean_actor_loss = -tf.reduce_mean(advantages * tf.math.log(action_probs))
mean_critic_loss = tf.reduce_mean(tf.pow(advantages, 2))
actor_grads = actor_tape.gradient(mean_actor_loss, globals()[f'actor_{n}'].trainable_variables)
globals()[f'actor_opt_{n}'].apply_gradients(zip(actor_grads, globals()[f'actor_{n}'].trainable_variables))
critic_grads = critic_tape.gradient(mean_critic_loss, globals()[f'critic_{n}'].trainable_variables)
globals()[f'critic_opt_{n}'].apply_gradients(zip(critic_grads, globals()[f'critic_{n}'].trainable_variables))
Let's plot the average reward over epochs, to obtain something like this
As you can see the agent tends to learn earlier and more stably in the first case than in the second (where learning may not even happen), although the two training loops are in theory mathematically equivalent. How is that? The reason has probably something to do with the fact that, in the first option, GradientTape is watching the trainable variables several times per epoch before applying the gradient, whereas in the second option it does so only once. Even so, I can't figure out why exactly this produces the observed results. Can you help me understand?

How could I speed up my written python code: spheres contact detection (collision) using spatial searching

I am working on a spatial search case for spheres in which I want to find connected spheres. For this aim, I searched around each sphere for spheres that centers are in a (maximum sphere diameter) distance from the searching sphere’s center. At first, I tried to use scipy related methods to do so, but scipy method takes longer times comparing to equivalent numpy method. For scipy, I have determined the number of K-nearest spheres firstly and then find them by cKDTree.query, which lead to more time consumption. However, it is slower than numpy method even by omitting the first step with a constant value (it is not good to omit the first step in this case). It is contrary to my expectations about scipy spatial searching speed. So, I tried to use some list-loops instead some numpy lines for speeding up using numba prange. Numba run the code a little faster, but I believe that this code can be optimized for better performances, perhaps by vectorization, using other alternative numpy modules or using numba in another way. I have used iteration on all spheres due to prevent probable memory leaks and …, where number of spheres are high.
import numpy as np
import numba as nb
from scipy.spatial import cKDTree, distance
# ---------------------------- input data ----------------------------
""" For testing by prepared files:
radii = np.load('a.npy') # shape: (n-spheres, ) must be loaded by np.load('a.npy') or np.loadtxt('radii_large.csv')
poss = np.load('b.npy') # shape: (n-spheres, 3) must be loaded by np.load('b.npy') or np.loadtxt('pos_large.csv', delimiter=',')
"""
rnd = np.random.RandomState(70)
data_volume = 200000
radii = rnd.uniform(0.0005, 0.122, data_volume)
dia_max = 2 * radii.max()
x = rnd.uniform(-1.02, 1.02, (data_volume, 1))
y = rnd.uniform(-3.52, 3.52, (data_volume, 1))
z = rnd.uniform(-1.02, -0.575, (data_volume, 1))
poss = np.hstack((x, y, z))
# --------------------------------------------------------------------
# #nb.jit('float64[:,::1](float64[:,::1], float64[::1])', forceobj=True, parallel=True)
def ends_gap(poss, dia_max):
particle_corsp_overlaps = np.array([], dtype=np.float64)
ends_ind = np.empty([1, 2], dtype=np.int64)
""" using list looping """
# particle_corsp_overlaps = []
# ends_ind = []
# for particle_idx in nb.prange(len(poss)): # by list looping
for particle_idx in range(len(poss)):
unshared_idx = np.delete(np.arange(len(poss)), particle_idx) # <--- relatively high time consumer
poss_without = poss[unshared_idx]
""" # SCIPY method ---------------------------------------------------------------------------------------------
nears_i_ind = cKDTree(poss_without).query_ball_point(poss[particle_idx], r=dia_max) # <--- high time consumer
if len(nears_i_ind) > 0:
dist_i, dist_i_ind = cKDTree(poss_without[nears_i_ind]).query(poss[particle_idx], k=len(nears_i_ind)) # <--- high time consumer
if not isinstance(dist_i, float):
dist_i[dist_i_ind] = dist_i.copy()
""" # NUMPY method --------------------------------------------------------------------------------------------
lx_limit_idx = poss_without[:, 0] <= poss[particle_idx][0] + dia_max
ux_limit_idx = poss_without[:, 0] >= poss[particle_idx][0] - dia_max
ly_limit_idx = poss_without[:, 1] <= poss[particle_idx][1] + dia_max
uy_limit_idx = poss_without[:, 1] >= poss[particle_idx][1] - dia_max
lz_limit_idx = poss_without[:, 2] <= poss[particle_idx][2] + dia_max
uz_limit_idx = poss_without[:, 2] >= poss[particle_idx][2] - dia_max
nears_i_ind = np.where(lx_limit_idx & ux_limit_idx & ly_limit_idx & uy_limit_idx & lz_limit_idx & uz_limit_idx)[0]
if len(nears_i_ind) > 0:
dist_i = distance.cdist(poss_without[nears_i_ind], poss[particle_idx][None, :]).squeeze() # <--- relatively high time consumer
# """ # -------------------------------------------------------------------------------------------------------
contact_check = dist_i - (radii[unshared_idx][nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps = np.concatenate((particle_corsp_overlaps, connected))
""" using list looping """
# if len(connected) > 0:
# for value_ in connected:
# particle_corsp_overlaps.append(value_)
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = np.array(nears_i_ind)[contacts_ind]
sphere_olps_ind = np.where((poss[:, None] == poss_without[contacts_sec_ind][None, :]).all(axis=2))[0] # <--- high time consumer
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
if particle_idx > 0:
ends_ind = np.concatenate((ends_ind, ends_ind_mod_temp))
else:
ends_ind[0, 0], ends_ind[0, 1] = ends_ind_mod_temp[0, 0], ends_ind_mod_temp[0, 1]
""" using list looping """
# for contacted_idx in sphere_olps_ind:
# ends_ind.append([particle_idx, contacted_idx])
# ends_ind_org = np.array(ends_ind) # using lists
ends_ind_org = ends_ind
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True) # <--- relatively high time consumer
gap = np.array(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
In one of my tests on 23000 spheres, scipy, numpy, and numba-aided methods finished the loop in about 400, 200, and 180 seconds correspondingly using Colab TPU; for 500.000 spheres it take 3.5 hours. These execution times are not satisfying at all for my project, where number of spheres may be up to 1.000.000 in a medium data volume. I will call this code many times in my main code and seeking for ways that could perform this code in milliseconds (as much as fastest that it could). Is it possible??
I would be appreciated if anyone would speed up the code as it is needed.
Notes:
This code must be executable with python 3.7+, on CPU and GPU.
This code must be applicable for data size, at least, 300.000 spheres.
All numpy, scipy, and … equivalent modules instead of my written modules, which make my code faster significantly, will be upvoted.
I would be appreciated for any recommendations or explanations about:
Which method could be faster in this subject?
Why scipy is not faster than other methods in this case and where it could be helpful relating to this subject?
Choosing between iterator methods and matrix form methods is a confusing matter for me. Iterating methods use less memory and could be used and tuned up by numba and … but, I think, are not useful and comparable with matrix methods (which depends on memory limits) like numpy and … for huge sphere numbers. For this case, perhaps I could omit the iteration by numpy, but I guess strongly that it cannot be handled due to huge matrix size operations and memory leaks.
Prepared sample test data:
Poss data: 23000, 500000
Radii data: 23000, 500000
Line by line speed test logs: for two test cases scipy method and numpy time consumption.
UPDATE: this post answered is now superseded by this new one
(which take into account the updates of the question) providing an even faster code based on a different approach.
Step 1: better algorithm
First of all, building a k-d tree runs in O(n log n) time and doing a query runs in O(log n) time where n is the number of points. So using a k-d tree seems a good idea at first glance. However, your code build a k-d tree for each point resulting in a O(n² log n) time. This is why the Scipy solution is slower than the others. The thing is that Scipy does not provide a way to update a k-d tree. It turns out that updating efficiently a k-d tree appears not to be possible. Hopefully, this is not a problem in your case: you can just build one k-d tree with all the points once and then discard the current point you do not want appearing in the result of each query.
Moreover, the computation of sphere_olps_ind runs in O(n² m) time where n is the total number of points and m is the average number of neighbour (ie. closest points retrieved from the k-d tree query). Assuming there is no duplicate points, then it turns out that sphere_olps_ind is simply equal to np.sort(contacts_sec_ind). The later runs in O(m log m) which is drastically better.
Additionally, using np.concatenate in a loop to append value in a Numpy array is slow because it creates a new bigger array for each iteration. Using a list was a good idea, but appending directly Numpy array in a list and then calling np.concatenate once is much faster.
Here is the resulting code:
def ends_gap(poss, dia_max):
particle_corsp_overlaps = []
ends_ind = [np.empty([1, 2], dtype=np.int64)]
kdtree = cKDTree(poss)
for particle_idx in range(len(poss)):
# Find the nearest point including the current one and
# then remove the current point from the output.
# The distances can be computed directly without a new query.
cur_point = poss[particle_idx]
nears_i_ind = np.array(kdtree.query_ball_point(cur_point, r=dia_max), dtype=np.int64)
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = distance.cdist(poss[nears_i_ind], cur_point[None, :]).squeeze()
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
if particle_idx > 0:
ends_ind.append(ends_ind_mod_temp)
else:
ends_ind[0][:] = ends_ind_mod_temp[0, 0], ends_ind_mod_temp[0, 1]
ends_ind_org = np.concatenate(ends_ind)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True) # <--- relatively high time consumer
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
Step 2: optimization
First of all, the query_ball_point call can be done on all the points at once in parallel by providing poss to the Scipy method and specifying the parameter workers=-1. However, note that this requires more memory.
Moreover, Numba can be used to significantly speed up the computation. The parts that can be mainly improved is the computation of the distances and the creation of many unnecessary temporary arrays as well as the use of Numpy array direct indexing instead of list's appends (since the bounded size of the output array can be known after the query_ball_point call).
Here is a simple example of optimized code using Numba:
#nb.jit('(float64[:, ::1], int64[::1], int64[::1], float64)')
def compute(poss, all_neighbours, all_neighbours_sizes, dia_max):
particle_corsp_overlaps = []
ends_ind_lst = [np.empty((1, 2), dtype=np.int64)]
an_offset = 0
for particle_idx in range(len(poss)):
cur_point = poss[particle_idx]
cur_len = all_neighbours_sizes[particle_idx]
nears_i_ind = all_neighbours[an_offset:an_offset+cur_len]
an_offset += cur_len
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = np.empty(len(nears_i_ind), dtype=np.float64)
# Compute the distances
x1, y1, z1 = poss[particle_idx]
for i in range(len(nears_i_ind)):
x2, y2, z2 = poss[nears_i_ind[i]]
dist_i[i] = np.sqrt((x2-x1)**2 + (y2-y1)**2 + (z2-z1)**2)
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where(contact_check <= 0)
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.empty((len(sphere_olps_ind), 2), dtype=np.int64)
for i in range(len(sphere_olps_ind)):
ends_ind_mod_temp[i, 0] = particle_idx
ends_ind_mod_temp[i, 1] = sphere_olps_ind[i]
if particle_idx > 0:
ends_ind_lst.append(ends_ind_mod_temp)
else:
tmp = ends_ind_lst[0]
tmp[:] = ends_ind_mod_temp[0, :]
return particle_corsp_overlaps, ends_ind_lst
def ends_gap(poss, dia_max):
kdtree = cKDTree(poss)
tmp = kdtree.query_ball_point(poss, r=dia_max, workers=-1)
all_neighbours = np.concatenate(tmp, dtype=np.int64)
all_neighbours_sizes = np.array([len(e) for e in tmp], dtype=np.int64)
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes, dia_max)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
ends_gap(poss, dia_max)
Performance analysis
Here are the performance results on my 6-core machine (with a i5-9600KF processor) on the small dataset:
Initial code with Scipy: 259 s
Initial default code with Numpy: 112 s
Optimized algorithm: 1.37 s
Final optimized code: 0.22 s
Unfortunately, the Scipy k-d tree is too big to fit in memory on my machine with the big dataset.
Thus the Numba implementation with an efficient algorithm is up to ~510 times faster than the initial Numpy implementation and ~1200 time faster than the initial Scipy implementation.
The Numba code can be further optimized, but please note that the Numba compute call takes less than 25% of the time on my machine. The np.unique call is the most expensive, but it is not easy to make it faster. A significant part of the time is spent in the Scipy-to-Numba data conversion, but this code is mandatory as long as Scipy is used. Thus, the code can be improved a bit (eg. certainly 2x faster) with advanced Numba optimization but if you need a much faster code, then you need to use a native language like C++ and an highly-optimized parallel k-d tree implementation. I expect a very-optimized native code to be an order of magnitude faster but not much more. I hardly believe the big dataset can be computed in less than 10 ms on my machine regardless of the implementation.
Notes
Note that gap is different with the provided functions (other values are left unchanged). However, the same thing happens between the initial Scipy method and the one of Numpy. This appear to come from the ordering of variables like nears_i_ind and dist_i which is undefined by Scipy and change the gap result in a non-trivial way (not just the order of gap). I am not sure this is a problem of the initial implementation. Because of that, it is much harder to compare the correctness of the different implementations.
forceobj should not be used in production as the documentation states this is only useful for testing purposes.
Based on previous answers, I designed a efficient algorithm with a much lower memory footprint and much faster than the previous ones (especially on the large dataset). That being said this algorithm is far move complex and push the limit of Python and Numba.
The key issue of previous algorithms is that they set a dia_max threshold which is much bigger than actually required. Indeed, dia_max is set to the maximum possible redius so to be sure not to miss any overlapping. The thing is the big dataset contains balls of very different size and some of them are huge. This means that previous algorithms was fetching for a very large radius around many small balls. The result was thousands of neighbours to check per ball while only few can truly overlap.
One solution to efficiently address this problem is to split the balls in different groups based on their size. The idea is to first sort balls based on radii, then split the sorted balls in two groups, then independently query neighbours between each possible pair of groups, then merge data so to apply the previous algorithm (with some additional optimizations). More specifically, the query is applied between small balls with big ones, small balls with other small ones, big balls with other big ones, and big balls with small ones.
Another key point to speed this up is to request the different neighbour queries in parallel using joblib. This solution is far from being perfect since the BallTree object needs to be duplicated which is inefficient but this is mandatory because of the way parallelism is currently done in CPython (ie. GIL, pickling, etc.). Using a package that support parallel request can bypass this inherent limitation of CPython but existing package doing that does not seems to provide an interface sufficiently useful to address this problem or are not optimized enough to be actually useful.
Finally, the Numba code can be strongly optimized by removing almost all very expensive (implicit) array allocations. Using a in-place sorting algorithm optimized for small array also improve significantly the execution time (mainly because the default implementation of Numba perform several expensive allocations and is not optimized for small arrays). In addition, the final np.unique operation can be completely rewritten with a basic loop as the main loop iterate over balls with increasing IDs (hence already sorted).
Here is the resulting code:
import numpy as np
import numba as nb
from sklearn.neighbors import BallTree
from joblib import Parallel, delayed
def flatten_neighbours(arr):
sizes = np.fromiter(map(len, arr), count=len(arr), dtype=np.int64)
values = np.concatenate(arr, dtype=np.int64)
return sizes, values
#delayed
def find_neighbours(searched_pts, ref_pts, max_dist):
balltree = BallTree(ref_pts, leaf_size=16, metric='euclidean')
res = balltree.query_radius(searched_pts, r=max_dist)
return flatten_neighbours(res)
def vstack_neighbours(top_infos, bottom_infos):
top_sizes, top_values = top_infos
bottom_sizes, bottom_values = bottom_infos
return np.concatenate([top_sizes, bottom_sizes]), np.concatenate([top_values, bottom_values])
#nb.njit('(Tuple([int64[::1],int64[::1]]), Tuple([int64[::1],int64[::1]]), int64)')
def hstack_neighbours(left_infos, right_infos, offset):
left_sizes, left_values = left_infos
right_sizes, right_values = right_infos
n = left_sizes.size
out_sizes = np.empty(n, dtype=np.int64)
out_values = np.empty(left_values.size + right_values.size, dtype=np.int64)
left_cur, right_cur, out_cur = 0, 0, 0
right_values += offset
for i in range(n):
left, right = left_sizes[i], right_sizes[i]
full = left + right
out_values[out_cur:out_cur+left] = left_values[left_cur:left_cur+left]
out_values[out_cur+left:out_cur+full] = right_values[right_cur:right_cur+right]
out_sizes[i] = full
left_cur += left
right_cur += right
out_cur += full
return out_sizes, out_values
#nb.njit('(int64[::1], int64[::1], int64[::1], int64[::1])')
def reorder_neighbours(in_sizes, in_values, index, reverse_index):
n = reverse_index.size
out_sizes = np.empty_like(in_sizes)
out_values = np.empty_like(in_values)
in_offsets = np.empty_like(in_sizes)
s, cur = 0, 0
for i in range(n):
in_offsets[i] = s
s += in_sizes[i]
for i in range(n):
in_ind = reverse_index[i]
size = in_sizes[in_ind]
in_offset = in_offsets[in_ind]
out_sizes[i] = size
for j in range(size):
out_values[cur+j] = index[in_values[in_offset+j]]
cur += size
return out_sizes, out_values
#nb.njit
def small_inplace_sort(arr):
if len(arr) < 80:
# Basic insertion sort
i = 1
while i < len(arr):
x = arr[i]
j = i - 1
while j >= 0 and arr[j] > x:
arr[j+1] = arr[j]
j = j - 1
arr[j+1] = x
i += 1
else:
arr.sort()
#nb.jit('(float64[:, ::1], float64[::1], int64[::1], int64[::1])')
def compute(poss, radii, neighbours_sizes, neighbours_values):
n, m = neighbours_sizes.size, np.max(neighbours_sizes)
# Big buffers allocated with the maximum size.
# Thank to virtual memory, it does not take more memory can actually needed.
particle_corsp_overlaps = np.empty(neighbours_values.size, dtype=np.float64)
ends_ind_org = np.empty((neighbours_values.size, 2), dtype=np.float64)
in_offset = 0
out_offset = 0
buff1 = np.empty(m, dtype=np.int64)
buff2 = np.empty(m, dtype=np.float64)
buff3 = np.empty(m, dtype=np.float64)
for particle_idx in range(n):
size = neighbours_sizes[particle_idx]
cur = 0
for i in range(size):
value = neighbours_values[in_offset+i]
if value != particle_idx:
buff1[cur] = value
cur += 1
nears_i_ind = buff1[0:cur]
small_inplace_sort(nears_i_ind) # Note: bottleneck of this function
in_offset += size
if len(nears_i_ind) == 0:
continue
x1, y1, z1 = poss[particle_idx]
cur = 0
for i in range(len(nears_i_ind)):
index = nears_i_ind[i]
x2, y2, z2 = poss[index]
dist = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2 + (z2 - z1) ** 2)
contact_check = dist - (radii[index] + radii[particle_idx])
if contact_check <= 0.0:
buff2[cur] = contact_check
buff3[cur] = index
cur += 1
particle_corsp_overlaps[out_offset:out_offset+cur] = buff2[0:cur]
contacts_sec_ind = buff3[0:cur]
small_inplace_sort(contacts_sec_ind)
sphere_olps_ind = contacts_sec_ind
for i in range(cur):
ends_ind_org[out_offset+i, 0] = particle_idx
ends_ind_org[out_offset+i, 1] = sphere_olps_ind[i]
out_offset += cur
# Truncate the views to their real size
particle_corsp_overlaps = particle_corsp_overlaps[:out_offset]
ends_ind_org = ends_ind_org[:out_offset]
assert len(ends_ind_org) % 2 == 0
size = len(ends_ind_org)//2
ends_ind = np.empty((size,2), dtype=np.int64)
ends_ind_idx = np.empty(size, dtype=np.int64)
gap = np.empty(size, dtype=np.float64)
cur = 0
# Find efficiently duplicates (replace np.unique+np.sort)
for i in range(len(ends_ind_org)):
left, right = ends_ind_org[i]
if left < right:
ends_ind[cur, 0] = left
ends_ind[cur, 1] = right
ends_ind_idx[cur] = i
gap[cur] = particle_corsp_overlaps[i]
cur += 1
return gap, ends_ind, ends_ind_idx, ends_ind_org
def ends_gap(poss, radii):
assert poss.size >= 1
# Sort the balls
index = np.argsort(radii)
reverse_index = np.empty(index.size, np.int64)
reverse_index[index] = np.arange(index.size, dtype=np.int64)
sorted_poss = poss[index]
sorted_radii = radii[index]
# Split them in two groups: the small and the big ones
split_ind = len(radii) * 3 // 4
small_poss, big_poss = np.split(sorted_poss, [split_ind])
small_radii, big_radii = np.split(sorted_radii, [split_ind])
max_small_radii = sorted_radii[max(split_ind, 0)]
max_big_radii = sorted_radii[-1]
# Find the neighbours in parallel
result = Parallel(n_jobs=4, backend='threading')([
find_neighbours(small_poss, small_poss, small_radii+max_small_radii),
find_neighbours(small_poss, big_poss, small_radii+max_big_radii ),
find_neighbours(big_poss, small_poss, big_radii+max_small_radii ),
find_neighbours(big_poss, big_poss, big_radii+max_big_radii )
])
small_small_neighbours = result[0]
small_big_neighbours = result[1]
big_small_neighbours = result[2]
big_big_neighbours = result[3]
# Merge the (segmented) arrays in a big one
neighbours_sizes, neighbours_values = vstack_neighbours(
hstack_neighbours(small_small_neighbours, small_big_neighbours, split_ind),
hstack_neighbours(big_small_neighbours, big_big_neighbours, split_ind)
)
# Reverse the indices.
# Note that the results in `neighbours_values` associated to
# `neighbours_sizes[i]` are subsets of `query_radius([poss[i]], r=dia_max)`
# on a `BallTree(poss)`.
res = reorder_neighbours(neighbours_sizes, neighbours_values, index, reverse_index)
neighbours_sizes, neighbours_values = res
# Finally compute the neighbours with a method similar to the
# previous one, but using a much faster optimized code.
return compute(poss, radii, neighbours_sizes, neighbours_values)
result = ends_gap(poss, radii)
Here is the results (still on the same i5-9600KF machine):
Small dataset:
- Reference optimized Numba code: 256 ms
- This highly-optimized Numba code: 82 ms
Big dataset:
- Reference optimized Numba code: 42.7 s (take about 7~8 GiB of RAM)
- This highly-optimized Numba code: 4.2 s (take about 1 GiB of RAM)
Thus the new algorithm is about 3.1 time faster on the small dataset (in addition to the previous optimizations), and about 10 times faster on the big dataset! This is 3 order of magnitude faster than the initially posted algorithms.
Note that 80% of the time is spend in the BallTree query (which is already mostly parallel). The main Numba computing function takes only 12% of the time and more than 75% of the time is spent in sorting the input indices. As a result, the neighbourhood search is clearly the bottleneck. It can be improved a bit by splitting the current queries in multiple smaller one but this will make the code even more complex for a relatively small improvement (eg. 1.5x faster). Note that more complex code are harder to maintain and modifications are bug-prone. Thus, I think moving to a native language to overcome the limitation of Python is the best solution to increase performance. That being said, writing a faster native code to solve this problem is far from being simple (unless you find good k-d tree, octree or ball tree library). Still, it is certainly better than optimizing this code further.
Analysis
A profiling analysis shows that at least 50% of the time in BallTree of scikit-learn is spent in unoptimized scalar loops that could use SIMD instructions like AVX-2 (and loop unrolling) to be about 4 times faster. Additionally, some multi-threading issue are also visible (the 4 threads on the top are the joblib workers, the light-green sections are the idle time):
This shows that this implementation is sub-optimal. One possible way to easily improve the execution time may be to optimize the hot loops of the scikit-learn BallTree implementation. Another strategy could be to try to use threads more efficiently (possibly by releasing the GIL in some parts of the scikit-learn module).
As the BallTree class of scikit-learn is written in Cython (BallTree is based on DKTree itself based on BinaryTree). You can try to rebuild the package on your machine and simply tweak compiler optimizations. Using the parameter -O3 -march=native -ffast-math should enable the compiler to use faster SIMD instruction and more aggressive optimizations resulting in a significant speed up. Note that using -ffast-math is unsafe as it assume the code of Scikit will never use NaN, Inf or -0 values (otherwise the result is completely undefined) and that floating-point number operations are associative (resulting in different results). That being said, such an option is critical to improve the automatic vectorization of numerical codes.
For the GIL, one can see that it is released in the query_radius function but it does not seems the case for the constructor of BallTree. Maybe, the simplest solution is to implement a parallel version of query/query_radius like Scipy did.
By fixing the query radius at twice the max sphere radius, you're creating a lot of spurious "collisions" to filter out.
The Python below achieves a significant speedup relative to your answer by using a fourth dimension to improve the selectivity of the kd-tree queries. Each Euclidean ball of radius r is over-approximated by an L1 ball of radius r√d where d is the dimension (3 here). The test for L1 balls colliding in 3d becomes a test for points being within a fixed L1 distance in 4d.
If you switched to a lower level language, you could potentially avoid a separate filtering step by altering the kd-tree implementation to use a combination L2+L1 metric.
import numpy as np
from scipy import spatial
from timeit import default_timer
def load_data():
centers = np.loadtxt("pos_large.csv", delimiter=",")
radii = np.loadtxt("radii_large.csv")
assert radii.shape + (3,) == centers.shape
return centers, radii
def count_contacts(centers, radii):
scaled_centers = centers / np.sqrt(centers.shape[1])
max_radius = radii.max()
tree = spatial.cKDTree(np.c_[scaled_centers, max_radius - radii])
count = 0
for i, x in enumerate(np.c_[scaled_centers, radii - max_radius]):
for j in tree.query_ball_point(x, r=2 * max_radius, p=1):
d = centers[i] - centers[j]
r = radii[i] + radii[j]
if i < j and np.inner(d, d) <= r * r:
count += 1
return count
def main():
centers, radii = load_data()
start = default_timer()
print(count_contacts(centers, radii))
end = default_timer()
print(end - start)
if __name__ == "__main__":
main()
As an update to Richard answer and to overcome probable memory leaks, I post this answer. During my testing executions, memory usage grows up and limits the execution to some smaller data volumes (maximum 200000 by my machine and 100000 on COLAB). This problem leads to much longer runtimes than resulted runtimes by Richard. So, I opened a SciPy issue relating to these different performances and put and compared some memory results there.
But, I did not get any answer so far and the origin of these significant differences between performances are not clear to me yet !!??
Fezzani referred to another SciPy issue to use chunk and well prepared a comparison to show the influence of chunk values on the runtimes. Strangely, although Fezzani's machine (Intel® Core™ i7-10700K CPU # 3.80GHz × 16; 32GiB of RAM) seems to be more powerful than Richard's machine (6-core machine with a i5-9600KF processor, 16 GiB of RAM 2 channels DDR4 # 3200MHz reaching 36~40 GiB/s), His execution on the large data will take at least (around) 33 seconds by chunk method (to avoid memory leaks).
I could not figure out why and which hardware can help machines to pass memory leaks and result in satisfying fast execution as for Richard (perhaps it was related to KF type of Richard's CPU) !!??
By seeking among some related memory issues, I could guess cKDTree methods are facing this inevitable problem when data volume is huge or … and scikit-learn, perhaps, be a better choice. In this regard, based on my understanding from JaminSore answer and the referred Martelli answer, I tried to evaluate BallTree and KDTree from scikit-learn. BallTree has better performance than KDTree in my cases (about 1.5 to 2 times), so I used it. There were no memory leaks for the large data, but it took 2 minutes (Richard results and mine differ just in time units now ;)). It ran faster than scipy when data volume increased. In my tests, scipy was faster on smaller data volumes (low memory consumptions) and as data volumes grows up, scipy performance falls behind due to its implementation behavior or related bugs (unclear to me yet); For my prepared 100000 data volumes, scikit-learn performs 1.5 to 2 times faster.
I guess using arrays is the big advantage of scikit-learn comparing to scipy method's lists, which can be derived from aforementioned Martelli answer. It may be the reason of the different performances.
scikit-learn methods return an object type ndarray with arrays of different lengths inside it that need to be sorted to get same results as the main code. I applied the related sorting behavior of each element in the loop in the compute function by modifying nears_i_ind code-line as nears_i_ind = np.sort(all_neighbours[an_offset:an_offset+cur_len]). Using BallTree, tmp and all_neighbours consume memory near the same. Note: If both have the same name, memory consumption will be reduced (almost halved). So, the modified Richard's ends_gap function by BallTree will be as:
def ends_gap(poss, dia_max):
balltree = BallTree(poss, metric='euclidean')
# tmp = balltree.query_radius(poss, r=dia_max)
# all_neighbours = np.concatenate(tmp, dtype=np.int64)
all_neighbours = balltree.query_radius(poss, r=dia_max)
all_neighbours_sizes = np.array([len(e) for e in all_neighbours], dtype=np.int64)
all_neighbours = np.concatenate(all_neighbours, dtype=np.int64)
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
It is not multi-threaded, which can improve the speed; I will try to multi-thread.
On my machine (i5 1st gen cpu intel core 760 # 2.8GHz, 16gb ram cl9 dual channel DDR3 ripjaws, x64 windows system) for 200000 data volume:
There were some mistakes in my two proposed methods which result in different gap values, which was mentioned in Note section by Richard. For producing same results, return_sorted=True must be added for nears_i_ind in Optimized algorithm and ends_ind and ends_ind_lst changes to list beside removing if-else statements in both codes:
Optimized algorithm:
def ends_gap(poss, dia_max):
particle_corsp_overlaps = []
ends_ind = [] # <------- this line is modified
kdtree = cKDTree(poss)
for particle_idx in range(len(poss)):
cur_point = poss[particle_idx]
nears_i_ind = np.array(kdtree.query_ball_point(cur_point, r=dia_max, return_sorted=True), dtype=np.int64) # <------- this line is modified
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = distance.cdist(poss[nears_i_ind], cur_point[None, :]).squeeze()
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
ends_ind.append(ends_ind_mod_temp) # <------- this line is modified
ends_ind_org = np.concatenate(ends_ind)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
Numba final optimized code:
#nb.jit('(float64[:, ::1], int64[::1], int64[::1])')
def compute(poss, all_neighbours, all_neighbours_sizes):
particle_corsp_overlaps = []
ends_ind_lst = [] # <------- this line is modified
an_offset = 0
for particle_idx in range(len(poss)):
cur_len = all_neighbours_sizes[particle_idx]
nears_i_ind = np.sort(all_neighbours[an_offset:an_offset+cur_len]) # <------- this line is modified
an_offset += cur_len
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = np.empty(len(nears_i_ind), dtype=np.float64)
x1, y1, z1 = poss[particle_idx]
for i in range(len(nears_i_ind)):
x2, y2, z2 = poss[nears_i_ind[i]]
dist_i[i] = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2 + (z2 - z1) ** 2)
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where(contact_check <= 0)
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.empty((len(sphere_olps_ind), 2), dtype=np.int64)
for i in range(len(sphere_olps_ind)):
ends_ind_mod_temp[i, 0] = particle_idx
ends_ind_mod_temp[i, 1] = sphere_olps_ind[i]
ends_ind_lst.append(ends_ind_mod_temp) # <------- this line is modified
return particle_corsp_overlaps, ends_ind_lst
def ends_gap(poss, dia_max):
balltree = BallTree(poss, metric='euclidean') # <------- new code
all_neighbours = balltree.query_radius(poss, r=dia_max) # <------- new code and modified
all_neighbours_sizes = np.array([len(e) for e in all_neighbours], dtype=np.int64) # <------- this line is modified
all_neighbours = np.concatenate(all_neighbours, dtype=np.int64) # <------- this line is modified
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
On my machine for around 550000 data volume:
Have you tried FLANN?
This code doesn't solve your problem completely. It simply finds the nearest 50 neighbors to each point in your 500000 point dataset:
from pyflann import FLANN
p = np.loadtxt("pos_large.csv", delimiter=",")
flann = FLANN()
flann.build_index(pts=p)
idx, dist = flann.nn_index(qpts=p, num_neighbors=50)
The last line takes less than a second in my laptop without any tuning or parallelization.

Nonlinear programming APOPT solver for optimal EV charging not infeasible with variables boundary <= 0 (GEKKO python)

I have tried to do optimal EV charging scheduling using the GEKKO packages. However, my code is stuck on some variable boundary condition when it is set to be lower than or equal to zero, i.e., x=m.Array(m.Var,n_var,value=0,lb=0,ub=1.0). The error message is 'Unsuccessful with error code 0'.
Below is my python script. If you have any advice on this problem, please don't hesitate to let me know.
Thanks,
Chitchai
#------------------------------
import numpy as np
import pandas as pd
import math
import os
from gekko import GEKKO
print('...... Preparing data for optimal EV charging ........')
#---------- Read data from csv.file -----------)
Main_path = os.path.dirname(os.path.realpath(__file__))
Baseload_path = Main_path + '/baseload_data.csv'
TOU_path = Main_path + '/TOUprices.csv'
EV_path = Main_path + '/EVtravel.csv'
df_Baseload = pd.read_csv(Baseload_path, index_col= 'Time')
df_TOU = pd.read_csv(TOU_path, index_col= 'Time')
df_EVtravel = pd.read_csv(EV_path, index_col= 'EV_no')
#Create and change run directory
rd= r'.\RunDir'
if not os.path.isdir(os.path.abspath(rd)):
os.mkdir(os.path.abspath(rd))
#--------------------EV charging optimization function----------------#
def OptEV(tou_rate, EV_data, P_baseload):
"""This function is to run EV charging optimization for each houshold"""
#Initialize EV model and traveling data in 96 intervals(interval = 15min)
s_time= 4*(EV_data[0]//1) + math.ceil(100*(EV_data[0]%1)/15) #starting time
d_time= 4*(EV_data[1]//1) + math.floor(100*(EV_data[1]%1)/15) #departure time
Pch_rating= EV_data[2] #charing rated power(kW)
kWh_bat= EV_data[3] #Battery rated capacity(kWh)
int_SoC= EV_data[4] #Initial battery's SoC(p.u.)
#Calculation charging period
if d_time<= s_time:
ch_period = 96+d_time-s_time
else:
ch_period = d_time-s_time
Np= int(ch_period)
print('charging period = %d intervals'%(Np))
#Construct revelant data list based on charging period
ch_time = [0]*Np #charging time step list
price_rate = [0]*Np #electricity price list
kW_baseload = [0]*Np #kW house baseload power list
#Re-arrange charging time, electricity price rate and baseload
for i in range(Np):
t_step = int(s_time)+i
if t_step <= 95: #Before midnight
ch_time[i] = t_step
price_rate[i] = tou_rate[t_step]
kW_baseload[i] = P_baseload[t_step]/1000 #active house baseload
else: #After midnight
ch_time[i] = t_step-96
price_rate[i] = tou_rate[t_step-96]
kW_baseload[i] = P_baseload[t_step-96]/1000
#Initialize Model
m = GEKKO(remote=False) # or m = GEKKO() for solve locally
m.path = os.path.abspath(rd) # change run directory
#define parameter
ch_eff= m.Const(value=0.90) #charging/discharging efficiency
alpha= m.Const(value= 0.00005) #regularization constant battery profile
net_load= [None]*Np #net metering houshold load power array
elec_price= [None]*Np #purchased electricity price array
SoC= [None]*(Np+1) #SoC of batteries array
#initialize variables
n_var= Np #number of dicision variables
x = m.Array(m.Var,n_var,value=0,lb=0,ub=1.0) #dicision charging variables
#Calculation relevant evaluated parameters
#x[0] = m.Intermediate(-1.025)
SoC[0]= m.Intermediate(int_SoC) #initial battery SoC
for i in range(Np):
#Netload metering evaluation
net_load[i]= m.Intermediate(kW_baseload[i]+x[i]*Pch_rating)
#electricity cost price evaluation(cent/kWh)
Neg_pr= (1/4)*net_load[i]*price_rate[i] # Reverse power cost
Pos_pr= (1/4)*net_load[i]*price_rate[i] # Purchased power cost
elec_price[i]= m.Intermediate(m.if3(net_load[i], Neg_pr, Pos_pr))
#current battery's SoC evaluation
j=i+1
SoC_dch= (1/4)*(x[i]*Pch_rating/ch_eff)/kWh_bat #Discharging(V2G)
SoC_ch= (1/4)*ch_eff*x[i]*Pch_rating/kWh_bat #Discharging
SoC[j]= m.Intermediate(m.if3(x[i], SoC[j-1]+SoC_dch, SoC[j-1]+SoC_ch))
#m.solve(disp=False)
#-------Constraint functions--------#
#EV battery constraint
m.Equation(SoC[-1] >= 0.80) #required departure SoC (minimum=80%)
for i in range(Np):
j=i+1
m.Equation(SoC[j] >= 0.20) #lower SoC limit = 20%
for i in range(Np):
j=i+1
m.Equation(SoC[j] <= 0.95) #upper SoC limit = 95%
#household Net-power constraint
for i in range(Np):
m.Equation(net_load[i] >= -10.0) #Lower netload power limit
for i in range(Np):
m.Equation(net_load[i] <= 10.0) #Upper netload power limit
#Objective functions
elec_cost = m.Intermediate(m.sum(elec_price)) #electricity cost
#battery degradation cost
bat_cost = m.Intermediate(m.sum([alpha*xi**2 for xi in x]))
#bat_cost = 0 #Not consider battery degradation cost
m.Minimize(elec_cost + bat_cost) # objective
#Set global options
m.options.IMODE = 3 #steady state optimization
#Solve simulation
try:
m.solve(disp=True) # solve
print('--------------Results---------------')
print('Objective Function= ' + str(m.options.objfcnval))
print('x= ', x)
print('price_rate= ', price_rate)
print('net_load= ', net_load)
print('elec_price= ', elec_price)
print('SoC= ', SoC)
print('Charging time= ', ch_time)
except:
print('*******Not successful*******')
print('--------------No convergence---------------')
# from gekko.apm import get_file
# print(m._server)
# print(m._model_name)
# f = get_file(m._server,m._model_name,'infeasibilities.txt')
# f = f.decode().replace('\r','')
# with open('infeasibilities.txt', 'w') as fl:
# fl.write(str(f))
Pcharge = x
return ch_time, Pcharge
pass
#---------------------- Run scripts ---------------------#
TOU= df_TOU['Prices'] #electricity TOU prices rate (c/kWh)
Load1= df_Baseload['Load1']
EV_data = [17.15, 8.15, 3.3, 24, 0.50] #[start,final,kW_rate,kWh_bat,int_SoC]
OptEV(TOU, EV_data, Load1)
#--------------------- End of a script --------------------#
When the solver fails to find a solution and reports "Solution Not Found", there is a troubleshooting method to diagnose the problem. The first thing to do is to look at the solver output with m.solve(disp=True). The solver may have identified either an infeasible problem or it reached the maximum number of iterations without converging to a solution. In your case, it identified the problem as infeasible.
Infeasible Problem
If the solver failed because of infeasible equations then it found that the combination of variables and equations is not solvable. You can try to relax the variable bounds or identify which equation is infeasible with the infeasibilities.txt file in the run directory. Retrieve the infeasibilities.txt file from the local run directory that you can view with m.open_folder() when m=GEKKO(remote=False).
Maximum Iteration Limit
If the solver reached the default iteration limit (m.options.MAX_ITER=250) then you can either try to increase this limit or else try the strategies below.
Try a different solver by setting m.options.SOLVER=1 for APOPT, m.options.SOLVER=2 for BPOPT, m.options.SOLVER=3 for IPOPT, or m.options.SOLVER=0 to try all the available solvers.
Find a feasible solution first by solving a square problem where the number of variables is equal to the number of equations. Gekko a couple options to help with this including m.options.COLDSTART=1 (sets STATUS=0 for all FVs and MVs) or m.options.COLDSTART=2 (sets STATUS=0 and performs block diagonal triangular decomposition to find possible infeasible equations).
Once a feasible solution is found, try optimizing with this solution as the initial guess.

How to accelerate my written python code: function containing nested functions for classification of points by polygons

I have written the following NumPy code by Python:
def inbox_(points, polygon):
""" Finding points in a region """
ll = np.amin(polygon, axis=0) # lower limit
ur = np.amax(polygon, axis=0) # upper limit
in_idx = np.all(np.logical_and(ll <= points, points < ur), axis=1) # points in the range [boolean]
return in_idx
def operation_(r, gap, ends_ind):
""" calculation formula which is applied on the points specified by inbox_ function """
r_active = np.take(r, ends_ind) # taking values from "r" based on indices and shape (paired_values) of "ends_ind"
r_sub = np.subtract.reduce(r_active, axis=1) # subtracting each paired "r" determined by "ends_ind" [this line will be used in the final return formula]
r_add = np.add.reduce(r_active, axis=1) # adding each paired "r" determined by "ends_ind" [this line will be used in the final return formula]
paired_cent_dis = np.sum((r_add, gap), axis=0) # distance of the each two paired points
return (np.power(gap, 2) * (np.power(paired_cent_dis, 2) + 5 * paired_cent_dis * r_add - 7 * np.power(r_sub, 2))) / (3 * paired_cent_dis) # Formula
def elapses(r, pos, gap, ends_ind, elem_vert, contact_poss):
if len(gap) > 0:
elaps = np.empty([len(elem_vert), ], dtype=object)
operate_ = operation_(r, gap, ends_ind)
#elbav = np.empty([len(elem_vert), ], dtype=object)
#con_num = 0
for i, j in enumerate(elem_vert): # loop for each section (cell or region) of a mesh
in_bool = inbox_(contact_poss, j) # getting boolean array for points within that section
elaps[i] = np.sum(operate_[in_bool]) # performing some calculations on that points and get the sum of them for each section
operate_ = operate_[np.invert(in_bool)] # slicing the arrays by deleting the points on which the calculations were performed to speed-up the code in next loops
contact_poss = contact_poss[np.invert(in_bool)] # as above
#con_num += sum(inbox_(contact_poss, j))
#inba_bool = inbox_(pos, j)
#elbav[i] = 4 * np.pi * np.sum(np.power(r[inba_bool], 3)) / 3
#pos = pos[np.invert(inba_bool)]
#r = r[np.invert(inba_bool)]
return elaps
r = np.load('a.npy')
pos = np.load('b.npy')
gap = np.load('c.npy')
ends_ind = np.load('d.npy')
elem_vert = np.load('e.npy')
contact_poss = np.load('f.npy')
elapses(r, pos, gap, ends_ind, elem_vert, contact_poss)
# a --------r-------> parameter corresponding to each coordinate (point); here radius (23605,) <class 'numpy.ndarray'> <class 'numpy.float64'>
# b -------pos------> coordinates of the points (23605, 3) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
# c -------gap------> if we consider points as spheres by that radii [r], it is maximum length for spheres' over-lap (103832,) <class 'numpy.ndarray'> <class 'numpy.float64'>
# d ----ends_ind----> indices for each over-laped spheres (103832, 2) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.int64'>
# e ---elem_vert----> vertices of the mesh's sections or cells (2000, 8, 3) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
# f --contact_poss--> a coordinate between the paired spheres (103832, 3) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
This code will be called frequently from another code with big-data inputs. So, speeding up this code is essential. I have tried to use jit decorator from JAX and Numba libraries to accelerate the code, but I could not work with that properly to make the code better. I have tested the code on Colab (for 3 data sets with loops number of 20, 250, and 2000) for speed and the results were:
11 (ms), 47 (ms), 6.62 (s) (per loop) <-- without the commented code lines in the code
137 (ms), 1.66 (s) , 4 (m) (per loop) <-- with activating the commented code lines in the code
What this code does is finding some coordinates in a range and then performing some calculations on them.
I will be very appreciated for any answers that can speed up this code significantly (I believe it could). Also, I will be grateful for any experienced recommendations about speeding up the code by changing (substituting) the used NumPy methods and … or writing method for the math operations.
Notes:
The proposed answers must be executable by python version 2 (being applicable in both versions 2 and 3 is very excellent)
The commented code lines in the code are unnecessary for the main aim and are written just for further evaluations. Any recommendations to handle these lines with the proposed answers is appreciated (is not needed).
Data sets for test:
small data set: https://drive.google.com/file/d/1CswjyoqS8ogLmLQa_oNTOj221chDcbK8/view?usp=sharing
medium data set: https://drive.google.com/file/d/14RJ0Ackx88NzQWloops5FagzuNQYDSrh/view?usp=sharing
large data set: https://drive.google.com/file/d/1dJnXpb3HiAGcRC9PPTwui9joNcij4E_E/view?usp=sharing
First of all, the algorithm can be improved to be much more efficient. Indeed, a polygon can be directly assigned to each point. This is like a classification of points by polygons. Once the classification is done, you can perform one/many reductions by key where the key is the polygon ID.
This new algorithm consists in:
computing all the bounding boxes of the polygons;
classifying the points by polygons;
performing the reduction by key (where the key is the polygon ID).
This approach is much more efficient than iterating over all the points for each polygons and filtering the attributes arrays (eg. operate_ and contact_poss). Indeed, a filtering is an expensive operation since it requires the target array (that may not fit in the CPU caches) to be fully read and then written back. Not to mention this operation requires a temporary array to be allocated/deleted if it is not performed in-place and the operation cannot benefit from being implemented with SIMD instructions on most x86/x86-64 platforms (as it requires the new AVX-512 instruction set). It is also harder to parallelize since the filtering steps are too fast for threads to be useful but steps need to be done sequentially.
Regarding the implementation of the algorithm, Numba can be used to speed up a lot the overall computation. The main benefit of using Numba is to drastically reduce the number of expensive temporary arrays created by Numpy in your current implementation. Note that you can specify the function types to Numba so it can compile functions when it is defined. Assertions can be used to make the code more robust and help the compiler to know the size of a given dimension so to generate a significantly faster code (the JIT compiler of Numba can unroll the loops). Ternaries operators can help a bit the JIT compiler to generate a faster branch-less program.
Note the classification can be easily parallelized using multiple threads. However, one needs to be very careful about constant propagation since some critical constants (like the shape of the working arrays and assertions) tends not to be propagated to the code executed by threads while the propagation is critical to optimize the hot loops (eg. vectorization, unrolling). Note also that creating of many threads can be expensive on machines with many cores (from 10 ms to 0.1 ms). Thus, this is often better to use a parallel implementation only on big input data.
Here is the resulting implementation (working with both Python2 and Python3):
#nb.njit('float64[::1](float64[::1], float64[::1], int64[:,::1])')
def operation_(r, gap, ends_ind):
""" calculation formula which is applied on the points specified by findMatchingPolygons_ function """
nPoints = ends_ind.shape[0]
assert ends_ind.shape[1] == 2
assert gap.size == nPoints
formula = np.empty(nPoints, dtype=np.float64)
for i in range(nPoints):
ind0, ind1 = ends_ind[i]
r0, r1 = r[ind0], r[ind1]
r_sub = r0 - r1
r_add = r0 + r1
cur_gap = gap[i]
paired_cent_dis = r_add + cur_gap
formula[i] = (cur_gap**2 * (paired_cent_dis**2 + 5 * paired_cent_dis * r_add - 7 * r_sub**2)) / (3 * paired_cent_dis)
return formula
# Use `parallel=True` for a parallel implementation
#nb.njit('int32[::1](float64[:,::1], float64[:,:,::1])')
def findMatchingPolygons_(points, polygons):
""" Attribute to all point a region """
nPolygons = polygons.shape[0]
nPolygonPoints = polygons.shape[1]
nPoints = points.shape[0]
assert points.shape[1] == 3
assert polygons.shape[2] == 3
# Compute the bounding boxes of all polygons
ll = np.empty((nPolygons, 3), dtype=np.float64)
ur = np.empty((nPolygons, 3), dtype=np.float64)
for i in range(nPolygons):
ll_x, ll_y, ll_z = polygons[i, 0]
ur_x, ur_y, ur_z = polygons[i, 0]
for j in range(1, nPolygonPoints):
x, y, z = polygons[i, j]
ll_x = x if x<ll_x else ll_x
ll_y = y if y<ll_y else ll_y
ll_z = z if z<ll_z else ll_z
ur_x = x if x>ur_x else ur_x
ur_y = y if y>ur_y else ur_y
ur_z = z if z>ur_z else ur_z
ll[i] = ll_x, ll_y, ll_z
ur[i] = ur_x, ur_y, ur_z
# Find for each point its corresponding polygon
pointPolygonId = np.empty(nPoints, dtype=np.int32)
# Use `nb.prange(nPoints)` for a parallel implementation
for i in range(nPoints):
x, y, z = points[i, 0], points[i, 1], points[i, 2]
pointPolygonId[i] = -1
for j in range(polygons.shape[0]):
if ll[j, 0] <= x < ur[j, 0] and ll[j, 1] <= y < ur[j, 1] and ll[j, 2] <= z < ur[j, 2]:
pointPolygonId[i] = j
break
return pointPolygonId
#nb.njit('float64[::1](float64[:,:,::1], float64[:,::1], float64[::1])')
def computeSections_(elem_vert, contact_poss, operate_):
nPolygons = elem_vert.shape[0]
elaps = np.zeros(nPolygons, dtype=np.float64)
pointPolygonId = findMatchingPolygons_(contact_poss, elem_vert)
for i, polygonId in enumerate(pointPolygonId):
if polygonId >= 0:
elaps[polygonId] += operate_[i]
return elaps
def elapses(r, pos, gap, ends_ind, elem_vert, contact_poss):
if len(gap) > 0:
operate_ = operation_(r, gap, ends_ind)
return computeSections_(elem_vert, contact_poss, operate_)
r = np.load('a.npy')
pos = np.load('b.npy')
gap = np.load('c.npy')
ends_ind = np.load('d.npy')
elem_vert = np.load('e.npy')
contact_poss = np.load('f.npy')
elapses(r, pos, gap, ends_ind, elem_vert, contact_poss)
Here are the results on a old 2-core machine (i7-3520M):
Small dataset:
- Original version: 5.53 ms
- Proposed version (sequential): 0.22 ms (x25)
- Proposed version (parallel): 0.20 ms (x27)
Medium dataset:
- Original version: 53.40 ms
- Proposed version (sequential): 1.24 ms (x43)
- Proposed version (parallel): 0.62 ms (x86)
Big dataset:
- Original version: 5742 ms
- Proposed version (sequential): 144 ms (x40)
- Proposed version (parallel): 67 ms (x86)
Thus, the proposed implementation is up to 86 times faster than the original one.

PyMC3 PK modelling. Model cant resolve to parameters used to create the data set

I am new to PK modelling and pymc3, but I have been playing around with pymc3 and trying to implement a simple PK model as part of my own learning. Specifically a model that captures this relationship...
Where C(t)(Cpred) is concentration at time t, Dose is the dose given, V is Volume of distribution, CL is clearance.
I have generated some test data (30 subjects) with values of CL =2 , V=10, for 3 doses 100,200,300, and generated data at timepoints 0,1,2,4,8,12, and also included some random error on CL (normal distribution, 0 mean, omega =0.6) and on the residual unexplained error DV = Cpred + sigma, where sigma is normally distributed the SD =0.33. In addition I have included a transformation on C and V with respect to the weight (uniform distribution 50-90) CLi = CL * WT/70; Vi = V * WT/70.
# Create Data for modelling
np.random.seed(0)
# Subject ID's
data = pd.DataFrame(np.arange(1,31), columns=['subject'])
# Dose
Data['dose'] = np.array([100,100,100,100,100,100,100,100,100,100,
200,200,200,200,200,200,200,200,200,200,
300,300,300,300,300,300,300,300,300,300])
# Random Body Weight
data['WT'] = np.random.randint(50,100, size =30)
# Fixed Clearance and Volume for the population
data['CLpop'] =2
data['Vpop']=10
# Error rate for individual clearance rate
OMEGA = 0.66
# Individual clearance rate as a function of weight and omega
data['CLi'] = data['CLpop']*(data['WT']/70)+ np.random.normal(0, OMEGA )
# Individual Volume as a function of weight
data['Vi'] = data['Vpop']*(data['WT']/70)
# Expand dataframe to account for time points
data = pd.concat([data]*6,ignore_index=True )
data = data.sort('subject')
# Add in time points
data['time'] = np.tile(np.array([0,1,2,4,8,12]), 30)
# Create concentration values using equation
data['Cpred'] = data['dose']/data['Vi'] *np.exp(-1*data['CLi']/data['Vi']*data['time'])
# Error rate for DV
SIGMA = 0.33
# Create Dependenet Variable from Cpred + error
data['DV']= data['Cpred'] + np.random.normal(0, SIGMA )
# Create new df with only data for modelling...
df = data[['subject','dose','WT', 'time', 'DV']]
Create arrays ready for model...
# Prepare data from df to model specific arrays
time = np.array(df['time'])
dose = np.array(df['dose'])
DV = np.array(df['DV'])
WT = np.array(df['WT'])
n_patients = len(data['subject'].unique())
subject = data['subject'].values-1
I have built a simple model in pymc3 ....
pk_model = Model()
with pk_model:
# Hyperparameter Priors
sigma = Lognormal('sigma', mu =0, tau=0.01)
V = Lognormal('V', mu =2, tau=0.01)
CL = Lognormal('CL', mu =1, tau=0.01)
# Transformation wrt to weight
CLi = CL*(WT)/70
Vi = V*(WT)/70
# Expected value of outcome
pred = dose/Vi*np.exp(-1*(CLi/Vi)*time)
# Likelihood (sampling distribution) of observations
conc = Normal('conc', mu =pred, tau=sigma, observed = DV)
My expectation was that I should have been able to resolve from the data the constants and error rates that were originally used to generate the data, although I have not been able to do this, although I can get close. In this example...
data['CLi'].mean()
> 2.322473543135788
data['Vi'].mean()
> 10.147619047619049
And the trace shows....
So my questions are..
Is my code structured correctly and are there any glaring mistakes that I have overlooked that might account for this difference?
Can I structure the pymc3 model to better reflect the relationship from which I have generated the data?
What would be your suggestions to improve the model?
Thanks in advance!
I'm going to answer my own question!
But I implemented a hierarchal model following the example found here...
GLM -hierarchical
and it works a treat. Also I noticed errors in the way I was applying the errors in the dataframe - should use
data['CLer'] = np.random.normal(scale=OMEGA, size=30)
To ensure each subject has a different value for the error