Sample without replacement - tensorflow

How to sample without replacement in TensorFlow? Like numpy.random.choice(n, size=k, replace=False) for some very large integer n (e.g. 100k-100M), and smaller k (e.g. 100-10k).
Also, I want it to be efficient and on the GPU, so other solutions like this with tf.py_func are not really an option for me. Anything which would use tf.range(n) or so is also not an option because n could be very large.

This is one way:
n = ...
sample_size = ...
idx = tf.random_shuffle(tf.range(n))[:sample_size]
EDIT:
I had posted the answer below but then read the last line of your post. I don't think there is a good way to do it if you absolutely cannot produce a tensor with size O(n) (numpy.random.choice with replace=False is also implemented as a slice of a permutation). You could resort to a tf.while_loop until you have unique indices:
n = ...
sample_size = ...
idx = tf.zeros(sample_size, dtype=tf.int64)
idx = tf.while_loop(
lambda i: tf.size(idx) == tf.size(tf.unique(idx)),
lambda i: tf.random_uniform(sample_size, maxval=n, dtype=int64))
EDIT 2:
About the average number of iterations in the previous method. If we call n the number of possible values and k the length of the desired vector (with k ≤ n), the probability that an iteration is successful is:
p = product((n - (i - 1) / n) for i in 1 .. k)
Since each iteartion can be considered a Bernoulli trial, the average number of trials unitl first success is 1 / p (proof here). Here is a function that calculates the average numbre of trials in Python for some k and n values:
def avg_iter(k, n):
if k > n or n <= 0 or k < 0:
raise ValueError()
avg_it = 1.0
for p in (float(n) / (n - i) for i in range(k)):
avg_it *= p
return avg_it
And here are some results:
+-------+------+----------+
| n | k | Avg iter |
+-------+------+----------+
| 10 | 5 | 3.3 |
| 100 | 10 | 1.6 |
| 1000 | 10 | 1.1 |
| 1000 | 100 | 167.8 |
| 10000 | 10 | 1.0 |
| 10000 | 100 | 1.6 |
| 10000 | 1000 | 2.9e+22 |
+-------+------+----------+
You can see it varies wildy depending on the parameters.
It is possible, though, to construct a vector in a fixed number of steps, although the only algorithm I can think of is O(k2). In pure Python it goes like this:
import random
def sample_wo_replacement(n, k):
sample = [0] * k
for i in range(k):
sample[i] = random.randint(0, n - 1 - len(sample))
for i, v in reversed(list(enumerate(sample))):
for p in reversed(sample[:i]):
if v >= p:
v += 1
sample[i] = v
return sample
random.seed(100)
print(sample_wo_replacement(10, 5))
# [2, 8, 9, 7, 1]
print(sample_wo_replacement(10, 10))
# [6, 5, 8, 4, 0, 9, 1, 2, 7, 3]
This is a possible way to do it in TensorFlow (not sure if the best one):
import tensorflow as tf
def sample_wo_replacement_tf(n, k):
# First loop
sample = tf.constant([], dtype=tf.int64)
i = 0
sample, _ = tf.while_loop(
lambda sample, i: i < k,
# This is ugly but I did not want to define more functions
lambda sample, i: (tf.concat([sample,
tf.random_uniform([1], maxval=tf.cast(n - tf.shape(sample)[0], tf.int64), dtype=tf.int64)],
axis=0),
i + 1),
[sample, i], shape_invariants=[tf.TensorShape((None,)), tf.TensorShape(())])
# Second loop
def inner_loop(sample, i):
sample_size = tf.shape(sample)[0]
v = sample[i]
j = i - 1
v, _ = tf.while_loop(
lambda v, j: j >= 0,
lambda v, j: (tf.cond(v >= sample[j], lambda: v + 1, lambda: v), j - 1),
[v, j])
return (tf.where(tf.equal(tf.range(sample_size), i), tf.tile([v], (sample_size,)), sample), i - 1)
i = tf.shape(sample)[0] - 1
sample, _ = tf.while_loop(lambda sample, i: i >= 0, inner_loop, [sample, i])
return sample
And an example:
with tf.Graph().as_default(), tf.Session() as sess:
tf.set_random_seed(100)
sample = sample_wo_replacement_tf(10, 5)
for i in range(10):
print(sess.run(sample))
# [3 0 6 8 4]
# [5 4 8 9 3]
# [1 4 0 6 8]
# [8 9 5 6 7]
# [7 5 0 2 4]
# [8 4 5 3 7]
# [0 5 7 4 3]
# [2 0 3 8 6]
# [3 4 8 5 1]
# [5 7 0 2 9]
This is quite intesive on tf.while_loops, though, which are well-known not to be particularly fast in TensorFlow, so I wouldn't know how fast can you really get with this method without some kind of benchmarking.
EDIT 4:
One last possible method. You can divide the range of possible values (0 to n) in "chunks" of size c and pick a random amount of numbers from each chunk, then shuffle everything. The amount of memory that you use is limited by c, and you don't need nested loops. If n is divisible by c, then you should get about a perfect random distribution, otherwise values in the last "short" chunk would receive some extra probability (this may be negligible depending on the case). Here is a NumPy implementation. It is somewhat long to account for different corner cases and pitfalls, but if c ≥ k and n mod c = 0 several parts get simplified.
import numpy as np
def sample_chunked(n, k, chunk=None):
chunk = chunk or n
last_chunk = chunk
parts = n // chunk
# Distribute k among chunks
max_p = min(float(chunk) / k, 1.0)
max_p_last = max_p
if n % chunk != 0:
parts += 1
last_chunk = n % chunk
max_p_last = min(float(last_chunk) / k, 1.0)
p = np.full(parts, 2)
# Iterate until a valid distribution is found
while not np.isclose(np.sum(p), 1) or np.any(p > max_p) or p[-1] > max_p_last:
p = np.random.uniform(size=parts)
p /= np.sum(p)
dist = (k * p).astype(np.int64)
sample_size = np.sum(dist)
# Account for rounding errors
while sample_size < k:
i = np.random.randint(len(dist))
while (dist[i] >= chunk) or (i == parts - 1 and dist[i] >= last_chunk):
i = np.random.randint(len(dist))
dist[i] += 1
sample_size += 1
while sample_size > k:
i = np.random.randint(len(dist))
while dist[i] == 0:
i = np.random.randint(len(dist))
dist[i] -= 1
sample_size -= 1
assert sample_size == k
# Generate sample parts
sample_parts = []
for i, v in enumerate(np.nditer(dist)):
if v <= 0:
continue
c = chunk if i < parts - 1 else last_chunk
base = chunk * i
sample_parts.append(base + np.random.choice(c, v, replace=False))
sample = np.concatenate(sample_parts, axis=0)
np.random.shuffle(sample)
return sample
np.random.seed(100)
print(sample_chunked(15, 5, 4))
# [ 8 9 12 13 3]
A quick benchmark of sample_chunked(100000000, 100000, 100000) takes about 3.1 seconds in my computer, while I haven't been able to run the previous algorithm (sample_wo_replacement function above) to completion with the same parameters. It should be possible to implement it in TensorFlow, maybe using tf.TensorArray, although it would require significant effort to get it exactly right.

use the gumbel-max trick here: https://github.com/tensorflow/tensorflow/issues/9260
z = -tf.log(-tf.log(tf.random_uniform(tf.shape(logits),0,1)))
_, indices = tf.nn.top_k(logits + z,K)
indices are what you want. This tick is so easy~!

The following works fairly fast on the GPU, and I did not encounter memory issues when using n~100M and k~10k (using NVIDIA GeForce GTX 1080 Ti):
def random_choice_without_replacement(n, k):
"""equivalent to 'numpy.random.choice(n, size=k, replace=False)'"""
return tf.math.top_k(tf.random.uniform(shape=[n]), k, sorted=False).indices

Related

Linear programming if/then modification to cost function?

I'm setting up a linear programming optimization model using CPLEX and am wondering if it's possible to accomplish a modification of the cost function dependent upon which binary decision variables are 'active' in an arbitrary solution. This is mostly a question about how to formulate the LP model (if it's even possible), but responses in the context of CPLEX are welcome or even preferred.
Say I have an LP problem in canonical form:
minimize cTx
s.t. Ax <= b
With cost function:
c = [c_1, c_2,...,c_100]
All variables are binary. I have this basic setup modeled and running effectively in CPLEX.
Now say I have a subset of variables:
efficiency_set = [x_1, x_2,...,x_5]
With the condition:
if any x_n in efficiency_set == 1
then c_n for all other x_n in the set = 0.9 * c_n
Essentially there is a dependency where if any x_n in the efficiency set is 'active', it becomes 10% less expensive for other variables in the set to appear in the solution.
I thought that CPLEX indicator constraints were what I was looking for, but after reading through documentation, I don't think I can enforce an on-the-fly change to cost function with them (I could be wrong). So I feel like it needs to be done through formulation of the LP, but I can't reason how to accomplish it. Any ideas?. Thanks.
In CPLEX you have many APIs, let me answer you with the easiest one OPL.
Your canonical form can be written
int n=3;
int m=4;
range N=1..n;
range M=1..m;
float A[N][M]=[[1,4,9,6],[8,5,0,8],[2,9,0,2]];
float B[M]=[3,1,3,0];
float C[N]=[1,1,1];
dvar boolean x[N];
minimize sum(i in N) C[i]*x[i];
subject to
{
forall(j in M) sum(i in N) A[i,j]*x[i]>=B[j];
}
and then you can you write logical constraints:
int n=3;
int m=4;
range N=1..n;
range M=1..m;
float A[N][M]=[[1,4,9,6],[8,5,0,8],[2,9,0,2]];
float B[M]=[3,1,3,0];
float C[N]=[1,1,1];
{int} efficiencySet={1,2};
dvar boolean activeEfficiencySet;
dvar boolean x[N];
minimize sum(i in N) C[i]*x[i]*(1-0.1*activeEfficiencySet*(i not in efficiencySet));
subject to
{
forall(j in M) sum(i in N) A[i,j]*x[i]>=B[j];
activeEfficiencySet==(1<=sum(i in efficiencySet) x[i]);
}
Using Alex's data, I have written the program in docplex (cplex python API)
from docplex.mp.model import Model
n = 3
m = 4
A = {}
A[0, 0] = 1
A[0, 1] = 4
A[0, 2] = 9
A[0, 3] = 6
A[1, 0] = 8
A[1, 1] = 5
A[1, 2] = 0
A[1, 3] = 8
A[2, 0] = 2
A[2, 1] = 9
A[2, 2] = 0
A[2, 3] = 2
B = {}
B[0] = 3
B[1] = 1
B[2] = 3
B[3] = 0
C = {}
C[0] = 1
C[1] = 1
C[2] = 1
efficiencySet = [0, 1]
mdl = Model(name="")
activeEfficiencySet = mdl.binary_var()
x = mdl.binary_var_dict(range(n), name="x")
# constraint 1:
for j in range(m):
mdl.add_constraint(mdl.sum(A[i, j] * x[i] for i in range(n)) >= B[j])
# constraint 2:
mdl.add(activeEfficiencySet == (mdl.sum(x) >= 1))
# objective function:
# expr = mdl.linear_expr()
lst = []
for i in range(n):
if i not in efficiencySet:
lst.append((C[i] * x[i] * (1 - 0.1 * activeEfficiencySet)))
else:
lst.append(C[i] * x[i])
mdl.minimize(mdl.sum(lst))
mdl.solve()
for i in range(n):
print(str(x[i]) + " : " + str(x[i].solution_value))
activeEfficiencySet.solution_value

How to index the unique value count in numpy? [duplicate]

Consider the following lists short_list and long_list
short_list = list('aaabaaacaaadaaac')
np.random.seed([3,1415])
long_list = pd.DataFrame(
np.random.choice(list(ascii_letters),
(10000, 2))
).sum(1).tolist()
How do I calculate the cumulative count by unique value?
I want to use numpy and do it in linear time. I want this to compare timings with my other methods. It may be easiest to illustrate with my first proposed solution
def pir1(l):
s = pd.Series(l)
return s.groupby(s).cumcount().tolist()
print(np.array(short_list))
print(pir1(short_list))
['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c']
[0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]
I've tortured myself trying to use np.unique because it returns a counts array, an inverse array, and an index array. I was sure I could these to get at a solution. The best I got is in pir4 below which scales in quadratic time. Also note that I don't care if counts start at 1 or zero as we can simply add or subtract 1.
Below are some of my attempts (none of which answer my question)
%%cython
from collections import defaultdict
def get_generator(l):
counter = defaultdict(lambda: -1)
for i in l:
counter[i] += 1
yield counter[i]
def pir2(l):
return [i for i in get_generator(l)]
def pir3(l):
return [i for i in get_generator(l)]
def pir4(l):
unq, inv = np.unique(l, 0, 1, 0)
a = np.arange(len(unq))
matches = a[:, None] == inv
return (matches * matches.cumsum(1)).sum(0).tolist()
setup
short_list = np.array(list('aaabaaacaaadaaac'))
functions
dfill takes an array and returns the positions where the array changes and repeats that index position until the next change.
# dfill
#
# Example with short_list
#
# 0 0 0 3 4 4 4 7 8 8 8 11 12 12 12 15
# [ a a a b a a a c a a a d a a a c]
#
# Example with short_list after sorting
#
# 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15
# [ a a a a a a a a a a a a b c c d]
argunsort returns the permutation necessary to undo a sort given the argsort array. The existence of this method became know to me via this post.. With this, I can get the argsort array and sort my array with it. Then I can undo the sort without the overhead of sorting again.
cumcount will take an array sort it, find the dfill array. An np.arange less dfill will give me cumulative count. Then I un-sort
# cumcount
#
# Example with short_list
#
# short_list:
# [ a a a b a a a c a a a d a a a c]
#
# short_list.argsort():
# [ 0 1 2 4 5 6 8 9 10 12 13 14 3 7 15 11]
#
# Example with short_list after sorting
#
# short_list[short_list.argsort()]:
# [ a a a a a a a a a a a a b c c d]
#
# dfill(short_list[short_list.argsort()]):
# [ 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15]
#
# np.range(short_list.size):
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
#
# np.range(short_list.size) -
# dfill(short_list[short_list.argsort()]):
# [ 0 1 2 3 4 5 6 7 8 9 10 11 0 0 1 0]
#
# unsorted:
# [ 0 1 2 0 3 4 5 0 6 7 8 0 9 10 11 1]
foo function recommended by #hpaulj using defaultdict
div function recommended by #Divakar (old, I'm sure he'd update it)
code
def dfill(a):
n = a.size
b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]])
return np.arange(n)[b[:-1]].repeat(np.diff(b))
def argunsort(s):
n = s.size
u = np.empty(n, dtype=np.int64)
u[s] = np.arange(n)
return u
def cumcount(a):
n = a.size
s = a.argsort(kind='mergesort')
i = argunsort(s)
b = a[s]
return (np.arange(n) - dfill(b))[i]
def foo(l):
n = len(l)
r = np.empty(n, dtype=np.int64)
counter = defaultdict(int)
for i in range(n):
counter[l[i]] += 1
r[i] = counter[l[i]]
return r - 1
def div(l):
a = np.unique(l, return_counts=1)[1]
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
rng = id_arr.cumsum()
return rng[argunsort(np.argsort(l))]
demonstration
cumcount(short_list)
array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
time testing
code
functions = pd.Index(['cumcount', 'foo', 'foo2', 'div'], name='function')
lengths = pd.RangeIndex(100, 1100, 100, 'array length')
results = pd.DataFrame(index=lengths, columns=functions)
from string import ascii_letters
for i in lengths:
a = np.random.choice(list(ascii_letters), i)
for j in functions:
results.set_value(
i, j,
timeit(
'{}(a)'.format(j),
'from __main__ import a, {}'.format(j),
number=1000
)
)
results.plot()
Here's a vectorized approach using custom grouped range creating function and np.unique for getting the counts -
def grp_range(a):
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
return id_arr.cumsum()
count = np.unique(A,return_counts=1)[1]
out = grp_range(count)[np.argsort(A).argsort()]
Sample run -
In [117]: A = list('aaabaaacaaadaaac')
In [118]: count = np.unique(A,return_counts=1)[1]
...: out = grp_range(count)[np.argsort(A).argsort()]
...:
In [119]: out
Out[119]: array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
For getting the count, few other alternatives could be proposed with focus on performance -
np.bincount(np.unique(A,return_inverse=1)[1])
np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)
Additionally, with A containing single-letter characters, we could get the count simply with -
np.bincount(np.array(A).view('uint8')-97)
Besides defaultdict there are a couple of other counters. Testing a slightly simpler case:
In [298]: from collections import defaultdict
In [299]: from collections import defaultdict, Counter
In [300]: def foo(l):
...: counter = defaultdict(int)
...: for i in l:
...: counter[i] += 1
...: return counter
...:
In [301]: short_list = list('aaabaaacaaadaaac')
In [302]: foo(short_list)
Out[302]: defaultdict(int, {'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [303]: Counter(short_list)
Out[303]: Counter({'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [304]: arr=[ord(i)-ord('a') for i in short_list]
In [305]: np.bincount(arr)
Out[305]: array([12, 1, 2, 1], dtype=int32)
I constructed arr because bincount only works with ints.
In [306]: timeit np.bincount(arr)
The slowest run took 82.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.63 µs per loop
In [307]: timeit Counter(arr)
100000 loops, best of 3: 13.6 µs per loop
In [308]: timeit foo(arr)
100000 loops, best of 3: 6.49 µs per loop
I'm guessing it would hard to improve on pir2 based on default_dict.
Searching and counting like this are not a strong area for numpy.

Transportation cost optimisation using OMPR for a large data set

I am solving a transport optimization problem given a set of constraints.
The following are the three key data sets that I have
#demand file
demand - has demand(DEMAND) across 4821(DPP) sale points(D)
head(demand)
D PP DEMAND DPP
1 ADILABAD (V) - T:11001 OPC:PACK 131.00 ADILABAD (V) - T:11001:OPC:PACK
2 ADILABAD (V) - T:13003 OPC:PACK 235.00 ADILABAD (V) - T:13003:OPC:PACK
3 ADILABAD (V) - T:2006 PPC:PACK 30.00 ADILABAD (V) - T:2006:PPC:PACK
4 ADILABAD (V) - T:4001 OPC:PACK 30.00 ADILABAD (V) - T:4001:OPC:PACK
5 ADILABAD (V) - T:7006 OPC:NPACK 34.84 ADILABAD (V) - T:7006:OPC:NPACK
6 AHMEDABAD:1001 OPC:PACK 442.10 AHMEDABAD:1001:OPC:PACK
#Capacity file
cc - has capacity constraint (MaxP, MinP) across 1823 sources(SOURCE)
head(cc,4)
SOURCE MinP MaxP
1 CHILAMKUR:P:OPC:NPACK:0:R 900 10806
2 CHILAMKUR:P:OPC:NPACK:0:W 900 10806
3 CHILAMKUR:P:OPC:PACK:0:R 5628 67536
4 CHILAMKUR:P:OPC:PACK:0:W 5628 67536
#LandingCost file
LCMat - This is a matrix with the landing cost to deliver the product across the demand location (DPP) from a given source(SOURCE). This is an 1823 x 4821 matrix. Since the landing costs to all locations do not exist from a given source, I have replace that with a huge cost (10^6) to such DPPs.
I am using the OMPR package in R to optimize shipping material to meet the demand.
This is potentially a very simple transport problem but it is taking a lot of time. I am using a 16GB ram machine
The following is the code. Could anyone guide me on what I should do better?
a = Sys.time()
grid = expand.grid(i = 1:nrow(LCMat),j = 1:ncol(LCMat))
grid_solve = grid[which(LCMat < 10^6),]
grid_notsolve = grid[which(LCMat >= 10^6),]
model <- MILPModel() %>%
add_variable(x[grid$i, grid$j],lb = 0, type = "continuous") %>%
add_constraint(x[grid_notsolve$i, grid_notsolve$j] == 0) %>%
add_constraint(sum_over(x[i,j], i = 1:nrow(LCMat)) <= demand$DEMAND[j], j = 1:ncol(LCMat)) %>%
add_constraint(sum_over(x[i,j], j = 1:ncol(LCMat)) <= cc$MaxP[i], i = 1:nrow(LCMat)) %>%
add_constraint(sum_over(x[i,j], j = 1:ncol(LCMat)) >= cc$MinP[i], i = 1:nrow(LCMat)) %>%
set_objective(sum_expr(LCMat[grid_solve$i,grid_solve$j]*x[grid_solve$i,grid_solve$j]),"min")
solution = model %>% solve_model(with_ROI(solver = "glpk", verbose = TRUE))
Sys.time() - a
Two options to potentially speed things up:
Make sure you use the latest CRAN versions of ompr and listcomp.
Try to use filter conditions to only create/use variables that are relevant to the model, instead of adding all nrow(LCMat)*ncol(LCMat) variables and then setting (potentially) a lot of them to 0. See the code below for an example. Depending on how sparse your problem is that could help as well.
The following code takes a sparse matrix (i.e. a matrix with many 0 elements or 10^6 elements in your case) and only generates x[i,j] variables that have an entry in sparse_matrix which is greater than 0. It hopefully illustrates how to use that feature and apply it to your case.
library(ompr)
sparse_matrix <- matrix(
c(
1, 0, 0, 1,
0, 1, 0, 1,
0, 0, 0, 1,
1, 0, 0, 0
), byrow = TRUE, ncol = 4
)
is_connected <- function(i, j) {
sparse_matrix[i, j] > 0
}
n <- nrow(sparse_matrix)
m <- ncol(sparse_matrix)
model <- MIPModel() |>
add_variable(x[i, j], i = 1:n, j = 1:m, is_connected(i, j)) |>
set_objective(sum_over(x[i, j], i = 1:n, j = 1:m, is_connected(i, j))) |>
add_constraint(sum_over(x[i, j], i = 1:n, is_connected(i, j)) <= 1, j = 1:m)
variable_keys(model)
#> [1] "x[1,1]" "x[1,4]" "x[2,2]" "x[2,4]" "x[3,4]" "x[4,1]"
extract_constraints(model)
#> $matrix
#> 3 x 6 sparse Matrix of class "dgCMatrix"
#>
#> [1,] 1 . . . . 1
#> [2,] . . 1 . . .
#> [3,] . 1 . 1 1 .
#>
#> $sense
#> [1] "<=" "<=" "<="
#>
#> $rhs
#> [1] 1 1 1
Created on 2022-03-12 by the reprex package (v2.0.1)
Both OMPR and GLPK are slow for large models.
You are duplicating sum_over(x[i,j], j = 1:ncol(LCMat)). That leads to more nonzero elements than needed. I usually try to prevent that (even at the expense of more variables).

probability of sample of distribution

I am trying to generate a sample of 100 scenarios (X, Y) where both X and Y are normally distributed X=N(50,5^2), Y=N(30,2^2) and X and Y are correlated Cov(X,Y)=0.4.
I have been able to generate 100 scenarios with the Cholesky decomposition:
# We do a Cholesky decomposition to generate correlated scenarios
nScenarios = 10
Σ = [25 0.4; 0.4 4]
μ = [50, 30]
L = cholesky(Σ)
v = [rand(Normal(0, 1), nScenarios), rand(Normal(0, 1), nScenarios)]
X = reshape(zeros(nScenarios),1,nScenarios)
Y = reshape(zeros(nScenarios),1,nScenarios)
for i = 1:nScenarios
X[1, i] = sum(L.U[1, j] *v[j][i] for j = 1:nBreadTypes) + μ[1]
Y[1, i] = sum(L.U[2, j] *v[j][i] for j = 1:nBreadTypes) + μ[2]
end
However I need the probability of each scenario, i.e P(X=k and Y=p). My question would be, how can we get a sample of a certain distribution with the probability of each scenario?
Following the BatWannaBe explanation, normally I would do it like this:
julia> using Distributions
julia> d = MvNormal([50.0, 30.0], [25.0 0.4; 0.4 4.0])
FullNormal(
dim: 2
μ: [50.0, 30.0]
Σ: [25.0 0.4; 0.4 4.0]
)
julia> point = rand(d)
2-element Vector{Float64}:
52.807189619051485
32.693811008760676
julia> pdf(d, point)
0.0056519503173830515

Fastest way to create a sparse matrix of the form A.T * diag(b) * A + C?

I'm trying to optimize a piece of code that solves a large sparse nonlinear system using an interior point method. During the update step, this involves computing the Hessian matrix H, the gradient g, then solving for d in H * d = -g to get the new search direction.
The Hessian matrix has a symmetric tridiagonal structure of the form:
A.T * diag(b) * A + C
I've run line_profiler on the particular function in question:
Line # Hits Time Per Hit % Time Line Contents
==================================================
386 def _direction(n, res, M, Hsig, scale_var, grad_lnprior, z, fac):
387
388 # gradient
389 44 1241715 28220.8 3.7 g = 2 * scale_var * res - grad_lnprior + z * np.dot(M.T, 1. / n)
390
391 # hessian
392 44 3103117 70525.4 9.3 N = sparse.diags(1. / n ** 2, 0, format=FMT, dtype=DTYPE)
393 44 18814307 427597.9 56.2 H = - Hsig - z * np.dot(M.T, np.dot(N, M)) # slow!
394
395 # update direction
396 44 10329556 234762.6 30.8 d, fac = my_solver(H, -g, fac)
397
398 44 111 2.5 0.0 return d, fac
Looking at the output it's clear that constructing H is by far the most costly step - it takes considerably longer than actually solving for the new direction.
Hsig and M are both CSC sparse matrices, n is a dense vector and z is a scalar. The solver I'm using requires H to be either a CSC or CSR sparse matrix.
Here's a function that produces some toy data with the same formats, dimensions and sparseness as my real matrices:
import numpy as np
from scipy import sparse
def make_toy_data(nt=200000, nc=10):
d0 = np.random.randn(nc * (nt - 1))
d1 = np.random.randn(nc * (nt - 1))
M = sparse.diags((d0, d1), (0, nc), shape=(nc * (nt - 1), nc * nt),
format='csc', dtype=np.float64)
d0 = np.random.randn(nc * nt)
Hsig = sparse.diags(d0, 0, shape=(nc * nt, nc * nt), format='csc',
dtype=np.float64)
n = np.random.randn(nc * (nt - 1))
z = np.random.randn()
return Hsig, M, n, z
And here's my original approach for constructing H:
def original(Hsig, M, n, z):
N = sparse.diags(1. / n ** 2, 0, format='csc')
H = - Hsig - z * np.dot(M.T, np.dot(N, M)) # slow!
return H
Timing:
%timeit original(Hsig, M, n, z)
# 1 loops, best of 3: 483 ms per loop
Is there a faster way to construct this matrix?
I get close to a 4x speed-up in computing the product M.T * D * M out of the three diagonal arrays. If d0 and d1 are the main and upper diagonal of M, and d is the main diagonal of D, then the following code creates M.T * D * M directly:
def make_tridi_bis(d0, d1, d, nc=10):
d00 = d0*d0*d
d11 = d1*d1*d
d01 = d0*d1*d
len_ = d0.size
data = np.empty((3*len_ + nc,))
indices = np.empty((3*len_ + nc,), dtype=np.int)
# Fill main diagonal
data[:2*nc:2] = d00[:nc]
indices[:2*nc:2] = np.arange(nc)
data[2*nc+1:-2*nc:3] = d00[nc:] + d11[:-nc]
indices[2*nc+1:-2*nc:3] = np.arange(nc, len_)
data[-2*nc+1::2] = d11[-nc:]
indices[-2*nc+1::2] = np.arange(len_, len_ + nc)
# Fill top diagonal
data[1:2*nc:2] = d01[:nc]
indices[1:2*nc:2] = np.arange(nc, 2*nc)
data[2*nc+2:-2*nc:3] = d01[nc:]
indices[2*nc+2:-2*nc:3] = np.arange(2*nc, len_+nc)
# Fill bottom diagonal
data[2*nc:-2*nc:3] = d01[:-nc]
indices[2*nc:-2*nc:3] = np.arange(len_ - nc)
data[-2*nc::2] = d01[-nc:]
indices[-2*nc::2] = np.arange(len_ - nc ,len_)
indptr = np.empty((len_ + nc + 1,), dtype=np.int)
indptr[0] = 0
indptr[1:nc+1] = 2
indptr[nc+1:len_+1] = 3
indptr[-nc:] = 2
np.cumsum(indptr, out=indptr)
return sparse.csr_matrix((data, indices, indptr), shape=(len_+nc, len_+nc))
If your matrix M were in CSR format, you can extract d0 and d1 as d0 = M.data[::2] and d1 = M.data[1::2], I modified you toy data making routine to return those arrays as well, and here's what I get:
In [90]: np.allclose((M.T * sparse.diags(d, 0) * M).A, make_tridi_bis(d0, d1, d).A)
Out[90]: True
In [92]: %timeit make_tridi_bis(d0, d1, d)
10 loops, best of 3: 124 ms per loop
In [93]: %timeit M.T * sparse.diags(d, 0) * M
1 loops, best of 3: 501 ms per loop
The whole purpose of the above code is to take advantage of the structure of the non-zero entries. If you draw a diagram of the matrices you are multiplying together, it is relatively easy to convince yourself that the main (d_0) and top and bottom (d_1) diagonals of the resulting tridiagonal matrix are simply:
d_0 = np.zeros((len_ + nc,))
d_0[:len_] = d00
d_0[-len_:] += d11
d_1 = d01
The rest of the code in that function is simply building the tridiagonal matrix directly, as calling sparse.diags with the above data is several times slower.
I tried running your test case and had problems with the np.dot(N, M). I didn't dig into it, but I think my numpy/sparse combo (both pretty new) had problems using np.dot on sparse arrays.
But H = -Hsig - z*M.T.dot(N.dot(M)) runs just fine. This uses the sparse dot.
I haven't run a profile, but here are Ipython timings for several parts. It takes longer to generate the data than to do that double dot.
In [37]: timeit Hsig,M,n,z=make_toy_data()
1 loops, best of 3: 2 s per loop
In [38]: timeit N = sparse.diags(1. / n ** 2, 0, format='csc')
1 loops, best of 3: 377 ms per loop
In [39]: timeit H = -Hsig - z*M.T.dot(N.dot(M))
1 loops, best of 3: 1.55 s per loop
H is a
<2000000x2000000 sparse matrix of type '<type 'numpy.float64'>'
with 5999980 stored elements in Compressed Sparse Column format>