Suppose there are three variables that take on discrete integer values, say w1 = {1,2,3,4,5,6,7,8,9,10,11,12}, w2 = {1,2,3,4,5,6,7,8,9,10,11,12}, and w3 = {1,2,3,4,5,6,7,8,9,10,11,12}. The task is to pick one value from each set such that the resulting triplet minimizes some (black box, computationally expensive) cost function.
I've tried the surrogate optimization in Matlab but I'm not sure it is appropriate. I've also heard about simulated annealing but found no implementation applied to this instance.
Which algorithm, apart from exhaustive search, can solve this combinatorial optimization problem?
Any help would be much appreciated.
The requirement/benefit of Simulated Annealing (SA), is that the objective surface is somewhat smooth, that is, we can be close to a solution.
For a completely random spiky surface- you might as well do a random search
If it is anything smooth, or even sometimes, it makes sense to try SA.
The idea is that (sometimes) changing only 1 of the 3 values, we have little effect on out blackbox function.
Here is a basic example to do this with Simulated Annealing, using frigidum in Python
import numpy as np
w1 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w2 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w3 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
W = np.array([w1,w2,w3])
LENGTH = 12
I define a black-box using the Rastrigin function.
def rastrigin_function_n( x ):
"""
N-dimensional Rastrigin
https://en.wikipedia.org/wiki/Rastrigin_function
x_i is in [-5.12, 5.12]
"""
A = 10
n = x.shape[0]
return A*n + np.sum( x**2- A*np.cos(2*np.pi * x) )
def black_box( x ):
"""
Transform from domain [1,12] to [-5,5]
to be able to push to rastrigin
"""
x = (x - 6.5) * (5/5.5)
return rastrigin_function_n(x)
Simulated Annealing needs to modify state X. Instead of taking/modifying values directly, we keep track of indices. This simplifies creating new proposals as an index is always an integer we can simply add/subtract 1 modulo LENGTH.
def random_start():
"""
returns 3 random indices
"""
return np.random.randint(0, LENGTH, size=3)
def random_small_step(x):
"""
change only 1 index
"""
d = np.array( [1,0,0] )
if np.random.random() < .5:
d = np.array( [-1,0,0] )
np.random.shuffle(d)
return (x+d) % LENGTH
def random_big_step(x):
"""
change 2 indici
"""
d = np.array( [1,-1,0] )
np.random.shuffle(d)
return (x+d) % LENGTH
def obj(x):
"""
We have a triplet of indici,
1. Calculate corresponding values in W = [w1,w2,w3]
2. Push the values in out black-box function
"""
indices = x
values = W[np.array([0,1,2]), indices]
return black_box(values)
And throw a SA Scheme at it
import frigidum
local_opt = frigidum.sa(random_start=random_start,
neighbours=[random_small_step, random_big_step],
objective_function=obj,
T_start=10**4,
T_stop=0.000001,
repeats=10**3,
copy_state=frigidum.annealing.naked)
I am not sure what the minimum for this function should be, but it found a objective with 47.9095 with indicis np.array([9, 2, 2])
Edit:
For frigidum to change the cooling schedule, use alpha=.9. My experience is that all the work of experiment which cooling scheme works best doesn't out-weight simply let it run a little longer. The multiplication you proposed, (sometimes called geometric) is the standard one, also implemented in frigidum. So to implement Tn+1 = 0.9*Tn you need a alpha=.9. Be aware this cooling step is done after N repeats, so if repeats=100, it will first do 100 proposals before lowering the temperature with factor alpha
Simple variations on current state often works best. Since its best practice to set the initial temperature high enough to make most proposals (>90%) accepted, it doesn't matter the steps are small. But if you fear its soo small, try 2 or 3 variations. Frigidum accepts a list of proposal functions, and combinations can enforce each other.
I have no experience with MINLP. But even if, so many times experiments can surprise us. So if time/cost is small to bring another competitor to the table, yes!
Try every possible combination of the three values and see which has the lowest cost.
Given two series, s1 and s2, are these two snippets of code equivalent?
s1, _ = s1.align(s2, join='right')
and
for k in s2.index:
if not k in s1.index:
s1[k] = np.nan
Yes, the 2 codes are equivalent, but if s2 is bigger than s1, than the second option with for loop becomes much slower comparing to the first one align
Consider the following pseudo code:
a <- [0,0,0] (initializing a 3d vector to zeros)
b <- [0,0,0] (initializing a 3d vector to zeros)
c <- a . b (Dot product of two vectors)
In the above pseudo code, what is the flop count (i.e. number floating point operations)?
More generally, what I want to know is whether initialization of variables counts towards the total floating point operations or not, when looking at an algorithm's complexity.
In your case, both a and b vectors are zeros and I don't think that it is a good idea to use zeros to describe or explain the flops operation.
I would say that given vector a with entries a1,a2 and a3, and also given vector b with entries b1, b2, b3. The dot product of the two vectors is equal to aTb that gives
aTb = a1*b1+a2*b2+a3*b3
Here we have 3 multiplication operations
(i.e: a1*b1, a2*b2, a3*b3) and 2 addition operations. In total we have 5 operations or 5 flops.
If we want to generalize this example for n dimensional vectors a_n and b_n, we would have n times multiplication operations and n-1 times addition operations. In total we would end up with n+n-1 = 2n-1 operations or flops.
I hope the example I used above gives you the intuition.
I looking for an elegant way to select a subset of a torch tensor which satisfies some constrains.
For example, say I have:
A = torch.rand(10,2)-1
and S is a 10x1 tensor,
sel = torch.ge(S,5) -- this is a ByteTensor
I would like to be able to do logical indexing, as follows:
A1 = A[sel]
But that doesn't work.
So there's the index function which accepts a LongTensor but I could not find a simple way to convert S to a LongTensor, except the following:
sel = torch.nonzero(sel)
which returns a K x 2 tensor (K being the number of values of S >= 5). So then I have to convert it to a 1 dimensional array, which finally allows me to index A:
A:index(1,torch.squeeze(sel:select(2,1)))
This is very cumbersome; in e.g. Matlab all I'd have to do is
A(S>=5,:)
Can anyone suggest a better way?
One possible alternative is:
sel = S:ge(5):expandAs(A) -- now you can use this mask with the [] operator
A1 = A[sel]:unfold(1, 2, 2) -- unfold to get back a 2D tensor
Example:
> A = torch.rand(3,2)-1
-0.0047 -0.7976
-0.2653 -0.4582
-0.9713 -0.9660
[torch.DoubleTensor of size 3x2]
> S = torch.Tensor{{6}, {1}, {5}}
6
1
5
[torch.DoubleTensor of size 3x1]
> sel = S:ge(5):expandAs(A)
1 1
0 0
1 1
[torch.ByteTensor of size 3x2]
> A[sel]
-0.0047
-0.7976
-0.9713
-0.9660
[torch.DoubleTensor of size 4]
> A[sel]:unfold(1, 2, 2)
-0.0047 -0.7976
-0.9713 -0.9660
[torch.DoubleTensor of size 2x2]
There are two simpler alternatives:
Use maskedSelect:
result=A:maskedSelect(your_byte_tensor)
Use a simple element-wise multiplication, for example
result=torch.cmul(A,S:gt(0))
The second one is very useful if you need to keep the shape of the original matrix (i.e A), for example to select neurons in a layer at backprop. However, since it puts zeros in the resulting matrix whenever the condition dictated by the ByteTensor doesn't apply, you can't use it to compute the product (or median, etc.). The first one only returns the elements that satisfy the condittion, so this is what I'd use to compute products or medians or any other thing where I don't want zeros.
I had these questions in an exam today. State True or False and explain.
If k1(.,.) and k2(.,.) are two valid kernel functions, then if h = k1 - k2, is h(.,.) a valid kernel function?
A standard soft margin SVM is used to classify data set. We have a fixed C parameter. Two different algorithms A1 and A2 are used to obtain the support vector set {S:
α
i > 0}. Call them S1 and S2. Is S1 = S2 in all cases? Assume both algorithm use the same kernel function.
EDITED:
I guessed as:
As kernel function need to be positive semi definite (PSD), the difference between two kernel functions need not be PSD. Hence FALSE.
αi can be different among the two algorithms, the number of support vectors can differ as well. Hence FALSE again.
A) constant 0 is a kernel, constant 1 is a kernel, too. But 0-1=-1 is not PSD.
Thus false IMHO.
B) Assuming 2D data, where x=0 for Class 1, x=1 for Class 2, and y is uniformly random. Any vector from each class is as good a support vector as the others, yielding the same hyperplane. Visually:
x1 | y1
|
x2 | y2
Which SVM is better, the one using x1 and y1 as support vectors, or the one using x2 and y2?