Suppose there are three variables that take on discrete integer values, say w1 = {1,2,3,4,5,6,7,8,9,10,11,12}, w2 = {1,2,3,4,5,6,7,8,9,10,11,12}, and w3 = {1,2,3,4,5,6,7,8,9,10,11,12}. The task is to pick one value from each set such that the resulting triplet minimizes some (black box, computationally expensive) cost function.
I've tried the surrogate optimization in Matlab but I'm not sure it is appropriate. I've also heard about simulated annealing but found no implementation applied to this instance.
Which algorithm, apart from exhaustive search, can solve this combinatorial optimization problem?
Any help would be much appreciated.
The requirement/benefit of Simulated Annealing (SA), is that the objective surface is somewhat smooth, that is, we can be close to a solution.
For a completely random spiky surface- you might as well do a random search
If it is anything smooth, or even sometimes, it makes sense to try SA.
The idea is that (sometimes) changing only 1 of the 3 values, we have little effect on out blackbox function.
Here is a basic example to do this with Simulated Annealing, using frigidum in Python
import numpy as np
w1 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w2 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w3 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
W = np.array([w1,w2,w3])
LENGTH = 12
I define a black-box using the Rastrigin function.
def rastrigin_function_n( x ):
"""
N-dimensional Rastrigin
https://en.wikipedia.org/wiki/Rastrigin_function
x_i is in [-5.12, 5.12]
"""
A = 10
n = x.shape[0]
return A*n + np.sum( x**2- A*np.cos(2*np.pi * x) )
def black_box( x ):
"""
Transform from domain [1,12] to [-5,5]
to be able to push to rastrigin
"""
x = (x - 6.5) * (5/5.5)
return rastrigin_function_n(x)
Simulated Annealing needs to modify state X. Instead of taking/modifying values directly, we keep track of indices. This simplifies creating new proposals as an index is always an integer we can simply add/subtract 1 modulo LENGTH.
def random_start():
"""
returns 3 random indices
"""
return np.random.randint(0, LENGTH, size=3)
def random_small_step(x):
"""
change only 1 index
"""
d = np.array( [1,0,0] )
if np.random.random() < .5:
d = np.array( [-1,0,0] )
np.random.shuffle(d)
return (x+d) % LENGTH
def random_big_step(x):
"""
change 2 indici
"""
d = np.array( [1,-1,0] )
np.random.shuffle(d)
return (x+d) % LENGTH
def obj(x):
"""
We have a triplet of indici,
1. Calculate corresponding values in W = [w1,w2,w3]
2. Push the values in out black-box function
"""
indices = x
values = W[np.array([0,1,2]), indices]
return black_box(values)
And throw a SA Scheme at it
import frigidum
local_opt = frigidum.sa(random_start=random_start,
neighbours=[random_small_step, random_big_step],
objective_function=obj,
T_start=10**4,
T_stop=0.000001,
repeats=10**3,
copy_state=frigidum.annealing.naked)
I am not sure what the minimum for this function should be, but it found a objective with 47.9095 with indicis np.array([9, 2, 2])
Edit:
For frigidum to change the cooling schedule, use alpha=.9. My experience is that all the work of experiment which cooling scheme works best doesn't out-weight simply let it run a little longer. The multiplication you proposed, (sometimes called geometric) is the standard one, also implemented in frigidum. So to implement Tn+1 = 0.9*Tn you need a alpha=.9. Be aware this cooling step is done after N repeats, so if repeats=100, it will first do 100 proposals before lowering the temperature with factor alpha
Simple variations on current state often works best. Since its best practice to set the initial temperature high enough to make most proposals (>90%) accepted, it doesn't matter the steps are small. But if you fear its soo small, try 2 or 3 variations. Frigidum accepts a list of proposal functions, and combinations can enforce each other.
I have no experience with MINLP. But even if, so many times experiments can surprise us. So if time/cost is small to bring another competitor to the table, yes!
Try every possible combination of the three values and see which has the lowest cost.
Currently my code is
self.df['sma'] = self.df['Close'].rolling(window=30).mean()
self.df['cma'] = self.df.apply(lambda x: self.get_cma(x), axis=1)
def get_cma(self, candle):
if np.isnan(candle['sma']):
return np.nan
secma = (candle['sma'] - self.previous_cma if self.previous_cma is not None else 0) ** 2
ka = 1 - (candle['var']/secma) if candle['var'] < secma else 0
cma = ((ka * candle['sma']) + ((1 - ka) * self.previous_cma)) if self.previous_cma is not None else candle[self.src]
self.previous_cma = cma
return cma
Can the above optimized to make it faster?
As you may already know, the secret to performance with Pandas is to do this in vectorized form. This means no apply. Here are the first few steps you need to take to speed up your code, by extracting parts of your get_cma() function to their vectorized equivalents.
if np.isnan(candle['sma']):
return np.nan
This early exit is not needed in get_cma(), we can do this instead:
self.df['cma'] = np.nan
valid = self.df['sma'].notnull()
# this comment is a placeholder for step 2
self.df.loc[valid, 'cma'] = self.df[valid].apply(self.get_cma, axis=1)
This not only vectorizes the first two lines of get_cma(), it means get_cma() is now only called on not-null rows, rather than every row. Depending on your data that alone may provide a noticeable speedup.
If that's not enough, we need a bigger hammer. The fundamental problem is that each iteration of get_cma() depends on the previous, so it is not easy to vectorize. So let's use Numba to JIT compile the code. First we need to get rid of apply by using a good old for loop over the individual columns, which is equivalent (and will still be slow). Note this is a free (global) function, not a member function, and it takes NumPy arrays instead of Pandas types, because those are what Numba understands:
def get_cma(sma, var, src):
cma = np.empty_like(sma)
# take care of the initial value first, to avoid unnecessary branches later
cma[0] = src[0]
# now do all remaining rows, cma[ii-1] is previous_cma and is never None
for ii in range(1, len(sma)):
secma = (sma[ii] - cma[ii-1]) ** 2
ka = 1 - (var[ii] / secma) if var[ii] < secma else 0
cma[ii] = (ka * sma[ii]]) + ((1 - ka) * cma[ii-1])
return cma
Call it like this, passing the required columns as NumPy arrays:
valid_rows = self.df[valid]
self.df.loc[valid, 'cma'] = get_cma(
valid_rows['sma'].to_numpy(),
valid_rows['var'].to_numpy(),
valid_rows[self.src].to_numpy())
Finally, after confirming the code works, decorate get_cma() to compile it with Numba automatically like this:
import numba
#numba.njit
def get_cma(sma, var, src):
...
That's it. Please let us know how much faster this runs on your real data. I expect it will be plenty fast enough.
This code is working correctly as expected. But it takes a lot of time for large dataframes.
for i in excel_df['name_of_college_school'] :
for y in mysql_df['college_name'] :
if SequenceMatcher(None, i.lower(), y.lower() ).ratio() > 0.8:
excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y
I guess, I can not use a function on join clause to compare values like this.
How do I vectorize this?
Update:
Is it possible to update with the highest score? This loop will overwrite the earlier match and it is possible that the earlier match was more relevant than current one.
What you are looking for is fuzzy merging.
a = excel_df.as_matrix()
b = mysql_df.as_matrix()
for i in a:
for j in b:
if SequenceMatcher(None,
i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8:
i[dupmark_index] = j
Never use loc in a loop, it has a huge overhead. And btw, get the index of the respective columns, (the numerical one). Use this -
df.columns.get_loc("college name")
You could avoid one of the loops using apply and instead of MxN .loc operations, now it'll be M operations.
for y in mysql_df['college_name']:
match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher(
None, x.lower(), y.lower()).ratio() > 0.8)
excel_df.loc[match, 'dupmark4'] = y
Cross posting this from CS Theory since it is more of a software question.
I need a code for calculating exact MIN-DOM-SET. Currently the best option suggested has been to formulate it as an SMT problem and throw it at an SMT solver.
Curious if there were any good MIN-DOM-SET specific codes out there or a good SMT-LIB formulation.
I coded one up in Z3's Python bindings using the new Optimize functionality.
def min_dom_set(graph):
"""Try to dominate the graph with the least number of verticies possible"""
s = Optimize()
nodes_colors = dict((node_name, Int('k%r' % node_name)) for node_name in graph.nodes())
for node in graph.nodes():
s.add(And(nodes_colors[node] >= 0, nodes_colors[node] <= 1)) # dominator or not
dom_neighbor = Sum ([ (nodes_colors[j]) for j in graph.neighbors(node) ])
s.add(Sum(nodes_colors[node], dom_neighbor ) >= 1 )
s.minimize( Sum([ nodes_colors[y] for y in graph.nodes() ]) )
if s.check() == sat:
m = s.model()
return dict((name, m[color].as_long()) for name, color in nodes_colors.iteritems())
raise Exception('Could not find a solution.')
I'm sorry the title is so confusingly worded, but it's hard to condense this problem down to a few words.
I'm trying to find the minimum value of a specific equation. At first I'm looping through the equation, which for our purposes here can be something like y = .245x^3-.67x^2+5x+12. I want to design a loop where the "steps" through the loop get smaller and smaller.
For example, the first time it loops through, it uses a step of 1. I will get about 30 values. What I need help on is how do I Use the three smallest values I receive from this first loop?
Here's an example of the values I might get from the first loop: (I should note this isn't supposed to be actual code at all. It's just a brief description of what's happening)
loop from x = 1 to 8 with step 1
results:
x = 1 -> y = 30
x = 2 -> y = 28
x = 3 -> y = 25
x = 4 -> y = 21
x = 5 -> y = 18
x = 6 -> y = 22
x = 7 -> y = 27
x = 8 -> y = 33
I want something that can detect the lowest three values and create a loop. From theses results, the values of x that get the smallest three results for y are x = 4, 5, and 6.
So my "guess" at this point would be x = 5. To get a better "guess" I'd like a loop that now does:
loop from x = 4 to x = 6 with step .5
I could keep this pattern going until I get an absurdly accurate guess for the minimum value of x.
Does anybody know of a way I can do this? I know the values I'm going to get are going to be able to be modeled by a parabola opening up, so this format will definitely work. I was thinking that the values could be put into a column. It wouldn't be hard to make something that returns the smallest value for y in that column, and the corresponding x-value.
If I'm being too vague, just let me know, and I can answer any questions you might have.
nice question. Here's at least a start for what I think you should do for this:
Sub findMin()
Dim lowest As Integer
Dim middle As Integer
Dim highest As Integer
lowest = 999
middle = 999
hightest = 999
Dim i As Integer
i = 1
Do While i < 9
If (retVal(i) < retVal(lowest)) Then
highest = middle
middle = lowest
lowest = i
Else
If (retVal(i) < retVal(middle)) Then
highest = middle
middle = i
Else
If (retVal(i) < retVal(highest)) Then
highest = i
End If
End If
End If
i = i + 1
Loop
End Sub
Function retVal(num As Integer) As Double
retVal = 0.245 * Math.Sqr(num) * num - 0.67 * Math.Sqr(num) + 5 * num + 12
End Function
What I've done here is set three Integers as your three Min values: lowest, middle, and highest. You loop through the values you're plugging into the formula (here, the retVal function) and comparing the return value of retVal (hence the name) to the values of retVal(lowest), retVal(middle), and retVal(highest), replacing them as necessary. I'm just beginning with VBA so what I've done likely isn't very elegant, but it does at least identify the Integers that result in the lowest values of the function. You may have to play around with the values of lowest, middle, and highest a bit to make it work. I know this isn't EXACTLY what you're looking for, but it's something along the lines of what I think you should do.
There is no trivial way to approach this unless the problem domain is narrowed.
The example polynomial given in fact has no minimum, which is readily determined by observing y'>0 (hence, y is always increasing WRT x).
Given the wide interpretation of
[an] equation, which for our purposes here can be something like y =
.245x^3-.67x^2+5x+12
many conditions need to be checked, even assuming the domain is limited to polynomials.
The polynomial order is significant, and the order determines what conditions are necessary to check for how many solutions are possible, or whether any solution is possible at all.
Without taking this complexity into account, an iterative approach could yield an incorrect solution due to underflow error, or an unfortunate choice of iteration steps or bounds.
I'm not trying to be hard here, I think your idea is neat. In practice it is more complicated than you think.