How should I handle music key (scale) as a feature in the knn algorithm - pandas

I'm doing a data science project, and I was wondering how to handle a music key (scale) as a feature in the KNN algorithm.
I know KNN is based on distances, therefore giving each key a number like 1-24 doesn't make that much sense (because key number 24 is close to 1 as much as 7 close to 8).
I have thought about making a column for "Major/Minor" and another for the note itself,
but I'm still facing the same problem, I need to specify the note with a number, but because notes are cyclic I cannot number them linearly 1-12.
For the people that have no idea how music keys work my question is equivalent to handling states in KNN, you can't just number them linearly 1-50.

One way you could think about the distance between scales is to think of each scale as a 12-element binary vector where there's a 1 wherever a note is in the scale and a zero otherwise.
Then you can compute the Hamming distance between scales. The Hamming distance, for example, between a major scale and its relative minor scale should be zero because they both contain the same notes.
Here's a way you could set this up in Python
from enum import IntEnum
import numpy as np
from scipy.spatial.distance import hamming
class Note(IntEnum):
C = 0
Db = 1
D = 2
Eb = 3
E = 4
F = 5
Gb = 6
G = 7
Ab = 8
A = 9
Bb = 10
B = 11
major = np.array((1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1))
minor = np.array((1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0)) #WHWWHWW Natural Minor
# Transpose the basic scale form to a key using Numpy's `roll` function
cMaj = np.roll(major, Note.C) # Rolling by zero changes nothing
aMin = np.roll(minor, Note.A)
gMaj = np.roll(major, Note.G)
fMaj = np.roll(major, Note.F)
print('Distance from cMaj to aMin', hamming(cMaj, aMin))
print('Distance from cMaj to gMaj', hamming(cMaj, gMaj)) # One step clockwise on circle of fifths
print('Distance from cMaj to fMaj', hamming(cMaj, fMaj)) # One step counter-clockwise on circle of fifths

IIUC, you can convert your features to something like sin as follows. Hear I have 10 values 1-10 and I am transforming them to keep their circular relation.
a = np.around(np.sin([np.deg2rad(x*18) for x in np.array(list(range(11)))]), 3)
import matplotlib.pyplot as plt
plt.plot(a)
Output:
Through this feature engineering you can see that the circularity of your feature is encoded. The value of 0 is equal to 10.

Related

cvxpy integer variable - exclude certain integer values from the solution

I have the following problem and I can't figure out if cvxpy can do what I need.
Context: I optimize portfolios. When buying bonds and optimizing the quantity of each bond to buy, it's only possible to buy each bond only in multiples of 1,000 units.
However, the minimum piece required to be bought is most of the time 10,000.
This means we either don't buy a bond at all or if we buy it, the quantity bought has to be either 10,000, 11,000, 12,000 and so on.
Is there a way (it seems it doesn't) to restrict certain values from the possible solutions an integer variable can have?
So let's assume we have an integer variable x that is non negative.
We basically want to buy 1000x but we know that x can be x = {0, 10, 11, 12, ...}
Is it possible to skip values 1.. 9 without adding other variables?
For example:
import numpy as np
import pandas as pd
import cvxpy as cvx
np.random.seed(1)
# np.random.rand(3)
p = pd.DataFrame({'bond_id': ['s1','s2', 's3', 's4', 's5', 's6', 's7','s8','s9', 's10'],
'er': np.random.rand(10),
'units': [10000,2000,3000,4000,27000,4000,0,0,0,0] })
final_units = cvx.Variable( 10, integer=True)
constraints = list()
constraints.append( final_units >= 0)
constraints.append(sum(final_units*1000) <= 50000)
constraints.append(sum(final_units*1000) >= 50000)
constraints.append(final_units <= 15)
obj = cvx.Maximize( final_units # np.array(list(p['er'])) )
prob = cvx.Problem(obj, constraints)
solve_val = prob.solve()
print("\n* solve_val = {}".format(solve_val))
solution_value = prob.value
solution = str(prob.status).lower()
print("\n** SOLUTION 3: {} Value: {} ".format(solution, solution_value))
print("\n* final_units -> \n{}\n".format(final_units.value))
p['FINAL_SOL'] = final_units.value * 1000
print("\n* Final Portfolio: \n{}\n".format(p))
This solution is a very simplified version of the problem I face. The final vector final_units can suggest values like in this example where we have to buy 5,000 units of bond s9, however I can't since the min I can buy is 10,000.
I know I could add an additional integer vector to express an OR condition, but in reality my real problem is way bigger than this, I have thousand of integer variables already. Hence, I wonder if there's a way to exclude values from 1 to 9 without adding additional variables to the problem.
Thank you
No, not with CVXPY. You can model it with an integer variable x[i] plus a binary variable y[i], and using the constraints (in math notation):
y[i] * 10 <= x[i] <= y[i] * 15
This results in x[i] ∈ {0, 10..15}.
Some solvers have a variable type for this: semi-integer variables. Using this you don't need to have an extra binary variable and these 2 constraints. CVXPY does not support this variable type AFAIK.

How to add sequential (time series) constraint to optimization problem using python PuLP?

A simple optimization problem: Find the optimal control sequence for a refrigerator based on the cost of energy. The only constraint is to stay below a temperature threshold, and the objective function tries to minimize the cost of energy used. This problem is simplified so the control is simply a binary array, ie. [0, 1, 0, 1, 0], where 1 means using electricity to cool the fridge, and 0 means to turn of the cooling mechanism (which means there is no cost for this period, but the temperature will increase). We can assume each period is fixed period of time, and has a constant temperature change based on it's on/off status.
Here are the example values:
Cost of energy (for our example 5 periods): [466, 426, 423, 442, 494]
Minimum cooling periods (just as a test): 3
Starting temperature: 0
Temperature threshold(must be less than or equal): 1
Temperature change per period of cooling: -1
Temperature change per period of warming (when control input is 0): 2
And here is the code in PuLP
from pulp import LpProblem, LpMinimize, LpVariable, lpSum, LpStatus, value
from itertools import accumulate
l = list(range(5))
costy = [466, 426, 423, 442, 494]
cost = dict(zip(l, costy))
min_cooling_periods = 3
prob = LpProblem("Fridge", LpMinimize)
si = LpVariable.dicts("time_step", l, lowBound=0, upBound=1, cat='Integer')
prob += lpSum([cost[i]*si[i] for i in l]) # cost function to minimize
prob += lpSum([si[i] for i in l]) >= min_cooling_periods # how many values must be positive
prob.solve()
The optimization seems to work before I try to account for the temperature threshold. With just the cost function, it returns an array of 0s, which does indeed minimize the cost (duh). With the first constraint (how many values must be positive) it picks the cheapest 3 cooling periods, and calculates the total cost correctly.
obj = value(prob.objective)
print(f'Solution is {LpStatus[prob.status]}\nThe total cost of this regime is: {obj}\n')
for v in prob.variables():
print(f'{v.name} = {v.varValue}')
output:
Solution is Optimal
The total cost of this regime is: 1291.0
time_step_0 = 0.0
time_step_1 = 1.0
time_step_2 = 1.0
time_step_3 = 1.0
time_step_4 = 0.0
So, if our control sequence is [0, 1, 1, 1, 0], the temperature will look like this at the end of each cooling/warming period: [2, 1, 0, -1, 1]. The temperature goes up 2 whenever the control input is 1, and down 1 whenever the control input is 1. This example sequence is a valid answer, but will have to change if we add a max temperature threshold of 1, which would mean the first value must be a 1, or else the fridge will warm to a temperature of 2.
However I get incorrect results when trying to specify the sequential constraint of staying within the temperature thresholds with the condition:
up_temp_thresh = 1
down = -1
up = 2
# here is where I try to ensure that the control sequence would never cause the temperature to
# surpass the threshold. In practice I would like a lower and upper threshold but for now
# let us focus only on the upper threshold.
prob += lpSum([e <= up_temp_thresh for e in accumulate([down if si[i] == 1. else up for i in l])]) >= len(l)
In this case the answer comes out the same as before, I am clearly not formulating it correctly as the sequence [0, 1, 1, 1, 0] would surpass the threshold.
I am trying to encode "the temperature at the end of each control sequence must be less than the threshold". I do this by turning the control sequence into an array of the temperature changes, so control sequence [0, 1, 1, 1, 0] gives us temperature changes [2, -1, -1, -1, 2]. Then using the accumulate function, it computes a cumulative sum, equal to the fridge temp after each step, which is [2, 1, 0, -1, 1]. I would like to just check if the max of this array is less than the threshold, but using lpSum I check that the sum of values in the array less than the threshold is equal to the length of the array, which should be the same thing.
However I'm clearly formulating this step incorrectly. As written this last constraint has no effect on the output, and small changes give other wrong answers. It seems the answer should be [1, 1, 1, 0, 0], which gives an acceptable temperature series of [-1, -2, -3, -1, 1]. How can I specify the sequential nature of the control input using PuLP, or another free python optimization library?
The easiest and least error-prone approach would be to create a new set of auxillary variables of your problem which track the temperature of the fridge in each interval. These are not 'primary decision variables' because you cannot directly choose them - rather the value of them is constrained by the on/off decision variables for the fridge.
You would then add constraints on these temperature state variables to represent the dynamics. So in untested code:
l_plus_1 = list(range(6))
fridge_temp = LpVariable.dicts("fridge_temp", l_plus_1, cat='Continuos')
fridge_temp[0] = init_temp # initial temperature of fridge - a known value
for i in l:
prob += fridge_temp[i+1] == fridge_temp[i] + 2 - 3*s[i]
You can then sent the min/max temperature constraints on these new fridge_temp variables.
Note that in the above I've assumed that the fridge temperature variables are defined at one more intervals than the on/off decisions for the fridge. The fridge temperature variables represent the temperature at the start of an interval - and having one extra one means we can ensure the final temperature of the fridge is acceptable.

How to find most similar numerical arrays to one array, using Numpy/Scipy?

Let's say I have a list of 5 words:
[this, is, a, short, list]
Furthermore, I can classify some text by counting the occurrences of the words from the list above and representing these counts as a vector:
N = [1,0,2,5,10] # 1x this, 0x is, 2x a, 5x short, 10x list found in the given text
In the same way, I classify many other texts (count the 5 words per text, and represent them as counts - each row represents a different text which we will be comparing to N):
M = [[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40],
...]
Now, I want to find the top 1 (2, 3 etc) rows from M that are most similar to N. Or on simple words, the most similar texts to my initial text.
The challenge is, just checking the distances between N and each row from M is not enough, since for example row M4 [4,0,8,20,40] is very different by distance from N, but still proportional (by a factor of 4) and therefore very similar. For example, the text in row M4 can be just 4x as long as the text represented by N, so naturally all counts will be 4x as high.
What is the best approach to solve this problem (of finding the most 1,2,3 etc similar texts from M to the text in N)?
Generally speaking, the most widely standard technique of bag of words (i.e. you arrays) for similarity is to check cosine similarity measure. This maps your bag of n (here 5) words to a n-dimensional space and each array is a point (which is essentially also a point vector) in that space. The most similar vectors(/points) would be ones that have the least angle to your text N in that space (this automatically takes care of proportional ones as they would be close in angle). Therefore, here is a code for it (assuming M and N are numpy arrays of the similar shape introduced in the question):
import numpy as np
cos_sim = M[np.argmax(np.dot(N, M.T)/(np.linalg.norm(M)*np.linalg.norm(N)))]
which gives output [ 4 0 8 20 40] for your inputs.
You can normalise your row counts to remove the length effect as you discussed. Row normalisation of M can be done as M / M.sum(axis=1)[:, np.newaxis]. The residual values can then be calculated as the sum of the square difference between N and M per row. The minimum difference (ignoring NaN or inf values obtained if the row sum is 0) is then the most similar.
Here is an example:
import numpy as np
N = np.array([1,0,2,5,10])
M = np.array([[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40]])
# sqrt of sum of normalised square differences
similarity = np.sqrt(np.sum((M / M.sum(axis=1)[:, np.newaxis] - N / np.sum(N))**2, axis=1))
# remove any Nan values obtained by dividing by 0 by making them larger than one element
similarity[np.isnan(similarity)] = similarity[0]+1
result = M[similarity.argmin()]
result
>>> array([ 4, 0, 8, 20, 40])
You could then use np.argsort(similarity)[:n] to get the n most similar rows.

Solving an underdetermined scipy.sparse matrix using svd

Problem
I have a set of equations with variables denoted with lowercase variables and constants with uppercase variables as such
A = a + b
B = c + d
C = a + b + c + d + e
I'm provided the information as to the structure of these equations in a pandas DataFrame with two columns: Constants and Variables
E.g.
df = pd.DataFrame([['A','a'],['A','b'],['B','c'],['B','d'],['C','a'],['C','b'],
['C','c'],['C','d'],['C','e']],columns=['Constants','Variables'])
I then convert this to a sparse CSC matrix by using NetworkX
table = nx.bipartite.biadjacency_matrix(nx.from_pandas_dataframe(df,'Constants','Variables')
,df.Constants.unique(),df.Variables.unique(),format='csc')
When converted to a dense matrix, table looks like the following
matrix([[1, 1, 0, 0, 0],[0, 0, 1, 1, 0],[1, 1, 1, 1, 1]], dtype=int64)
What I want from here is to find which variables are solvable (in this example, only e is solvable) and for each solvable variable, what constants is its value dependent on (in this case, since e = C-B-A, it is dependent on A, B, and C)
Attempts at Solution
I first tried to use rref to solve for the solvable variables. I used the symbolics library sympy and the function sympy.Matrix.rref, which gave me exactly what I wanted, since any solvable variable would have its own row with almost all zeros and 1 one, which I could check for row by row.
However, this solution was not stable. Primarily, it was exceedingly slow, and didn't make use of the fact that my datasets are likely to be very sparse. Moreover, rref doesn't do too well with floating points. So I decided to move on to another approach motivated by Removing unsolvable equations from an underdetermined system, which suggested using svd
Conveniently, there is a svd function in the scipy.sparse library, namely scipy.sparse.linalg.svds. However, given my lack of linear algebra background, I don't understand the results outputted by running this function on my table, or how to use those results to get what I want.
Further Details in the Problem
The coefficient of every variable in my problem is 1. This is how the data can be expressed in the two column pandas DataFrame shown earlier
The vast majority of variables in my actual examples will not be solvable. The goal is to find the few that are solvable
I'm more than willing to try an alternate approach if it fits the constraints of this problem.
This is my first time posting a question, so I apologize if this doesn't exactly follow guidelines. Please leave constructive criticism but be gentle!
The system you are solving has the form
[ 1 1 0 0 0 ] [a] [A]
[ 0 0 1 1 0 ] [b] = [B]
[ 1 1 1 1 1 ] [c] [C]
[d]
[e]
i.e., three equations for five variables a, b, c, d, e. As the answer linked in your question mentions, one can tackle such underdetermined system with the pseudoinverse, which Numpy directly provides in terms of the pinv function.
Since M has linearly independent rows, the psudoinverse has in this case the property that M.pinv(M) = I, where I denotes identity matrix (3x3 in this case). Thus formally, we can write the solution as:
v = pinv(M) . b
where v is the 5-component solution vector, and b denotes the right-hand side 3-component vector [A, B, C]. However, this solution is not unique, since one can add a vector from the so-called kernel or null space of the matrix M (i.e., a vector w for which M.w=0) and it will be still a solution:
M.(v + w) = M.v + M.w = b + 0 = b
Therefore, the only variables for which there is a unique solution are those for which the corresponding component of all possible vectors from the null space of M is zero. In other words, if you assemble the basis of the null space into a matrix (one basis vector per column), then the "solvable variables" will correspond to zero rows of this matrix (the corresponding component of any linear combination of the columns will be then also zero).
Let's apply this to your particular example:
import numpy as np
from numpy.linalg import pinv
M = [
[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[1, 1, 1, 1, 1]
]
print(pinv(M))
[[ 5.00000000e-01 -2.01966890e-16 1.54302378e-16]
[ 5.00000000e-01 1.48779676e-16 -2.10806254e-16]
[-8.76351626e-17 5.00000000e-01 8.66819360e-17]
[-2.60659800e-17 5.00000000e-01 3.43000417e-17]
[-1.00000000e+00 -1.00000000e+00 1.00000000e+00]]
From this pseudoinverse, we see that the variable e (last row) is indeed expressible as - A - B + C. However, it also "predicts" that a=A/2 and b=A/2. To eliminate these non-unique solutions (equally valid would be also a=A and b=0 for example), let's calculate the null space borrowing the function from SciPy Cookbook:
print(nullspace(M))
[[ 5.00000000e-01 -5.00000000e-01]
[-5.00000000e-01 5.00000000e-01]
[-5.00000000e-01 -5.00000000e-01]
[ 5.00000000e-01 5.00000000e-01]
[-1.77302319e-16 2.22044605e-16]]
This function returns already the basis of the null space assembled into a matrix (one vector per column) and we see that, within a reasonable precision, the only zero row is indeed only the last one corresponding to the variable e.
EDIT:
For the set of equations
A = a + b, B = b + c, C = a + c
the corresponding matrix M is
[ 1 1 0 ]
[ 0 1 1 ]
[ 1 0 1 ]
Here we see that the matrix is in fact square, and invertible (the determinant is 2). Thus the pseudoinverse coincides with "normal" inverse:
[[ 0.5 -0.5 0.5]
[ 0.5 0.5 -0.5]
[-0.5 0.5 0.5]]
which corresponds to the solution a = (A - B + C)/2, .... Since M is invertible, its kernel / null space is empty, that's why the cookbook function returns only []. To see this, let's use the definition of the kernel - it is formed by all non-zero vectors x such that M.x = 0. However, since M^{-1} exists, x is given as x = M^{-1} . 0 = 0 which is a contradiction. Formally, this means that the found solution is unique (or that all variables are "solvable").
To build on ewcz's answer, both the nullspace and pseudo-inverse can be calculated using numpy.linalg.svd. See the links below:
pseudo-inverse
nullspace

Octave: summing indexed elements

The easiest way to describe this is via example:
data = [1, 5, 3, 6, 10];
indices = [1, 2, 2, 2, 4];
result = zeroes(1, 5);
I want result(1) to be the sum of all the elements in data whose index is 1, result(2) to be the sum of all the elements in data whose index is 2, etc.
This works but is really slow when applied (changing 5 to 65535) to 64K element vectors:
result = result + arrayfun(#(x) sum(data(index==x)), 1:5);
I think it's creating 64K vectors with 64K elements that's taking up the time. Is there a faster way to do this? Or do I need to figure out a completely different approach?
for i = [1:5]
idx = indices(i);
result(idx) = result(idx) + data(i);
endfor
But that's a very non-octave-y way to do it.
Seeing how MATLAB is very similar to Octave, I will provide an answer that was tested on MATLAB R2016b. Looking at the documentation of Octave 4.2.1 the syntax should be the same.
All you need to do is this:
result = accumarray(indices(:), data(:), [5 1]).'
Which gives:
result =
1 14 0 10 0
Reshaping to a column vector (arrayName(:) ) is necessary because of the expected inputs to accumarray. Specifying the size as [5 1] and then transposing the result was done to avoid some MATLAB error.
accumarray is also described in depth in the MATLAB documentation