I am trying to learn various implementations of the Hungarian Algorithm. Specifically, I want to maximise and get the highest score.
I have found two solutions from various packages: (1) the munkres package, and (2) Linear Sum Assignment in Scipy
(1) http://software.clapper.org/munkres/
(2) https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html
I was able to get something together with #1 but was finding issues in my implementation (https://github.com/bmc/munkres/issues/39). So, I am now trying to work with option #2.
Here is what I have so far:
import numpy as np
from scipy.optimize import linear_sum_assignment
matrix = np.array([
[10.01, 10.02, 8.03, 11.04],
[9.05, 8.06, 500.07, 1.08],
[9.09, 7.11, 4.11, 1000.12]
])
row_ind, col_ind = linear_sum_assignment(matrix, maximize=True)
print('\nSolution:', matrix[row_ind, col_ind].sum())
It returns the correct solution of 1510.21.
Help I would appreciate:
I have been struggling to display the workings. Ideally, what I want to see is the matching row and column pair and the score. In this example, it would be:
(0,1) (10.02)
(1,2) (500.07)
(2,3) (1000.12)
This was straight forward enough to do with the munkres package (#1 detailed above) but struggling to get my head around how to make this work with the scipy implementation.
Thanks for any help
I was able to get what I was after using something like this:-
for i in range(len(row_ind)):
print("row: ", row_ind[i], " col: " ,col_ind[i], " value: ", matrix[i, col_ind[i]] )
This will return the position in the matrix along with the value.
Related
In Python, I need to create an NxM matrix in which the ij entry has value of i^2 + j^2.
I'm currently constructing it using two for loops, but the array is quite big and the computation time is long and I need to perform it several times. Is there a more efficient way of constructing such matrix using maybe Numpy ?
You can use broadcasting in numpy. You may refer to the official documentation. For example,
import numpy as np
N = 3; M = 4 #whatever values you'd like
a = (np.arange(N)**2).reshape((-1,1)) #make it to column vector
b = np.arange(M)**2
print(a+b) #broadcasting applied
Instead of using np.arange(), you can use np.array([...some array...]) for customizing it.
I am trying to speed up a batched matrix multiplication problem with numba, but it keeps telling me that it's faster with contiguous code.
Note: I'm using numba version 0.55.1, and numpy version 1.21.5
Here's the problem:
import numpy as np
import numba as nb
def numbaFastMatMult(mat,vec):
result = np.zeros_like(vec)
for n in nb.prange(vec.shape[0]):
result[n,:] = np.dot(vec[n,:], mat[n,:,:])
return result
D,N = 10,1000
mat = np.random.normal(0,1,(N,D,D))
vec = np.random.normal(0,1,(N,D))
result = numbaFastMatMult(mat,vec)
print(mat.data.contiguous)
print(vec.data.contiguous)
print(mat[n,:,:].data.contiguous)
print(vec[n,:].data.contiguous)
clearly all the relevant data is contiguous (run the above code snippet and see the results of print()...
But, when I run this code, I get the following warning:
NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 1d, C), array(float64, 2d, A))
result[n,:] = np.dot(vec[n,:], mat[n,:,:])
2 Extra comments:
This is just a toy problem for replication. I'm actually using something with many more data points, so hoping this will speed up.
I think the "right" way to solve this is with np.tensordot. However, I want to understand what's going on for future reference. For example, this discussion addresses a similar issue, but as far as I can tell, doesn't address why the warning shows up directly.
I've tried adding a decorator:
nb.float64[:,::1](nb.float64[:,:,::1],nb.float64[:,::1]),
I've tried reordering the arrays so the batch index is first (n in the above code)
I've tried printing whether the "mat" variable is contiguous from inside the function
I'll leave this up, but I figured it out:
Outside of a numba function:
mat[n,:,:].data.contiguous==True
but inside numba, mat[n,:,:] is no longer continous.
Changing my code above to np.dot(vec[n], mat[n]) removed the warning.
I'm making this the "correct" answer since it solved my problem. However, according to max9111's response, this behavior may be a bug!
I've created the "Precipitation Analysis" example Jupyter Notebook in the Bluemix Spark service.
Notebook Link: https://console.ng.bluemix.net/data/notebooks/3ffc43e2-d639-4895-91a7-8f1599369a86/view?access_token=effff68dbeb5f9fc0d2df20cb51bffa266748f2d177b730d5d096cb54b35e5f0
So in In[34] and In[35] (you have to scroll a lot) they use numpy polyfit to calculate the trend for given temperature data. However, I do not understand how to use it.
Can somebody explain it?
The question has been answered on Developerworks:-
https://developer.ibm.com/answers/questions/282350/how-does-numpy-polyfit-work.html
I will try to explain each of this:-
index = chile[chile>0.0].index => this statements gives out all the years which are indices in chile python series which are greater than 0.0.
fit = np.polyfit(index.astype('int'), chile[index].values,1)
This is polyfit function call which find out ploynomial fitting coefficient(slope and intercept) for the given x(years) and y(precipitation on year) values at index(years) supplied through the vectors.
print "slope: " + str(fit[0])
The below code simply plots the datapoints referenced to straight line to show the trend
plt.plot(index, chile[index],'.')
Perticularly in the below statement the second argument is actually straight line equation to represent y which is "y = mx + b" where m is the slope and b is intercept that we found out above using polyfit.
plt.plot(index, fit[0]*index.astype('int') + fit[1], '-', color='red')
plt.title("Precipitation Trend for Chile")
plt.xlabel("Year")
plt.ylabel("Precipitation (million cubic meters)")
plt.show()
I hope that helps.
Thanks, Charles.
What is a quick way to simulate random returns. I'm aware of numpy.random. However, that doesn't guide me towards how to model asset returns.
I've tried:
import numpy as np
r = np.random.rand(100)
But this doesn't feel accurate. How are others dealing doing this?
I'd suggest one of two approaches:
One: Assume returns are normally distributed with mean equal to 0.1% and stadard deviation about 1%. This looks like:
import numpy as np
np.random.seed(314)
r = np.random.randn(100) / 100 + 0.001
seed(314) sets the random number generator at a specific point so that if we both use the same seed, we should see the same results.
randn pulls from the normal distribution.
I'd also recommend using pandas. It's a library that implements a DataFrame object similar to R
import pandas as pd
df = pd.DataFrame(r)
You can then plot the cumulative returns like this:
df.add(1).cumprod().plot()
Two:
The second way is to assume returns are log normally distributed. That means the log(r) is normal. In this scenario, we pull normally distributed random numbers and then use those values as the exponent of e. It looks like this:
r = np.exp(np.random.randn(100) / 100 + 0.001) - 1
If you plot it, it looks like this:
pd.DataFrame(r).add(1).cumprod().plot()
I've been rewriting a matlab/octave program into numpy and ran across a difference in some resultant values.
This occurs with both the percentile/prctile and the stdard-deviation functions.
In Numpy:
import matplotlib.mlab as ml
import numpy
>>> t = numpy.linspace(0,100, 100)
>>> numpy.percentile(t,95)
95.0
>>> numpy.std(t)
29.157646512850626
>>> ml.prctile(t,95)
95.000000000000014
In Octave:
octave:1> t = linspace(0,100,100)';
octave:2> prctile(t,95)
ans = 95.454545
octave:3> std(t)
ans = 29.304537
Although the array values of 't' are the same, the results are more different than I would suspect.
In the numpy help(numpy.std) they specifically mention that the algorithm is:
std = sqrt(mean(abs(x - x.mean())**2))
So I implemented that in octave and got the exact answer numpy gives. So it seems the std-deviation function differs.
But why/how? And which is correct? (if there is such a thing)
And even prctile/percentile?
Just in case since I'm in Linux aptosid...
GNU Octave, version 3.6.2
numpy.version '1.6.2rc1'
Numpy simply uses a different algorithm when the percentile lies between two data points. Octave, Matlab and R always center it exactly between two points when needed (I believe), numpy does a bit more then that... if you check http://en.wikipedia.org/wiki/Percentile you will see there are a couple of ways to calculate percentiles.
It seems like Octave assumes ddof=1, at least by default, and numpy uses 0 by default:
>>> numpy.std(t, ddof=0)
29.157646512850633
>>> numpy.std(t, ddof=1)
29.304537349375785