Fast Cross Entropy in Numpy - numpy

I have come up with two versions of Cross-Entropy one in the more vectorized dot product format and the other is the typical one you will see in any ML lecture. I am trying to speed my algorithm and thus will use any chance to speed it up.
cost = -(1.0/m) * np.sum(Y*np.log(A) + (1-Y)*np.log(1-A))
Vectorized Version
cost = -(1.0/m) * (np.dot(np.log(A), Y.T) + np.dot(np.log(1-A), (1-Y).T))
My question: Which one of the above implementations of cross-entropy loss is computed fastest given the architecture of Numpy library and other constraints.

It would help with benchmarking to know typical values of labels. If labels are too short, a pure python implementation could actually be faster than using NumPy
Coming back to your question, I would say the dot products are already scalars (single numbers) the np.sum (which sums the array of products in the other style - see) is not needed in "np dot style".
I hope it helps.

Related

Tensorflow: Add small number before division for numerical stability

In order to prevent divisions by zero in TensorFlow, I want to add a tiny number to my dividend. A quick search did not yield any results. In particular, I am interested in using the scientific notation, e.g.
a = b/(c+1e-05)
How can this be achieved?
Assuming a, b and c are tensors. The formula you have written will work as expected. 1e-5 will be broadcasted and added on the tensor c. Tensorflow automatically typecasts the 1e-5 to tf.constant(1e-5).
Tensorflow however has some limitations with non-scalar broadcasts. Take a look at my other answer.

How to create a synthetic dataset

I want to run some Machine Learning clustering algorithms on some big data.
The problem is that I'm having troubles to find interesting data for this purpose on the web.Also, usually this data might be inconvenient to use because the format won't fit for me.
I need a txt file which each line represents a mathematical vector, each element seperated by space, for example:
1 2.2 3.1
1.12 0.13 4.46
1 2 54.44
Therefore, I decided to first run those algorithms on some synthetic data which I'll create by my self. How can I do this in a smart way with numpy?
In smart way, I mean that it won't be generated uniformly, because it's a little bit boring. How can I generate some interesting clusters?
I want to have 5GB / 10GB of data at the moment.
You need to define what you mean by "clusters", but I think what you are asking for is several random-parameter normal distributions combined together, for each of your coordinate values.
From http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.randn.html#numpy.random.randn:
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu
And use <range> * np.random.rand(<howmany>) for each of sigma and mu.
There is no one good answer for such question. What is interesting? For clustering, unfortunately, there is no such thing as an interesting or even well posed problem. Clustering as such has no well defineid evaluation, consequently each method is equally good/bad, as long as it has well defined internal objective. So k-means will always be good one to minimize inter-cluster euclidean distance and will struggle with sparse data, non-convex, imbalanced clusters. DBScan will always be the best in greedy density based sense and will strugle with diverse density clusters. GMM will be always great fitting on gaussian mixtures, and will strugle with clusters which are not gaussians (for example lines, squares etc.).
From the question one could deduce that you are at the very begining of work with clustering and so need "just anything more complex than uniform", so I suggest you take a look at datasets generators, in particular accesible in scikit-learn (python) http://scikit-learn.org/stable/datasets/ or in clusterSim (R) http://www.inside-r.org/packages/cran/clusterSim/docs/cluster.Gen or clusterGeneration (R) https://cran.r-project.org/web/packages/clusterGeneration/clusterGeneration.pdf

Numpy/Scipy pinv and pinv2 behave differently

I am working with bidimensional arrays on Numpy for Extreme Learning Machines. One of my arrays, H, is random, and I want to compute its pseudoinverse.
If I use scipy.linalg.pinv2 everything runs smoothly. However, if I use scipy.linalg.pinv, sometimes (30-40% of the times) problems arise.
The reason why I am using pinv2 is because I read (here: http://vene.ro/blog/inverses-pseudoinverses-numerical-issues-speed-symmetry.html ) that pinv2 performs better on "tall" and on "wide" arrays.
The problem is that, if H has a column j of all 1, pinv(H) has huge coefficients at row j.
This is in turn a problem because, in such cases, np.dot(pinv(H), Y) contains some nan values (Y is an array of small integers).
Now, I am not into linear algebra and numeric computation enough to understand if this is a bug or some precision related property of the two functions. I would like you to answer this question so that, if it's the case, I can file a bug report (honestly, at the moment I would not even know what to write).
I saved the arrays with np.savetxt(fn, a, '%.2e', ';'): please, see https://dl.dropboxusercontent.com/u/48242012/example.tar.gz to find them.
Any help is appreciated. In the provided file, you can see in pinv(H).csv that rows 14, 33, 55, 56 and 99 have huge values, while in pinv2(H) the same rows have more decent values.
Your help is appreciated.
In short, the two functions implement two different ways to calculate the pseudoinverse matrix:
scipy.linalg.pinv uses least squares, which may be quite compute intensive and take up a lot of memory.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html#scipy.linalg.pinv
scipy.linalg.pinv2 uses SVD (singular value decomposition), which should run with a smaller memory footprint in most cases.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv2.html#scipy.linalg.pinv2
numpy.linalg.pinv also implements this method.
As these are two different evaluation methods, the resulting matrices will not be the same. Each method has its own advantages and disadvantages, and it is not always easy to determine which one should be used without deeply understanding the data and what the pseudoinverse will be used for. I'd simply suggest some trial-and-error and use the one which gives you the best results for your classifier.
Note that in some cases these functions cannot converge to a solution, and will then raise a scipy.stats.LinAlgError. In that case you may try to use the second pinv implementation, which will greatly reduce the amount of errors you receive.
Starting from scipy 1.7.0 , pinv2 is deprecated and also replaced by a SVD solution.
DeprecationWarning: scipy.linalg.pinv2 is deprecated since SciPy 1.7.0, use scipy.linalg.pinv instead
That means, numpy.pinv, scipy.pinv and scipy.pinv2 now compute all equivalent solutions. They are also equally fast in their computation, with scipy being slightly faster.
import numpy as np
import scipy
arr = np.random.rand(1000, 2000)
res1 = np.linalg.pinv(arr)
res2 = scipy.linalg.pinv(arr)
res3 = scipy.linalg.pinv2(arr)
np.testing.assert_array_almost_equal(res1, res2, decimal=10)
np.testing.assert_array_almost_equal(res1, res3, decimal=10)

How to get scikit learn to find simple non-linear relationship

I have some data in a pandas dataframe (although pandas is not the point of this question). As an experiment I made column ZR as column Z divided by column R. As a first step using scikit learn I wanted to see if I could predict ZR from the other columns (which should be possible as I just made it from R and Z). My steps have been.
columns=['R','T', 'V', 'X', 'Z']
for c in columns:
results[c] = preprocessing.scale(results[c])
results['ZR'] = preprocessing.scale(results['ZR'])
labels = results["ZR"].values
features = results[columns].values
#print labels
#print features
regr = linear_model.LinearRegression()
regr.fit(features, labels)
print(regr.coef_)
print np.mean((regr.predict(features)-labels)**2)
This gives
[ 0.36472515 -0.79579885 -0.16316067 0.67995378 0.59256197]
0.458552051342
The preprocessing seems wrong as it destroys the Z/R relationship I think. What's the right way to preprocess in this situation?
Is there some way to get near 100% accuracy? Linear regression is the wrong tool as the relationship is not-linear.
The five features are highly correlated in my data. Is non-negative least squares implemented in scikit learn ? ( I can see it mentioned in the mailing list but not the docs.) My aim would be to get as many coefficients set to zero as possible.
You should easily be able to get a decent fit using random forest regression, without any preprocessing, since it is a nonlinear method:
model = RandomForestRegressor(n_estimators=10, max_features=2)
model.fit(features, labels)
You can play with the parameters to get better performance.
The solutions is not as easy and can be very influenced by your data.
If your variables R and Z are bounded (for ex 0<R<1 -3<Z<2) then you should be able to get a good estimation of the output variable using neural network.
Using neural network you should be able to estimate your output even without preprocessing the data and using all the variables as input.
(Of course here you will have to solve a minimization problem).
Sklearn do not implement neural network so you should use pybrain or fann.
If you want to preprocess the data in order to make the minimization problem easier you can try to extract the right features from the predictor matrix.
I do not think there are a lot of tools for non linear features selection. I would try to estimate the important variables from you dataset using in this order :
1-lasso
2- sparse PCA
3- decision tree (you can actually use them for features selection ) but I would avoid this as much as possible
If this is a toy problem I would sugges you to move towards something of more standard.
You can find a lot of examples on google.

How do I approximate the Jacobian and Hessian of a function numerically?

I have a function in Python:
def f(x):
return x[0]**3 + x[1]**2 + 7
# Actually more than this.
# No analytical expression
It's a scalar valued function of a vector.
How can I approximate the Jacobian and Hessian of this function in numpy or scipy numerically?
(Updated in late 2017 because there's been a lot of updates in this space.)
Your best bet is probably automatic differentiation. There are now many packages for this, because it's the standard approach in deep learning:
Autograd works transparently with most numpy code. It's pure-Python, requires almost no code changes for typical functions, and is reasonably fast.
There are many deep-learning-oriented libraries that can do this.
Some of the most popular are TensorFlow, PyTorch, Theano, Chainer, and MXNet. Each will require you to rewrite your function in their kind-of-like-numpy-but-needlessly-different API, and in return will give you GPU support and a bunch of deep learning-oriented features that you may or may not care about.
FuncDesigner is an older package I haven't used whose website is currently down.
Another option is to approximate it with finite differences, basically just evaluating (f(x + eps) - f(x - eps)) / (2 * eps) (but obviously with more effort put into it than that). This will probably be slower and less accurate than the other approaches, especially in moderately high dimensions, but is fully general and requires no code changes. numdifftools seems to be the standard Python package for this.
You could also attempt to find fully symbolic derivatives with SymPy, but this will be a relatively manual process.
Restricted to just SciPy, the most convenient way I found was scipy.misc.derivative, within the appropriate loops, with lambdas to curry the function of interest.