I have a list of 2 dimensional arrays (same shape), and would like to get the mean and deviation for all terms, in a result array of the same shape as the inputs. I have trouble understanding from the doc whether this is possible. All my attempts with axis and keepdims parameters produce results of different shapes.
I would like for example to have: mean([x, x]) equal to x, and std([x, x]) zeroes shaped like x.
Is this possible without reshaping the arrays ? If not, how to do it with reshaping ?
Example:
>> x= np.array([[1,2],[3,4]])
>>> y= np.array([[2,3],[4,5]])
>>> np.mean([x,y])
3.0
I want [[1.5,2.5],[3.5,4.5]] instead.
As Divikar points out, you can pass the list of arrays to np.mean and specify axis=0 to average over corresponding values from each array in the list:
In [13]: np.mean([x,y], axis=0)
Out[13]:
array([[ 1.5, 2.5],
[ 3.5, 4.5]])
This works for lists of arbitrary length. For just two arrays, (x+y)/2.0 is faster:
In [20]: %timeit (x+y)/2.0
100000 loops, best of 3: 1.96 µs per loop
In [21]: %timeit np.mean([x,y], axis=0)
10000 loops, best of 3: 21.6 µs per loop
Related
I'm struggling to clearly understand the concept of "element" in Pandas. Already went through the document of Pandas and googled around, and I'm guessing it's some sort of row? What do people mean when they say "apply function elment-wise"?
This question came up when I was reading this SO post : How to apply a function to two columns of Pandas dataframe
Pandas is designed for operating vector wise operations i.e. taking entire column and operate some function. This you can term as column wise operation.
But in some cases you may need to operate element by element (i.e. element wise operation). This type operation is not very efficient.
Here is an example:
import pandas as pd
df = pd.DataFrame([a for a in range(100)], columns=['mynum'])
column wise operation
%%timeit
df['add1'] = df.mynum +1
222 µs ± 3.31 µs per loop
When operated element wise
%%timeit
df['add1'] = df.apply(lambda a: a.mynum+1, axis = 1)
2.33 ms ± 85.4 µs per loop
I believe "element" in Pandas is an inherited concept of the "element" from NumPy. Give the first few paragraphs of the docs on ufuncs a read.
Each universal function takes array inputs and produces array outputs by performing the core function element-wise on the inputs (where an element is generally a scalar, but can be a vector or higher-order sub-array for generalized ufuncs).
In mathematics, element-wise operations refer to operations on individual elements of a matrix.
Examples:
import numpy as np
>>> x, y = np.arange(1,5).reshape(2,2), 3*np.eye(2)
>>> x, y
>>> x, y = np.arange(1,5).reshape(2,2), 3*np.eye(2)
>>> x, y
(array([[1, 2],
[3, 4]]),
array([[3., 0.],
[0., 3.]]))
>>> x + y # element-wise addition
array([[4., 2.],
[3., 7.]])
columns of y
>>> np.dot(x,y) # NOT element-wise multiplication (matrix multiplication)
# elements become dot products of the rows of x with columns of y
array([[ 3., 6.],
[ 9., 12.]])
>>> x * y # element-wise multiplication
array([[ 3., 0.],
[ 0., 12.]])
I realize your question was about Pandas, but element-wise in Pandas means the same thing it does in NumPy and in linear algebra (as far as I'm aware).
Element-wise means handling data element by element.
I have a calculated matrix
from numpy import matrix
vec=matrix([[ 4.79263398e-01+0.j , -2.94883960e-14+0.34362808j,
5.91036823e-01+0.j , -2.06730654e-14+0.41959935j,
-3.20298698e-01+0.08635809j, -5.97136351e-02+0.22325523j],
[ 9.45394208e-14+0.34385164j, 4.78941900e-01+0.j ,
1.07732017e-13+0.41891016j, 5.91969770e-01+0.j ,
-6.06877417e-02-0.2250884j , 3.17803028e-01+0.08500215j],
[ 4.63795513e-01-0.00827114j, -1.15263719e-02+0.33287485j,
-2.78282097e-01-0.20137267j, -2.81970922e-01-0.1980647j ,
9.26109539e-02-0.38428445j, 5.12483437e-01+0.j ],
[ -1.15282610e-02+0.33275927j, 4.63961516e-01-0.00826978j,
-2.84077490e-01-0.19723838j, -2.79429184e-01-0.19984041j,
-4.42104809e-01+0.25708681j, -2.71973825e-01+0.28735795j],
[ 4.63795513e-01+0.00827114j, 1.15263719e-02+0.33287485j,
-2.78282097e-01+0.20137267j, 2.81970922e-01-0.1980647j ,
2.73235786e-01+0.28564581j, -4.44053596e-01-0.25584307j],
[ 1.15282610e-02+0.33275927j, 4.63961516e-01+0.00826978j,
2.84077490e-01-0.19723838j, -2.79429184e-01+0.19984041j,
5.11419878e-01+0.j , -9.22028113e-02-0.38476356j]])
I want to get 2nd row, 3rd column element
vec[1][2]
IndexError: index 1 is out of bounds for axis 0 with size 1
and slicing works well
vec[1,2]
(1.07732017e-13+0.41891015999999998j)
My first question why first way does not work in this case? it worked before when I used it.
Second question is: the result of slicing is an array, how to make it an complex value without bracket? My experience was using
vec[1,2][0]
but again it is not working here.
I tried to do everything on numpy array at begining, those methods that do not work on numpy matrix work on numpy array. Why there are such differences?
The key difference is that a matrix is always 2d, always. (This is supposed to be familiar to MATLAB users.)
In [85]: mat = np.matrix('1,2;3,4')
In [86]: mat
Out[86]:
matrix([[1, 2],
[3, 4]])
In [87]: mat.shape
Out[87]: (2, 2)
In [88]: mat[1]
Out[88]: matrix([[3, 4]])
In [89]: _.shape
Out[89]: (1, 2)
Selecting a row of mat returns a matrix - a 1 row one. It should be clear that it cannot be indexed again with [1].
Indexing with the tuple returns a scalar:
In [90]: mat[1,1]
Out[90]: 4
In [91]: type(_)
Out[91]: numpy.int32
As a general rule operations on a np.matrix returns a matrix or a scalar, not a np.ndarray.
The other key point is that mat[1][1] is not one numpy operation. It is two, a mat[1] followed by another [1]. Imagine yourself to be a Python interpreter without any special knowledge of numpy. How would you evaluate that expression?
Now for the complex question:
In [92]: mat = np.matrix('1+3j, 2;-2, 2+1j')
In [93]: mat
Out[93]:
matrix([[ 1.+3.j, 2.+0.j],
[-2.+0.j, 2.+1.j]])
In [94]: mat[1,1]
Out[94]: (2+1j)
In [95]: type(_)
Out[95]: numpy.complex128
As expected the tuple index has returned a scalar numpy element. () is just part of numpys way of displaying a complex number.
We can use item to extra python equivalent, but the display still uses ()
In [96]: __.item()
Out[96]: (2+1j)
In [97]: type(_)
Out[97]: complex
In [98]: 1+3j
Out[98]: (1+3j)
mat has A property that gives the array equivalent. But notice the shapes.
In [99]: mat.A # a 2d array
Out[99]:
array([[ 1.+3.j, 2.+0.j],
[-2.+0.j, 2.+1.j]])
In [100]: mat.A1 # a 1d array
Out[100]: array([ 1.+3.j, 2.+0.j, -2.+0.j, 2.+1.j])
In [101]: mat[1].A
Out[101]: array([[-2.+0.j, 2.+1.j]])
In [102]: mat[1].A1
Out[102]: array([-2.+0.j, 2.+1.j])
Sometimes this behavior of matrix is handy. For example np.sum acts like the array keepdims=True:
In [108]: np.sum(mat,1)
Out[108]:
matrix([[ 3.+3.j],
[ 0.+1.j]])
In [110]: np.sum(mat.A,1, keepdims=True)
Out[110]:
array([[ 3.+3.j],
[ 0.+1.j]])
I have two arrays that have the shapes N X T and M X T. I'd like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively).
What's the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpy and/or scipy. Right now my arrays are numpy arrays, but I'm open to converting them to a different type.
I'm expecting my output to be an array with the shape N X M.
N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.
Here are some things to note:
The numpy function correlate requires input arrays to be one-dimensional.
The numpy function corrcoef accepts two-dimensional arrays, but they must have the same shape.
The scipy.stats function pearsonr requires input arrays to be one-dimensional.
Correlation (default 'valid' case) between two 2D arrays:
You can simply use matrix-multiplication np.dot like so -
out = np.dot(arr_one,arr_two.T)
Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.
Row-wise Correlation Coefficient calculation for two 2D arrays:
def corr2_coeff(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:, None]
B_mB = B - B.mean(1)[:, None]
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum(1)
# Finally get corr coeff
return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))
This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB
Benchmarking
This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.
Case #1:
In [106]: A = np.random.rand(1000, 100)
In [107]: B = np.random.rand(1000, 100)
In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop
In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop
Case #2:
In [110]: A = np.random.rand(5000, 100)
In [111]: B = np.random.rand(5000, 100)
In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop
In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop
Case #3:
In [114]: A = np.random.rand(10000, 10)
In [115]: B = np.random.rand(10000, 10)
In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop
In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop
The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -
In [118]: A = np.random.rand(1000, 100)
In [119]: B = np.random.rand(1000, 100)
In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop
In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop
In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop
#Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.
In order to calculate the correlation coefficient, a bit more is required:
import numpy as np
def generate_correlation_map(x, y):
"""Correlate each n with each m.
Parameters
----------
x : np.array
Shape N X T.
y : np.array
Shape M X T.
Returns
-------
np.array
N X M array in which each element is a correlation coefficient.
"""
mu_x = x.mean(1)
mu_y = y.mean(1)
n = x.shape[1]
if n != y.shape[1]:
raise ValueError('x and y must ' +
'have the same number of timepoints.')
s_x = x.std(1, ddof=n - 1)
s_y = y.std(1, ddof=n - 1)
cov = np.dot(x,
y.T) - n * np.dot(mu_x[:, np.newaxis],
mu_y[np.newaxis, :])
return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])
Here's a test of this function, which passes:
from scipy.stats import pearsonr
def test_generate_correlation_map():
x = np.random.rand(10, 10)
y = np.random.rand(20, 10)
desired = np.empty((10, 20))
for n in range(x.shape[0]):
for m in range(y.shape[0]):
desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
actual = generate_correlation_map(x, y)
np.testing.assert_array_almost_equal(actual, desired)
For those interested in computing the Pearson correlation coefficient between a 1D and 2D array, I wrote the following function, where x is a 1D array and y a 2D array.
def pearsonr_2D(x, y):
"""computes pearson correlation coefficient
where x is a 1D and y a 2D array"""
upper = np.sum((x - np.mean(x)) * (y - np.mean(y, axis=1)[:,None]), axis=1)
lower = np.sqrt(np.sum(np.power(x - np.mean(x), 2)) * np.sum(np.power(y - np.mean(y, axis=1)[:,None], 2), axis=1))
rho = upper / lower
return rho
Example run:
>>> x
Out[1]: array([1, 2, 3])
>>> y
Out[2]: array([[ 1, 2, 3],
[ 6, 7, 12],
[ 9, 3, 1]])
>>> pearsonr_2D(x, y)
Out[3]: array([ 1. , 0.93325653, -0.96076892])
I'm finding this happening to me a lot: I want to compute a matrix multiplication of the sort (X^TX)^{-1}XX^T, or something along these lines. I end up doing something like
X = np.array([[1,2],[3,4]])
a = np.dot(np.transpose(X), X)
b = np.dot(np.linalg.inv(a), X)
answer = np.dot(b, np.transpose(X))
Is there a better way to do this without resorting to the np.matrix type? Is there a way to do transpose without typing np.transpose?
Let's explore the options a bit
inv=np.linalg.inv
def array1(X):
a = np.dot(X.T, X)
b = np.dot(inv(a), X)
return np.dot(b, X.T)
Basically your code, but using the method expression dot and .T notation.
Testing with your X:
In [12]: array1(X)
Out[12]:
array([[-13.5, -32.5],
[ 10. , 24. ]])
What's the matrix equivalent?
In [17]: M=np.matrix(X)
In [18]: (M.T*M).I*M*M.T
Out[18]:
matrix([[-13.5, -32.5],
[ 10. , 24. ]])
The matrix version is more compact, but is it clearer? It's not faster.
In [22]: timeit array1(X)
10000 loops, best of 3: 48.7 µs per loop
In [23]: timeit (M.T*M).I*M*M.T
10000 loops, best of 3: 95.4 µs per loop
First stab at a einsum equivalent
In [32]: np.einsum('ij,jk,lk',inv(np.einsum('ji,jk',X,X)),X,X)
Out[32]:
array([[-13.5, -32.5],
[ 10. , 24. ]])
In [33]: timeit np.einsum('ij,jk,lk',inv(np.einsum('ji,jk',X,X)),X,X)
10000 loops, best of 3: 55.1 µs per loop
basically the same as the dot version.
The matrix version shows me that I can simplify the array version to:
inv(X.T.dot(X)).dot(X.dot(X.T))
(same timing)
I have a matrix A (nXm) . My ultimate goal is to get Z of dimension (nXmXm) Currently I am doing it using this but can it be done without using for loop using some matrix.tensordot or matrix.multiply.outer
for i in range(0,A.shape[0]):
Z[i,:,:] = np.outer(A[i,:],A[i,:])
You could use numpy's Einstein summation, like this:
np.einsum('ij, ik -> ijk', a, a)
Just for completeness, the timing comparison with the also excellent answer (+1) from unutbu:
In [39]: A = np.random.random((1000,50))
In [40]: %timeit using_einsum(A)
100 loops, best of 3: 11.6 ms per loop
In [41]: %timeit using_broadcasting(A)
100 loops, best of 3: 10.2 ms per loop
In [42]: %timeit orig(A)
10 loops, best of 3: 27.8 ms per loop
Which teaches me that
unutbu's machine is faster than mine
broadcasting would be slightly faster than np.einsum
for i in range(0,A.shape[0]):
Z[i,:,:] = np.outer(A[i,:],A[i,:])
means
Z_ijk = A_ij * A_ik
which can be computed using NumPy broadcasting:
Z = A[:, :, np.newaxis] * A[:, np.newaxis, :]
A[:, :, np.newaxis] has shape (n, m, 1) and A[:, np.newaxis, :] has shape
(n, 1, m). Multiplying the two causes both arrays to be broadcasted up to
shape (n, m, m).
NumPy multiplication is always performed elementwise. The values along the
broadcasted axis are the same everywhere, so elementwise multiplication results
in Z_ijk = A_ij * A_ik.
import numpy as np
def orig(A):
Z = np.empty(A.shape+(A.shape[-1],), dtype=A.dtype)
for i in range(0,A.shape[0]):
Z[i,:,:] = np.outer(A[i,:],A[i,:])
return Z
def using_broadcasting(A):
return A[:, :, np.newaxis] * A[:, np.newaxis, :]
Here is a sanity check showing this produces the correct result:
A = np.random.random((1000,50))
assert np.allclose(using_broadcasting(A), orig(A))
By choosing A.shape[0] to be large we get an example which shows off the
advantage of broadcasting over looping in Python:
In [107]: %timeit using_broadcasting(A)
10 loops, best of 3: 6.12 ms per loop
In [108]: %timeit orig(A)
100 loops, best of 3: 16.9 ms per loop