What does "element wise" mean in Pandas? - pandas

I'm struggling to clearly understand the concept of "element" in Pandas. Already went through the document of Pandas and googled around, and I'm guessing it's some sort of row? What do people mean when they say "apply function elment-wise"?
This question came up when I was reading this SO post : How to apply a function to two columns of Pandas dataframe

Pandas is designed for operating vector wise operations i.e. taking entire column and operate some function. This you can term as column wise operation.
But in some cases you may need to operate element by element (i.e. element wise operation). This type operation is not very efficient.
Here is an example:
import pandas as pd
df = pd.DataFrame([a for a in range(100)], columns=['mynum'])
column wise operation
%%timeit
df['add1'] = df.mynum +1
222 µs ± 3.31 µs per loop
When operated element wise
%%timeit
df['add1'] = df.apply(lambda a: a.mynum+1, axis = 1)
2.33 ms ± 85.4 µs per loop

I believe "element" in Pandas is an inherited concept of the "element" from NumPy. Give the first few paragraphs of the docs on ufuncs a read.
Each universal function takes array inputs and produces array outputs by performing the core function element-wise on the inputs (where an element is generally a scalar, but can be a vector or higher-order sub-array for generalized ufuncs).
In mathematics, element-wise operations refer to operations on individual elements of a matrix.
Examples:
import numpy as np
>>> x, y = np.arange(1,5).reshape(2,2), 3*np.eye(2)
>>> x, y
>>> x, y = np.arange(1,5).reshape(2,2), 3*np.eye(2)
>>> x, y
(array([[1, 2],
[3, 4]]),
array([[3., 0.],
[0., 3.]]))
>>> x + y # element-wise addition
array([[4., 2.],
[3., 7.]])
columns of y
>>> np.dot(x,y) # NOT element-wise multiplication (matrix multiplication)
# elements become dot products of the rows of x with columns of y
array([[ 3., 6.],
[ 9., 12.]])
>>> x * y # element-wise multiplication
array([[ 3., 0.],
[ 0., 12.]])
I realize your question was about Pandas, but element-wise in Pandas means the same thing it does in NumPy and in linear algebra (as far as I'm aware).

Element-wise means handling data element by element.

Related

Indexing xarray data with variable length DataArray

I am trying to extract data from xarray dataset using DataArray indexing. My goal is to obtain the data along different line segments overlapping the array. For that I have obtained indices of each of the lines (these are of different sizes based on the length).
For example for line 1 : x = [1,2,3], y=[7,8,9] and similarly for line 2 is x=[1,4,5,6,8], y=[0,2,7,9,6] and so on I have some of the lines which are 100x 2. For this I have tried like below :
df=xarray_dataset
indx=xr.DataArray([[1,2,3],[1,4,5,6,8],[2,3]])
indy=xr.DataArray([[7,9,8],[0,2,7,9,6],[4,5]])
dx_sel=df.isel(x=indx,y=indy)
However what I understand that the length of each of the data array index needs to be equal. Is there a way I can handle such issues. Basically these indices represent the x and y coordinates of different segments within the data frame and get the mean of each of the segment, I have 100s of such segments if there are only few I would be able to use a loop for each of the segment indexes however it's not computationally efficient to use a loop for each segment.
This is a similar issue with numpy array as well. Is there a way to pass NaN or something similar in the index so that we could make the equal shape but no data is extracted for that index.
You can use set_index -> unstack mechanism, which is based on pd.MultiIndex.
In [4]: df = xr.DataArray(np.arange(110).reshape(10, 11),
...: dims=['x', 'y'])
In [5]: indx=xr.DataArray([1,2,3, 1,4,5,6,8, 2,3],
...: dims=['index'],
...: coords={'i': ('index', [0,0,0, 1,1,1,1,1, 2,2]),
...: 'j': ('index', [0,1,2, 0,1,2,3,4, 0,1])})
...:
...: indy=xr.DataArray([7,9,8, 0,2,7,9,6, 4,5], dims=['index'],
...: coords={'i': ('index', [0,0,0, 1,1,1,1,1, 2,2]),
...: 'j': ('index', [0,1,2, 0,1,2,3,4, 0,1])})
In [8]: df.isel(x=indx, y=indy).set_index(index=['i', 'j']).unstack('index')
Out[8]:
<xarray.DataArray (i: 3, j: 5)>
array([[18., 31., 41., nan, nan],
[11., 46., 62., 75., 94.],
[26., 38., nan, nan, nan]])
Coordinates:
* i (i) int64 0 1 2
* j (j) int64 0 1 2 3 4
Here, indx and indy has non-dimensional coordinates, i and j, which are essentially the original position of the index in the 2-dimensional space.

why the difference between numpy matrix and numpy array when selecting element

I have a calculated matrix
from numpy import matrix
vec=matrix([[ 4.79263398e-01+0.j , -2.94883960e-14+0.34362808j,
5.91036823e-01+0.j , -2.06730654e-14+0.41959935j,
-3.20298698e-01+0.08635809j, -5.97136351e-02+0.22325523j],
[ 9.45394208e-14+0.34385164j, 4.78941900e-01+0.j ,
1.07732017e-13+0.41891016j, 5.91969770e-01+0.j ,
-6.06877417e-02-0.2250884j , 3.17803028e-01+0.08500215j],
[ 4.63795513e-01-0.00827114j, -1.15263719e-02+0.33287485j,
-2.78282097e-01-0.20137267j, -2.81970922e-01-0.1980647j ,
9.26109539e-02-0.38428445j, 5.12483437e-01+0.j ],
[ -1.15282610e-02+0.33275927j, 4.63961516e-01-0.00826978j,
-2.84077490e-01-0.19723838j, -2.79429184e-01-0.19984041j,
-4.42104809e-01+0.25708681j, -2.71973825e-01+0.28735795j],
[ 4.63795513e-01+0.00827114j, 1.15263719e-02+0.33287485j,
-2.78282097e-01+0.20137267j, 2.81970922e-01-0.1980647j ,
2.73235786e-01+0.28564581j, -4.44053596e-01-0.25584307j],
[ 1.15282610e-02+0.33275927j, 4.63961516e-01+0.00826978j,
2.84077490e-01-0.19723838j, -2.79429184e-01+0.19984041j,
5.11419878e-01+0.j , -9.22028113e-02-0.38476356j]])
I want to get 2nd row, 3rd column element
vec[1][2]
IndexError: index 1 is out of bounds for axis 0 with size 1
and slicing works well
vec[1,2]
(1.07732017e-13+0.41891015999999998j)
My first question why first way does not work in this case? it worked before when I used it.
Second question is: the result of slicing is an array, how to make it an complex value without bracket? My experience was using
vec[1,2][0]
but again it is not working here.
I tried to do everything on numpy array at begining, those methods that do not work on numpy matrix work on numpy array. Why there are such differences?
The key difference is that a matrix is always 2d, always. (This is supposed to be familiar to MATLAB users.)
In [85]: mat = np.matrix('1,2;3,4')
In [86]: mat
Out[86]:
matrix([[1, 2],
[3, 4]])
In [87]: mat.shape
Out[87]: (2, 2)
In [88]: mat[1]
Out[88]: matrix([[3, 4]])
In [89]: _.shape
Out[89]: (1, 2)
Selecting a row of mat returns a matrix - a 1 row one. It should be clear that it cannot be indexed again with [1].
Indexing with the tuple returns a scalar:
In [90]: mat[1,1]
Out[90]: 4
In [91]: type(_)
Out[91]: numpy.int32
As a general rule operations on a np.matrix returns a matrix or a scalar, not a np.ndarray.
The other key point is that mat[1][1] is not one numpy operation. It is two, a mat[1] followed by another [1]. Imagine yourself to be a Python interpreter without any special knowledge of numpy. How would you evaluate that expression?
Now for the complex question:
In [92]: mat = np.matrix('1+3j, 2;-2, 2+1j')
In [93]: mat
Out[93]:
matrix([[ 1.+3.j, 2.+0.j],
[-2.+0.j, 2.+1.j]])
In [94]: mat[1,1]
Out[94]: (2+1j)
In [95]: type(_)
Out[95]: numpy.complex128
As expected the tuple index has returned a scalar numpy element. () is just part of numpys way of displaying a complex number.
We can use item to extra python equivalent, but the display still uses ()
In [96]: __.item()
Out[96]: (2+1j)
In [97]: type(_)
Out[97]: complex
In [98]: 1+3j
Out[98]: (1+3j)
mat has A property that gives the array equivalent. But notice the shapes.
In [99]: mat.A # a 2d array
Out[99]:
array([[ 1.+3.j, 2.+0.j],
[-2.+0.j, 2.+1.j]])
In [100]: mat.A1 # a 1d array
Out[100]: array([ 1.+3.j, 2.+0.j, -2.+0.j, 2.+1.j])
In [101]: mat[1].A
Out[101]: array([[-2.+0.j, 2.+1.j]])
In [102]: mat[1].A1
Out[102]: array([-2.+0.j, 2.+1.j])
Sometimes this behavior of matrix is handy. For example np.sum acts like the array keepdims=True:
In [108]: np.sum(mat,1)
Out[108]:
matrix([[ 3.+3.j],
[ 0.+1.j]])
In [110]: np.sum(mat.A,1, keepdims=True)
Out[110]:
array([[ 3.+3.j],
[ 0.+1.j]])

Numpy mean and std over every terms of arrays

I have a list of 2 dimensional arrays (same shape), and would like to get the mean and deviation for all terms, in a result array of the same shape as the inputs. I have trouble understanding from the doc whether this is possible. All my attempts with axis and keepdims parameters produce results of different shapes.
I would like for example to have: mean([x, x]) equal to x, and std([x, x]) zeroes shaped like x.
Is this possible without reshaping the arrays ? If not, how to do it with reshaping ?
Example:
>> x= np.array([[1,2],[3,4]])
>>> y= np.array([[2,3],[4,5]])
>>> np.mean([x,y])
3.0
I want [[1.5,2.5],[3.5,4.5]] instead.
As Divikar points out, you can pass the list of arrays to np.mean and specify axis=0 to average over corresponding values from each array in the list:
In [13]: np.mean([x,y], axis=0)
Out[13]:
array([[ 1.5, 2.5],
[ 3.5, 4.5]])
This works for lists of arbitrary length. For just two arrays, (x+y)/2.0 is faster:
In [20]: %timeit (x+y)/2.0
100000 loops, best of 3: 1.96 µs per loop
In [21]: %timeit np.mean([x,y], axis=0)
10000 loops, best of 3: 21.6 µs per loop

Confused about (x,y) order with RectBivariateSpline

I am getting confused about the argument order with RectBivariateSpline. I am reading a set of 2D data which has 343 values along the X axis and 373 values along the Y axis. The routine that reads the data returns it in the "correct" sense such that when I plot it it matplotlib I get a map that has the correct physical orientation. It also returns the X values in an array of 343 elements and the Y values in in array of 373 elements which makes sense.
The scipy documentation for RectBivariateSpline gives the arguments as:
scipy.interpolate.RectBivariateSpline(x, y, z)
However, when I execute
spln = scipy.interpolate.RectBivariateSpline(xval, yval, zval)
I get this error:
TypeError: x dimension of z must have same number of elements as x
I can remove the error by executing
spln = scipy.interpolate.RectBivariateSpline(yval, xval, zval)
but now the x and y values are the wrong way round (in a physical sense at least). Does this mean that the x argument to RectBivariateSpline refers to the first data dimension of the dataset rather than the physical x dimension? I am used to working with data in Fortran-style ordering, which probably is not helping.
In answer to hpaulj's comment, the shapes of the various arrays are:
xval (343,)
yval (373,)
zval (373, 343)
I think the issue is that I am getting confused between 'xy' and 'ij' ordering. Matpoltlib seems to be using 'xy' ordering so I guess I just need to be careful to transpose the ZVAL array when interpolating using scipy
Show us the values of xval.shape, yval.shape and zval.shape
Early in the RectBivariateSpline code it does:
x, y = ravel(x), ravel(y)
....
if not x.size == z.shape[0]:
raise TypeError('x dimension of z must have same number of '
'elements as x')
if not y.size == z.shape[1]:
raise TypeError('y dimension of z must have same number of '
'elements as y')
So the number of rows of z (1st dimension) must match the number of elements in x.
When you display a 2d array, rows are the 1st dimension, going down the page, columns the 2nd. But in a plot, we often expect the first axis, the x one to go across the page.
np.meshgrid lets you specify:
indexing : {'xy', 'ij'}, optional
Cartesian ('xy', default) or matrix ('ij') indexing of output.
See Notes for more details.
The difference between 'xy' indexing and 'ij' indexing might be confusing you. This Spline class is using the 'ij' kind.
An alternative to switching x and y is to use the transpose of z, z.T. Keep in mind that the interpolation points follow the same ordering rules.
Simple example
In [30]: x=np.arange(10)
In [31]: y=np.arange(15)
In [32]: z=x[:,None]*y[None,:]
In [33]: S=interpolate.RectBivariateSpline(x,y,z)
In [34]: S([1,2,3],[4,5,6])
Out[34]:
array([[ 4., 5., 6.],
[ 8., 10., 12.],
[ 12., 15., 18.]])
Contrast the xy v ij in meshgrid:
In [37]: np.meshgrid(x,y)[0].shape
Out[37]: (15, 10)
In [38]: np.meshgrid(x,y,indexing='ij')[0].shape
Out[38]: (10, 15)
z could have been constructed from the ij grids.
X,Y=np.meshgrid(x,y,indexing='ij')
Z = X*Y

Nicer way to do nested dot products in numpy?

I'm finding this happening to me a lot: I want to compute a matrix multiplication of the sort (X^TX)^{-1}XX^T, or something along these lines. I end up doing something like
X = np.array([[1,2],[3,4]])
a = np.dot(np.transpose(X), X)
b = np.dot(np.linalg.inv(a), X)
answer = np.dot(b, np.transpose(X))
Is there a better way to do this without resorting to the np.matrix type? Is there a way to do transpose without typing np.transpose?
Let's explore the options a bit
inv=np.linalg.inv
def array1(X):
a = np.dot(X.T, X)
b = np.dot(inv(a), X)
return np.dot(b, X.T)
Basically your code, but using the method expression dot and .T notation.
Testing with your X:
In [12]: array1(X)
Out[12]:
array([[-13.5, -32.5],
[ 10. , 24. ]])
What's the matrix equivalent?
In [17]: M=np.matrix(X)
In [18]: (M.T*M).I*M*M.T
Out[18]:
matrix([[-13.5, -32.5],
[ 10. , 24. ]])
The matrix version is more compact, but is it clearer? It's not faster.
In [22]: timeit array1(X)
10000 loops, best of 3: 48.7 µs per loop
In [23]: timeit (M.T*M).I*M*M.T
10000 loops, best of 3: 95.4 µs per loop
First stab at a einsum equivalent
In [32]: np.einsum('ij,jk,lk',inv(np.einsum('ji,jk',X,X)),X,X)
Out[32]:
array([[-13.5, -32.5],
[ 10. , 24. ]])
In [33]: timeit np.einsum('ij,jk,lk',inv(np.einsum('ji,jk',X,X)),X,X)
10000 loops, best of 3: 55.1 µs per loop
basically the same as the dot version.
The matrix version shows me that I can simplify the array version to:
inv(X.T.dot(X)).dot(X.dot(X.T))
(same timing)