Indexing xarray data with variable length DataArray - numpy

I am trying to extract data from xarray dataset using DataArray indexing. My goal is to obtain the data along different line segments overlapping the array. For that I have obtained indices of each of the lines (these are of different sizes based on the length).
For example for line 1 : x = [1,2,3], y=[7,8,9] and similarly for line 2 is x=[1,4,5,6,8], y=[0,2,7,9,6] and so on I have some of the lines which are 100x 2. For this I have tried like below :
df=xarray_dataset
indx=xr.DataArray([[1,2,3],[1,4,5,6,8],[2,3]])
indy=xr.DataArray([[7,9,8],[0,2,7,9,6],[4,5]])
dx_sel=df.isel(x=indx,y=indy)
However what I understand that the length of each of the data array index needs to be equal. Is there a way I can handle such issues. Basically these indices represent the x and y coordinates of different segments within the data frame and get the mean of each of the segment, I have 100s of such segments if there are only few I would be able to use a loop for each of the segment indexes however it's not computationally efficient to use a loop for each segment.
This is a similar issue with numpy array as well. Is there a way to pass NaN or something similar in the index so that we could make the equal shape but no data is extracted for that index.

You can use set_index -> unstack mechanism, which is based on pd.MultiIndex.
In [4]: df = xr.DataArray(np.arange(110).reshape(10, 11),
...: dims=['x', 'y'])
In [5]: indx=xr.DataArray([1,2,3, 1,4,5,6,8, 2,3],
...: dims=['index'],
...: coords={'i': ('index', [0,0,0, 1,1,1,1,1, 2,2]),
...: 'j': ('index', [0,1,2, 0,1,2,3,4, 0,1])})
...:
...: indy=xr.DataArray([7,9,8, 0,2,7,9,6, 4,5], dims=['index'],
...: coords={'i': ('index', [0,0,0, 1,1,1,1,1, 2,2]),
...: 'j': ('index', [0,1,2, 0,1,2,3,4, 0,1])})
In [8]: df.isel(x=indx, y=indy).set_index(index=['i', 'j']).unstack('index')
Out[8]:
<xarray.DataArray (i: 3, j: 5)>
array([[18., 31., 41., nan, nan],
[11., 46., 62., 75., 94.],
[26., 38., nan, nan, nan]])
Coordinates:
* i (i) int64 0 1 2
* j (j) int64 0 1 2 3 4
Here, indx and indy has non-dimensional coordinates, i and j, which are essentially the original position of the index in the 2-dimensional space.

Related

Get the last non-null value per row of a 2D Numpy array

I have a 2D numpy array that looks like
a = np.array(
[
[1,2,np.nan,np.nan],
[1,33,45,np.nan],
[11,22,3,78],
]
)
I need to extract the last non null value per row i.e.
[2, 45, 78]
Please guide on how to get it.
thanks
Break this into two sub-problems.
Remove nans from each row
Select last element from the array
[r[~np.isnan(r)][-1] for r in a]
produces
[2.0, 45.0, 78.0]
For a vectorial solution, you can try:
# get the index of last nan
idx = np.argmax(~np.isnan(a)[:, ::-1], axis=1)
# slice
a[np.arange(a.shape[0]), a.shape[1]-idx-1]
output: array([ 2., 45., 78.])

How to split a cell which contains nested array in a pandas DataFrame

I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!
Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']
One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.
import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

What does "element wise" mean in Pandas?

I'm struggling to clearly understand the concept of "element" in Pandas. Already went through the document of Pandas and googled around, and I'm guessing it's some sort of row? What do people mean when they say "apply function elment-wise"?
This question came up when I was reading this SO post : How to apply a function to two columns of Pandas dataframe
Pandas is designed for operating vector wise operations i.e. taking entire column and operate some function. This you can term as column wise operation.
But in some cases you may need to operate element by element (i.e. element wise operation). This type operation is not very efficient.
Here is an example:
import pandas as pd
df = pd.DataFrame([a for a in range(100)], columns=['mynum'])
column wise operation
%%timeit
df['add1'] = df.mynum +1
222 µs ± 3.31 µs per loop
When operated element wise
%%timeit
df['add1'] = df.apply(lambda a: a.mynum+1, axis = 1)
2.33 ms ± 85.4 µs per loop
I believe "element" in Pandas is an inherited concept of the "element" from NumPy. Give the first few paragraphs of the docs on ufuncs a read.
Each universal function takes array inputs and produces array outputs by performing the core function element-wise on the inputs (where an element is generally a scalar, but can be a vector or higher-order sub-array for generalized ufuncs).
In mathematics, element-wise operations refer to operations on individual elements of a matrix.
Examples:
import numpy as np
>>> x, y = np.arange(1,5).reshape(2,2), 3*np.eye(2)
>>> x, y
>>> x, y = np.arange(1,5).reshape(2,2), 3*np.eye(2)
>>> x, y
(array([[1, 2],
[3, 4]]),
array([[3., 0.],
[0., 3.]]))
>>> x + y # element-wise addition
array([[4., 2.],
[3., 7.]])
columns of y
>>> np.dot(x,y) # NOT element-wise multiplication (matrix multiplication)
# elements become dot products of the rows of x with columns of y
array([[ 3., 6.],
[ 9., 12.]])
>>> x * y # element-wise multiplication
array([[ 3., 0.],
[ 0., 12.]])
I realize your question was about Pandas, but element-wise in Pandas means the same thing it does in NumPy and in linear algebra (as far as I'm aware).
Element-wise means handling data element by element.

why the difference between numpy matrix and numpy array when selecting element

I have a calculated matrix
from numpy import matrix
vec=matrix([[ 4.79263398e-01+0.j , -2.94883960e-14+0.34362808j,
5.91036823e-01+0.j , -2.06730654e-14+0.41959935j,
-3.20298698e-01+0.08635809j, -5.97136351e-02+0.22325523j],
[ 9.45394208e-14+0.34385164j, 4.78941900e-01+0.j ,
1.07732017e-13+0.41891016j, 5.91969770e-01+0.j ,
-6.06877417e-02-0.2250884j , 3.17803028e-01+0.08500215j],
[ 4.63795513e-01-0.00827114j, -1.15263719e-02+0.33287485j,
-2.78282097e-01-0.20137267j, -2.81970922e-01-0.1980647j ,
9.26109539e-02-0.38428445j, 5.12483437e-01+0.j ],
[ -1.15282610e-02+0.33275927j, 4.63961516e-01-0.00826978j,
-2.84077490e-01-0.19723838j, -2.79429184e-01-0.19984041j,
-4.42104809e-01+0.25708681j, -2.71973825e-01+0.28735795j],
[ 4.63795513e-01+0.00827114j, 1.15263719e-02+0.33287485j,
-2.78282097e-01+0.20137267j, 2.81970922e-01-0.1980647j ,
2.73235786e-01+0.28564581j, -4.44053596e-01-0.25584307j],
[ 1.15282610e-02+0.33275927j, 4.63961516e-01+0.00826978j,
2.84077490e-01-0.19723838j, -2.79429184e-01+0.19984041j,
5.11419878e-01+0.j , -9.22028113e-02-0.38476356j]])
I want to get 2nd row, 3rd column element
vec[1][2]
IndexError: index 1 is out of bounds for axis 0 with size 1
and slicing works well
vec[1,2]
(1.07732017e-13+0.41891015999999998j)
My first question why first way does not work in this case? it worked before when I used it.
Second question is: the result of slicing is an array, how to make it an complex value without bracket? My experience was using
vec[1,2][0]
but again it is not working here.
I tried to do everything on numpy array at begining, those methods that do not work on numpy matrix work on numpy array. Why there are such differences?
The key difference is that a matrix is always 2d, always. (This is supposed to be familiar to MATLAB users.)
In [85]: mat = np.matrix('1,2;3,4')
In [86]: mat
Out[86]:
matrix([[1, 2],
[3, 4]])
In [87]: mat.shape
Out[87]: (2, 2)
In [88]: mat[1]
Out[88]: matrix([[3, 4]])
In [89]: _.shape
Out[89]: (1, 2)
Selecting a row of mat returns a matrix - a 1 row one. It should be clear that it cannot be indexed again with [1].
Indexing with the tuple returns a scalar:
In [90]: mat[1,1]
Out[90]: 4
In [91]: type(_)
Out[91]: numpy.int32
As a general rule operations on a np.matrix returns a matrix or a scalar, not a np.ndarray.
The other key point is that mat[1][1] is not one numpy operation. It is two, a mat[1] followed by another [1]. Imagine yourself to be a Python interpreter without any special knowledge of numpy. How would you evaluate that expression?
Now for the complex question:
In [92]: mat = np.matrix('1+3j, 2;-2, 2+1j')
In [93]: mat
Out[93]:
matrix([[ 1.+3.j, 2.+0.j],
[-2.+0.j, 2.+1.j]])
In [94]: mat[1,1]
Out[94]: (2+1j)
In [95]: type(_)
Out[95]: numpy.complex128
As expected the tuple index has returned a scalar numpy element. () is just part of numpys way of displaying a complex number.
We can use item to extra python equivalent, but the display still uses ()
In [96]: __.item()
Out[96]: (2+1j)
In [97]: type(_)
Out[97]: complex
In [98]: 1+3j
Out[98]: (1+3j)
mat has A property that gives the array equivalent. But notice the shapes.
In [99]: mat.A # a 2d array
Out[99]:
array([[ 1.+3.j, 2.+0.j],
[-2.+0.j, 2.+1.j]])
In [100]: mat.A1 # a 1d array
Out[100]: array([ 1.+3.j, 2.+0.j, -2.+0.j, 2.+1.j])
In [101]: mat[1].A
Out[101]: array([[-2.+0.j, 2.+1.j]])
In [102]: mat[1].A1
Out[102]: array([-2.+0.j, 2.+1.j])
Sometimes this behavior of matrix is handy. For example np.sum acts like the array keepdims=True:
In [108]: np.sum(mat,1)
Out[108]:
matrix([[ 3.+3.j],
[ 0.+1.j]])
In [110]: np.sum(mat.A,1, keepdims=True)
Out[110]:
array([[ 3.+3.j],
[ 0.+1.j]])

Slicing a numpy array and passing the slice to a function

I want to have a function that can operate on either a row or a column of a 2D ndarray. Assume the array has C order. The function changes values in the 2D data.
Inside the function I want to have identical index syntax whether it is called with a row or column. A row slice is [n,:] and column slice [:,n] so they have different shapes. Inside the function this requires different indexing expressions.
Is there a way to do this that does not require moving or allocating memory? I am under the impression that using reshape will force a copy to make the data to make it contiguous. Is there a way to use nditer in the function?
Do you mean like this:
In [74]: def foo(arr, n):
...: arr += n
...:
In [75]: arr = np.ones((2,3),int)
In [76]: foo(arr[0,:],1)
In [77]: arr
Out[77]:
array([[2, 2, 2],
[1, 1, 1]])
In [78]: foo(arr[:,1],[100,200])
In [79]: arr
Out[79]:
array([[ 2, 102, 2],
[ 1, 201, 1]])
In the first case I'm adding 1 to one row of the array, ie. a row slice. In the second case I'm add a array (list) to a column. In that case n has to have the right length.
Usually we don't worry about whether the values are C contiguous. Striding takes care of access either way.