Make pandas raise on divide by zero instead of inf - pandas

I would like to have pandas raise an exception when dividing by zero as in:
d = {'col1': [2., 0.], 'col2': [4., 0.]}
df = pd.DataFrame(data=d)
2/df
Instead of the current result:
0 1.000000
1 inf
Name: col1, dtype: float64
Any suggestions how to achieve that?
I know with numpy I can np.seterr(divide='raise') but pandas does ignore that.
Many thanks

A closer look into the source code and the trace shows that inside pandas you can find a lot of context handlers like this:
with np.errstate(all='ignore'):
or
with numeric.errstate(all='ignore'):
This is the reason why np.seterr is ignored and there is probably no easy way to get rid of this.

It's far from ideal, but one potential option is to interpret the elements of your dataframe as Python objects rather than the more optimized numpy or pandas dtypes that it typically uses:
In [37]: d = {'col1': [2., 0.], 'col2': [4., 0.]}
...: df = pd.DataFrame(data=d)
...: 2/df
Out[37]:
col1 col2
0 1.0 0.5
1 inf inf
In [38]: 2 / df.astype('O')
---------------------------------------------------------------------------
ZeroDivisionError: float division by zero

Related

Get the last non-null value per row of a 2D Numpy array

I have a 2D numpy array that looks like
a = np.array(
[
[1,2,np.nan,np.nan],
[1,33,45,np.nan],
[11,22,3,78],
]
)
I need to extract the last non null value per row i.e.
[2, 45, 78]
Please guide on how to get it.
thanks
Break this into two sub-problems.
Remove nans from each row
Select last element from the array
[r[~np.isnan(r)][-1] for r in a]
produces
[2.0, 45.0, 78.0]
For a vectorial solution, you can try:
# get the index of last nan
idx = np.argmax(~np.isnan(a)[:, ::-1], axis=1)
# slice
a[np.arange(a.shape[0]), a.shape[1]-idx-1]
output: array([ 2., 45., 78.])

pandas DataFrame value_counts on column that stores DataFrame

I am trying to use value_counts() on a pandas DataFrame column that stores another DataFrame.
Is there a possibility to get the value_counts() function working (or something similar), without having to transform my DataFrames to Strings or Hashes or something like that?
I've tried to count the inner DataFrames, which completely breaks, and then I tried with Arrays, which it seems it cannot make the correct comparison also:
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
ar1 = np.array([11,22])
ar2 = np.array([11,22])
ar3 = np.array([33,44])
df = pd.DataFrame([
['0', ar1],
['1', ar2],
['2', ar3]
], columns =['str', 'ars'])
print(df["ars"].value_counts())
Expected:
[11, 22] 2
[33, 44] 1
Actual:
[11, 22] 1
[11, 22] 1
[33, 44] 1
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
df1 = pd.DataFrame({'col1': [11], 'col2': [22]})
df2 = pd.DataFrame({'col1': [11], 'col2': [22]})
df3 = pd.DataFrame({'col1': [33], 'col2': [44]})
df = pd.DataFrame([
['0', df1],
['1', df2],
['2', df3]
], columns =['str', 'dfs'])
print(df["dfs"].value_counts())
Expected:
{} 2
{} 1
Actual:
BREAKS COMPLETELY
How can I achive the count of complex values in a DataFrame?
I'm honestly confused how either of those managed to run without raising an exception.
Neither np.array nor pd.DataFrame are hashable, and as far as I understood, hashing was necessary for value_count.
Case and point, neither of your examples can be translated to their DataFrame.value_counts equivalent, because underneath it's doing df.groupby(["ars"], dropna=True).grouper.size() which requires hashing.
>>> df.value_counts(["ars"])
TypeError: unhashable type: 'numpy.ndarray'
Overall, I would not count on any .value_count method working on non-hashable columns.

Indexing xarray data with variable length DataArray

I am trying to extract data from xarray dataset using DataArray indexing. My goal is to obtain the data along different line segments overlapping the array. For that I have obtained indices of each of the lines (these are of different sizes based on the length).
For example for line 1 : x = [1,2,3], y=[7,8,9] and similarly for line 2 is x=[1,4,5,6,8], y=[0,2,7,9,6] and so on I have some of the lines which are 100x 2. For this I have tried like below :
df=xarray_dataset
indx=xr.DataArray([[1,2,3],[1,4,5,6,8],[2,3]])
indy=xr.DataArray([[7,9,8],[0,2,7,9,6],[4,5]])
dx_sel=df.isel(x=indx,y=indy)
However what I understand that the length of each of the data array index needs to be equal. Is there a way I can handle such issues. Basically these indices represent the x and y coordinates of different segments within the data frame and get the mean of each of the segment, I have 100s of such segments if there are only few I would be able to use a loop for each of the segment indexes however it's not computationally efficient to use a loop for each segment.
This is a similar issue with numpy array as well. Is there a way to pass NaN or something similar in the index so that we could make the equal shape but no data is extracted for that index.
You can use set_index -> unstack mechanism, which is based on pd.MultiIndex.
In [4]: df = xr.DataArray(np.arange(110).reshape(10, 11),
...: dims=['x', 'y'])
In [5]: indx=xr.DataArray([1,2,3, 1,4,5,6,8, 2,3],
...: dims=['index'],
...: coords={'i': ('index', [0,0,0, 1,1,1,1,1, 2,2]),
...: 'j': ('index', [0,1,2, 0,1,2,3,4, 0,1])})
...:
...: indy=xr.DataArray([7,9,8, 0,2,7,9,6, 4,5], dims=['index'],
...: coords={'i': ('index', [0,0,0, 1,1,1,1,1, 2,2]),
...: 'j': ('index', [0,1,2, 0,1,2,3,4, 0,1])})
In [8]: df.isel(x=indx, y=indy).set_index(index=['i', 'j']).unstack('index')
Out[8]:
<xarray.DataArray (i: 3, j: 5)>
array([[18., 31., 41., nan, nan],
[11., 46., 62., 75., 94.],
[26., 38., nan, nan, nan]])
Coordinates:
* i (i) int64 0 1 2
* j (j) int64 0 1 2 3 4
Here, indx and indy has non-dimensional coordinates, i and j, which are essentially the original position of the index in the 2-dimensional space.

What does "element wise" mean in Pandas?

I'm struggling to clearly understand the concept of "element" in Pandas. Already went through the document of Pandas and googled around, and I'm guessing it's some sort of row? What do people mean when they say "apply function elment-wise"?
This question came up when I was reading this SO post : How to apply a function to two columns of Pandas dataframe
Pandas is designed for operating vector wise operations i.e. taking entire column and operate some function. This you can term as column wise operation.
But in some cases you may need to operate element by element (i.e. element wise operation). This type operation is not very efficient.
Here is an example:
import pandas as pd
df = pd.DataFrame([a for a in range(100)], columns=['mynum'])
column wise operation
%%timeit
df['add1'] = df.mynum +1
222 µs ± 3.31 µs per loop
When operated element wise
%%timeit
df['add1'] = df.apply(lambda a: a.mynum+1, axis = 1)
2.33 ms ± 85.4 µs per loop
I believe "element" in Pandas is an inherited concept of the "element" from NumPy. Give the first few paragraphs of the docs on ufuncs a read.
Each universal function takes array inputs and produces array outputs by performing the core function element-wise on the inputs (where an element is generally a scalar, but can be a vector or higher-order sub-array for generalized ufuncs).
In mathematics, element-wise operations refer to operations on individual elements of a matrix.
Examples:
import numpy as np
>>> x, y = np.arange(1,5).reshape(2,2), 3*np.eye(2)
>>> x, y
>>> x, y = np.arange(1,5).reshape(2,2), 3*np.eye(2)
>>> x, y
(array([[1, 2],
[3, 4]]),
array([[3., 0.],
[0., 3.]]))
>>> x + y # element-wise addition
array([[4., 2.],
[3., 7.]])
columns of y
>>> np.dot(x,y) # NOT element-wise multiplication (matrix multiplication)
# elements become dot products of the rows of x with columns of y
array([[ 3., 6.],
[ 9., 12.]])
>>> x * y # element-wise multiplication
array([[ 3., 0.],
[ 0., 12.]])
I realize your question was about Pandas, but element-wise in Pandas means the same thing it does in NumPy and in linear algebra (as far as I'm aware).
Element-wise means handling data element by element.

How to handle categorical features in neural network?

I am currently having a dataset for the location of stores and name of item to predict sales of a particular product.
I wanted to use binary encoding or pandas get_dummies(), but there are 5000 names for items and it causes memory error, is there any alternative or better way to handle this? Thanks all!
print(train.shape)
print(train.dtypes)
print(train.head())
(125497040, 6)
id int64
date object
store_nbr int64
item_nbr int64
unit_sales float64
onpromotion object
dtype: object
id date store_nbr item_nbr unit_sales onpromotion
0 0 2013-01-01 25 103665 7.0 NaN
1 1 2013-01-01 25 105574 1.0 NaN
2 2 2013-01-01 25 105575 2.0 NaN
3 3 2013-01-01 25 108079 1.0 NaN
4 4 2013-01-01 25 108701 1.0 NaN
Instead of creating gazillions of dummy variables you should use one-hot encoding instead: https://en.wikipedia.org/wiki/One-hot
Pandas doesn't have this functionality built-in, so the easiest way is to use scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
The way I see it you could:
Not to use all items but only most frequent ones.
This way creating dummies, creates fewer new columns and needs less memory. For this happen you will need items with few counts (define few with a threshold) and you will lose some information.
An alternative approach will be to use a Factorization Machine.
You could use both suggestions above and at the end average their prediction for an even better score.