getting a default value from pandas dataframe when a key is not present - pandas

I have a dataframe multi-index where each key is a tuple of two. Currently, the order of the values in the key matters: df[(k1,k2)] is not the same as df[('k2,k1')]. also, sometimes k1,k2 exists in the dataframe while k2,k1 does not.
I'm trying to average the values of a certain columns for those two entries. currently, Im doing this:
if (k1,k2) in df.index.values and not (k2,k1) in df.index.values:
x = df[(k1,k2)]
if (k2,k1) in df.index.values and not (k1,k2) in df.index.values:
x = df[(k2,k1)]
if (k2,k1) in df.index.values and (k1,k2) in df.index.values:
x = (df[(k2,k1)] + df[k1,k2])/2
This is quit ugly... Im looking for something like a get_defualt method we have on a dictionary.. Is there something like this in pandas?

ix index access and mean function handle this for you. Fetch the two tuples from df.ix and apply the mean function to it: non existing keys are returned as nan values, and mean ignores nan values by default:
In [102]: df
Out[102]:
(26, 22) (10, 48) (48, 42) (48, 10) (42, 48)
a 311 NaN 724 879 42
In [103]: df.ix[:,[(10, 48), (48, 10)]].mean(axis=1)
Out[103]:
a 879
dtype: float64
In [104]: df.ix[:,[(42, 48), (48, 42)]].mean(axis=1)
Out[104]:
a 383
dtype: float64
In [105]: df.ix[:,[(26, 22), (22, 26)]].mean(axis=1)
Out[105]:
a 311
dtype: float64

Related

How to Exclude NaNs from Pandas Rolling, but not return NaN if there is one in the DataFrame

Currently I have the DataFrame seen below and I want to do a rolling average over the last 10 occurrences that have actual values, but to skip the NaNs
Example DataFrame
The issues is that if I run df['AST_Hit'].rolling(10).mean(skipna=True).shift(1) I get this DataFrame below which is not what I am looking for
Example Output DataFrame
I've tried using window and min_period but that does not give me what I want as I don't want the average over anything greater than 10.
Ideally I would like the DataFrame to be able to discard a NaN, but still look to see if there are 10 values in that selection. From what I am describing I think I need some sort of max period where it is equal to 10 as well as the min period equal to 10, but I could not find anything on Pandas documentation for rolling on setting up a max period.
Maybe it would also be best if I just dropped any NaN rows. My DataFrame is much bigger than what is seen, so it isn't just those 3 rows that contain a NaN, but it may be the best course of action
Any help or tips is greatly appreciated.
This might help you:
import pandas as pd
import numpy as np
# create a sample DataFrame with non-numeric columns
np.random.seed(123)
df = pd.DataFrame({
'Date': pd.date_range(start='2022-01-01', periods=100),
'AST_Hit': np.random.randint(0, 10, size=100),
'Other_Column': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] * 10
})
df.iloc[3:6, 1] = np.nan
df.iloc[7, 0] = np.nan
df.iloc[10:15, 2] = np.nan
df.iloc[20:25, 1] = np.nan
df.iloc[30:40, 2] = np.nan
# compute rolling average over the last 10 non-null values
rolling_mask = df['AST_Hit'].notnull().rolling(window=10, min_periods=1).sum().eq(10)
result = df['AST_Hit'].rolling(window=10, min_periods=1).apply(lambda x: np.mean(x[rolling_mask]))
print(result)
which gives
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
95 5.6
96 5.1
97 4.7
98 4.2
99 3.9
Name: AST_Hit, Length: 100, dtype: float64

Creating a Pandas DataFrame from a NumPy masked array?

I am trying to create a Pandas DataFrame from a NumPy masked array, which I understand is a supported operation. This is an example of the source array:
a = ma.array([(1, 2.2), (42, 5.5)],
dtype=[('a',int),('b',float)],
mask=[(True,False),(False,True)])
which outputs as:
masked_array(data=[(--, 2.2), (42, --)],
mask=[( True, False), (False, True)],
fill_value=(999999, 1.e+20),
dtype=[('a', '<i8'), ('b', '<f8')])
Attempting to create a DataFrame with pd.DataFrame(a) returns:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-40-a4c5236a3cd4> in <module>
----> 1 pd.DataFrame(a)
/usr/local/anaconda/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
636 # a masked array
637 else:
--> 638 data = sanitize_masked_array(data)
639 mgr = ndarray_to_mgr(
640 data,
/usr/local/anaconda/lib/python3.8/site-packages/pandas/core/construction.py in sanitize_masked_array(data)
452 """
453 mask = ma.getmaskarray(data)
--> 454 if mask.any():
455 data, fill_value = maybe_upcast(data, copy=True)
456 data.soften_mask() # set hardmask False if it was True
/usr/local/anaconda/lib/python3.8/site-packages/numpy/core/_methods.py in _any(a, axis, dtype, out, keepdims, where)
54 # Parsing keyword arguments is currently fairly slow, so avoid it for now
55 if where is True:
---> 56 return umr_any(a, axis, dtype, out, keepdims)
57 return umr_any(a, axis, dtype, out, keepdims, where=where)
58
TypeError: cannot perform reduce with flexible type
Is this operation indeed supported? Currently using Pandas 1.3.3 and NumPy 1.20.3.
Update
Is this supported?
According to the Pandas documentation here:
Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame constructor, and its masked entries will be considered missing.
The code above was my asking the question "What will I get?" if I passed a NumPy masked array to Pandas, but that was the result I was hoping for. Above was the simplest example I could come up with.
I do expect each Series/column in Pandas to be of a single type.
Update 2
Anyone interested in this should probably see this Pandas GitHub issue; it's noted there that Pandas has "deprecated support for MaskedRecords".
If the array has a simple dtype, the dataframe creation works (as documented):
In [320]: a = np.ma.array([(1, 2.2), (42, 5.5)],
...: mask=[(True,False),(False,True)])
In [321]: a
Out[321]:
masked_array(
data=[[--, 2.2],
[42.0, --]],
mask=[[ True, False],
[False, True]],
fill_value=1e+20)
In [322]: import pandas as pd
In [323]: pd.DataFrame(a)
Out[323]:
0 1
0 NaN 2.2
1 42.0 NaN
This a is (2,2), and the result is 2 rows, 2 columns
With the compound dtype, the shape is 1d:
In [326]: a = np.ma.array([(1, 2.2), (42, 5.5)],
...: dtype=[('a',int),('b',float)],
...: mask=[(True,False),(False,True)])
In [327]: a.shape
Out[327]: (2,)
The error is the result of a test on the mask. flexible type refers to your compound dtype:
In [330]: a.mask.any()
Traceback (most recent call last):
File "<ipython-input-330-8dc32ee3f59d>", line 1, in <module>
a.mask.any()
File "/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py", line 57, in _any
return umr_any(a, axis, dtype, out, keepdims)
TypeError: cannot perform reduce with flexible type
The documented pandas feature clearly does not apply to structured arrays. Without studying the pandas code I can't say exactly what it's trying to do at this point, but it's clear the code was not written with structured arrays in mind.
The non-masked part does work, with the desired column dtypes:
In [332]: pd.DataFrame(a.data)
Out[332]:
a b
0 1 2.2
1 42 5.5
Using the default fill:
In [344]: a.filled()
Out[344]:
array([(999999, 2.2e+00), ( 42, 1.0e+20)],
dtype=[('a', '<i8'), ('b', '<f8')])
In [345]: pd.DataFrame(a.filled())
Out[345]:
a b
0 999999 2.200000e+00
1 42 1.000000e+20
I'd have to look more at ma docs/code to see if it's possible to apply a different fill to the two fields. Filling with nan doesn't work for the int field. numpy doesn't have pandas' int none. I haven't worked enough with that pandas feature to know whether the resulting dtype is still int, or it is changed to object.
Anyways, you are pushing the bounds of both np.ma and pandas with this task.
edit
The default fill_value is a tuple, one for each field:
In [350]: a.fill_value
Out[350]: (999999, 1.e+20)
So we can fill the fields differently, and make a frame from that:
In [351]: a.filled((-1, np.nan))
Out[351]: array([(-1, 2.2), (42, nan)], dtype=[('a', '<i8'), ('b', '<f8')])
In [352]: pd.DataFrame(a.filled((-1, np.nan)))
Out[352]:
a b
0 -1 2.2
1 42 NaN
Looks like I can make a structured array with a pandas dtype, and its associated fill_value:
In [363]: a = np.ma.array([(1, 2.2), (42, 5.5)],
...: dtype=[('a',pd.Int64Dtype),('b',float)],
...: mask=[(True,False),(False,True)],
fill_value=(pd.NA,np.nan))
In [364]: a
Out[364]:
masked_array(data=[(--, 2.2), (42, --)],
mask=[( True, False), (False, True)],
fill_value=(<NA>, nan),
dtype=[('a', 'O'), ('b', '<f8')])
In [366]: pd.DataFrame(a.filled())
Out[366]:
a b
0 <NA> 2.2
1 42 NaN
The question is what would you expect to get? It would be ambiguous for pandas to convert your data.
If you want to get the original data:
>>> pd.DataFrame(a.data)
a b
0 1 2.2
1 42 5.5
If you want to consider masked values invalid:
>>> pd.DataFrame(a.filled(np.nan))
BUT, for this you should have all type float in the masked array

Sum of data entry with the given index in pandas dataframe

I try to get the sum of possible combination of given data in pandas dataframe. To do this I use itertools combination to get all of possible combinations, then by using loop, I sum each of it.
Is there any way to do this without using the loop?
Please check the following script that I created to shows what I want.
import pandas as pd
import itertools as it
A = pd.Series([50, 20, 75], index = list(range(1, 4)))
df = pd.DataFrame({'A': A})
listNew = []
for i in range(1, len(df.A)+1):
Temp=it.combinations(df.index.values, i)
for data in Temp:
listNew.append(data)
print(listNew)
for data in listNew:
print(df.A[list(data)].sum())
Output of these scripts are:
[(1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
50
20
75
70
125
95
145
thank you in advance.
IIUC, using reindex
#convert you list of tuple to data frame and using stack to flatten it
s=pd.DataFrame([(1,), (2,), (3,), (1, 2),(1, 3),(2, 3), (1, 2, 3)]).stack().to_frame('index')
# then we reindex base on the order of it using df.A
s['Value']=df.reindex(s['index']).A.values
#you can using groupby here, but since the index is here, I will recommend sum with level
s=s.Value.sum(level=0)
s
Out[796]:
0 50
1 20
2 75
3 70
4 125
5 95
6 145
Name: Value, dtype: int64

Selecting columns from numpy recarray

I have an object from type numpy.core.records.recarray. I want to use it effectively as pandas dataframe. More precisely, I want to use a subset of its columns in order to obtain a new recarray, the same way you would do pandas_dataframe[[selected_columns]].
What's the easiest way to achieve this?
Without using pandas you can select a subset of the fields of a structured array (recarray). For example:
In [338]: dt=np.dtype('i,f,i,f')
In [340]: A=np.ones((3,),dtype=dt)
In [341]: A[:]=(1,2,3,4)
In [342]: A
Out[342]:
array([(1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4'), ('f3', '<f4')])
a subset of the fields.
In [343]: B=A[['f1','f3']].copy()
In [344]: B
Out[344]:
array([(2.0, 4.0), (2.0, 4.0), (2.0, 4.0)],
dtype=[('f1', '<f4'), ('f3', '<f4')])
that can be modified independently of A:
In [346]: B['f3']=[.1,.2,.3]
In [347]: B
Out[347]:
array([(2.0, 0.10000000149011612), (2.0, 0.20000000298023224),
(2.0, 0.30000001192092896)],
dtype=[('f1', '<f4'), ('f3', '<f4')])
In [348]: A
Out[348]:
array([(1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4'), ('f3', '<f4')])
The structured subset of fields is not highly developed. A[['f0','f1']] is enough for viewing, but it will warn or give an error if you try to modify that subset. That's why I used copy with B.
There's a set of functions that facilitate adding and removing fields from recarrays. I'll have to look up the access pattern. But mostly the construct a new dtype, and empty array, and then copy fields by name.
import numpy.lib.recfunctions as rf
update
With newer numpy versions, the multi-field index has changed
In [17]: B=A[['f1','f3']]
In [18]: B
Out[18]:
array([(2., 4.), (2., 4.), (2., 4.)],
dtype={'names':['f1','f3'], 'formats':['<f4','<f4'], 'offsets':[4,12], 'itemsize':16})
This B is a true view, referencing the same data buffer as A. The offsets lets it ignore the missing fields. Those fields can be removed with repack_fields as just documented.
But when putting this into a dataframe, it doesn't look like we need to do that.
In [19]: df = pd.DataFrame(A)
In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f0 3 non-null int32
1 f1 3 non-null float32
2 f2 3 non-null int32
3 f3 3 non-null float32
dtypes: float32(2), int32(2)
memory usage: 176.0 bytes
In [22]: df = pd.DataFrame(B)
In [24]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f1 3 non-null float32
1 f3 3 non-null float32
dtypes: float32(2)
memory usage: 152.0 bytes
The frame created from B is smaller.
Sometimes when making a dataframe from an array, the array itself is used as the frame's memory. Changing values in the source array will change the values in the frame. But with structured arrays, pandas makes a copy of the data, with a different memory layout.
Columns of matching dtype are grouped into a common NumericBlock:
In [42]: pd.DataFrame(A)._data
Out[42]:
BlockManager
Items: Index(['f0', 'f1', 'f2', 'f3'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(1, 5, 2), 2 x 3, dtype: float32
NumericBlock: slice(0, 4, 2), 2 x 3, dtype: int32
In [43]: pd.DataFrame(B)._data
Out[43]:
BlockManager
Items: Index(['f1', 'f3'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: float32
In addition to #hpaulj answer, you'll want to repack the copy, otherwise the copied subset will have the same size memory footprint as the original.
import numpy as np
# note that you have to import this library explicitly
import numpy.lib.recfunctions
# B has a subset of "colums" but uses the same amount of memory as A
B = A[['f1','f3']].copy()
# C has a smaller memory footprint
C = np.lib.recfunctions.repack_fields(B)

How to ensure get label for zero counts in python pandas pd.cut

I am analyzing a DataFrame and getting timing counts which I want to put into specific buckets (0-10 seconds, 10-30 seconds, etc).
Here is a simplified example:
import pandas as pd
filter_values = [0, 10, 20, 30] # Bucket Values for pd.cut
#Sample Times
df1 = pd.DataFrame([1, 3, 8, 20], columns = ['filtercol'])
#Use cut to get counts for each bucket
out = pd.cut(df1.filtercol, bins = filter_values)
counts = pd.value_counts(out)
print counts
The above prints:
(0, 10] 3
(10, 20] 1
dtype: int64
You will notice it does not show any values for (20, 30]. This is a problem because I want to put this into my output as zero. I can handle it using the following code:
bucket1=bucket2=bucket3=0
if '(0, 10]' in counts:
bucket1=counts['(0, 10]']
if '(10, 20]' in counts:
bucket2=counts['(10, 30]']
if '(20, 30]' in counts:
bucket3=counts['(30, 60]']
print bucket1, bucket2, bucket3
But I want a simpler cleaner approach where I can use:
print counts['(0, 10]'], counts['(10, 30]'], counts['(30, 60]']
Ideally where the print is based on the values in filter_values so they are only in one place in the code. Yes I know I can change the print to use filter_values[0]...
Lastly when using cut is there a way to specify infinity so the last bucket is all values greater than say 60?
Cheers,
Stephen
You can reindex by the categorical's levels:
In [11]: pd.value_counts(out).reindex(out.levels, fill_value=0)
Out[11]:
(0, 10] 3
(10, 20] 1
(20, 30] 0
dtype: int64