Why does pandas change the index value in this example? - pandas

First we create a raw dataset with MultiIndex-
In [166]: import numpy as np; import pandas as pd
In [167]: data_raw = pd.DataFrame([
...: {'frame': 1, 'face': np.NaN, 'lmark': np.NaN, 'x': np.NaN, 'y': np.NaN},
...: {'frame': 197, 'face': 0, 'lmark': 1, 'x': 969, 'y': 737},
...: {'frame': 197, 'face': 0, 'lmark': 2, 'x': 969, 'y': 740},
...: {'frame': 197, 'face': 0, 'lmark': 3, 'x': 970, 'y': 744},
...: {'frame': 197, 'face': 0, 'lmark': 4, 'x': 972, 'y': 748},
...: {'frame': 197, 'face': 0, 'lmark': 5, 'x': 973, 'y': 752},
...: {'frame': 300, 'face': 0, 'lmark': 1, 'x': 745, 'y': 367},
...: {'frame': 300, 'face': 0, 'lmark': 2, 'x': 753, 'y': 411},
...: {'frame': 300, 'face': 0, 'lmark': 3, 'x': 759, 'y': 455},
...: {'frame': 301, 'face': 0, 'lmark': 1, 'x': 741, 'y': 364},
...: {'frame': 301, 'face': 0, 'lmark': 2, 'x': 746, 'y': 408},
...: {'frame': 301, 'face': 0, 'lmark': 3, 'x': 750, 'y': 452}]).set_index(['frame', 'face', 'lmark'])
Next we calculate the z-scores for each lmark -
In [168]: ((data_raw - data_raw.mean(level='lmark')).abs()) / data_raw.std(level='lmark')
Out[168]:
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
4.0 NaN NaN
5.0 NaN NaN
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270
The index values don't change, as expected.
Now we filter out records where lmark > 3 -
In [170]: data_filtered = data_raw.loc[(slice(None), slice(None), [np.NaN, slice(3)]),:]
In [171]: data_filtered
Out[171]:
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 969.0 737.0
2.0 969.0 740.0
3.0 970.0 744.0
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
301 0.0 1.0 741.0 364.0
2.0 746.0 408.0
3.0 750.0 452.0
and recalculate the z-scores -
In [172]: ((data_filtered - data_filtered.mean(level='lmark')).abs()) / data_filtered.std(level='lmark')
Out[172]:
x y
frame face lmark
1 NaN 1.0 NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270
Why has the value of the first record's lmark index changed from NaN to 1.0?

I think it seems bug.
Solution is use MultiIndex.remove_unused_levels:
data_filtered.index = data_filtered.index.remove_unused_levels()
a = ((data_filtered - data_filtered.mean(level='lmark')).abs()) / data_filtered.std(level='lmark')
print (a)
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270

Related

Pandas rolling mean only for non-NaNs

If have a DataFrame:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]
'A1': [1, 1, 2, 2, 2]
'A2': [1, 2, 3, 3, 3]})
I want to create a grouped-by on columns "A1" and "A2" and then apply a rolling-mean on "B" with window 3. If less values are available, that is fine, the mean should still be computed. But I do not want any values if there is no original entry.
Result should be:
pd.DataFrame({'B': [0, 1, 2, np.nan, 3]})
Applying df.rolling(3, min_periods=1).mean() yields:
pd.DataFrame({'B': [0, 1, 2, 2, 3]})
Any ideas?
Reason is for mean with widows=3 is ouput some scalars, not NaNs, possible solution is set NaN manually after rolling:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A': [1, 1, 2, 2, 2]})
df['C'] = df['B'].rolling(3, min_periods=1).mean().mask(df['B'].isna())
df['D'] = df.groupby('A')['B'].rolling(3, min_periods=1).mean().droplevel(0).mask(df['B'].isna())
print (df)
B A C D
0 0.0 1 0.0 0.0
1 1.0 1 0.5 0.5
2 2.0 2 1.0 2.0
3 NaN 2 NaN NaN
4 4.0 2 3.0 3.0
EDIT: For multiple grouping columns remove levels in Series.droplevel:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A1': [1, 1, 2, 2, 2],
'A2': [1, 2, 3, 3, 3]})
df['D'] = df.groupby(['A1','A2'])['B'].rolling(3, min_periods=1).mean().droplevel(['A1','A2']).mask(df['B'].isna())
print (df)
B A1 A2 D
0 0.0 1 1 0.0
1 1.0 1 2 1.0
2 2.0 2 3 2.0
3 NaN 2 3 NaN
4 4.0 2 3 3.0

Product demand down calculation in pandas df without loop

I'm having trouble with shift and diff and I feel it is simple?
Assume I have customers with different product demands, and they get handled with priority top down. I'd like to have it efficient without looping....
df_situation = pd.DataFrame(
{
"cust": [1, 2, 3, 3,4],
"prod": [1, 1, 1, 2,2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"needed": [200, 300, 1000, 1000,1000],
}
)
My objective is to get some additional columns like this, but it looks like difference calculations and shift operation are in a "chicken and egg problem situation".
Thanks in advance for any hint
leftover_prod is the available ffill - the cumulative demand groupby cumsum:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
0 800.0
1 500.0
2 -500.0
3 1000.0
4 0.0
Name: leftover_prod, dtype: float64
fulfilled_cust is either the demand if there is enough leftover_prod or the leftover_prod groupby shift + np.where:
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
0 200.0
1 300.0
2 500.0
3 1000.0
4 1000.0
Name: fulfilled_cust, dtype: float64
missing_cust is the demand - the fulfilled_cust:
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
0 0.0
1 0.0
2 500.0
3 0.0
4 0.0
Name: missing_cust, dtype: float64
Together:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
cust prod available demand leftover_prod fulfilled_cust missing_cust
0 1 1 1000.0 200 800.0 200.0 0.0
1 2 1 NaN 300 500.0 300.0 0.0
2 3 1 NaN 1000 -500.0 500.0 500.0
3 3 2 2000.0 1000 1000.0 1000.0 0.0
4 4 2 NaN 1000 0.0 1000.0 0.0
imports and DataFrame used:
import numpy as np
import pandas as pd
df_situation = pd.DataFrame({
"cust": [1, 2, 3, 3, 4],
"prod": [1, 1, 1, 2, 2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"demand": [200, 300, 1000, 1000, 1000],
})
(changed "needed" to "demand" as it appears in image.)

Apply kmeans on in each group in pandas DataFrame and save the clusters in a new column in the same DataFrame

I have a dataframe containing some embeddings in column D. I would like to first groupby the data by column A and then apply kmeans on each group. Each group might contain nan values, so in the apply function I consider number of clusters as the number of non-nan values in column D devided by 2 (n_clusters = int(not_na_mask.sum()/2)).
In the apply function I return df['cluster'].values.tolist(). I printed this values and it's correct for each group, but after running the whole script df_test['clusters'] only contains nan in all the rows.
Sample DataFrame:
df_test = pd.DataFrame({'A' : ['aa', 'bb', 'aa', 'bb','aa', 'bb', 'aa', 'cc', 'aa', 'aa', 'bb', 'bb', 'bb','cc', 'bb', 'aa', 'cc', 'aa'],
'B' : [1, 2, np.nan, 4, 6, np.nan, 7, 8, np.nan, 1, 4, 3, 4, 7, 5, 7, 9, np.nan],
'D' : [[2, 0, 1, 5, 4, 0], np.nan, [4, 7, 0, 1, 0, 2], [1., 1, 1, 2, 0, 5], np.nan , [1, 6, 3, 2, 1, 9], [4, 2, 1, 0, 0, 0], [3, 5, 6, 8, 8, 0], np.nan,
np.nan, [2, 5, 1, 7, 4, 0] , [4, 2, 0, 4, 0, 0], [1., 0, 1, 8, 0, 9], [1, 0, 7, 2, 1, 0], np.nan , [1, 1, 5, 0, 8, 0], [4, 1, 6, 1, 1, 0], np.nan]})
df_test:
A B D
0 aa 1.0 [2, 0, 1, 5, 4, 0]
1 bb 2.0 NaN
2 aa NaN [4, 7, 0, 1, 0, 2]
3 bb 4.0 [1.0, 1, 1, 2, 0, 5]
4 aa 6.0 NaN
5 bb NaN [1, 6, 3, 2, 1, 9]
6 aa 7.0 [4, 2, 1, 0, 0, 0]
7 cc 8.0 [3, 5, 6, 8, 8, 0]
8 aa NaN NaN
9 aa 1.0 NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0]
11 bb 3.0 [4, 2, 0, 4, 0, 0]
12 bb 4.0 [1.0, 0, 1, 8, 0, 9]
13 cc 7.0 [1, 0, 7, 2, 1, 0]
14 bb 5.0 NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0]
16 cc 9.0 [4, 1, 6, 1, 1, 0]
17 aa NaN NaN
My approach for calculating kmeans:
def apply_kmeans_on_each_category(df):
not_na_mask = df['D'].notna()
embedding = df[not_na_mask]['D']
n_clusters = int(not_na_mask.sum()/2)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
df.loc[not_na_mask, 'cluster'] = kmeans.labels_
return df['cluster'].values.tolist()
else:
return [np.nan] * len(df)
df_test['clusters'] = df_test.groupby('A').apply(apply_kmeans_on_each_category)
result:
df_test['clusters']:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
Name: clusters, dtype: object
Made some slight changes. Meat of the change is to use transform instead of apply. Also, no need to pass the entire Grouper df, you can just pass column D directly as that's the only column you are using -
def apply_kmeans_on_each_category(df):
not_na_mask = df.notna()
embedding = df.loc[not_na_mask]
n_clusters = int(not_na_mask.sum()/2)
op = pd.Series([np.nan] * len(df), index=df.index)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
op.loc[not_na_mask] = kmeans.labels_.tolist()
return op
df_test['clusters'] = df_test.groupby('A')['D'].transform(apply_kmeans_on_each_category)
Output
A B D clusters
0 aa 1.0 [2, 0, 1, 5, 4, 0] 0.0
1 bb 2.0 NaN NaN
2 aa NaN [4, 7, 0, 1, 0, 2] 1.0
3 bb 4.0 [1.0, 1, 1, 2, 0, 5] 0.0
4 aa 6.0 NaN NaN
5 bb NaN [1, 6, 3, 2, 1, 9] 0.0
6 aa 7.0 [4, 2, 1, 0, 0, 0] 1.0
7 cc 8.0 [3, 5, 6, 8, 8, 0] NaN
8 aa NaN NaN NaN
9 aa 1.0 NaN NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0] 1.0
11 bb 3.0 [4, 2, 0, 4, 0, 0] 1.0
12 bb 4.0 [1.0, 0, 1, 8, 0, 9] 0.0
13 cc 7.0 [1, 0, 7, 2, 1, 0] NaN
14 bb 5.0 NaN NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0] 0.0
16 cc 9.0 [4, 1, 6, 1, 1, 0] NaN
17 aa NaN NaN NaN

Pandas - Unexpected results when indexing a DataFrame containing missing entries

First, we create a large dataset with MultiIndex whose first record contains missing values np.NaN
In [200]: data = []
...: val = 0
...: for ind_1 in range(3000):
...: if ind_1 == 0:
...: data.append({'ind_1': 0, 'ind_2': np.NaN, 'val': np.NaN})
...: else:
...: for ind_2 in range(3000):
...: data.append({'ind_1': ind_1, 'ind_2': ind_2, 'val': val})
...: val += 1
...: df = pd.DataFrame(data).set_index(['ind_1', 'ind_2'])
In [201]: df
Out[201]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
3.0 3.0
... ...
2999 2995.0 8996995.0
2996.0 8996996.0
2997.0 8996997.0
2998.0 8996998.0
2999.0 8996999.0
[8997001 rows x 1 columns]
I want to select all rows where ind_1 < 3 and ind_2 < 3
First I create an MultiIndex i1 where ind_1 < 3
In [202]: i1 = df.loc[df.index.get_level_values('ind_1') < 3].index
In [203]: i1
Out[203]:
MultiIndex([(0, nan),
(1, 0.0),
(1, 1.0),
(1, 2.0),
(1, 3.0),
(1, 4.0),
(1, 5.0),
(1, 6.0),
(1, 7.0),
(1, 8.0),
...
(2, 2990.0),
(2, 2991.0),
(2, 2992.0),
(2, 2993.0),
(2, 2994.0),
(2, 2995.0),
(2, 2996.0),
(2, 2997.0),
(2, 2998.0),
(2, 2999.0)],
names=['ind_1', 'ind_2'], length=6001)
Then I create an MultiIndex i2 where ind_2 < 3
In [204]: i2 = df.loc[~(df.index.get_level_values('ind_2') > 2)].index
In [205]: i2
Out[205]:
MultiIndex([( 0, nan),
( 1, 0.0),
( 1, 1.0),
( 1, 2.0),
( 2, 0.0),
( 2, 1.0),
( 2, 2.0),
( 3, 0.0),
( 3, 1.0),
( 3, 2.0),
...
(2996, 2.0),
(2997, 0.0),
(2997, 1.0),
(2997, 2.0),
(2998, 0.0),
(2998, 1.0),
(2998, 2.0),
(2999, 0.0),
(2999, 1.0),
(2999, 2.0)],
names=['ind_1', 'ind_2'], length=8998)
Logically, the solution should be the intersection of these two sets
In [206]: df.loc[i1 & i2]
Out[206]:
val
ind_1 ind_2
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0
Why is the first record (0, nan) filtered out?
Use boolean arrays i1, i2 instead of indexes
In [27]: i1 = df.index.get_level_values('ind_1') < 3
In [28]: i2 = ~(df.index.get_level_values('ind_2') > 2)
In [29]: i1
Out[29]: array([ True, True, True, ..., False, False, False])
In [30]: i2
Out[30]: array([ True, True, True, ..., False, False, False])
In [31]: df.loc[i1 & i2]
Out[31]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0

pandas 0.20.3 DataFrame behavior changes for pyspark.ml.vectors object in a column

The following code works in pandas 0.20.0 but not in 0.20.3:
import pandas as pd
from pyspark.ml.linalg import Vectors
df = pd.DataFrame({'A': [1,2,3,4],
'B': [1,2,3,4],
'C': [1,2,3,4],
'D': [1,2,3,4]},
index=[0, 1, 2, 3])
df.apply(lambda x: pd.Series(Vectors.dense([x["A"], x["B"]])), axis=1)
This produces from pandas 0.20.0:
0
0 [1.0, 1.0]
1 [2.0, 2.0]
2 [3.0, 3.0]
3 [4.0, 4.0]
but it is different in pandas 0.20.3:
0 1
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
How can I achieve the first behavior in 0.20.3?