apply function in pandas hierarchical index - pandas

I have a pandas dataframe as below.
df = pd.DataFrame({'team' : ['A', 'B', 'A', 'B', 'A', 'B'],
'tiger' : [87, 159, 351, 140, 72, 119],
'lion' : [1843, 3721, 6905, 1667, 2865, 1599],
'bear' : [1.9, 3.3, 6.3, 2.3, 1.2, 4.1],
'points' : [425, 425, 441, 441, 1048, 1048]})
grouped = df.groupby(['points', 'team'])[['tiger', 'lion', 'bear']].median()
print(grouped)
tiger lion bear
points team
425 A 87.00000 1843.00000 1.90000
B 159.00000 3721.00000 3.30000
441 A 351.00000 6905.00000 6.30000
B 140.00000 1667.00000 2.30000
1048 A 72.00000 2865.00000 1.20000
B 119.00000 1599.00000 4.10000
I would like to take the difference between teams A and B for each of the animal (tiger, lion, bear) and points levels. So the difference between team A (87) and B (159) within points 425 and tiger. I'm not sure how to do this with an hierarchical index. It would look something like below. Thanks.
points tiger lion bear
0 425 72 1878 1.40000
1 441 -211 -5238 -4.00000
2 1048 47 -1266 2.90000

You can swaplevel and slice:
grouped = (df.groupby(['points', 'team'])[['tiger', 'lion', 'bear']].median()
.swaplevel()
)
grouped.loc['A']-grouped.loc['B']
Or use xs:
grouped = df.groupby(['points', 'team'])[['tiger', 'lion', 'bear']].median()
grouped.xs('A', level='team')-grouped.xs('B', level='team')
Output:
tiger lion bear
points
425 -72.0 -1878.0 -1.4
441 211.0 5238.0 4.0
1048 -47.0 1266.0 -2.9

grouped.groupby(level=0).apply(lambda dd:dd.diff().tail(1)).droplevel([1,2])
out
tiger lion bear
points
425 72.0 1878.0 1.4
441 -211.0 -5238.0 -4.0
1048 47.0 -1266.0 2.9

Related

Find out missing values based on mapping in Pandas

I have 2 datasets:
df_1.head(4)
region postal_code
Adrar 1000
Broko 5633
Conan 4288
Cymus 7435
df_2.head(4)
Name Charges region postal_code Revenue
Lia HG Pintol 4522 345
Joss PX Inend 7455 142
Amph CT 5633 148
Andrew UY Liven 9033 147
The second dataset has many missing values in 'region' column... But we can get those missing values using first dataset by matching values of postal_code... For example, in the third row of df_2, 'region' column is missing but by matching it's respective postal_code with df_1, we can find it's region as 'Broko'... Can someone please suggest on how to code it
You can use boolean indexing and a map:
m = df2['region'].isna()
df2.loc[m, 'region'] = (df2.loc[m, 'postal_code']
.map(df1.set_index('postal_code')['region'])
)
Another less efficient approach could be:
df2['region'] = (df2['region']
.fillna(df2['postal_code']
.map(df1.set_index('postal_code')['region']))
)
# or in place
df2['region'].update(df2['postal_code']
.map(df1.set_index('postal_code')['region']))
Output:
Name Charges region postal_code Revenue
0 Lia HG Pintol 4522 345
1 Joss PX Inend 7455 142
2 Amph CT Broko 5633 148
3 Andrew UY Liven 9033 147
Example
data1 = {'region': {0: 'Adrar', 1: 'Broko', 2: 'Conan', 3: 'Cymus'},
'postal_code': {0: 1000, 1: 5633, 2: 4288, 3: 7435}}
data2 = {'Name': {0: 'Lia', 1: 'Joss', 2: 'Amph', 3: 'Andrew'},
'Charges': {0: 'HG', 1: 'PX', 2: 'CT', 3: 'UY'},
'region': {0: 'Pintol', 1: 'Inend', 2: None, 3: 'Liven'},
'postal_code': {0: 4522, 1: 7455, 2: 5633, 3: 9033},
'Revenue': {0: 345, 1: 142, 2: 148, 3: 147}}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Code
use map and fillna
mapper = df1.set_index('postal_code')['region']
df2.assign(region=df2['region'].fillna(df2['postal_code'].map(mapper)))
result:
Name Charges region postal_code Revenue
0 Lia HG Pintol 4522 345
1 Joss PX Inend 7455 142
2 Amph CT Broko 5633 148
3 Andrew UY Liven 9033 147
try this:
mapper = dict(df1.values[:, ::-1])
df2['region'] = df2['region'].combine_first(df2['postal_code'].replace(mapper))

I am trying to unwrap?, explode?, a data frame with several columns into a new data frame with rows

I apologize for not knowing the correct terminology, but I am looking for a way in Pandas to transform a data frame with several similar columns into a data frame with rows that explode? into more rows. Basically for every column that starts with Line.{x}, I want to create a new row that has all the Line.{x} columns. Same for all columns with values in {x}, e.g. 1,2,3.
Here is an example of a data frame I'd like to convert from:
Column1 Column2 Column3 Column4 Line.0.a Line.0.b Line.0.c Line.1.a Line.1.b Line.1.c Line.2.a Line.2.b Line.2.c Line.3.a Line.3.b Line.3.c
0 the quick brown dog 100 200 300 400 500 600 700 800 900 1000 1100 1200
1 you see spot run 101 201 301 401 501 601
2 four score and seven 102 202 302
I would like to convert it to this:
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100 200 300
1 the quick brown dog 400 500 600
2 the quick brown dog 700 800 900
3 the quick brown dog 1000 1100 1200
4 you see spot run 101 201 301
5 you see spot run 401 501 601
6 four score and seven 102 202 302
Thank you in advance!
One option is with pivot_longer from pyjanitor, where for this particular use case, you pass .value placeholder to names_to, to keep track of the parts of the column you want to retain as headers; you then pass a regular expression with matching groups to names_pattern:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index='Col*',
names_to = (".value", ".value"),
names_pattern = r"(.+)\.\d+(.+)")
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100.0 200.0 300.0
1 you see spot run 101.0 201.0 301.0
2 four score and seven 102.0 202.0 302.0
3 the quick brown dog 400.0 500.0 600.0
4 you see spot run 401.0 501.0 601.0
5 four score and seven NaN NaN NaN
6 the quick brown dog 700.0 800.0 900.0
7 you see spot run NaN NaN NaN
8 four score and seven NaN NaN NaN
9 the quick brown dog 1000.0 1100.0 1200.0
10 you see spot run NaN NaN NaN
11 four score and seven NaN NaN NaN
You can get rid of the nulls with dropna:
(df
.pivot_longer(
index='Col*',
names_to = (".value", ".value"),
names_pattern = r"(.+)\.\d+(.+)")
.dropna()
)
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100.0 200.0 300.0
1 you see spot run 101.0 201.0 301.0
2 four score and seven 102.0 202.0 302.0
3 the quick brown dog 400.0 500.0 600.0
4 you see spot run 401.0 501.0 601.0
6 the quick brown dog 700.0 800.0 900.0
9 the quick brown dog 1000.0 1100.0 1200.0
Another option, as pointed out by #Mozway, is to convert the columns into a MultiIndex, and stack:
temp = df.set_index(['Column1', 'Column2', 'Column3','Column4'])
# this is where the MultiIndex is created
cols = temp.columns.str.split('.', expand=True)
temp.columns = cols
# now we stack
# nulls are dropped by default
temp = temp.stack(level=1).droplevel(-1)
temp.columns = temp.columns.map('.'.join)
temp.reset_index()
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100.0 200.0 300.0
1 the quick brown dog 400.0 500.0 600.0
2 the quick brown dog 700.0 800.0 900.0
3 the quick brown dog 1000.0 1100.0 1200.0
4 you see spot run 101.0 201.0 301.0
5 you see spot run 401.0 501.0 601.0
6 four score and seven 102.0 202.0 302.0
Here is an approach that works. It uses melt and then joins.
new_df contains what you need. the order of items might be different though. The fuction takes 3 parameters. The first is your data frame. Second is keys that remain static and third is convertion dict that tells what goes where.
import pandas as pd
def vars_to_cases(df:pd.DataFrame,keys:list,convertion_dict:dict):
vals = list(convertion_dict.values())
l = len(vals[0])
if not all(len(item) == l for item in vals):
raise Exception("Dictionary values don't have the same length")
tempkeys = keys.copy()
tempkeys.append("variable")
df_data = pd.DataFrame()
for short_name, my_list in convertion_dict.items():
my_replace_dict = {}
for count, item, in enumerate(my_list):
my_replace_dict[item] = count
mydf = pd.melt(df, id_vars=tempkeys[:-1], value_vars=my_list)
mydf["variable"].replace(my_replace_dict, inplace=True)
mydf.rename(columns={"value": short_name}, inplace=True)
mydf = mydf.set_index(tempkeys)
if df_data.empty:
df_data = mydf.copy()
else:
df_data = df_data.join(mydf)
return df_data
#here is the data
df=pd.DataFrame({'Column1': {0: 'the', 1: 'you', 2: 'four'},
'Column2': {0: 'quick', 1: 'see', 2: 'score'},
'Column3': {0: 'brown', 1: 'spot', 2: 'and'},
'Column4': {0: 'dog', 1: 'run', 2: 'seven'},
'Line.0.a': {0: 100, 1: 101, 2: 102},
'Line.0.b': {0: 200, 1: 201, 2: 202},
'Line.0.c': {0: 300, 1: 301, 2: 302},
'Line.1.a': {0: 400.0, 1: 401.0, 2: None},
'Line.1.b': {0: 500.0, 1: 501.0, 2: None},
'Line.1.c': {0: 600.0, 1: 601.0, 2: None},
'Line.2.a': {0: 700.0, 1: None, 2: None},
'Line.2.b': {0: 800.0, 1: None, 2: None},
'Line.2.c': {0: 900.0, 1: None, 2: None},
'Line.3.a': {0: 1000.0, 1: None, 2: None},
'Line.3.b': {0: 1100.0, 1: None, 2: None},
'Line.3.c': {0: 1200.0, 1: None, 2: None}})
convertion_dict={"Line.a":["Line.0.a","Line.1.a","Line.2.a","Line.3.a"],
"Line.b":["Line.0.b","Line.1.b","Line.2.b","Line.3.b"],
"Line.c":["Line.0.c","Line.1.c","Line.2.c","Line.3.c"]}
keys=["Column1","Column2","Column3","Column4"]
new_df=vars_to_cases(df,keys,convertion_dict)
new_df=new_df.reset_index()
new_df=new_df.dropna()
new_df=new_df.drop(columns="variable")

Create a dataframe from a series with a TimeSeriesIndex multiplied by another series

Let's say I have a series, ser1 with a TimeSeriesIndex length x. I also have another series, ser2 length y. How do I multiply these so that I get a dataframe shape (x,y) where the index is from ser1 and the columns are the indices from ser2. I want every element of ser2 to be multiplied by the values of each element in ser1.
import pandas as pd
ser1 = pd.Series([100, 105, 110, 114, 89],index=pd.date_range(start='2021-01-01', end='2021-01-05', freq='D'), name='test')
test_ser2 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
Perhaps this is more elegantly done with numpy.
Try this using np.outer with pandas DataFrame constructor:
pd.DataFrame(np.outer(ser1, test_ser2), index=ser1.index, columns=test_ser2.index)
Output:
a b c d e
2021-01-01 100 200 300 400 500
2021-01-02 105 210 315 420 525
2021-01-03 110 220 330 440 550
2021-01-04 114 228 342 456 570
2021-01-05 89 178 267 356 445

Plot a Trellis Stacked Bar Chart in Altair by combining column values

I'd like to plot a Trellis Stacked Bar Chart graph like in the example Trellis Stacked Bar Chart.
I have this dataset:
pd.DataFrame({
'storage': ['dev01', 'dev01', 'dev01', 'dev02', 'dev02', 'dev03'],
'project': ['omega', 'alpha', 'beta', 'omega', 'beta', 'alpha'],
'read': [3, 0, 0, 114, 27, 82],
'write': [70, 0, 0, 45, 655, 203],
'read-write': [313, 322, 45, 89, 90, 12]
})
storage project read write read-write
0 dev01 omega 3 70 313
1 dev01 alpha 0 0 322
2 dev01 beta 0 0 45
3 dev02 omega 114 45 89
4 dev02 beta 27 655 90
5 dev03 alpha 82 203 12
What I can't figure out is how to specify the read, write, read-write columns as the colors / values for Altair.
Your data is wide-form, and must be converted to long-form to be used in Altair encodings. See Long-Form vs. Wide-Form Data in Altair's documentation for more information.
This can be addressed by modifying the input data in Pandas using pd.melt, but it is often more convenient to use Altair's Fold Transform to do this reshaping within the chart specification. For example:
import pandas as pd
import altair as alt
df = pd.DataFrame({
'storage': ['dev01', 'dev01', 'dev01', 'dev02', 'dev02', 'dev03'],
'project': ['omega', 'alpha', 'beta', 'omega', 'beta', 'alpha'],
'read': [3, 0, 0, 114, 27, 82],
'write': [70, 0, 0, 45, 655, 203],
'read-write': [313, 322, 45, 89, 90, 12]
})
alt.Chart(df).transform_fold(
['read', 'write', 'read-write'],
as_=['mode', 'value']
).mark_bar().encode(
x='value:Q',
y='project:N',
column='storage:N',
color='mode:N'
).properties(
width=200
)
You need to melt your desired columns into a new column:
# assuming your DataFrame is assigned to `df`
cols_to_melt = ['read', 'write', 'read-write']
cols_to_keep = df.columns.difference(cols_to_melt)
df = df.melt(cols_to_keep, cols_to_melt, 'mode')
So you get the following:
project storage mode value
0 omega dev01 read 3
1 alpha dev01 read 0
2 beta dev01 read 0
3 omega dev02 read 114
4 beta dev02 read 27
5 alpha dev03 read 82
6 omega dev01 write 70
7 alpha dev01 write 0
8 beta dev01 write 0
9 omega dev02 write 45
10 beta dev02 write 655
11 alpha dev03 write 203
12 omega dev01 read-write 313
13 alpha dev01 read-write 322
14 beta dev01 read-write 45
15 omega dev02 read-write 89
16 beta dev02 read-write 90
17 alpha dev03 read-write 12
Then in the altair snippet, instead of color='site', use color='mode'.

Apply set_index over groupby object in order to apply asfreq per group

Im looking to apply pading over each group of my data frame
notice that for a single group ('element_id') i have no problem in pading:
first group (group1):
{'date': {88: datetime.date(2017, 10, 3), 43: datetime.date(2017, 9, 26), 159: datetime.date(2017, 11, 8)}, u'element_id': {88: 122, 43: 122, 159: 122}, u'VALUE': {88: '8.0', 43: '2.0', 159: '5.0'}}
So im applying padding over it (which works great):
print group1.set_index('date').asfreq('D', method='pad').head()
Im looking to apply this logic over several groups through groupby
Another group (group2):
{'date': {88: datetime.date(2017, 10, 3), 43: datetime.date(2017, 9, 26), 159: datetime.date(2017, 11, 8)}, u'element_id': {88: 122, 43: 122, 159: 122}, u'VALUE': {88: '8.0', 43: '2.0', 159: '5.0'}}
group_data=pd.concat([group1,group2],axis=0)
group_data.groupby(['element_id']).set_index('date').resample('D').asfreq()
And im getting the following error:
AttributeError: Cannot access callable attribute 'set_index' of 'DataFrameGroupBy' objects, try using the 'apply' method
First there is problem your date column has dtype object, not datetime, so first is necessary convert it by to_datetime.
Then is possible use GroupBy.apply:
group_data['date'] = pd.to_datetime(group_data['date'])
df = (group_data.groupby(['element_id'])
.apply(lambda x: x.set_index('date').resample('D').ffill()))
print (df.head())
VALUE element_id
element_id date
122 2017-09-26 2.0 122
2017-09-27 2.0 122
2017-09-28 2.0 122
2017-09-29 2.0 122
2017-09-30 2.0 122
Or DataFrameGroupBy.resample:
df = group_data.set_index('date').groupby(['element_id']).resample('D').ffill()
print (df.head())
VALUE element_id
element_id date
122 2017-09-26 2.0 122
2017-09-27 2.0 122
2017-09-28 2.0 122
2017-09-29 2.0 122
2017-09-30 2.0 122
EDIT:
If problem with duplicates values solution is add new column for subgroups with unique dates. If use concat there is parameter keys for it:
group1 = pd.DataFrame({'date': {88: datetime.date(2017, 10, 3),
43: datetime.date(2017, 9, 26),
159: datetime.date(2017, 11, 8)},
u'element_id': {88: 122, 43: 122, 159: 122},
u'VALUE': {88: '8.0', 43: '2.0', 159: '5.0'}})
d = {'level_0':'g'}
group_data=pd.concat([group1,group1], keys=('a','b')).reset_index(level=0).rename(columns=d)
print (group_data)
g VALUE date element_id
43 a 2.0 2017-09-26 122
88 a 8.0 2017-10-03 122
159 a 5.0 2017-11-08 122
43 b 2.0 2017-09-26 122
88 b 8.0 2017-10-03 122
159 b 5.0 2017-11-08 122
group_data['date'] = pd.to_datetime(group_data['date'])
df = (group_data.groupby(['g','element_id'])
.apply(lambda x: x.set_index('date').resample('D').ffill()))
print (df.head())
g VALUE element_id
g element_id date
a 122 2017-09-26 a 2.0 122
2017-09-27 a 2.0 122
2017-09-28 a 2.0 122
2017-09-29 a 2.0 122
2017-09-30 a 2.0 122