Pandas Flatten a Complex Multi-level column dataframe - pandas

I initially had a dataframe with column ID and Date, i wanted to find the first and last Date entry for every ID.
Therefore i applied an aggregation function:
df.groupby('ID').agg({'Date':['first','last']})
I have a dataframe in the following form:
print(df.columns)
>> MultiIndex(levels=[['Date', 'ID', 'difference'], ['first', 'last', '']],
labels=[[1, 0, 0, 2], [2, 0, 1, 2]])
I want to flatten this dataframe such that i get the dataframe in the following manner:
I tried using df.reset_index(level=[0])
and also used df.unstack() but couldn't get the desired result.
Any leads regarding on how to solve this problem?

I think you need change aggregate function for avoid MultiIndex in columns with specify column for aggregate and list of aggregating functions:
rng = pd.date_range('2017-04-03', periods=10)
df = pd.DataFrame({'Date': rng, 'id': [23] * 5 + [35] * 5})
print (df)
Date id
0 2017-04-03 23
1 2017-04-04 23
2 2017-04-05 23
3 2017-04-06 23
4 2017-04-07 23
5 2017-04-08 35
6 2017-04-09 35
7 2017-04-10 35
8 2017-04-11 35
9 2017-04-12 35
df1 = df.groupby('id')['Date'].agg(['first','last']).reset_index()
print (df1)
id first last
0 23 2017-04-03 2017-04-07
1 35 2017-04-08 2017-04-12

Related

Sorting date values ​in a dataframe doesn't work

I have the column 'Created At' in this form:
The date is in this format: '%d/%m/%Y' -> day, month, year
obj = {'Created At': ['01/01/2017', '01/02/2017', '02/01/2017',
'02/02/2017',
'03/01/2017', '03/02/2017','04/01/2017' ],
'Text': [1, 70,14,17,84,76,32]}
df = pd.DataFrame(data=obj)
I did it, but dosen't work:
df.sort_values(by='Created At', inplace=True)
It seems that it sorts only the days and disregards the month. What do I do?
It does sort it properly: your dates are strings here. Strings are sorted lexicographically. So that means that only if the first character is the same, it will look at the second character, etc.
You therefore might want to convert the column first to datetime objects:
df['Created At'] = pd.to_datetime(df['Created At'], format='%d/%m/%Y')
then we can sort the dataframe, and obtain:
>>> df.sort_values(by='Created At', inplace=True)
>>> df
Created At Text
0 2017-01-01 1
2 2017-01-02 14
4 2017-01-03 84
6 2017-01-04 32
1 2017-02-01 70
3 2017-02-02 17
5 2017-02-03 76

Pandas join (merge?) dataframes, keep only unique indicies

I have a data frame with date index. There are a few dates that somehow went missing. This I’ll call dataframe A. I have another data frame with the dates in question included. I’ll call this dataframe B.
I’d like to merge two dataframes:
Keep all indices of A and join it with B, but I don’t want any of the rows in B that share an index with A. That is, I want only the rows missing from A returned from B.
How is this most easily achieved?
Note:
This behavior is true for a database of data I have. I’ll be doing it roughly 400 times.
If I'm reading the question correctly, what you want is
B[~B.index.isin(A.index)]
For example:
In [192]: A
Out[192]:
Empty DataFrame
Columns: []
Index: [1, 2, 4, 5]
In [193]: B
Out[193]:
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5]
In [194]: B[~B.index.isin(A.index)]
Out[194]:
Empty DataFrame
Columns: []
Index: [3]
To use the data from A when it's there, and otherwise take it from B, you could then do
pd.concat([A, B[~B.index.isin(A.index)]).sort_index()
or, assuming that A contains no null elements that you want to keep, you could take a different approach and go for something like
pd.DataFrame(A, index=B.index).fillna(B)
I beleive you need Index.difference:
B.loc[B.index.difference(A.index)]
EDIT:
A = pd.DataFrame({'A':range(10)}, index=pd.date_range('2019-02-01', periods=10))
B = pd.DataFrame({'A':range(10, 20)}, index=pd.date_range('2019-01-27', periods=10))
df = pd.concat([A, B.loc[B.index.difference(A.index)]]).sort_index()
print (df)
A
2019-01-27 10
2019-01-28 11
2019-01-29 12
2019-01-30 13
2019-01-31 14
2019-02-01 0
2019-02-02 1
2019-02-03 2
2019-02-04 3
2019-02-05 4
2019-02-06 5
2019-02-07 6
2019-02-08 7
2019-02-09 8
2019-02-10 9
df1= pd.concat([A, B])
df1 = df1[~df1.index.duplicated()].sort_index()
print (df1)
A
2019-01-27 10
2019-01-28 11
2019-01-29 12
2019-01-30 13
2019-01-31 14
2019-02-01 0
2019-02-02 1
2019-02-03 2
2019-02-04 3
2019-02-05 4
2019-02-06 5
2019-02-07 6
2019-02-08 7
2019-02-09 8
2019-02-10 9
Although there alread good anwer I want to share this one because it's so short
pd.concat([A, B]).drop_duplicates(keep='first')

How to create a function to convert monthly data to daily, weekly in pandas dataframe?

I have the below monthly data in the dataframe and I need to convert the data to weekly, daily, biweekly.
date chair_price vol_chair
01-09-2018 23 30
01-10-2018 53 20
daily: price as same and vol_chair divided by days of the month
weekly: price as same and vol_chair divided by number of weeks in a month
expected output:
daily:
date chair_price vol_chair
01-09-2018 23 1
02-09-2018 23 1
03-09-2018 23 1
..
30-09-2018 23 1
01-10-2018 53 0.64
..
31-10-2018 53 0.64
weekly:
date chair_price vol_chair
02-09-2018 23 6
09-09-2018 23 6
16-09-2018 23 6
23-09-2018 23 6
30-09-2018 23 6
07-10-2018 53 5
14-10-2018 53 5
..
I am using below code as for column vol, any quick way to do it together i.e. keep price same and vol - take action and find number of weeks in a month
df.resample('W').ffill().agg(lambda x: x/4)
df.resample('D').ffill().agg(lambda x: x/30)
and need to use calendar.monthrange(2012,1)[1] to identify days
def func_count_number_of_weeks(df):
return len(calendar.monthcalendar(df['DateRange'].year, df['DateRange'].month))
def func_convert_from_monthly(df, col, category, columns):
if category == "Daily":
df['number_of_days'] = df['DateRange'].dt.daysinmonth
for column in columns:
df[column] = df[column] / df['number_of_days']
df.drop('number_of_days', axis=1, inplace=True)
elif category == "Weekly":
df['number_of_weeks'] = df.apply(func_count_number_of_weeks, axis=1)
for column in columns:
df[column] = df[column] / df['number_of_weeks']
df.drop('number_of_weeks', axis=1, inplace=True)
return df
def func_resample_from_monthly(df,col, category):
df.set_index(col, inplace=True)
df.index = pd.to_datetime(df.index, dayfirst=True)
if category == "Monthly":
df = df.resample('MS').ffill()
elif category == "Weekly":
df = df.resample('W').ffill()
return df
Use:
#convert to datetimeindex
df.index = pd.to_datetime(df.index, dayfirst=True)
#add new next month for correct resample
idx = df.index[-1] + pd.offsets.MonthBegin(1)
df = df.append(df.iloc[[-1]].rename({df.index[-1]: idx}))
#resample with forward filling values, remove last helper row
#df1 = df.resample('D').ffill().iloc[:-1]
df1 = df.resample('W').ffill().iloc[:-1]
#divide by size of months
df1['vol_chair'] /= df1.resample('MS')['vol_chair'].transform('size')
print (df1)
chair_price vol_chair
date
2018-09-02 23 6.0
2018-09-09 23 6.0
2018-09-16 23 6.0
2018-09-23 23 6.0
2018-09-30 23 6.0
2018-10-07 53 5.0
2018-10-14 53 5.0
2018-10-21 53 5.0
2018-10-28 53 5.0

Apply rolling function to groupby over several columns

I'd like to apply rolling functions to a dataframe grouped by two columns with repeated date entries. Specifically, with both "freq" and "window" as datetime values, not simply ints.
In principle, I'm try to combine the methods from How to apply rolling functions in a group by object in pandas and pandas rolling sum of last five minutes.
Input
Here is a sample of the data, with one id=33 although we expect several id's.
X = [{'date': '2017-02-05', 'id': 33, 'item': 'A', 'points': 20},
{'date': '2017-02-05', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-06', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-13', 'id': 33, 'item': 'A', 'points': 4}]
# df = pd.DataFrame(X) and reindex df to pd.to_datetime(df['date'])
df
id item points
date
2017-02-05 33 A 20
2017-02-05 33 B 10
2017-02-06 33 B 10
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-13 33 A 4
Goal
Sample each 'id' every 2 days (freq='2d') and return the sum of total points for each item over the previous three days (window='3D'), end-date inclusive
Desired Output
id A B
date
2017-02-05 33 20 10
2017-02-07 33 20 30
2017-02-09 33 0 10
2017-02-11 33 3 0
2017-02-13 33 7 0
E.g. on the right-inclusive end-date 2017-02-13, we sample the 3-day period 2017-02-11 to 2017-02-13. In this period, id=33 had a sum of A points equal to 1+1+1+4 = 7
Attempts
An attempt of groupby with a pd.rolling_sum as follows didn't work, due to repeated dates
df.groupby(['id', 'item'])['points'].apply(pd.rolling_sum, freq='4D', window=3)
ValueError: cannot reindex from a duplicate axis
Also note that from the documentation http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_apply.html 'window' is an int representing the size sample period, not the number of days to sample.
We can also try resampling and using last, however the desired look-back of 3 days doesn't seem to be used
df.groupby(['id', 'item'])['points'].resample('2D', label='right', closed='right').\
apply(lambda x: x.last('3D').sum())
id item date
33 A 2017-02-05 20
2017-02-07 0
2017-02-09 0
2017-02-11 3
2017-02-13 4
B 2017-02-05 10
2017-02-07 10
Of course,setting up a loop over unique id's ID, selecting df_id = df[df['id']==ID], and summing over the periods does work but is computationally-intensive and doesn't exploit groupby's nice vectorization.
Thanks to #jezrael for good suggestions so far
Notes
Pandas version = 0.20.1
I'm a little confused as to why the documentation on rolling() here:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
suggests that the "window" parameter can be in an int or offset but on attempting df.rolling(window='3D',...) I getraise ValueError("window must be an integer")
It appears that the above documentation is not consistent with the latest code for rolling's window from ./core/window.py :
https://github.com/pandas-dev/pandas/blob/master/pandas/core/window.py
elif not is_integer(self.window):
raise ValueError("window must be an integer")
It's easiest to handle resample and rolling with date frequencies when we have a single level datetime index.
However, I can't pivot/unstack appropriately without dealing with duplicate A/Bs so I groupby and sum
I unstack one level date so I can fill_value=0. Currently, I can't fill_value=0 when I unstack more than one level at a time. I make up for it with a transpose T
Now that I've got a single level in the index, I reindex with a date range from the min to max values in the index
Finally, I do a rolling 3 day sum and resample that result every 2 days with resample
I clean this up with a bit of renaming indices and one more pivot.
s = df.set_index(['id', 'item'], append=True).points
s = s.groupby(level=['date', 'id', 'item']).sum()
d = s.unstack('date', fill_value=0).T
tidx = pd.date_range(d.index.min(), d.index.max())
d = d.reindex(tidx, fill_value=0)
d1 = d.rolling('3D').sum().resample('2D').first().astype(d.dtypes).stack(0)
d1 = d1.rename_axis(['date', 'id']).rename_axis(None, 1)
print(d1)
A B
date id
2017-02-05 33 20 10
2017-02-07 33 20 20
2017-02-09 33 0 0
2017-02-11 33 3 0
2017-02-13 33 7 0
df = pd.DataFrame(X)
# group sum by day
df = df.groupby(['date', 'id', 'item'])['points'].sum().reset_index().sort_values(['date', 'id', 'item'])
# convert index to datetime index
df = df.set_index('date')
df.index = DatetimeIndex(df.index)
# rolloing sum by 3D
df['pointsum'] = df.groupby(['id', 'item']).transform(lambda x: x.rolling(window='3D').sum())
# reshape dataframe
df = df.reset_index().set_index(['date', 'id', 'item'])['pointsum'].unstack().reset_index().set_index('date').fillna(0)
df

sorting within the keys of group by

I have a group by table as follows, I want to sort by index within the keys ['CPUCore', Offline_RetetionAge'] (need to keep the structure of ['CPUCore', Offline_RetetionAge']) how should I do?
I think there is problem dtype of your second level is object, what is obviously string, so if use sort_index it sorts alphanumeric:
df = pd.DataFrame({'CPUCore':[2,2,2,3,3],
'Offline_RetetionAge':['100','1','12','120','15'],
'index':[11,16,5,4,3]}).set_index(['CPUCore','Offline_RetetionAge'])
print (df)
index
CPUCore Offline_RetetionAge
2 100 11
1 16
12 5
3 120 4
15 3
print (df.index.get_level_values('Offline_RetetionAge').dtype)
object
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
100 11
12 5
3 120 4
15 3
#change multiindex - cast level Offline_RetetionAge to int
new_index = list(zip(df.index.get_level_values('CPUCore'),
df.index.get_level_values('Offline_RetetionAge').astype(int)))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
12 5
100 11
3 15 3
120 4
EDIT by comment:
print (df.reset_index()
.sort_values(['CPUCore','index'])
.set_index(['CPUCore','Offline_RetetionAge']))
index
CPUCore Offline_RetetionAge
2 12 5
100 11
1 16
3 15 3
120 4
I think what you mean is this:
import pandas as pd
from pandas import Series, DataFrame
# create what I believe you tried to ask
df = DataFrame( \
[[11,'reproducible'], [16, 'example'], [5, 'a'], [4, 'create'], [9,'!']])
df.columns = ['index', 'bla']
df.index = pd.MultiIndex.from_arrays([[2]*4+[3],[10,100,1000,11,512]], \
names=['CPUCore', 'Offline_RetentionAge'])
# sort by values and afterwards by index where sort_remaining=False preserves
# the order of index
df = df.sort_values('index').sort_index(level=0, sort_remaining=False)
print df
The statement sort_values sorts the values by index and the sort_index restores the grouping by multiindex without changing the order of index for rows with the same CPUCore.
I don't know what a "group by table" is supposed to be. If you have a pd.GroupBy object, you won't be able to use sort_values() like that.
You might have to rethink what you group by or use functools.partial and DataFrame.apply
Output:
index bla
CPUCore Offline_RetentionAge
2 11 4 create
1000 5 a
10 11 reproducible
100 16 example
3 512 9 !