How can a pandas dataframe with a TimedeltaIndex by grouped by nearest whole day? - pandas

I've got a pandas DataFrame with an index of pd.TimeDeltas some of which are fractions of days. I'd like to use df.groupby to group the rows by whole days (ignoring the fractions of days) so that I can calculate the mean.
Here's an example of what I'd like to do:
import pandas as pd
import numpy as np
data = [[1,2,3], [2,3,4], [3,4,5], [1,2,3], [2,3,4], [3,4,5]]
idx = [pd.Timedelta('1.2 days'), pd.Timedelta('1.2 days'), pd.Timedelta('3.8 days'), pd.Timedelta('3.8 days'), pd.Timedelta('4.2 days'), pd.Timedelta('4.2 days')]
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df.index = idx
df
Out:
a b c
1 days 04:48:00 1 2 3
1 days 04:48:00 2 3 4
3 days 19:12:00 3 4 5
3 days 19:12:00 1 2 3
4 days 04:48:00 2 3 4
4 days 04:48:00 3 4 5
The line below produces the desired a result however it creates extra rows for each day so there are rows full of NaNs which I subsequently remove with df.dropna(). Is there a better approach to doing this?
df.groupby(pd.Grouper(freq='D')).aggregate(np.mean).dropna()

Your approach is fine, or you can just group by df.index.days as below:
In [196]: df.groupby(df.index.days).mean()
Out[196]:
a b c
1 1.5 2.5 3.5
3 2.0 3.0 4.0
4 2.5 3.5 4.5
The difference in the two methods is where things get grouped on the margins. In yours, something at 2 days, 02:00:00 would get grouped with the 1-day rows since pd.Grouper will start with the first example, whereas in mine, it will get a separate row since it treats midnight as the start of a new group.

Related

Check if list cell contains value

Having a dataframe like this:
month transactions_ids
0 1 [0, 5, 1]
1 2 [7, 4]
2 3 [8, 10, 9, 11]
3 6 [2]
4 9 [3]
For a given transaction_id, I would like to get the month when it took place. Notice that a transaction_id can only be related to one single month.
So for example, given transaction_id = 4, the month would be 2.
I know this can be done in a loop by looking month by month if the transactions_ids related contain the given transaction_id, but I'm wondering if there is any way more efficient than that.
Cheers
The best way in my opinion is to explode your data frame and avoid having python lists in your cells.
df = df.explode('transaction_ids')
which outputs
month transactions_ids
0 1 0
0 1 5
0 1 1
1 2 7
1 2 4
2 3 8
2 3 10
2 3 9
2 3 11
3 6 2
4 9 3
Then, simply
id_to_find = 1 # example
df.loc[df.transactions_ids == id_to_find, 'month']
P.S: be aware of the duplicated indexes that explode outputs. In general, it is better to do explode(...).reset_index(drop=True) for most cases to avoid unwanted behavior.
You can use pandas string methods to find the id in the "list" (it's really just a string as far as pandas is concerned when read in using StringIO):
import pandas as pd
from io import StringIO
data = StringIO("""
month transactions_ids
1 [0,5,1]
2 [7,4]
3 [8,10,9,11]
6 [2]
9 [3]
""")
df = pd.read_csv(data, delim_whitespace=True)
df.loc[df['transactions_ids'].str.contains('4'), 'month']
In case your transactions_ids are real lists, then you can use map to check for membership:
df['transactions_ids'].map(lambda x: 3 in x)

group values in pandas and sum after all dates

I have a pandas dataframe like this:
date id flow type
2020-04-26 1 3 A
2020-04-27 2 4 A
2020-04-28 1 2 A
2020-04-26 1 -3 B
2020-04-27 1 4 B
2020-04-28 2 3 B
2020-04-26 3 0 C
2020-04-27 2 5 C
i also have a dictionary like this of 'trailing_date' keys.
{'T-1': Timestamp('2020-04-27')
'T-2' : Timestamp('2020-04-26')}
I would like to sum the flows for each id and group by they keys in my dictionary where
the sum of flows is inclusive of this trailing dates minus the flows of most recent date. In other words. i would like to have this:
type T-1 T-2
A 4 7
B 4 1
Why did i get 4 for T-1 at A? its because if today is 28th, then T-1 is 27th, hence answer is 4. Likewise at T-2, its 3+4 = 7 etc.
I tried:
df2 = df.groupby(["type","date"])['flow'].sum().unstack("type")
Im somewhat stuck what to do after this. Thanks
Tough problem. There might be a more elegant way to do this, but here is what I came up with.
import pandas as pd
dates1 = pd.Series(range(3), index=pd.date_range('2020-04-26', freq='D', periods=3)).index
dates2 = dates1.copy()
dates3 = dates1.copy()[0:-1]
dates = dates1.append([dates2, dates3])
types = ['A']*3 + ['B']*3 + ['C']*2
df = pd.DataFrame({'date': dates, 'id':[1,2,1,1,1,2,3,2],
'flow': [3,4,2,-3,4,3,0,5], 'type': types})
dates_dict = {'T-1': pd.Timestamp('2020-04-27'), 'T-2': pd.Timestamp('2020-04-26')}
grouped_df = df.groupby(["type","date"])['flow'].sum()
new_dict = {}
for key in dates_dict:
sums_list = []
# loops through the unique levels of the grouped_df: 'A', 'B', 'C'
types = grouped_df.index.get_level_values(0).unique()
new_dict.update({'type': types})
for letter in types:
# sums up the flows by dates
# coming before the timestamp label corresponding to the key
# but leaves out the most recent date
sums_list.append(grouped_df[letter][grouped_df[letter].index >= dates_dict[key]].iloc[:-1].sum())
new_dict.update({key: sums_list})
final_df = pd.DataFrame(new_dict)
Output:
>>> final_df
type T-1 T-2
0 A 4 7
1 B 4 1
2 C 0 0

Generate list of values summing to 1 - within groupby?

In the spirit of Generating a list of random numbers, summing to 1 from several years ago, is there a way to apply the np array result of the np.random.dirichlet result against a groupby for the dataframe?
For example, I can loop through the unique values of the letter column and apply one at a time:
df = pd.DataFrame([['a', 1], ['a', 3], ['a', 2], ['a', 6],
['b', 7],['b', 5],['b', 4],], columns=['letter', 'value'])
df['grp_sum'] = df.groupby('letter')['value'].transform('sum')
df['prop_of_total'] = np.random.dirichlet(np.ones(len(df)), size=1).tolist()[0]
for letter in df['letter'].unique():
sz=len(df[df['letter'] == letter])
df.loc[df['letter'] == letter, 'prop_of_grp'] = np.random.dirichlet(np.ones(sz), size=1).tolist()[0]
print(df)
results in:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.015493 0.293481
1 a 3 12 0.114027 0.043973
2 a 2 12 0.309150 0.160818
3 a 6 12 0.033999 0.501729
4 b 7 16 0.365276 0.617484
5 b 5 16 0.144502 0.318075
6 b 4 16 0.017552 0.064442
but there's got to be a better way than iterating the unique values and filtering the dataframe for each. This is small but I'll have potentially tens of thousands of groupings of varying sizes of ~50-100 rows each, and each needs a different random distribution.
I have also considered creating a temporary dataframe for each grouping, appending to a second dataframe and finally merging the results, though that seems more convoluted than this. I have not found a solution where I can apply an array of groupby size to the groupby but I think something along those lines would do.
Thoughts? Suggestions? Solutions?
IIUC, do a transform():
def direchlet(x, size=1):
return np.array(np.random.dirichlet(np.ones(len(x)), size=size)[0])
df['prop_of_grp'] = df.groupby('letter')['value'].transform(direchlet)
Output:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.102780 0.127119
1 a 3 12 0.079201 0.219648
2 a 2 12 0.341158 0.020776
3 a 6 12 0.096956 0.632456
4 b 7 16 0.193970 0.269094
5 b 5 16 0.012905 0.516035
6 b 4 16 0.173031 0.214871

Why use to_frame before reset_index?

Using a data set like this one
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_id','module_id','week'])
we often see this pattern:
df.groupby(['user_id'])['module_id'].count().to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
But we get exactly the same result from
df.groupby(['user_id'])['module_id'].count().reset_index(name='count')
(N.B. we need the additional rename in the former because reset_index on Series (here) includes a name parameter and returns a data frame, while reset_index on DataFrame (here) does not include the name parameter.)
Is there any advantage in using to_frame first?
(I wondered if it might be an artefact of earlier versions of pandas, but that looks unlikely:
Series.reset_index was added in this commit on the 27th of January 2012.
Series.to_frame was added in this commit on the 13th of October 2013.
So Series.reset_index was available over a year before Series.to_frame.)
There is no noticeable advantage of using to_frame(). Both approaches can be used to achieve the same result. It is common in pandas to use multiple approaches for solving a problem. The only advantage I can think of is that for larger sets of data, it maybe more convenient to have a dataframe view first before resetting the index. If we take your dataframe as an example, you will find that to_frame() displays a dataframe view that maybe useful to understand the data in terms of a neat dataframe table v/s a count series. Also, the usage of to_frame() makes the intent more clear to a new user who looks at your code for the first time.
The example dataframe:
In [7]: df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_i
...: d','module_id','week'])
In [8]: df.head()
Out[8]:
user_id module_id week
0 3 4 4
1 1 3 4
2 1 2 2
3 1 3 4
4 1 2 2
The count() function returns a Series:
In [18]: test1 = df.groupby(['user_id'])['module_id'].count()
In [19]: type(test1)
Out[19]: pandas.core.series.Series
In [20]: test1
Out[20]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [21]: test1.index
Out[21]: Int64Index([0, 1, 2, 3, 4], dtype='int64', name='user_id')
Using to_frame makes it explicit that you intend to convert the Series to a Dataframe. The index here is user_id:
In [22]: test1.to_frame()
Out[22]:
module_id
user_id
0 2
1 7
2 4
3 6
4 1
And now we reset the index and rename the column using Dataframe.rename. As you rightly pointed, Dataframe.reset_index() does not have a name parameter and therefore, we will have to rename the column explicitly.
In [24]: testdf1 = test1.to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
In [25]: testdf1
Out[25]:
user_id count
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
Now lets look at the other case. We will use the same count() series test1 but rename it as test2 to differentiate between the two approaches. In other words, test1 is equal to test2.
In [26]: test2 = df.groupby(['user_id'])['module_id'].count()
In [27]: test2
Out[27]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [28]: test2.reset_index()
Out[28]:
user_id module_id
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
In [30]: testdf2 = test2.reset_index(name='count')
In [31]: testdf1 == testdf2
Out[31]:
user_id count
0 True True
1 True True
2 True True
3 True True
4 True True
As you can see both dataframes are equivalent, and in the second approach we just had to use reset_index(name='count') to both reset the index and rename the column name because Series.reset_index() does have a name parameter.
The second case has lesser code but is less readable for new eyes and I'd prefer the first approach of using to_frame() because it makes the intent clear: "Convert this count series to a dataframe and rename the column 'module_id' to 'count'".

Pandas - calculate rolling average of group excluding current row

For an example:
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Date' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
This works to calculate the rolling average, inclusive of the current row:
df['avg'] = df.groupby(['Platoon'])['Casualties'].transform(lambda x: x.rolling(2, 1).mean())
Which gives:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 2.5
A 3 5 4.5
A 4 7 6.0
......
What I want to get is:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 1.0
A 3 5 2.5
A 4 7 4.5
......
I suspect I can use shift here but I can't figure it out!
You need shift with bfill
df.groupby(['Platoon'])['Casualties'].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())