Nested groupby in DataFrame and aggregate multiple columns - pandas

I am trying to do nested groupby as follows:
>>> df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'Quantity': {0: 60,1: 50, 2: 40, 3: 30, 4: 20, 5: 10}, 'UiD':{0:1,1:1,2:1,3:2,4:2,5:3}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'Sign':{0:1,1:1,2:0,3:-1,4:0,5:-1}, 'leg1':{0:2,1:2,2:4,3:5,4:7,5:8}})
>>> df1
Date Quantity Sign StartTime Stock UiD leg1
0 2016-10-11 60 1 08:00:00.241 ABC 1 2
1 2016-10-11 50 1 08:00:00.243 ABC 1 2
2 2016-10-11 40 0 12:34:23.563 ABC 1 4
3 2016-10-11 30 -1 08:14.05.908 ABC 2 5
4 2016-10-11 20 0 18:54:50.100 ABC 2 7
5 2016-10-12 10 -1 10:08:36.657 XYZ 3 8
>>> dfg1=df1.groupby(['Date','Stock'])
>>> dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))
Date Stock
2016-10-11 ABC 90
2016-10-12 XYZ 10
dtype: int64
>>>
>>> dfg1['leg1'].sum()
Date Stock
2016-10-11 ABC 20
2016-10-12 XYZ 8
Name: leg1, dtype: int64
So far so good. Now I am trying to concatenate the two results into a new DataFrame df2 as follows:
>>> df2 = pd.concat([dfg1['leg1'].sum(), dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))],axis=1)
0 1
Date Stock
2016-10-11 ABC 20 90
2016-10-12 XYZ 8 10
>>>
I am wondering if there is a better way to re-write following line in order to avoid repetition of groupby(['Date','Stock'])
dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))
Also this fails if ['Date','Stock'] contains 'UiD' as one of the keys or if ['Date','Stock'] is replaced by just ['UiD'].

Please restate your question to be clearer. You want to groupby(['Date','Stock']), then:
take only the first record for each UiD and sum (aggregate) its
Quantity, but also
sum all leg1 values for that Date,Stock
combination (not just the first-for-each-UiD). Is that right?
Anyway you want to perform an aggregation (sum) on multiple columns, and yeah the way to avoid repetition of groupby(['Date','Stock']) is to keep one dataframe, not try to stitch together two dataframes from two individual aggregate operations. Something like the following (I'll fix it once you confirm this is what you want):
def filter_first_UiD(g):
#return g.groupby('UiD').first().agg(np.sum)
return g.groupby('UiD').first().agg({'Quantity':'sum', 'leg1':'sum'})
df1.groupby(['Date','Stock']).apply(filter_first_UiD)

The way I dealt with the last scenario of avoiding groupby to fail if ['Date','Stock'] contains 'UiD' as one of the keys or if ['Date','Stock'] is replaced by just ['UiD'] is as follows:
>>> df2 = pd.concat([dfg1['leg1'].sum(), dfg1[].first() if 'UiD' in `['Date','Stock']` else dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))],axis=1)
But more elegant solution is still an open question.

Related

How to make difference between two pandas dataframes?

I have two pandas dataframes:
import pandas as pd
df_1 = pd.DataFrame({'ID': [1, 2, 4, 7, 30],
'Instrument': ['temp', 'temp_sensor', 'temp_sensor',
'sensor', 'sensor'],
'Value': [1000, 0, 1000, 0, 1000]})
print(df_1)
ID Instrument Value
1 temp 1000
2 temp_sensor 0
4 temp_sensor 1000
7 sensor 0
30 sensor 1000
df_2 = pd.DataFrame({'ID': [1, 30],
'Instrument': ['temp', 'sensor'],
'Value': [1000, 1000]})
print(df_2)
ID Instrument Value
1 temp 1000
30 sensor 1000
I need to exclude from df_1 the lines that also exist in df_2. So I made the code:
combined = df_1.append(df_2)
combined[~combined.index.duplicated(keep=False)]
The (wrong) output is:
ID Instrument Value
4 temp_sensor 1000
7 sensor 0
30 sensor 1000
I would like the output to be:
ID Instrument Value
2 temp_sensor 0
4 temp_sensor 1000
7 sensor 0
I relied on what was explained in: How to remove a pandas dataframe from another dataframe
Use DataFrame.merge by all columns names with left join and parameter indicator=True and filter rows with left_only values:
s = df_1.merge(df_2, on=list(df_1.columns), how='left', indicator=True)['_merge']
df = df_1.loc[s == 'left_only']
print(df)
ID Instrument Value
1 2 temp_sensor 0
2 4 temp_sensor 1000
3 7 sensor 0

Pandas transform rows with specific character

i am working on features transformation, and ran into this issue. Let me know what you think. Thanks!
I have a table like this
And I want to create an output column like this
Some info:
All the outputs will be based on numbers that end with a ':'
I have 100M+ rows in this table. Need to consider performance issue.
Let me know if you have some good ideas. Thanks!
Here is some copy and paste-able sample data:
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
Solution #1:
You can use .str.contains(':') with np.where() to identify the values, otherwise return np.nan. Then, use ffill() to fill down on nan values:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
df['Output'] = np.where(df['Number'].str.contains(':'),df['Number'].str.split(':').str[0],np.nan)
df['Output'] = df['Output'].ffill()
df
Solution #2 - Even easier and potentially better performance you can do some regex with str.extract() and then again ffill():
df['Output'] = df['Number'].str.extract('^(\d+):').ffill()
df
Out[1]:
Number Output
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7
I think this is what you are looking for:
import pandas as pd
c = ['Number']
d = ['1:00',100,1001,1321,3254,'15:00',20,60,80,90,'4:00',26,45,90,89]
df = pd.DataFrame(data=d,columns=c)
temp= df['Number'].str.split(":", n = 1, expand = True)
df['New_Val'] = temp[0].ffill()
print(df)
The output of this will be as follows:
Number New_Val
0 1:00 1
1 100 1
2 1001 1
3 1321 1
4 3254 1
5 15:00 15
6 20 15
7 60 15
8 80 15
9 90 15
10 4:00 4
11 26 4
12 45 4
13 90 4
14 89 4
Looks like your DataFrame has string values. I considered them as a mix of numbers and strings.
Here's the solution if df['Number'] is all strings.
df1 = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
temp= df1['Number'].str.split(":", n = 1, expand = True)
temp.loc[temp[1].astype(bool) != False, 'New_val'] = temp[0]
df1['New_val'] = temp['New_val'].ffill()
print (df1)
The output of df1 will be:
Number New_val
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

Apply rolling function to groupby over several columns

I'd like to apply rolling functions to a dataframe grouped by two columns with repeated date entries. Specifically, with both "freq" and "window" as datetime values, not simply ints.
In principle, I'm try to combine the methods from How to apply rolling functions in a group by object in pandas and pandas rolling sum of last five minutes.
Input
Here is a sample of the data, with one id=33 although we expect several id's.
X = [{'date': '2017-02-05', 'id': 33, 'item': 'A', 'points': 20},
{'date': '2017-02-05', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-06', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-13', 'id': 33, 'item': 'A', 'points': 4}]
# df = pd.DataFrame(X) and reindex df to pd.to_datetime(df['date'])
df
id item points
date
2017-02-05 33 A 20
2017-02-05 33 B 10
2017-02-06 33 B 10
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-13 33 A 4
Goal
Sample each 'id' every 2 days (freq='2d') and return the sum of total points for each item over the previous three days (window='3D'), end-date inclusive
Desired Output
id A B
date
2017-02-05 33 20 10
2017-02-07 33 20 30
2017-02-09 33 0 10
2017-02-11 33 3 0
2017-02-13 33 7 0
E.g. on the right-inclusive end-date 2017-02-13, we sample the 3-day period 2017-02-11 to 2017-02-13. In this period, id=33 had a sum of A points equal to 1+1+1+4 = 7
Attempts
An attempt of groupby with a pd.rolling_sum as follows didn't work, due to repeated dates
df.groupby(['id', 'item'])['points'].apply(pd.rolling_sum, freq='4D', window=3)
ValueError: cannot reindex from a duplicate axis
Also note that from the documentation http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_apply.html 'window' is an int representing the size sample period, not the number of days to sample.
We can also try resampling and using last, however the desired look-back of 3 days doesn't seem to be used
df.groupby(['id', 'item'])['points'].resample('2D', label='right', closed='right').\
apply(lambda x: x.last('3D').sum())
id item date
33 A 2017-02-05 20
2017-02-07 0
2017-02-09 0
2017-02-11 3
2017-02-13 4
B 2017-02-05 10
2017-02-07 10
Of course,setting up a loop over unique id's ID, selecting df_id = df[df['id']==ID], and summing over the periods does work but is computationally-intensive and doesn't exploit groupby's nice vectorization.
Thanks to #jezrael for good suggestions so far
Notes
Pandas version = 0.20.1
I'm a little confused as to why the documentation on rolling() here:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
suggests that the "window" parameter can be in an int or offset but on attempting df.rolling(window='3D',...) I getraise ValueError("window must be an integer")
It appears that the above documentation is not consistent with the latest code for rolling's window from ./core/window.py :
https://github.com/pandas-dev/pandas/blob/master/pandas/core/window.py
elif not is_integer(self.window):
raise ValueError("window must be an integer")
It's easiest to handle resample and rolling with date frequencies when we have a single level datetime index.
However, I can't pivot/unstack appropriately without dealing with duplicate A/Bs so I groupby and sum
I unstack one level date so I can fill_value=0. Currently, I can't fill_value=0 when I unstack more than one level at a time. I make up for it with a transpose T
Now that I've got a single level in the index, I reindex with a date range from the min to max values in the index
Finally, I do a rolling 3 day sum and resample that result every 2 days with resample
I clean this up with a bit of renaming indices and one more pivot.
s = df.set_index(['id', 'item'], append=True).points
s = s.groupby(level=['date', 'id', 'item']).sum()
d = s.unstack('date', fill_value=0).T
tidx = pd.date_range(d.index.min(), d.index.max())
d = d.reindex(tidx, fill_value=0)
d1 = d.rolling('3D').sum().resample('2D').first().astype(d.dtypes).stack(0)
d1 = d1.rename_axis(['date', 'id']).rename_axis(None, 1)
print(d1)
A B
date id
2017-02-05 33 20 10
2017-02-07 33 20 20
2017-02-09 33 0 0
2017-02-11 33 3 0
2017-02-13 33 7 0
df = pd.DataFrame(X)
# group sum by day
df = df.groupby(['date', 'id', 'item'])['points'].sum().reset_index().sort_values(['date', 'id', 'item'])
# convert index to datetime index
df = df.set_index('date')
df.index = DatetimeIndex(df.index)
# rolloing sum by 3D
df['pointsum'] = df.groupby(['id', 'item']).transform(lambda x: x.rolling(window='3D').sum())
# reshape dataframe
df = df.reset_index().set_index(['date', 'id', 'item'])['pointsum'].unstack().reset_index().set_index('date').fillna(0)
df

How to multiply iteratively down a column?

I am having a tough time with this one - not sure why...maybe it's the late hour.
I have a dataframe in pandas as follows:
1 10
2 11
3 20
4 5
5 10
I would like to calculate for each row the multiplicand for each row above it. For example, at row 3, I would like to calculate 10*11*20, or 2,200.
How do I do this?
Use cumprod.
Example:
df = pd.DataFrame({'A': [10, 11, 20, 5, 10]}, index=range(1, 6))
df['cprod'] = df['A'].cumprod()
Note, since your example is just a single column, a cumulative product can be done succinctly with a Series:
import pandas as pd
s = pd.Series([10, 11, 20, 5, 10])
s
# Output
0 10
1 11
2 20
3 5
4 10
dtype: int64
s.cumprod()
# Output
0 10
1 110
2 2200
3 11000
4 110000
dtype: int64
Kudos to #bananafish for locating the inherent cumprod method.