When Data Group by, How can I cumsum milliseconds in pandas DataFrame? - pandas

When data using group by, how can I cumsum millisenconds in df?
Inputs is bellow here.
inputs:
time key isValue
2018-03-04 00:00:06.520 1 NaN
2018-03-04 00:00:07.230 1 NaN
2018-03-04 00:00:08.140 1 1
2018-03-04 00:00:08.720 1 1
2018-03-04 00:00:09.110 1 1
2018-03-04 00:00:09.650 1 NaN
2018-03-04 00:00:10.360 1 NaN
2018-03-04 00:00:11.150 1 NaN
2018-03-04 00:00:11.770 2 NaN
2018-03-04 00:00:12.320 2 NaN
2018-03-04 00:00:12.910 2 1
2018-03-04 00:00:13.250 2 1
2018-03-04 00:00:13.960 2 1
2018-03-04 00:00:14.550 2 NaN
2018-03-04 00:00:15.250 2 NaN
....
And I wanna Outputs is bellow here.
outputs
key : time
1 : 1.030
2 : 1.050
3 : X.xxx
4 : X.xxx
....
Well, I'm using this code
df.groupby(["key"])["time"].cumsum()
Is not correct code that I think.

I think need:
df['new'] = df["time"].dt.microsecond.groupby(df["key"]).cumsum() / 1000
print (df)
time key isValue new
0 2018-03-04 00:00:06.520 1 NaN 520.0
1 2018-03-04 00:00:07.230 1 NaN 750.0
2 2018-03-04 00:00:08.140 1 1.0 890.0
3 2018-03-04 00:00:08.720 1 1.0 1610.0
4 2018-03-04 00:00:09.110 1 1.0 1720.0
5 2018-03-04 00:00:09.650 1 NaN 2370.0
6 2018-03-04 00:00:10.360 1 NaN 2730.0
7 2018-03-04 00:00:11.150 1 NaN 2880.0
8 2018-03-04 00:00:11.770 2 NaN 770.0
9 2018-03-04 00:00:12.320 2 NaN 1090.0
10 2018-03-04 00:00:12.910 2 1.0 2000.0
11 2018-03-04 00:00:13.250 2 1.0 2250.0
12 2018-03-04 00:00:13.960 2 1.0 3210.0
13 2018-03-04 00:00:14.550 2 NaN 3760.0
14 2018-03-04 00:00:15.250 2 NaN 4010.0

Related

Moving Average Pandas Across Group

My data has the following structure:
np.random.seed(25)
tdf = pd.DataFrame({'person_id' :[1,1,1,1,
2,2,
3,3,3,3,3,
4,4,4,
5,5,5,5,5,5,5,
6,
7,7,
8,8,8,8,8,8,8,
9,9,
10,10
],
'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11',
'2021-01-02','2021-01-05','2021-01-07',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05'
],
'Quantity': np.floor(np.random.random(size=35)*100)
})
And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data.
Thanks!
IIUC, you can aggregate the similar dates first, getting the sum and count.
Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting).
Finally, perform the ratio of sum and count to get the mean:
g = tdf.groupby('Date')['Quantity']
out = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
output:
Date
2021-01-02 NaN
2021-01-05 50.210526
2021-01-07 45.071429
2021-01-09 41.000000
2021-01-11 44.571429
2021-01-13 48.800000
2021-01-15 50.500000
Name: Quantity, dtype: float64
joining the original data:
g = tdf.groupby('Date')['Quantity']
s = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True)
output:
person_id Date Quantity Quantity_MA(2)
0 1 2021-01-02 87.0 NaN
4 2 2021-01-02 41.0 NaN
6 3 2021-01-02 68.0 NaN
11 4 2021-01-02 11.0 NaN
14 5 2021-01-02 16.0 NaN
21 6 2021-01-02 51.0 NaN
22 7 2021-01-02 38.0 NaN
24 8 2021-01-02 51.0 NaN
31 9 2021-01-02 90.0 NaN
33 10 2021-01-02 45.0 NaN
1 1 2021-01-05 58.0 50.210526
5 2 2021-01-05 11.0 50.210526
7 3 2021-01-05 43.0 50.210526
12 4 2021-01-05 44.0 50.210526
15 5 2021-01-05 52.0 50.210526
23 7 2021-01-05 99.0 50.210526
25 8 2021-01-05 55.0 50.210526
32 9 2021-01-05 66.0 50.210526
34 10 2021-01-05 28.0 50.210526
2 1 2021-01-07 27.0 45.071429
8 3 2021-01-07 55.0 45.071429
13 4 2021-01-07 58.0 45.071429
16 5 2021-01-07 32.0 45.071429
26 8 2021-01-07 3.0 45.071429
3 1 2021-01-09 18.0 41.000000
9 3 2021-01-09 36.0 41.000000
17 5 2021-01-09 69.0 41.000000
27 8 2021-01-09 71.0 41.000000
10 3 2021-01-11 40.0 44.571429
18 5 2021-01-11 36.0 44.571429
28 8 2021-01-11 42.0 44.571429
19 5 2021-01-13 83.0 48.800000
29 8 2021-01-13 43.0 48.800000
20 5 2021-01-15 48.0 50.500000
30 8 2021-01-15 28.0 50.500000

Pandas DF will in for Missing Months

I have a dataframe of values that are mostly (but not always) quarterly values.
I need to fill in for any missing months so it is complete.
Here i need to put it into a complete df from 2015-12 to 2021-03.
Thank you.
id date amt rate
0 15856 2015-12-31 85.09 0.0175
1 15857 2016-03-31 135.60 0.0175
2 15858 2016-06-30 135.91 0.0175
3 15859 2016-09-30 167.27 0.0175
4 15860 2016-12-31 173.32 0.0175
....
19 15875 2020-09-30 305.03 0.0175
20 15876 2020-12-31 354.09 0.0175
21 15877 2021-03-31 391.19 0.0175
You can use pd.date_range() to generate a list of months end with freq='M' then reindex datetime index.
df_ = df.set_index('date').reindex(pd.date_range('2015-12', '2021-03', freq='M')).reset_index().rename(columns={'index': 'date'})
print(df_)
date id amt rate
0 2015-12-31 15856.0 85.09 0.0175
1 2016-01-31 NaN NaN NaN
2 2016-02-29 NaN NaN NaN
3 2016-03-31 15857.0 135.60 0.0175
4 2016-04-30 NaN NaN NaN
.. ... ... ... ...
58 2020-10-31 NaN NaN NaN
59 2020-11-30 NaN NaN NaN
60 2020-12-31 15876.0 354.09 0.0175
61 2021-01-31 NaN NaN NaN
62 2021-02-28 NaN NaN NaN
To fill the NaN value, you can use df_.fillna(0).

Pandas resample is jumbling date order

I'm trying to resample some tick data I have into 1 minute blocks. The code appears to work fine but when I look into the resulting dataframe it is changing the order of the dates incorrectly. Below is what it looks like pre resample:
Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
2020-06-30 17:00:00 41.68 2 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 3 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.68 1 tptTradetctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 5 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 8 tptAsk tctRegular NaN 255 NaN 0 msNormal
... ... ... ... ... ... ... ... ... ...
2020-01-07 17:00:21 41.94 5 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:27 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:40 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:46 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:50 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
As you can see the date starts at 5pm on the 30th of June. Then I use this code:
one_minute_dataframe['Price'] = df.Var2.resample('1min').last()
one_minute_dataframe['Volume'] = df.Var3.resample('1min').sum()
one_minute_dataframe.index = pd.to_datetime(one_minute_dataframe.index)
one_minute_dataframe.sort_index(inplace = True)
And I get the following:
Price Volume
2020-01-07 00:00:00 41.73 416
2020-01-07 00:01:00 41.74 198
2020-01-07 00:02:00 41.76 40
2020-01-07 00:03:00 41.74 166
2020-01-07 00:04:00 41.77 143
... ... ...
2020-06-30 23:55:00 41.75 127
2020-06-30 23:56:00 41.74 234
2020-06-30 23:57:00 41.76 344
2020-06-30 23:58:00 41.72 354
2020-06-30 23:59:00 41.74 451
It seems to want to start from midnight on the 1st of July. But I've tried sorting the index and it still is not changing.
Also, the datetime index seems to add lots more dates outside the ones that were originally in the dataframe and plonks them in the middle of the resampled one.
Any help would be great. Apologies if I've set this out poorly
I see what's happened. Somewhere in the data download the month and day have been switched around. That's why its putting July at the top, because it thinks it's January.

pandas: group by ffill does not apply fill in correct order

i am facing issue with group by ffill. It does not seem to apply forward fill in correct order
Here is my starting data
group date stage_2
0 A 2014-01-01 NaN
1 A 2014-01-03 NaN
2 A 2014-01-04 NaN
3 A 2014-01-05 1.0
4 B 2014-01-02 NaN
5 B 2014-01-06 NaN
6 B 2014-01-10 NaN
7 C 2014-01-03 1.0
8 C 2014-01-05 3.0
9 C 2014-01-08 NaN
10 C 2014-01-09 NaN
11 C 2014-01-10 NaN
12 C 2014-01-11 NaN
13 D 2014-01-01 NaN
14 D 2014-01-03 NaN
15 D 2014-01-04 NaN
16 E 2014-01-04 1.0
17 E 2014-01-06 3.0
18 E 2014-01-07 4.0
19 E 2014-01-08 NaN
20 E 2014-01-09 NaN
21 E 2014-01-10 NaN
22 F 2014-01-08 NaN
After applying the ffill method this is what i get
df['stage_2'] = df.groupby('group')['stage_2'].ffill()
I am expecting a different value at index 9 through 12 and 21
group date stage_2
0 A 2014-01-01 NaN
1 A 2014-01-03 NaN
2 A 2014-01-04 NaN
3 A 2014-01-05 1.0
4 B 2014-01-02 NaN
5 B 2014-01-06 NaN
6 B 2014-01-10 NaN
7 C 2014-01-03 1.0
8 C 2014-01-05 3.0
9 C 2014-01-08 1.0
10 C 2014-01-09 NaN
11 C 2014-01-10 NaN
12 C 2014-01-11 NaN
13 D 2014-01-01 NaN
14 D 2014-01-03 NaN
15 D 2014-01-04 NaN
16 E 2014-01-04 1.0
17 E 2014-01-06 3.0
18 E 2014-01-07 4.0
19 E 2014-01-08 4.0
20 E 2014-01-09 4.0
21 E 2014-01-10 NaN
22 F 2014-01-08 NaN
The only way I can reproduce this is by putting in non-ASCII characters e.g. Cyrillic С and Е into the group column at indexes 9-12 and 21 respectively.
EDIT
OK, most likely you're using pandas v0.23.0 which had a bug (fixed in future versions, at least in v0.23.4) that makes .ffill() give the exact output you posted. So please upgrade your pandas.

Pandas Group by before outer Join

I have two tables with the following formats:
Table1: key = Date, Index
Date Index Value1
0 2015-01-01 A -1.292040
1 2015-04-01 A 0.535893
2 2015-02-01 B -1.779029
3 2015-06-01 B 1.129317
Table2: Key = Date
Date Value2
0 2015-01-01 2.637761
1 2015-02-01 -0.496927
2 2015-03-01 0.226914
3 2015-04-01 -2.010917
4 2015-05-01 -1.095533
5 2015-06-01 0.651244
6 2015-07-01 0.036592
7 2015-08-01 0.509352
8 2015-09-01 -0.682297
9 2015-10-01 1.231889
10 2015-11-01 -1.557481
11 2015-12-01 0.332942
Table2 has more rows and I want to join Table1 into Table2 on Date so I can do stuff with the Values. However, I also want to bring in Index and and fill in for each index, all the Dates they don't have like this:
Result:
Date Index Value1 Value2
0 2015-01-01 A -1.292040 2.637761
1 2015-02-01 A NaN -0.496927
2 2015-03-01 A NaN 0.226914
3 2015-04-01 A 0.535893 -2.010917
4 2015-05-01 A NaN -1.095533
5 2015-06-01 A NaN 0.651244
6 2015-07-01 A NaN 0.036592
7 2015-08-01 A NaN 0.509352
8 2015-09-01 A NaN -0.682297
9 2015-10-01 A NaN 1.231889
10 2015-11-01 A NaN -1.557481
11 2015-12-01 A NaN 0.332942
.... and so on with Index B
I suppose I could manually filter out each Index value from Table1 into Table2, but that would be really tedious and troublesome if I didn't actually know all the indexes. I essentially want to do a "Table1 group by Index and right join to Table2 on Date" at the same time, but I'm stuck on how to express this.
Running the latest versions of Pandas and Jupyter.
EDIT: I have a program to fill in the NaNs, so they're not a problem right now.
It seems you want to merge 'Value1' of df1 with df2 on 'Date', while assigning the Index to every date. You can use pd.concat with a list comprehension
import pandas as pd
pd.concat([df2.assign(Index=i).merge(gp, how='left') for i, gp in df1.groupby('Index')],
ignore_index=True)
Output:
Date Value2 Index Value1
0 2015-01-01 2.637761 A -1.292040
1 2015-02-01 -0.496927 A NaN
2 2015-03-01 0.226914 A NaN
3 2015-04-01 -2.010917 A 0.535893
4 2015-05-01 -1.095533 A NaN
5 2015-06-01 0.651244 A NaN
6 2015-07-01 0.036592 A NaN
7 2015-08-01 0.509352 A NaN
8 2015-09-01 -0.682297 A NaN
9 2015-10-01 1.231889 A NaN
10 2015-11-01 -1.557481 A NaN
11 2015-12-01 0.332942 A NaN
12 2015-01-01 2.637761 B NaN
13 2015-02-01 -0.496927 B -1.779029
14 2015-03-01 0.226914 B NaN
15 2015-04-01 -2.010917 B NaN
16 2015-05-01 -1.095533 B NaN
17 2015-06-01 0.651244 B 1.129317
18 2015-07-01 0.036592 B NaN
19 2015-08-01 0.509352 B NaN
20 2015-09-01 -0.682297 B NaN
21 2015-10-01 1.231889 B NaN
22 2015-11-01 -1.557481 B NaN
23 2015-12-01 0.332942 B NaN
By not specifying the merge keys, it's automatically using the intersection of columns, which is ['Date', 'Index'] for each group.