I downloaded the Broad Dollar Index from FRED with the following format:
DATE RTWEXBGS
0 2006-01-01 100.0000
1 2006-02-01 100.2651
2 2006-03-01 100.5424
3 2006-04-01 100.0540
4 2006-05-01 97.8681
.. ... ...
194 2022-03-01 111.2659
195 2022-04-01 111.8324
196 2022-05-01 114.6075
197 2022-06-01 115.6957
198 2022-07-01 118.2674
I also got an Excel file of inflation rate with a different format:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Annual
0 2022 0.07480 0.07871 0.08542 0.08259 0.08582 0.09060 0.08525 NaN NaN NaN NaN NaN NaN
1 2021 0.01400 0.01676 0.02620 0.04160 0.04993 0.05391 0.05365 0.05251 0.05390 0.06222 0.06809 0.07036 0.04698
2 2020 0.02487 0.02335 0.01539 0.00329 0.00118 0.00646 0.00986 0.01310 0.01371 0.01182 0.01175 0.01362 0.01234
3 2019 0.01551 0.01520 0.01863 0.01996 0.01790 0.01648 0.01811 0.01750 0.01711 0.01764 0.02051 0.02285 0.01812
4 2018 0.02071 0.02212 0.02360 0.02463 0.02801 0.02872 0.02950 0.02699 0.02277 0.02522 0.02177 0.01910 0.02443
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
104 1918 0.19658 0.17500 0.16667 0.12698 0.13281 0.13077 0.17969 0.18462 0.18045 0.18519 0.20741 0.20438 0.17284
105 1917 0.12500 0.15385 0.14286 0.18868 0.19626 0.20370 0.18519 0.19266 0.19820 0.19469 0.17391 0.18103 0.17841
106 1916 0.02970 0.04000 0.06061 0.06000 0.05941 0.06931 0.06931 0.07921 0.09901 0.10784 0.11650 0.12621 0.07667
107 1915 0.01000 0.01010 0.00000 0.02041 0.02020 0.02020 0.01000 -0.00980 -0.00980 0.00990 0.00980 0.01980 0.00915
108 1914 0.02041 0.01020 0.01020 0.00000 0.02062 0.01020 0.01010 0.03030 0.02000 0.01000 0.00990 0.01000 0.01349
How do I change the inflation table into a format similar to the dollar index?
Something like this(didn't take column=Annual into account),
df
###
Year Jan Feb Mar Apr May Jun Jul Aug \
0 2022 0.07480 0.07871 0.08542 0.08259 0.08582 0.09060 0.08525 NaN
1 2021 0.01400 0.01676 0.02620 0.04160 0.04993 0.05391 0.05365 NaN
2 2020 0.02487 0.02335 0.01539 0.00329 0.00118 0.00646 0.00986 NaN
Sep Oct Nov Dec Annual
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df_melt = pd.melt(df, id_vars=['Year'], value_vars=month, var_name='Month', value_name='Sales')
df_melt['Date'] = pd.to_datetime(df_melt['Year'].astype(str) + '-' + df_melt['Month'].astype(str))
# convert Date column to datetime type
df_melt = df_melt[['Date', 'Sales']]
df_melt
###
Date Sales
0 2022-01-01 0.07480
1 2021-01-01 0.01400
2 2020-01-01 0.02487
3 2022-02-01 0.07871
4 2021-02-01 0.01676
5 2020-02-01 0.02335
6 2022-03-01 0.08542
7 2021-03-01 0.02620
8 2020-03-01 0.01539
9 2022-04-01 0.08259
10 2021-04-01 0.04160
11 2020-04-01 0.00329
12 2022-05-01 0.08582
13 2021-05-01 0.04993
14 2020-05-01 0.00118
15 2022-06-01 0.09060
16 2021-06-01 0.05391
17 2020-06-01 0.00646
18 2022-07-01 0.08525
19 2021-07-01 0.05365
20 2020-07-01 0.00986
21 2022-08-01 NaN
22 2021-08-01 NaN
23 2020-08-01 NaN
24 2022-09-01 NaN
25 2021-09-01 NaN
26 2020-09-01 NaN
27 2022-10-01 NaN
28 2021-10-01 NaN
29 2020-10-01 NaN
30 2022-11-01 NaN
31 2021-11-01 NaN
32 2020-11-01 NaN
33 2022-12-01 NaN
34 2021-12-01 NaN
35 2020-12-01 NaN
My data has the following structure:
np.random.seed(25)
tdf = pd.DataFrame({'person_id' :[1,1,1,1,
2,2,
3,3,3,3,3,
4,4,4,
5,5,5,5,5,5,5,
6,
7,7,
8,8,8,8,8,8,8,
9,9,
10,10
],
'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11',
'2021-01-02','2021-01-05','2021-01-07',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05'
],
'Quantity': np.floor(np.random.random(size=35)*100)
})
And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data.
Thanks!
IIUC, you can aggregate the similar dates first, getting the sum and count.
Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting).
Finally, perform the ratio of sum and count to get the mean:
g = tdf.groupby('Date')['Quantity']
out = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
output:
Date
2021-01-02 NaN
2021-01-05 50.210526
2021-01-07 45.071429
2021-01-09 41.000000
2021-01-11 44.571429
2021-01-13 48.800000
2021-01-15 50.500000
Name: Quantity, dtype: float64
joining the original data:
g = tdf.groupby('Date')['Quantity']
s = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True)
output:
person_id Date Quantity Quantity_MA(2)
0 1 2021-01-02 87.0 NaN
4 2 2021-01-02 41.0 NaN
6 3 2021-01-02 68.0 NaN
11 4 2021-01-02 11.0 NaN
14 5 2021-01-02 16.0 NaN
21 6 2021-01-02 51.0 NaN
22 7 2021-01-02 38.0 NaN
24 8 2021-01-02 51.0 NaN
31 9 2021-01-02 90.0 NaN
33 10 2021-01-02 45.0 NaN
1 1 2021-01-05 58.0 50.210526
5 2 2021-01-05 11.0 50.210526
7 3 2021-01-05 43.0 50.210526
12 4 2021-01-05 44.0 50.210526
15 5 2021-01-05 52.0 50.210526
23 7 2021-01-05 99.0 50.210526
25 8 2021-01-05 55.0 50.210526
32 9 2021-01-05 66.0 50.210526
34 10 2021-01-05 28.0 50.210526
2 1 2021-01-07 27.0 45.071429
8 3 2021-01-07 55.0 45.071429
13 4 2021-01-07 58.0 45.071429
16 5 2021-01-07 32.0 45.071429
26 8 2021-01-07 3.0 45.071429
3 1 2021-01-09 18.0 41.000000
9 3 2021-01-09 36.0 41.000000
17 5 2021-01-09 69.0 41.000000
27 8 2021-01-09 71.0 41.000000
10 3 2021-01-11 40.0 44.571429
18 5 2021-01-11 36.0 44.571429
28 8 2021-01-11 42.0 44.571429
19 5 2021-01-13 83.0 48.800000
29 8 2021-01-13 43.0 48.800000
20 5 2021-01-15 48.0 50.500000
30 8 2021-01-15 28.0 50.500000
I'm trying to resample some tick data I have into 1 minute blocks. The code appears to work fine but when I look into the resulting dataframe it is changing the order of the dates incorrectly. Below is what it looks like pre resample:
Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
2020-06-30 17:00:00 41.68 2 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 3 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.68 1 tptTradetctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 5 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 8 tptAsk tctRegular NaN 255 NaN 0 msNormal
... ... ... ... ... ... ... ... ... ...
2020-01-07 17:00:21 41.94 5 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:27 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:40 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:46 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:50 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
As you can see the date starts at 5pm on the 30th of June. Then I use this code:
one_minute_dataframe['Price'] = df.Var2.resample('1min').last()
one_minute_dataframe['Volume'] = df.Var3.resample('1min').sum()
one_minute_dataframe.index = pd.to_datetime(one_minute_dataframe.index)
one_minute_dataframe.sort_index(inplace = True)
And I get the following:
Price Volume
2020-01-07 00:00:00 41.73 416
2020-01-07 00:01:00 41.74 198
2020-01-07 00:02:00 41.76 40
2020-01-07 00:03:00 41.74 166
2020-01-07 00:04:00 41.77 143
... ... ...
2020-06-30 23:55:00 41.75 127
2020-06-30 23:56:00 41.74 234
2020-06-30 23:57:00 41.76 344
2020-06-30 23:58:00 41.72 354
2020-06-30 23:59:00 41.74 451
It seems to want to start from midnight on the 1st of July. But I've tried sorting the index and it still is not changing.
Also, the datetime index seems to add lots more dates outside the ones that were originally in the dataframe and plonks them in the middle of the resampled one.
Any help would be great. Apologies if I've set this out poorly
I see what's happened. Somewhere in the data download the month and day have been switched around. That's why its putting July at the top, because it thinks it's January.
I have two tables with the following formats:
Table1: key = Date, Index
Date Index Value1
0 2015-01-01 A -1.292040
1 2015-04-01 A 0.535893
2 2015-02-01 B -1.779029
3 2015-06-01 B 1.129317
Table2: Key = Date
Date Value2
0 2015-01-01 2.637761
1 2015-02-01 -0.496927
2 2015-03-01 0.226914
3 2015-04-01 -2.010917
4 2015-05-01 -1.095533
5 2015-06-01 0.651244
6 2015-07-01 0.036592
7 2015-08-01 0.509352
8 2015-09-01 -0.682297
9 2015-10-01 1.231889
10 2015-11-01 -1.557481
11 2015-12-01 0.332942
Table2 has more rows and I want to join Table1 into Table2 on Date so I can do stuff with the Values. However, I also want to bring in Index and and fill in for each index, all the Dates they don't have like this:
Result:
Date Index Value1 Value2
0 2015-01-01 A -1.292040 2.637761
1 2015-02-01 A NaN -0.496927
2 2015-03-01 A NaN 0.226914
3 2015-04-01 A 0.535893 -2.010917
4 2015-05-01 A NaN -1.095533
5 2015-06-01 A NaN 0.651244
6 2015-07-01 A NaN 0.036592
7 2015-08-01 A NaN 0.509352
8 2015-09-01 A NaN -0.682297
9 2015-10-01 A NaN 1.231889
10 2015-11-01 A NaN -1.557481
11 2015-12-01 A NaN 0.332942
.... and so on with Index B
I suppose I could manually filter out each Index value from Table1 into Table2, but that would be really tedious and troublesome if I didn't actually know all the indexes. I essentially want to do a "Table1 group by Index and right join to Table2 on Date" at the same time, but I'm stuck on how to express this.
Running the latest versions of Pandas and Jupyter.
EDIT: I have a program to fill in the NaNs, so they're not a problem right now.
It seems you want to merge 'Value1' of df1 with df2 on 'Date', while assigning the Index to every date. You can use pd.concat with a list comprehension
import pandas as pd
pd.concat([df2.assign(Index=i).merge(gp, how='left') for i, gp in df1.groupby('Index')],
ignore_index=True)
Output:
Date Value2 Index Value1
0 2015-01-01 2.637761 A -1.292040
1 2015-02-01 -0.496927 A NaN
2 2015-03-01 0.226914 A NaN
3 2015-04-01 -2.010917 A 0.535893
4 2015-05-01 -1.095533 A NaN
5 2015-06-01 0.651244 A NaN
6 2015-07-01 0.036592 A NaN
7 2015-08-01 0.509352 A NaN
8 2015-09-01 -0.682297 A NaN
9 2015-10-01 1.231889 A NaN
10 2015-11-01 -1.557481 A NaN
11 2015-12-01 0.332942 A NaN
12 2015-01-01 2.637761 B NaN
13 2015-02-01 -0.496927 B -1.779029
14 2015-03-01 0.226914 B NaN
15 2015-04-01 -2.010917 B NaN
16 2015-05-01 -1.095533 B NaN
17 2015-06-01 0.651244 B 1.129317
18 2015-07-01 0.036592 B NaN
19 2015-08-01 0.509352 B NaN
20 2015-09-01 -0.682297 B NaN
21 2015-10-01 1.231889 B NaN
22 2015-11-01 -1.557481 B NaN
23 2015-12-01 0.332942 B NaN
By not specifying the merge keys, it's automatically using the intersection of columns, which is ['Date', 'Index'] for each group.
i have below dataframe. date/time is multi-indexed indexes.
when i doing this code,
<code>
idx = pd.IndexSlice
print(df_per_wday_temp.loc[idx[:,datetime.time(4, 0, 0): datetime.time(7, 0, 0)]])"
but i got error 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'. this may be error in
index slicing but i don't know why this happened. anybody can solve it ?
a b
date time
2018-01-26 19:00:00 25.08 -7.85
19:15:00 24.86 -7.81
19:30:00 24.67 -8.24
19:45:00 NaN -9.32
20:00:00 NaN -8.29
20:15:00 NaN -8.58
20:30:00 NaN -9.48
20:45:00 NaN -8.73
21:00:00 NaN -8.60
21:15:00 NaN -8.70
21:30:00 NaN -8.53
21:45:00 NaN -8.90
22:00:00 NaN -8.55
22:15:00 NaN -8.48
22:30:00 NaN -9.90
22:45:00 NaN -9.70
23:00:00 NaN -8.98
23:15:00 NaN -9.17
23:30:00 NaN -9.07
23:45:00 NaN -9.45
00:00:00 NaN -9.64
00:15:00 NaN -10.08
00:30:00 NaN -8.87
00:45:00 NaN -9.91
01:00:00 NaN -9.91
01:15:00 NaN -9.93
01:30:00 NaN -9.55
01:45:00 NaN -9.51
02:00:00 NaN -9.75
02:15:00 NaN -9.44
... ... ...
03:45:00 NaN -9.28
04:00:00 NaN -9.96
04:15:00 NaN -10.19
04:30:00 NaN -10.20
04:45:00 NaN -9.85
05:00:00 NaN -10.33
05:15:00 NaN -10.18
05:30:00 NaN -10.81
05:45:00 NaN -10.51
06:00:00 NaN -10.41
06:15:00 NaN -10.49
06:30:00 NaN -10.13
06:45:00 NaN -10.36
07:00:00 NaN -10.71
07:15:00 NaN -12.11
07:30:00 NaN -10.76
07:45:00 NaN -10.76
08:00:00 NaN -11.63
08:15:00 NaN -11.18
08:30:00 NaN -10.49
08:45:00 NaN -11.18
09:00:00 NaN -10.67
09:15:00 NaN -10.60
09:30:00 NaN -10.36
09:45:00 NaN -9.39
10:00:00 NaN -9.77
10:15:00 NaN -9.54
10:30:00 NaN -8.99
10:45:00 NaN -9.01
11:00:00 NaN -10.01
thanks in advance
If is not possible sorting index, is necessary create boolean mask and filter by boolean indexing:
from datetime import time
mask = df1.index.get_level_values(1).to_series().between(time(4, 0, 0), time(7, 0, 0)).values
df = df1[mask]
print (df)
a b
date time
2018-01-26 04:00:00 NaN -9.96
04:15:00 NaN -10.19
04:30:00 NaN -10.20
04:45:00 NaN -9.85
05:00:00 NaN -10.33
05:15:00 NaN -10.18
05:30:00 NaN -10.81