how to reindex panel data with MultiIndex - pandas

i got a panel data,how can i get the dataframe without Multiindex,i try to do this
print k_data
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 480 (major_axis) x 100 (minor_axis)
Items axis: close to volume
Major_axis axis: 2018-08-13 09:35:00 to 2018-08-24 15:00:00
Minor_axis axis: 603105.XSHG to 600236.XSHG
print k_data.to_frame()
close high low money open \
major minor
2018-08-13 09:35:00 603105.XSHG 25.20 26.00 23.65 367025532.0 23.80
300745.XSHE 56.85 56.88 56.03 27557052.0 56.47
300746.XSHE 24.80 24.92 24.40 25316020.0 24.92
300747.XSHE 156.77 157.01 155.11 74177868.0 155.67
002932.XSHE 77.77 77.77 76.52 47234204.0 77.00
603045.XSHG 45.48 45.49 45.00 12387785.0 45.00
how to reindex columns like this
major minor close high low money open volume
2018/8/3 9:31 603105.XSHG 24.2 24.44 24.2 75700508 24.3 3111000
2018/8/3 9:31 300745.XSHE 62.06 62.31 61.46 25664428 61.46 415385
2018/8/3 9:31 300746.XSHE 28.6 28.74 28.54 4479504 28.74 156300
2018/8/3 9:31 300747.XSHE 181.2 181.39 180.85 11388640 181.39 62900

ok,i figure out the question
use k_data.to_frame().reset_index() to reset like this:
df = k_data.to_frame().reset_index()
print df
major minor close high low money \
0 2018-08-13 09:35:00 603105.XSHG 25.20 26.00 23.65 367025532.0
1 2018-08-13 09:35:00 300745.XSHE 56.85 56.88 56.03 27557052.0
2 2018-08-13 09:35:00 300746.XSHE 24.80 24.92 24.40 25316020.0
3 2018-08-13 09:35:00 300747.XSHE 156.77 157.01 155.11 74177868.0
4 2018-08-13 09:35:00 002932.XSHE 77.77 77.77 76.52 47234204.0
5 2018-08-13 09:35:00 603045.XSHG 45.48 45.49 45.00 12387785.0
6 2018-08-13 09:35:00 601330.XSHG 15.24 15.27 14.93 31972260.0

Related

pandas (multi) index wrong need to change it

I have a DataFrame multiData that looks like this:
print(multiData)
Date Open High Low Close Adj Close Volume
Ticker Date
AAPL 0 2010-01-04 7.62 7.66 7.59 7.64 6.51 493729600
1 2010-01-05 7.66 7.70 7.62 7.66 6.52 601904800
2 2010-01-06 7.66 7.69 7.53 7.53 6.41 552160000
3 2010-01-07 7.56 7.57 7.47 7.52 6.40 477131200
4 2010-01-08 7.51 7.57 7.47 7.57 6.44 447610800
... ... ... ... ... ... ... ...
META 2668 2022-12-23 116.03 118.18 115.54 118.04 118.04 17796600
2669 2022-12-27 117.93 118.60 116.05 116.88 116.88 21392300
2670 2022-12-28 116.25 118.15 115.51 115.62 115.62 19612500
2671 2022-12-29 116.40 121.03 115.77 120.26 120.26 22366200
2672 2022-12-30 118.16 120.42 117.74 120.34 120.34 19492100
I need to get rid of "Date 0, 1, 2, ..." column and make the actual "Date" column part of the (multi) index
How do I do this?
Use df.droplevel to delete level 1 and chain df.set_index to add column Date to the index by setting the append parameter to True.
df = df.droplevel(1).set_index('Date', append=True)
df
Open High Low Close Adj Close Volume
Ticker Date
AAPL 2010-01-04 7.62 7.66 7.59 7.64 6.51 493729600
2010-01-05 7.66 7.70 7.62 7.66 6.52 601904800

Calculate the rolling average every two weeks for the same day and hour in a DataFrame

I have a Dataframe like the following:
df = pd.DataFrame()
df['datetime'] = pd.date_range(start='2023-1-2', end='2023-1-29', freq='15min')
df['week'] = df['datetime'].apply(lambda x: int(x.isocalendar()[1]))
df['day_of_week'] = df['datetime'].dt.weekday
df['hour'] = df['datetime'].dt.hour
df['minutes'] = pd.DatetimeIndex(df['datetime']).minute
df['value'] = range(len(df))
df.set_index('datetime',inplace=True)
df = week day_of_week hour minutes value
datetime
2023-01-02 00:00:00 1 0 0 0 0
2023-01-02 00:15:00 1 0 0 15 1
2023-01-02 00:30:00 1 0 0 30 2
2023-01-02 00:45:00 1 0 0 45 3
2023-01-02 01:00:00 1 0 1 0 4
... ... ... ... ... ...
2023-01-08 23:00:00 1 6 23 0 668
2023-01-08 23:15:00 1 6 23 15 669
2023-01-08 23:30:00 1 6 23 30 670
2023-01-08 23:45:00 1 6 23 45 671
2023-01-09 00:00:00 2 0 0 0 672
And I want to calculate the average of the column "value" for the same hour/minute/day, every two consecutive weeks.
What I would like to get is the following:
df=
value
day_of_week hour minutes datetime
0 0 0 2023-01-02 00:00:00 NaN
2023-01-09 00:00:00 NaN
2023-01-16 00:00:00 336
2023-01-23 00:00:00 1008
15 2023-01-02 00:15:00 NaN
2023-01-09 00:15:00 NaN
2023-01-16 00:15:00 337
2023-01-23 00:15:00 1009
So the first two weeks should have NaN values and week-3 should be the average of week-1 and week-2 and then week-4 the average of week-2 and week-3 and so on.
I tried the following code but it does not seem to do what I expect:
df = pd.DataFrame(df.groupby(['day_of_week','hour','minutes'])['value'].rolling(window='14D', min_periods=1).mean())
As what I am getting is:
value
day_of_week hour minutes. datetime
0 0 0 2023-01-02 00:00:00 0
2023-01-09 00:00:00 336
2023-01-16 00:00:00 1008
2023-01-23 00:00:00 1680
15 2023-01-02 00:15:00 1
2023-01-09 00:15:00 337
2023-01-16 00:15:00 1009
2023-01-23 00:15:00 1681
I think you want to shift within each group. Then you need another groupby:
(df.groupby(['day_of_week','hour','minutes'])['value']
.rolling(window='14D', min_periods=2).mean() # `min_periods` is different
.groupby(['day_of_week','hour','minutes']).shift() # shift within each group
.to_frame()
)
Output:
value
day_of_week hour minutes datetime
0 0 0 2023-01-02 00:00:00 NaN
2023-01-09 00:00:00 NaN
2023-01-16 00:00:00 336.0
2023-01-23 00:00:00 1008.0
15 2023-01-02 00:15:00 NaN
... ...
6 23 30 2023-01-15 23:30:00 NaN
2023-01-22 23:30:00 1006.0
45 2023-01-08 23:45:00 NaN
2023-01-15 23:45:00 NaN
2023-01-22 23:45:00 1007.0

How to merge records with aggregate historical data?

I have a table with individual records and another which holds historical information about the individuals in the former.
I want to extract information about the individuals from the second table. Both tables have timestamp. It is very important that the historical information happened before the record in the first table.
Date_Time name
0 2021-09-06 10:46:00 Leg It Liam
1 2021-09-06 10:46:00 Hollyhill Island
2 2021-09-06 10:46:00 Shani El Bolsa
3 2021-09-06 10:46:00 Kilbride Fifi
4 2021-09-06 10:46:00 Go
2100 2021-10-06 11:05:00 Slaneyside Babs
2101 2021-10-06 11:05:00 Hillview Joe
2102 2021-10-06 11:05:00 Fairway Flyer
2103 2021-10-06 11:05:00 Whiteys Surprise
2104 2021-10-06 11:05:00 Astons Lucy
The name is the variable by which you connect the two tables:
Date_Time name cc
13 2021-09-15 12:16:00 Hollyhill Island 6.00
14 2021-09-06 10:46:00 Hollyhill Island 4.50
15 2021-05-30 18:28:00 Hollyhill Island 3.50
16 2021-05-25 10:46:00 Hollyhill Island 2.50
17 2021-05-18 12:46:00 Hollyhill Island 2.38
18 2021-04-05 12:31:00 Hollyhill Island 3.50
19 2021-04-28 12:16:00 Hollyhill Island 3.75
I want to add aggregated data from this table to the first. Such as adding the cc mean and count.
Date_Time name
1 2021-09-06 10:46:00 Hollyhill Island
This line I would add 5 for cc count and 3.126 for the cc mean. Remember the historical records need to be before the date time of the individual records.
I am a bit confused how to do this efficiently. I know I need to groupby the historical data.
Also the individual records are usually in groups of Date_Time, if that makes it any easier.
IIUC:
try:
out=df1.merge(df2,on='name',suffixes=('','_y'))
#merging both df's on name
out=out.mask(out['Date_Time']<=out['Date_Time_y']).dropna()
#filtering results
out=out.groupby(['Date_Time','name'])['cc'].agg(['count','mean']).reset_index()
#aggregrating values
output of out:
Date_Time name count mean
0 2021-09-06 10:46:00 Hollyhill Island 5 3.126

how do i access only specific entries of a dataframe having date as index

[this is tail of my DataFrame for around 1000 entries][1]
Open Close High Change mx_profitable
Date
2018-06-06 263.00 270.15 271.4 7.15 8.40
2018-06-08 268.95 273.00 273.9 4.05 4.95
2018-06-11 273.30 274.00 278.4 0.70 5.10
2018-06-12 274.00 282.85 284.4 8.85 10.40
I have to sort out the entries of only certain dates, for example, 25th of every month.
I think need DatetimeIndex.day with boolean indexing:
df[df.index.day == 25]
Sample:
rng = pd.date_range('2017-04-03', periods=1000)
df = pd.DataFrame({'a': range(1000)}, index=rng)
print (df.head())
a
2017-04-03 0
2017-04-04 1
2017-04-05 2
2017-04-06 3
2017-04-07 4
df1 = df[df.index.day == 25]
print (df1.head())
a
2017-04-25 22
2017-05-25 52
2017-06-25 83
2017-07-25 113
2017-08-25 144

Data analysis over multiple dataframes? Panels or Multi-index?

When I extract data for multiple stocks using web.DataReader, I'm getting a panel as the output.
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re
startDate = '2010-01-01'
endDate = '2016-09-07'
stocks_query = ['AAPL','OPK']
stocks = web.DataReader(stocks_query, data_source='yahoo',
start=startDate, end=endDate)
stocks = stocks.swapaxes('items','minor_axis')
Leading to an output of :
Dimensions: 2 (items) x 1682 (major_axis) x 6 (minor_axis)
Items axis: AAPL to OPK
Major_axis axis: 2010-01-04 00:00:00 to 2016-09-07 00:00:00
Minor_axis axis: Open to Adj Close
A single dataframe of the panel looks like this
stocks['OPK']
Open High Low Close Volume Adj Close log_return \
Date
2010-01-04 1.80 1.97 1.76 1.95 234500.0 1.95 NaN
2010-01-05 1.64 1.95 1.64 1.93 135800.0 1.93 -0.010309
2010-01-06 1.90 1.92 1.77 1.79 546600.0 1.79 -0.075304
2010-01-07 1.79 1.94 1.76 1.92 138700.0 1.92 0.070110
2010-01-08 1.92 1.94 1.86 1.89 62500.0 1.89 -0.015748
I plan to do a lot of data manipulation across all dataframes adding new columns. comparing two dataframes etc. I was recommended to looking into multi_indexing, as panels are being deprecated.
This is my fist time working with panels.
If I want to add a new column to both dataframes (AAPL, OPK), I had to do something like this:
for i in stocks:
stocks[i]['log_return'] = np.log(stocks[i]['Close']/(stocks[i]['Close'].shift(1)))
If multi_indexing is indeed recommended for working with multiple dataframes, how exactly would I convert my dataframes into a form I can easily work with?
Would I have one main index, with the next-level being the stocks, and the columns would be contained within each stock?
I went through the docs, which gave many examples using tuples o which I didn't get or examples using single dataframes.
http://pandas.pydata.org/pandas-docs/stable/advanced.html
So how exactly do I convert my panel into a multi_index dataframe?
I'd like to extend #piRSquared's answer with some examples:
In [40]: stocks.to_frame()
Out[40]:
AAPL OPK
Date minor
2010-01-04 Open 2.134300e+02 1.80
High 2.145000e+02 1.97
Low 2.123800e+02 1.76
Close 2.140100e+02 1.95
Volume 1.234324e+08 234500.00
Adj Close 2.772704e+01 1.95
2010-01-05 Open 2.146000e+02 1.64
High 2.155900e+02 1.95
Low 2.132500e+02 1.64
Close 2.143800e+02 1.93
... ... ...
2016-09-06 Low 1.075100e+02 9.19
Close 1.077000e+02 9.36
Volume 2.688040e+07 3026900.00
Adj Close 1.066873e+02 9.36
2016-09-07 Open 1.078300e+02 9.39
High 1.087600e+02 9.60
Low 1.070700e+02 9.38
Close 1.083600e+02 9.59
Volume 4.236430e+07 2632400.00
Adj Close 1.073411e+02 9.59
[10092 rows x 2 columns]
but if you want to convert it to MultiIndex DF, it's better to leave the original pandas_datareader Panel as it is:
In [38]: p = web.DataReader(stocks_query, data_source='yahoo', start=startDate, end=endDate)
In [39]: p.to_frame()
Out[39]:
Open High Low Close Volume Adj Close
Date minor
2010-01-04 AAPL 213.429998 214.499996 212.380001 214.009998 123432400.0 27.727039
OPK 1.800000 1.970000 1.760000 1.950000 234500.0 1.950000
2010-01-05 AAPL 214.599998 215.589994 213.249994 214.379993 150476200.0 27.774976
OPK 1.640000 1.950000 1.640000 1.930000 135800.0 1.930000
2010-01-06 AAPL 214.379993 215.230000 210.750004 210.969995 138040000.0 27.333178
OPK 1.900000 1.920000 1.770000 1.790000 546600.0 1.790000
2010-01-07 AAPL 211.750000 212.000006 209.050005 210.580000 119282800.0 27.282650
OPK 1.790000 1.940000 1.760000 1.920000 138700.0 1.920000
2010-01-08 AAPL 210.299994 212.000006 209.060005 211.980005 111902700.0 27.464034
OPK 1.920000 1.940000 1.860000 1.890000 62500.0 1.890000
... ... ... ... ... ... ...
2016-08-31 AAPL 105.660004 106.570000 105.639999 106.099998 29662400.0 105.102360
OPK 9.260000 9.260000 9.070000 9.100000 2793300.0 9.100000
2016-09-01 AAPL 106.139999 106.800003 105.620003 106.730003 26701500.0 105.726441
OPK 9.310000 9.540000 9.190000 9.290000 3515300.0 9.290000
2016-09-02 AAPL 107.699997 108.000000 106.820000 107.730003 26802500.0 106.717038
OPK 9.340000 9.390000 9.160000 9.330000 2061200.0 9.330000
2016-09-06 AAPL 107.900002 108.300003 107.510002 107.699997 26880400.0 106.687314
OPK 9.320000 9.480000 9.190000 9.360000 3026900.0 9.360000
2016-09-07 AAPL 107.830002 108.760002 107.070000 108.360001 42364300.0 107.341112
OPK 9.390000 9.600000 9.380000 9.590000 2632400.0 9.590000
[3364 rows x 6 columns]
How to work with a MultiIndex DF:
In [46]: df = p.to_frame()
In [47]: df.loc[pd.IndexSlice[:, ['AAPL']], :]
Out[47]:
Open High Low Close Volume Adj Close
Date minor
2010-01-04 AAPL 213.429998 214.499996 212.380001 214.009998 123432400.0 27.727039
2010-01-05 AAPL 214.599998 215.589994 213.249994 214.379993 150476200.0 27.774976
2010-01-06 AAPL 214.379993 215.230000 210.750004 210.969995 138040000.0 27.333178
2010-01-07 AAPL 211.750000 212.000006 209.050005 210.580000 119282800.0 27.282650
2010-01-08 AAPL 210.299994 212.000006 209.060005 211.980005 111902700.0 27.464034
2010-01-11 AAPL 212.799997 213.000002 208.450005 210.110003 115557400.0 27.221758
2010-01-12 AAPL 209.189995 209.769995 206.419998 207.720001 148614900.0 26.912110
2010-01-13 AAPL 207.870005 210.929995 204.099998 210.650002 151473000.0 27.291720
2010-01-14 AAPL 210.110003 210.459997 209.020004 209.430000 108223500.0 27.133657
2010-01-15 AAPL 210.929995 211.599997 205.869999 205.930000 148516900.0 26.680198
... ... ... ... ... ... ...
2016-08-24 AAPL 108.570000 108.750000 107.680000 108.029999 23675100.0 107.014213
2016-08-25 AAPL 107.389999 107.879997 106.680000 107.570000 25086200.0 106.558539
2016-08-26 AAPL 107.410004 107.949997 106.309998 106.940002 27766300.0 105.934466
2016-08-29 AAPL 106.620003 107.440002 106.290001 106.820000 24970300.0 105.815591
2016-08-30 AAPL 105.800003 106.500000 105.500000 106.000000 24863900.0 105.003302
2016-08-31 AAPL 105.660004 106.570000 105.639999 106.099998 29662400.0 105.102360
2016-09-01 AAPL 106.139999 106.800003 105.620003 106.730003 26701500.0 105.726441
2016-09-02 AAPL 107.699997 108.000000 106.820000 107.730003 26802500.0 106.717038
2016-09-06 AAPL 107.900002 108.300003 107.510002 107.699997 26880400.0 106.687314
2016-09-07 AAPL 107.830002 108.760002 107.070000 108.360001 42364300.0 107.341112
[1682 rows x 6 columns]
You're going to love this one
stocks.to_frame()