pandas (multi) index wrong need to change it - pandas

I have a DataFrame multiData that looks like this:
print(multiData)
Date Open High Low Close Adj Close Volume
Ticker Date
AAPL 0 2010-01-04 7.62 7.66 7.59 7.64 6.51 493729600
1 2010-01-05 7.66 7.70 7.62 7.66 6.52 601904800
2 2010-01-06 7.66 7.69 7.53 7.53 6.41 552160000
3 2010-01-07 7.56 7.57 7.47 7.52 6.40 477131200
4 2010-01-08 7.51 7.57 7.47 7.57 6.44 447610800
... ... ... ... ... ... ... ...
META 2668 2022-12-23 116.03 118.18 115.54 118.04 118.04 17796600
2669 2022-12-27 117.93 118.60 116.05 116.88 116.88 21392300
2670 2022-12-28 116.25 118.15 115.51 115.62 115.62 19612500
2671 2022-12-29 116.40 121.03 115.77 120.26 120.26 22366200
2672 2022-12-30 118.16 120.42 117.74 120.34 120.34 19492100
I need to get rid of "Date 0, 1, 2, ..." column and make the actual "Date" column part of the (multi) index
How do I do this?

Use df.droplevel to delete level 1 and chain df.set_index to add column Date to the index by setting the append parameter to True.
df = df.droplevel(1).set_index('Date', append=True)
df
Open High Low Close Adj Close Volume
Ticker Date
AAPL 2010-01-04 7.62 7.66 7.59 7.64 6.51 493729600
2010-01-05 7.66 7.70 7.62 7.66 6.52 601904800

Related

Resample 10D but until end of months

I would like to resample a DataFrame with frequences of 10D but cutting the last decade always at the end of the month.
ES:
print(df)
 data
index
2010-01-01 145.08
2010-01-02 143.69
2010-01-03 101.06
2010-01-04 57.63
2010-01-05 65.46
...
2010-02-24 48.06
2010-02-25 87.41
2010-02-26 71.97
2010-02-27 73.1
2010-02-28 41.43
Apply something like df.resample('10DM').mean()
data
index
2010-01-10 97.33
2010-01-20 58.58
2010-01-31 41.43
2010-02-10 35.17
2010-02-20 32.44
2010-02-28 55.44
note that the 1st and 2nd decades are normal 10D resample, but the 3rd can be 8-9-10-11 days based on month and year.
Thanks in advance.
Sample data (easy to check):
# df = pd.DataFrame({"value": np.arange(1, len(dti)+1)}, index=dti)
>>> df
value
2010-01-01 1
2010-01-02 2
2010-01-03 3
2010-01-04 4
2010-01-05 5
...
2010-02-24 55
2010-02-25 56
2010-02-26 57
2010-02-27 58
2010-02-28 59
You need to create groups by (days, month, year):
grp = df.groupby([pd.cut(df.index.day, [0, 10, 20, 31]),
pd.Grouper(freq='M'),
pd.Grouper(freq='Y')])
Now you can compute the mean for each group:
out = grp['value'].apply(lambda x: (x.index.max(), x.mean())).apply(pd.Series) \
.reset_index(drop=True).rename(columns={0:'date', 1:'value'}) \
.set_index('date').sort_index()
Output result:
>>> out
value
date
2010-01-10 5.5
2010-01-20 15.5
2010-01-31 26.0
2010-02-10 36.5
2010-02-20 46.5
2010-02-28 55.5

Creating values from datetime objects in certain fixed divisions

I am trying to create a new column, in which e.g. the time 14:02 should be saved as 14.0, whereas 14:16 should be 14.5. This would equal half-hour units. Of course 15min units should also be creatable and so on. This is my approach for full hours, but I need a higher resolution.
df["Time"] = df.StartDateTime.apply(lambda x: x.hour)
So long as the units evenly divide an hour you can round with that frequency and then divide by an hour.
import pandas as pd
df = pd.DataFrame({'Time': pd.timedelta_range('14:00:00', freq='4min', periods=10)})
for freq in ['30min', '15min', '20min', '10min']:
df[freq] = df['Time'].dt.round(freq)/pd.Timedelta('1H')
Time 30min 15min 20min 10min
0 14:00:00 14.0 14.00 14.000000 14.000000
1 14:04:00 14.0 14.00 14.000000 14.000000
2 14:08:00 14.0 14.25 14.000000 14.166667
3 14:12:00 14.0 14.25 14.333333 14.166667
4 14:16:00 14.5 14.25 14.333333 14.333333
5 14:20:00 14.5 14.25 14.333333 14.333333
6 14:24:00 14.5 14.50 14.333333 14.333333
7 14:28:00 14.5 14.50 14.333333 14.500000
8 14:32:00 14.5 14.50 14.666667 14.500000
9 14:36:00 14.5 14.50 14.666667 14.666667
If you start from a datetime64[ns] column you can isolate the time by subtracting off the normalized date. For example:
df = pd.DataFrame({'Time': pd.date_range('2010-01-01 14:00:00', freq='4min', periods=5)})
df['Time_only'] = df['Time'] - df['Time'].dt.normalize()
# Time Time_only
#0 2010-01-01 14:00:00 14:00:00
#1 2010-01-01 14:04:00 14:04:00
#2 2010-01-01 14:08:00 14:08:00
#3 2010-01-01 14:12:00 14:12:00
#4 2010-01-01 14:16:00 14:16:00
print(df.dtypes)
#Time datetime64[ns]
#Time_only timedelta64[ns]
#dtype: object

Data analysis over multiple dataframes? Panels or Multi-index?

When I extract data for multiple stocks using web.DataReader, I'm getting a panel as the output.
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re
startDate = '2010-01-01'
endDate = '2016-09-07'
stocks_query = ['AAPL','OPK']
stocks = web.DataReader(stocks_query, data_source='yahoo',
start=startDate, end=endDate)
stocks = stocks.swapaxes('items','minor_axis')
Leading to an output of :
Dimensions: 2 (items) x 1682 (major_axis) x 6 (minor_axis)
Items axis: AAPL to OPK
Major_axis axis: 2010-01-04 00:00:00 to 2016-09-07 00:00:00
Minor_axis axis: Open to Adj Close
A single dataframe of the panel looks like this
stocks['OPK']
Open High Low Close Volume Adj Close log_return \
Date
2010-01-04 1.80 1.97 1.76 1.95 234500.0 1.95 NaN
2010-01-05 1.64 1.95 1.64 1.93 135800.0 1.93 -0.010309
2010-01-06 1.90 1.92 1.77 1.79 546600.0 1.79 -0.075304
2010-01-07 1.79 1.94 1.76 1.92 138700.0 1.92 0.070110
2010-01-08 1.92 1.94 1.86 1.89 62500.0 1.89 -0.015748
I plan to do a lot of data manipulation across all dataframes adding new columns. comparing two dataframes etc. I was recommended to looking into multi_indexing, as panels are being deprecated.
This is my fist time working with panels.
If I want to add a new column to both dataframes (AAPL, OPK), I had to do something like this:
for i in stocks:
stocks[i]['log_return'] = np.log(stocks[i]['Close']/(stocks[i]['Close'].shift(1)))
If multi_indexing is indeed recommended for working with multiple dataframes, how exactly would I convert my dataframes into a form I can easily work with?
Would I have one main index, with the next-level being the stocks, and the columns would be contained within each stock?
I went through the docs, which gave many examples using tuples o which I didn't get or examples using single dataframes.
http://pandas.pydata.org/pandas-docs/stable/advanced.html
So how exactly do I convert my panel into a multi_index dataframe?
I'd like to extend #piRSquared's answer with some examples:
In [40]: stocks.to_frame()
Out[40]:
AAPL OPK
Date minor
2010-01-04 Open 2.134300e+02 1.80
High 2.145000e+02 1.97
Low 2.123800e+02 1.76
Close 2.140100e+02 1.95
Volume 1.234324e+08 234500.00
Adj Close 2.772704e+01 1.95
2010-01-05 Open 2.146000e+02 1.64
High 2.155900e+02 1.95
Low 2.132500e+02 1.64
Close 2.143800e+02 1.93
... ... ...
2016-09-06 Low 1.075100e+02 9.19
Close 1.077000e+02 9.36
Volume 2.688040e+07 3026900.00
Adj Close 1.066873e+02 9.36
2016-09-07 Open 1.078300e+02 9.39
High 1.087600e+02 9.60
Low 1.070700e+02 9.38
Close 1.083600e+02 9.59
Volume 4.236430e+07 2632400.00
Adj Close 1.073411e+02 9.59
[10092 rows x 2 columns]
but if you want to convert it to MultiIndex DF, it's better to leave the original pandas_datareader Panel as it is:
In [38]: p = web.DataReader(stocks_query, data_source='yahoo', start=startDate, end=endDate)
In [39]: p.to_frame()
Out[39]:
Open High Low Close Volume Adj Close
Date minor
2010-01-04 AAPL 213.429998 214.499996 212.380001 214.009998 123432400.0 27.727039
OPK 1.800000 1.970000 1.760000 1.950000 234500.0 1.950000
2010-01-05 AAPL 214.599998 215.589994 213.249994 214.379993 150476200.0 27.774976
OPK 1.640000 1.950000 1.640000 1.930000 135800.0 1.930000
2010-01-06 AAPL 214.379993 215.230000 210.750004 210.969995 138040000.0 27.333178
OPK 1.900000 1.920000 1.770000 1.790000 546600.0 1.790000
2010-01-07 AAPL 211.750000 212.000006 209.050005 210.580000 119282800.0 27.282650
OPK 1.790000 1.940000 1.760000 1.920000 138700.0 1.920000
2010-01-08 AAPL 210.299994 212.000006 209.060005 211.980005 111902700.0 27.464034
OPK 1.920000 1.940000 1.860000 1.890000 62500.0 1.890000
... ... ... ... ... ... ...
2016-08-31 AAPL 105.660004 106.570000 105.639999 106.099998 29662400.0 105.102360
OPK 9.260000 9.260000 9.070000 9.100000 2793300.0 9.100000
2016-09-01 AAPL 106.139999 106.800003 105.620003 106.730003 26701500.0 105.726441
OPK 9.310000 9.540000 9.190000 9.290000 3515300.0 9.290000
2016-09-02 AAPL 107.699997 108.000000 106.820000 107.730003 26802500.0 106.717038
OPK 9.340000 9.390000 9.160000 9.330000 2061200.0 9.330000
2016-09-06 AAPL 107.900002 108.300003 107.510002 107.699997 26880400.0 106.687314
OPK 9.320000 9.480000 9.190000 9.360000 3026900.0 9.360000
2016-09-07 AAPL 107.830002 108.760002 107.070000 108.360001 42364300.0 107.341112
OPK 9.390000 9.600000 9.380000 9.590000 2632400.0 9.590000
[3364 rows x 6 columns]
How to work with a MultiIndex DF:
In [46]: df = p.to_frame()
In [47]: df.loc[pd.IndexSlice[:, ['AAPL']], :]
Out[47]:
Open High Low Close Volume Adj Close
Date minor
2010-01-04 AAPL 213.429998 214.499996 212.380001 214.009998 123432400.0 27.727039
2010-01-05 AAPL 214.599998 215.589994 213.249994 214.379993 150476200.0 27.774976
2010-01-06 AAPL 214.379993 215.230000 210.750004 210.969995 138040000.0 27.333178
2010-01-07 AAPL 211.750000 212.000006 209.050005 210.580000 119282800.0 27.282650
2010-01-08 AAPL 210.299994 212.000006 209.060005 211.980005 111902700.0 27.464034
2010-01-11 AAPL 212.799997 213.000002 208.450005 210.110003 115557400.0 27.221758
2010-01-12 AAPL 209.189995 209.769995 206.419998 207.720001 148614900.0 26.912110
2010-01-13 AAPL 207.870005 210.929995 204.099998 210.650002 151473000.0 27.291720
2010-01-14 AAPL 210.110003 210.459997 209.020004 209.430000 108223500.0 27.133657
2010-01-15 AAPL 210.929995 211.599997 205.869999 205.930000 148516900.0 26.680198
... ... ... ... ... ... ...
2016-08-24 AAPL 108.570000 108.750000 107.680000 108.029999 23675100.0 107.014213
2016-08-25 AAPL 107.389999 107.879997 106.680000 107.570000 25086200.0 106.558539
2016-08-26 AAPL 107.410004 107.949997 106.309998 106.940002 27766300.0 105.934466
2016-08-29 AAPL 106.620003 107.440002 106.290001 106.820000 24970300.0 105.815591
2016-08-30 AAPL 105.800003 106.500000 105.500000 106.000000 24863900.0 105.003302
2016-08-31 AAPL 105.660004 106.570000 105.639999 106.099998 29662400.0 105.102360
2016-09-01 AAPL 106.139999 106.800003 105.620003 106.730003 26701500.0 105.726441
2016-09-02 AAPL 107.699997 108.000000 106.820000 107.730003 26802500.0 106.717038
2016-09-06 AAPL 107.900002 108.300003 107.510002 107.699997 26880400.0 106.687314
2016-09-07 AAPL 107.830002 108.760002 107.070000 108.360001 42364300.0 107.341112
[1682 rows x 6 columns]
You're going to love this one
stocks.to_frame()

seasonal_decompose: operands could not be broadcast together with shapes on a series

I know there are many questions on this topic, but none of them helped me to solve this problem. I'm really stuck on this.
With a simple series:
0
2016-01-31 266
2016-02-29 235
2016-03-31 347
2016-04-30 514
2016-05-31 374
2016-06-30 250
2016-07-31 441
2016-08-31 422
2016-09-30 323
2016-10-31 168
2016-11-30 496
2016-12-31 303
import statsmodels.api as sm
logdf = np.log(df[0])
decompose = sm.tsa.seasonal_decompose(logdf,freq=12, model='additive')
decomplot = decompose.plot()
i keep getting: ValueError: operands could not be broadcast together with shapes (12,) (14,)
I've tried pretty much everything, passing only logdf.values, passing a non-log series. It doesn't work.
Numpy and statsmodel versions:
print(statsmodels.__version__)
print(pd.__version__)
print(np.__version__)
0.6.1
0.18.1
1.11.3
As #yoonforh pointed, in my case this was fixed by setting the freq parameter to less than the time series length. E.g. if your time series ts looks like this:
2014-01-01 0.0
2014-02-01 0.0
2014-03-01 1.0
2014-04-01 1.0
2014-05-01 0.0
2014-06-01 1.0
2014-07-01 1.0
2014-08-01 0.0
2014-09-01 0.0
2014-10-01 1.0
2014-11-01 0.0
2014-12-01 0.0
the shape is
(12,)
so this will give the error as per above:
seasonal_decompose(ts, freq=12, model='additive')
but if I try freq=11 or any other int less than 12, e.g.
seasonal_decompose(ts, freq=11, model='additive')
this works
i noticed that with newer pandas and statsmodel versions it seems to work.
Given a series:
2016-01-03 8.326275
2016-01-10 8.898229
2016-01-17 8.754792
2016-01-24 8.658172
2016-01-31 8.731659
2016-02-07 9.047233
2016-02-14 8.799662
2016-02-21 8.783549
2016-02-28 8.782783
2016-03-06 9.081825
2016-03-13 8.737934
2016-03-20 8.658693
2016-03-27 8.666475
2016-04-03 9.029178
2016-04-10 8.781555
2016-04-17 8.720787
2016-04-24 8.633909
2016-05-01 8.937744
2016-05-08 8.804925
2016-05-15 8.766862
2016-05-22 8.651899
2016-05-29 8.653645
...
And pd/sm version:
statsmodels.__version__ 0.8.0
pandas.__version__ 0.20.1
This is the result:
import statsmodels.api as sm
logdf = np.log(df_series)
decompose = sm.tsa.seasonal_decompose(logdf, model='additive', filt=None, freq=1, two_sided=True)
decompose.plot()
I hope this could solve your problem too.

How to create new Pandas Dataframe with columns form DataFrame (PYTHON)

I am creating a DataFrame from a csv file, where my index (rows) is date and my column names are names of cities.
After I create the raw DataFrame, I am trying to create a DataFrame from selected columns. I have tried:
A=df['city1'] #city 1
B=df['city2']
C=pd.merge(A,B)
but it does't work. This is what A and B look like.
Date
2013-11-01 2.56
2013-12-01 1.77
2014-01-01 0.00
2014-02-01 0.38
2014-03-01 13.16
2014-04-01 10.29
2014-05-01 15.43
2014-06-01 11.48
2014-07-01 8.54
2014-08-01 11.11
2014-09-01 2.71
2014-10-01 4.16
2014-11-01 13.01
2014-12-01 9.59
Name: Seattle.Washington, dtype: float64 Date
And this is what I am looking to create:
City1 City2
Date
2013-11-01 0.00 2.94
2013-12-01 8.26 3.41
2014-01-01 1.11 14.27
2014-02-01 32.86 84.26
2014-03-01 34.12 0.00
2014-04-01 68.39 0.00
2014-05-01 27.17 9.09
2014-06-01 10.47 32.00
2014-07-01 14.19 26.83
2014-08-01 14.91 6.36
2014-09-01 3.76 8.32
2014-10-01 5.83 2.19
2014-11-01 10.79 2.64
2014-12-01 21.24 8.08
Any suggestion?
Error Message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-222-ec50ff9f372f> in <module>()
14 S = df['City1']
15 A = df['City2']
16
---> 17 print merge(S,A)
18 #df2=pd.merge(A,A)
19 #print df2
C:\...\merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_on=right_on, left_index=left_index,
37 right_index=right_index, sort=sort, suffixes=suffixes,
---> 38 copy=copy)
39 return op.get_result()
40 if __debug__:
Answer: (Courtesy of #EdChum)
df[['City1', 'City2']]