pandas data frame iterating over 2 index variables - pandas

I have a data frame with 2 indexes called "DATE"( it is monthly data) and "ID" and a column variable named Volume. Now I want to iterate over it and fill for every unique ID a new column with the average value of the column Volume in a new column.
The basic idea is to figure out which months are above the yearly avg for every ID.
list(df.index)
(Timestamp('1970-09-30 00:00:00'), 12167.0)
print(df.index.name)
None
I seemed to not find a tutorial to address this :(
Can someone please point me in the right direction
SHRCD EXCHCD SICCD PRC VOL RET SHROUT \
DATE PERMNO
1970-08-31 10559.0 10.0 1.0 5311.0 35.000 1692.0 0.030657 12048.0
12626.0 10.0 1.0 5411.0 46.250 926.0 0.088235 6624.0
12749.0 11.0 1.0 5331.0 45.500 5632.0 0.126173 34685.0
13100.0 11.0 1.0 5311.0 22.000 1759.0 0.171242 15107.0
13653.0 10.0 1.0 5311.0 13.125 141.0 0.220930 1337.0
13936.0 11.0 1.0 2331.0 11.500 270.0 -0.053061 3942.0
14322.0 11.0 1.0 5311.0 64.750 6934.0 0.024409 154187.0
16969.0 10.0 1.0 5311.0 42.875 1069.0 0.186851 13828.0
17072.0 10.0 1.0 5311.0 14.750 777.0 0.026087 5415.0
17304.0 10.0 1.0 5311.0 24.875 1939.0 0.058511 8150.0

You can use transform with year for same size Series like original DataFrame:
print (df)
VOL
DATE PERMNO
1970-08-31 10559.0 1
10559.0 2
12749.0 3
1971-08-31 13100.0 4
13100.0 5
df['avg'] = df.groupby([df.index.get_level_values(0).year, 'PERMNO'])['VOL'].transform('mean')
print (df)
VOL avg
DATE PERMNO
1970-08-31 10559.0 1 1.5
10559.0 2 1.5
12749.0 3 3.0
1971-08-31 13100.0 4 4.5
13100.0 5 4.5

Related

Y axis in panel

I have a DataFrame dft like this:
Date Apple Amazon Facebook US Bond
0 2018-01-02 NaN NaN NaN NaN
1 2018-01-03 NaN NaN NaN NaN
2 2018-01-04 NaN NaN NaN NaN
3 2018-01-05 NaN NaN NaN NaN
4 2018-01-08 NaN NaN NaN NaN
... ... ... ... ... ...
665 2020-08-24 0.708554 0.528557 0.152367 0.185932
666 2020-08-25 0.639243 0.534403 0.106550 0.133563
667 2020-08-26 0.520858 0.562482 0.018176 0.133283
668 2020-08-27 0.549531 0.593006 -0.011161 0.261187
669 2020-08-28 0.552725 0.595580 -0.038886 0.278847
Change the Date type
dft["Date"] = pd.to_datetime(dft["Date"]).dt.date
idf = dft.interactive()
date_from = datetime.date(yearStart, 1, 1)
date_to = datetime.date(yearEnd, 8, 31)
date_slider = pn.widgets.DateSlider(name="date", start = date_from, end = date_to, steps=1, value=date_from)
date_slider
and I see a date slider. All good. More controls:
tickerNames = ['Apple', 'Amazon', 'Facebook', 'US Bond']
# Radio buttons for metric measures
yaxis = pn.widgets.RadioButtonGroup(
name='Y axis',
options=tickerNames,
button_type='success'
)
pipeline = (
idf[
(idf.Date <= date_slider)
]
.groupby(['Date'])[yaxis].mean()
.to_frame()
.reset_index()
.sort_values(by='Date')
.reset_index(drop=True)
)
if I now type
pipeline
I see a table with a date slider above it, where each symbol is it's own "tab". If I click on the symbol, and I change the slider, I see more/less data. Again all good. Here is where I get confused. I want to plot the values of the columns:
plot = pipeline.hvplot(x = 'Date', by='WHAT GOES IN HERE', y=yaxis,line_width=2, title="Prices")
NOTE: WHAT GOES IN HERE. I need the values in the `dtf` dataframe above, but I can't hardwire the symbol since it depends on what the user chooses in the `table`? I want an interactive chart, so that as I slide the date_slider, all more and more of the data for each symbol gets plotted.
If I do it the old fashioned way:
fig = plt.figure(figsize=(15, 7))
ax1 = fig.add_subplot(1, 1, 1)
dft.plot(ax=ax1)
ax1.set_xlabel('Date')
ax1.set_ylabel('21days rolling daily change')
ax1.set_title('21days rolling daily change of financial assets')
plt.show()
It works as expected?

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)
Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125
Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)
This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

convert long dataframe to single line dataframe with enumerating the column names

have a df that looks like this:
data = \
[{'len_overlap': 2, 'prox': 1.0, 'freq_sum_w': 0.03962264150943396},
{'len_overlap': 22, 'prox': np.nan, 'freq_sum_w': 0.0311111962264150943396}]
df = pd.DataFrame(data)
len_overlap
prox
freq_sum_w
0
2
1
0.0396226
1
22
nan
0.0311112
I want to make it one row data frame, so far I have this:
pd.DataFrame([np.ravel(df.values)], columns=sum([[f'{x}_{n}' for x in df.columns] for n in range(df.shape[0])], []))
len_overlap_0
prox_0
freq_sum_w_0
len_overlap_1
prox_1
freq_sum_w_1
0
2
1
0.0396226
22
nan
0.0311112
This is what I want (the ints convert to floats, don't know why, but that's not a problem) but I'm wondering if there is a nicer, more Pandas way for doing this.
Thanks
Try via unstack(),to_frame() and Transpose(T) attribute:
out=df.unstack().to_frame().T
Finally:
out.columns=out.columns.map(lambda x:'_'.join(map(str,x)))
output of out:
len_overlap_0 len_overlap_1 prox_0 prox_1 freq_sum_w_0 freq_sum_w_1
0 2.0 22.0 1.0 NaN 0.039623 0.031111
One line but more complex:
>>> df.unstack() \
.to_frame() \
.set_index(pd.MultiIndex.from_product([df.columns, df.index.astype(str)])
.sortlevel(1)[0]
.to_flat_index()
.map('_'.join)) \
.transpose()
freq_sum_w_0 len_overlap_0 prox_0 freq_sum_w_1 len_overlap_1 prox_1
0 2.0 22.0 1.0 NaN 0.039623 0.031111
IMHO, I think the "more Pandas way" is to use a MultiIndex:
>>> df.stack().to_frame().transpose()
0 1
len_overlap prox freq_sum_w len_overlap freq_sum_w
0 2.0 1.0 0.039623 22.0 0.031111
or better (like pd.melt):
>>> df.stack()
0 len_overlap 2.000000
prox 1.000000
freq_sum_w 0.039623
1 len_overlap 22.000000
freq_sum_w 0.031111
Try,
df_out = df.unstack()
df_out = df_out.sort_index(level=1)
df_out.index = [f'{i}_{j}' for i, j in df_out.index]
df_out.to_frame().T
Output:
freq_sum_w_0 len_overlap_0 prox_0 freq_sum_w_1 len_overlap_1 prox_1
0 0.039623 2.0 1.0 0.031111 22.0 NaN

Some confusion in creating pivot table

I am trying to create a pivot table but i am not getting the result i want. Couldn't able to understand why is this happening.
I have a dataframe like this -
data_channel_is_lifestyle data_channel_is_bus shares
0 0.0 0.0 593
1 0.0 1.0 711
2 0.0 1.0 1500
3 0.0 0.0 1200
4 0.0 0.0 505
And the result i am looking for is name of the columns in the index and sum of shares in the column. So
i did this -
news_copy.pivot_table(index=['data_channel_is_lifestyle','data_channel_is_bus'], values='shares', aggfunc=sum)
but i am getting the result something like this -
shares
data_channel_is_lifestyle data_channel_is_bus
0.0 0.0 107709305
1.0 19168370
1.0 0.0 7728777
I don't want these 0's and 1's, i just want the result to be something like this -
shares
data_channel_is_lifestyle 107709305
data_channel_is_bus 19168370
How can i do this?
As you put it, it's just matrix multipliation:
df.filter(like='data').T#(df[['shares']])
Output (for sample data):
shares
data_channel_is_lifestyle 0.0
data_channel_is_bus 2211.0

Python Pandas rolling mean with window value in another column

I am using pandas.DataFrame.rolling to calculate rolling means for a stock index close price series. I can do this in Excel. How can I do the same thing in Pandas? Thanks!
Below is my Excel formula to calculate the moving average and the window length is in column ma window:
date close ma window ma
2018/3/21 4061.0502
2018/3/22 4020.349
2018/3/23 3904.9355 3 =AVERAGE(INDIRECT("B"&(ROW(B4)-C4+1)):B4)
2018/3/26 3879.893 2 =AVERAGE(INDIRECT("B"&(ROW(B5)-C5+1)):B5)
2018/3/27 3913.2689 4 =AVERAGE(INDIRECT("B"&(ROW(B6)-C6+1)):B6)
2018/3/28 3842.7155 7 =AVERAGE(INDIRECT("B"&(ROW(B7)-C7+1)):B7)
2018/3/29 3894.0498 1 =AVERAGE(INDIRECT("B"&(ROW(B8)-C8+1)):B8)
2018/3/30 3898.4977 6 =AVERAGE(INDIRECT("B"&(ROW(B9)-C9+1)):B9)
2018/4/2 3886.9189 2 =AVERAGE(INDIRECT("B"&(ROW(B10)-C10+1)):B10)
2018/4/3 3862.4796 8 =AVERAGE(INDIRECT("B"&(ROW(B11)-C11+1)):B11)
2018/4/4 3854.8625 1 =AVERAGE(INDIRECT("B"&(ROW(B12)-C12+1)):B12)
2018/4/9 3852.9292 9 =AVERAGE(INDIRECT("B"&(ROW(B13)-C13+1)):B13)
2018/4/10 3927.1729 3 =AVERAGE(INDIRECT("B"&(ROW(B14)-C14+1)):B14)
2018/4/11 3938.3434 1 =AVERAGE(INDIRECT("B"&(ROW(B15)-C15+1)):B15)
2018/4/12 3898.6354 3 =AVERAGE(INDIRECT("B"&(ROW(B16)-C16+1)):B16)
2018/4/13 3871.1443 8 =AVERAGE(INDIRECT("B"&(ROW(B17)-C17+1)):B17)
2018/4/16 3808.863 2 =AVERAGE(INDIRECT("B"&(ROW(B18)-C18+1)):B18)
2018/4/17 3748.6412 2 =AVERAGE(INDIRECT("B"&(ROW(B19)-C19+1)):B19)
2018/4/18 3766.282 4 =AVERAGE(INDIRECT("B"&(ROW(B20)-C20+1)):B20)
2018/4/19 3811.843 6 =AVERAGE(INDIRECT("B"&(ROW(B21)-C21+1)):B21)
2018/4/20 3760.8543 3 =AVERAGE(INDIRECT("B"&(ROW(B22)-C22+1)):B22)
Here is a snapshot of the Excel version.
I figured it out. But I don't think it is the best solution...
import pandas as pd
data = pd.read_excel('data.xlsx' ,index_col='date')
def get_price_mean(x):
win = data.loc[:,'ma window'].iloc[x.shape[0]-1].astype('int')
win = max(win,0)
return pd.Series(x).rolling(window = win).mean().iloc[-1]
data.loc[:,'ma'] = data.loc[:,'close'].expanding().apply(get_price_mean)
print(data)
The result is:
close ma window ma
date
2018-03-21 4061.0502 NaN NaN
2018-03-22 4020.3490 NaN NaN
2018-03-23 3904.9355 3.0 3995.444900
2018-03-26 3879.8930 2.0 3892.414250
2018-03-27 3913.2689 4.0 3929.611600
2018-03-28 3842.7155 7.0 NaN
2018-03-29 3894.0498 1.0 3894.049800
2018-03-30 3898.4977 6.0 3888.893400
2018-04-02 3886.9189 2.0 3892.708300
2018-04-03 3862.4796 8.0 3885.344862
2018-04-04 3854.8625 1.0 3854.862500
2018-04-09 3852.9292 9.0 3876.179456
2018-04-10 3927.1729 3.0 3878.321533
2018-04-11 3938.3434 1.0 3938.343400
2018-04-12 3898.6354 3.0 3921.383900
2018-04-13 3871.1443 8.0 3886.560775
2018-04-16 3808.8630 2.0 3840.003650
2018-04-17 3748.6412 2.0 3778.752100
2018-04-18 3766.2820 4.0 3798.732625
2018-04-19 3811.8430 6.0 3817.568150
2018-04-20 3760.8543 3.0 3779.659767