Pandas: aggregating by different columns with MultiIndex columns - pandas

I would like to take a dataframe with MultiIndex columns (where the index is a DatetimeIndex), and the aggregate by different functions depending on the column.
For example, consider the following table where index includes dates, first level of columns are Price and Volume, and second level of columns are tickers (e.g. AAPL and AMZN).
df1 = pd.DataFrame({"ticker":["AAPL"]*365,
'date': pd.date_range(start='20170101', end='20171231'),
'volume' : [np.random.randint(50,100) for i in range(365)],
'price': [np.random.randint(100,200) for i in range(365)]})
df2 = pd.DataFrame({"ticker":["AMZN"]*365,
'date': pd.date_range(start='20170101', end='20171231'),
'volume' : [np.random.randint(50,100) for i in range(365)],
'price': [np.random.randint(100,200) for i in range(365)]})
df = pd.concat([df1,df2])
grp = df.groupby(['date', 'ticker']).mean().unstack()
grp.head()
What I would like to do is to aggregate the data by month, but taking the mean of price and sum of volume.
I would have thought that something along the lines of grp.resample("MS").agg({"price":"mean", "volume":"sum"}) should work, but it does not because of the multi-index column. What's the best way to accomplish this?

You can
df.groupby([pd.to_datetime(df.date).dt.strftime('%Y-%m'),df.ticker]).\
agg({"price":"mean", "volume":"sum"}).unstack()
Out[529]:
price volume
ticker AAPL AMZN AAPL AMZN
date
2017-01 155.548387 141.580645 2334 2418
2017-02 154.035714 156.821429 2112 2058
2017-03 154.709677 148.806452 2258 2188
2017-04 154.366667 149.366667 2271 2254
2017-05 154.774194 155.096774 2331 2264
2017-06 147.333333 145.133333 2220 2302
2017-07 149.709677 150.645161 2188 2412
2017-08 150.806452 154.645161 2265 2341
2017-09 157.033333 151.466667 2199 2232
2017-10 149.387097 145.580645 2303 2203
2017-11 154.100000 150.266667 2212 2275
2017-12 156.064516 149.290323 2265 2224

Related

How to arrange df in ascending order and reset the index numbering

My is about stock data.
Open Price High Price Low Price Close Price WAP No.of Shares No. of Trades Total Turnover (Rs.) Deliverable Quantity % Deli. Qty to Traded Qty Spread High-Low Spread Close-Open Pert Rank Year
Date
2022-12-30 419.75 421.55 415.55 418.95 417.841704 1573 183 657265.0 954 60.65 6.00 -0.80 0.131558 2022
2022-12-29 412.15 418.40 411.85 415.90 413.236152 1029 117 425220.0 766 74.44 6.55 3.75 0.086360 2022
2022-12-28 411.90 422.05 411.30 415.35 417.917534 2401 217 1003420.0 949 39.53 10.75 3.45 0.128329 2022
2022-12-27 409.60 414.70 407.60 412.70 411.436312 1052 136 432831.0 687 65.30 7.10 3.10 0.066182 2022
2022-12-26 392.00 409.55 389.60 406.35 400.942300 2461 244 986719.0 1550 62.98 19.95 14.35 0.240920 2022
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2018-01-05 338.75 358.70 338.75 355.65 351.255581 31802 896 11170630.0 15781 49.62 19.95 16.90 0.949153
The date column is in descending order that has to be converted to ascending order.
at the same time index i.e has to be converted to ascending order. i.e. 1,2,3,4. It should not be in descending order.
I tried with sort_values. it returns nonetype object.
I am expecting a dataframe. I also tried groupby. Is there any other way.
Sorting dates values with sort_values works for me
df = pd.DataFrame({'Dates': ['2022-12-30', '2022-12-29','2022-12-28'],'Prices':[100,101,99]})
df
Out[142]:
Dates Prices
0 2022-12-30 100
1 2022-12-29 101
2 2022-12-28 99
df.sort_values('Dates',ascending=True,inplace=True)
df
Out[144]:
Dates Prices
2 2022-12-28 99
1 2022-12-29 101
0 2022-12-30 100
You need to sort_values and reset_index:
>>> import random
>>> df = pd.DataFrame(
{
"Dates": pd.Series(pd.date_range("2022-12-24", "2022-12-29")),
"Prices": pd.Series(np.random.randint(0,100,size=(6,)))
})
>>> df
Dates Prices
0 2022-12-24 31
1 2022-12-25 2
2 2022-12-26 27
3 2022-12-27 90
4 2022-12-28 87
5 2022-12-29 49
>>> df.sort_values(by="Dates", ascending=True).reset_index(drop=True, inplace=True)

Add column for percentages

I have a df who looks like this:
Total Initial Follow Sched Supp Any
0 5525 3663 968 296 65 533
I transpose the df 'cause I have to add a column with the percentages based on column 'Total'
Now my df looks like this:
0
Total 5525
Initial 3663
Follow 968
Sched 296
Supp 65
Any 533
So, How can I add this percentage column?
The expected output looks like this
0 Percentage
Total 5525 100
Initial 3663 66.3
Follow 968 17.5
Sched 296 5.4
Supp 65 1.2
Any 533 9.6
Do you know how can I add this new column?
I'm working in jupyterlab with pandas and numpy
Multiple column 0 by scalar from Total with Series.div, then multiple by 100 by Series.mul and last round by Series.round:
df['Percentage'] = df[0].div(df.loc['Total', 0]).mul(100).round(1)
print (df)
0 Percentage
Total 5525 100.0
Initial 3663 66.3
Follow 968 17.5
Sched 296 5.4
Supp 65 1.2
Any 533 9.6
Consider below df:
In [1328]: df
Out[1328]:
b
a
Total 5525
Initial 3663
Follow 968
Sched 296
Supp 65
Any 533
In [1327]: df['Perc'] = round(df.b.div(df.loc['Total', 'b']) * 100, 1)
In [1330]: df
Out[1330]:
b Perc
a
Total 5525 100.0
Initial 3663 66.3
Follow 968 17.5
Sched 296 5.4
Supp 65 1.2
Any 533 9.6

difference in two date column in Pandas

I am trying to get difference between two date columns below script and data used in script, but I am getting same results for all three rows
df = pd.read_csv(r'Book1.csv',encoding='cp1252')
df
Out[36]:
Start End DifferenceinDays DifferenceinHrs
0 10/26/2013 12:43 12/15/2014 0:04 409 9816
1 2/3/2014 12:43 3/25/2015 0:04 412 9888
2 5/14/2014 12:43 7/3/2015 0:04 409 9816
I am expecting results as in column DifferenceinDays which is calculated in excel but in python getting same values for all three rows, Please refer to below code used, can anyone let me know how is to calculate difference between 2 date column, I am trying to get number of hours between two date columns.
df["Start"] = pd.to_datetime(df['Start'])
df["End"] = pd.to_datetime(df['End'])
df['hrs']=(df.End-df.Start)
df['hrs']
Out[38]:
0 414 days 11:21:00
1 414 days 11:21:00
2 414 days 11:21:00
Name: hrs, dtype: timedelta64[ns]
IIUC, np.timedelta64(1,'h')
Additionally, it looks like excel calculates the hours differently, unsure why.
import numpy as np
df['hrs'] = (df['End'] - df['Start']) / np.timedelta64(1,'h')
print(df)
Start End DifferenceinHrs hrs
0 2013-10-26 12:43:00 2014-12-15 00:04:00 9816 9947.35
1 2014-02-03 12:43:00 2015-03-25 00:04:00 9888 9947.35
2 2014-05-14 12:43:00 2015-07-03 00:04:00 9816 9947.35

How to sort date index of a pandas dataframe so that all the newer year dates are on one side on the X-axis label when plotted on graph

I have a pandas dataframe with dates as the indexes (indices). When I plot the values, the indexes(dates, i.e. the X-axis labels) do not show up in a proper sequence on the X-axis of the plotted graph. For example, instead of all the 2018 dates (e.g. 2018/02/15, 2018/03/10, 2018/10/12 ... 2019/01/07, 2019/01/10, 2019/03/16 ...), I would have these dates showing up on the X-axis in a mismatch order. For example 2019/01/07, 2019/01/10, 2018/02/15, 2018/03/10, 2019/03/16 ... even though I have applied sorting to the indexes (i.e. the dates). How do I handle this issue? Thank you in advance.
I tried to sort the indexes but this did not work.
DTT_data = miniBid_data.groupby(['Mini_Bid_Date_2'])['New_Cost_Per_Load','Volume'].aggregate([np.mean])
# sort the data
DTT_data.sort_index(inplace=True, ascending=True)
fig, ax = plt.subplots()
color1 = 'tab:red'
DTT_data.plot(kind='line', figsize=(12,8), legend=False, ax=ax, logy=True, marker='*')
ax.set_title('Trends of Selected Variables')
ax.set_ylabel('Log10 of Values', color=color1)
ax.legend(loc='upper left')
ax.set_xlabel('Event Dates')
ax.tick_params(axis='y', labelcolor=color1)
#ax.legend(loc='upper left')
ax1 = ax.twinx()
color2 = 'tab:blue'
DTT_data2 = miniBid_data.groupby(['Mini_Bid_Date_2'])['Carrier_Code'].count()
DTT_data2.plot(kind='bar', figsize=(12,8), legend=False, ax=ax1, color=color2)
DTT_data2.sort_index(inplace=True, ascending=False)
ax1.set_ylabel('Log10 of Values', color=color2)
ax1.set_yscale('log')
ax1.tick_params(axis='y', labelcolor=color2)
ax1.legend(loc='upper right')
fig.autofmt_xdate()
fig.tight_layout()
plt.show()
Sample Data:
a) DTT_data =
Mini_Bid_Date_2 New_Cost_Per_Load Volume
01/07/2019 1604.3570393487105 1.6431478968792401
02/25/2018 1816.1534797297306 2.831081081081081
10/22/2018 1865.5403827160494 2.074074074074074
10/29/2018 1945.3011032028478 1.9023576512455516
01/08/2019 1947.7562972972971 1.162162162162162
02/11/2019 2062.7133737931017 2.3424827586206916
11/05/2018 2095.531836956521 1.7753623188405796
12/08/2018 2155.48935907859 1.437825203252031
02/04/2019 2169.209245791246 2.2669696969696966
02/04/2018 2189.3693333333335 5.0
01/14/2019 2313.3854211711728 1.1587162162162181
01/20/2019 2380.9063928571427 1.0
01/21/2019 2631.0407864661634 1.3657894736842129
12/03/2018 2684.0808513089005 4.402827225130894
02/25/2019 2844.047048492792 1.89397116644823
11/12/2018 3011.510282722513 2.147905759162304
10/08/2018 3042.3035776536312 1.8130726256983247
11/19/2018 3063.736631460676 1.7407865168539327
02/18/2019 3148.531689480355 6.798162230671736
10/01/2018 3248.0486851851842 2.1951388888888905
01/19/2019 3291.1334154589376 1.4626086956521749
10/15/2018 11881.90527833753 1.779911838790932
01/28/2019 13786.149445804196 1.6329195804195813
03/04/2019 14313.741501103752 1.5459455481972018
12/10/2018 100686.89588865546 3.051260504201676
b) DTT_data =
Mini_Bid_Date_2 Carrier_Code
12/08/2018 1476
03/04/2019 1359
02/04/2019 1188
10/29/2018 1124
12/03/2018 955
10/08/2018 895
11/19/2018 890
10/15/2018 794
02/18/2019 789
02/25/2019 763
01/07/2019 737
02/11/2019 725
01/21/2019 665
10/01/2018 648
02/25/2018 592
01/28/2019 572
12/10/2018 476
01/14/2019 444
11/12/2018 382
10/22/2018 324
11/05/2018 276
01/19/2019 207
01/20/2019 56
01/08/2019 37
02/04/2018 30
My expectation is to have dates (indexes) in this case show up in sequential order, for example, 2019/01/07, 2019/01/10, 2018/02/15, 2018/03/10, 2019/03/16 ... on as labels on the X-axis.

groupby pandas dataframe, take difference between value of latest and earliest date

I have a Cumulative column and I want to groupby index and take the values corresponding to the latest date minus the values corresponding to the earliest date.
Very similar to this: group by pandas dataframe and select latest in each group
But take the difference between latest and earliest in each group.
I'm a python rookie, and here is my solution:
import pandas as pd
from io import StringIO
csv = StringIO("""index id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01""")
df = pd.read_table(csv, sep='\s+',index_col='index')
df['date']=pd.to_datetime(df['date'],errors='coerce')
df_sort=df.sort_values('date')
df_sort.drop(['product'], axis=1,inplace=True)
df_sort.groupby('id').tail(1).set_index('id')-df_sort.groupby('id').head(1).set_index('id')