Not getting top5 values for each month using grouper and groupby in pandas - pandas

I'm trying to get top5 values for amount for each month along with the text column. I've tried resampling and group by statement
Dataset:
text amount date
123… 11.00 11-05-17
123abc… 10.00 11-08-17
Xyzzy… 22.00. 12-07-17
Xyzzy… 221.00. 11-08-17
Xyzzy… 212.00. 10-08-17
Xyzzy… 242.00. 18-08-17
Code:
df1 = df.groupby([’text', pd.Grouper(key=‘date', freq='M')])[‘amount'].apply(lambda x: x.nlargest(5))
I get group of text but not arranged by month or largest values sorted in descending order.
df1 = df.groupby([pd.Grouper(key=‘date', freq='M')])[‘amount'].apply(lambda x: x.nlargest(5))
THis code works fine but does not give text column.

assuming that amount is a numeric column:
In [8]: df.groupby(['text', pd.Grouper(key='date', freq='M')]).apply(lambda x: x.nlargest(2, 'amount'))
Out[8]:
text amount date
text date
123abc… 2017-11-30 1 123abc… 10.0 2017-11-08
123… 2017-11-30 0 123… 11.0 2017-11-05
Xyzzy… 2017-08-31 5 Xyzzy… 242.0 2017-08-18
2017-10-31 4 Xyzzy… 212.0 2017-10-08
2017-11-30 3 Xyzzy… 221.0 2017-11-08
2017-12-31 2 Xyzzy… 22.0 2017-12-07

You can using head with sort_values
df1 = df.sort_values('amount',ascending=False).groupby(['text', pd.Grouper(key='date', freq='M')]).head(2)

Related

How to concatenate a dataframe to a multiindex main dataframe along columns

I have tried a few answers but was not able to get the desired result in my case.
I am working with stocks data.
I have a list ['3MINDIA.NS.csv', 'AARTIDRUGS.NS.csv', 'AARTIIND.NS.csv', 'AAVAS.NS.csv', 'ABB.NS.csv']
for every stock in the list I get an output which contains trades and related info.. it looks something like this:
BUY SELL profits rel_profits
0 2004-01-13 2004-01-27 -44.200012 -0.094606
1 2004-02-05 2004-02-16 18.000000 0.044776
2 2005-03-08 2005-03-11 25.000000 0.048077
3 2005-03-31 2005-04-01 13.000000 0.025641
4 2005-10-11 2005-10-26 -20.400024 -0.025342
5 2005-10-31 2005-11-04 67.000000 0.095578
6 2006-05-22 2006-06-05 -55.100098 -0.046693
7 2007-03-06 2007-03-14 3.000000 0.001884
8 2007-03-19 2007-03-28 41.500000 0.028222
9 2007-07-31 2007-08-14 69.949951 0.038224
10 2008-01-24 2008-02-05 25.000000 0.013055
11 2009-11-04 2009-11-05 50.000000 0.031250
12 2010-12-10 2010-12-15 63.949951 0.018612
13 2011-02-02 2011-02-15 -53.050049 -0.015543
14 2011-09-30 2011-10-07 74.799805 0.018181
15 2015-12-09 2015-12-18 -215.049805 -0.019523
16 2016-01-18 2016-02-01 -475.000000 -0.046005
17 2016-11-16 2016-11-30 -1217.500000 -0.096877
18 2018-03-26 2018-04-02 0.250000 0.000013
19 2018-05-22 2018-05-25 250.000000 0.012626
20 2018-06-05 2018-06-12 101.849609 0.005361
21 2018-09-25 2018-10-10 -2150.000000 -0.090717
22 2021-01-27 2021-02-03 500.150391 0.024638
23 2021-06-30 2021-07-07 393.000000 0.016038
24 2021-08-12 2021-08-13 840.000000 0.035279
25 NaN NaN -1693.850281 0.995277
# note: every dataframe will have a last row with NaN values in buy, sell columns
# each datafram has different number of rows
Now I tried to add an extra level of index to this dataframe like this:
symbol = name of the stock from given list for ex. for 3MINDIA.NS.csv symbol is 3MINDIA
trades.columns = pd.MultiIndex.from_product([[symbol], trades.columns])
after this I tried to concatenate each trades dataframe that is generated in the loop to a main dataframe using:
result_df = pd.concat([result_df, trades], axis=1)
# I am trying to do this so that Whenever
I call result_df[symbol] I should be able
to see the trade dates for that particular symbol.
But I get a result_df that has lot of NaN values because each trades dataframe has variable number of rows in it.
IS there any way I can combine trades dataframes along the columns with stock symbol as higher level index and not get all the NaN values in my result_df
result_df I got
So I found a way to get what I wanted.
first I added this code in loop
trades = pd.concat([trades], keys=[symbol], names=['Stocks'])
after this I used concatenate again on result_df and trades
# Desired Result
result_df = pd.concat([result_df, trades], axis=0, ignore_index=False)
And BAM!!! This is exactly what I wanted

How to sum up a selected range of rows via a condition?

I hope with these additional information someone could find time to help me with this new issue.
sample date here --> file
'Date as index' (datetime.date)
As I said I'm trying to select a range in a dataframe every time x is in interval [-20 -190] and create a new dataframe with a new column which is the sum of the selected rows and keep the last "encountered" date as index
EDIT : The "loop" start at the first date/beginning of the df and when a value which is less than 0 or -190 is found, then sum it up and continue to find and sum it up and so on
BUT I still got values which are still in the intervall (-190, 0)
example and code below.
Thks
import pandas as pd
df = pd.read_csv('http://www.sharecsv.com/s/0525f76a07fca54717f7962d58cac692/sample_file.csv', sep = ';')
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3
##### output #####
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 11:28:00 -154.35
3 2019-01-02 12:08:00 -4706.87
4 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-29 16:58:00 -0.38
833 2019-09-30 17:08:00 -129365.71
834 2019-09-30 17:13:00 -157.05
835 2019-10-01 08:58:00 -111911.98
########## expected output #############
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 12:08:00 -4706.87
3 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-30 17:08:00 -129365.71
833 2019-10-01 08:58:00 -111911.98
...
...
Use Series.where with Series.between for replace values to NaNs of Date column with back filling missing values and then aggregate sum, next step is filter out rows with match range by boolean indexing and last use DataFrame.resample with cast Series to one column DataFrame by Series.to_frame:
#range -190, 0
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3 = df3[~df3['x'].between(-190, 0)]
df3 = df3.resample('D', on='Date')['x'].sum().to_frame()

How to slice the pandas dataframe which has date as its index

I have pandas dataframe which reads like below
SKU
1/1/2017 1
2/1/2017 2
3/1/2017 3
4/1/2017 4
5/1/2017 5
So it has date string as index
How can I perform slicing operation for this dataframe
I tried
df.loc['1/1/2017':'3/1/2017']
It threw me error, saying that I have to convert the string indexes into datetime
Kindly help
For me it working nice with your sample data:
print (df.loc['1/1/2017':'3/1/2017'])
SKU
1/1/2017 1
2/1/2017 2
3/1/2017 3
But I suggest create DatetimeIndex:
df.index = pd.to_datetime(df.index, dayfirst=True)
print (df.loc['2017-01-01':'2017-01-03'])
SKU
2017-01-01 1
2017-01-02 2
2017-01-03 3

Backfilling a pandas dataframe missed the first month

I have a pandas df or irrigation demand data that has daily values from 1900 to 2099. I resampled the df to get the monthly average and then resampled and backfilled the monthly averages on a daily frequency, so that the average daily value for each month, was input as the daily value for every day of that month.
My problem is that the first month was not backfilled and there is only a value for the last day of that month (1900-01-31).
Here is my code, any suggestions on what I am doing wrong?
I2 = pd.DataFrame(IrrigDemand, columns = ['Year', 'Month', 'Day', 'IrrigArea_1', 'IrrigArea_2','IrrigArea_3','IrrigArea_4','IrrigArea_5'],dtype=float)
# set dates as index
I2.set_index('Year')
# make a column of dates in datetime format
dates = pd.to_datetime(I2[['Year', 'Month', 'Day']])
# add the column of dates to df
I2['dates'] = pd.Series(dates, index=I2.index)
# set dates as index of df
I2.set_index('dates')
# delete the three string columns replaced with datetime values
I2.drop(['Year', 'Month', 'Day'],inplace=True,axis=1)
# calculate the average daily value for each month
I2_monthly_average = I2.reset_index().set_index('dates').resample('m').mean()
I2_daily_average = I2_monthly_average.resample('d').bfill()
There is problem first day is not added by resample('m'), so necessary add it manually:
# make a column of dates in datetime format and assign to index
I2.index = pd.to_datetime(I2[['Year', 'Month', 'Day']])
# delete the three string columns replaced with datetime values
I2.drop(['Year', 'Month', 'Day'],inplace=True,axis=1)
# calculate the average daily value for each month
I2_monthly_average = I2.resample('m').mean()
first_day = I2_monthly_average.index[0].replace(day = 1)
I2_monthly_average.loc[first_day] = I2_monthly_average.iloc[0]
I2_daily_average = I2_monthly_average.resample('d').bfill()
Sample:
rng = pd.date_range('2017-04-03', periods=10, freq='20D')
I2 = pd.DataFrame({'a': range(10)}, index=rng)
print (I2)
a
2017-04-03 0
2017-04-23 1
2017-05-13 2
2017-06-02 3
2017-06-22 4
2017-07-12 5
2017-08-01 6
2017-08-21 7
2017-09-10 8
2017-09-30 9
I2_monthly_average = I2.resample('m').mean()
print (I2_monthly_average)
a
2017-04-30 0.5
2017-05-31 2.0
2017-06-30 3.5
2017-07-31 5.0
2017-08-31 6.5
2017-09-30 8.5
first_day = I2_monthly_average.index[0].replace(day = 1)
I2_monthly_average.loc[first_day] = I2_monthly_average.iloc[0]
print (I2_monthly_average)
a
2017-04-30 0.5
2017-05-31 2.0
2017-06-30 3.5
2017-07-31 5.0
2017-08-31 6.5
2017-09-30 8.5
2017-04-01 0.5 <- added first day
I2_daily_average = I2_monthly_average.resample('d').bfill()
print (I2_daily_average.head())
a
2017-04-01 0.5
2017-04-02 0.5
2017-04-03 0.5
2017-04-04 0.5
2017-04-05 0.5

aggregate data by quarter

I have a pivot pandas data frame (sales by region) that got created from another pandas data frame (sales by store) using the pivot_table method.
As an example:
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df['date'] = pd.to_datetime(df['date'])
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
My goal is to do the following to the sales data frame, df_sales.
Create a new dataframe that summarizes sales by quarter. I could use the original dataframe df, or the sales_df.
As of quarter here we only have only two quarters (USA fiscal calendar year) so the quarterly aggregated data frame would look like:
2017Q1 2017Q2
10 27
31 37.5
133 139.17
I take the average for all days in Q1, and same for Q2. Thus, for example for the North east region, 'NE', the Q1 is the average of only one day 2017-03-30, i.e., 10, and for the Q2 is the average across 2017-04-05 to 2017-04-20, i.e.,
(20+30+12+20+30+50)/6=27
Any suggestions?
ADDITIONAL NOTE: I would ideally do the quarter aggregations on the df_sales pivoted table since it's a much smaller dataframe to keep in memory. The current solution does it on the original df, but I am still seeking a way to do it in the df_sales dataframe.
UPDATE:
Setup:
df.date = pd.to_datetime(df.date)
df_sales = df.pivot_table(index='region', columns='date', values='sales', aggfunc='sum')
In [318]: df_sales
Out[318]:
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17 2017-04-20
region
NE 10 20 30 12 20 30 50
NW 31 33 31 35 39 49 38
SW 133 135 140 137 137 145 141
Solution:
In [319]: (df_sales.groupby(pd.PeriodIndex(df_sales.columns, freq='Q'), axis=1)
...: .apply(lambda x: x.sum(axis=1)/x.shape[1])
...: )
Out[319]:
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
Solution based on the original DF:
In [253]: (df.groupby(['region', pd.PeriodIndex(df.date, freq='Q-DEC')])
...: .apply(lambda x: x['sales'].sum()/x['date'].nunique())
...: .to_frame('avg').unstack('date')
...: )
...:
Out[253]:
avg
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
NOTE: df - is the original DF (before "pivoting")