Group and divide a date column monthwise in pandas - pandas

i have a dataframe df:
store date invoice_count
A 2018-04-03 2
A 2018-04-06 5
A 2018-06-15 5
B 2018-05-05 2
B 2018-04-09 5
C 2018-02-16 6
which contains the invoice_counts(no of invoices generated) of stores for given dates.
I am trying to group them such that i get a month-wise invoice_count for every store.
Expected final dataframe in this format:
store jan_18 feb_18 mar_18 apr_18 may_18 june_18
A 0 0 0 7 0 5
B 0 0 0 5 2 0
C 0 6 0 0 0 0
Is there any way to group dates based on month-wise??
Note: This is a dummy dataframe, the final monthly column names can be in other appropriate format.

Use groupby with DataFrameGroupBy.resample and aggregate sum, then reshape by unstack and if necessary add missing columns with 0 by reindex, last change format of datetimes by DatetimeIndex.strftime:
df = (df.set_index('date')
.groupby('store')
.resample('m')['invoice_count']
.sum()
.unstack(fill_value=0))
df = df.reindex(columns=pd.date_range('2018-01-01', df.columns.max(), freq='m'), fill_value=0)
df.columns = df.columns.strftime('%b_%y')
print (df)
Jan_18 Feb_18 Mar_18 Apr_18 May_18 Jun_18
store
A 0 0 0 7 0 5
B 0 0 0 5 2 0
C 0 6 0 0 0 0

Related

Unable to find Sum of particular rows in pandas

I have a dataset in this form:
crawled_fag tech_flag ug_flag
0 1 2 0
1 6 0 0
2 2 0 1
3 1 0 1
4 0 1 0
5 0 7 0
What I want here is that the second row should be equal to sum of all it's below rows.. For example, In crawled_flag column, the second row value should be 6+2+1+0+0 = 9...
Simillarly, this should be my final dataset:
crawled_fag tech_flag ug_flag
0 1 2 0
1 9 8 2
Can someone please help me on how to achieve it..
Use concat for add summed all rows without first, last transpose DataFrame:
df = pd.concat([df.iloc[0], df.iloc[1:].sum()], axis=1, ignore_index=True).T
print (df)
crawled_fag tech_flag ug_flag
0 1 2 0
1 9 8 2

How to broadcast a pandas dataframe with a single row dataframe?

I have a dataframe(dall), and I have a single row dataframe that has the same columns (row).
How to get d_result without writing a loop? I understand I can convert dataframe to numpy array and broadcast, but I would imagine Pandas has a way to do it directly. I have tried pd.mul, give me nan results.
dall = pd.DataFrame([[5,4,3], [3,5,5], [6,6,6]], columns=['a','b','c'])
row = pd.DataFrame([[-1, 100, 0]], columns=['a','b','c'])
d_result = pd.DataFrame([[-5,400,0], [-3,500,0], [-6,600,0]], columns=['a','b','c'])
dall
a b c
0 5 4 3
1 3 5 5
2 6 6 6
row
a b c
0 -1 100 0
d_result
a b c
0 -5 400 0
1 -3 500 0
2 -6 600 0
We can use mul
dall=dall.mul(row.loc[0],axis=1)
dall
Out[5]:
a b c
0 -5 400 0
1 -3 500 0
2 -6 600 0
You can do this by multiplying DataFrame obj to Series obj. Something like this:
dall * row.iloc[0]
I think this is essentially same as #WeNYoBen answer.
You can also multiply DataFrame obj to DataFrame obj like below. But be careful that NaN value will not propagate, because NaN value will be replaced to 1.0 before multiplication.
dall.mul(row, axis='columns', fill_value=1.0)

Determine the max count in a pandas Grouped By df and use this as a criteria to return records

Afternoon All,
I have a large amount of data over a one month period. I would like to:
a. Find the book with the highest number of trades over that months period.
b. Knowing this provide a groupby summary of all the trades done on that book for the month but display it's months trades within each hour of the 24 hour clock.
Here is a sample dataset:
df_Highest_Traded_Away_Book = [
('trading_book', ['A', 'A','A','A','B','C','C','C']),
('rfq_create_date_time', ['2018-09-03 01:06:09', '2018-09-08 01:23:29',
'2018-09-15 02:23:29','2018-09-20 03:23:29',
'2018-09-20 00:23:29','2018-09-25 01:23:29',
'2018-09-25 02:23:29','2018-09-30 02:23:29',])
]
df_Highest_Traded_Away_Book = pd.DataFrame.from_items(df_Highest_Traded_Away_Book)
display(df_Highest_Traded_Away_Book)
trading_book rfq_create_date_time
0 A 2018-09-03 01:06:09
1 A 2018-09-08 01:23:29
2 A 2018-09-15 02:23:29
3 A 2018-09-20 03:23:29
4 B 2018-09-20 00:23:29
5 C 2018-09-25 01:23:29
6 C 2018-09-25 02:23:29
7 C 2018-09-30 02:23:29
df_Highest_Traded_Away_Book['rfq_create_date_time'] = pd.to_datetime(df_Highest_Traded_Away_Book['rfq_create_date_time'])
df_Highest_Traded_Away_Book['Time_in_GMT'] = df_Highest_Traded_Away_Book['rfq_create_date_time'].dt.hour
display(df_Highest_Traded_Away_Book)
trading_book rfq_create_date_time Time_in_GMT
0 A 2018-09-03 01:06:09 1
1 A 2018-09-08 01:23:29 1
2 A 2018-09-15 02:23:29 2
3 A 2018-09-20 03:23:29 3
4 B 2018-09-20 00:23:29 0
5 C 2018-09-25 01:23:29 1
6 C 2018-09-25 02:23:29 2
7 C 2018-09-30 02:23:29 2
df_Highest_Traded_Away_Book = df_Highest_Traded_Away_Book.groupby(['trading_book']).size().reset_index(name='Traded_Away_for_the_Hour').sort_values(['Traded_Away_for_the_Hour'], ascending=False)
display(df_Highest_Traded_Away_Book)
trading_book Trades_Bucketted_into_the_Hour_They_Occured
0 A 4
2 C 3
1 B 1
display(df_Highest_Traded_Away_Book['Traded_Away_for_the_Hour'].max())
4
i.e. Book A has the most number of trades in the month
Now return a grouped by result of all trades done on this book (for the month) but display such that trades are bucketed into the hour they were traded.
Time_in_GMT Trades_Book_A_Bucketted_into_the_Hour_They_Occured
0 0
1 2
2 1
3 1
4 0
. 0
. 0
. 0
24 0
Any help would be appreciated. I figure there is some way to return the criteria in one line of code.
Use Series.idxmax for top book:
df_Highest_Traded_Away_Book['rfq_create_date_time'] = pd.to_datetime(df_Highest_Traded_Away_Book['rfq_create_date_time'])
df_Highest_Traded_Away_Book['Time_in_GMT'] = df_Highest_Traded_Away_Book['rfq_create_date_time'].dt.hour
df_Highest_Book = df_Highest_Traded_Away_Book.groupby(['trading_book']).size().idxmax()
#alternative solution
#df_Highest_Book = df_Highest_Traded_Away_Book['trading_book'].value_counts().idxmax()
print(df_Highest_Book)
A
Then compare by eq (==), aggregate sum for count of True values and add missing values by reindex:
df_Highest_Traded_Away_Book = (df_Highest_Traded_Away_Book['trading_book']
.eq(df_Highest_Book)
.groupby(df_Highest_Traded_Away_Book['Time_in_GMT'])
.sum()
.astype(int)
.reindex(np.arange(25), fill_value=0)
.to_frame(df_Highest_Book))
print(df_Highest_Traded_Away_Book)
A
Time_in_GMT
0 0
1 2
2 1
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0

Pandas groupby with MultiIndex columns and different levels

I want to do a groupby on a MultiIndex dataframe, counting the occurrences for each column for every user2 in df:
>>> df
user1 user2 count
0 1 2
a x d a
0 2 6 0 1 0 0
1 4 6 0 0 0 3
2 21 76 2 0 1 0
3 5 18 0 0 0 0
Note that user1 and user2 are at the same level as count (side effect of merging).
Desired output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 76 1 0 0 0
3 18 0 0 0 0
I've tried
>>> df.groupby(['user2','count'])
but I get
ValueError: Grouper for 'count' not 1-dimensional
GENERATOR CODE:
df = pd.DataFrame({'user1':[2,4,21,21],'user2':[6,6,76,76],'param1':[0,2,0,1],'param2':['x','a','a','d'],'count':[1,3,2,1]}, columns=['user1','user2','param1','param2','count'])
df = df.set_index(['user1','user2','param1','param2'])
df = df.unstack([2,3]).sort_index(axis=1).reset_index()
df2 = pd.DataFrame({'user1':[2,5,21],'user2':[6,18,76]})
df2.columns = pd.MultiIndex.from_product([df2.columns, [''],['']])
final_df = df2.merge(df, on=['user1','user2'], how='outer').fillna(0)
IIUC, you want:
final_df.where(final_df>0).groupby('user2').count().drop('user1', axis=1).reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 18 0 0 0 0
2 76 1 0 1 0
Avoid dropping columns, select only 'count', and changed function to sum:
final_df.where(final_df>0).groupby('user2').sum()[['count']].reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0.0 1.0 0.0 3.0
1 18 0.0 0.0 0.0 0.0
2 76 2.0 0.0 1.0 0.0
To void dropping user2 equal to zero values also.
final_df[['count']].where(final_df[['count']]>0)\
.groupby(final_df.user2).sum().reset_index()

create new column based on other columns in pandas dataframe

What is the best way to create a set of new columns based on two other columns? (similar to a crosstab or SQL case statement)
This works but performance is very slow on large dataframes:
for label in labels:
df[label + '_amt'] = df.apply(lambda row: row['amount'] if row['product'] == label else 0, axis=1)
You can use pivot_table
>>> df
amount product
0 6 b
1 3 c
2 3 a
3 7 a
4 7 a
>>> df.pivot_table(index=df.index, values='amount',
... columns='product', fill_value=0)
product a b c
0 0 6 0
1 0 0 3
2 3 0 0
3 7 0 0
4 7 0 0
or,
>>> for label in df['product'].unique():
... df[label + '_amt'] = (df['product'] == label) * df['amount']
...
>>> df
amount product b_amt c_amt a_amt
0 6 b 6 0 0
1 3 c 0 3 0
2 3 a 0 0 3
3 7 a 0 0 7
4 7 a 0 0 7