Inserting Rows with Consecutive Dates for Different Projects with Different Start and End Dates in pandas - pandas

I have a df called data of project records that looks somewhat like this:
project = ['Project 1','Project 1','Project 1','Project 1','Project 2','Project 2','Project 2','Project 3','Project 3','Project 3','Project 4','Project 5','Project 5','Project 5']
date = ['2010-10-12','2010-10-15','2010-10-20','2010-10-22','2012-05-05','2012-05-07','2012-05-10','2018-01-01','2018-01-05','2018-01-06','2019-10-02','2010-02-02','2010-02-04','2010-02-07']
date = pd.to_datetime(date)
hours = [0,1,0,2,4,0,2,1,0,2,4,2,4,3]
taskcount = [0,1,2,0,1,2,0,1,2,0,1,0,0,2]
data = pd.DataFrame({'Project':project, 'Date':date, 'Hours':hours,'TaskCount':taskcount})
The column Hours shows the number of hours worked on the particular project on the given date, while the column TaskCount gives a count of the number of tasks completed for that project on the given date.
I have a second df called project_duration containing info about the duration of each of the projects in the df data:
column_DateFirstRecord = data.groupby('Project').apply(lambda df: df.Date.min())
column_DateLastRecord = data.groupby('Project').apply(lambda df: df.Date.max())
project_duration = pd.concat([column_DateFirstRecord, column_DateLastRecord], axis=1)
project_duration.columns = ['DateFirstRecord', 'DateLastRecord']
project_duration = project_duration.assign(ProjectLength = (project_duration.DateLastRecord - project_duration.DateFirstRecord))
For each project in the df data, I need to append rows to the df data with the missing dates from the date of the first record to the date of the last record for that particular project. For instance, for Project 1, I need to add rows to the df data for Jan 13-16, 17-19, and 21, 2010. Note that these new rows should have the value 0 in the columns Hours and TaskCount.
The output that I'm looking for should look like the df data_output that I've created below:
date_output = ['2010-10-12', '2010-10-13','2010-10-14','2010-10-15','2010-10-16','2010-10-17','2010-10-18','2010-10-19','2010-10-20','2010-10-21','2010-10-22','2012-05-05','2012-05-06','2012-05-07','2012-05-08','2012-05-09','2012-05-10','2018-01-01', '2018-01-02','2018-01-03','2018-01-04','2018-01-05','2018-01-06','2019-10-02','2010-02-02','2010-02-03','2010-02-04','2010-02-05','2010-02-06','2010-02-07']
date_output = pd.to_datetime(date_output)
project_output = ['Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 2','Project 2','Project 2','Project 2','Project 2','Project 2','Project 3','Project 3','Project 3','Project 3','Project 3','Project 3','Project 4','Project 5','Project 5','Project 5','Project 5','Project 5','Project 5']
hours_output = [0,0,0,1,0,0,0,0,0,0,2,4,0,0,0,0,2,1,0,0,0,0,2,4,2,0,4,0,0,3]
taskcount_output = [0,0,0,1,0,0,0,0,2,0,0,1,0,2,0,0,0,1,0,0,0,2,0,1,0,0,0,0,0,2]
data_output = pd.DataFrame({'Project':project_output, 'Date':date_output, 'Hours':hours_output,'TaskCount':taskcount_output})
I should also note that the real dfs that I'm working with are very large - they comprise about 278,000 rows - so I'm hoping to find an efficient solution. I tried the method detailed in this StackOverflow post: Pandas filling missing dates and values within group , but it didn't allow for the different start and end dates for each project.

One option:
(data.groupby('Project')
.apply(lambda g: g.set_index('Date')
.reindex(pd.date_range(project_duration.loc[g.name, 'DateFirstRecord'],
project_duration.loc[g.name, 'DateLastRecord']
).rename('Date'),
fill_value=0
)
)[['Hours', 'TaskCount']]
.reset_index()
)
Output:
Project Date Hours TaskCount
0 Project 1 2010-10-12 0 0
1 Project 1 2010-10-13 0 0
2 Project 1 2010-10-14 0 0
3 Project 1 2010-10-15 1 1
4 Project 1 2010-10-16 0 0
5 Project 1 2010-10-17 0 0
6 Project 1 2010-10-18 0 0
7 Project 1 2010-10-19 0 0
8 Project 1 2010-10-20 0 2
9 Project 1 2010-10-21 0 0
10 Project 1 2010-10-22 2 0
11 Project 2 2012-05-05 4 1
12 Project 2 2012-05-06 0 0
13 Project 2 2012-05-07 0 2
14 Project 2 2012-05-08 0 0
15 Project 2 2012-05-09 0 0
16 Project 2 2012-05-10 2 0
17 Project 3 2018-01-01 1 1
18 Project 3 2018-01-02 0 0
19 Project 3 2018-01-03 0 0
20 Project 3 2018-01-04 0 0
21 Project 3 2018-01-05 0 2
22 Project 3 2018-01-06 2 0
23 Project 4 2019-10-02 4 1
24 Project 5 2010-02-02 2 0
25 Project 5 2010-02-03 0 0
26 Project 5 2010-02-04 4 0
27 Project 5 2010-02-05 0 0
28 Project 5 2010-02-06 0 0
29 Project 5 2010-02-07 3 2

Try:
x = data.groupby('Project').apply(
lambda x: (tmp:=x.set_index('Date').asfreq('1D')).assign(
Project=tmp['Project'].ffill(),
Hours=tmp['Hours'].fillna(0).astype(int),
TaskCount=tmp.TaskCount.fillna(0).astype(int))
).droplevel(0).reset_index()
print(x)
Prints:
Date Project Hours TaskCount
0 2010-10-12 Project 1 0 0
1 2010-10-13 Project 1 0 0
2 2010-10-14 Project 1 0 0
3 2010-10-15 Project 1 1 1
4 2010-10-16 Project 1 0 0
5 2010-10-17 Project 1 0 0
6 2010-10-18 Project 1 0 0
7 2010-10-19 Project 1 0 0
8 2010-10-20 Project 1 0 2
9 2010-10-21 Project 1 0 0
10 2010-10-22 Project 1 2 0
11 2012-05-05 Project 2 4 1
12 2012-05-06 Project 2 0 0
13 2012-05-07 Project 2 0 2
14 2012-05-08 Project 2 0 0
15 2012-05-09 Project 2 0 0
16 2012-05-10 Project 2 2 0
17 2018-01-01 Project 3 1 1
18 2018-01-02 Project 3 0 0
19 2018-01-03 Project 3 0 0
20 2018-01-04 Project 3 0 0
21 2018-01-05 Project 3 0 2
22 2018-01-06 Project 3 2 0
23 2019-10-02 Project 4 4 1
24 2010-02-02 Project 5 2 0
25 2010-02-03 Project 5 0 0
26 2010-02-04 Project 5 4 0
27 2010-02-05 Project 5 0 0
28 2010-02-06 Project 5 0 0
29 2010-02-07 Project 5 3 2

col1=project_duration.apply(lambda ss:pd.date_range(ss.DateFirstRecord,ss.DateLastRecord),axis=1)
col1.explode().to_frame("Date").reset_index().set_index(["Project","Date"])\
.join(data.set_index(["Project","Date"])).fillna(0).reset_index()
out:
Date Project Hours TaskCount
0 2010-10-12 Project 1 0 0
1 2010-10-13 Project 1 0 0
2 2010-10-14 Project 1 0 0
3 2010-10-15 Project 1 1 1
4 2010-10-16 Project 1 0 0
5 2010-10-17 Project 1 0 0
6 2010-10-18 Project 1 0 0
7 2010-10-19 Project 1 0 0
8 2010-10-20 Project 1 0 2
9 2010-10-21 Project 1 0 0
10 2010-10-22 Project 1 2 0
11 2012-05-05 Project 2 4 1
12 2012-05-06 Project 2 0 0
13 2012-05-07 Project 2 0 2
14 2012-05-08 Project 2 0 0
15 2012-05-09 Project 2 0 0
16 2012-05-10 Project 2 2 0
17 2018-01-01 Project 3 1 1
18 2018-01-02 Project 3 0 0
19 2018-01-03 Project 3 0 0
20 2018-01-04 Project 3 0 0
21 2018-01-05 Project 3 0 2
22 2018-01-06 Project 3 2 0
23 2019-10-02 Project 4 4 1
24 2010-02-02 Project 5 2 0
25 2010-02-03 Project 5 0 0
26 2010-02-04 Project 5 4 0
27 2010-02-05 Project 5 0 0
28 2010-02-06 Project 5 0 0
29 2010-02-07 Project 5 3 2

Related

Reset 'Id' value of appended Dataframe

I have appended multiple dataframes to form single dataframe. Each dataframe had multiple rows assigned with specific ID. After appending, Big dataframe has multiple rows with same Id. Would like assign new id's.
Current Dataframe:
Index name groupid
0 Abc 0
1 cvb 0
2 sdf 0
3 ksh 1
4 kjl 1
5 lmj 2
6 hyb 2
0 khf 0
1 uyt 0
2 tre 1
3 awe 1
4 uys 2
5 asq 2
6 lsx 2
Desired Output:
Index name groupid new_id
0 Abc 0 0
1 cvb 0 0
2 sdf 0 0
3 ksh 1 1
4 kjl 1 1
5 lmj 2 2
6 hyb 2 2
7 khf 0 3
8 uyt 0 3
9 tre 1 4
10 awe 1 4
11 uys 2 5
12 asq 2 5
13 lsx 2 5
You would have to use a slightly modified version of groupby:
df['new_id'] = df.groupby(df['groupid'].ne(df['groupid'].shift()).cumsum(), sort=False)
.ngroup())
Output is:
Index name groupid new_id
0 0 Abc 0 0
1 1 cvb 0 0
2 2 sdf 0 0
3 3 ksh 1 1
4 4 kjl 1 1
5 5 lmj 2 2
6 6 hyb 2 2
7 0 khf 0 3
8 1 uyt 0 3
9 2 tre 1 4
10 3 awe 1 4
11 4 uys 2 5
12 5 asq 2 5
13 6 lsx 2 5
See previous answer for reference.

Dataframe within a Dataframe - to create new column_

For the following dataframe:
import pandas as pd
df=pd.DataFrame({'list_A':[3,3,3,3,3,\
2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,4,4]})
How can 'list_A' be manipulated to give 'list_B'?
Desired output:
list_A
list_B
0
3
1
1
3
1
2
3
1
3
3
0
4
2
1
5
2
1
6
2
0
7
2
0
8
4
1
9
4
1
10
4
1
11
4
1
12
4
0
13
4
0
14
4
0
15
4
0
16
4
0
As you can see, if List_A has the number 3 - then the first 3 values of List_B are '1' and then the value of List_B changes to '0', until List_A changes value again.
GroupBy.cumcount
df['list_B'] = df['list_A'].gt(df.groupby('list_A').cumcount()).astype(int)
print(df)
Output
list_A list_B
0 3 1
1 3 1
2 3 1
3 3 0
4 3 0
5 2 1
6 2 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 0
12 4 1
13 4 1
14 4 1
15 4 1
16 4 0
17 4 0
18 4 0
19 4 0
20 4 0
21 4 0
22 4 0
23 4 0
EDIT
blocks = df['list_A'].ne(df['list_A'].shift()).cumsum()
df['list_B'] = df['list_A'].gt(df.groupby(blocks).cumcount()).astype(int)

Pivot Data and Count in SQL

I have a dataset in SQL that looks similar to this:
date dayssinceend
02/07/2020 1
03/07/2020 2
04/07/2020 3
05/07/2020 4
06/07/2020 5
01/07/2020 1
02/07/2020 2
03/07/2020 3
04/07/2020 4
01/07/2020 1
02/07/2020 2
03/07/2020 3
04/07/2020 4
I want to pivot it so that the date is on the Y axis and the Days since end is on the X Axis
So it would look like this:
Date 1 2 3 4 5
01/07/2020 2 0 0 0 0
02/07/2020 1 2 0 0 0
03/07/2020 0 1 2 0 0
04/07/2020 0 0 1 2 0
05/07/2020 0 0 0 1 0
06/07/2020 0 0 0 0 1
The Days since end will change so if there's away of doing this without setting the values to be static then that would be great.

Multindex add zero values if no in pandas dataframe

I have a pandas (v.0.23.4) dataframe with multindex('date', 'class').
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
4 7
In '2019-06-30' class 3 is missing because there are no data.
What I want is to add class 3 in the multindex and zero values in the Col_values column automatically.
Use DataFrame.unstack with fill_value=0 with DataFrame.stack:
df = df.unstack(fill_value=0).stack()
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7
Another solution is use DataFrame.reindex with MultiIndex.from_product:
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0)
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7

Determine the max count in a pandas Grouped By df and use this as a criteria to return records

Afternoon All,
I have a large amount of data over a one month period. I would like to:
a. Find the book with the highest number of trades over that months period.
b. Knowing this provide a groupby summary of all the trades done on that book for the month but display it's months trades within each hour of the 24 hour clock.
Here is a sample dataset:
df_Highest_Traded_Away_Book = [
('trading_book', ['A', 'A','A','A','B','C','C','C']),
('rfq_create_date_time', ['2018-09-03 01:06:09', '2018-09-08 01:23:29',
'2018-09-15 02:23:29','2018-09-20 03:23:29',
'2018-09-20 00:23:29','2018-09-25 01:23:29',
'2018-09-25 02:23:29','2018-09-30 02:23:29',])
]
df_Highest_Traded_Away_Book = pd.DataFrame.from_items(df_Highest_Traded_Away_Book)
display(df_Highest_Traded_Away_Book)
trading_book rfq_create_date_time
0 A 2018-09-03 01:06:09
1 A 2018-09-08 01:23:29
2 A 2018-09-15 02:23:29
3 A 2018-09-20 03:23:29
4 B 2018-09-20 00:23:29
5 C 2018-09-25 01:23:29
6 C 2018-09-25 02:23:29
7 C 2018-09-30 02:23:29
df_Highest_Traded_Away_Book['rfq_create_date_time'] = pd.to_datetime(df_Highest_Traded_Away_Book['rfq_create_date_time'])
df_Highest_Traded_Away_Book['Time_in_GMT'] = df_Highest_Traded_Away_Book['rfq_create_date_time'].dt.hour
display(df_Highest_Traded_Away_Book)
trading_book rfq_create_date_time Time_in_GMT
0 A 2018-09-03 01:06:09 1
1 A 2018-09-08 01:23:29 1
2 A 2018-09-15 02:23:29 2
3 A 2018-09-20 03:23:29 3
4 B 2018-09-20 00:23:29 0
5 C 2018-09-25 01:23:29 1
6 C 2018-09-25 02:23:29 2
7 C 2018-09-30 02:23:29 2
df_Highest_Traded_Away_Book = df_Highest_Traded_Away_Book.groupby(['trading_book']).size().reset_index(name='Traded_Away_for_the_Hour').sort_values(['Traded_Away_for_the_Hour'], ascending=False)
display(df_Highest_Traded_Away_Book)
trading_book Trades_Bucketted_into_the_Hour_They_Occured
0 A 4
2 C 3
1 B 1
display(df_Highest_Traded_Away_Book['Traded_Away_for_the_Hour'].max())
4
i.e. Book A has the most number of trades in the month
Now return a grouped by result of all trades done on this book (for the month) but display such that trades are bucketed into the hour they were traded.
Time_in_GMT Trades_Book_A_Bucketted_into_the_Hour_They_Occured
0 0
1 2
2 1
3 1
4 0
. 0
. 0
. 0
24 0
Any help would be appreciated. I figure there is some way to return the criteria in one line of code.
Use Series.idxmax for top book:
df_Highest_Traded_Away_Book['rfq_create_date_time'] = pd.to_datetime(df_Highest_Traded_Away_Book['rfq_create_date_time'])
df_Highest_Traded_Away_Book['Time_in_GMT'] = df_Highest_Traded_Away_Book['rfq_create_date_time'].dt.hour
df_Highest_Book = df_Highest_Traded_Away_Book.groupby(['trading_book']).size().idxmax()
#alternative solution
#df_Highest_Book = df_Highest_Traded_Away_Book['trading_book'].value_counts().idxmax()
print(df_Highest_Book)
A
Then compare by eq (==), aggregate sum for count of True values and add missing values by reindex:
df_Highest_Traded_Away_Book = (df_Highest_Traded_Away_Book['trading_book']
.eq(df_Highest_Book)
.groupby(df_Highest_Traded_Away_Book['Time_in_GMT'])
.sum()
.astype(int)
.reindex(np.arange(25), fill_value=0)
.to_frame(df_Highest_Book))
print(df_Highest_Traded_Away_Book)
A
Time_in_GMT
0 0
1 2
2 1
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0