How to groupby, iterate over rows and count in pandas? - pandas
I have a dataframe with columns city, date, source and count of each source.
I need to group by city, then iterate over rows of date column with the following condition: check each row and if the difference between them is <30 days then find start_date(min) and end_date(max) for the group of dates that meet condition, and add them as a separate columns. Also, count the source type and add again as separate columns.
city date source count
0 Alexandria 2022-10-13 black 117
1 Alexandria 2022-10-14 black 85
2 Alexandria 2022-10-15 black 63
3 Alexandria 2022-10-16 black 190
4 Alexandria 2022-10-17 black 389
5 Alexandria 2022-10-18 black 284
6 Alexandria 2022-10-19 black 179
7 Amsterdam 2018-08-05 red 1
8 Amsterdam 2018-08-28 red 111
9 Amsterdam 2019-08-17 red 1669
10 Amsterdam 2019-08-18 red 1584
11 Amsterdam 2019-08-19 red 940
12 Amsterdam 2019-08-21 red 1498
13 Amsterdam 2019-08-22 red 2281
14 Amsterdam 2019-08-23 red 2038
15 Amsterdam 2019-08-24 red 1516
16 Amsterdam 2019-08-25 red 1952
17 Amsterdam 2019-08-26 red 1434
18 Amsterdam 2019-08-27 red 881
19 Amsterdam 2019-08-29 red 1482
20 Amsterdam 2019-08-30 red 978
21 Amsterdam 2019-08-31 red 1423
22 Amsterdam 2019-09-01 red 1120
23 Amsterdam 2019-09-02 red 1117
24 Amsterdam 2019-09-06 red 1
To reproduce my dataframe:
# initialize list of lists
data1 = [['Alexandria', '2022-10-13', 'black', 117],
['Alexandria', '2022-10-14', 'black', 85],
['Alexandria', '2022-10-15', 'black', 63],
['Alexandria', '2022-10-16', 'black', 190],
['Alexandria', '2022-10-17', 'black', 389],
['Alexandria', '2022-10-18', 'black', 284],
['Alexandria', '2022-10-19', 'black', 179],
['Amsterdam', '2018-08-05', 'red', 1],
['Amsterdam', '2018-08-28', 'red', 111],
['Amsterdam', '2019-08-17', 'red', 1669],
['Amsterdam', '2019-08-18', 'red', 1584],
['Amsterdam', '2019-08-19', 'red', 940],
['Amsterdam', '2019-08-21', 'red', 1498],
['Amsterdam', '2019-08-22', 'red', 2281],
['Amsterdam', '2019-08-23', 'red', 2038],
['Amsterdam', '2019-08-24', 'red',1516],
['Amsterdam', '2019-08-25', 'red', 1952],
['Amsterdam', '2019-08-26', 'red', 1434],
['Amsterdam', '2019-08-27', 'red', 881],
['Amsterdam', '2019-08-29', 'red', 1482],
['Amsterdam', '2019-08-30', 'red', 978],
['Amsterdam', '2019-08-31', 'red', 1423],
['Amsterdam', '2019-09-01', 'red', 1120],
['Amsterdam', '2019-09-02', 'red',1117],
['Amsterdam', '2019-09-06', 'red', 1],
]
# Create the pandas DataFrame
df_b = pd.DataFrame(data1, columns=['city', 'date', 'source','count'])
The output needed(looks like pivot table):
So, for Alexandria i have only 1 round where dates are less than 30 days between them: from 2022-10-13 till 2022-10-19 (6th row). These 2 dates are min and max.
For Amsterdam we have 2 rounds of dates that have less that 30 days between them: 1st round are rows 7-8, 2nd round are rows 9-24.
Between 8 and 9 rows for Amsterdam there are more than 30 days, meaning it starts a new round.
city start_date end_date black red
0 Alexandria 2022-10-13 2022-10-19 1307 0
1 Amsterdam 2018-08-05 2018-08-28 0 112
2 Amsterdam 2019-08-17 2019-09-06 0 21914
To reproduce the output:
# initialize list of lists
data = [['Alexandria', '2022-10-13', '2022-10-19', 1307, 0],
['Amsterdam', '2018-08-05', '2018-08-28', 0, 112],
['Amsterdam', '2019-08-17', '2019-09-06', 0,21914 ]]
# Create the pandas DataFrame
df_a = pd.DataFrame(data, columns=['city', 'start_date', 'end_date','black', 'red'])
I know this does not fully answer the question, but is hopefully helpful in getting you partly there, since you did not show what you have tried (if anything). The 30 days part presumably would involve some custom way of grouping and aggregating the data which I have not attempted here. As such, this answer shows how to:
Create start_date and end_date columns for minimum and maximum dates grouped by city, respectively
Sum the count for each source value into new columns for each identified source value (black and red)
Merge the two resultant dataframes into a single one
This answer assumes the original data is in a dateframe named df.
Minimum and maximum dates
I know this is not exactly what is wanted due to the 30 days logic, but shows a straightforward grouping by city with minimum and maximum dates for each.
date_agg_df = df.groupby("city").agg({"date": ["min", "max"]}).reset_index()
date_agg_df.columns = ["city", "start_date", "end_date"]
Data in date_agg_df:
city start_date end_date
0 Alexandria 2022-10-13 2022-10-19
1 Amsterdam 2018-08-05 2019-09-06
Sum the count of each source value
black_red_df = df.fillna(0).pivot_table(index="city", columns="source", values="count", aggfunc="sum").fillna(0).reset_index()
black_red_df.columns = ["city", "black", "red"]
Data in black_red_df:
city black red
0 Alexandria 1307.0 0.0
1 Amsterdam 0.0 22026.0
Merge the dataframes
mrg_df = pd.merge(date_agg_df, black_red_df, on="city")
Data in mrg_df:
city start_date end_date black red
0 Alexandria 2022-10-13 2022-10-19 1307.0 0.0
1 Amsterdam 2018-08-05 2019-09-06 0.0 22026.0
Related
Groupby each group and then do division of two divisions (current yr/latest yr for each column)
I'd like to create a new column by dividing current year by its latest year in Col_1 and Col_2 respectively for each group. Then, divide the two divisions. Methodology: Calculate (EachYrCol_1/Yr2000Col_1)/(EachYrCol_2/Yr2000Col_2) for each group See example below: Year Group Col_1 Col_2 New Column 1995 A 100 11 (100/600)/(11/66) 1996 A 200 22 (200/600)/(22/66) 1997 A 300 33 (300/600)/(33/66) 1998 A 400 44 ............. 1999 A 500 55 ............. 2000 A 600 66 ............. 1995 B 700 77 (700/1200)/(77/399) 1996 B 800 88 (800/1200)/(88/399) 1997 B 900 99 (900/1200)/(99/399) 1998 B 1000 199 ............. 1999 B 1100 299 ............. 2000 B 1200 399 ............. Sample dataset: import pandas as pd df = pd.DataFrame({'Year':[1995, 1996, 1997, 1998, 1999, 2000,1995, 1996, 1997, 1998, 1999, 2000], 'Group':['A', 'A', 'A','A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'], 'Col_1':[100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200], 'Col_2':[11, 22, 33, 44, 55, 66, 77, 88, 99, 199, 299, 399]})
Use GroupBy.transform with GroupBy.last for helper DataFrame, so possible divide each column: df1 = df.groupby('Group').transform('last') df['New'] = df['Col_1'].div(df1['Col_1']).div(df['Col_2'].div(df1['Col_2'])) print (df) Year Group Col_1 Col_2 New 0 1995 A 100 11 1.000000 1 1996 A 200 22 1.000000 2 1997 A 300 33 1.000000 3 1998 A 400 44 1.000000 4 1999 A 500 55 1.000000 5 2000 A 600 66 1.000000 6 1995 B 700 77 3.022727 7 1996 B 800 88 3.022727 8 1997 B 900 99 3.022727 9 1998 B 1000 199 1.670854 10 1999 B 1100 299 1.223244 11 2000 B 1200 399 1.000000
Pandas - styling with multi-index (from pivot)
As a result of pivot table I get the following table. Entry table I would like to apply a style with a criteria depending on the product = when eggs are sold I would like to have a red background when sales is lower than 15 when chicken is sold I would like to have a red background when sales is lower than 18 Expected result I am able to do it converting to records (pd.to_record) creating a single index but I am loosing the nice layout allowing to compare evolution between months. Note the number of month is variable (can be 2 in example but more). Entry file is read from Excell= import pandas as pd df = pd.read_excel('xxx.xlsx',sheet_name='Sheet') result = product Month sales revenue 0 eggs 2021-05-01 10 8.0 1 chicken 2021-05-01 15 12.0 2 chicken 2021-05-01 17 15.0 3 eggs 2021-05-01 20 15.0 4 eggs 2021-06-01 11 8.5 5 chicken 2021-06-01 14 12.0 6 chicken 2021-06-01 18 15.0 7 eggs 2021-06-01 22 17.0 Pivoting it and then swaping it = df2 = pd.pivot_table(df, columns = ['Month'], index = ['product'], values = ['sales','product'],aggfunc=sum) df3 = df2.swaplevel(0,1,1).sort_values(by='Month',ascending=True, axis=1) Month 2021-05-01 2021-06-01 revenue sales revenue sales product chicken 27.0 32 27.0 32 eggs 23.0 30 25.5 33 The df-to-dict() = {(Timestamp('2021-05-01 00:00:00'), 'revenue'): {'chicken': 27.0, 'eggs': 23.0}, (Timestamp('2021-05-01 00:00:00'), 'sales'): {'chicken': 32, 'eggs': 30}, (Timestamp('2021-06-01 00:00:00'), 'revenue'): {'chicken': 27.0, 'eggs': 25.5}, (Timestamp('2021-06-01 00:00:00'), 'sales'): {'chicken': 32, 'eggs': 33}} Thanks in advance for help you can provide.
How to sum of certain values using pandas datetime operations
Headline is not clear. Let me explain. I have a dataframe like this: Order Quantity Date Accepted Date Delivered 20 01-05-2010 01-02-2011 10 01-11-2010 01-03-2011 300 01-12-2010 01-04-2011 5 01-03-2011 01-03-2012 20 01-04-2012 01-11-2013 10 01-07-2013 01-12-2014 I want to basically create another column that contains the total undelivered items for each row. Expected output: Order Quantity Date Accepted Date Delivered Pending Order 20 01-05-2010 01-02-2011 20 10 01-11-2010 01-03-2011 30 300 01-12-2010 01-04-2011 330 5 01-03-2011 01-03-2012 305 20 01-04-2012 01-11-2013 20 10 01-07-2013 01-12-2014 30
Here, I have taken a part of your dataframe and try to get the result. df = pd.DataFrame({'order': [20, 10, 300, 200], 'Date_aceepted': ['01-05-2010', '01-11-2010', '01-12-2010', '01-12-2010'], 'Date_delever': ['01-02-2011', '01-03-2011', '01-04-2011', '01-12-2010']}) order Date_aceepted Date_delever 0 20 01-05-2010 01-02-2011 1 10 01-11-2010 01-03-2011 2 300 01-12-2010 01-04-2011 3 200 01-12-2010 01-12-2010 Then I will change the Date_accepted and Date_deliver to date time by using pandas data time module df['date1'] = pd.to_datetime(df['Date_aceepted']) df['date2'] = pd.to_datetime(df['Date_delever']) Then I will make a new data frame in which the Date_accepted and Date_delever are not the same. I assume you just need that in your final result. dff = df[df['date1'] != df['date2']] You can see the last row in which both accepted and delever are same is now removed in dff. order Date_aceepted Date_delever date1 date2 0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02 1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03 2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04 Then I did use pandas cumsum of pending order dff['pending'] = dff['order'].cumsum() and it gives order Date_aceepted Date_delever date1 date2 pending 0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02 20 1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03 30 2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04 330 The final data frame has two extra columns that can be dropped if you don't want in your result.
I am trying to find the original index value of the last occurence of each group
I am trying to find the original indices for the last occurrence of groupby groups. If I have the dataframe given by: data = { 'Name':['Jack', 'Jill', 'Jill', 'Jill', 'Ryan', 'Ryan','Lilian', 'Jack', 'Jack', 'Jack'], 'Age': [15, 20, 25, 30, 23, 23, 45, 24, 65, 115] } df = pd.DataFrame(data) df I hope to see: 0 Jack 15 3 Jill 30 5 Ryan 23 6 Lilian 45 9 Jack 115 Tried using groupby and .last() after the groupby but this gets rid of the index.
If you want to drop duplicates not considering the records which appear latter as a dupe(i think the expected output before editing), you can also do: (df.assign(k=df['Name'].ne(df['Name'].shift()).cumsum()) .drop_duplicates(['Name','k'],keep='last')) Or better as #PiR mentions: df[df.Name.ne(df.Name.shift(-1))] Name Age k 0 Jack 15 1 3 Jill 30 2 5 Ryan 23 3 6 Lilian 45 4 9 Jack 115 5
Can also df.groupby(df.Name.ne(df.Name.shift()).cumsum()).tail(1) Name Age 0 Jack 15 3 Jill 30 5 Ryan 23 6 Lilian 45 9 Jack 115
Use duplicated: print(df[~df.Name.ne(df.Name.shift()).cumsum().duplicated(keep='last')]) Output Name Age 0 Jack 15 3 Jill 30 5 Ryan 23 6 Lilian 45 9 Jack 115
Need assistance with below query
I'm getting this error: Error tokenizing data. C error: Expected 2 fields in line 11, saw 3 Code: import webbrowser website = 'https://en.wikipedia.org/wiki/Winning_percentage' webbrowser.open(website) league_frame = pd.read_clipboard() And the above mentioned comes next.
I believe you need use read_html - returned all parsed tables and select Dataframe by position: website = 'https://en.wikipedia.org/wiki/Winning_percentage' #select first parsed table df1 = pd.read_html(website)[0] print (df1.head()) Win % Wins Losses Year Team Comment 0 0.798 67 17 1882 Chicago White Stockings best pre-modern season 1 0.763 116 36 1906 Chicago Cubs best 154-game NL season 2 0.721 111 43 1954 Cleveland Indians best 154-game AL season 3 0.716 116 46 2001 Seattle Mariners best 162-game AL season 4 0.667 108 54 1975 Cincinnati Reds best 162-game NL season #select second parsed table df2 = pd.read_html(website)[1] print (df2) Win % Wins Losses Season Team \ 0 0.890 73 9 2015–16 Golden State Warriors 1 0.110 9 73 1972–73 Philadelphia 76ers 2 0.106 7 59 2011–12 Charlotte Bobcats Comment 0 best 82 game season 1 worst 82-game season 2 worst season statistically