How to groupby, iterate over rows and count in pandas? - pandas

I have a dataframe with columns city, date, source and count of each source.
I need to group by city, then iterate over rows of date column with the following condition: check each row and if the difference between them is <30 days then find start_date(min) and end_date(max) for the group of dates that meet condition, and add them as a separate columns. Also, count the source type and add again as separate columns.
city date source count
0 Alexandria 2022-10-13 black 117
1 Alexandria 2022-10-14 black 85
2 Alexandria 2022-10-15 black 63
3 Alexandria 2022-10-16 black 190
4 Alexandria 2022-10-17 black 389
5 Alexandria 2022-10-18 black 284
6 Alexandria 2022-10-19 black 179
7 Amsterdam 2018-08-05 red 1
8 Amsterdam 2018-08-28 red 111
9 Amsterdam 2019-08-17 red 1669
10 Amsterdam 2019-08-18 red 1584
11 Amsterdam 2019-08-19 red 940
12 Amsterdam 2019-08-21 red 1498
13 Amsterdam 2019-08-22 red 2281
14 Amsterdam 2019-08-23 red 2038
15 Amsterdam 2019-08-24 red 1516
16 Amsterdam 2019-08-25 red 1952
17 Amsterdam 2019-08-26 red 1434
18 Amsterdam 2019-08-27 red 881
19 Amsterdam 2019-08-29 red 1482
20 Amsterdam 2019-08-30 red 978
21 Amsterdam 2019-08-31 red 1423
22 Amsterdam 2019-09-01 red 1120
23 Amsterdam 2019-09-02 red 1117
24 Amsterdam 2019-09-06 red 1
To reproduce my dataframe:
# initialize list of lists
data1 = [['Alexandria', '2022-10-13', 'black', 117],
['Alexandria', '2022-10-14', 'black', 85],
['Alexandria', '2022-10-15', 'black', 63],
['Alexandria', '2022-10-16', 'black', 190],
['Alexandria', '2022-10-17', 'black', 389],
['Alexandria', '2022-10-18', 'black', 284],
['Alexandria', '2022-10-19', 'black', 179],
['Amsterdam', '2018-08-05', 'red', 1],
['Amsterdam', '2018-08-28', 'red', 111],
['Amsterdam', '2019-08-17', 'red', 1669],
['Amsterdam', '2019-08-18', 'red', 1584],
['Amsterdam', '2019-08-19', 'red', 940],
['Amsterdam', '2019-08-21', 'red', 1498],
['Amsterdam', '2019-08-22', 'red', 2281],
['Amsterdam', '2019-08-23', 'red', 2038],
['Amsterdam', '2019-08-24', 'red',1516],
['Amsterdam', '2019-08-25', 'red', 1952],
['Amsterdam', '2019-08-26', 'red', 1434],
['Amsterdam', '2019-08-27', 'red', 881],
['Amsterdam', '2019-08-29', 'red', 1482],
['Amsterdam', '2019-08-30', 'red', 978],
['Amsterdam', '2019-08-31', 'red', 1423],
['Amsterdam', '2019-09-01', 'red', 1120],
['Amsterdam', '2019-09-02', 'red',1117],
['Amsterdam', '2019-09-06', 'red', 1],
]
# Create the pandas DataFrame
df_b = pd.DataFrame(data1, columns=['city', 'date', 'source','count'])
The output needed(looks like pivot table):
So, for Alexandria i have only 1 round where dates are less than 30 days between them: from 2022-10-13 till 2022-10-19 (6th row). These 2 dates are min and max.
For Amsterdam we have 2 rounds of dates that have less that 30 days between them: 1st round are rows 7-8, 2nd round are rows 9-24.
Between 8 and 9 rows for Amsterdam there are more than 30 days, meaning it starts a new round.
city start_date end_date black red
0 Alexandria 2022-10-13 2022-10-19 1307 0
1 Amsterdam 2018-08-05 2018-08-28 0 112
2 Amsterdam 2019-08-17 2019-09-06 0 21914
To reproduce the output:
# initialize list of lists
data = [['Alexandria', '2022-10-13', '2022-10-19', 1307, 0],
['Amsterdam', '2018-08-05', '2018-08-28', 0, 112],
['Amsterdam', '2019-08-17', '2019-09-06', 0,21914 ]]
# Create the pandas DataFrame
df_a = pd.DataFrame(data, columns=['city', 'start_date', 'end_date','black', 'red'])

I know this does not fully answer the question, but is hopefully helpful in getting you partly there, since you did not show what you have tried (if anything). The 30 days part presumably would involve some custom way of grouping and aggregating the data which I have not attempted here. As such, this answer shows how to:
Create start_date and end_date columns for minimum and maximum dates grouped by city, respectively
Sum the count for each source value into new columns for each identified source value (black and red)
Merge the two resultant dataframes into a single one
This answer assumes the original data is in a dateframe named df.
Minimum and maximum dates
I know this is not exactly what is wanted due to the 30 days logic, but shows a straightforward grouping by city with minimum and maximum dates for each.
date_agg_df = df.groupby("city").agg({"date": ["min", "max"]}).reset_index()
date_agg_df.columns = ["city", "start_date", "end_date"]
Data in date_agg_df:
city start_date end_date
0 Alexandria 2022-10-13 2022-10-19
1 Amsterdam 2018-08-05 2019-09-06
Sum the count of each source value
black_red_df = df.fillna(0).pivot_table(index="city", columns="source", values="count", aggfunc="sum").fillna(0).reset_index()
black_red_df.columns = ["city", "black", "red"]
Data in black_red_df:
city black red
0 Alexandria 1307.0 0.0
1 Amsterdam 0.0 22026.0
Merge the dataframes
mrg_df = pd.merge(date_agg_df, black_red_df, on="city")
Data in mrg_df:
city start_date end_date black red
0 Alexandria 2022-10-13 2022-10-19 1307.0 0.0
1 Amsterdam 2018-08-05 2019-09-06 0.0 22026.0

Related

Groupby each group and then do division of two divisions (current yr/latest yr for each column)

I'd like to create a new column by dividing current year by its latest year in Col_1 and Col_2 respectively for each group. Then, divide the two divisions.
Methodology: Calculate (EachYrCol_1/Yr2000Col_1)/(EachYrCol_2/Yr2000Col_2) for each group
See example below:
Year
Group
Col_1
Col_2
New Column
1995
A
100
11
(100/600)/(11/66)
1996
A
200
22
(200/600)/(22/66)
1997
A
300
33
(300/600)/(33/66)
1998
A
400
44
.............
1999
A
500
55
.............
2000
A
600
66
.............
1995
B
700
77
(700/1200)/(77/399)
1996
B
800
88
(800/1200)/(88/399)
1997
B
900
99
(900/1200)/(99/399)
1998
B
1000
199
.............
1999
B
1100
299
.............
2000
B
1200
399
.............
Sample dataset:
import pandas as pd
df = pd.DataFrame({'Year':[1995, 1996, 1997, 1998, 1999, 2000,1995, 1996, 1997, 1998, 1999, 2000],
'Group':['A', 'A', 'A','A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Col_1':[100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200],
'Col_2':[11, 22, 33, 44, 55, 66, 77, 88, 99, 199, 299, 399]})
Use GroupBy.transform with GroupBy.last for helper DataFrame, so possible divide each column:
df1 = df.groupby('Group').transform('last')
df['New'] = df['Col_1'].div(df1['Col_1']).div(df['Col_2'].div(df1['Col_2']))
print (df)
Year Group Col_1 Col_2 New
0 1995 A 100 11 1.000000
1 1996 A 200 22 1.000000
2 1997 A 300 33 1.000000
3 1998 A 400 44 1.000000
4 1999 A 500 55 1.000000
5 2000 A 600 66 1.000000
6 1995 B 700 77 3.022727
7 1996 B 800 88 3.022727
8 1997 B 900 99 3.022727
9 1998 B 1000 199 1.670854
10 1999 B 1100 299 1.223244
11 2000 B 1200 399 1.000000

Pandas - styling with multi-index (from pivot)

As a result of pivot table I get the following table.
Entry table
I would like to apply a style with a criteria depending on the product =
when eggs are sold I would like to have a red background when sales is lower than 15
when chicken is sold I would like to have a red background when sales is lower than 18
Expected result
I am able to do it converting to records (pd.to_record) creating a single index but I am loosing the nice layout allowing to compare evolution between months. Note the number of month is variable (can be 2 in example but more).
Entry file is read from Excell=
import pandas as pd
df = pd.read_excel('xxx.xlsx',sheet_name='Sheet')
result =
product Month sales revenue
0 eggs 2021-05-01 10 8.0
1 chicken 2021-05-01 15 12.0
2 chicken 2021-05-01 17 15.0
3 eggs 2021-05-01 20 15.0
4 eggs 2021-06-01 11 8.5
5 chicken 2021-06-01 14 12.0
6 chicken 2021-06-01 18 15.0
7 eggs 2021-06-01 22 17.0
Pivoting it and then swaping it =
df2 = pd.pivot_table(df, columns = ['Month'], index = ['product'], values = ['sales','product'],aggfunc=sum)
df3 = df2.swaplevel(0,1,1).sort_values(by='Month',ascending=True, axis=1)
Month 2021-05-01 2021-06-01
revenue sales revenue sales
product
chicken 27.0 32 27.0 32
eggs 23.0 30 25.5 33
The df-to-dict() =
{(Timestamp('2021-05-01 00:00:00'), 'revenue'): {'chicken': 27.0, 'eggs': 23.0}, (Timestamp('2021-05-01 00:00:00'), 'sales'): {'chicken': 32, 'eggs': 30}, (Timestamp('2021-06-01 00:00:00'), 'revenue'): {'chicken': 27.0, 'eggs': 25.5}, (Timestamp('2021-06-01 00:00:00'), 'sales'): {'chicken': 32, 'eggs': 33}}
Thanks in advance for help you can provide.

How to sum of certain values using pandas datetime operations

Headline is not clear. Let me explain.
I have a dataframe like this:
Order Quantity Date Accepted Date Delivered
20 01-05-2010 01-02-2011
10 01-11-2010 01-03-2011
300 01-12-2010 01-04-2011
5 01-03-2011 01-03-2012
20 01-04-2012 01-11-2013
10 01-07-2013 01-12-2014
I want to basically create another column that contains the total undelivered items for each row.
Expected output:
Order Quantity Date Accepted Date Delivered Pending Order
20 01-05-2010 01-02-2011 20
10 01-11-2010 01-03-2011 30
300 01-12-2010 01-04-2011 330
5 01-03-2011 01-03-2012 305
20 01-04-2012 01-11-2013 20
10 01-07-2013 01-12-2014 30
Here, I have taken a part of your dataframe and try to get the result.
df = pd.DataFrame({'order': [20, 10, 300, 200],
'Date_aceepted': ['01-05-2010', '01-11-2010', '01-12-2010', '01-12-2010'],
'Date_delever': ['01-02-2011', '01-03-2011', '01-04-2011', '01-12-2010']})
order Date_aceepted Date_delever
0 20 01-05-2010 01-02-2011
1 10 01-11-2010 01-03-2011
2 300 01-12-2010 01-04-2011
3 200 01-12-2010 01-12-2010
Then I will change the Date_accepted and Date_deliver to date time by using pandas data time module
df['date1'] = pd.to_datetime(df['Date_aceepted'])
df['date2'] = pd.to_datetime(df['Date_delever'])
Then I will make a new data frame in which the Date_accepted and Date_delever are not the same. I assume you just need that in your final result.
dff = df[df['date1'] != df['date2']]
You can see the last row in which both accepted and delever are same is now removed in dff.
order Date_aceepted Date_delever date1 date2
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04
Then I did use pandas cumsum of pending order
dff['pending'] = dff['order'].cumsum()
and it gives
order Date_aceepted Date_delever date1 date2 pending
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02 20
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03 30
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04 330
The final data frame has two extra columns that can be dropped if you don't want in your result.

I am trying to find the original index value of the last occurence of each group

I am trying to find the original indices for the last occurrence of groupby groups.
If I have the dataframe given by:
data = {
'Name':['Jack', 'Jill', 'Jill', 'Jill', 'Ryan',
'Ryan','Lilian', 'Jack', 'Jack', 'Jack'],
'Age': [15, 20, 25, 30, 23, 23, 45, 24, 65, 115]
}
df = pd.DataFrame(data)
df
I hope to see:
0 Jack 15
3 Jill 30
5 Ryan 23
6 Lilian 45
9 Jack 115
Tried using groupby and .last() after the groupby but this gets rid of the index.
If you want to drop duplicates not considering the records which appear latter as a dupe(i think the expected output before editing), you can also do:
(df.assign(k=df['Name'].ne(df['Name'].shift()).cumsum())
.drop_duplicates(['Name','k'],keep='last'))
Or better as #PiR mentions:
df[df.Name.ne(df.Name.shift(-1))]
Name Age k
0 Jack 15 1
3 Jill 30 2
5 Ryan 23 3
6 Lilian 45 4
9 Jack 115 5
Can also
df.groupby(df.Name.ne(df.Name.shift()).cumsum()).tail(1)
Name Age
0 Jack 15
3 Jill 30
5 Ryan 23
6 Lilian 45
9 Jack 115
Use duplicated:
print(df[~df.Name.ne(df.Name.shift()).cumsum().duplicated(keep='last')])
Output
Name Age
0 Jack 15
3 Jill 30
5 Ryan 23
6 Lilian 45
9 Jack 115

Need assistance with below query

I'm getting this error:
Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
Code: import webbrowser
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
webbrowser.open(website)
league_frame = pd.read_clipboard()
And the above mentioned comes next.
I believe you need use read_html - returned all parsed tables and select Dataframe by position:
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
#select first parsed table
df1 = pd.read_html(website)[0]
print (df1.head())
Win % Wins Losses Year Team Comment
0 0.798 67 17 1882 Chicago White Stockings best pre-modern season
1 0.763 116 36 1906 Chicago Cubs best 154-game NL season
2 0.721 111 43 1954 Cleveland Indians best 154-game AL season
3 0.716 116 46 2001 Seattle Mariners best 162-game AL season
4 0.667 108 54 1975 Cincinnati Reds best 162-game NL season
#select second parsed table
df2 = pd.read_html(website)[1]
print (df2)
Win % Wins Losses Season Team \
0 0.890 73 9 2015–16 Golden State Warriors
1 0.110 9 73 1972–73 Philadelphia 76ers
2 0.106 7 59 2011–12 Charlotte Bobcats
Comment
0 best 82 game season
1 worst 82-game season
2 worst season statistically