Divide two row values based on label and create a new column to populate the calculated value - pandas

New to Python and looking for some help.
I would like to divide values in two different rows (part of the same column) and then insert a new column with the calculated value
City 2017-18 Item
0 Boston 100 Primary
1 Boston 200 Secondary
2 Boston 300 Tertiary
3 Boston 400 Nat'l average
4 Chicago 500 Primary
5 Chicago 600 Secondary
6 Chicago 700 Tertiary
7 Chicago 800 Nat'l average
On the above Dataframe, I am trying to divide a City's Primary, Secondary and Tertiary values respectively by the Nat'l average for that City. The resultant answer to be populated in a new column part of the same Dataframe. After calculation, the row with the label 'Nat'l average' need to be deleted.
City 2014-15 Item New_column
0 Boston 100 Primary 100/400
1 Boston 200 Secondary 200/400
2 Boston 300 Tertiary 300/400
3 Chicago 500 Primary 500/800
4 Chicago 600 Secondary 600/800
5 Chicago 700 Tertiary 700/800

If mean value is always last per groups divide column by Series created by GroupBy.transform and GroupBy.last:
df['new'] = df['2017-18'].div(df.groupby('City')['2017-18'].transform('last'))
If not first filter values with averages and divide by Series.maping Series:
s = df[df['Item'] == "Nat'l average"].set_index('City')['2017-18']
df['new'] = df['2017-18'].div(df['City'].map(s))
And last filter out rows by boolean indexing:
df = df[df['Item'] != "Nat'l average"]
print (df)
City 2017-18 Item new
0 Boston 100 Primary 0.250
1 Boston 200 Secondary 0.500
2 Boston 300 Tertiary 0.750
4 Chicago 500 Primary 0.625
5 Chicago 600 Secondary 0.750
6 Chicago 700 Tertiary 0.875
print (df['City'].map(s))
0 400
1 400
2 400
3 400
4 800
5 800
6 800
7 800
Name: City, dtype: int64


dataset = pd.read_csv('./file.csv')
This gives:
age sex smoker married region price
0 39 female yes no us 250000
1 28 male no no us 400000
2 23 male no yes europe 389000
3 17 male no no asia 230000
4 43 male no yes asia 243800
I want to replace all yes/no values of smoker with 0 or 1, but I don't want to change the yes/no values of married. I want to use pandas replace function.
I did the following, but this obviously changes all yes/no values (from smoker and married column):
dataset = dataset.replace(to_replace='yes', value='1')
dataset = dataset.replace(to_replace='no', value='0')
age sex smoker married region price
0 39 female 1 0 us 250000
1 28 male 0 0 us 400000
2 23 male 0 1 europe 389000
3 17 male 0 0 asia 230000
4 43 male 0 1 asia 243800
How can I ensure that only the yes/no values from the smoker column get changed, preferably using Pandas' replace function?
did you try:
dataset['smoker']=dataset['smoker'].replace({'yes':1, 'no':0})

Pandas: Aggregate mean ("totals") for each combination of dimensions

I have a table like this:
I'd like to add all possible means for combinations of dimensions 1-3 to the same table (or a new dataframe that I can concatenate with the original).
The time dimension should not be aggregated. Example of means added for different combinations:
... total
... total
Using groupby on multiple columns will groupby with all combinations of these columns. So a simple df.groupby(["city", "age"]).mean() will achieve the mean for the "total", "city", "age" combination. The problem here is you want all combinations of all size for the list ["gender", "city", "age"]. It's not straightforward and I think a more pythonic way can be found but here is my proposal :
## create artificial data
cols = ["gender", "city", "age"]
genders = ["male", "female"]
cities = ["newyork", "losangeles", "chicago"]
ages = ["10_20y", "20_30y"]
n = 20
df = pd.DataFrame({"gender" : np.random.choice(genders, n),
"city" : np.random.choice(cities, n),
"age" : np.random.choice(ages, n),
"value": np.random.randint(1, 20, n)})
## the dataframe we will append during the process
new = pd.DataFrame(columns = cols)
## itertools contains the function combinations
import itertools
## list all size combinations possible
for n in range(0, len(cols)+1):
for i in itertools.combinations(cols, n):
## if n > 0, the combinations is not empty,
## so we can directly groupby this sublist and take the mean
if n != 0:
agg_df = df.groupby(list(i)).mean().reset_index()
## reset index since for multiple columns, the result will be multiindex
## if n=0, we just want to take the mean of the whole dataframe
agg_df = pd.DataFrame(columns = cols)
agg_df.loc[0, "value"] = df.loc[:, "value"].mean()
## a bit ugly since this mean is an integer not a dataframe
## from here agg_df will have n+1 columns,
## for instance for ["gender"] we will have only "gender" and "value" columns
## "city" and "age" are missing since we aggregate on it
for j in cols:
if j not in i:
agg_df.loc[:, j] = "total" ## adding total as you asked for
new = new.append(agg_df)
For instance, I find :
gender city age value
0 total total total 9.750000
0 female total total 8.083333
1 male total total 12.250000
0 total chicago total 8.000000
1 total losangeles total 11.428571
2 total newyork total 11.666667
0 total total 10_20y 8.100000
1 total total 20_30y 11.400000
0 female chicago total 7.333333
1 female losangeles total 9.250000
2 female newyork total 8.000000
3 male chicago total 9.000000
4 male losangeles total 14.333333
5 male newyork total 19.000000
0 female total 10_20y 7.333333
1 female total 20_30y 8.833333
2 male total 10_20y 9.250000
3 male total 20_30y 15.250000
0 total chicago 10_20y 7.285714
1 total chicago 20_30y 9.666667
2 total losangeles 10_20y 10.000000
3 total losangeles 20_30y 12.500000
4 total newyork 20_30y 11.666667
0 female chicago 10_20y 7.250000
1 female chicago 20_30y 7.500000
2 female losangeles 10_20y 7.500000
3 female losangeles 20_30y 11.000000
4 female newyork 20_30y 8.000000
5 male chicago 10_20y 7.333333
6 male chicago 20_30y 14.000000
7 male losangeles 10_20y 15.000000
8 male losangeles 20_30y 14.000000
9 male newyork 20_30y 19.000000
It's a base work, I think you can work around

Calculate average of non numeric columns in pandas

I have a df "data" as below
Name Quality city
Tom High A
nick Medium B
krish Low A
Jack High A
Kevin High B
Phil Medium B
I want group it by city and a create a new columns based on the column "quality" and calculate avegare as below
city High Medium Low High_Avg Medium_AVG Low_avg
A 2 0 1 66.66 0 33.33
B 1 1 0 50 50 0
I tried with the below script and I know it is completely wrong.
data_average = data_df.groupby(['city'], as_index = False).count()
Get a count of the frequencies, divide the outcome by the sum across columns, and finally concatenate the datframes into one :
result = pd.crosstab(df.city, df.Quality)
averages = result.div(result.sum(1).array, axis=0).mul(100).round(2).add_suffix("_Avg")
#combine the dataframes
pd.concat((result, averages), axis=1)
Quality High Low Medium High_Avg Low_Avg Medium_Avg
A 2 1 0 66.67 33.33 0.00
B 1 0 2 33.33 0.00 66.67

how to groupby dataframe with two rows as header

I have dataframe with two rows as header(name and unit). is there a way to groupby dataframe with unit. what Iam trying to acheive is group by similar units and run analysis on them.
df = pd.read_csv('filename',header=[0.1])
Customer length height adress city
Name meter meter bldg name
A 10 20 1 Delhi
C 30 20 10 Delhi
B 20 40 19 Delhi
D 40 50 10 Delhi
i am trying to isolate the dataframe with units(second header)
for example:
length height
10 20
30 20
20 40
40 50

Overall sum by groupby pandas

I have a dataframe as shown below, which is area usage of whole city say Bangalore.
Sector Plot Usage Status Area
A 1 Villa Constructed 40
A 2 Residential Constructed 50
A 3 Substation Not_Constructed 120
A 4 Villa Not_Constructed 60
A 5 Residential Not_Constructed 30
A 6 Substation Constructed 100
B 1 Villa Constructed 80
B 2 Residential Constructed 60
B 3 Substation Not_Constructed 40
B 4 Villa Not_Constructed 80
B 5 Residential Not_Constructed 100
B 6 Substation Constructed 40
Bangalore consist of two sectors A and B.
From the above I would like to calculate total area of Bangalore and its distribution of usage.
Expected Output:
City Total_Area %_Villa %_Resid %_Substation %_Constructed %_Not_Constructed
Bangalore(A+B) 800 32.5 30 37.5 46.25 53.75
I think you need set scalar value to column city before apply solution (if there are only sectors A and B):
df['Sector'] = 'Bangalore(A+B)'
#aggregate sum per 2 columns Sector and Usage
df1 = df.groupby(['Sector', 'Usage'])['Area'].sum()
#percentage by division of total per Sector
df1 = df1.div(df1.sum(level=0), level=0).unstack(fill_value=0).mul(100).add_prefix('%_')
#aggregate sum per 2 columns Sector and Status
df2 = df.groupby(['Sector', 'Status'])['Area'].sum()
df2 = df2.div(df2.sum(level=0), level=0).unstack(fill_value=0).mul(100).add_prefix('%_')
#total Area per Sector
s = df.groupby('Sector')['Area'].sum().rename('Total_area')
#join all together
dfA = pd.concat([s, df1, df2], axis=1).reset_index()
print (dfA)
Sector Total_area %_Residential %_Substation %_Villa \
0 Bangalore(A+B) 800 30.0 37.5 32.5
%_Constructed %_Not_Constructed
0 46.25 53.75
Simple Pivot Table can help!
1. One Line Pandas Solution: 80% work done
pv =
2. Now formatting for %: 90% work done
ans =
3. Adding the Total Column: 99% work done
ans['Total'] = pv['Total']['Total']
4. Renaming Columns and Arranging in your expected order: and done!
ans = ans[['Total',''%_Villa','%_Resid','%_Substation','%_Constructed','%_Not_Constructed']]