Divide two row values based on label and create a new column to populate the calculated value - pandas

New to Python and looking for some help.
I would like to divide values in two different rows (part of the same column) and then insert a new column with the calculated value
City 2017-18 Item
0 Boston 100 Primary
1 Boston 200 Secondary
2 Boston 300 Tertiary
3 Boston 400 Nat'l average
4 Chicago 500 Primary
5 Chicago 600 Secondary
6 Chicago 700 Tertiary
7 Chicago 800 Nat'l average
On the above Dataframe, I am trying to divide a City's Primary, Secondary and Tertiary values respectively by the Nat'l average for that City. The resultant answer to be populated in a new column part of the same Dataframe. After calculation, the row with the label 'Nat'l average' need to be deleted.
Appreciate your help...
City 2014-15 Item New_column
0 Boston 100 Primary 100/400
1 Boston 200 Secondary 200/400
2 Boston 300 Tertiary 300/400
3 Chicago 500 Primary 500/800
4 Chicago 600 Secondary 600/800
5 Chicago 700 Tertiary 700/800

If mean value is always last per groups divide column by Series created by GroupBy.transform and GroupBy.last:
df['new'] = df['2017-18'].div(df.groupby('City')['2017-18'].transform('last'))
If not first filter values with averages and divide by Series.maping Series:
s = df[df['Item'] == "Nat'l average"].set_index('City')['2017-18']
df['new'] = df['2017-18'].div(df['City'].map(s))
And last filter out rows by boolean indexing:
df = df[df['Item'] != "Nat'l average"]
print (df)
City 2017-18 Item new
0 Boston 100 Primary 0.250
1 Boston 200 Secondary 0.500
2 Boston 300 Tertiary 0.750
4 Chicago 500 Primary 0.625
5 Chicago 600 Secondary 0.750
6 Chicago 700 Tertiary 0.875
Detail:
print (df['City'].map(s))
0 400
1 400
2 400
3 400
4 800
5 800
6 800
7 800
Name: City, dtype: int64

Related

Pandas replace function specifying the column [duplicate]

This question already has answers here:
Replacing column values in a pandas DataFrame
(16 answers)
Closed 4 months ago.
dataset = pd.read_csv('./file.csv')
dataset.head()
This gives:
age sex smoker married region price
0 39 female yes no us 250000
1 28 male no no us 400000
2 23 male no yes europe 389000
3 17 male no no asia 230000
4 43 male no yes asia 243800
I want to replace all yes/no values of smoker with 0 or 1, but I don't want to change the yes/no values of married. I want to use pandas replace function.
I did the following, but this obviously changes all yes/no values (from smoker and married column):
dataset = dataset.replace(to_replace='yes', value='1')
dataset = dataset.replace(to_replace='no', value='0')
age sex smoker married region price
0 39 female 1 0 us 250000
1 28 male 0 0 us 400000
2 23 male 0 1 europe 389000
3 17 male 0 0 asia 230000
4 43 male 0 1 asia 243800
How can I ensure that only the yes/no values from the smoker column get changed, preferably using Pandas' replace function?
did you try:
dataset['smoker']=dataset['smoker'].replace({'yes':1, 'no':0})

Pandas: Aggregate mean ("totals") for each combination of dimensions

I have a table like this:
gender
city
age
time
value
male
newyork
10_20y
2010
10.5
female
newyork
10_20y
2010
11
I'd like to add all possible means for combinations of dimensions 1-3 to the same table (or a new dataframe that I can concatenate with the original).
The time dimension should not be aggregated. Example of means added for different combinations:
gender
city
age
time
value
total
total
total
2010
(mean)
male
total
total
2010
(mean)
male
newyork
total
2010
(mean)
female
total
total
2010
(mean)
female
newyork
total
2010
(mean)
... total
... total
10_20y
2010
(mean)
Using groupby on multiple columns will groupby with all combinations of these columns. So a simple df.groupby(["city", "age"]).mean() will achieve the mean for the "total", "city", "age" combination. The problem here is you want all combinations of all size for the list ["gender", "city", "age"]. It's not straightforward and I think a more pythonic way can be found but here is my proposal :
## create artificial data
cols = ["gender", "city", "age"]
genders = ["male", "female"]
cities = ["newyork", "losangeles", "chicago"]
ages = ["10_20y", "20_30y"]
n = 20
df = pd.DataFrame({"gender" : np.random.choice(genders, n),
"city" : np.random.choice(cities, n),
"age" : np.random.choice(ages, n),
"value": np.random.randint(1, 20, n)})
## the dataframe we will append during the process
new = pd.DataFrame(columns = cols)
## itertools contains the function combinations
import itertools
## list all size combinations possible
for n in range(0, len(cols)+1):
for i in itertools.combinations(cols, n):
## if n > 0, the combinations is not empty,
## so we can directly groupby this sublist and take the mean
if n != 0:
agg_df = df.groupby(list(i)).mean().reset_index()
## reset index since for multiple columns, the result will be multiindex
## if n=0, we just want to take the mean of the whole dataframe
else:
agg_df = pd.DataFrame(columns = cols)
agg_df.loc[0, "value"] = df.loc[:, "value"].mean()
## a bit ugly since this mean is an integer not a dataframe
## from here agg_df will have n+1 columns,
## for instance for ["gender"] we will have only "gender" and "value" columns
## "city" and "age" are missing since we aggregate on it
for j in cols:
if j not in i:
agg_df.loc[:, j] = "total" ## adding total as you asked for
new = new.append(agg_df)
For instance, I find :
new
gender city age value
0 total total total 9.750000
0 female total total 8.083333
1 male total total 12.250000
0 total chicago total 8.000000
1 total losangeles total 11.428571
2 total newyork total 11.666667
0 total total 10_20y 8.100000
1 total total 20_30y 11.400000
0 female chicago total 7.333333
1 female losangeles total 9.250000
2 female newyork total 8.000000
3 male chicago total 9.000000
4 male losangeles total 14.333333
5 male newyork total 19.000000
0 female total 10_20y 7.333333
1 female total 20_30y 8.833333
2 male total 10_20y 9.250000
3 male total 20_30y 15.250000
0 total chicago 10_20y 7.285714
1 total chicago 20_30y 9.666667
2 total losangeles 10_20y 10.000000
3 total losangeles 20_30y 12.500000
4 total newyork 20_30y 11.666667
0 female chicago 10_20y 7.250000
1 female chicago 20_30y 7.500000
2 female losangeles 10_20y 7.500000
3 female losangeles 20_30y 11.000000
4 female newyork 20_30y 8.000000
5 male chicago 10_20y 7.333333
6 male chicago 20_30y 14.000000
7 male losangeles 10_20y 15.000000
8 male losangeles 20_30y 14.000000
9 male newyork 20_30y 19.000000
It's a base work, I think you can work around

Calculate average of non numeric columns in pandas

I have a df "data" as below
Name Quality city
Tom High A
nick Medium B
krish Low A
Jack High A
Kevin High B
Phil Medium B
I want group it by city and a create a new columns based on the column "quality" and calculate avegare as below
city High Medium Low High_Avg Medium_AVG Low_avg
A 2 0 1 66.66 0 33.33
B 1 1 0 50 50 0
I tried with the below script and I know it is completely wrong.
data_average = data_df.groupby(['city'], as_index = False).count()
Get a count of the frequencies, divide the outcome by the sum across columns, and finally concatenate the datframes into one :
result = pd.crosstab(df.city, df.Quality)
averages = result.div(result.sum(1).array, axis=0).mul(100).round(2).add_suffix("_Avg")
#combine the dataframes
pd.concat((result, averages), axis=1)
Quality High Low Medium High_Avg Low_Avg Medium_Avg
city
A 2 1 0 66.67 33.33 0.00
B 1 0 2 33.33 0.00 66.67

how to groupby dataframe with two rows as header

I have dataframe with two rows as header(name and unit). is there a way to groupby dataframe with unit. what Iam trying to acheive is group by similar units and run analysis on them.
df = pd.read_csv('filename',header=[0.1])
Customer length height adress city
Name meter meter bldg name
A 10 20 1 Delhi
C 30 20 10 Delhi
B 20 40 19 Delhi
D 40 50 10 Delhi
i am trying to isolate the dataframe with units(second header)
for example:
length height
10 20
30 20
20 40
40 50

Overall sum by groupby pandas

I have a dataframe as shown below, which is area usage of whole city say Bangalore.
Sector Plot Usage Status Area
A 1 Villa Constructed 40
A 2 Residential Constructed 50
A 3 Substation Not_Constructed 120
A 4 Villa Not_Constructed 60
A 5 Residential Not_Constructed 30
A 6 Substation Constructed 100
B 1 Villa Constructed 80
B 2 Residential Constructed 60
B 3 Substation Not_Constructed 40
B 4 Villa Not_Constructed 80
B 5 Residential Not_Constructed 100
B 6 Substation Constructed 40
Bangalore consist of two sectors A and B.
From the above I would like to calculate total area of Bangalore and its distribution of usage.
Expected Output:
City Total_Area %_Villa %_Resid %_Substation %_Constructed %_Not_Constructed
Bangalore(A+B) 800 32.5 30 37.5 46.25 53.75
I think you need set scalar value to column city before apply solution (if there are only sectors A and B):
df['Sector'] = 'Bangalore(A+B)'
#aggregate sum per 2 columns Sector and Usage
df1 = df.groupby(['Sector', 'Usage'])['Area'].sum()
#percentage by division of total per Sector
df1 = df1.div(df1.sum(level=0), level=0).unstack(fill_value=0).mul(100).add_prefix('%_')
#aggregate sum per 2 columns Sector and Status
df2 = df.groupby(['Sector', 'Status'])['Area'].sum()
df2 = df2.div(df2.sum(level=0), level=0).unstack(fill_value=0).mul(100).add_prefix('%_')
#total Area per Sector
s = df.groupby('Sector')['Area'].sum().rename('Total_area')
#join all together
dfA = pd.concat([s, df1, df2], axis=1).reset_index()
print (dfA)
Sector Total_area %_Residential %_Substation %_Villa \
0 Bangalore(A+B) 800 30.0 37.5 32.5
%_Constructed %_Not_Constructed
0 46.25 53.75
Simple Pivot Table can help!
1. One Line Pandas Solution: 80% work done
pv =
df.pivot_table(values='Area',aggfunc=np.sum,index=['Status'],columns=['Usage'],margins=True,margins_name='Total',fill_value=0).unstack()
2. Now formatting for %: 90% work done
ans =
pd.DataFrame([[pv['Villa']['Total']/pv['Total']['Total'].astype('float'),pv['Resid']['Total']/pv['Total']['Total'].astype('float'),pv['Substation']['Total']/pv['Total']['Total'].astype('float'),pv['Total']['Constructed']/pv['Total']['Total'].astype('float'),pv['Total']['Not_Constructed']/pv['Total']['Total'].astype('float')]]).round(2)*100
3. Adding the Total Column: 99% work done
ans['Total'] = pv['Total']['Total']
4. Renaming Columns and Arranging in your expected order: and done!
ans.columns=['%_Villa','%_Resid','%_Substation','%_Constructed','%_Not_Constructed','Total']
ans = ans[['Total',''%_Villa','%_Resid','%_Substation','%_Constructed','%_Not_Constructed']]