Proportions for multiple subcategories - pandas

I am trying to calculate proportions with multiple subcategories. As seen in the screenshot below, the series is grouped by ['budget_levels', 'revenue_levels'].
I would like to calculate the proportion for each.
For example,
budget_levels=='low' & revenue_levels=='low' / budget_levels=='low'
budget_levels=='low' & revenue_levels=='medium' / budget_levels=='low'
However, not getting the desired output.
Is there any way I could do this calculation for each with a simple one-line code such as .apply(lambda) function?

Use value_counts to get the number of occurences of each combination. Then group by the column budget_levels and divide the observations in each group by their sum. sort_index makes it easier to compare the groups.
df.value_counts().groupby(level=0).transform(lambda x: x / x.sum()).sort_index()

Related

Removing 1% top and bottom percentiles given a condition

I have the following dataset and I would like to remove that 1% top and bottom percentiles for each "PRIMARY_SIC_CODE" on the column "ROA", i.e., take all the different ROAS for each PRIMARY_SIC_CODE, and remove the quantiles and the rest of the rows in the dataset.
Is there any easy way of doing it? Thanks!
If you want to exclude the top and bottom 1% by considering the column ROAS in its entirety:
top_1perc = df['ROA'].quantile(q=0.99)
bottom_1perc = df['ROA'].quantile(q=0.01)
new_df = df[(df['ROA']> bottom_1perc) & (df['ROA']< top_1perc)
If instead, you want to exclude them for each PRIMARY SIC CODE group:
df[df.groupby('PRIMARY SIC CODE')['ROA'].transform(\
lambda x : ((x > x.quantile(q=0.01)) & (x<x.quantile(q=0.99)))).eq(1)]
Try along the lines of...
df.groupby("PRIMARY SIC CODE")['ROA'].quantile(q=0.1)

How to get the count of dataframe rows at 75% or max

The command below shows some details about the dataframe.
df.describe()
It gives details about count, mean, std, min, 25%, ...
Is there any way to get the count of rows in a dataframe at 75% or 25%?
Thanks.
Use pandas.Series.quantile to determine the value for the given quantile of the selected column.
.quantile has the benefit of being able to specify any quantile value (e.g. 30%)
.describe() is limited to [25%, 50%, 75%], and it performs unnecessary aggregations.
Select the specific data using Boolean selection, with .ge and .le
.ge is >=
.le is <=
.eq is ==
Once you have all the values matching the criteria, use something like quartile_25.count() or len(quartile_25), to get determine how many values meet the criteria.
col should be some column name as a string
quartile_75 = df[df[col].ge(df[col].quantile(q=.75))]
quartile_25 = df[df[col].le(df[col].quantile(q=.25))]
max_ = df[df[col].eq(df[col].max())]

Subtract the mean of a group for a column away from a column value

I have a companies dataset with 35 columns. The companies can belong to one of 8 different groups. How do I for each group create a new dataframe which subtract the mean of the column for that group away from the original value?
Here is an example of part of the dataset.
So for example for row 1 I want to subtract the mean of BANK_AND_DEP for Consumer Markets away from the value of 7204.400207. I need to do this for each column.
I assume this is some kind of combination of a transform and a lambda - but cannot hit the syntax.
Although it might seem counter-intuitive for this to involve a loop at all, looping through the columns themselves allows you to do this as a vectorized operation, which will be quicker than .apply(). For what to subtract by, you'll combine .groupby() and .transform() to get the value you need to subtract from a column. Then, just subtract it.
for column in df.columns:
df['new_'+column] = df[column]-df.groupby('Cluster')['column'].transform('mean')

How to check the highest score among specific columns and compute the average in pandas?

Help with homework problem: "Let us define the "data science experience" of a given person as the person's largest score among Regression, Classification, and Clustering. Compute the average data science experience among all MSIS students."
Beginner to coding. I am trying to figure out how to check amongst columns and compare those columns to each other for the largest value. And then take the average of those found values.
I greatly appreciate your help in advance!
Picture of the sample data set: 1: https://i.stack.imgur.com/9OSjz.png
Provided Code:
import pandas as pd
df = pd.read_csv("cleaned_survey.csv", index_col=0)
df.drop(['ProgSkills','Languages','Expert'],axis=1,inplace=True)
Sample Data:
What I have tried so far:
df[data_science_experience]=df[["Regression","Classification","Clustering"]].values.max()
df['z']=df[['Regression','Classification','Clustering']].apply(np.max,axis=1)
df[data_science_experience]=df[["Regression","Classification","Clustering"]].apply(np.max,axis=1)
If you want to get the highest score of column 'hw1' you can get it with:
pd['hw1'].max(). this gives you a series of all the values in that column and max returns the maximum. for average use mean:
pd['hw1'].mean()
if you want to find the maximum of multiple columns, you can use:
maximum_list = list()
for col in pd.columns:
maximum_list.append(pd[col].max)
max = maximum_list.max()
avg = maximum_list.mean()
hope this helps.
First, you want to get only the rows with MSIS in the Program column. That can be done in the following way:
df[df['Program'] == 'MSIS']
Next, you want to get only the Regression, Classification and Clustering columns. The previous query filtered only rows; we can add to that, like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']]
Now, for each row remaining, we want to take the maximum. That can be done by appending .max(axis=1) to the previous line (axis=1 because we want the maximum of each row, not each column).
At this point, we should have a DataFrame where each row represents the highest score of the three categories for each student. Now, all that's left to do is take the mean, which can be done with .mean(). The full code should therefore look like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']].max(axis=1).mean()

Quick Delta Between Two Rows/Columns in GoodData

Right now, I see there are quick ways to get things like Sum/Avg/Max/Etc. for two or more rows or columns when building a table in GoodData.
quick total options
I am building a little table that shows last week and the week prior, and I'm trying to show the delta between them.
So if the first column is 100 and the second is 50, I want '-50'
If the first column is 25 and the second is 100, i want '75'
Is there an easy way to do this?
Let’s consider, that the first column contains result of calculating of metric #1 and the second column contains result of calculating of metric #2, you can simply create a metric #3, which would be defined as the (metric #1 - metric #2) or vice versa.