How can I plot the average of a column based off of the values in another column? - pandas

I have a list of campaign donations and I want to create a plot with the average 'contribution_receipt_amount' for each candidate.
indv.columns
#produces this code
#Index(['candidate_name', 'committee_name', 'contribution_receipt_amount',
'contribution_receipt_date', 'contributor_first_name',
'contributor_middle_name', 'contributor_last_name',
'contributor_suffix', 'contributor_street_1', 'contributor_street_2',
'contributor_city', 'contributor_state', 'contributor_zip',
'contributor_employer', 'contributor_occupation',
'contributor_aggregate_ytd', 'report_year', 'report_type',
'contributor_name', 'recipient_committee_type',
'recipient_committee_org_type', 'election_type',
'fec_election_type_desc', 'fec_election_year', 'filing_form', 'sub_id',
'pdf_url', 'line_number_label'],
dtype='object')

First aggregate mean to Series and then use Series.plot:
indv.groupby('candidate_name')['contribution_receipt_amount'].mean().plot()

Related

Proportions for multiple subcategories

I am trying to calculate proportions with multiple subcategories. As seen in the screenshot below, the series is grouped by ['budget_levels', 'revenue_levels'].
I would like to calculate the proportion for each.
For example,
budget_levels=='low' & revenue_levels=='low' / budget_levels=='low'
budget_levels=='low' & revenue_levels=='medium' / budget_levels=='low'
However, not getting the desired output.
Is there any way I could do this calculation for each with a simple one-line code such as .apply(lambda) function?
Use value_counts to get the number of occurences of each combination. Then group by the column budget_levels and divide the observations in each group by their sum. sort_index makes it easier to compare the groups.
df.value_counts().groupby(level=0).transform(lambda x: x / x.sum()).sort_index()

Groupby Get Group For Loop

I have a dataframe that I need to subset by the column measure name.
Measure_Group=measures.groupby('Measure'). I can get.group() like this CDC=Measure_Group.get_group('CDC') , but I have over 20 measures to subset. Is there a for loop or lambda function that I can use with the group by to subset all 20 column names with just one iteration instead of using the get.group multiple times

a bar chart based on the total numbers for each year in Pandas

I have two columns in which there are different numbers in different rows for each year.
First, I need to display the sorted values based on the total numbers for each year.
Second, I need to create a bar chart in which, the y-axis in 'year' and each bar has a label which is the total number for that year.
I'm not sure if I explained the problem clearly, but I would appreciate some help.
Let us do
df.set_index('Year')['Goals scored'].sum(level=0).sort_index().plot(kind='bar')
Try:
sum_by_years = (df.groupby('Year')['Goals scored'].sum()
.sort_values(ascending=False)
)
sum_by_years.plot.barh()

Subtract the mean of a group for a column away from a column value

I have a companies dataset with 35 columns. The companies can belong to one of 8 different groups. How do I for each group create a new dataframe which subtract the mean of the column for that group away from the original value?
Here is an example of part of the dataset.
So for example for row 1 I want to subtract the mean of BANK_AND_DEP for Consumer Markets away from the value of 7204.400207. I need to do this for each column.
I assume this is some kind of combination of a transform and a lambda - but cannot hit the syntax.
Although it might seem counter-intuitive for this to involve a loop at all, looping through the columns themselves allows you to do this as a vectorized operation, which will be quicker than .apply(). For what to subtract by, you'll combine .groupby() and .transform() to get the value you need to subtract from a column. Then, just subtract it.
for column in df.columns:
df['new_'+column] = df[column]-df.groupby('Cluster')['column'].transform('mean')

How to check the highest score among specific columns and compute the average in pandas?

Help with homework problem: "Let us define the "data science experience" of a given person as the person's largest score among Regression, Classification, and Clustering. Compute the average data science experience among all MSIS students."
Beginner to coding. I am trying to figure out how to check amongst columns and compare those columns to each other for the largest value. And then take the average of those found values.
I greatly appreciate your help in advance!
Picture of the sample data set: 1: https://i.stack.imgur.com/9OSjz.png
Provided Code:
import pandas as pd
df = pd.read_csv("cleaned_survey.csv", index_col=0)
df.drop(['ProgSkills','Languages','Expert'],axis=1,inplace=True)
Sample Data:
What I have tried so far:
df[data_science_experience]=df[["Regression","Classification","Clustering"]].values.max()
df['z']=df[['Regression','Classification','Clustering']].apply(np.max,axis=1)
df[data_science_experience]=df[["Regression","Classification","Clustering"]].apply(np.max,axis=1)
If you want to get the highest score of column 'hw1' you can get it with:
pd['hw1'].max(). this gives you a series of all the values in that column and max returns the maximum. for average use mean:
pd['hw1'].mean()
if you want to find the maximum of multiple columns, you can use:
maximum_list = list()
for col in pd.columns:
maximum_list.append(pd[col].max)
max = maximum_list.max()
avg = maximum_list.mean()
hope this helps.
First, you want to get only the rows with MSIS in the Program column. That can be done in the following way:
df[df['Program'] == 'MSIS']
Next, you want to get only the Regression, Classification and Clustering columns. The previous query filtered only rows; we can add to that, like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']]
Now, for each row remaining, we want to take the maximum. That can be done by appending .max(axis=1) to the previous line (axis=1 because we want the maximum of each row, not each column).
At this point, we should have a DataFrame where each row represents the highest score of the three categories for each student. Now, all that's left to do is take the mean, which can be done with .mean(). The full code should therefore look like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']].max(axis=1).mean()