a bar chart based on the total numbers for each year in Pandas - pandas

I have two columns in which there are different numbers in different rows for each year.
First, I need to display the sorted values based on the total numbers for each year.
Second, I need to create a bar chart in which, the y-axis in 'year' and each bar has a label which is the total number for that year.
I'm not sure if I explained the problem clearly, but I would appreciate some help.

Let us do
df.set_index('Year')['Goals scored'].sum(level=0).sort_index().plot(kind='bar')

Try:
sum_by_years = (df.groupby('Year')['Goals scored'].sum()
.sort_values(ascending=False)
)
sum_by_years.plot.barh()

Related

calculate value based on other column values with some step for rows of other columns

total beginner here. If my question is irrelevant, apologies in advance, I'll remove it. So, I have a question : using pandas, I want to calculate an evolution ratio for a week data compared with the previous rolling 4 weeks mean data.
df['rolling_mean_fourweeks'] = df.rolling(4).mean().round(decimals=1)
from here I wanna create a new column for the evolution ratio based on the week data compared with the row of the rolling mean at the previous week.
what is the best way to go here? (I don't have big data) I have tried unsuccessfully with .shift() but am very foreign to .shift()... I should get NAN for week 3 (fourth week) and ~47% for fifth week.
Any suggestion for retrieving the value at row with step -1?
Thanks and have a good day!
Your idea about using shift can perfectly work. The shift(x) function simply shifts a series (a full column in your case) of x steps.
A simple way to check if the rolling_mean_fourweeks is a good predictor can be to shift Column1 and then check how it differs from rolling_mean_fourweeks:
df['column1_shifted'] = df['Column1'].shift(-1)
df['rolling_accuracy'] = ((df['column1_shifted']-df['rolling_mean_fourweeks'])
/df['rolling_mean_fourweeks'])
resulting in:

Proportions for multiple subcategories

I am trying to calculate proportions with multiple subcategories. As seen in the screenshot below, the series is grouped by ['budget_levels', 'revenue_levels'].
I would like to calculate the proportion for each.
For example,
budget_levels=='low' & revenue_levels=='low' / budget_levels=='low'
budget_levels=='low' & revenue_levels=='medium' / budget_levels=='low'
However, not getting the desired output.
Is there any way I could do this calculation for each with a simple one-line code such as .apply(lambda) function?
Use value_counts to get the number of occurences of each combination. Then group by the column budget_levels and divide the observations in each group by their sum. sort_index makes it easier to compare the groups.
df.value_counts().groupby(level=0).transform(lambda x: x / x.sum()).sort_index()

How can I plot the average of a column based off of the values in another column?

I have a list of campaign donations and I want to create a plot with the average 'contribution_receipt_amount' for each candidate.
indv.columns
#produces this code
#Index(['candidate_name', 'committee_name', 'contribution_receipt_amount',
'contribution_receipt_date', 'contributor_first_name',
'contributor_middle_name', 'contributor_last_name',
'contributor_suffix', 'contributor_street_1', 'contributor_street_2',
'contributor_city', 'contributor_state', 'contributor_zip',
'contributor_employer', 'contributor_occupation',
'contributor_aggregate_ytd', 'report_year', 'report_type',
'contributor_name', 'recipient_committee_type',
'recipient_committee_org_type', 'election_type',
'fec_election_type_desc', 'fec_election_year', 'filing_form', 'sub_id',
'pdf_url', 'line_number_label'],
dtype='object')
First aggregate mean to Series and then use Series.plot:
indv.groupby('candidate_name')['contribution_receipt_amount'].mean().plot()

How to check the highest score among specific columns and compute the average in pandas?

Help with homework problem: "Let us define the "data science experience" of a given person as the person's largest score among Regression, Classification, and Clustering. Compute the average data science experience among all MSIS students."
Beginner to coding. I am trying to figure out how to check amongst columns and compare those columns to each other for the largest value. And then take the average of those found values.
I greatly appreciate your help in advance!
Picture of the sample data set: 1: https://i.stack.imgur.com/9OSjz.png
Provided Code:
import pandas as pd
df = pd.read_csv("cleaned_survey.csv", index_col=0)
df.drop(['ProgSkills','Languages','Expert'],axis=1,inplace=True)
Sample Data:
What I have tried so far:
df[data_science_experience]=df[["Regression","Classification","Clustering"]].values.max()
df['z']=df[['Regression','Classification','Clustering']].apply(np.max,axis=1)
df[data_science_experience]=df[["Regression","Classification","Clustering"]].apply(np.max,axis=1)
If you want to get the highest score of column 'hw1' you can get it with:
pd['hw1'].max(). this gives you a series of all the values in that column and max returns the maximum. for average use mean:
pd['hw1'].mean()
if you want to find the maximum of multiple columns, you can use:
maximum_list = list()
for col in pd.columns:
maximum_list.append(pd[col].max)
max = maximum_list.max()
avg = maximum_list.mean()
hope this helps.
First, you want to get only the rows with MSIS in the Program column. That can be done in the following way:
df[df['Program'] == 'MSIS']
Next, you want to get only the Regression, Classification and Clustering columns. The previous query filtered only rows; we can add to that, like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']]
Now, for each row remaining, we want to take the maximum. That can be done by appending .max(axis=1) to the previous line (axis=1 because we want the maximum of each row, not each column).
At this point, we should have a DataFrame where each row represents the highest score of the three categories for each student. Now, all that's left to do is take the mean, which can be done with .mean(). The full code should therefore look like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']].max(axis=1).mean()

Quick Delta Between Two Rows/Columns in GoodData

Right now, I see there are quick ways to get things like Sum/Avg/Max/Etc. for two or more rows or columns when building a table in GoodData.
quick total options
I am building a little table that shows last week and the week prior, and I'm trying to show the delta between them.
So if the first column is 100 and the second is 50, I want '-50'
If the first column is 25 and the second is 100, i want '75'
Is there an easy way to do this?
Let’s consider, that the first column contains result of calculating of metric #1 and the second column contains result of calculating of metric #2, you can simply create a metric #3, which would be defined as the (metric #1 - metric #2) or vice versa.