how to plot graded letters like A* in matplotlib - pandas

i'm a complete beginner and i have a college stats project, im comparing exam scores for our year group and the one below. i collected my own data and since i do cs i decided to try visualize the data with pandas and matplotlib (my first time). i was able to read the csv file into a dataframe with columns = Level,Grade,Difficulty,Happy,MAG. Level is just ' year group ' e.g. AS or A2. and MAG is like a minimum expected grade, the rest are numeric values out of 5.
i want to do some type of plotting but i cant' seem to get it work.
i want to plot revision against difficulty? for AS group and try show a correlation. i also want to show a barchart ( if appropriate ) for Grade Vs MAG.
here is the csv https://docs.google.com/spreadsheets/d/169UKfcet1qh8ld-eI7B4U14HIl7pvgZfQLE45NrleX8/edit?usp=sharing
this is the code so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Report Task.csv')
df.columns = ['Level','Grade','Difficulty','Revision','Happy','MAG'] #numerical values are out of 5
df[df.Level.str.match('AS')] #to get only AS group
plt.plot(df.Revision, df.Difficulty)
this is my first time ever posting on stack so im really sorry if i did something wrong.

For difficulty vs revision, you were using a line plot. You're probably looking for a scatter plot:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
plt.scatter(x=df.Revision, y=df.Difficulty)
plt.xlabel('Revision')
plt.ylabel('Difficulty')
Alternatively you can plot via pandas directly:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
df.plot.scatter(x='Revision', y='Difficulty')

Related

Iterating and ploting five columns per iteration pandas

I am trying to plot five columns per iteration, but current code is ploting everithing five times. How to explain to it to plot five columns per iteration without repeting them?
n=4
for tag_1,tag_2,tag_3,tag_4,tag_5 in zip(df.columns[n:], df.columns[n+1:], df.columns[n+2:], df.columns[n+3:], df.columns[n+4:]):
fig,ax=plt.subplots(ncols=5, tight_layout=True, sharey=True, figsize=(20,3))
sns.scatterplot(df, x=tag_1, y='variable', ax=ax[0])
sns.scatterplot(df, x=tag_2, y='variable', ax=ax[1])
sns.scatterplot(df, x=tag_3, y='variable', ax=ax[2])
sns.scatterplot(df, x=tag_4, y='variable', ax=ax[3])
sns.scatterplot(df, x=tag_5, y='variable', ax=ax[4])
plt.show()
You are using list slicing in the wrong way. When you use df.columns[n:], you are getting all the column names from the one with index n to the last one. The same is valid for n+1, n+2, n+3 and n+4. This causes the repetition that you are referring to. In addition to that, the fact that the plot is shown five times is due to the behavior of the zip function: when used on iterables with different sizes, the iterable returned by zip has the size of the smaller one (in this case df.columns[n+4:]).
You can achieve what you want by adapting your code as follows:
# Imports to create sample data
import string
import random
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Create some sample data and a sample dataframe
data = { string.ascii_lowercase[i]: [random.randint(0, 100) for _ in range(100)] for i in range(15) }
df = pd.DataFrame(data)
# Iterate in groups of five indexes
for start in range(0, len(df.columns), 5):
# Get the next five columns. Pay attention to the case in which the number of columns is not a multiple of 5
cols = [df.columns[idx] for idx in range(start, min(start+5, len(df.columns)))]
# Adapt your plot and take into account that the last group can be smaller than 5
fig,ax=plt.subplots(ncols=len(cols), tight_layout=True, sharey=True, figsize=(20,3))
for idx in range(len(cols)):
#sns.scatterplot(df, x=cols[idx], y='variable', ax=ax[idx])
sns.scatterplot(df, x=cols[idx], y=df[cols[idx]], ax=ax[idx]) # In the example the values of the column are plotted
plt.show()
In this case, the code performs the following steps:
Iterate over groups of at most five indexes ([0->4], [5->10]...)
Recover the columns that are positioned in the previously recovered indexes. The last group of columns may be smaller than 5 (e.g., 18 columns, the last is composed of the ones with the following indexes: 15, 16, 17
Create the plot taking into account the previous corner case of less than 5 columns
With Seaborn's object interface, available from v0.12, we might do like this:
from numpy import random
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import seaborn.objects as so
sns.set_theme()
First, let's create a sample dataset, just like trolloldem's answer.
random.seed(0) # To produce the same random values across multiple runs
columns = list("abcdefghij")
sample_size = 20
df_orig = pd.DataFrame(
{c: random.randint(100, size=sample_size) for c in columns},
index=pd.Series(range(sample_size), name="variable")
)
Then transform the data frame into a long-form for easier processing.
df = (df_orig
.melt(value_vars=columns, var_name="tag", ignore_index=False)
.reset_index()
)
Then finally render the figures, 5 figures per row.
(
so.Plot(df, x="value", y="variable") # Or you might do x="variable", y="value" instead
.facet(col="tag", wrap=5)
.add(so.Dot())
)

Time series analysis - putting values into bins

Data
How can I split the values in the category_lvl2 column into bins for each different value, and find the average amount for all the values in each bin?
For example finding the average amount spent on coffee
I have already performed feature scaling on the amounts
You can use groupby() method and provide the groups you get with pd.cut(). The example below bins the data into 10 categories by sepal_length column. Then those categories are used to groupby the iris df. You could also bin with a variable and get the mean of another one with groupby.
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
bins = pd.cut(iris.sepal_length, 10)
iris.groupby(bins).sepal_length.mean()

Replace xticks with names

I am working on the Spotify dataset from Kaggle. I plotted a barplot showing the top artists with most songs in the dataframe.
But the X-axis is showing numbers and I want to show names of the Artists.
names = list(df1['artist'][0:19])
plt.figure(figsize=(8,4))
plt.xlabel("Artists")
sns.barplot(x=np.arange(1,20),
y=df1['song_title'][0:19]);
I tried both list and Series object type but both are giving error.
How to replace the numbers in xticks with names?
Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Data
Data from Spotify - All Time Top 2000s Mega Dataset
df = pd.read_csv('Spotify-2000.csv')
titles = pd.DataFrame(df.groupby(['Artist'])['Title'].count()).reset_index().sort_values(['Title'], ascending=False).reset_index(drop=True)
titles.rename(columns={'Title': 'Title Count'}, inplace=True)
# titles.head()
Artist Title Count
Queen 37
The Beatles 36
Coldplay 27
U2 26
The Rolling Stones 24
Plot
plt.figure(figsize=(8, 4))
chart = sns.barplot(x=titles.Artist[0:19], y=titles['Title Count'][0:19])
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
plt.show()
OK, so I didnt know this, although now it seems stupid not to do so in hindsight!
Pass names(or string labels) in the argument for X-axis.
use plt.xticks(rotate=90) so the labels don't overlap

Filtering out columns with Pandas

enter image description hereI want to filter a pandas data-frame to only keep columns that contain a certain wildcard and then keep the two columns directly to right of this.
The Dataframe is tracking pupil grades, overall total and feedback. I only want to keep the data that corresponds to Homework and not other assessments. So in the example below I would want to keep First Name, Last Name, any homework column and the corresponding points and feedback column which are always exported to the right of this.
First Name,Last Name,Understanding Business Homework,Points,Feedback,Past Paper Homework,Points,Feedback, Groupings/Structures Questions,Points, Feedback
import pandas as pd
import numpy as np
all_data = all_data.filter(like=('Homework') and ('First Name') and
('Second Name') and ('Points'),axis=1)
print(all_data.head())
export_csv = all_data.to_csv (r'C:\Users\Sandy\Python\Automate_the_Boring_Stuff\new.csv', index = None, header=True)

Average of selected rows in csv file

In a csv file, how can i calculate the average of selected rows in a column:
Columns
I did this:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Read the csv file:
df = pd.read_csv("D:\\xxxxx\\mmmmm.csv")
#Separate the columns and get the average:
# Skid:
S = df['Skid Number after milling'].mean()
But this just gave me the average for the entire column
Thank you for the help!
For selecting rows in a pandas dataframe or series you can use the .iloc attribute.
For example df['A'].iloc[3:5] selects the fourth and fifth row in column "A" of a DataFrame. Indexing starts at 0 and the number behind the colon is not included. This returns a pandas series.
You can do the same using numpy: df["A"].values[3:5]
This already returns a numpy array.
Possibilities to calculate the mean are therefore.
df['A'].iloc[3:5].mean()
or
df["A"].values[3:5].mean()
Also see the documentation about indexing in pandas.