Split a dataframe by unique values in a dataframe - pandas

I'm looking to analyze the ABV and Style of beer, and then take an average for graphing. I have all the beer styles and their ABV in a dataframe, I'm looking to create seperate Dataframes for each style, and then take the average of that styles ABV.
I've tried groupby and got nothing.
What I want to accomplish:
-Split dataframe into multiple dataframe by style which would include all ABV's per that style (there are some duplicate ABV values and 90 Styles, 71 unique ABV's)
-Take the average of each style
-Graph in a scatter plot.
Data Frame:
]

I managed to find documentation on iterating through groups and ran passed this loop into it. This sorted it by style
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
beers_csv = pd.read_csv("Resources/cleaned_beer.csv")
dropped_beers_csv = beers_csv.drop(columns=["Unnamed: 0", "Brewery ID", "Brewery", "City", "IBU", "State", "OZ.", "Beer"])
beer_data = dropped_beers_csv
grouped = beer_data.groupby('Style')
for name, group in grouped:
print(name)
print(group)
grouped_beer = grouped.mean()
grouped_beer
It returned all the styles and the ABV (for example, it returned 2 Abby Single Ales and their ABV.
The last two lines just applied the mean function and spit out a dataframe with 90 rows, and running a unique count on my original csv file shows 90 unique styles, and then the mean function took the average of each group. Now I have a 90 row data frame containing each unique style and the average ABV for that style.

Related

Iterating and ploting five columns per iteration pandas

I am trying to plot five columns per iteration, but current code is ploting everithing five times. How to explain to it to plot five columns per iteration without repeting them?
n=4
for tag_1,tag_2,tag_3,tag_4,tag_5 in zip(df.columns[n:], df.columns[n+1:], df.columns[n+2:], df.columns[n+3:], df.columns[n+4:]):
fig,ax=plt.subplots(ncols=5, tight_layout=True, sharey=True, figsize=(20,3))
sns.scatterplot(df, x=tag_1, y='variable', ax=ax[0])
sns.scatterplot(df, x=tag_2, y='variable', ax=ax[1])
sns.scatterplot(df, x=tag_3, y='variable', ax=ax[2])
sns.scatterplot(df, x=tag_4, y='variable', ax=ax[3])
sns.scatterplot(df, x=tag_5, y='variable', ax=ax[4])
plt.show()
You are using list slicing in the wrong way. When you use df.columns[n:], you are getting all the column names from the one with index n to the last one. The same is valid for n+1, n+2, n+3 and n+4. This causes the repetition that you are referring to. In addition to that, the fact that the plot is shown five times is due to the behavior of the zip function: when used on iterables with different sizes, the iterable returned by zip has the size of the smaller one (in this case df.columns[n+4:]).
You can achieve what you want by adapting your code as follows:
# Imports to create sample data
import string
import random
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Create some sample data and a sample dataframe
data = { string.ascii_lowercase[i]: [random.randint(0, 100) for _ in range(100)] for i in range(15) }
df = pd.DataFrame(data)
# Iterate in groups of five indexes
for start in range(0, len(df.columns), 5):
# Get the next five columns. Pay attention to the case in which the number of columns is not a multiple of 5
cols = [df.columns[idx] for idx in range(start, min(start+5, len(df.columns)))]
# Adapt your plot and take into account that the last group can be smaller than 5
fig,ax=plt.subplots(ncols=len(cols), tight_layout=True, sharey=True, figsize=(20,3))
for idx in range(len(cols)):
#sns.scatterplot(df, x=cols[idx], y='variable', ax=ax[idx])
sns.scatterplot(df, x=cols[idx], y=df[cols[idx]], ax=ax[idx]) # In the example the values of the column are plotted
plt.show()
In this case, the code performs the following steps:
Iterate over groups of at most five indexes ([0->4], [5->10]...)
Recover the columns that are positioned in the previously recovered indexes. The last group of columns may be smaller than 5 (e.g., 18 columns, the last is composed of the ones with the following indexes: 15, 16, 17
Create the plot taking into account the previous corner case of less than 5 columns
With Seaborn's object interface, available from v0.12, we might do like this:
from numpy import random
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import seaborn.objects as so
sns.set_theme()
First, let's create a sample dataset, just like trolloldem's answer.
random.seed(0) # To produce the same random values across multiple runs
columns = list("abcdefghij")
sample_size = 20
df_orig = pd.DataFrame(
{c: random.randint(100, size=sample_size) for c in columns},
index=pd.Series(range(sample_size), name="variable")
)
Then transform the data frame into a long-form for easier processing.
df = (df_orig
.melt(value_vars=columns, var_name="tag", ignore_index=False)
.reset_index()
)
Then finally render the figures, 5 figures per row.
(
so.Plot(df, x="value", y="variable") # Or you might do x="variable", y="value" instead
.facet(col="tag", wrap=5)
.add(so.Dot())
)

how to plot graded letters like A* in matplotlib

i'm a complete beginner and i have a college stats project, im comparing exam scores for our year group and the one below. i collected my own data and since i do cs i decided to try visualize the data with pandas and matplotlib (my first time). i was able to read the csv file into a dataframe with columns = Level,Grade,Difficulty,Happy,MAG. Level is just ' year group ' e.g. AS or A2. and MAG is like a minimum expected grade, the rest are numeric values out of 5.
i want to do some type of plotting but i cant' seem to get it work.
i want to plot revision against difficulty? for AS group and try show a correlation. i also want to show a barchart ( if appropriate ) for Grade Vs MAG.
here is the csv https://docs.google.com/spreadsheets/d/169UKfcet1qh8ld-eI7B4U14HIl7pvgZfQLE45NrleX8/edit?usp=sharing
this is the code so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Report Task.csv')
df.columns = ['Level','Grade','Difficulty','Revision','Happy','MAG'] #numerical values are out of 5
df[df.Level.str.match('AS')] #to get only AS group
plt.plot(df.Revision, df.Difficulty)
this is my first time ever posting on stack so im really sorry if i did something wrong.
For difficulty vs revision, you were using a line plot. You're probably looking for a scatter plot:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
plt.scatter(x=df.Revision, y=df.Difficulty)
plt.xlabel('Revision')
plt.ylabel('Difficulty')
Alternatively you can plot via pandas directly:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
df.plot.scatter(x='Revision', y='Difficulty')

How to visualize single column from pandas dataframe

I'm new to data science & pandas. I'm just trying to visualize the distribution of data from a single series (a single column), but the histogram that I'm generating is only a single column (see below where it's sorted descending).
My data is over 11 million rows. The max value is 27,235 and the min values are 1. I'd like to see the "count" column grouped into different bins and a column/bar whose height is the total for each bin. But, I'm only seeing a single bar and am not sure what to do.
Data
df = pd.DataFrame({'count':[27235,26000,25877]})
Solution
import matplotlib.pyplot as plt
df['count'].hist()
Alternatively
sns.distplot(df['count'])

Time series analysis - putting values into bins

Data
How can I split the values in the category_lvl2 column into bins for each different value, and find the average amount for all the values in each bin?
For example finding the average amount spent on coffee
I have already performed feature scaling on the amounts
You can use groupby() method and provide the groups you get with pd.cut(). The example below bins the data into 10 categories by sepal_length column. Then those categories are used to groupby the iris df. You could also bin with a variable and get the mean of another one with groupby.
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
bins = pd.cut(iris.sepal_length, 10)
iris.groupby(bins).sepal_length.mean()

Filtering out columns with Pandas

enter image description hereI want to filter a pandas data-frame to only keep columns that contain a certain wildcard and then keep the two columns directly to right of this.
The Dataframe is tracking pupil grades, overall total and feedback. I only want to keep the data that corresponds to Homework and not other assessments. So in the example below I would want to keep First Name, Last Name, any homework column and the corresponding points and feedback column which are always exported to the right of this.
First Name,Last Name,Understanding Business Homework,Points,Feedback,Past Paper Homework,Points,Feedback, Groupings/Structures Questions,Points, Feedback
import pandas as pd
import numpy as np
all_data = all_data.filter(like=('Homework') and ('First Name') and
('Second Name') and ('Points'),axis=1)
print(all_data.head())
export_csv = all_data.to_csv (r'C:\Users\Sandy\Python\Automate_the_Boring_Stuff\new.csv', index = None, header=True)