Seaborn how to add number of samples per category in sns.catplot - pandas

I have a catplot drawing using:
s = sns.catplot(x="level", y="value", hue="cond", kind=graph_type, data=df)
However, the size of the groups is not equal:
"Minimal" has n=12 samples , and "Moderate" has n=18 samples.
How can I add this info to the graph?

Manually calculate the sizes and add them to xticklabels, something like this
import matplotlib.pyplot as plt
import seaborn as sns
exercise = sns.load_dataset("exercise")
cnts = dict(exercise['time'].value_counts())
key = list(cnts.keys())
vals = list(cnts.values())
g = sns.catplot(x="time", y="pulse", hue="kind",order=key,
data=exercise, kind="box")
g.set_axis_labels("", "pulse")
g.set_xticklabels([(key[i]+'\n('+str(vals[i])+')') for i in range(len(key))])
plt.show()

Related

Equivalent of Hist()'s Layout hyperparameter in Sns.Pairplot?

Am trying to find hist()'s figsize and layout parameter for sns.pairplot().
I have a pairplot that gives me nice scatterplots between the X's and y. However, it is oriented horizontally and there is no equivalent layout parameter to make them vertical to my knowledge. 4 plots per row would be great.
This is my current sns.pairplot():
sns.pairplot(X_train,
x_vars = X_train.select_dtypes(exclude=['object']).columns,
y_vars = ["SalePrice"])
This is what I would like it to look like: Source
num_mask = train_df.dtypes != object
num_cols = train_df.loc[:, num_mask[num_mask == True].keys()]
num_cols.hist(figsize = (30,15), layout = (4,10))
plt.show()
What you want to achieve isn't currently supported by sns.pairplot, but you can use one of the other figure-level functions (sns.displot, sns.catplot, ...). sns.lmplot creates a grid of scatter plots. For this to work, the dataframe needs to be in "long form".
Here is a simple example. sns.lmplot has parameters to leave out the regression line (fit_reg=False), to set the height of the individual subplots (height=...), to set its aspect ratio (aspect=..., where the subplot width will be height times aspect ratio), and many more. If all y ranges are similar, you can use the default sharey=True.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some test data with different y-ranges
np.random.seed(20230209)
X_train = pd.DataFrame({"".join(np.random.choice([*'uvwxyz'], np.random.randint(3, 8))):
np.random.randn(100).cumsum() + np.random.randint(100, 1000) for _ in range(10)})
X_train['SalePrice'] = np.random.randint(10000, 100000, 100)
# convert the dataframe to long form
# 'SalePrice' will get excluded automatically via `melt`
compare_columns = X_train.select_dtypes(exclude=['object']).columns
long_df = X_train.melt(id_vars='SalePrice', value_vars=compare_columns)
# create a grid of scatter plots
g = sns.lmplot(data=long_df, x='SalePrice', y='value', col='variable', col_wrap=4, sharey=False)
g.set(ylabel='')
plt.show()
Here is another example, with histograms of the mpg dataset:
import matplotlib.pyplot as plt
import seaborn as sns
mpg = sns.load_dataset('mpg')
compare_columns = mpg.select_dtypes(exclude=['object']).columns
mpg_long = mpg.melt(value_vars=compare_columns)
g = sns.displot(data=mpg_long, kde=True, x='value', common_bins=False, col='variable', col_wrap=4, color='crimson',
facet_kws={'sharex': False, 'sharey': False})
g.set(xlabel='')
plt.show()

Barplot per each ax in matplotlib

I have the following dataset, ratings in stars for two fictitious places:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
Since the rating is a category (is not a continuous data) I convert it to a category:
df['rating_cat'] = pd.Categorical(df['rating'])
What I want is to create a bar plot per each fictitious place ('A or B'), and the count per each rating. This is the intended plot:
I guess using a for per each value in id could work, but I have some trouble to decide the size:
fig, ax = plt.subplots(1,2,figsize=(6,6))
axs = ax.flatten()
cats = df['rating_cat'].cat.categories.tolist()
ids_uniques = df.id.unique()
for i in range(len(ids_uniques)):
ax[i].bar(df[df['id']==ids_uniques[i]], df['rating'].size())
But it returns me an error TypeError: 'int' object is not callable
Perhaps it's something complicated what I am doing, please, could you guide me with this code
The pure matplotlib way:
from math import ceil
# Prepare the data for plotting
df_plot = df.groupby(["id", "rating"]).size()
unique_ids = df_plot.index.get_level_values("id").unique()
# Calculate the grid spec. This will be a n x 2 grid
# to fit one chart by id
ncols = 2
nrows = ceil(len(unique_ids) / ncols)
fig = plt.figure(figsize=(6,6))
for i, id_ in enumerate(unique_ids):
# In a figure grid spanning nrows x ncols, plot into the
# axes at position i + 1
ax = fig.add_subplot(nrows, ncols, i+1)
df_plot.xs(id_).plot(axes=ax, kind="bar")
You can simplify things a lot with Seaborn:
import seaborn as sns
sns.catplot(data=df, x="rating", col="id", col_wrap=2, kind="count")
If you're ok with installing a new library, seaborn has a very helpful countplot. Seaborn uses matplotlib under the hood and makes certain plots easier.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
sns.countplot(
data = df,
x = 'rating',
hue = 'id',
)
plt.show()
plt.close()

How to plot frequency distribution graph using Matplotlib?

I trust you are doing well. I am using a data frame in which there are two columns screens and it's frequency. I am trying to find out the relationship between the screen and the frequency of the appearance of the screens. Now I want to know, for all screens what are all of the frequencies as sort of a summary graph. Imagine putting all of those frequencies into an array, and wanting to study the distribution in that array. Below is my code that I have tried so far:
data = pd.read_csv('frequency_list.csv')
new_columns = data.columns.values
new_columns[1] = 'frequency'
data.columns = new_columns
import matplotlib.pyplot as plt
%matplotlib inline
dataset = data.head(10)
dataset.plot(x = "screen", y = "frequency", kind = "bar")
plt.show()
col_one_list = unpickled_df['screen'].tolist()
col_one_arr = unpickled_df['screen'].head(10).to_numpy()
plt.hist(col_one_arr) #gives you a histogram of your array 'a'
plt.show() #finishes out the plot
Below is the screenshot of my data frame containing screen as one column and frequency as another. Can you help me to find out a way to plot a frequency distribution graph? Thanks in advance.
Will a bar plot work? Here's an example:
import pandas as pd
import matplotlib.pyplot as plt
freq = [102,98,56,117]
screen = ['A','B','C','D']
df = pd.DataFrame(list(zip(screen, freq)), columns=['screen', 'freq'])
plt.bar(df.screen,df.freq)
plt.xlabel('x')
plt.ylabel('count')
plt.show()

pandas dataframe bar plot put space between bars

So I want my image look like this
But now my image look like this
How do I reduce the space between bars without making the bar width into 1?
Here is my code:
plot=repeat.loc['mean'].plot(kind='bar',rot=0,alpha=1,cmap='Reds',
yerr=repeat.loc['std'],error_kw=dict(elinewitdh=0.02,ecolor='grey'),
align='center',width=0.2,grid=None)
plt.ylabel('')
plt.grid(False)
plt.title(cell,ha='center')
plt.xticks([])
plt.yticks([])
plt.ylim(0,120)
plt.tight_layout()`
make the plot from scratch if the toplevel functions from pandas or seaborn do not give you the desired result! :)
import seaborn.apionly as sns
import scipy as sp
import matplotlib.pyplot as plt
# some fake data
data = sp.randn(10,10) + 1
data = data[sp.argsort(sp.average(data,axis=1))[::-1],:]
avg = sp.average(data,axis=1)
std = sp.std(data,axis=1)
# a practical helper from seaborn to quickly generate the colors
colors = sns.color_palette('Reds',n_colors = data.shape[0])
fig, ax = plt.subplots()
pos = range(10)
ax.bar(pos,avg,width=1)
for col,patch in zip(colors,ax.patches):
patch.set_facecolor(col)
patch.set_edgecolor('k')
for i,p in enumerate(pos):
ax.plot([p,p],[avg[i],avg[i]+std[i]],color='k',lw=2, zorder=-1)

Graphing the sum and average; pandas

total_income = df.groupby('title_year')['gross'].sum()
average_income = df.groupby('title_year')['gross'].mean()
print(plt.semilogy(total_income,average_income))
So I wanted to plot the total and average income on the same graph showing two lines. And I want my x-axis to show the years from 1916-2016 and y-axis to show in Dollars. But my code isn't doing that. I need help on how to change up my code in order to get what I needed
Here's my output of my code.
This is my data file named data.csv:
year,gross
2015,45
2015,47
2015,49
2016,76
2016,78
2016,87
2017,103
2017,115
2017,133
1.) This is all the code to get the log-normal plot:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv")
total_income = df.groupby('year')['gross'].sum()
average_income = df.groupby('year')['gross'].mean()
total_income.plot(label="Total Income")
average_income.plot(label="Average Income")
plt.xlabel("Year")
plt.ylabel("log$_{10}$(Gross)")
plt.yscale("log")
plt.legend()
plt.tight_layout()
plt.savefig("plot.png")
2.) This is how you use plt.semilogy():
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv")
total_income = pd.DataFrame(df.groupby('year')['gross'].sum())
average_income = pd.DataFrame(df.groupby('year')['gross'].mean())
plt.semilogy(total_income.index, total_income["gross"],
label="Total Income")
plt.semilogy(average_income.index, average_income["gross"],
label="Average Income")
plt.xlabel("Year")
plt.ylabel("log$_{10}$(Gross)")
plt.legend()
plt.tight_layout()
plt.savefig("plot.png")
1.) and 2.) methods produce the following same plot.