Equivalent of Hist()'s Layout hyperparameter in Sns.Pairplot? - matplotlib

Am trying to find hist()'s figsize and layout parameter for sns.pairplot().
I have a pairplot that gives me nice scatterplots between the X's and y. However, it is oriented horizontally and there is no equivalent layout parameter to make them vertical to my knowledge. 4 plots per row would be great.
This is my current sns.pairplot():
sns.pairplot(X_train,
x_vars = X_train.select_dtypes(exclude=['object']).columns,
y_vars = ["SalePrice"])
This is what I would like it to look like: Source
num_mask = train_df.dtypes != object
num_cols = train_df.loc[:, num_mask[num_mask == True].keys()]
num_cols.hist(figsize = (30,15), layout = (4,10))
plt.show()

What you want to achieve isn't currently supported by sns.pairplot, but you can use one of the other figure-level functions (sns.displot, sns.catplot, ...). sns.lmplot creates a grid of scatter plots. For this to work, the dataframe needs to be in "long form".
Here is a simple example. sns.lmplot has parameters to leave out the regression line (fit_reg=False), to set the height of the individual subplots (height=...), to set its aspect ratio (aspect=..., where the subplot width will be height times aspect ratio), and many more. If all y ranges are similar, you can use the default sharey=True.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some test data with different y-ranges
np.random.seed(20230209)
X_train = pd.DataFrame({"".join(np.random.choice([*'uvwxyz'], np.random.randint(3, 8))):
np.random.randn(100).cumsum() + np.random.randint(100, 1000) for _ in range(10)})
X_train['SalePrice'] = np.random.randint(10000, 100000, 100)
# convert the dataframe to long form
# 'SalePrice' will get excluded automatically via `melt`
compare_columns = X_train.select_dtypes(exclude=['object']).columns
long_df = X_train.melt(id_vars='SalePrice', value_vars=compare_columns)
# create a grid of scatter plots
g = sns.lmplot(data=long_df, x='SalePrice', y='value', col='variable', col_wrap=4, sharey=False)
g.set(ylabel='')
plt.show()
Here is another example, with histograms of the mpg dataset:
import matplotlib.pyplot as plt
import seaborn as sns
mpg = sns.load_dataset('mpg')
compare_columns = mpg.select_dtypes(exclude=['object']).columns
mpg_long = mpg.melt(value_vars=compare_columns)
g = sns.displot(data=mpg_long, kde=True, x='value', common_bins=False, col='variable', col_wrap=4, color='crimson',
facet_kws={'sharex': False, 'sharey': False})
g.set(xlabel='')
plt.show()

Related

Barplot per each ax in matplotlib

I have the following dataset, ratings in stars for two fictitious places:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
Since the rating is a category (is not a continuous data) I convert it to a category:
df['rating_cat'] = pd.Categorical(df['rating'])
What I want is to create a bar plot per each fictitious place ('A or B'), and the count per each rating. This is the intended plot:
I guess using a for per each value in id could work, but I have some trouble to decide the size:
fig, ax = plt.subplots(1,2,figsize=(6,6))
axs = ax.flatten()
cats = df['rating_cat'].cat.categories.tolist()
ids_uniques = df.id.unique()
for i in range(len(ids_uniques)):
ax[i].bar(df[df['id']==ids_uniques[i]], df['rating'].size())
But it returns me an error TypeError: 'int' object is not callable
Perhaps it's something complicated what I am doing, please, could you guide me with this code
The pure matplotlib way:
from math import ceil
# Prepare the data for plotting
df_plot = df.groupby(["id", "rating"]).size()
unique_ids = df_plot.index.get_level_values("id").unique()
# Calculate the grid spec. This will be a n x 2 grid
# to fit one chart by id
ncols = 2
nrows = ceil(len(unique_ids) / ncols)
fig = plt.figure(figsize=(6,6))
for i, id_ in enumerate(unique_ids):
# In a figure grid spanning nrows x ncols, plot into the
# axes at position i + 1
ax = fig.add_subplot(nrows, ncols, i+1)
df_plot.xs(id_).plot(axes=ax, kind="bar")
You can simplify things a lot with Seaborn:
import seaborn as sns
sns.catplot(data=df, x="rating", col="id", col_wrap=2, kind="count")
If you're ok with installing a new library, seaborn has a very helpful countplot. Seaborn uses matplotlib under the hood and makes certain plots easier.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
sns.countplot(
data = df,
x = 'rating',
hue = 'id',
)
plt.show()
plt.close()

how to set the distance between bars and axis using matplot lib [duplicate]

So currently learning how to import data and work with it in matplotlib and I am having trouble even tho I have the exact code from the book.
This is what the plot looks like, but my question is how can I get it where there is no white space between the start and the end of the x-axis.
Here is the code:
import csv
from matplotlib import pyplot as plt
from datetime import datetime
# Get dates and high temperatures from file.
filename = 'sitka_weather_07-2014.csv'
with open(filename) as f:
reader = csv.reader(f)
header_row = next(reader)
#for index, column_header in enumerate(header_row):
#print(index, column_header)
dates, highs = [], []
for row in reader:
current_date = datetime.strptime(row[0], "%Y-%m-%d")
dates.append(current_date)
high = int(row[1])
highs.append(high)
# Plot data.
fig = plt.figure(dpi=128, figsize=(10,6))
plt.plot(dates, highs, c='red')
# Format plot.
plt.title("Daily high temperatures, July 2014", fontsize=24)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()
There is an automatic margin set at the edges, which ensures the data to be nicely fitting within the axis spines. In this case such a margin is probably desired on the y axis. By default it is set to 0.05 in units of axis span.
To set the margin to 0 on the x axis, use
plt.margins(x=0)
or
ax.margins(x=0)
depending on the context. Also see the documentation.
In case you want to get rid of the margin in the whole script, you can use
plt.rcParams['axes.xmargin'] = 0
at the beginning of your script (same for y of course). If you want to get rid of the margin entirely and forever, you might want to change the according line in the matplotlib rc file:
axes.xmargin : 0
axes.ymargin : 0
Example
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
tips.plot(ax=ax1, title='Default Margin')
tips.plot(ax=ax2, title='Margins: x=0')
ax2.margins(x=0)
Alternatively, use plt.xlim(..) or ax.set_xlim(..) to manually set the limits of the axes such that there is no white space left.
If you only want to remove the margin on one side but not the other, e.g. remove the margin from the right but not from the left, you can use set_xlim() on a matplotlib axes object.
import seaborn as sns
import matplotlib.pyplot as plt
import math
max_x_value = 100
x_values = [i for i in range (1, max_x_value + 1)]
y_values = [math.log(i) for i in x_values]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
sn.lineplot(ax=ax1, x=x_values, y=y_values)
sn.lineplot(ax=ax2, x=x_values, y=y_values)
ax2.set_xlim(-5, max_x_value) # tune the -5 to your needs

Align multi-line ticks in Seaborn plot

I have the following heatmap:
I've broken up the category names by each capital letter and then capitalised them. This achieves a centering effect across the labels on my x-axis by default which I'd like to replicate across my y-axis.
yticks = [re.sub("(?<=.{1})(.?)(?=[A-Z]+)", "\\1\n", label, 0, re.DOTALL).upper() for label in corr.index]
xticks = [re.sub("(?<=.{1})(.?)(?=[A-Z]+)", "\\1\n", label, 0, re.DOTALL).upper() for label in corr.columns]
fig, ax = plt.subplots(figsize=(20,15))
sns.heatmap(corr, ax=ax, annot=True, fmt="d",
cmap="Blues", annot_kws=annot_kws,
mask=mask, vmin=0, vmax=5000,
cbar_kws={"shrink": .8}, square=True,
linewidths=5)
for p in ax.texts:
myTrans = p.get_transform()
offset = mpl.transforms.ScaledTranslation(-12, 5, mpl.transforms.IdentityTransform())
p.set_transform(myTrans + offset)
plt.yticks(plt.yticks()[0], labels=yticks, rotation=0, linespacing=0.4)
plt.xticks(plt.xticks()[0], labels=xticks, rotation=0, linespacing=0.4)
where corr represents a pre-defined pandas dataframe.
I couldn't seem to find an align parameter for setting the ticks and was wondering if and how this centering could be achieved in seaborn/matplotlib?
I've adapted the seaborn correlation plot example below.
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Generate a large random dataset
rs = np.random.RandomState(33)
d = pd.DataFrame(data=rs.normal(size=(100, 7)),
columns=['Donald\nDuck','Mickey\nMouse','Han\nSolo',
'Luke\nSkywalker','Yoda','Santa\nClause','Ronald\nMcDonald'])
# Compute the correlation matrix
corr = d.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
for i in ax.get_yticklabels():
i.set_ha('right')
i.set_rotation(0)
for i in ax.get_xticklabels():
i.set_ha('center')
Note the two for sequences above. These get the label and then set the horizontal alignment (You can also change the vertical alignment (set_va()).
The code above produces this:

pandas dataframe bar plot put space between bars

So I want my image look like this
But now my image look like this
How do I reduce the space between bars without making the bar width into 1?
Here is my code:
plot=repeat.loc['mean'].plot(kind='bar',rot=0,alpha=1,cmap='Reds',
yerr=repeat.loc['std'],error_kw=dict(elinewitdh=0.02,ecolor='grey'),
align='center',width=0.2,grid=None)
plt.ylabel('')
plt.grid(False)
plt.title(cell,ha='center')
plt.xticks([])
plt.yticks([])
plt.ylim(0,120)
plt.tight_layout()`
make the plot from scratch if the toplevel functions from pandas or seaborn do not give you the desired result! :)
import seaborn.apionly as sns
import scipy as sp
import matplotlib.pyplot as plt
# some fake data
data = sp.randn(10,10) + 1
data = data[sp.argsort(sp.average(data,axis=1))[::-1],:]
avg = sp.average(data,axis=1)
std = sp.std(data,axis=1)
# a practical helper from seaborn to quickly generate the colors
colors = sns.color_palette('Reds',n_colors = data.shape[0])
fig, ax = plt.subplots()
pos = range(10)
ax.bar(pos,avg,width=1)
for col,patch in zip(colors,ax.patches):
patch.set_facecolor(col)
patch.set_edgecolor('k')
for i,p in enumerate(pos):
ax.plot([p,p],[avg[i],avg[i]+std[i]],color='k',lw=2, zorder=-1)

Matplotlib histogram with errorbars

I have created a histogram with matplotlib using the pyplot.hist() function. I would like to add a Poison error square root of bin height (sqrt(binheight)) to the bars. How can I do this?
The return tuple of .hist() includes return[2] -> a list of 1 Patch objects. I could only find out that it is possible to add errors to bars created via pyplot.bar().
Indeed you need to use bar. You can use to output of hist and plot it as a bar:
import numpy as np
import pylab as plt
data = np.array(np.random.rand(1000))
y,binEdges = np.histogram(data,bins=10)
bincenters = 0.5*(binEdges[1:]+binEdges[:-1])
menStd = np.sqrt(y)
width = 0.05
plt.bar(bincenters, y, width=width, color='r', yerr=menStd)
plt.show()
Alternative Solution
You can also use a combination of pyplot.errorbar() and drawstyle keyword argument. The code below creates a plot of the histogram using a stepped line plot. There is a marker in the center of each bin and each bin has the requisite Poisson errorbar.
import numpy
import pyplot
x = numpy.random.rand(1000)
y, bin_edges = numpy.histogram(x, bins=10)
bin_centers = 0.5*(bin_edges[1:] + bin_edges[:-1])
pyplot.errorbar(
bin_centers,
y,
yerr = y**0.5,
marker = '.',
drawstyle = 'steps-mid-'
)
pyplot.show()
My personal opinion
When plotting the results of multiple histograms on the the same figure, line plots are easier to distinguish. In addition, they look nicer when plotting with a yscale='log'.