Add a category without data in it to a plot in seaborn - matplotlib

I am making plotting some data as a catplot like this:
ax = sns.catplot(x='Kind', y='VAF', hue='Sample', jitter=True, data=df, legend=False)
The trouble is that some of the categories of 'VAF' contain no data, and the corresponding label is not added to the plot. Is there a way to retain the label but just not plot any points for it?
Here is a reproducible example to help explain:
x=pd.DataFrame({'Data':[1,3,4,6,3,2],'Number':['One','One','One','One','Three','Three']})
plt.figure()
ax = sns.catplot(x='Number', y='Data', jitter=True, data=x)
In this plot you can see that on the x-axis, samples One and Three are displayed. But imagine that there is also a sample Two that just had no data points in it. How can I display One, Two, and Three on the x-axis?

Order parameter
Of course one would need to know which categories are expected. Given a list of expected categories, one can use the order parameter to supply the expected categories.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'Data':[1,3,4,6,3,2],
'Number':['One','One','One','One','Three','Three']})
exp_cats = ["One", "Two", "Three"]
ax = sns.stripplot(x='Number', y='Data', jitter=True, data=df, order=exp_cats)
plt.show()
Alternatives
The above works with matplotlib 2.2.3, but not with 3.0. It works again with the current development version (hence 3.1). For the moment, there are the following alternatives:
A. Looping over categories
Given a list of expected categories, one can just loop over them and plot a scatter of each category.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Data':[1,3,4,6,3,2],
'Number':['One','One','One','One','Three','Three']})
exp_cats = ["One", "Two", "Three"]
for i, cat in enumerate(exp_cats):
cdf = df[df["Number"] == cat]
x = np.zeros(len(cdf))+i+.2*(np.random.rand(len(cdf))-0.5)
plt.scatter(x, cdf["Data"].values)
plt.xticks(range(len(exp_cats)), exp_cats)
plt.show()
B. Map categories to numbers.
You can map the expected categories to numbers and plot numbers instead of categories.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Data':[1,3,4,6,3,2],
'Number':['One','One','One','One','Three','Three']})
exp_cats = ["One", "Two", "Three"]
df["IntNumber"] = df["Number"].map(dict(zip(exp_cats, range(len(exp_cats)))))
plt.scatter(df["IntNumber"] + .2*(np.random.rand(len(df))-0.5), df["Data"].values,
c = df["IntNumber"].values.astype(int))
plt.xticks(range(len(exp_cats)), exp_cats)
plt.show()
C. Appending missing categories to the dataframe
Finally you may append nan values to the dataframe to make sure each expected category appears in it.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'Data':[1,3,4,6,3,2],
'Number':['One','One','One','One','Three','Three']})
exp_cats = ["One", "Two", "Three"]
dfa = df.append(pd.DataFrame({'Data':[np.nan]*len(exp_cats), 'Number':exp_cats}))
ax = sns.stripplot(x='Number', y='Data', jitter=True, data=dfa, order=exp_cats)
plt.show()

Related

How make scatterplot in pandas readable

I've been playing with Titanic dataset and working through some visualisations in Pandas using this tutorial. https://www.kdnuggets.com/2023/02/5-pandas-plotting-functions-might-know.html
I have a visual of scatterplot having used this code.
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('train.csv')
I was confused by bootstrap plot result so went on to scatterplot.
pd.plotting.scatter_matrix(df, figsize=(10,10), )
plt.show()
I can sort of interpret it but I'd like to put the various variables at top and bottom of every column. Is that doable?
You can use:
fig, ax = plt.subplots(4, 3, figsize=(20, 15))
sns.scatterplot(x = 'bedrooms', y = 'price', data = dataset, whis=1.5, ax=ax[0, 0])
sns.scatterplot(x = 'bathrooms', y = 'price', data = dataset, whis=1.5, ax=ax[0, 1])

Barplot per each ax in matplotlib

I have the following dataset, ratings in stars for two fictitious places:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
Since the rating is a category (is not a continuous data) I convert it to a category:
df['rating_cat'] = pd.Categorical(df['rating'])
What I want is to create a bar plot per each fictitious place ('A or B'), and the count per each rating. This is the intended plot:
I guess using a for per each value in id could work, but I have some trouble to decide the size:
fig, ax = plt.subplots(1,2,figsize=(6,6))
axs = ax.flatten()
cats = df['rating_cat'].cat.categories.tolist()
ids_uniques = df.id.unique()
for i in range(len(ids_uniques)):
ax[i].bar(df[df['id']==ids_uniques[i]], df['rating'].size())
But it returns me an error TypeError: 'int' object is not callable
Perhaps it's something complicated what I am doing, please, could you guide me with this code
The pure matplotlib way:
from math import ceil
# Prepare the data for plotting
df_plot = df.groupby(["id", "rating"]).size()
unique_ids = df_plot.index.get_level_values("id").unique()
# Calculate the grid spec. This will be a n x 2 grid
# to fit one chart by id
ncols = 2
nrows = ceil(len(unique_ids) / ncols)
fig = plt.figure(figsize=(6,6))
for i, id_ in enumerate(unique_ids):
# In a figure grid spanning nrows x ncols, plot into the
# axes at position i + 1
ax = fig.add_subplot(nrows, ncols, i+1)
df_plot.xs(id_).plot(axes=ax, kind="bar")
You can simplify things a lot with Seaborn:
import seaborn as sns
sns.catplot(data=df, x="rating", col="id", col_wrap=2, kind="count")
If you're ok with installing a new library, seaborn has a very helpful countplot. Seaborn uses matplotlib under the hood and makes certain plots easier.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
sns.countplot(
data = df,
x = 'rating',
hue = 'id',
)
plt.show()
plt.close()

seaborn.swarmplot problem with symlog scale: zero's are not expanded

I have a data set of positive values and zero's that I would like to show on the log scale. To represent zero's I use 'symlog' option, but all zero values are mapped into one point on swarmplot. How to fix it?
import numpy as np
import seaborn as sns
import pandas as pd
import random
import matplotlib.pyplot as plt
n = 100
x = np.concatenate(([0]*n,np.linspace(0,1,n),[5]*n,np.linspace(10,100,n),np.linspace(100,1000,n)),axis=None)
data = pd.DataFrame({'value': x, 'category': random.choices([0,1,2,3], k=len(x))})
f, ax = plt.subplots(figsize=(10, 6))
ax.set_yscale("symlog",linthreshy=1.e-2)
ax.set_ylim(ymax=1000)
sns.swarmplot(x="category", y="value", data=data)
sns.despine(left=True)
link to the resulting plot

Annotate labels in pandas scatter plot

I saw this method from an older post but can't get the plot I want.
To start
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import string
df = pd.DataFrame({'x':np.random.rand(10),'y':np.random.rand(10)},
index=list(string.ascii_lowercase[:10]))
scatter plot
ax = df.plot('x','y', kind='scatter', s=50)
Then define a function to iterate the rows to annotate
def annotate_df(row):
ax.annotate(row.name, row.values,
xytext=(10,-5),
textcoords='offset points',
size=18,
color='darkslategrey')
Last apply to get annotation
ab= df.apply(annotate_df, axis=1)
Somehow I just get a series ab instead of the scatter plot I want. Where is wrong? Thank you!
Your code works, you just need plt.show() at the end.
Your full code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import string
df = pd.DataFrame({'x':np.random.rand(10),'y':np.random.rand(10)},
index=list(string.ascii_lowercase[:10]))
ax = df.plot('x','y', kind='scatter', s=50)
def annotate_df(row):
ax.annotate(row.name, row.values,
xytext=(10,-5),
textcoords='offset points',
size=18,
color='darkslategrey')
ab= df.apply(annotate_df, axis=1)
plt.show()
Looks like that this doesn't work any more, however the solution is easy: convert row.values from numpy.ndarray to list:
list(row.values)

My pandas-generated subplots are layouted incorrectly

I ran the following code to get two plots next to each other (it is a minimal working example that you can copy):
import pandas as pd
import numpy as np
from matplotlib.pylab import plt
comp1 = np.random.normal(0,1,size=200)
values = pd.Series(comp1)
plt.close("all")
f = plt.figure()
plt.show()
sp1 = f.add_subplot(2,2,1)
values.hist(bins=100, alpha=0.5, color="r", normed=True)
sp2 = f.add_subplot(2,2,2)
values.plot(kind="kde")
Unfortunately, I then get the following image:
This is also an interesting layout, but I wanted the figures to be next to each other. What did I do wrong? How can I correct it?
For clarity, I could also use this:
import pandas as pd
import numpy as np
from matplotlib.pylab import plt
comp1 = np.random.normal(0,1,size=200)
values = pd.Series(comp1)
plt.close("all")
fig, axes = plt.subplots(2,2)
plt.show()
axes[0,0].hist(values, bins=100, alpha=0.5, color="r", normed=True) # Until here, it works. You get a half-finished correct image of what I was going for (though it is 2x2 here)
axes[0,1].plot(values, kind="kde") # This does not work
Unfortunately, in this approach axes[0,1] refers to the subplot that has a plot method but does not know kind="kde". Please take into consideration that the in the first version plot is executed on the pandas object, whereas in the second version plot is executed on the subplot, which does not work with the kind="kde" parameter.
use ax= argument to set which subplot object to plot:
import pandas as pd
import numpy as np
from matplotlib.pylab import plt
comp1 = np.random.normal(0,1,size=200)
values = pd.Series(comp1)
plt.close("all")
f = plt.figure()
sp1 = f.add_subplot(2,2,1)
values.hist(bins=100, alpha=0.5, color="r", normed=True, ax=sp1)
sp2 = f.add_subplot(2,2,2)
values.plot(kind="kde", ax=sp2)