Matplotlib--scatter plot with half filled markers - matplotlib

Question: Using a scatter plot in matplotlib, is there a simple way get a half-filled marker?
I know half-filled markers can easily be done using a line plot, but I would like to use 'scatter' because I want to use marker size and color (i.e., alternate marker face color) to represent other data. (I believe this will be easier with a scatter plot since I want to automate making a large number of plots from a large data set.)
I can't seem to make half-filled markers properly using a scatter plot. That is to say, instead of a half-filled marker, the plot shows half of a marker. I've been using matplotlib.markers.MarkerStyle, but that seems to only get me halfway there. I'm able to get following output using the code below.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.markers import MarkerStyle
plt.scatter(1, 1, marker=MarkerStyle('o', fillstyle='full'), edgecolors='k', s=500)
plt.scatter(2, 2, marker=MarkerStyle('o', fillstyle='left'), edgecolors='k', s=500)
plt.scatter(3, 3, marker=MarkerStyle('o', fillstyle='right'), edgecolors='k', s=500)
plt.scatter(4, 4, marker=MarkerStyle('o', fillstyle='top'), edgecolors='k', s=500)
plt.scatter(5, 5, marker=MarkerStyle('o', fillstyle='bottom'), edgecolors='k', s=500)
plt.show()

As mentioned in the comments, I don't see why you have to use plt.scatter but if you want to, you can fake a combined marker:
from matplotlib.markers import MarkerStyle
from matplotlib import pyplot as plt
#data generation
import pandas as pd
import numpy as np
np.random.seed(123)
n = 10
df = pd.DataFrame({"X": np.random.randint(1, 20, n),
"Y": np.random.randint(10, 30, n),
"S": np.random.randint(50, 500, n),
"C1": np.random.choice(["red", "blue", "green"], n),
"C2": np.random.choice(["yellow", "grey"], n)})
fig, ax = plt.subplots()
ax.scatter(df.X, df.Y, s=df.S, c=df.C1, edgecolor="black", marker=MarkerStyle("o", fillstyle="right"))
ax.scatter(df.X, df.Y, s=df.S, c=df.C2, edgecolor="black", marker=MarkerStyle("o", fillstyle="left"))
plt.show()
Sample output:
This works, of course, also for continuous data:
from matplotlib import pyplot as plt
from matplotlib.markers import MarkerStyle
import pandas as pd
import numpy as np
np.random.seed(123)
n = 10
df = pd.DataFrame({"X": np.random.randint(1, 20, n),
"Y": np.random.randint(10, 30, n),
"S": np.random.randint(100, 1000, n),
"C1": np.random.randint(1, 100, n),
"C2": np.random.random(n)})
fig, ax = plt.subplots(figsize=(10,8))
im1 = ax.scatter(df.X, df.Y, s=df.S, c=df.C1, edgecolor="black", marker=MarkerStyle("o", fillstyle="right"), cmap="autumn")
im2 = ax.scatter(df.X, df.Y, s=df.S, c=df.C2, edgecolor="black", marker=MarkerStyle("o", fillstyle="left"), cmap="winter")
cbar1 = plt.colorbar(im1, ax=ax)
cbar1.set_label("right half", rotation=90)
cbar2 = plt.colorbar(im2, ax=ax)
cbar2.set_label("left half", rotation=90)
plt.show()
Sample output:
But be reminded that plt.plot with marker definitions might be faster for large-scale datasets: The plot function will be faster for scatterplots where markers don't vary in size or color.

Related

who to plot stats.probplot in a grid?

I have a data frame with four columns I would like to plot the normality test for each column in a 2*2 grid, but it only plot one figure, and the else is empty.
import random
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2,2, figsize=(15, 6), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = .5, wspace=.001)
data = {'col1': [random.randrange(1, 50, 1) for i in range(1000)], 'col2': [random.randrange(1, 50, 1) for i in range(1000)],'col3':[random.randrange(1, 50, 1) for i in range(1000)]
,'col4':[random.randrange(1, 50, 1) for i in range(1000)]}
df = pd.DataFrame(data)
for ax, d in zip(axs.ravel(), df):
ax=stats.probplot(df[d], plot=plt)
#ax.set_title(str(d))
plt.show()
is there a way to construct the subplot and the stats.probplot within a loop?
In your code, you need to change the for loop to this:
for ax, d in zip(axs.ravel(), df):
stats.probplot(df[d], plot=ax)
#ax.set_titl(str(d))
plt.show()
I hope this will help you move on.

Align bar and line plot on x axis without the use of rank and pointplot

Please note, I've looked at other questions like question and my problem is different and not a duplicate!
I would like to have two plots, with the same x axis in matplotlib. I thought this should be achieved via constrained_layout, but apparently this is not the case. Here is an example code.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as grd
x = np.arange(0, 30, 0.001)
df_line = pd.DataFrame({"x": x, "y": np.sin(x)})
df_bar = pd.DataFrame({
"x_bar": [1, 7, 10, 20, 30],
"y_bar": [0.0, 0.3, 0.4, 0.1, 0.2]
})
fig = plt.subplots(constrained_layout=True)
gs = grd.GridSpec(2, 1, height_ratios=[3, 2], wspace=0.1)
ax1 = plt.subplot(gs[0])
sns.lineplot(data=df_line, x=df_line["x"], y=df_line["y"], ax=ax1)
ax1.set_xlabel("time", fontsize="22")
ax1.set_ylabel("y values", fontsize="22")
plt.yticks(fontsize=16)
plt.xticks(fontsize=16)
plt.setp(ax1.get_legend().get_texts(), fontsize="22")
ax2 = plt.subplot(gs[1])
sns.barplot(data=df_bar, x="x_bar", y="y_bar", ax=ax2)
ax2.set_xlabel("time", fontsize="22")
ax2.set_ylabel("y values", fontsize="22")
plt.yticks(fontsize=16)
plt.xticks(fontsize=16)
this leads to the following figure.
However, I would like to see the corresponding x values of both plot aligned. How can I achieve this? Note, I've tried to use the following related question. However, this doesn't fully apply to my situation. First with the high number of x points (which I need in reality) point plots is make the picture to big and slow for loading. On top, I can't use the rank method as my categories for the barplot are not evenly distributed. They are specific points on the x axis which should be aligned with the corresponding point on the lineplot
x = np.arange(0, 30, 0.001)
df_line = pd.DataFrame({"x": x, "y": np.sin(x)})
df_bar = pd.DataFrame({
"x_bar": [1, 7, 10, 20, 30],
"y_bar": [0.0, 0.3, 0.4, 0.1, 0.2]
})
fig, (ax1, ax2) = plt.subplots(2,1)
ax1.plot(df_line['x'], df_line['y'])
for i in range(len(df_bar['x_bar'])):
ax2.axvline(x=df_bar['x_bar'][i], ymin=0, ymax=df_bar['y_bar'][i])
Output:
---edit---
I incorporated #mozway advice for linewidth:
lw = (300/ax1.get_xlim()[1])
ax2.axvline(x=df_bar['x_bar'][i], ymin=0, ymax=df_bar['y_bar'][i], solid_capstyle='butt', lw=lw)
Output:
or:

Seaborn: annotate missing values on the heatmap

I am plotting a heatmap in python with the seaborn library. The dataframe contains some missing values (NaN). I wish that the heatmap cells corresponding to these fields are white (by default) and also annotated with a string NA. However, if I see it correctly, annotation does not work with missing values. Is there any hack around it?
My code:
sns.heatmap(
df,
ax=ax[0, 0],
cbar=False,
annot=annot_df,
fmt="",
annot_kws={"size": annot_size, "va": "center_baseline"},
cmap="coolwarm",
linewidth=0.5,
linecolor="black",
vmin=-max_value,
vmax=max_value,
xticklabels=True,
yticklabels=True,
)
An idea is to draw another heatmap, with a transparent color and with only values where the original dataframe is NaN. To control the axis labels, the "real" heatmap should be drawn last. Note that the color for the NaN cells is the background color of the plot.
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
data = np.where(np.random.rand(7, 10) < 0.2, np.nan, np.random.rand(7, 10) * 2 - 1)
df = pd.DataFrame(data)
annot_df = df.applymap(lambda f: f'{f:.1f}')
fig, ax = plt.subplots(squeeze=False)
sns.heatmap(
np.where(df.isna(), 0, np.nan),
ax=ax[0, 0],
cbar=False,
annot=np.full_like(df, "NA", dtype=object),
fmt="",
annot_kws={"size": 10, "va": "center_baseline", "color": "black"},
cmap=ListedColormap(['none']),
linewidth=0)
sns.heatmap(
df,
ax=ax[0, 0],
cbar=False,
annot=annot_df,
fmt="",
annot_kws={"size": 10, "va": "center_baseline"},
cmap="coolwarm",
linewidth=0.5,
linecolor="black",
vmin=-1,
vmax=1,
xticklabels=True,
yticklabels=True)
plt.show()
PS: To explicitly color the 'NA' cells, e.g. cmap=ListedColormap(['yellow']) could be used.

how to plot lines linking medians of multiple violin distributions in seaborn?

I struggle hard to succeed in plotting a dot-line between the median values (and min and max) per type of stacked violin distributions.
I tried superposing a violin plot with a seaborn.lineplot but it failed. I'm not sure with this approach that I can draw dot-lines and also link min and max of distributions of the same type. I also tried to use seaborn.lineplot but here the challenge is to plot min and max of the distribution at each x-axis value.
Here is a example dataset and the code for the violin plot in seaborn
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
x=[0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8]
cate=['a','a','a','a','b','b','b','b','c','c','c','c','a','a','a','a','b','b','b','b','c','c','c','c','a','a','a','a','b','b','b','b','c','c','c','c','a','a','a','a','b','b','b','b','c','c','c','c']
y=[1.1,1.12,1.13,1.13,3.1,3.12,3.13,3.13,5.1,5.12,5.13,5.13,2.2,2.22,2.25,2.23,4.2,4.22,4.25,4.23,6.2,6.22,6.25,6.23,2.2,2.22,2.24,2.23,4.2,4.22,4.24,4.23,6.2,6.22,6.24,6.23,1.1,1.13,1.14,1.12,3.1,3.13,3.14,3.12,5.1,5.13,5.14,5.12]
my_pal =['red','green', 'purple']
df = pd.DataFrame({'x': x, 'Type': cate, 'y': y})
ax=sns.catplot(y='y', x='x',data=df, hue='Type', palette=my_pal, kind="violin",dodge =False)
sns.lineplot(y='y', x='x',data=df, hue='Type', palette=my_pal, ci=100,legend=False)
plt.show()
but it plots line only on a reduce part of the left of the plot. Is there a trick to superpose lineplot with violin plot?
For the line plot, 'x' is considered numerical. However, for the violin plot 'x' is considered categorical (positioned at 0, 1, 2, ...).
A solution is to convert 'x' to strings to have both plots consider it as categorical.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
my_pal = ['red', 'green', 'purple']
N = 40
df = pd.DataFrame({'x': np.random.randint(1, 6, N*3) * 0.2,
'y': np.random.uniform(0, 1, N*3) + np.tile([2, 4, 6], N),
'Type': np.tile(list('abc'), N)})
df['x'] = [f'{x:.1f}' for x in df['x']]
ax = sns.violinplot(y='y', x='x', data=df, hue='Type', palette=my_pal, dodge=False)
ax = sns.lineplot(y='y', x='x', data=df, hue='Type', palette=my_pal, ci=100, legend=False, ax=ax)
ax.margins(0.15) # slightly more padding for x and y axis
ax.legend(bbox_to_anchor=(1.01, 1), loc='upper left')
plt.tight_layout()
plt.show()

How to plot an kernel density estimation in seaborn scatterplot plot

I would like to plot the same as shown in the picture( but only the red part). The curve is a kernel density estimate based only on the X-values (the y-values are irrelevant and actually all 1,2 or 3. It is here just plotted like this to distinguish between red an blue. I have plotted the scatterplot, but how can I include the kernel density curve on the scatterplot? (the black dotted lines in the curve are just the quartiles and the median).
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
from scipy.stats import norm
from sklearn.neighbors import KernelDensity
%matplotlib inline
# Change plotting style to ggplot
plt.style.use('ggplot')
from matplotlib.font_manager import FontProperties
X_plot = np.linspace(0, 30, 1000)[:, np.newaxis]
X1 = df[df['Zustandsklasse']==1]['Verweildauer'].values.reshape(-1,1)
X2 = df[df['Zustandsklasse']==2]['Verweildauer'].values.reshape(-1,1)
X3 = df[df['Zustandsklasse']==3]['Verweildauer'].values.reshape(-1,1)
#print(X1)
ax=sns.scatterplot(x="Verweildauer", y="CS_bandwith", data=df, legend="full", alpha=1)
kde=KernelDensity(kernel='gaussian').fit(X1)
log_dens = kde.score_samples(X_plot)
ax.plot(X_plot[:,0], np.exp(log_dens), color ="blue", linestyle="-", label="Gaussian Kernel")
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
ax.invert_yaxis()
plt.ylim(5.5, .5)
ax.set_ylabel("Zustandsklasse")
ax.set_xlabel("Verweildauer in Jahren")
handles, labels = ax.get_legend_handles_labels()
# create the legend again skipping this first entry
leg = ax.legend(handles[1:], labels[1:], loc="lower right", ncol=2, facecolor='silver', fontsize= 7)
ax.set_xticks(np.arange(0, 30, 5))
ax2 = ax.twinx()
#get the ticks at the same heights as the left axis
ax2.set_ylim(ax.get_ylim())
s=[(df["Zustandsklasse"] == t).sum() for t in range(1, 6)]
s.insert(0, 0)
print(s)
ax2.set_yticklabels(s)
ax2.set_ylim(ax.get_ylim())
ax2.set_ylabel("Anzahl Beobachtungen")
ax2.grid(False)
#plt.tight_layout()
plt.show()
Plotting target
Whats is plotted with the code above
It's much easier if you use subplots. Here is an example with seaborn's Titanic dataset:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
titanic = sns.load_dataset('titanic')
fig, ax = plt.subplots(nrows=3, sharex=True)
ax[2].set_xlabel('Age')
for i in [1, 2, 3]:
age_i = titanic[titanic['pclass'] == i]['age']
ax[i-1].scatter(age_i, [0] * len(age_i))
sns.kdeplot(age_i, ax=ax[i-1], shade=True, legend=False)
ax[i-1].set_yticks([])
ax[i-1].set_ylim(-0.01)
ax[i-1].set_ylabel('Class ' + str(i))