Related
I've created a scatter plot (actually two similar subplots) using matplotlib.pyplot which I'm using for stylometric text analysis. The code I'm using to make the plot is as follows:
import matplotlib.pyplot as plt
import numpy as np
clusters = 4
two_d_matrix = np.array([[0.00617068, -0.53451777], [-0.01837677, -0.47131886], ...])
my_labels = [0, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
fig, (plot1, plot2) = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(20, 10))
plot1.axhline(0, color='#afafaf')
plot1.axvline(0, color='#afafaf')
for i in range(clusters):
try:
plot1.scatter(two_d_matrix[i:, 0], two_d_matrix[i:, 1], s=30, c=my_labels, cmap='viridis')
except (KeyError, ValueError) as e:
pass
plot1.legend(my_labels)
plot1.set_title("My First Plot")
plot2.axhline(0, color='#afafaf')
plot2.axvline(0, color='#afafaf')
for i in range(clusters):
try:
plot2.scatter(two_d_matrix[i:, 0], two_d_matrix[i:, 1], s=30, c=my_labels, cmap='viridis')
except (KeyError, ValueError) as e:
pass
plot2.legend(my_labels)
plot2.set_title("My Second Plot")
plt.show()
Because there are four distinct values in my_labels there are four colours which appear on the plot, these should correspond to the four clusters I expected to find.
The problem is that the legend only has three values, corresponding to the first three values in my_labels. It also appears that the legend isn't displaying a key for each colour, but for each of the axes and then for one of the colours. This means that the colours appearing in the plot are not matched to what appears in the legend, so the legend is inaccurate. I have no idea why this is happening.
Ideally, the legend should display one colour for each unique value in my_labels, so it should look like this:
How can I get the legend to accurately display all the values it should be showing, i.e. one for each colour which appears in the plot?
Before calling plot1.legend or plot2.legend, you can pass label = None to plot1.axhline or axvline (and similarly to plot2.axhline or plot2.axvline.) This will make sure it doesn't interfere with plotting legends of the scatter points and also not label those lines.
To get labels for all categories of scatter points, you'll have to call plot1.scatter or plot2.scatter by passing the label and choosing only values from two_d_matrix whose index matches with the index of label in my_labels.
You can do it as follows:
import matplotlib.pyplot as plt
import numpy as np
# Generate some (pseudo) random data which is reproducible
generator = np.random.default_rng(seed=121)
matrix = generator.uniform(size=(40, 2))
matrix = np.sort(matrix)
clusters = 4
my_labels = np.array([0, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
fig, ax = plt.subplots(1, 1)
# Select data points wisely
for i in range(clusters):
pos = np.where(my_labels == i)
ax.scatter(matrix[pos, 0], matrix[pos, 1], s=30, cmap='viridis', label=i)
ax.axhline(0, color='#afafaf', label=None)
ax.axvline(0, color='#afafaf', label=None)
ax.legend()
ax.set_title("Expected output")
plt.show()
This gives:
Comparison of current output and expected output
Observe how data points selection (done inside the for loops in the code below) affects the output:
Code:
import matplotlib.pyplot as plt
import numpy as np
# Generate some (pseudo) random data which is reproducible
generator = np.random.default_rng(seed=121)
matrix = generator.uniform(size=(40, 2))
matrix = np.sort(matrix)
clusters = 4
my_labels = np.array([0, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
fig, ax = plt.subplots(1, 2)
# Question plot
for i in range(clusters):
ax[0].scatter(matrix[i:, 0], matrix[i:, 1], s=30, cmap='viridis', label=i)
ax[0].axhline(0, color='#afafaf', label=None)
ax[0].axvline(0, color='#afafaf', label=None)
ax[0].legend()
ax[0].set_title("Current output (with label = None)")
# Answer plot
for i in range(clusters):
pos = np.where(my_labels == i) # <- choose index of data points based on label position in my_labels
ax[1].scatter(matrix[pos, 0], matrix[pos, 1], s=30, cmap='viridis', label=i)
ax[1].axhline(0, color='#afafaf', label=None)
ax[1].axvline(0, color='#afafaf', label=None)
ax[1].legend()
ax[1].set_title("Expected output")
plt.show()
Is there a built-in way to add a line connecting scatter points with the same y-val?
Currently have this:
x1 = [6, 11, 7, 13, 6, 7.5]
x2 = [np.nan, np.nan, np.nan, np.nan, np.nan, 8.6]
y = [2, 10, 2, 14, 9, 10]
df = pd.DataFrame(data=zip(x1, x2, y), columns=["x1", "x2", "y"])
fig, ax = plt.subplots()
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.scatter(df["x1"], df["y"], c="k")
ax.scatter(df["x2"], df["y"], edgecolor="k", facecolors="none")
Want this:
You can pair the points (x1[i],y[i]) and (x2[i],y[i]) iteratively by using the following code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
x1 = [6, 11, 7, 13, 6, 7.5]
x2 = [np.nan, np.nan, np.nan, np.nan, np.nan, 8.6]
y = [2, 10, 2, 14, 9, 10]
df = pd.DataFrame(data=zip(x1, x2, y), columns=["x1", "x2", "y"])
fig, ax = plt.subplots()
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.scatter(df["x1"], df["y"], c="k")
ax.scatter(df["x2"], df["y"], edgecolor="k", facecolors="none")
for i in range(len(y)):
plt.plot([x1[i],x2[i]],[y[i],y[i]])
plt.show()
The output of this code gives:
I have two arrays:
l1 = [2,1,5,6,1,2,4,5,6,1]
l2 = [7,8,4,1,3,4,8]
I want to plot a seaborn violinplot with different color per list (a separate violin for l1 and l2).
Is there a way to do so without creating a dataframe and pd.melt from them?
You can give the parameters to sns.violinplot() without provding a dataframe, as follows:
import seaborn as sns
l1 = [2, 1, 5, 6, 1, 2, 4]
l2 = [7, 8, 4, 1, 3, 4, 8]
flag = [0, 1, 1, 1, 0, 0, 1]
sns.violinplot(y=l1 + l2, x=["l1"]*len(l1) + ["l2"]*len(l2),
hue=flag + flag, palette=['crimson', 'cornflowerblue'])
To only use the l1 vs l2 information:
import seaborn as sns
l1 = [2, 1, 5, 6, 1, 2, 4]
l2 = [7, 8, 4, 1, 3, 4, 8]
sns.violinplot(y=l1 + l2, x=["l1"] * len(l1) + ["l2"] * len(l2), palette=['tomato', 'cornflowerblue'])
I want to create a stacked barplot using Seaborn with this MiltiIndex DataFrame
header = pd.MultiIndex.from_product([['#'],
['TE', 'SS', 'M', 'MR']])
dat = ([[100, 20, 21, 35], [100, 12, 5, 15]])
df = pd.DataFrame(dat, index=['JC', 'TTo'], columns=header)
df = df.stack()
df = df.sort_values('#', ascending=False).sort_index(level=0, sort_remaining=False)
The code I'm using for the plot is:
fontP = FontProperties()
fontP.set_size('medium')
colors = {'TE': 'green', 'SS': 'blue', 'M': 'yellow', 'MR': 'red'}
kwargs = {'alpha':0.5}
plt.figure(figsize=(12, 9))
sns.barplot(x=df2.index.get_level_values(0).unique(),
y=df2.loc[pd.IndexSlice[:, df2.index[0]], '#'],
color=colors[df2.index[0][1]], **kwargs)
sns.barplot(x=df2.index.get_level_values(0).unique(),
y=df2.loc[pd.IndexSlice[:, df2.index[1]], '#'],
color=colors[df2.index[1][1]], **kwargs)
sns.barplot(x=df2.index.get_level_values(0).unique(),
y=df2.loc[pd.IndexSlice[:, df2.index[2]], '#'],
color=colors[df2.index[2][1]], **kwargs)
bottom_plot = sns.barplot(x=df2.index.get_level_values(0).unique(),
y=df2.loc[pd.IndexSlice[:, df2.index[3]], '#'],
color=colors[df2.index[3][1]], **kwargs)
bar1 = plt.Rectangle((0, 0), 1, 1, fc='green', edgecolor="None")
bar2 = plt.Rectangle((0, 0), 0, 0, fc='yellow', edgecolor="None")
bar3 = plt.Rectangle((0, 0), 2, 2, fc='red', edgecolor="None")
bar4 = plt.Rectangle((0, 0), 3, 3, fc='blue', edgecolor="None")
l = plt.legend([bar1, bar2, bar3, bar4], [
"TE", "M",
'MR', 'SS'
],
bbox_to_anchor=(0.95, 1),
loc='upper left',
prop=fontP)
l.draw_frame(False)
sns.despine()
bottom_plot.set_ylabel("#")
axes = plt.gca()
axes.yaxis.grid()
And I get:
My problem is the order of the colors in the second bar ('TTo'), I want the colors to be automatically selected based on the level 1 index value (['TE', 'SS', 'M', 'MR']) so that they are ordered correctly. Further down the one with the highest value with its corresponding color, in front the next one with the next highest value and its color and so on, as the first bar shows ('JC).
Maybe there is a simpler way to do this in Seaborn than the one I'm using...
I'm not sure how to create such a plot with seaborn. Here is a way to create it with a loop through the rows and adding one matplotlib bar at each step:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
sns.set()
header = pd.MultiIndex.from_product([['#'],
['TE', 'SS', 'M', 'MR']])
dat = ([[100, 20, 21, 35], [100, 12, 5, 15]])
df = pd.DataFrame(dat, index=['JC', 'TTo'], columns=header)
df = df.stack()
df = df.sort_values('#', ascending=False).sort_index(level=0, sort_remaining=False)
colors = {'TE': 'green', 'SS': 'blue', 'M': 'yellow', 'MR': 'red'}
prev_index0 = None
for (index0, index1), quantity in df.itertuples():
if index0 != prev_index0:
bottom = 0
plt.bar(index0, quantity, fc=colors[index1], ec='none', bottom=bottom, label=index1)
bottom += quantity
prev_index0 = index0
legend_handles = [plt.Rectangle((0, 0), 0, 0, color=colors[c], label=c) for c in colors]
plt.legend(handles=legend_handles)
plt.show()
To plot the bars back to front without stacking, the code can be simplified:
colors = {'TE': 'forestgreen', 'SS': 'cornflowerblue', 'M': 'gold', 'MR': 'crimson'}
for (index0, index1), quantity in df.itertuples():
plt.bar(index0, quantity, fc=colors[index1], ec='none', label=index1)
legend_handles = [plt.Rectangle((0, 0), 0, 0, color=colors[c], label=c, ec='black') for c in colors]
plt.legend(handles=legend_handles, bbox_to_anchor=(1.02, 1.02), loc='upper left')
plt.tight_layout()
Given a set of 2d data points with coordinates x and y (left picture), is there an easy way to construct a triangular mesh on top of it (right picture)? i.e. return a list of tuples that indicates which vertices are connected. The solution is not unique, but any reasonable mesh would suffice.
You can use scipy.spatial.Delaunay. Here is an example from the
import numpy as np
points = np.array([[-1,1],[-1.3, .6],[0,0],[.2,.8],[1,.85],[-.1,-.4],[.4,-.15],[.6,-.6],[.9,-.2]])
from scipy.spatial import Delaunay
tri = Delaunay(points)
import matplotlib.pyplot as plt
plt.triplot(points[:,0], points[:,1], tri.simplices)
plt.plot(points[:,0], points[:,1], 'o')
plt.show()
Here is the result on an input similar to yours:
The triangles are stored in the simplices attribute of the Delaunay object which reference the coordinates stored in the points attribute:
>>> tri.points
array([[-1. , 1. ],
[-1.3 , 0.6 ],
[ 0. , 0. ],
[ 0.2 , 0.8 ],
[ 1. , 0.85],
[-0.1 , -0.4 ],
[ 0.4 , -0.15],
[ 0.6 , -0.6 ],
[ 0.9 , -0.2 ]])
>>> tri.simplices
array([[5, 2, 1],
[0, 3, 4],
[2, 0, 1],
[3, 0, 2],
[8, 6, 7],
[6, 5, 7],
[5, 6, 2],
[6, 3, 2],
[3, 6, 4],
[6, 8, 4]], dtype=int32)
If you are looking for which vertices are connected, there is an attribute containing that info also:
>>> tri.vertex_neighbor_vertices
(array([ 0, 4, 7, 12, 16, 20, 24, 30, 33, 36], dtype=int32), array([3, 4, 2, 1, 5, 2, 0, 5, 1, 0, 3, 6, 0, 4, 2, 6, 0, 3, 6, 8, 2, 1,
6, 7, 8, 7, 5, 2, 3, 4, 8, 6, 5, 6, 7, 4], dtype=int32))
You can try scipy.spatial.Delaunay. From that link:
points = np.array([[0, 0], [0, 1.1], [1, 0], [1, 1]])
from scipy.spatial import Delaunay
tri = Delaunay(points)
plt.triplot(points[:,0], points[:,1], tri.simplices)
plt.plot(points[:,0], points[:,1], 'o')
plt.show()
Output:
I think Delanuay gives something closer to a convex hull. In OP's picture A is not connected to C, it is connected to B which is connected to C which gives a different shape.
One solution could be running Delanuay first then removing triangles whose angles exceed a certain degree, eg 90, or 100. A prelim code could look like
from scipy.spatial import Delaunay
points = [[101, 357], [198, 327], [316, 334], [ 58, 299], [162, 258], [217, 240], [310, 236], [153, 207], [257, 163]]
points = np.array(points)
tri = Delaunay(points,furthest_site=False)
newsimp = []
for t in tri.simplices:
A,B,C = points[t[0]],points[t[1]],points[t[2]]
e1 = B-A; e2 = C-A
num = np.dot(e1, e2)
denom = np.linalg.norm(e1) * np.linalg.norm(e2)
d1 = np.rad2deg(np.arccos(num/denom))
e1 = C-B; e2 = A-B
num = np.dot(e1, e2)
denom = np.linalg.norm(e1) * np.linalg.norm(e2)
d2 = np.rad2deg(np.arccos(num/denom))
d3 = 180-d1-d2
degs = np.array([d1,d2,d3])
if np.any(degs > 110): continue
newsimp.append(t)
plt.triplot(points[:,0], points[:,1], newsimp)
which gives the shape seen above. For more complicated shapes removing large sides could be necessary too,
for t in tri.simplices:
...
n1 = np.linalg.norm(e1); n2 = np.linalg.norm(e2)
...
res.append([n1,n2,d1,d2,d3])
res = np.array(res)
m = res[:,[0,1]].mean()*res[:,[0,1]].std()
mask = np.any(res[:,[2,3,4]] > 110) & (res[:,0] < m) & (res[:,1] < m )
plt.triplot(points[:,0], points[:,1], tri.simplices[mask])