Matplotlib add line to connect related scatter points - matplotlib

Is there a built-in way to add a line connecting scatter points with the same y-val?
Currently have this:
x1 = [6, 11, 7, 13, 6, 7.5]
x2 = [np.nan, np.nan, np.nan, np.nan, np.nan, 8.6]
y = [2, 10, 2, 14, 9, 10]
df = pd.DataFrame(data=zip(x1, x2, y), columns=["x1", "x2", "y"])
fig, ax = plt.subplots()
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.scatter(df["x1"], df["y"], c="k")
ax.scatter(df["x2"], df["y"], edgecolor="k", facecolors="none")
Want this:

You can pair the points (x1[i],y[i]) and (x2[i],y[i]) iteratively by using the following code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
x1 = [6, 11, 7, 13, 6, 7.5]
x2 = [np.nan, np.nan, np.nan, np.nan, np.nan, 8.6]
y = [2, 10, 2, 14, 9, 10]
df = pd.DataFrame(data=zip(x1, x2, y), columns=["x1", "x2", "y"])
fig, ax = plt.subplots()
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.scatter(df["x1"], df["y"], c="k")
ax.scatter(df["x2"], df["y"], edgecolor="k", facecolors="none")
for i in range(len(y)):
plt.plot([x1[i],x2[i]],[y[i],y[i]])
plt.show()
The output of this code gives:

Related

How can I create a legend for my scatter plot which matches the colours used in the plot?

I've created a scatter plot (actually two similar subplots) using matplotlib.pyplot which I'm using for stylometric text analysis. The code I'm using to make the plot is as follows:
import matplotlib.pyplot as plt
import numpy as np
clusters = 4
two_d_matrix = np.array([[0.00617068, -0.53451777], [-0.01837677, -0.47131886], ...])
my_labels = [0, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
fig, (plot1, plot2) = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(20, 10))
plot1.axhline(0, color='#afafaf')
plot1.axvline(0, color='#afafaf')
for i in range(clusters):
try:
plot1.scatter(two_d_matrix[i:, 0], two_d_matrix[i:, 1], s=30, c=my_labels, cmap='viridis')
except (KeyError, ValueError) as e:
pass
plot1.legend(my_labels)
plot1.set_title("My First Plot")
plot2.axhline(0, color='#afafaf')
plot2.axvline(0, color='#afafaf')
for i in range(clusters):
try:
plot2.scatter(two_d_matrix[i:, 0], two_d_matrix[i:, 1], s=30, c=my_labels, cmap='viridis')
except (KeyError, ValueError) as e:
pass
plot2.legend(my_labels)
plot2.set_title("My Second Plot")
plt.show()
Because there are four distinct values in my_labels there are four colours which appear on the plot, these should correspond to the four clusters I expected to find.
The problem is that the legend only has three values, corresponding to the first three values in my_labels. It also appears that the legend isn't displaying a key for each colour, but for each of the axes and then for one of the colours. This means that the colours appearing in the plot are not matched to what appears in the legend, so the legend is inaccurate. I have no idea why this is happening.
Ideally, the legend should display one colour for each unique value in my_labels, so it should look like this:
How can I get the legend to accurately display all the values it should be showing, i.e. one for each colour which appears in the plot?
Before calling plot1.legend or plot2.legend, you can pass label = None to plot1.axhline or axvline (and similarly to plot2.axhline or plot2.axvline.) This will make sure it doesn't interfere with plotting legends of the scatter points and also not label those lines.
To get labels for all categories of scatter points, you'll have to call plot1.scatter or plot2.scatter by passing the label and choosing only values from two_d_matrix whose index matches with the index of label in my_labels.
You can do it as follows:
import matplotlib.pyplot as plt
import numpy as np
# Generate some (pseudo) random data which is reproducible
generator = np.random.default_rng(seed=121)
matrix = generator.uniform(size=(40, 2))
matrix = np.sort(matrix)
clusters = 4
my_labels = np.array([0, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
fig, ax = plt.subplots(1, 1)
# Select data points wisely
for i in range(clusters):
pos = np.where(my_labels == i)
ax.scatter(matrix[pos, 0], matrix[pos, 1], s=30, cmap='viridis', label=i)
ax.axhline(0, color='#afafaf', label=None)
ax.axvline(0, color='#afafaf', label=None)
ax.legend()
ax.set_title("Expected output")
plt.show()
This gives:
Comparison of current output and expected output
Observe how data points selection (done inside the for loops in the code below) affects the output:
Code:
import matplotlib.pyplot as plt
import numpy as np
# Generate some (pseudo) random data which is reproducible
generator = np.random.default_rng(seed=121)
matrix = generator.uniform(size=(40, 2))
matrix = np.sort(matrix)
clusters = 4
my_labels = np.array([0, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
fig, ax = plt.subplots(1, 2)
# Question plot
for i in range(clusters):
ax[0].scatter(matrix[i:, 0], matrix[i:, 1], s=30, cmap='viridis', label=i)
ax[0].axhline(0, color='#afafaf', label=None)
ax[0].axvline(0, color='#afafaf', label=None)
ax[0].legend()
ax[0].set_title("Current output (with label = None)")
# Answer plot
for i in range(clusters):
pos = np.where(my_labels == i) # <- choose index of data points based on label position in my_labels
ax[1].scatter(matrix[pos, 0], matrix[pos, 1], s=30, cmap='viridis', label=i)
ax[1].axhline(0, color='#afafaf', label=None)
ax[1].axvline(0, color='#afafaf', label=None)
ax[1].legend()
ax[1].set_title("Expected output")
plt.show()

Plot rows of df on Plotly

The df is like this:
X Y Label
0 [16, 37, 38] [7968, 4650, 3615] 0.7
1 [29, 37, 12] [4321, 4650, 1223] 0.8
2 [12, 2, 445] [1264, 3456, 2112] 0.9
This should plot three lines on the same plot with labels as continuous variables. What is the fastest & simplest way to plot it using plotly?
Taking This should plot three lines on the same plot as the requirement. (Which is inconsistent with where I want subplots from each row of the df)
Simple case of create a trace for each row, using https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html to prepare x and y
import pandas as pd
import plotly.graph_objects as go
df = pd.DataFrame(
{
"X": [[16, 37, 38], [29, 37, 12], [12, 2, 445]],
"Y": [[7968, 4650, 3615], [4321, 4650, 1223], [1264, 3456, 2112]],
"Label": [0.7, 0.8, 0.9],
}
)
go.Figure(
[
go.Scatter(
x=r["X"].explode(), y=r["Y"].explode(), name=str(r["Label"].values[0])
)
for _, r in df.groupby(df.index)
]
)
with continuous color defined by label
import pandas as pd
import plotly.graph_objects as go
from plotly.colors import sample_colorscale
import plotly.express as px
df = pd.DataFrame(
{
"X": [[16, 37, 38], [29, 37, 12], [12, 2, 445]],
"Y": [[7968, 4650, 3615], [4321, 4650, 1223], [1264, 3456, 2112]],
"Label": [0.1, 0.5, 0.9],
}
)
fig = px.scatter(x=[0], y=[0], color=[.5], color_continuous_scale="YlGnBu")
fig = fig.add_traces(
[
go.Scatter(
x=r["X"].explode(),
y=r["Y"].explode(),
name=str(r["Label"].values[0]),
line_color=sample_colorscale("YlGnBu", r["Label"].values[0])[0],
showlegend=False
)
for _, r in df.groupby(df.index)
]
)
fig

How to return a list into a dataframe based on matching index of other column

I have a two data frames, one made up with a column of numpy array list, and other with two columns. I am trying to match the elements in the 1st dataframe (df) to get two columns, o1 and o2 from the df2, by matching based on index. I was wondering i can get some inputs.. please note the string 'A1' in column in 'o1' is repeated twice in df2 and as you may see in my desired output dataframe the duplicates are removed in column o1.
import numpy as np
import pandas as pd
array_1 = np.array([[0, 2, 3], [3, 4, 6], [1,2,3,6]])
#dataframe 1
df = pd.DataFrame({ 'A': array_1})
#dataframe 2
df2 = pd.DataFrame({ 'o1': ['A1', 'B1', 'A1', 'C1', 'D1', 'E1', 'F1'], 'o2': [15, 17, 18, 19, 20, 7, 8]})
#desired output
df_output = pd.DataFrame({ 'A': array_1, 'o1': [['A1', 'C1'], ['C1', 'D1', 'F1'], ['B1','A1','C1','F1']],
'o2': [[15, 18, 19], [19, 20, 8], [17,18,19,8]] })
# please note in the output, the 'index 0 of df1 has 0&2 which have same element i.e. 'A1', the output only shows one 'A1' by removing duplicated one.
I believe you can explode df and use that to extract information from df2, then finally join back to df
s = df['A'].explode()
df_output= df.join(df2.loc[s].groupby(s.index).agg(lambda x: list(set(x))))
Output:
A o1 o2
0 [0, 2, 3] [C1, A1] [18, 19, 15]
1 [3, 4, 6] [F1, D1, C1] [8, 19, 20]
2 [1, 2, 3, 6] [F1, B1, C1, A1] [8, 17, 18, 19]

Use pandas cut function in Dask

How can I use pd.cut() in Dask?
Because of the large dataset, I am not able to put the whole dataset into memory before finishing the pd.cut().
Current code that is working in Pandas but needs to be changed to Dask:
import pandas as pd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
#Groupby name and add column sum (of amounts) and count (number of grouped rows)
df = (df.groupby('name')['amount'].agg(['sum', 'count']).reset_index().sort_values(by='name', ascending=True))
print(df.head(15))
#Groupby bins and chnage sum and count based on grouped rows
df = df.groupby(pd.cut(df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Output:
name sum count
0 namebin1 5 3
1 namebin2 9 2
2 namebin3 8 1
I tried:
import pandas as pd
import dask.dataframe as dd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
print(df.head(15))
df = df.groupby(df.map_partitions(pd.cut,
df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Gives error:
TypeError("cut() got multiple values for argument 'bins'",)
The reason why you're seeing this error is that pd.cut() is being called with the partition as the first argument which it doesn't expect (see the docs).
You can wrap it in a custom function and call that instead, like so:
import pandas as pd
import dask.dataframe as dd
def custom_cut(partition, bins, labels):
result = pd.cut(x=partition["name"], bins=bins, labels=labels)
return result
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
df = df.groupby(df.map_partitions(custom_cut,
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))[['sum', 'count']].sum().reset_index()
df.compute()
name sum count
namebin1 5 3
namebin2 9 2
namebin3 8 1

matplotlib advanced stacked bar

matplotlib plot bars
It can be regular like http://matplotlib.org/examples/api/barchart_demo.html
Let's define this as [M, F]
It can be stacked like http://matplotlib.org/examples/pylab_examples/bar_stacked.html
Let's define this as [M + F]
Now how to plot [M, F + other]
If I understand you correctly, you want to have a stack plot with more than two elements stacked? If yes, that goes pretty straight forward as in the example you posted:
#!/usr/bin/env python
# a stacked bar plot with errorbars
import numpy as np
import matplotlib.pyplot as plt
N = 5
menMeans = [20, 35, 30, 35, 27]
womenMeans = [25, 32, 34, 20, 25]
otherMeans = [5, 2, 4, 8, 5]
menStd = [2, 3, 4, 1, 2]
womenStd = [3, 5, 2, 3, 3]
otherStd = [1, 1, 1, 1, 1]
ind = np.arange(N) # the x locations for the groups
width = 0.35 # the width of the bars: can also be len(x) sequence
p1 = plt.bar(ind, menMeans, width, color='r', yerr=womenStd)
p2 = plt.bar(ind, womenMeans, width, color='y',
bottom=menMeans, yerr=menStd)
p3 = plt.bar(ind, otherMeans, width, color='b',
bottom=[menMeans[j] + womenMeans[j] for j in range(len(menMeans)) ],
yerr=otherStd)
plt.ylabel('Scores')
plt.title('Scores by group and gender')
plt.xticks(ind+width/2., ('G1', 'G2', 'G3', 'G4', 'G5') )
plt.yticks(np.arange(0,81,10))
plt.legend( (p1[0], p2[0], p3[0]), ('Men', 'Women', 'Other') )
plt.show()