Customizing legend with scatterplot - pandas

I struggle with customizing the legend of my scatterplot. Here is a snapshot :
And here is a code sample :
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
my_df = pd.DataFrame([[5, 3, 1], [2, 1, 2], [3, 4, 1], [1, 2, 1]],
columns=["DUMMY_CT", "FOO_CT", "CI_CT"])
g = sns.scatterplot("DUMMY_CT", "FOO_CT", data=my_df, size="CI_CT")
g.set_title("Number of Baz", weight="bold")
g.set_xlabel("Dummy count")
g.set_ylabel("Foo count")
g.get_legend().set_title("Baz count")
Also, I work in a Jupyter-lab notebook with Python 3, if it helps.
The red thingy issue
First things first, I wish to hide the name of the CI_CT variable (contoured in red on the picture). After exploring the whole documentation for this afternoon, I found the get_legend_handlers_label method (see here), which produces the following :
>>> g.get_legend_handles_labels()
([<matplotlib.collections.PathCollection at 0xfaaba4a8>,
<matplotlib.collections.PathCollection at 0xfaa3ff28>,
<matplotlib.collections.PathCollection at 0xfaa3f6a0>,
<matplotlib.collections.PathCollection at 0xfaa3fe48>],
['CI_CT', '0', '1', '2'])
Where I can spot my dear CI_CT string. However, I'm unable to change this name or to hide it completely. I found a dirty way, that basically consists in not using efficiently the dataframe passed as a data parameter. Here is the scatterplot call :
g = sns.scatterplot("DUMMY_CT", "FOO_CT", data=my_df, size=my_df["CI_CT"].values)
Result here :
It works, but is there a cleaner way to achieve this?
The green thingy issue
Displaying a 0 level in this legend is incorrect, since there is no zero value in the column CI_CT of my_df. It is therefore misleading for the readers, who might assume the smaller dots represents a value of 0 or 1. I wish to setup a defined scale, in the way one can do it for the x and y axis. However, I cannot achieve it. Any idea?
TL;DR : A broader question that could solve everything
Those adventures make me wonder if there is a way to handle the data you can pass to the scatterplots with hue and size parameters in a clean, x-and-y-axis way. Is it actually possible?
Please pardon my English, please let me know if the question is too broad or uncorrectly labelled.

The "green thing issue", namely that there is one more legend entry than there are sizes, is solved by specifying legend="full".
g = sns.scatterplot(..., legend="full")
The "red thing issue" is more tricky. The problem here is that seaborn misuses a normal legend label as a headline for the legend. An option is indeed to supply the values directly instead of the name of the column, to prevent seaborn from using that column name.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
my_df = pd.DataFrame([[5, 3, 1], [2, 1, 2], [3, 4, 1], [1, 2, 1]],
columns=["DUMMY_CT", "FOO_CT", "CI_CT"])
g = sns.scatterplot("DUMMY_CT", "FOO_CT", data=my_df, size=my_df["CI_CT"].values, legend="full")
g.set_title("Number of Baz", weight="bold")
g.set_xlabel("Dummy count")
g.set_ylabel("Foo count")
g.get_legend().set_title("Baz count")
plt.show()
If you really must use the column name itself, a hacky solution is to crawl into the legend and remove the label you don't want.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
my_df = pd.DataFrame([[5, 3, 1], [2, 1, 2], [3, 4, 1], [1, 2, 1]],
columns=["DUMMY_CT", "FOO_CT", "CI_CT"])
g = sns.scatterplot("DUMMY_CT", "FOO_CT", data=my_df, size="CI_CT", legend="full")
g.set_title("Number of Baz", weight="bold")
g.set_xlabel("Dummy count")
g.set_ylabel("Foo count")
g.get_legend().set_title("Baz count")
#Hack to remove the first legend entry (which is the undesired title)
vpacker = g.get_legend()._legend_handle_box.get_children()[0]
vpacker._children = vpacker.get_children()[1:]
plt.show()

I finally managed to get the result I wish, but the ugly way. It might be useful to someone, but I would not advise to do this.
The solution to fix the scale into the legend consists of moving all the CI_CT column values to the negatives (to keep the order and the consistency of markers size). Then, the values displayed in the legend are corrected accordingly to the previous data changes (inspiration from here).
However, I did not find any better way to make the "CI_CT" text desapear in the legend without leaving an atrociously huge blank space.
Here is the sample of code and the result.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
my_df = pd.DataFrame([[5, 3, 1], [2, 1, 2], [3, 4, 1], [1, 2, 1]], columns=["DUMMY_CT", "FOO_CT", "CI_CT"])
# Substracting the maximal value of CI_CT for each value
max_val = my_df["CI_CT"].agg("max")
my_df["CI_CT"] = my_df.apply(lambda x : x["CI_CT"] - max_val, axis=1)
# scatterplot declaration
g = sns.scatterplot("DUMMY_CT", "FOO_CT", data=my_df, size=my_df["CI_CT"].values)
g.set_title("Number of Baz", weight="bold")
g.set_xlabel("Dummy count")
g.set_ylabel("Foo count")
g.get_legend().set_title("Baz count")
# Correcting legend values
l = g.legend_
for t in l.texts :
t.set_text(int(t.get_text()) + max_val)
# Restoring the DF
my_df["CI_CT"] = my_df.apply(lambda x : x["CI_CT"] + max_val, axis=1)
I'm still looking for a better way to achieve this.

Related

How can I use matplotlib.pyplot to customize geopandas plots?

What is the difference between geopandas plots and matplotlib plots? Why are not all keywords available?
In geopandas there is markersize, but not markeredgecolor...
In the example below I plot a pandas df with some styling, then transform the pandas df to a geopandas df. Simple plotting is working, but no additional styling.
This is just an example. In my geopandas plots I would like to customize, markers, legends, etc. How can I access the relevant matplotlib objects?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
X = np.linspace(-6, 6, 1024)
Y = np.sinc(X)
df = pd.DataFrame(Y, X)
plt.plot(X,Y,linewidth = 3., color = 'k', markersize = 9, markeredgewidth = 1.5, markerfacecolor = '.75', markeredgecolor = 'k', marker = 'o', markevery = 32)
# alternatively:
# df.plot(linewidth = 3., color = 'k', markersize = 9, markeredgewidth = 1.5, markerfacecolor = '.75', markeredgecolor = 'k', marker = 'o', markevery = 32)
plt.show()
# create GeoDataFrame from df
df.reset_index(inplace=True)
df.rename(columns={'index': 'Y', 0: 'X'}, inplace=True)
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['Y'], df['X']))
gdf.plot(linewidth = 3., color = 'k', markersize = 9) # working
gdf.plot(linewidth = 3., color = 'k', markersize = 9, markeredgecolor = 'k') # not working
plt.show()
You're probably confused by the fact that both libraries named the method .plot(. In matplotlib that specifically translates to a mpl.lines.Line2D object, which also contains the markers and their styling.
Geopandas, assumes you want to plot geographic data, and uses a Path for this (mpl.collections.PathCollection). That has for example the face and edgecolors, but no markers. The facecolor comes into play whenever your path closes and forms a polygon (your example doesn't, making it "just" a line).
Geopandas seems to use a bit of a trick for points/markers, it appears to draw a "path" using the "CURVE4" code (cubic Bézier).
You can explore what's happening if you capture the axes that geopandas returns:
ax = gdf.plot(...
Using ax.get_children() you'll get all artists that have been added to the axes, since this is a simple plot, it's easy to see that the PathCollection is the actual data. The other artists are drawing the axis/spines etc.
[<matplotlib.collections.PathCollection at 0x1c05d5879d0>,
<matplotlib.spines.Spine at 0x1c05d43c5b0>,
<matplotlib.spines.Spine at 0x1c05d43c4f0>,
<matplotlib.spines.Spine at 0x1c05d43c9d0>,
<matplotlib.spines.Spine at 0x1c05d43f1c0>,
<matplotlib.axis.XAxis at 0x1c05d036590>,
<matplotlib.axis.YAxis at 0x1c05d43ea10>,
Text(0.5, 1.0, ''),
Text(0.0, 1.0, ''),
Text(1.0, 1.0, ''),
<matplotlib.patches.Rectangle at 0x1c05d351b10>]
If you reduce the amount of points a lot, like use 5 instead of 1024, retrieving the Path's drawn show the coordinates and also the codes used:
pcoll = ax.get_children()[0] # the first artist is the PathCollection
path = pcoll.get_paths()[0] # it only contains 1 Path
print(path.codes) # show the codes used.
# array([ 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
# 4, 4, 4, 4, 4, 4, 4, 4, 79], dtype=uint8)
Some more info about how these paths work can be found at:
https://matplotlib.org/stable/tutorials/advanced/path_tutorial.html
So long story short, you do have all the same keywords as when using Matplotlib, but they're the keywords for Path's and not the Line2D object that you might expect.
You can always flip the order around, and start with a Matplotlib figure/axes created by you, and pass that axes to Geopandas when you want to plot something. That might make it easier or more intuitive when you (also) want to plot other things in the same axes. It does require perhaps a bit more discipline to make sure the (spatial)coordinates etc match.
I personally almost always do that, because it allows to do most of the plotting using the same Matplotlib API's. Which admittedly has perhaps a slightly steeper learning curve. But overall I find it easier compared to having to deal with every package's slightly different interpretation that uses Matplotlib under the hood (eg geopandas, seaborn, xarray etc). But that really depends on where you're coming from.
Thank you for your detailed answer. Based on this I came up with this simplified code from my real project.
I have a shapefile shp and some point data df which I want to plot. shp is plotted with geopandas, df with matplotlib.plt. No need for transferring the point data into a geodataframe gdf as I did initially.
# read marker data (places with coordindates)
df = pd.read_csv("../obese_pct_by_place.csv")
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['sweref99_lng'], df['sweref99_lat']))
# read shapefile
shp = gpd.read_file("../../SWEREF_Shapefiles/KommunSweref99TM/Kommun_Sweref99TM_region.shp")
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_aspect('equal')
shp.plot(ax=ax)
# plot obesity markers
# geopandas, no edgecolor here
# gdf.plot(ax=ax, marker='o', c='r', markersize=gdf['obese'] * 25)
# matplotlib.pyplot with edgecolor
plt.scatter(df['sweref99_lng'], df['sweref99_lat'], c='r', edgecolor='k', s=df['obese'] * 25)
plt.show()

How to start Seaborn Logarithmic Barplot at y=1

I have a problem figuring out how to have Seaborn show the right values in a logarithmic barplot. A value of mine should be, in the ideal case, be 1. My dataseries (5,2,1,0.5,0.2) has a set of values that deviate from unity and I want to visualize these in a logarithmic barplot. However, when plotting this in the standard log-barplot it shows the following:
But the values under one are shown to increase from -infinity to their value, whilst the real values ought to look like this:
Strangely enough, I was unable to find a Seaborn, Pandas or Matplotlib attribute to "snap" to a different horizontal axis or "align" or ymin/ymax. I have a feeling I am unable to find it because I can't find the terms to shove down my favorite search engine. Some semi-solutions I found just did not match what I was looking for or did not have either xaxis = 1 or a ylog. A try that uses some jank Matplotlib lines:
If someone knows the right terms or a solution, thank you in advance.
Here are the Jupyter cells I used:
{1}
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
data = {'X': ['A','B','C','D','E'], 'Y': [5,2,1,0.5,0.2]}
df = pd.DataFrame(data)
{2}
%matplotlib widget
g = sns.catplot(data=df, kind="bar", y = "Y", x = "X", log = True)
{3}
%matplotlib widget
plt.vlines(x=data['X'], ymin=1, ymax=data['Y'])
You could let the bars start at 1 instead of at 0. You'll need to use sns.barplot directly.
The example code subtracts 1 of all y-values and sets the bar bottom at 1.
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import seaborn as sns
import pandas as pd
import numpy as np
data = {'X': ['A', 'B', 'C', 'D', 'E'], 'Y': [5, 2, 1, 0.5, 0.2]}
df = pd.DataFrame(data)
ax = sns.barplot(y=df["Y"] - 1, x=df["X"], bottom=1, log=True, palette='flare_r')
ax.axhline(y=1, c='k')
# change the y-ticks, as the default shows too few in this case
ax.set_yticks(np.append(np.arange(.2, .8, .1), np.arange(1, 7, 1)), minor=False)
ax.set_yticks(np.arange(.3, 6, .1), minor=True)
ax.yaxis.set_major_formatter(lambda x, pos: f'{x:.0f}' if x >= 1 else f'{x:.1f}')
ax.yaxis.set_minor_formatter(NullFormatter())
ax.bar_label(ax.containers[0], labels=df["Y"])
sns.despine()
plt.show()
PS: With these specific values, the plot might go without logscale:

squared-off line plot matplotlib

How do I generate a line graph in Matplotlib where lines connecting the data points are only vertical and horizontal, not diagonal, giving a "blocky" look?
Note that this is sometimes called zero order extrapolation.
MWE
import matplotlib.pyplot as plt
x = [1, 3, 5, 7]
y = [2, 0, 4, 1]
plt.plot(x, y)
This gives:
and I want:
I think you are looking for plt.step. Here are some examples.

matplotlib collection linewidth mapping?

I'm creating some GIS-style plots in matplotlib of road networks and the like, so I'm using LineCollection to store and represent all of the roads and color accordingly. This is working fine, I color the roads based on a criteria and the following map:
from matplotlib.colors import ListedColormap,BoundaryNorm
from matplotlib.collections import LineCollection
cmap = ListedColormap(['grey','blue','green','yellow','orange','red','black'])
norm = BoundaryNorm([0,0.5,0.75,0.9,0.95,1.0,1.5,100],cmap.N)
roads = LineCollection(road_segments, array=ratios, cmap=cmap, norm=norm)
axes.add_collection(roads)
This works fine, however I would really like to have linewidths defined in a similar manner to the color map - ranging from 0.5 to 5 for each color
Does anyone know of a clever way of doing this?
The linewidths keyword.
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
axes = plt.axes()
roads = LineCollection([
[[0, 0], [1, 1]],
[[0, 1], [1, 0]]
],
colors=['black', 'red'],
linewidths=[3, 8],
)
axes.add_collection(roads)
plt.show()
HTH

How do I assign multiple labels at once in matplotlib?

I have the following dataset:
x = [0, 1, 2, 3, 4]
y = [ [0, 1, 2, 3, 4],
[5, 6, 7, 8, 9],
[9, 8, 7, 6, 5] ]
Now I plot it with:
import matplotlib.pyplot as plt
plt.plot(x, y)
However, I want to label the 3 y-datasets with this command, which raises an error when .legend() is called:
lineObjects = plt.plot(x, y, label=['foo', 'bar', 'baz'])
plt.legend()
File "./plot_nmos.py", line 33, in <module>
plt.legend()
...
AttributeError: 'list' object has no attribute 'startswith'
When I inspect the lineObjects:
>>> lineObjects[0].get_label()
['foo', 'bar', 'baz']
>>> lineObjects[1].get_label()
['foo', 'bar', 'baz']
>>> lineObjects[2].get_label()
['foo', 'bar', 'baz']
Question
Is there an elegant way to assign multiple labels by just using the .plot() method?
You can iterate over your line objects list, so labels are individually assigned. An example with the built-in python iter function:
lineObjects = plt.plot(x, y)
plt.legend(iter(lineObjects), ('foo', 'bar', 'baz'))`
Edit: after updating to matplotlib 1.1.1, it looks like the plt.plot(x, y), with y as a list of lists (as provided by the author of the question), doesn't work anymore. The one step plotting without iteration over the y arrays is still possible thought after passing y as numpy.array (assuming (numpy)[http://numpy.scipy.org/] as been previously imported).
In this case, use plt.plot(x, y) (if the data in the 2D y array are arranged as columns [axis 1]) or plt.plot(x, y.transpose()) (if the data in the 2D y array are arranged as rows [axis 0])
Edit 2: as pointed by #pelson (see commentary below), the iter function is unnecessary and a simple plt.legend(lineObjects, ('foo', 'bar', 'baz')) works perfectly
It is not possible to plot those two arrays agains each other directly (with at least version 1.1.1), therefore you must be looping over your y arrays. My advice would be to loop over the labels at the same time:
import matplotlib.pyplot as plt
x = [0, 1, 2, 3, 4]
y = [ [0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [9, 8, 7, 6, 5] ]
labels = ['foo', 'bar', 'baz']
for y_arr, label in zip(y, labels):
plt.plot(x, y_arr, label=label)
plt.legend()
plt.show()
Edit: #gcalmettes pointed out that as numpy arrays, it is possible to plot all the lines at the same time (by transposing them). See #gcalmettes answer & comments for details.
I came over the same problem and now I found a solution that is most easy! Hopefully that's not too late for you. No iterator, just assign your result to a structure...
from numpy import *
from matplotlib.pyplot import *
from numpy.random import *
a = rand(4,4)
a
>>> array([[ 0.33562406, 0.96967617, 0.69730654, 0.46542408],
[ 0.85707323, 0.37398595, 0.82455736, 0.72127002],
[ 0.19530943, 0.4376796 , 0.62653007, 0.77490795],
[ 0.97362944, 0.42720348, 0.45379479, 0.75714877]])
[b,c,d,e] = plot(a)
legend([b,c,d,e], ["b","c","d","e"], loc=1)
show()
Looks like this:
The best current solution is:
lineObjects = plt.plot(x, y) # y describes 3 lines
plt.legend(['foo', 'bar', 'baz'])
You can give the labels while plotting the curves
import pylab as plt
x = [0, 1, 2, 3, 4]
y = [ [0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [9, 8, 7, 6, 5] ]
labels=['foo', 'bar', 'baz']
colors=['r','g','b']
# loop over data, labels and colors
for i in range(len(y)):
plt.plot(x,y[i],'o-',color=colors[i],label=labels[i])
plt.legend()
plt.show()
In case of numpy matrix plot assign multiple legends at once for each column
I would like to answer this question based on plotting a matrix that has two columns.
Say you have a 2 column matrix Ret
then one may use this code to assign multiple labels at once
import pandas as pd, numpy as np, matplotlib.pyplot as plt
pd.DataFrame(Ret).plot()
plt.xlabel('time')
plt.ylabel('Return')
plt.legend(['Bond Ret','Equity Ret'], loc=0)
plt.show()
I hope this helps
This problem comes up for me often when I have a single set of x values and multiple y values in the columns of an array. I really don't want to plot the data in a loop, and multiple calls to ax.legend/plt.legend are not really an option, since I want to plot other stuff, usually in an equally annoying format.
Unfortunately, plt.setp is not helpful here. In newer versions of matplotlib, it just converts your entire list/tuple into a string, and assigns the whole thing as a label to all the lines.
I've therefore made a utility function to wrap calls to ax.plot/plt.plot in:
def set_labels(artists, labels):
for artist, label in zip(artists, labels):
artist.set_label(label)
You can call it something like
x = np.arange(5)
y = np.random.ranint(10, size=(5, 3))
fig, ax = plt.subplots()
set_labels(ax.plot(x, y), 'ABC')
This way you get to specify all your normal artist parameters to plot, without having to see the loop in your code. An alternative is to put the whole call to plot into a utility that just unpacks the labels, but that would require a lot of duplication to figure out how to parse multiple datasets, possibly with different numbers of columns, and spread out across multiple arguments, keyword or otherwise.
I used the following to show labels for a dataframe without using the dataframe plot:
lines_ = plot(df)
legend(lines_, df.columns) # df.columns is a list of labels
If you're using a DataFrame, you can also iterate over the columns of the data you want to plot:
# Plot figure
fig, ax = plt.subplots(figsize=(5,5))
# Data
data = data
# Plot
for i in data.columns:
_ = ax.plot(data[i], label=i)
_ = ax.legend()
plt.show()