Pandas: plot a dataframe with on its right side rectangle colored according to an array's values - pandas

I have a dataframe with 100 rows and 4 columns. I have an array (size 100,1) filled with values spanning between 0 and 1. I would like to plot my dataframe, with on its right side a rectangle which will take a color depending on the value of the array at a specific row (see the poor drawing I made, the array is written to help understanding what I want). I would like the colors to be a gradient, where 0 = dark blue, and 1 = bright red.
I know how to create a colormap, but this is slightly different.
Which function do you advise me to use ?
Here is some code I use for the plotting:
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
rectangle_values = np.random.rand(100)
plt.figure(figsize=(15,15))
ax = sns.heatmap(df, cbar = None)
)

My solution would be to use plot.subplots to create two plots with the width_ratios argument as something like 19:1. On the left hand side you plot the data frame as usual, on the right hand side you plot the vector. Notice that I am using vmin and vmax to set the boundaries as required (0, 1) for the vector. Also, for the requested colors, I'm using MatPlotLib's RdBu (Red and Blue map), but it was needed to reverse it in order to meet your requirements. You can confirm the colors by the values, on this run the generated random values were [0.74, 0.96, 0.87, 0.50, 0.26].
df = pd.DataFrame(np.random.randint(0,5,size=(5, 4)), columns=list('ABCD'))
rectangle_values = pd.DataFrame(np.random.rand(5), columns=['foo'])
plt.subplots(1, 2, gridspec_kw={'width_ratios': [19, 1]})
plt.subplot(1, 2, 1)
sns.heatmap(df, cbar = None)
plt.subplot(1, 2, 2)
sns.heatmap(rectangle_values, cbar = None, cmap=plt.cm.get_cmap('RdBu').reversed(), vmin=0, vmax=1)
plt.show()
And the output is:

Related

How can I use matplotlib.pyplot to customize geopandas plots?

What is the difference between geopandas plots and matplotlib plots? Why are not all keywords available?
In geopandas there is markersize, but not markeredgecolor...
In the example below I plot a pandas df with some styling, then transform the pandas df to a geopandas df. Simple plotting is working, but no additional styling.
This is just an example. In my geopandas plots I would like to customize, markers, legends, etc. How can I access the relevant matplotlib objects?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
X = np.linspace(-6, 6, 1024)
Y = np.sinc(X)
df = pd.DataFrame(Y, X)
plt.plot(X,Y,linewidth = 3., color = 'k', markersize = 9, markeredgewidth = 1.5, markerfacecolor = '.75', markeredgecolor = 'k', marker = 'o', markevery = 32)
# alternatively:
# df.plot(linewidth = 3., color = 'k', markersize = 9, markeredgewidth = 1.5, markerfacecolor = '.75', markeredgecolor = 'k', marker = 'o', markevery = 32)
plt.show()
# create GeoDataFrame from df
df.reset_index(inplace=True)
df.rename(columns={'index': 'Y', 0: 'X'}, inplace=True)
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['Y'], df['X']))
gdf.plot(linewidth = 3., color = 'k', markersize = 9) # working
gdf.plot(linewidth = 3., color = 'k', markersize = 9, markeredgecolor = 'k') # not working
plt.show()
You're probably confused by the fact that both libraries named the method .plot(. In matplotlib that specifically translates to a mpl.lines.Line2D object, which also contains the markers and their styling.
Geopandas, assumes you want to plot geographic data, and uses a Path for this (mpl.collections.PathCollection). That has for example the face and edgecolors, but no markers. The facecolor comes into play whenever your path closes and forms a polygon (your example doesn't, making it "just" a line).
Geopandas seems to use a bit of a trick for points/markers, it appears to draw a "path" using the "CURVE4" code (cubic Bézier).
You can explore what's happening if you capture the axes that geopandas returns:
ax = gdf.plot(...
Using ax.get_children() you'll get all artists that have been added to the axes, since this is a simple plot, it's easy to see that the PathCollection is the actual data. The other artists are drawing the axis/spines etc.
[<matplotlib.collections.PathCollection at 0x1c05d5879d0>,
<matplotlib.spines.Spine at 0x1c05d43c5b0>,
<matplotlib.spines.Spine at 0x1c05d43c4f0>,
<matplotlib.spines.Spine at 0x1c05d43c9d0>,
<matplotlib.spines.Spine at 0x1c05d43f1c0>,
<matplotlib.axis.XAxis at 0x1c05d036590>,
<matplotlib.axis.YAxis at 0x1c05d43ea10>,
Text(0.5, 1.0, ''),
Text(0.0, 1.0, ''),
Text(1.0, 1.0, ''),
<matplotlib.patches.Rectangle at 0x1c05d351b10>]
If you reduce the amount of points a lot, like use 5 instead of 1024, retrieving the Path's drawn show the coordinates and also the codes used:
pcoll = ax.get_children()[0] # the first artist is the PathCollection
path = pcoll.get_paths()[0] # it only contains 1 Path
print(path.codes) # show the codes used.
# array([ 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
# 4, 4, 4, 4, 4, 4, 4, 4, 79], dtype=uint8)
Some more info about how these paths work can be found at:
https://matplotlib.org/stable/tutorials/advanced/path_tutorial.html
So long story short, you do have all the same keywords as when using Matplotlib, but they're the keywords for Path's and not the Line2D object that you might expect.
You can always flip the order around, and start with a Matplotlib figure/axes created by you, and pass that axes to Geopandas when you want to plot something. That might make it easier or more intuitive when you (also) want to plot other things in the same axes. It does require perhaps a bit more discipline to make sure the (spatial)coordinates etc match.
I personally almost always do that, because it allows to do most of the plotting using the same Matplotlib API's. Which admittedly has perhaps a slightly steeper learning curve. But overall I find it easier compared to having to deal with every package's slightly different interpretation that uses Matplotlib under the hood (eg geopandas, seaborn, xarray etc). But that really depends on where you're coming from.
Thank you for your detailed answer. Based on this I came up with this simplified code from my real project.
I have a shapefile shp and some point data df which I want to plot. shp is plotted with geopandas, df with matplotlib.plt. No need for transferring the point data into a geodataframe gdf as I did initially.
# read marker data (places with coordindates)
df = pd.read_csv("../obese_pct_by_place.csv")
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['sweref99_lng'], df['sweref99_lat']))
# read shapefile
shp = gpd.read_file("../../SWEREF_Shapefiles/KommunSweref99TM/Kommun_Sweref99TM_region.shp")
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_aspect('equal')
shp.plot(ax=ax)
# plot obesity markers
# geopandas, no edgecolor here
# gdf.plot(ax=ax, marker='o', c='r', markersize=gdf['obese'] * 25)
# matplotlib.pyplot with edgecolor
plt.scatter(df['sweref99_lng'], df['sweref99_lat'], c='r', edgecolor='k', s=df['obese'] * 25)
plt.show()

same colorbar/colormap for all subplots [duplicate]

I want to make 4 imshow subplots but all of them share the same colormap. Matplotlib automatically adjusts the scale on the colormap depending on the entries of the matrices. For example, if one of my matrices has all entires as 10 and the other one has all entries equal to 5 and I use the Greys colormap then one of my subplots should be completely black and the other one should be completely grey. But both of them end up becoming completely black. How to make all the subplots share the same scale on the colormap?
To get this right you need to have all the images with the same intensity scale, otherwise the colorbar() colours are meaningless. To do that, use the vmin and vmax arguments of imshow(), and make sure they are the same for all your images.
E.g., if the range of values you want to show goes from 0 to 10, you can use the following:
import pylab as plt
import numpy as np
my_image1 = np.linspace(0, 10, 10000).reshape(100,100)
my_image2 = np.sqrt(my_image1.T) + 3
plt.subplot(1, 2, 1)
plt.imshow(my_image1, vmin=0, vmax=10, cmap='jet', aspect='auto')
plt.subplot(1, 2, 2)
plt.imshow(my_image2, vmin=0, vmax=10, cmap='jet', aspect='auto')
plt.colorbar()
When the ranges of data (data1 and data2) sets are unknown and you want to use the same colour bar for both/all plots, find the overall minimum and maximum to use as vmin and vmax in the call to imshow:
import numpy as np
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=2)
# generate randomly populated arrays
data1 = np.random.rand(10,10)*10
data2 = np.random.rand(10,10)*10 -7.5
# find minimum of minima & maximum of maxima
minmin = np.min([np.min(data1), np.min(data2)])
maxmax = np.max([np.max(data1), np.max(data2)])
im1 = axes[0].imshow(data1, vmin=minmin, vmax=maxmax,
extent=(-5,5,-5,5), aspect='auto', cmap='viridis')
im2 = axes[1].imshow(data2, vmin=minmin, vmax=maxmax,
extent=(-5,5,-5,5), aspect='auto', cmap='viridis')
# add space for colour bar
fig.subplots_adjust(right=0.85)
cbar_ax = fig.add_axes([0.88, 0.15, 0.04, 0.7])
fig.colorbar(im2, cax=cbar_ax)
It may be that you don't know beforehand the ranges of your data, but you may know that somehow they are compatible. In that case, you may prefer to let matplotlib choose those ranges for the first plot and use the same range for the remaining plots. Here is how you can do it. The key is to get the limits with properties()['clim']
import numpy as np
import matplotlib.pyplot as plt
my_image1 = np.linspace(0, 10, 10000).reshape(100,100)
my_image2 = np.sqrt(my_image1.T) + 3
fig, axes = plt.subplots(nrows=1, ncols=2)
im = axes[0].imshow(my_image1)
clim=im.properties()['clim']
axes[1].imshow(my_image2, clim=clim)
fig.colorbar(im, ax=axes.ravel().tolist(), shrink=0.5)
plt.show()

Seaborn jointplot link x-axis to Matplotlib subplots

Is there a way to add additional subplots created with vanilla Matplotlib to (below) a Seaborn jointplot, sharing the x-axis? Ideally I'd like to control the ratio between the jointplot and the additional plots (similar to gridspec_kw={'height_ratios':[3, 1, 1]}
I tried to fake it by tuning figsize in the Matplotlib subplots, but obviously it doesn't work well when the KDE curves in the marginal plot change. While I could manually resize the output PNG to shrink/grow one of the figures, I'd like to have everything aligned automatically.
I know this is tricky with the way the joint grid is set up, but maybe it is reasonably simple for someone fluent in the underpinnings of Seaborn.
Here is a minimal working example, but there are two separate figures:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Figure 1
diamonds = sns.load_dataset('diamonds')
g = sns.jointplot(
data=diamonds,
x="carat",
y="price",
hue="cut",
xlim=(1, 2),
)
g.ax_marg_x.remove()
Figure 2
fig, (ax1, ax2) = plt.subplots(2,1,sharex=True)
ax1.scatter(x=diamonds["carat"], y=diamonds["depth"], color="gray", edgecolor="black")
ax1.set_xlim([1, 2])
ax1.set_ylabel("depth")
ax2.scatter(x=diamonds["carat"], y=diamonds["table"], color="gray", edgecolor="black")
ax2.set_xlabel("carat")
ax2.set_ylabel("table")
Desired output:
I think this is a case where setting up the figure using matplotlib functions is going to be better than working backwards from a seaborn figure layout that doesn't really match the use-case.
If you have a non-full subplot grid, you'll have to decide whether you want to (A) set up all the subplots and then remove the ones you don't want or (B) explicitly add each of the subplots you do want. Let's go with option A here.
figsize = (6, 8)
gridspec_kw = dict(
nrows=3, ncols=2,
width_ratios=[5, 1],
height_ratios=[4, 1, 1],
)
subplot_kw = dict(sharex="col", sharey="row")
fig = plt.figure(figsize=figsize, constrained_layout=True)
axs = fig.add_gridspec(**gridspec_kw).subplots(**subplot_kw)
sns.kdeplot(data=df, y="price", hue="cut", legend=False, ax=axs[0, 1])
sns.scatterplot(data=df, x="carat", y="price", hue="cut", ax=axs[0, 0])
sns.scatterplot(data=df, x="carat", y="depth", color=".2", ax=axs[1, 0])
sns.scatterplot(data=df, x="carat", y="table", color=".2", ax=axs[2, 0])
axs[0, 0].set(xlim=(1, 2))
axs[1, 1].remove()
axs[2, 1].remove()
BTW, this is almost a bit easier with plt.subplot_mosaic, but it does not yet support axis sharing.
You could take the figure created by jointplot(), move its padding (with subplots_adjust()) and add 2 extra axes.
The example code will need some tweaking for each particular situation.
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
import seaborn as sns
diamonds = sns.load_dataset('diamonds')
g = sns.jointplot(data=diamonds, x="carat", y="price", hue="cut",
xlim=(1, 2), height=12)
g.ax_marg_x.remove()
g.fig.subplots_adjust(left=0.08, right=0.97, top=1.05, bottom=0.45)
axins1 = inset_axes(g.ax_joint, width="100%", height="30%",
bbox_to_anchor=(0, -0.4, 1, 1),
bbox_transform=g.ax_joint.transAxes, loc=3, borderpad=0)
axins2 = inset_axes(g.ax_joint, width="100%", height="30%",
bbox_to_anchor=(0, -0.75, 1, 1),
bbox_transform=g.ax_joint.transAxes, loc=3, borderpad=0)
shared_x_group = g.ax_joint.get_shared_x_axes()
shared_x_group.remove(g.ax_marg_x)
shared_x_group.join(g.ax_joint, axins1)
shared_x_group.join(g.ax_joint, axins2)
axins1.scatter(x=diamonds["carat"], y=diamonds["depth"], color="grey", edgecolor="black")
axins1.set_ylabel("depth")
axins2.scatter(x=diamonds["carat"], y=diamonds["table"], color="grey", edgecolor="black")
axins2.set_xlabel("carat")
axins2.set_ylabel("table")
g.ax_joint.set_xlim(1, 2)
plt.setp(axins1.get_xticklabels(), visible=False)
plt.show()
PS: How to share x axes of two subplots after they have been created contains some info about sharing axes (although here you simply get the same effect by setting the xlims for each of the subplots).
The code to position the new axes has been adapted from this tutorial example.

Align multi-line ticks in Seaborn plot

I have the following heatmap:
I've broken up the category names by each capital letter and then capitalised them. This achieves a centering effect across the labels on my x-axis by default which I'd like to replicate across my y-axis.
yticks = [re.sub("(?<=.{1})(.?)(?=[A-Z]+)", "\\1\n", label, 0, re.DOTALL).upper() for label in corr.index]
xticks = [re.sub("(?<=.{1})(.?)(?=[A-Z]+)", "\\1\n", label, 0, re.DOTALL).upper() for label in corr.columns]
fig, ax = plt.subplots(figsize=(20,15))
sns.heatmap(corr, ax=ax, annot=True, fmt="d",
cmap="Blues", annot_kws=annot_kws,
mask=mask, vmin=0, vmax=5000,
cbar_kws={"shrink": .8}, square=True,
linewidths=5)
for p in ax.texts:
myTrans = p.get_transform()
offset = mpl.transforms.ScaledTranslation(-12, 5, mpl.transforms.IdentityTransform())
p.set_transform(myTrans + offset)
plt.yticks(plt.yticks()[0], labels=yticks, rotation=0, linespacing=0.4)
plt.xticks(plt.xticks()[0], labels=xticks, rotation=0, linespacing=0.4)
where corr represents a pre-defined pandas dataframe.
I couldn't seem to find an align parameter for setting the ticks and was wondering if and how this centering could be achieved in seaborn/matplotlib?
I've adapted the seaborn correlation plot example below.
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Generate a large random dataset
rs = np.random.RandomState(33)
d = pd.DataFrame(data=rs.normal(size=(100, 7)),
columns=['Donald\nDuck','Mickey\nMouse','Han\nSolo',
'Luke\nSkywalker','Yoda','Santa\nClause','Ronald\nMcDonald'])
# Compute the correlation matrix
corr = d.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
for i in ax.get_yticklabels():
i.set_ha('right')
i.set_rotation(0)
for i in ax.get_xticklabels():
i.set_ha('center')
Note the two for sequences above. These get the label and then set the horizontal alignment (You can also change the vertical alignment (set_va()).
The code above produces this:

how to increase space between bar and increase bar width in matplotlib

i am web-scraping a wikipedia table directly from wikipedia website and plot the table. i want to increase the bar width, add space between the bars and make all bars visible. pls how can i do? my code below
#########scrapping#########
html= requests.get("https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Nigeria")
bsObj= BeautifulSoup(html.content, 'html.parser')
states= []
cases=[]
for items in bsObj.find("table",{"class":"wikitable sortable"}).find_all('tr')[1:37]:
data = items.find_all(['th',{"align":"left"},'td'])
states.append(data[0].a.text)
cases.append(data[1].b.text)
########Dataframe#########
table= ["STATES","CASES"]
tab= pd.DataFrame(list(zip(states,cases)),columns=table)
tab["CASES"]=tab["CASES"].replace('\n','', regex=True)
tab["CASES"]=tab["CASES"].replace(',','', regex=True)
tab['CASES'] = pd.to_numeric(tab['CASES'], errors='coerce')
tab["CASES"]=tab["CASES"].fillna(0)
tab["CASES"] = tab["CASES"].values.astype(int)
#######matplotlib########
x=tab["STATES"]
y=tab["CASES"]
plt.cla()
plt.locator_params(axis='y', nbins=len(y)/4)
plt.bar(x,y, color="blue")
plt.xticks(fontsize= 8,rotation='vertical')
plt.yticks(fontsize= 8)
plt.show()
Use pandas.read_html and barh
.read_html will read all tables tags from a website and return a list of dataframes.
barh will make horizontal instead of vertical bars, which is useful if there are a lot of bars.
Make the plot longer, if needed. In this case, (16.0, 10.0), increase 10.
I'd recommend using a log scale for x, because Lagos has so many cases compared to Kogi
This doesn't put more space between the bars, but the formatted plot is more legible with its increased dimensions and horizontal bars.
.iloc[:36, :5] removes some unneeded columns and rows from the dataframe.
import pandas as pd
import matplotlib.pyplot as plt
# url
url = 'https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Nigeria'
# create dataframe list
dataframe_list = pd.read_html(url) # this is a list of all the tables at the url as dataframes
# get the dataframe from the list
df = dataframe_list[2].iloc[:36, :5] # you want the dataframe at index 2
# replace '-' with 0
df.replace('–', 0, inplace=True)
# set to int
for col in df.columns[1:]:
df[col] = df[col].astype('int')
# plot a horizontal bar
plt.rcParams['figure.figsize'] = (16.0, 10.0)
plt.style.use('ggplot')
p = plt.barh(width='Cases', y='State', data=df, color='purple')
plt.xscale('log')
plt.xlabel('Number of Cases')
plt.show()
Plot all the data in df
df.set_index('State', inplace=True)
plt.figure(figsize=(14, 14))
df.plot.barh()
plt.xscale('log')
plt.show()
4 subplots
State as index
plt.figure(figsize=(14, 14))
for i, col in enumerate(df.columns, 1):
plt.subplot(2, 2, i)
df[col].plot.barh(label=col, color='green')
plt.xscale('log')
plt.legend()
plt.tight_layout()
plt.show()