Fill in between two lines on seaborn Python - matplotlib

I've tried filling in the area between two lines with Python -- using facetgrid on seaborn. My code is the following:
sns.set_style('white')
sns.set(font_scale=1.5, rc={"lines.linewidth": 1.5})
g = sns.FacetGrid(final, col = "Variable", col_wrap = 3, legend_out=True)
g.fig.set_figwidth(20)
g.fig.set_figheight(10)
g.map_dataframe(sns.lineplot, x = "Time", y="Shock", color='black')
g.map_dataframe(sns.lineplot, x = "Time", y="Upper confidence level", color='grey')
g.map_dataframe(sns.lineplot, x = "Time", y="Lower confidence level", color='grey')
line = (g.map_dataframe(sns.lineplot, x = "Time", y="Lower", color='grey')).get_lines()
g.map(plt.fill_between(line[0].get_xdata(), line[1].get_ydata(), line[2].get_ydata(), color='grey', alpha=.5))
g.map(plt.axhline, y=0, ls='--', c='red')
g.set_titles(col_template="{col_name}", row_template=["",""])
g.set(ylim=(-0.1, 0.1))
I get the following error: AttributeError: 'FacetGrid' object has no attribute 'get_lines'
Is there a way to apply the plt.fill_between command on FacetGrid?

For functions such as fill_between, there is no way to directly execute them via g.map_dataframe. Instead, you can create a custom function which then can be called via g.map_dataframe. Such a custom function needs to accept at least 2 parameters:
data: the dataframe restricted to the subplot
color: this is used when working with hue, but can be ignored
As a custom function is needed anyway, it might make sense to do all the drawing inside that function. Especially when a lot of functions need to be called, such might be easier to read and to customize further.
Here is an example of how such code could look like. Note that the preferred way to set the figsize is via the height= and aspect= keywords, indicating the height and the ratio between width and height of the individual subplots.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
def draw_subplot(data, color):
sns.lineplot(data=data, x="Time", y="Shock", color='black')
sns.lineplot(data=data, x="Time", y="Upper confidence level", color='grey')
sns.lineplot(data=data, x="Time", y="Lower confidence level", color='grey')
line = plt.gca().lines
plt.fill_between(line[0].get_xdata(), line[1].get_ydata(), line[2].get_ydata(), color='grey', alpha=.5)
plt.axhline(y=0, ls='--', c='red')
plt.margins(x=0)
sns.set_style('white')
sns.set(font_scale=1.5, rc={"lines.linewidth": 1.5})
final = pd.DataFrame({"Time": np.tile(np.arange(1, 81), 3),
"Shock": np.random.randn(80 * 3).cumsum() * 2,
"Variable": np.repeat([*'ABC'], 80)})
final["Upper confidence level"] = final["Shock"] + 2 + np.abs(np.random.randn(len(final)).cumsum())
final["Lower confidence level"] = final["Shock"] - 2 - np.abs(np.random.randn(len(final)).cumsum())
g = sns.FacetGrid(final, col="Variable", col_wrap=3, legend_out=True, height=6, aspect=1.5)
g.map_dataframe(draw_subplot)
g.set_titles(col_template="{col_name}", row_template=["", ""])
plt.show()

Related

Equivalent of Hist()'s Layout hyperparameter in Sns.Pairplot?

Am trying to find hist()'s figsize and layout parameter for sns.pairplot().
I have a pairplot that gives me nice scatterplots between the X's and y. However, it is oriented horizontally and there is no equivalent layout parameter to make them vertical to my knowledge. 4 plots per row would be great.
This is my current sns.pairplot():
sns.pairplot(X_train,
x_vars = X_train.select_dtypes(exclude=['object']).columns,
y_vars = ["SalePrice"])
This is what I would like it to look like: Source
num_mask = train_df.dtypes != object
num_cols = train_df.loc[:, num_mask[num_mask == True].keys()]
num_cols.hist(figsize = (30,15), layout = (4,10))
plt.show()
What you want to achieve isn't currently supported by sns.pairplot, but you can use one of the other figure-level functions (sns.displot, sns.catplot, ...). sns.lmplot creates a grid of scatter plots. For this to work, the dataframe needs to be in "long form".
Here is a simple example. sns.lmplot has parameters to leave out the regression line (fit_reg=False), to set the height of the individual subplots (height=...), to set its aspect ratio (aspect=..., where the subplot width will be height times aspect ratio), and many more. If all y ranges are similar, you can use the default sharey=True.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some test data with different y-ranges
np.random.seed(20230209)
X_train = pd.DataFrame({"".join(np.random.choice([*'uvwxyz'], np.random.randint(3, 8))):
np.random.randn(100).cumsum() + np.random.randint(100, 1000) for _ in range(10)})
X_train['SalePrice'] = np.random.randint(10000, 100000, 100)
# convert the dataframe to long form
# 'SalePrice' will get excluded automatically via `melt`
compare_columns = X_train.select_dtypes(exclude=['object']).columns
long_df = X_train.melt(id_vars='SalePrice', value_vars=compare_columns)
# create a grid of scatter plots
g = sns.lmplot(data=long_df, x='SalePrice', y='value', col='variable', col_wrap=4, sharey=False)
g.set(ylabel='')
plt.show()
Here is another example, with histograms of the mpg dataset:
import matplotlib.pyplot as plt
import seaborn as sns
mpg = sns.load_dataset('mpg')
compare_columns = mpg.select_dtypes(exclude=['object']).columns
mpg_long = mpg.melt(value_vars=compare_columns)
g = sns.displot(data=mpg_long, kde=True, x='value', common_bins=False, col='variable', col_wrap=4, color='crimson',
facet_kws={'sharex': False, 'sharey': False})
g.set(xlabel='')
plt.show()

Calculating and plotting parametric equations in sympy

So i'm struggling with these parametric equations in Sympy.
𝑓(𝜃) = cos(𝜃) − sin(𝑎𝜃) and 𝑔(𝜃) = sin(𝜃) + cos(𝑎𝜃)
with 𝑎 ∈ ℝ∖{0}.
import matplotlib.pyplot as plt
import sympy as sp
from IPython.display import display
sp.init_printing()
%matplotlib inline
This is what I have to define them:
f = sp.Function('f')
g = sp.Function('g')
f = sp.cos(th) - sp.sin(a*th)
g = sp.sin(th) + sp.cos(a*th)
I don't know how to define a with the domain ℝ∖{0} and it gives me trouble when I want to solve the equation
𝑓(𝜃)+𝑔(𝜃)=0
The solution should be:
𝜃=[3𝜋/4,3𝜋/4𝑎,𝜋/2(𝑎−1),𝜋/(𝑎+1)]
Next I want to plot the parametric equations when a=2, a=4, a=6 and a=8. I want to have a different color for every value of a. The most efficient way will probably be with a for-loop.
I also need to use lambdify to have a list of values but I'm fairly new to this so it's a bit vague.
This is what I already have:
fig, ax = plt.subplots(1, figsize=(12, 12))
theta_range = np.linspace(0, 2*np.pi, 750)
colors = ['blue', 'green', 'orange', 'cyan']
a = [2, 4, 6, 8]
for index in range(0, 4):
# I guess I need to use lambdify here but I don't see how
plt.show()
Thank you in advance!
You're asking two very different questions. One question about solving a symbolic expression, and one about plotting curves.
First, about the symbolic expression. a can be defined as a = sp.symbols('a', real=True, nonzero=True) and theta as th = sp.symbols('theta', real=True). There is no need to define f and g as sympy symbols, as they get assigned a sympy expression. To solve the equation, just use sp.solve(f+g, th). Sympy gives [pi, pi/a, pi/(2*(a - 1)), pi/(a + 1)] as the result.
Sympy also has a plotting function, which could be called as sp.plot(*[(f+g).subs({a:a_val}) for a_val in [2, 4, 6, 8]]). But there is very limited support for options such as color.
To have more control, matplotlib can do the plotting based on numpy functions. sp.lambdify converts the expression: sp.lambdify((th, a), f+g, 'numpy').
Then, matplotlib can do the plotting. There are many options to tune the result.
Here is some example code:
import matplotlib.pyplot as plt
import numpy as np
import sympy as sp
th = sp.symbols('theta', real=True)
a = sp.symbols('a', real=True, nonzero=True)
f = sp.cos(th) - sp.sin(a*th)
g = sp.sin(th) + sp.cos(a*th)
thetas = sp.solve(f+g, th)
print("Solutions for theta:", thetas)
fg_np = sp.lambdify((th, a), f+g, 'numpy')
fig, ax = plt.subplots(1, figsize=(12, 12))
theta_range = np.linspace(0, 2*np.pi, 750)
colors = plt.cm.Set2.colors
for a_val, color in zip([2,4,6,8], colors):
plt.plot(theta_range, fg_np(theta_range, a_val), color=color, label=f'a={a_val}')
plt.axhline(0, color='black')
plt.xlabel("theta")
plt.ylabel(f+g)
plt.legend()
plt.grid()
plt.autoscale(enable=True, axis='x', tight=True)
plt.show()

Python keeps overwriting hist on previous plot but doesn't save it with the desired plot

I am saving two separate figures, that each should contain 2 plots together.
The problem is that the first figure is ok, but the second one, does not gets overwritten on the new plot but on the previous one, but in the saved figure, I only find one of the plots :
This is the first figure , and I get the first figure correctly :
import scipy.stats as s
import numpy as np
import os
import pandas as pd
import openpyxl as pyx
import matplotlib
matplotlib.rcParams["backend"] = "TkAgg"
#matplotlib.rcParams['backend'] = "Qt4Agg"
#matplotlib.rcParams['backend'] = "nbAgg"
import matplotlib.pyplot as plt
import math
data = [336256, 620316, 958846, 1007830, 1080401]
pdf = array([ 0.00449982, 0.0045293 , 0.00455894, 0.02397463,
0.02395788, 0.02394114])
fig, ax = plt.subplots();
fig = plt.figure(figsize=(40,30))
x = np.linspace(np.min(data), np.max(data), 100);
plt.plot(x, s.exponweib.pdf(x, *s.exponweib.fit(data, 1, 1, loc=0, scale=2)))
plt.hist(data, bins = np.linspace(data[0], data[-1], 100), normed=True, alpha= 1)
text1= ' Weibull'
plt.savefig(text1+ '.png' )
datar =np.asarray(data)
mu, sigma = datar.mean() , datar.std() # mean and standard deviation
normal_std = np.sqrt(np.log(1 + (sigma/mu)**2))
normal_mean = np.log(mu) - normal_std**2 / 2
hs = np.random.lognormal(normal_mean, normal_std, 1000)
print(hs.max()) # some finite number
print(hs.mean()) # about 136519
print(hs.std()) # about 50405
count, bins, ignored = plt.hist(hs, 100, normed=True)
x = np.linspace(min(bins), max(bins), 10000)
pdfT = [];
for el in range (len(x)):
pdfTmp = (math.exp(-(np.log(x[el]) - normal_mean)**2 / (2 * normal_std**2)))
pdfT += [pdfTmp]
pdf = np.asarray(pdfT)
This is the second set :
fig, ax = plt.subplots();
fig = plt.figure(figsize=(40,40))
plt.plot(x, pdf, linewidth=2, color='r')
plt.hist(data, bins = np.linspace(data[0], data[-1], 100), normed=True, alpha= 1)
text= ' Lognormal '
plt.savefig(text+ '.png' )
The first plot saves the histogram together with curve. instead the second one only saves the curve
update 1 : looking at This Question , I found out that clearing the plot history will help the figures don't mixed up , but still my second set of plots, I mean the lognormal do not save together, I only get the curve and not the histogram.
This is happening, because you have set normed = True, which means that area under the histogram is normalized to 1. And since your bins are very wide, this means that the actual height of the histogram bars are very small (in this case so small that they are not visible)
If you use
n, bins, _ = plt.hist(data, bins = np.linspace(data[0], data[-1], 100), normed=True, alpha= 1)
n will contain the y-value of your bins and you can confirm this yourself.
Also have a look at the documentation for plt.hist.
So if you set normed to False, the histogram will be visible.
Edit: number of bins
import numpy as np
import matplotlib.pyplot as plt
rand_data = np.random.uniform(0, 1.0, 100)
fig = plt.figure()
ax_1 = fig.add_subplot(211)
ax_1.hist(rand_data, bins=10)
ax_2 = fig.add_subplot(212)
ax_2.hist(rand_data, bins=100)
plt.show()
will give you two plots similar (since its random) to:
which shows how the number of bins changes the histogram.
A histogram visualises the distribution of your data along one dimension, so not sure what you mean by number of inputs and bins.

Percentile Distribution Graph

Does anyone have an idea how to change X axis scale and ticks to display a percentile distribution like the graph below? This image is from MATLAB, but I want to use Python (via Matplotlib or Seaborn) to generate.
From the pointer by #paulh, I'm a lot closer now. This code
import matplotlib
matplotlib.use('Agg')
import numpy as np
import matplotlib.pyplot as plt
import probscale
import seaborn as sns
clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
sns.set(style='ticks', context='notebook', palette="muted", rc=clear_bkgd)
fig, ax = plt.subplots(figsize=(8, 4))
x = [30, 60, 80, 90, 95, 97, 98, 98.5, 98.9, 99.1, 99.2, 99.3, 99.4]
y = np.arange(0, 12.1, 1)
ax.set_xlim(40, 99.5)
ax.set_xscale('prob')
ax.plot(x, y)
sns.despine(fig=fig)
Generates the following plot (notice the re-distributed X-Axis)
Which I find much more useful than a the standard scale:
I contacted the author of the original graph and they gave me some pointers. It is actually a log scale graph, with x axis reversed and values of [100-val], with manual labeling of the x axis ticks. The code below recreates the original image with the same sample data as the other graphs here.
import matplotlib
matplotlib.use('Agg')
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
sns.set(style='ticks', context='notebook', palette="muted", rc=clear_bkgd)
x = [30, 60, 80, 90, 95, 97, 98, 98.5, 98.9, 99.1, 99.2, 99.3, 99.4]
y = np.arange(0, 12.1, 1)
# Number of intervals to display.
# Later calculations add 2 to this number to pad it to align with the reversed axis
num_intervals = 3
x_values = 1.0 - 1.0/10**np.arange(0,num_intervals+2)
# Start with hard-coded lengths for 0,90,99
# Rest of array generated to display correct number of decimal places as precision increases
lengths = [1,2,2] + [int(v)+1 for v in list(np.arange(3,num_intervals+2))]
# Build the label string by trimming on the calculated lengths and appending %
labels = [str(100*v)[0:l] + "%" for v,l in zip(x_values, lengths)]
fig, ax = plt.subplots(figsize=(8, 4))
ax.set_xscale('log')
plt.gca().invert_xaxis()
# Labels have to be reversed because axis is reversed
ax.xaxis.set_ticklabels( labels[::-1] )
ax.plot([100.0 - v for v in x], y)
ax.grid(True, linewidth=0.5, zorder=5)
ax.grid(True, which='minor', linewidth=0.5, linestyle=':')
sns.despine(fig=fig)
plt.savefig("test.png", dpi=300, format='png')
This is the resulting graph:
These type of graphs are popular in the low-latency community for plotting latency distributions. When dealing with latencies most of the interesting information tends to be in the higher percentiles, so a logarithmic view tends to work better. I've first seen these graphs used in https://github.com/giltene/jHiccup and https://github.com/HdrHistogram/.
The cited graph was generated by the following code
n = ceil(log10(length(values)));
p = 1 - 1./10.^(0:0.01:n);
percentiles = prctile(values, p * 100);
semilogx(1./(1-p), percentiles);
The x-axis was labelled with the code below
labels = cell(n+1, 1);
for i = 1:n+1
labels{i} = getPercentileLabel(i-1);
end
set(gca, 'XTick', 10.^(0:n));
set(gca, 'XTickLabel', labels);
% {'0%' '90%' '99%' '99.9%' '99.99%' '99.999%' '99.999%' '99.9999%'}
function label = getPercentileLabel(i)
switch(i)
case 0
label = '0%';
case 1
label = '90%';
case 2
label = '99%';
otherwise
label = '99.';
for k = 1:i-2
label = [label '9'];
end
label = [label '%'];
end
end
The following Python code uses Pandas to read a csv file that contains a list of recorded latency values (in milliseconds), then it records those latency values (as microseconds) in an HdrHistogram, and saves the HdrHistogram to an hgrm file, that will then be used by Seaborn to plot the latency distribution graph.
import pandas as pd
from hdrh.histogram import HdrHistogram
from hdrh.dump import dump
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import sys
import argparse
# Parse the command line arguments.
parser = argparse.ArgumentParser()
parser.add_argument('csv_file')
parser.add_argument('hgrm_file')
parser.add_argument('png_file')
args = parser.parse_args()
csv_file = args.csv_file
hgrm_file = args.hgrm_file
png_file = args.png_file
# Read the csv file into a Pandas data frame and generate an hgrm file.
csv_df = pd.read_csv(csv_file, index_col=False)
USECS_PER_SEC=1000000
MIN_LATENCY_USECS = 1
MAX_LATENCY_USECS = 24 * 60 * 60 * USECS_PER_SEC # 24 hours
# MAX_LATENCY_USECS = int(csv_df['response-time'].max()) * USECS_PER_SEC # 1 hour
LATENCY_SIGNIFICANT_DIGITS = 5
histogram = HdrHistogram(MIN_LATENCY_USECS, MAX_LATENCY_USECS, LATENCY_SIGNIFICANT_DIGITS)
for latency_sec in csv_df['response-time'].tolist():
histogram.record_value(latency_sec*USECS_PER_SEC)
# histogram.record_corrected_value(latency_sec*USECS_PER_SEC, 10)
TICKS_PER_HALF_DISTANCE=5
histogram.output_percentile_distribution(open(hgrm_file, 'wb'), USECS_PER_SEC, TICKS_PER_HALF_DISTANCE)
# Read the generated hgrm file into a Pandas data frame.
hgrm_df = pd.read_csv(hgrm_file, comment='#', skip_blank_lines=True, sep=r"\s+", engine='python', header=0, names=['Latency', 'Percentile'], usecols=[0, 3])
# Plot the latency distribution using Seaborn and save it as a png file.
sns.set_theme()
sns.set_style("dark")
sns.set_context("paper")
sns.set_color_codes("pastel")
fig, ax = plt.subplots(1,1,figsize=(20,15))
fig.suptitle('Latency Results')
sns.lineplot(x='Percentile', y='Latency', data=hgrm_df, ax=ax)
ax.set_title('Latency Distribution')
ax.set_xlabel('Percentile (%)')
ax.set_ylabel('Latency (seconds)')
ax.set_xscale('log')
ax.set_xticks([1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
ax.set_xticklabels(['0', '90', '99', '99.9', '99.99', '99.999', '99.9999', '99.99999'])
fig.tight_layout()
fig.savefig(png_file)

Scatter plot with scalar data

I want to create a scatter plot with matplotlib where the data points have scalar data attached to them and are assigned a color depending on how large their attached value is relative to the other points in the set. I.e., I want something akin to a heatmap. However, I'm looking for a "discrete" heatmap, i.e. nothing should be ploted where there were no points in the original data set and, in particular, no interpolation (in space) should be performed.
Can this be done?
you can use scatter, and set the attached value to c parameter:
import numpy as np
import pylab as pl
x = np.random.uniform(-1, 1, 1000)
y = np.random.uniform(-1, 1, 1000)
z = np.sqrt(x*x+y*y)
pl.scatter(x, y, c=z)
pl.colorbar()
pl.show()
Solving this in Altair.
import numpy as np
import pylab as pl
x = np.random.uniform(-1, 1, 1000)
y = np.random.uniform(-1, 1, 1000)
z = np.sqrt(x*x+y*y)
df = pd.DataFrame({'x':x,'y':y, 'z':z})
from altair import *
Chart(df).mark_circle().encode(x='x',y='y', color='z')