change color of bar for data selection in seaborn histogram (or plt) - pandas

Let's say I have a dataframe like:
X2 = np.random.normal(10, 3, 200)
X3 = np.random.normal(34, 2, 200)
a = pd.DataFrame({"X3": X3, "X2":X2})
and I am doing the following plotting routine:
f, axes = plt.subplots(2, 2, gridspec_kw={"height_ratios":(.10, .30)}, figsize = (13, 4))
for i, c in enumerate(a.columns):
sns.boxplot(a[c], ax=axes[0,i])
sns.distplot(a[c], ax = axes[1,i])
axes[1, i].set(yticklabels=[])
axes[1, i].set(xlabel='')
axes[1, i].set(ylabel='')
plt.tight_layout()
plt.show()
Which yields to:
Now I want to be able to perform a data selection on the dataframe a. Let's say something like:
b = a[(a['X2'] <4)]
and highlight the selection from b in the posted histograms.
for example if the first row of b is [32:0] for X3 and [0:5] for X2, the desired output would be:
is it possible to do this with the above for loop and with sns? Many thanks!
EDIT: I am also happy with a matplotlib solution, if easier.
EDIT2:
If it helps, it would be similar to do the following:
b = a[(a['X3'] >38)]
f, axes = plt.subplots(2, 2, gridspec_kw={"height_ratios":(.10, .30)}, figsize = (13, 4))
for i, c in enumerate(a.columns):
sns.boxplot(a[c], ax=axes[0,i])
sns.distplot(a[c], ax = axes[1,i])
sns.distplot(b[c], ax = axes[1,i])
axes[1, i].set(yticklabels=[])
axes[1, i].set(xlabel='')
axes[1, i].set(ylabel='')
plt.tight_layout()
plt.show()
which yields the following:
However, I would like to be able to just colour those bars in the first plot in a different colour!
I also thought about setting the ylim to only the size of the blue plot so that the orange won't distort the shape of the blue distribution, but it wouldn't still be feasible, as in reality I have about 10 histograms to show, and setting ylim would be pretty much the same as sharey=True, which Im trying to avoid, so that I'm able to show the true shape of the distributions.

I think I found the solution for this using the inspiration from the previous answer and this video:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(2021)
X2 = np.random.normal(10, 3, 200)
X3 = np.random.normal(34, 2, 200)
a = pd.DataFrame({"X3": X3, "X2":X2})
b = a[(a['X3'] < 30)]
hist_idx=[]
for i, c in enumerate(a.columns):
bin_ = np.histogram(a[c], bins=20)[1]
hist = np.where(np.logical_and(bin_<=max(b[c]), bin_>min(b[c])))
hist_idx.append(hist)
f, axes = plt.subplots(2, 2, gridspec_kw={"height_ratios":(.10, .30)}, figsize = (13, 4))
for i, c in enumerate(a.columns):
sns.boxplot(a[c], ax=axes[0,i])
axes[1, i].hist(a[c], bins = 20)
axes[1, i].set(yticklabels=[])
axes[1, i].set(xlabel='')
axes[1, i].set(ylabel='')
for it, index in enumerate(hist_idx):
lenght = len(index[0])
for r in range(lenght):
try:
axes[1, it].patches[index[0][r]-1].set_fc("red")
except:
pass
plt.tight_layout()
plt.show()
which yields the following for b = a[(a['X3'] < 30)] :
or for b = a[(a['X3'] > 36)]:
Thought I'd leave it here - although niche, might help someone in the future!

I created the following code with the understanding that the intent of your question is to add a different color to the histogram based on the data extracted under certain conditions.
Use np.histogram() to get an array of frequencies and an array of bins. Get the index of the value closest to the value of the first row of data extracted for a certain condition. Change the color of the histogram with that retrieved index. The same method can be used to deal with the other graph.
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(2021)
X2 = np.random.normal(10, 3, 200)
X3 = np.random.normal(34, 2, 200)
a = pd.DataFrame({"X3": X3, "X2":X2})
f, axes = plt.subplots(2, 2, gridspec_kw={"height_ratios":(.10, .30)}, figsize = (13, 4))
for i, c in enumerate(a.columns):
sns.boxplot(a[c], ax=axes[0,i])
sns.distplot(a[c], ax = axes[1,i])
axes[1, i].set(yticklabels=[])
axes[1, i].set(xlabel='')
axes[1, i].set(ylabel='')
b = a[(a['X2'] <4)]
hist3, bins3 = np.histogram(X3)
idx = np.abs(np.asarray(hist3) - b['X3'].head(1).values[0]).argmin()
for k in range(idx):
axes[1,0].get_children()[k].set_color("red")
plt.tight_layout()
plt.show()

Related

Matplotlib weird behavior with 2D arrays plot

As per the matplotlib documentation, x and/or y may be 2D arrays, and in this case the columns are treated as different datasets. When I follow the example in the matplotlib page it works fine:
>>> x = [1, 2, 3]
>>> y = np.array([[1, 2], [3, 4], [5, 6]])
>>> plot(x, y)
However, when I try with larger, float64 arrays, it plots a weird figure. This is what I got
from scipy.stats import chi2
x = np.linspace(0,5,1000)
chi2_2, chi2_5 = chi2.pdf(x,2), chi2.pdf(x,5)
y = np.array((chi2_2,chi2_5)).reshape(1000,2)
fig, ax = plt.subplots()
ax.plot(x,y)
and produces this plot:
if I plot them separately, it comes out fine:
fig, ax = plt.subplots()
ax.plot(x,chi2_2,'b')
ax.plot(x,chi2_5,'r')
I can't figure out what is the difference between the example and my case other then using 2D arrays with Float64 instead of Int64.
Any help is appreciated.
It looks like reshape isn't doing what you expect it to do. I think the function that you are looking for is transpose rather than reshape.
from scipy.stats import chi2
x = np.linspace(0,5,1000)
chi2_2, chi2_5 = chi2.pdf(x,2), chi2.pdf(x,5)
y = np.array((chi2_2,chi2_5)).T
y2 = np.array((chi2_2,chi2_5)).reshape(1000,2)
print(np.array_equal(y,y2))
fig, ax = plt.subplots()
ax.plot(x,y)
plt.show()
Using transpose returns the plot that you want and np.array_equal(y,y2) being False
confirms that the 2 arrays are not the same.
Below is the output:

Calculating and plotting parametric equations in sympy

So i'm struggling with these parametric equations in Sympy.
𝑓(πœƒ) = cos(πœƒ) βˆ’ sin(π‘Žπœƒ) and 𝑔(πœƒ) = sin(πœƒ) + cos(π‘Žπœƒ)
with π‘Ž ∈ β„βˆ–{0}.
import matplotlib.pyplot as plt
import sympy as sp
from IPython.display import display
sp.init_printing()
%matplotlib inline
This is what I have to define them:
f = sp.Function('f')
g = sp.Function('g')
f = sp.cos(th) - sp.sin(a*th)
g = sp.sin(th) + sp.cos(a*th)
I don't know how to define a with the domain β„βˆ–{0} and it gives me trouble when I want to solve the equation
𝑓(πœƒ)+𝑔(πœƒ)=0
The solution should be:
πœƒ=[3πœ‹/4,3πœ‹/4π‘Ž,πœ‹/2(π‘Žβˆ’1),πœ‹/(π‘Ž+1)]
Next I want to plot the parametric equations when a=2, a=4, a=6 and a=8. I want to have a different color for every value of a. The most efficient way will probably be with a for-loop.
I also need to use lambdify to have a list of values but I'm fairly new to this so it's a bit vague.
This is what I already have:
fig, ax = plt.subplots(1, figsize=(12, 12))
theta_range = np.linspace(0, 2*np.pi, 750)
colors = ['blue', 'green', 'orange', 'cyan']
a = [2, 4, 6, 8]
for index in range(0, 4):
# I guess I need to use lambdify here but I don't see how
plt.show()
Thank you in advance!
You're asking two very different questions. One question about solving a symbolic expression, and one about plotting curves.
First, about the symbolic expression. a can be defined as a = sp.symbols('a', real=True, nonzero=True) and theta as th = sp.symbols('theta', real=True). There is no need to define f and g as sympy symbols, as they get assigned a sympy expression. To solve the equation, just use sp.solve(f+g, th). Sympy gives [pi, pi/a, pi/(2*(a - 1)), pi/(a + 1)] as the result.
Sympy also has a plotting function, which could be called as sp.plot(*[(f+g).subs({a:a_val}) for a_val in [2, 4, 6, 8]]). But there is very limited support for options such as color.
To have more control, matplotlib can do the plotting based on numpy functions. sp.lambdify converts the expression: sp.lambdify((th, a), f+g, 'numpy').
Then, matplotlib can do the plotting. There are many options to tune the result.
Here is some example code:
import matplotlib.pyplot as plt
import numpy as np
import sympy as sp
th = sp.symbols('theta', real=True)
a = sp.symbols('a', real=True, nonzero=True)
f = sp.cos(th) - sp.sin(a*th)
g = sp.sin(th) + sp.cos(a*th)
thetas = sp.solve(f+g, th)
print("Solutions for theta:", thetas)
fg_np = sp.lambdify((th, a), f+g, 'numpy')
fig, ax = plt.subplots(1, figsize=(12, 12))
theta_range = np.linspace(0, 2*np.pi, 750)
colors = plt.cm.Set2.colors
for a_val, color in zip([2,4,6,8], colors):
plt.plot(theta_range, fg_np(theta_range, a_val), color=color, label=f'a={a_val}')
plt.axhline(0, color='black')
plt.xlabel("theta")
plt.ylabel(f+g)
plt.legend()
plt.grid()
plt.autoscale(enable=True, axis='x', tight=True)
plt.show()

Threshold Otsu: AttributeError: 'AxesSubplot' object has no attribute 'ravel'

I loaded nifty files(These were as well converted from .pack CT scans). My goal is to use the threashold otsu algorithm to mask it from the background and compare the two images. When I try to plot I get the error
AttributeError: 'AxesSubplot' object has no attribute 'ravel'
Below is the code and attached is a screenshot.
import SimpleITK as sitk
import matplotlib.pyplot as plt
import numpy as np
from skimage.filters import threshold_otsu
#THRESHOLD OTSU
img = sitk.GetArrayFromImage(sitk.ReadImage("\\\\x.x.x.x/users/ddff/python/nifts/prr_ipsi.nii"))
print(img.shape)
thresh = threshold_otsu(img.flatten())
#thresh = thresh.reshape(img.shape)
binary = img <= thresh
#I can plot this image slice fine
plt.imshow(img[20,:,:])
fig, axes = plt.subplots(ncols=1)
ax = axes.ravel()
ax[0] = plt.subplot(1, 3, 1)
ax[1] = plt.subplot(1, 3, 2)
ax[2] = plt.subplot(1, 3, 3, sharex=ax[0], sharey=ax[0])
ax[0].imshow(img[20,:,:], cmap=plt.cm.gray)
ax[0].set_title('Original Breast Delineation')
ax[0].axis('off')
ax[1].hist(thresh, bins=256)
ax[1].set_title('Histogram ')
ax[1].axvline(thresh, color='r')
ax[2].imshow(binary[20,:,:], cmap=plt.cm.gray)
ax[2].set_title('Thresholded')
ax[2].axis('off')
plt.show()[enter image description here][1]
axes is just a single figure with 1 column so there is nothing to ravel or flatten. It will work if you have more than one sub plot. Nevertheless, you can do the following without ravel if you have only a single row or a single column.
fig, ax = plt.subplots(ncols=3, sharex=True, sharey=True)
ax[0].imshow(img[20,:,:], cmap=plt.cm.gray)
ax[0].set_title('Original Breast Delineation')
ax[0].axis('off')
ax[1].hist(thresh, bins=256)
ax[1].set_title('Histogram ')
ax[1].axvline(thresh, color='r')
ax[2].imshow(binary[20,:,:], cmap=plt.cm.gray)
ax[2].set_title('Thresholded')
ax[2].axis('off')
In case you want a 2d matrix of subplot instances, you can use Thomas KΓΌhn's suggestion.
fig, ax = plt.subplots(ncols=3, sharex=True, sharey=True, squeeze=False)
and then you can access the subplots as
ax[0][0].imshow()
ax[0][1].imshow()
......

matplotlib heatmap, customize y axis

Right now my code looks like this:
#generate 262*20 elements
values = np.random.random(262*20).tolist()
# convert the list to a 2D NumPy array
values = np.array(values).reshape((262, 20))
h, w = values.shape
#h=262, w=20
fig = plt.figure(num=None, dpi=80,figsize=(9, 7), facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
#fig, ax = plt.subplots()
plt.imshow(values)
plt.colorbar()
plt.xticks(np.arange(w), list('PNIYLKCVFWABCDEFGHIJ'))
ax.set_aspect(w/h)
plt.show()
The plot looks like this:
As you can see, the range of y axis is 0-261.
But I want my y axis to go from 26 to 290, missing 57, 239, and 253. So still 262 in total. I tried to generate a list like this:
mylist =[26, 27, ......missing 57, 239, 253, ....290]
plt.yticks(np.arange(h), mylist)
The Y axis just looks like everything squished together.
So I tried:
pylab.ylim([26, 290])
And It looks like this:
So it just feels like the data in first row always corresponds to [0], not to [26]
Suggest you use pcolormesh. If you want gaps, then use an numpy.ma.masked array for the area with gaps.
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
values = np.random.rand(290,20)
values[:26, :] = np.NaN
values[ [57, 239, 253], :] = np.NaN
values = np.ma.masked_invalid(values)
h, w = values.shape
fig, ax = plt.subplots(figsize=(9,7))
# Make one larger so these values represent the edge of the data pixels.
y = np.arange(0, 290.5)
x = np.arange(0, 20.5)
pcm = ax.pcolormesh(x, y, values, rasterized=True) # you don't need rasterized=True
fig.colorbar(pcm)
plt.xticks(np.arange(w), list('PNIYLKCVFWABCDEFGHIJ'))
plt.show()
Result
EDIT: If you want to just work w/ a 262x20 array:
values = np.random.rand(262,20)
h, w = values.shape
fig, ax = plt.subplots(figsize=(9,7))
# Make one larger so these values represent the edge of the data pixels.
y = np.arange(0, 290.5)
y = np.delete(y, [57, 239, 253])
y = np.delete(y, range(26))
x = np.arange(0, 20.5)
pcm = ax.pcolormesh(x, y, values, rasterized=True) # you don't need rasterized=True
fig.colorbar(pcm)
plt.xticks(np.arange(w), list('PNIYLKCVFWABCDEFGHIJ'))
plt.show()
Note that this doesn't put a blank line at 57, 239 and 253. If you want that, you need to do:
values = np.random.rand(262,20)
Z = np.ones((290, 20)) * np.NaN
inds = set(range(290)) - set(list(range(26)) + [57, 239, 253])
for nn, ind in enumerate(inds):
Z[ind, :] = values[nn,:]
h, w = values.shape
fig, ax = plt.subplots(figsize=(9,7))
# Make one larger so these values represent the edge of the data pixels.
y = np.arange(0, 290.5)
x = np.arange(0, 20.5)
pcm = ax.pcolormesh(x, y, Z, rasterized=True) # you don't need rasterized=True
fig.colorbar(pcm)
plt.xticks(np.arange(w), list('PNIYLKCVFWABCDEFGHIJ'))
plt.show()

group boxplot histogramming

I would like to group my data and to plot the boxplot for all the groups. There are many questions and answer about that, my problem is that I want to group by a continuos variable, so I want to histogramming my data.
Here what I have done. My data:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
x = np.random.chisquare(5, size=100000)
y = np.random.normal(size=100000) / (0.05 * x + 0.1) + 2 * x
f, ax = plt.subplots()
ax.plot(x, y, '.', alpha=0.05)
plt.show()
I want to study the behaviour of y (location, width, ...) as a function of x. I am not interested in the distribution of x so I will normalized it.
f, ax = plt.subplots()
xbins = np.linspace(0, 25, 50)
ybins = np.linspace(-20, 50, 50)
H, xedges, yedges = np.histogram2d(y, x, bins=(ybins, xbins))
norm = np.sum(H, axis = 0)
H /= norm
ax.pcolor(xbins, ybins, np.nan_to_num(H), vmax=.4)
plt.show()
I can plot histogram, but I want boxplot
binning = np.concatenate(([0], np.sort(np.random.random(20) * 25), [25]))
idx = np.digitize(x, binning)
data_to_plot = [y[idx == i] for i in xrange(len(binning))]
f, ax = plt.subplots()
midpoints = 0.5 * (binning[1:] + binning[:-1])
widths = 0.9 * (binning[1:] - binning[:-1])
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
majorLocator = MultipleLocator(2)
ax.boxplot(data_to_plot, positions = midpoints, widths=widths)
ax.set_xlim(0, 25)
ax.xaxis.set_major_locator(majorLocator)
ax.set_xlabel('x')
ax.set_ylabel('median(y)')
plt.show()
Is there an automatic way to do that, like ax.magic(x, y, binning)? Is there a better way to do that? (Have a look to https://root.cern.ch/root/html/TProfile.html for example, which plot the mean and the error of the mean as error bars)
In addition, I want to minize the memory footprint (my real data are much more than 100000), I am worried about data_to_plot, is it a copy?