Dotted line style from non-evenly distributed data - matplotlib

I'm new to Python and MatPlotlib.
This is my first posting to Stackoverflow - I've been unable to find the answer elsewhere and would be grateful for your help.
I'm using Windows XP, with Enthought Canopy v1.1.1 (32 bit).
I want to plot a dotted-style linear regression line through a scatter plot of data, where both x and y arrays contain random floating point data.
The dots in the resulting dotted line are not distributed evenly along the regression line, and are "smeared together" in the middle of the red line, making it look messy (see upper plot resulting from attached minimal example code).
This does not seem to occur if the items in the array of x values are evenly distributed (lower plot).
I'm therefore guessing that this is an issue with how MatplotLib renders dotted lines, or with how Canopy interfaces Python with Matplotlib.
Please could you tell me a workaround which will make the dots on the dotted line type appear evenly distributed; even if both x and y data are non-evenly distributed; whilst still using Canopy and Matplotlib?
(As a general point, I'm always keen to improve my coding skills - if any code in my example can be written more neatly or concisely, I'd be grateful for your expertise).
Many thanks in anticipation
Dave
(UK)
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
#generate data
x1=10 * np.random.random_sample((40))
x2=np.linspace(0,10,40)
y=5 * np.random.random_sample((40))
slope, intercept, r_value, p_value, std_err = stats.linregress(x1,y)
line = (slope*x1)+intercept
plt.figure(1)
plt.subplot(211)
plt.scatter(x1,y,color='blue', marker='o')
plt.plot(x1,line,'r:',label="Regression Line")
plt.legend(loc='upper right')
slope, intercept, r_value, p_value, std_err = stats.linregress(x2,y)
line = (slope*x2)+intercept
plt.subplot(212)
plt.scatter(x2,y,color='blue', marker='o')
plt.plot(x2,line,'r:',label="Regression Line")
plt.legend(loc='upper right')
plt.show()

Welcome to SO.
You have already identified the problem yourself, but seem a bit surprised that a random x-array results in the line be 'cluttered'. But you draw a dotted line repeatedly over the same location, so it seems like the normal behavior to me that it gets smeared at places where there are multiple dotted lines on top of each other.
If you don't want that, you can sort your array and use that to calculate the regression line and plot it. Since its a linear regression, just using the min and max values would also work.
x1_sorted = np.sort(x1)
line = (slope * x1_sorted) + intercept
or
x1_extremes = np.array([x1.min(),x1.max()])
line = (slope * x1_extremes) + intercept
The last should be faster if x1 becomes very large.
With regard to your last comment. In your example you use whats called the 'state-machine' environment for plotting. It means that specified commands are applied to the active figure and the active axes (subplots).
You can also consider the OO approach where you get figure and axes objects. This means you can access any figure or axes at any time, not just the active one. Its useful when passing an axes to a function for example.
In your example both would work equally well and it would be more a matter of taste.
A small example:
# create a figure with 2 subplots (2 rows, 1 column)
fig, axs = plt.subplots(2,1)
# plot in the first subplots
axs[0].scatter(x1,y,color='blue', marker='o')
axs[0].plot(x1,line,'r:',label="Regression Line")
# plot in the second
axs[1].plot()
etc...

Related

How to gain control over annotate arrows

I'm trying to insert arrows (brackets) in plots using the annotate package, but I cannot figure out what the input parameters mean. I read the documentation and I'm still unsure of how to control the arrows. Here's an example starting point:
import matplotlib.pyplot as pl
import numpy as np
fig = pl.figure(figsize=(3.25, 2.5))
ax0 = fig.add_subplot(111)
x, y = np.arange(10), np.arange(10) * -1
for offset in range(5):
ax0.plot(x + offset, y, lw=1)
# add annotation arrow
bbox = dict(facecolor="w",
alpha=0.95,
ls="None",
boxstyle="round",
pad=0.1)
ax0.annotate(text="Example",
xy=(7.5, -5),
xytext=(0, -9),
arrowprops=dict(arrowstyle="-[",
linewidth=1,
connectionstyle="arc,armA=90,angleA=0,angleB=-40,armB=85,rad=0"),
verticalalignment="bottom",
horizontalalignment="left",
fontsize=8,
bbox=bbox)
fig.show()
I want the bracket to span the width of all of the drawn lines (as if to say "these" lines are what the annotation refers to), but I cannot figure out how to change the bracket width.
Another issue is interpreting armA and armB (the arrow lines currently look ugly). I understand these refer to the length of line segments, but I cannot figure out what the units are (pixels?), much less how to automate generating their lengths.
Can you please provide guidance on how to adjust the width of the bracket and what the connectionstyle parameters mean? If this is documented somewhere I would appreciate the reference (even if it comes with a RTFM-type comment).
I think the parameter you want is mutation_scale.
I changed your annotate command to this and I think it looks reasonable now, but it took some manual adjustment. If you had a consistent pattern in multiple figures you could probably calculate the angles and lengths that you want and use them as inputs, but for your example this seems to work reasonably well.
ax0.annotate(text="Example",
xy=(8.5, -6.5),
xytext=(0, -9),
arrowprops=dict(arrowstyle="-[",
linewidth=1,
mutation_scale=22,
connectionstyle="arc,armA=70, \
angleA=0, \
angleB=-45, \
armB=50, \
rad=0"), \
verticalalignment="bottom",
horizontalalignment="left",
fontsize=8,
bbox=bbox)

Matplotlib/Seaborn: Boxplot collapses on x axis

I am creating a series of boxplots in order to compare different cancer types with each other (based on 5 categories). For plotting I use seaborn/matplotlib. It works fine for most of the cancer types (see image right) however in some the x axis collapses slightly (see image left) or strongly (see image middle)
https://i.imgur.com/dxLR4B4.png
Looking into the code how seaborn plots a box/violin plot https://github.com/mwaskom/seaborn/blob/36964d7ffba3683de2117d25f224f8ebef015298/seaborn/categorical.py (line 961)
violin_data = remove_na(group_data[hue_mask])
I realized that this happens when there are too many nans
Is there any possibility to prevent this collapsing by code only
I do not want to modify my dataframe (replace the nans by zero)
Below you find my code:
boxp_df=pd.read_csv(pf_in,sep="\t",skip_blank_lines=False)
fig, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(data=boxp_df, ax=ax)
plt.xticks(rotation=-45)
plt.ylabel("label")
plt.tight_layout()
plt.savefig(pf_out)
The output is a per cancer type differently sized plot
(depending on if there is any category completely nan)
I am expecting each plot to be in the same width.
Update
trying to use the order parameter as suggested leads to the following output:
https://i.imgur.com/uSm13Qw.png
Maybe this toy example helps ?
|Cat1|Cat2|Cat3|Cat4|Cat5
|3.93| |0.52| |6.01
|3.34| |0.89| |2.89
|3.39| |1.96| |4.63
|1.59| |3.66| |3.75
|2.73| |0.39| |2.87
|0.08| |1.25| |-0.27
Update
Apparently, the problem is not the data but the length of the title
https://github.com/matplotlib/matplotlib/issues/4413
Therefore I would close the question
#Diziet should I delete it or does my issue might help other ones?
Sorry for not including the line below in the code example:
ax.set_title("VERY LONG TITLE", fontsize=20)
It's hard to be sure without data to test it with, but I think you can pass the names of your categories/cancers to the order= parameter. This forces seaborn to use/display those, even if they are empty.
for instance:
tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips, order=['Thur','Fri','Sat','Freedom Day','Sun','Durin\'s Day'])

Difference between matplotlib.countourf and matlab.contourf() - odd sharp edges in matplotlib

I am a recent migrant from Matlab to Python and have recently worked with Numpy and Matplotlib. I recoded one of my scripts from Matlab, which employs Matlab's contourf-function, into Python using matplotlib's corresponding contourf-function. I managed to replicate the output in Python, apart that the contourf-plots are not exacly the same, for a reason that is unknown to me. As I run the contourf-function in matplotlib, I get this otherwise nice figure but it has these sharp edges on the contour-levels on top and bottom, which should not be there (see Figure 1 below, matplotlib-output). Now, when I export the arrays I used in Python to Matlab (i.e. the exactly same data set that was used to generate the matplotlib-contourf-plot) and use Matlab's contourf-function, I get a slightly different output, without those sharp contour-level edges (see Figure 2 below, Matlab-output). I used the same number of levels in both figures. In figure 3 I have made a scatterplot of the same data, which shows that there are no such sharp edges in the data as shown in the contourf-plot (I added contour-lines just for reference). Example dataset can be downloaded through Dropbox-link given below. The data set contains three txt-files: X, Y, Z. Each of them are an 500x500 arrays, which can be directly used with contourf(), i.e. plt.contourf(X,Y,Z,...). The code that used was
plt.contourf(X,Y,Z,10, cmap=plt.cm.jet)
plt.contour(X,Y,Z,10,colors='black', linewidths=0.5)
plt.axis('equal')
plt.axis('off')
Does anyone have an idea why this happens? I would appreciate any insight on this!
Cheers,
Jussi
Below are the details of my setup:
Python 3.7.0
IPython 6.5.0
matplotlib 2.2.3
Matplotlib output
Matlab output
Matplotlib-scatter
Link to data set
The confusing thing about the matlab plot is that its colorbar shows much more levels than there are actually in the plot. Hence you don't see the actual intervals that are contoured.
You would achieve the same result in matplotlib by choosing 12 instead of 11 levels.
import numpy as np
import matplotlib.pyplot as plt
X, Y, Z = [np.loadtxt("data/roundcontourdata/{}.txt".format(i)) for i in list("XYZ")]
levels = np.linspace(Z.min(), Z.max(), 12)
cntr = plt.contourf(X,Y,Z,levels, cmap=plt.cm.jet)
plt.contour(X,Y,Z,levels,colors='black', linewidths=0.5)
plt.colorbar(cntr)
plt.axis('equal')
plt.axis('off')
plt.show()
So in conclusion, both plots are correct and show the same data. Just the levels being automatically chosen are different. This can be circumvented by choosing custom levels depending on the desired visual appearance.

How do I use colourmaps with variable alpha in a Seaborn kdeplot without seeing the contour lines?

Python version: 3.6.4 (Anaconda on Windows)
Seaborn: 0.8.1
Matplotlib: 2.1.2
I'm trying to create a 2D Kernel Density plot using Seaborn but I want each step in the colourmap to have a different alpha value. I had a look at this question to create a matplotlib colourmap with alpha values: Add alpha to an existing matplotlib colormap.
I have a problem in that the lines between contours are visible. The result I get is here:
I thought that I had found the answer when I found this question: Hide contour linestroke on pyplot.contourf to get only fills. I tried the method outlined in the answer (using set_edgecolor("face") but it did not work in this case. That question also seemed to be related to vector graphics formats and I am just writing out a PNG.
Here is my script:
import numpy as np
import seaborn as sns
import matplotlib.colors as cols
import matplotlib.pyplot as plt
def alpha_cmap(cmap):
my_cmap = cmap(np.arange(cmap.N))
# Set a square root alpha.
x = np.linspace(0, 1, cmap.N)
my_cmap[:,-1] = x ** (0.5)
my_cmap = cols.ListedColormap(my_cmap)
return my_cmap
xs = np.random.uniform(size=100)
ys = np.random.uniform(size=100)
kplot = sns.kdeplot(data=xs, data2=ys,
cmap=alpha_cmap(plt.cm.viridis),
shade=True,
shade_lowest=False,
n_levels=30)
plt.savefig("example_plot.png")
Guided by some comments on this question I have tried some other methods that have been successful when this problem has come up. Based on this question (Matplotlib Contourf Plots Unwanted Outlines when Alpha < 1) I have tried altering the plot call to:
sns.kdeplot(data=xs, data2=ys,
cmap=alpha_cmap(plt.cm.viridis),
shade=True,
shade_lowest=False,
n_levels=30,
antialiased=True)
With antialiased=True the lines between contours are replaced by a narrow white line:
I have also tried an approach similar to this question - Pyplot pcolormesh confused when alpha not 1. This approach is based on looping over the PathCollections in kplot.collections and tuning the parameters of the edges so that they become invisible. I have tried adding this code and tweaking the linewidth -
for thing in kplot.collections:
thing.set_edgecolor("face")
thing.set_linewidth(0.01)
fig.canvas.draw()
This results in a mix of white and dark lines - .
I believe that I will not be able to tune the line width to make the lines disappear because of the variable width of the contour bands.
Using both methods (antialiasing + linewidth) makes this version, which looks cool but isn't quite what I want:
I also found this question - Changing Transparency of/Remove Contour Lines in Matplotlib
This one suggests overplotting a second plot with a different number of contour levels on the same axis, like:
kplot = sns.kdeplot(data=xs, data2=ys,
ax=ax,
cmap=alpha_cmap(plt.cm.viridis),
shade=True,
shade_lowest=False,
n_levels=30,
antialiased=True)
kplot = sns.kdeplot(data=xs, data2=ys,
ax=ax,
cmap=alpha_cmap(plt.cm.viridis),
shade=True,
shade_lowest=False,
n_levels=35,
antialiased=True)
This results in:
This is better, and almost works. The problem here is I need variable (and non-linear) alpha throughout the colourmap. The variable banding and lines seem to be a result of the combinations of alpha when contours are plotted over each other. I also still see some clear/white lines in the result.

Add a new axis to the right/left/top-right of an axis

How do you add an axis to the outside of another axis, keeping it within the figure as a whole? legend and colorbar both have this capability, but implemented in rather complicated (and for me, hard to reproduce) ways.
You can use the subplots command to achieve this, this can be as simple as py.subplot(2,2,1) where the first two numbers describe the geometry of the plots (2x2) and the third is the current plot number. In general it is better to be explicit as in the following example
import pylab as py
# Make some data
x = py.linspace(0,10,1000)
cos_x = py.cos(x)
sin_x = py.sin(x)
# Initiate a figure, there are other options in addition to figsize
fig = py.figure(figsize=(6,6))
# Plot the first set of data on ax1
ax1 = fig.add_subplot(2,1,1)
ax1.plot(x,sin_x)
# Plot the second set of data on ax2
ax2 = fig.add_subplot(2,1,2)
ax2.plot(x,cos_x)
# This final line can be used to adjust the subplots, if uncommentted it will remove all white space
#fig.subplots_adjust(left=0.13, right=0.9, top=0.9, bottom=0.12,hspace=0.0,wspace=0.0)
Notice that this means things like py.xlabel may not work as expected since you have two axis. Instead you need to specify ax1.set_xlabel("..") this makes the code easier to read.
More examples can be found here.