Related
x = [0.0000000,0.0082707,0.0132000, 0.0255597, 0.0503554, 0.0751941, 0.1000570,
0.1498328, 0.1996558, 0.2495240, 0.2994312, 0.3993490, 0.4993711, 0.5994664,
0.6996058, 0.7997553, 0.8998927, 0.9499514, 1.0000000, 0.0000000, 0.006114,
0.0062188, 0.0087532, 0.0138088, 0.0264052, 0.0515127, 0.0765762, 0.1016176,
0.1516652, 0.2016828, 0.2516733, 0.3016387, 0.4015163, 0.5013438, 0.6011363,
0.7008976, 0.8006328, 0.9003380, 0.9501740, 1.0000000]
y = [0.0000000, 0.0233088, 0.0298517, 0.0425630, 0.0603942, 0.0739301, 0.0850687,
0.1023515, 0.1149395, 0.1230325, 0.1272298, 0.1253360, 0.1130538, 0.0934796,
0.0695104, 0.0445423, 0.0207728, 0.0098870, 0.0000000, 0.0000000, -.0208973,
-.0210669, -.0247377, -.0307807, -.0416431, -.0548774, -.0637165, -.0703581,
-.0801452, -.0869356, -.0910290, -.0926252, -.0905235, -.0834273, -.0728351,
-.0591463, -.0428603, -.0235778, -.0122883, 0.0000000]
plt.plot(x,y)
This is a data point I have web scraped from a website. If I plot this there is a line that crosses from (0,0) to (1,0) which I don't want. I've tried manually removing points and seeing what points I need to remove but I can't seem to get it to work. There are two pairs of the same points in this data point. How can you remove the line that crosses from (0,0) to (1,0) and can you automate it by using if statement?
If you look carefully at your data, there appears to be a discontinuity. Your x values start going up, then goes back to 0, and goes up again. This discontinuity is creating the horizontal line that's bothering you. You can loop through x and y, detect when the values of x drop instead of increase, find this index and separate the two lists. However, using np.diff can help with this too. In this specific case, there's only one drop, so you have only one negative diff value. Here's my code and the graph I got. It might not generalize to any number of discontinuities, but it fixes your case.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
x = [0.0000000,0.0082707,0.0132000, 0.0255597, 0.0503554, 0.0751941, 0.1000570,
0.1498328, 0.1996558, 0.2495240, 0.2994312, 0.3993490, 0.4993711, 0.5994664,
0.6996058, 0.7997553, 0.8998927, 0.9499514, 1.0000000, 0.0000000, 0.006114,
0.0062188, 0.0087532, 0.0138088, 0.0264052, 0.0515127, 0.0765762, 0.1016176,
0.1516652, 0.2016828, 0.2516733, 0.3016387, 0.4015163, 0.5013438, 0.6011363,
0.7008976, 0.8006328, 0.9003380, 0.9501740, 1.0000000]
y = [0.0000000, 0.0233088, 0.0298517, 0.0425630, 0.0603942, 0.0739301, 0.0850687,
0.1023515, 0.1149395, 0.1230325, 0.1272298, 0.1253360, 0.1130538, 0.0934796,
0.0695104, 0.0445423, 0.0207728, 0.0098870, 0.0000000, 0.0000000, -.0208973,
-.0210669, -.0247377, -.0307807, -.0416431, -.0548774, -.0637165, -.0703581,
-.0801452, -.0869356, -.0910290, -.0926252, -.0905235, -.0834273, -.0728351,
-.0591463, -.0428603, -.0235778, -.0122883, 0.0000000]
discont = np.diff(x).argmin()
x1 = x[:discont+1]
y1 = y[:discont+1]
x2 = x[discont+1:]
y2 = y[discont+1:]
plt.plot(x1, y1)
plt.plot(x2, y2)
plt.show()
I'm trying to do some comparative analysis for a publication. I came across seaborn and pandas and really like the ease with which I can create the analysis that I want. However, I find the manuals a bit scanty on the things that I'm trying to understand about the example plots and how to modify the plots to my needs. I'm hoping for some advice here on to get the plots I'm want. Perhaps pandas/seaborn is not what I need.
So, I would like to create subplots, (3,1) or (2,3), of the following figure:
Questions:
I would like the attached plot to have a title on the colorbar. Not sure if this is possible or exactly what is shown, i.e., is it relative frequency or occurrence or a percentage, etc? How can I put a explanatory tile on the colorbar (oriented vertically).
The text is a nice addition. The pearsonr is the correlation, but I'm not sure what is p. My guess is that it is showing the lag, or? If so, how can I remove the p in the text?
I would like to make the same kind of figure for different variables and put it all in a subplot.
Here's the code I pieced together from the seaborn manual/examples and from other users here on SO (thanks guys).
import netCDF4 as nc
import pandas as pd
import xarray as xr
import numpy as np
import seaborn as sns
import pdb
import matplotlib.pyplot as plt
from scipy import stats, integrate
import matplotlib as mpl
import matplotlib.ticker as tkr
import matplotlib.gridspec as gridspec
sns.set(style="white")
sns.set(color_codes=True)
octp = [622.0, 640.0, 616.0, 731.0, 668.0, 631.0, 641.0, 589.0, 801.0,
828.0, 598.0, 742.0,665.0, 611.0, 773.0, 608.0, 734.0, 725.0, 716.0,
699.0, 686.0, 671.0, 700.0, 656.0,686.0, 675.0, 678.0, 653.0, 659.0,
682.0, 674.0, 684.0, 679.0, 704.0, 624.0, 727.0,739.0, 662.0, 801.0,
633.0, 896.0, 729.0, 659.0, 741.0, 510.0, 836.0, 720.0, 685.0,430.0,
833.0, 710.0, 799.0, 534.0, 532.0, 605.0, 519.0, 850.0, 357.0, 858.0,
497.0,404.0, 456.0, 448.0, 836.0, 462.0, 381.0, 499.0, 673.0, 642.0,
641.0, 458.0, 809.0,562.0, 742.0, 732.0, 710.0, 658.0, 533.0, 811.0,
853.0, 856.0, 785.0, 659.0, 697.0,654.0, 673.0, 707.0, 711.0, 423.0,
751.0, 761.0, 638.0, 576.0, 538.0, 596.0, 718.0,843.0, 640.0, 647.0,
692.0, 599.0, 607.0, 537.0, 679.0, 712.0, 612.0, 641.0, 665.0,658.0,
722.0, 656.0, 656.0, 742.0, 505.0, 688.0, 805.0]
cctp = [482.0, 462.0, 425.0, 506.0, 500.0, 464.0, 486.0, 473.0, 577.0,
735.0, 390.0, 590.0,464.0, 417.0, 722.0, 410.0, 679.0, 680.0, 711.0,
658.0, 687.0, 621.0, 643.0, 690.0,630.0, 661.0, 608.0, 658.0, 624.0,
646.0, 651.0, 634.0, 612.0, 636.0, 607.0, 539.0,706.0, 614.0, 706.0,
401.0, 720.0, 746.0, 511.0, 700.0, 453.0, 677.0, 637.0, 605.0,454.0,
733.0, 535.0, 725.0, 668.0, 513.0, 470.0, 589.0, 765.0, 596.0, 749.0,
462.0,469.0, 514.0, 511.0, 789.0, 647.0, 324.0, 555.0, 670.0, 656.0,
786.0, 374.0, 757.0,645.0, 744.0, 708.0, 497.0, 654.0, 288.0, 705.0,
703.0, 446.0, 675.0, 440.0, 652.0,589.0, 542.0, 661.0, 631.0, 343.0,
585.0, 632.0, 591.0, 602.0, 365.0, 535.0, 663.0,561.0, 448.0, 582.0,
591.0, 535.0, 475.0, 422.0, 599.0, 594.0, 569.0, 576.0, 622.0,483.0,
539.0, 515.0, 621.0, 443.0, 435.0, 502.0, 443.0]
cctp = pd.Series(cctp, name='CTP [hPa]')
octp = pd.Series(octp, name='CTP [hPa]')
formatter = tkr.ScalarFormatter(useMathText=True)
formatter.set_scientific(True)
formatter.set_powerlimits((-2, 2))
g = sns.jointplot(cctp,octp, kind="kde",size=8,space=0.2,cbar=True,
n_levels=50,cbar_kws={"format": formatter})
# add a line x=y
x0, x1 = g.ax_joint.get_xlim()
y0, y1 = g.ax_joint.get_ylim()
lims = [max(x0, y0), min(x1, y1)]
g.ax_joint.plot(lims, lims, ':k')
plt.show()
plt.savefig('test_fig.png')
I know I'm asking a lot here. So I put the questions in order of priority.
1: To set the colorbar label, you can add the label key to the cbar_kws dict:
cbar_kws={"format": formatter, "label": 'My colorbar'}
2: To change the stats label, you need to first slightly modify the stats.pearsonr function to only return the first value, instead of the (pearsonr, p) tuple:
pr = lambda a, b: stats.pearsonr(a, b)[0]
Then, you can change that function using jointplot's stat_func kwarg:
stat_func=pr
and finally, you need to change the annotation to get the label right:
annot_kws={'stat':'pearsonr'})
Putting that all together:
pr = lambda a, b: stats.pearsonr(a, b)[0]
g = sns.jointplot(cctp,octp, kind="kde",size=8,space=0.2,cbar=True,
n_levels=50,cbar_kws={"format": formatter, "label": 'My colorbar'},
stat_func=pr, annot_kws={'stat':'pearsonr'})
3: I don't think its possible to put everything in a subplot with jointplot. Happy to be proven wrong there though.
I'm trying to come up with a function that plots n points inside the unit circle, but I need them to be sufficiently spread out.
ie. something that looks like this:
Is it possible to write a function with two parameters, n (number of points) and min_d (minimum distance apart) such that the points are:
a) equidistant
b) no pairwise distance exceeds a given min_d
The problem with sampling from a uniform distribution is that it could happen that two points are almost on top of each other, which I do not want to happen. I need this kind of input for a network diagram representing node clusters.
EDIT: I have found an answer to a) here: Generator of evenly spaced points in a circle in python, but b) still eludes me.
At the time this answer was provided, the question asked for random numbers. This answer thus gives a solution drawing random numbers. It ignores any edits made to the question afterwards.
On may simply draw random points and for each one check if the condition of the minimum distance is fulfilled. If not, the point can be discarded. This can be done until a list is filled with enough points or some break condition is met.
import numpy as np
import matplotlib.pyplot as plt
class Points():
def __init__(self,n=10, r=1, center=(0,0), mindist=0.2, maxtrials=1000 ) :
self.success = False
self.n = n
self.r = r
self.center=np.array(center)
self.d = mindist
self.points = np.ones((self.n,2))*10*r+self.center
self.c = 0
self.trials = 0
self.maxtrials = maxtrials
self.tx = "rad: {}, center: {}, min. dist: {} ".format(self.r, center, self.d)
self.fill()
def dist(self, p, x):
if len(p.shape) >1:
return np.sqrt(np.sum((p-x)**2, axis=1))
else:
return np.sqrt(np.sum((p-x)**2))
def newpoint(self):
x = (np.random.rand(2)-0.5)*2
x = x*self.r-self.center
if self.dist(self.center, x) < self.r:
self.trials += 1
if np.all(self.dist(self.points, x) > self.d):
self.points[self.c,:] = x
self.c += 1
def fill(self):
while self.trials < self.maxtrials and self.c < self.n:
self.newpoint()
self.points = self.points[self.dist(self.points,self.center) < self.r,:]
if len(self.points) == self.n:
self.success = True
self.tx +="\n{} of {} found ({} trials)".format(len(self.points),self.n,self.trials)
def __repr__(self):
return self.tx
center =(0,0)
radius = 1
p = Points(n=40,r=radius, center=center)
fig, ax = plt.subplots()
x,y = p.points[:,0], p.points[:,1]
plt.scatter(x,y)
ax.add_patch(plt.Circle(center, radius, fill=False))
ax.set_title(p)
ax.relim()
ax.autoscale_view()
ax.set_aspect("equal")
plt.show()
If the number of points should be fixed, you may try to run find this number of points for decreasing distances until the desired number of points are found.
In the following case, we are looking for 60 points and start with a minimum distance of 0.6 which we decrease stepwise by 0.05 until there is a solution found. Note that this will not necessarily be the optimum solution, as there is only maxtrials of retries in each step. Increasing maxtrials will of course bring us closer to the optimum but requires more runtime.
center =(0,0)
radius = 1
mindist = 0.6
step = 0.05
success = False
while not success:
mindist -= step
p = Points(n=60,r=radius, center=center, mindist=mindist)
print p
if p.success:
break
fig, ax = plt.subplots()
x,y = p.points[:,0], p.points[:,1]
plt.scatter(x,y)
ax.add_patch(plt.Circle(center, radius, fill=False))
ax.set_title(p)
ax.relim()
ax.autoscale_view()
ax.set_aspect("equal")
plt.show()
Here the solution is found for a minimum distance of 0.15.
I am trying to plot two columns of raw data (I have used melt to combine them into one data frame) and then add separate error bars for each. However, I want to make the raw data for each column one pair of colors and the error bars another set of colors, but I can't seem to get it to work. The plot I am getting is at the link below. I want to have different color pairs for the raw data and for the error bars. A simple reproducible example is coded below, for illustrative purposes.
dat2.m<-data.frame(obs=c(2,4,6,8,12,16,2,4,6),variable=c("raw","raw","raw","ip","raw","ip","raw","ip","ip"),value=runif(9,0,10))
c <- ggplot(dat2.m, aes(x=obs, y=value, color=variable,fill=variable,size = 0.02)) +geom_jitter(size=1.25) + scale_colour_manual(values = c("blue","Red"))
c<- c+stat_summary(fun.data="median_hilow",fun.args=(conf.int=0.95),aes(color=variable), position="dodge",geom="errorbar", size=0.5,lty=1)
print(c)
[1]: http://i.stack.imgur.com/A5KHk.jpg
For the record: I think that this is a really, really bad idea. Unless you have a use case where this is crucial, I think you should re-examine your plan.
However, you can get around it by adding a new set of variables, padded with a space at the end. You will want/need to play around with the legends, but this should work (though it is definitely ugly):
dat2.m<- data.frame(obs=c(2,4,6,8,12,16,2,4,6),variable=c("raw","raw","raw","ip","raw","ip","raw","ip","ip"),value=runif(9,0,10))
c <- ggplot(dat2.m, aes(x=obs, y=value, color=variable,fill=variable,size = 0.02)) +geom_jitter(size=1.25) + scale_colour_manual(values = c("blue","Red","green","purple"))
c<- c+stat_summary(fun.data="median_hilow",fun.args=(conf.int=0.95),aes(color=paste(variable," ")), position="dodge",geom="errorbar", size=0.5,lty=1)
print(c)
One way around this would be to use repetitive calls to geom_point and stat_summary. Use the data argument of those functions to feed subsets of your dataset into each call, and set the color attribute outside of aes(). It's repetitive and somewhat defeats the compactness of ggplot, but it'd do.
c <- ggplot(dat2.m, aes(x = obs, y = value, size = 0.02)) +
geom_jitter(data = subset(dat2.m, variable == 'raw'), color = 'blue', size=1.25) +
geom_jitter(data = subset(dat2.m, variable == 'ip'), color = 'red', size=1.25) +
stat_summary(data = subset(dat2.m, variable == 'raw'), fun.data="median_hilow", fun.args=(conf.int=0.95), color = 'pink', position="dodge",geom="errorbar", size=0.5,lty=1) +
stat_summary(data = subset(dat2.m, variable == 'ip'), fun.data="median_hilow", fun.args=(conf.int=0.95), color = 'green', position="dodge",geom="errorbar", size=0.5,lty=1)
print(c)
I'm trying to regrid a numpy array onto a new grid. In this specific case, I'm trying to regrid a power spectrum onto a logarithmic grid so that the data are evenly spaced logarithmically for plotting purposes.
Doing this with straight interpolation using np.interp results in some of the original data being ignored entirely. Using digitize gets the result I want, but I have to use some ugly loops to get it to work:
xfreq = np.fft.fftfreq(100)[1:50] # only positive, nonzero freqs
psw = np.arange(xfreq.size) # dummy array for MWE
# new logarithmic grid
logfreq = np.logspace(np.log10(np.min(xfreq)), np.log10(np.max(xfreq)), 100)
inds = np.digitize(xfreq,logfreq)
# interpolation: ignores data *but* populates all points
logpsw = np.interp(logfreq, xfreq, psw)
# so average down where available...
logpsw[np.unique(inds)] = [psw[inds==i].mean() for i in np.unique(inds)]
# the new plot
loglog(logfreq, logpsw, linewidth=0.5, color='k')
Is there a nicer way to accomplish this in numpy? I'd be satisfied with just a replacement of the inline loop step.
You can use bincount() twice to calculate the average value of every bins:
logpsw2 = np.interp(logfreq, xfreq, psw)
counts = np.bincount(inds)
mask = counts != 0
logpsw2[mask] = np.bincount(inds, psw)[mask] / counts[mask]
or use unique(inds, return_inverse=True) and bincount() twice:
logpsw4 = np.interp(logfreq, xfreq, psw)
uinds, inv_index = np.unique(inds, return_inverse=True)
logpsw4[uinds] = np.bincount(inv_index, psw) / np.bincount(inv_index)
Or if you use Pandas:
import pandas as pd
logpsw4 = np.interp(logfreq, xfreq, psw)
s = pd.groupby(pd.Series(psw), inds).mean()
logpsw4[s.index] = s.values