How does numpy polyfit work? - numpy

I've created the "Precipitation Analysis" example Jupyter Notebook in the Bluemix Spark service.
Notebook Link: https://console.ng.bluemix.net/data/notebooks/3ffc43e2-d639-4895-91a7-8f1599369a86/view?access_token=effff68dbeb5f9fc0d2df20cb51bffa266748f2d177b730d5d096cb54b35e5f0
So in In[34] and In[35] (you have to scroll a lot) they use numpy polyfit to calculate the trend for given temperature data. However, I do not understand how to use it.
Can somebody explain it?

The question has been answered on Developerworks:-
https://developer.ibm.com/answers/questions/282350/how-does-numpy-polyfit-work.html
I will try to explain each of this:-
index = chile[chile>0.0].index => this statements gives out all the years which are indices in chile python series which are greater than 0.0.
fit = np.polyfit(index.astype('int'), chile[index].values,1)
This is polyfit function call which find out ploynomial fitting coefficient(slope and intercept) for the given x(years) and y(precipitation on year) values at index(years) supplied through the vectors.
print "slope: " + str(fit[0])
The below code simply plots the datapoints referenced to straight line to show the trend
plt.plot(index, chile[index],'.')
Perticularly in the below statement the second argument is actually straight line equation to represent y which is "y = mx + b" where m is the slope and b is intercept that we found out above using polyfit.
plt.plot(index, fit[0]*index.astype('int') + fit[1], '-', color='red')
plt.title("Precipitation Trend for Chile")
plt.xlabel("Year")
plt.ylabel("Precipitation (million cubic meters)")
plt.show()
I hope that helps.
Thanks, Charles.

Related

Difference between matplotlib.countourf and matlab.contourf() - odd sharp edges in matplotlib

I am a recent migrant from Matlab to Python and have recently worked with Numpy and Matplotlib. I recoded one of my scripts from Matlab, which employs Matlab's contourf-function, into Python using matplotlib's corresponding contourf-function. I managed to replicate the output in Python, apart that the contourf-plots are not exacly the same, for a reason that is unknown to me. As I run the contourf-function in matplotlib, I get this otherwise nice figure but it has these sharp edges on the contour-levels on top and bottom, which should not be there (see Figure 1 below, matplotlib-output). Now, when I export the arrays I used in Python to Matlab (i.e. the exactly same data set that was used to generate the matplotlib-contourf-plot) and use Matlab's contourf-function, I get a slightly different output, without those sharp contour-level edges (see Figure 2 below, Matlab-output). I used the same number of levels in both figures. In figure 3 I have made a scatterplot of the same data, which shows that there are no such sharp edges in the data as shown in the contourf-plot (I added contour-lines just for reference). Example dataset can be downloaded through Dropbox-link given below. The data set contains three txt-files: X, Y, Z. Each of them are an 500x500 arrays, which can be directly used with contourf(), i.e. plt.contourf(X,Y,Z,...). The code that used was
plt.contourf(X,Y,Z,10, cmap=plt.cm.jet)
plt.contour(X,Y,Z,10,colors='black', linewidths=0.5)
plt.axis('equal')
plt.axis('off')
Does anyone have an idea why this happens? I would appreciate any insight on this!
Cheers,
Jussi
Below are the details of my setup:
Python 3.7.0
IPython 6.5.0
matplotlib 2.2.3
Matplotlib output
Matlab output
Matplotlib-scatter
Link to data set
The confusing thing about the matlab plot is that its colorbar shows much more levels than there are actually in the plot. Hence you don't see the actual intervals that are contoured.
You would achieve the same result in matplotlib by choosing 12 instead of 11 levels.
import numpy as np
import matplotlib.pyplot as plt
X, Y, Z = [np.loadtxt("data/roundcontourdata/{}.txt".format(i)) for i in list("XYZ")]
levels = np.linspace(Z.min(), Z.max(), 12)
cntr = plt.contourf(X,Y,Z,levels, cmap=plt.cm.jet)
plt.contour(X,Y,Z,levels,colors='black', linewidths=0.5)
plt.colorbar(cntr)
plt.axis('equal')
plt.axis('off')
plt.show()
So in conclusion, both plots are correct and show the same data. Just the levels being automatically chosen are different. This can be circumvented by choosing custom levels depending on the desired visual appearance.

Why does DataFrameGroupBy.boxplot method throw error when given argument "subplots=True/False"?

I can use DataFrameGroupBy.boxplot(...) to create a boxplot in the following way:
In [15]: df = pd.DataFrame({"gene_length":[100,100,100,200,200,200,300,300,300],
...: "gene_id":[1,1,1,2,2,2,3,3,3],
...: "density":[0.4,1.1,1.2,1.9,2.0,2.5,2.2,3.0,3.3],
...: "cohort":["USA","EUR","FIJ","USA","EUR","FIJ","USA","EUR","FIJ"]})
In [17]: df.groupby("cohort").boxplot(column="density",by="gene_id")
In [18]: plt.show()
This produces the following image:
This is exactly what I want, except instead of making three subplots, I want all the plots to be in one plot (with different colors for USA, EUR, and FIJ). I've tried
In [17]: df.groupby("cohort").boxplot(column="density",subplots=False,by="gene_id")
but it produces the error
KeyError: 'gene_id'
I think the problem has something to do with the fact that by="gene_id" is a keyword sent to the matplotlib boxplot method. If someone has a better way of producing the plot I am after, perhaps by using DataFrame.boxplot(?) instead, please respond here. Thanks so much!
To use the pure pandas functions, I think you should not GroupBy before calling boxplot, but instead, request to group by certain columns in the call to boxplot on the DataFrame itself:
df.boxplot(column='density',by=['gene_id','cohort'])
To get a better-looking result, you might want to consider using the Seaborn library. It is designed to help precisely with this sort of tasks:
sns.boxplot(data=df,x='gene_id',y='density',hue='cohort')
EDIT to take into account comment below
If you want to have each of your cohort boxplots stacked/superimposed for each gene_id, it's a bit more complicated (plus you might end up with quite an ugly output). You cannot do this using Seaborn, AFAIK, but you could with pandas directly, by using the position= parameter to boxplot (see doc). The catch it to generate the correct sequence of positions to place the boxplots where you want them, but you'll have to fix the tick labels and the legend yourself.
pos = [i for i in range(len(df.gene_id.unique())) for _ in range(len(df.cohort.unique()))]
df.boxplot(column='density',by=['gene_id','cohort'],positions=pos)
An alternative would be to use seaborn.swarmplot instead of using boxplot. A swarmplot plots every point instead of the synthetic representation of boxplots, but you can use the parameter split=False to get the points colored by cohort but stacked on top of each other for each gene_id.
sns.swarmplot(data=df,x='gene_id',y='density',hue='cohort', split=False)
Without knowing the actual content of your dataframe (number of points per gene and per cohort, and how separate they are in each cohort), it's hard to say which solution would be the most appropriate.

matplotlib/pyplot: print only ticks once in scatter plot?

I am looking for a way to clean-up the ticks in my pyplot scatter plot.
To create a scatter plot from a Pandas dataset column with strings as elements, I followed the example in [2] - and got me a nice scatter plot:
input are 10k data points where the X axis has only ~200 unique 'names', that got matched to scalars for plotting. Obviously, plotting all the 10k ticks on the x axis is a bit clocked. So, I am looking for a way, to print each unique tick only once and not for each data point?
My code looks like:
fig2 = plt.figure()
WNsUniques, WNs = numpy.unique(taskDataFrame['modificationhost'], return_inverse=True)
scatterWNs = fig2.add_subplot(111)
scatterWNs.scatter(WNs, taskDataFrame['cpuconsumptiontime'])
scatterWNs.set(xticks=range(len(WNsUniques)), xticklabels=WNsUniques)
plt.xticks(rotation='vertical')
plt.savefig("%s_WNs-CPUTime_scatter.%s" % (dfName,"pdf"))
actually, I was hoping that setting the plot x ticks to the unique names should be sufficient - but apparently not? Probably it is something easy, but how do I reduce the ticks for my subplot to unique once (should they not already be uniqueified as returned by numpy.unique?)?
Maybe someone has an idea for me?
Cheers ans thanks,
Thomas
You can use the set_xticks method to accomplish this. Note that 200 axis ticks with labels are still quite a lot to force on a small plot like this, and this is what you might already be seeing with the above code. Without complete code to play with, I can't say for sure.
Additionally, what is the size of WNsUniques? That can easily be used to check if your call to unique is doing what you think.

Dotted line style from non-evenly distributed data

I'm new to Python and MatPlotlib.
This is my first posting to Stackoverflow - I've been unable to find the answer elsewhere and would be grateful for your help.
I'm using Windows XP, with Enthought Canopy v1.1.1 (32 bit).
I want to plot a dotted-style linear regression line through a scatter plot of data, where both x and y arrays contain random floating point data.
The dots in the resulting dotted line are not distributed evenly along the regression line, and are "smeared together" in the middle of the red line, making it look messy (see upper plot resulting from attached minimal example code).
This does not seem to occur if the items in the array of x values are evenly distributed (lower plot).
I'm therefore guessing that this is an issue with how MatplotLib renders dotted lines, or with how Canopy interfaces Python with Matplotlib.
Please could you tell me a workaround which will make the dots on the dotted line type appear evenly distributed; even if both x and y data are non-evenly distributed; whilst still using Canopy and Matplotlib?
(As a general point, I'm always keen to improve my coding skills - if any code in my example can be written more neatly or concisely, I'd be grateful for your expertise).
Many thanks in anticipation
Dave
(UK)
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
#generate data
x1=10 * np.random.random_sample((40))
x2=np.linspace(0,10,40)
y=5 * np.random.random_sample((40))
slope, intercept, r_value, p_value, std_err = stats.linregress(x1,y)
line = (slope*x1)+intercept
plt.figure(1)
plt.subplot(211)
plt.scatter(x1,y,color='blue', marker='o')
plt.plot(x1,line,'r:',label="Regression Line")
plt.legend(loc='upper right')
slope, intercept, r_value, p_value, std_err = stats.linregress(x2,y)
line = (slope*x2)+intercept
plt.subplot(212)
plt.scatter(x2,y,color='blue', marker='o')
plt.plot(x2,line,'r:',label="Regression Line")
plt.legend(loc='upper right')
plt.show()
Welcome to SO.
You have already identified the problem yourself, but seem a bit surprised that a random x-array results in the line be 'cluttered'. But you draw a dotted line repeatedly over the same location, so it seems like the normal behavior to me that it gets smeared at places where there are multiple dotted lines on top of each other.
If you don't want that, you can sort your array and use that to calculate the regression line and plot it. Since its a linear regression, just using the min and max values would also work.
x1_sorted = np.sort(x1)
line = (slope * x1_sorted) + intercept
or
x1_extremes = np.array([x1.min(),x1.max()])
line = (slope * x1_extremes) + intercept
The last should be faster if x1 becomes very large.
With regard to your last comment. In your example you use whats called the 'state-machine' environment for plotting. It means that specified commands are applied to the active figure and the active axes (subplots).
You can also consider the OO approach where you get figure and axes objects. This means you can access any figure or axes at any time, not just the active one. Its useful when passing an axes to a function for example.
In your example both would work equally well and it would be more a matter of taste.
A small example:
# create a figure with 2 subplots (2 rows, 1 column)
fig, axs = plt.subplots(2,1)
# plot in the first subplots
axs[0].scatter(x1,y,color='blue', marker='o')
axs[0].plot(x1,line,'r:',label="Regression Line")
# plot in the second
axs[1].plot()
etc...

matplotlib: working with range in x-axis

I'm trying to do a basic line graph here, but I can't seem to figure out how to adjust my x axis.
And here is the error I get when I try adjusting my range.
from pylab import *
plot ( range(0,11),[9,4,5,2,3,5,7,12,2,3],'.-',label='sample1' )
plot ( range(0,11),[12,5,33,2,4,5,3,3,22,10],'o-',label='sample2' )
xlabel('x axis')
ylabel('y axis')
title('my sample graphs')
legend(('sample1','sample2'))
savefig("sampleg.png",dpi=(640/8))
show()
File "C:\Python26\lib\site-packages\matplotlib\axes.py", line 228, in _xy_from_xy
raise ValueError("x and y must have same first dimension")
ValueError: x and y must have same first dimension
I want my range to be a list of strings: ["12/1/2007","12/1/2008", "12/1/2009","12/1/2010"]
Any suggestions?
Honestly, I found the code online and was trying to rewrite it to properly understand it. I think I'm going to start from scratch so that I know what I'm doing but I need help on where to start.
I posted another question which explains what I want to do here:
Using PyLab to create a 2D graph from two separate lists
range(0,11) should be range(0,10).
In addition to Steve's observation: If your points are always some y-value at the same consecutive integer x's, matplotlib makes the range even implicit.
plot([9,4,5,2,3,5,7,12,2,3],'.-',label='sample1')