I'm learning some data science related topics and oh boy, this is a jungle of different libraries for everything π
Because of things, I went with Lets-plot, which has a nice Kotlin API that I'm using combined with Kotlin kernel for Jupyter notebooks
Overall, things are going pretty good. Most tutorials & docs I see online use different libraries for plotting (e.g. Seaborn, Matplotlib, Plotly) so most of the time I have to do some reading of the Lets-Plot-Kotlin reference and try/error until I find the equivalent code for my graphs
Currently, I'm trying to graph the distribution of differences between two values. Overall, this looks pretty good. I can just do something like
(letsPlot(df)
+ geomHistogram { x = "some-column" }
).show()
which gives a nice graph
It would be interesting to see the density estimator as well, geomDensity to the rescue!
(letsPlot(df)
+ geomDensity(color = "red") { x = "some-column" }
).show()
Nice! Now let's watch them both together
(letsPlot(df)
+ geomDensity(color = "red") { x = "some-column" }
+ geomHistogram() { x = "some-column" }
).show()
As you can see, there's a small red line in the bottom (the geomDensity!). Problem here (I would say) is that both layers are using the same Y scale. Histogram is working with 0-20 values and density with 0-0.02 so when plotted together it's just a line at the bottom
Is there any way to add several layers in the same plot that use their own scale? I've read some blogposts that claim that you should not go for it (seems to be pretty much accepted by the community.
My target is to achieve something similar to what you can do with Seaborn by doing
plt.figure(figsize=(10,4),dpi=200)
sns.histplot(data=df,x='some_column',kde=True,bins=25)
(yes I know I took the lets plot screenshot without the bins configured. Not relevant, I'd say Β―_(γ)_/Β― )
Maybe I'm just approaching the problem with a mindset I should not? As mentioned, I'm still learning so every alternative will be highly welcomed π
Just, please, don't go with the "Switch to Python". I'm exploring and I'd prefer to go one topic at a time
In order for histogram and density layers to share the same y-scale you need to map variable "..density.." to aesthetic "y" in the histogram layer (by default histogram maps "..count.." to "y").
You will find an example of it in cell [4] in this notebook: https://nbviewer.org/github/JetBrains/lets-plot-kotlin/blob/master/docs/examples/jupyter-notebooks/distributions.ipynb
BWT, many of the pages in Lets-Plot Kotlin API Reference are equipped with links on demo-notebooks, in "Examples" section: geomHistogram().
And of course you can find a lot of info online on the R ggplot2 package which is largely applicable to Lets-Plot as well. For example: Histogram with kernel density estimation.
Finally :) , calling show() is not necessary - Jupyter Kotlin kernel will render plot automatically if plot expression is the last one in the cell which is often the case.
Related
I have some data in a pandas dataframe (although pandas is not the point of this question). As an experiment I made column ZR as column Z divided by column R. As a first step using scikit learn I wanted to see if I could predict ZR from the other columns (which should be possible as I just made it from R and Z). My steps have been.
columns=['R','T', 'V', 'X', 'Z']
for c in columns:
results[c] = preprocessing.scale(results[c])
results['ZR'] = preprocessing.scale(results['ZR'])
labels = results["ZR"].values
features = results[columns].values
#print labels
#print features
regr = linear_model.LinearRegression()
regr.fit(features, labels)
print(regr.coef_)
print np.mean((regr.predict(features)-labels)**2)
This gives
[ 0.36472515 -0.79579885 -0.16316067 0.67995378 0.59256197]
0.458552051342
The preprocessing seems wrong as it destroys the Z/R relationship I think. What's the right way to preprocess in this situation?
Is there some way to get near 100% accuracy? Linear regression is the wrong tool as the relationship is not-linear.
The five features are highly correlated in my data. Is non-negative least squares implemented in scikit learn ? ( I can see it mentioned in the mailing list but not the docs.) My aim would be to get as many coefficients set to zero as possible.
You should easily be able to get a decent fit using random forest regression, without any preprocessing, since it is a nonlinear method:
model = RandomForestRegressor(n_estimators=10, max_features=2)
model.fit(features, labels)
You can play with the parameters to get better performance.
The solutions is not as easy and can be very influenced by your data.
If your variables R and Z are bounded (for ex 0<R<1 -3<Z<2) then you should be able to get a good estimation of the output variable using neural network.
Using neural network you should be able to estimate your output even without preprocessing the data and using all the variables as input.
(Of course here you will have to solve a minimization problem).
Sklearn do not implement neural network so you should use pybrain or fann.
If you want to preprocess the data in order to make the minimization problem easier you can try to extract the right features from the predictor matrix.
I do not think there are a lot of tools for non linear features selection. I would try to estimate the important variables from you dataset using in this order :
1-lasso
2- sparse PCA
3- decision tree (you can actually use them for features selection ) but I would avoid this as much as possible
If this is a toy problem I would sugges you to move towards something of more standard.
You can find a lot of examples on google.
Is there a way to place patterns into selected areas on an imshow graph? To be precise, I need to make it so that, in addition to the numerical-data-carrying colored squares, I also have different patterns in other squares indicate different failure modes for the experiment (and also generate a key explaining the meaning of these different patterns). An example of a pattern that would be useful would be various types of crosshatches. I need to be able to do this without disrupting the main color-numerical data relationship on the graph.
Due to the back-end I am working within for the GUI containing the graph, I cannot utilize patches (they fail to pickle and make it from the back-end to the front-end via the multiprocessing package). I was wondering if anyone knew of another way to do this.
grid = np.ma.array(grid, mask=np.isnan(grid))
ax.imshow(grid, interpolation='nearest', aspect='equal', vmax = private.vmax, vmin = private.vmin)
# Up to here works fine and draws the graph showing only the data with white spaces for any point that failed
if show_fail and faildat != []:
faildat = faildat[np.lexsort((faildat[:,yind],faildat[:,xind]))]
fails = []
for i in range(len(faildat)): #gives coordinates with failures as (x,y)
fails.append((faildat[i,1],faildat[i,0]))
for F in fails:
ax.FUNCTION NEEDED HERE
ax.minorticks_off()
ax.set_xticks(range(len(placex)))
ax.set_yticks(range(len(placey)))
ax.set_xticklabels(placex)
ax.set_yticklabels(placey, rotation = 0)
ax.colorbar()
ax.show()
!I have values in the form of (x,y,z). By creating a list_plot3d plot i can clearly see that they are not quite evenly spaced. They usually form little "blobs" of 3 to 5 points on the xy plane. So for the interpolation and the final "contour" plot to be better, or should i say smoother(?), do i have to create a rectangular grid (like the squares on a chess board) so that the blobs of data are somehow "smoothed"? I understand that this might be trivial to some people but i am trying this for the first time and i am struggling a bit. I have been looking at the scipy packages like scipy.interplate.interp2d but the graphs produced at the end are really bad. Maybe a brief tutorial on 2d interpolation in sagemath for an amateur like me? Some advice? Thank you.
EDIT:
https://docs.google.com/file/d/0Bxv8ab9PeMQVUFhBYWlldU9ib0E/edit?pli=1
This is mostly the kind of graphs it produces along with this message:
Warning: No more knots can be added because the number of B-spline
coefficients
already exceeds the number of data points m. Probably causes:
either
s or m too small. (fp>s)
kx,ky=3,3 nx,ny=17,20 m=200 fp=4696.972223 s=0.000000
To get this graph i just run this command:
f_interpolation = scipy.interpolate.interp2d(*zip(*matrix(C)),kind='cubic')
plot_interpolation = contour_plot(lambda x,y:
f_interpolation(x,y)[0], (22.419,22.439),(37.06,37.08) ,cmap='jet', contours=numpy.arange(0,1400,100), colorbar=True)
plot_all = plot_interpolation
plot_all.show(axes_labels=["m", "m"])
Where matrix(c) can be a huge matrix like 10000 X 3 or even a lot more like 1000000 x 3. The problem of bad graphs persists even with fewer data like the picture i attached now where matrix(C) was only 200 x 3. That's why i begin to think that it could be that apart from a possible glitch with the program my approach to the use of this command might be totally wrong, hence the reason for me to ask for advice about using a grid and not just "throwing" my data into a command.
I've had a similar problem using the scipy.interpolate.interp2d function. My understanding is that the issue arises because the interp1d/interp2d and related functions use an older wrapping of FITPACK for the underlying calculations. I was able to get a problem similar to yours to work using the spline functions, which rely on a newer wrapping of FITPACK. The spline functions can be identified because they seem to all have capital letters in their names here http://docs.scipy.org/doc/scipy/reference/interpolate.html. Within the scipy installation, these newer functions appear to be located in scipy/interpolate/fitpack2.py, while the functions using the older wrappings are in fitpack.py.
For your purposes, RectBivariateSpline is what I believe you want. Here is some sample code for implementing RectBivariateSpline:
import numpy as np
from scipy import interpolate
# Generate unevenly spaced x/y data for axes
npoints = 25
maxaxis = 100
x = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
y = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
xsort = np.sort(x)
ysort = np.sort(y)
# Generate the z-data, which first requires converting
# x/y data into grids
xg, yg = np.meshgrid(xsort,ysort)
z = xg**2 - yg**2
# Generate the interpolated, evenly spaced data
# Note that the min/max of x/y isn't necessarily 0 and 100 since
# randomly chosen points were used. If we want to avoid extrapolation,
# the explicit min/max must be found
interppoints = 100
xinterp = np.linspace(xsort[0],xsort[-1],interppoints)
yinterp = np.linspace(ysort[0],ysort[-1],interppoints)
# Generate the kernel that will be used for interpolation
# Note that the default version uses three coefficients for
# interpolation (i.e. parabolic, a*x**2 + b*x +c). Higher order
# interpolation can be used by setting kx and ky to larger
# integers, i.e. interpolate.RectBivariateSpline(xsort,ysort,z,kx=5,ky=5)
kernel = interpolate.RectBivariateSpline(xsort,ysort,z)
# Now calculate the linear, interpolated data
zinterp = kernel(xinterp, yinterp)
I want to generate a grid of plots, of several arrays, with positive and negative values, with log scale, sharing the same colorbar.
I've achieved the sharing part of the colorbar (using ImageGrid and common max and min values), and I know that I could get a logarithmic scale using LogNorm() on the imshow call in the case of only positive values. But given the presence of negative values, I would need a colorbar on symmetric logarithmic scale.
I have found what would be the solution on https://stackoverflow.com/a/7741317/1101750 , but running the sample code Yann provides gives me very different results, cleary wrong:
Reviewing the code, I'm not able to grasp what's going on.
In addition to that, I've discovered that on Matplotlib 1.2, scale.SymmetricalLogScale.SymmetricalLogTransform asks for a new argument not explained on the documentation (linscale, which looking at the code of other transforms I assume that leaving it as 1 is a safe value).
Is the easiest solution subclassing LogNorm?
I've used a pretty simple recipe in the past to do exactly this, without the need to do any subclassing. matplotlib.colors.SymLogNorm provides most of the functionality you need, except that I've found it necessary to generate the tick marks by hand. Note that this solution uses matplotlib 1.3.0, and I may be using features that weren't available with 1.2.
def imshow_symlog(my_matrix, vmin, vmax, logthresh=5):
img=imshow( my_matrix ,
vmin=float(vmin), vmax=float(vmax),
norm=matplotlib.colors.SymLogNorm(10**-logthresh) )
maxlog=int(np.ceil( np.log10(vmax) ))
minlog=int(np.ceil( np.log10(-vmin) ))
#generate logarithmic ticks
tick_locations=([-(10**x) for x in xrange(minlog,-logthresh-1,-1)]
+[0.0]
+[(10**x) for x in xrange(-logthresh,maxlog+1)] )
cb=colorbar(ticks=tick_locations)
return img,cb
Since 1.3 matplotlib has a SymLogNorm. http://matplotlib.org/api/colors_api.html#matplotlib.colors.SymLogNorm
In GNU Octave you can make a picture where different colors represent different values in a matrix. You can also add a colorbar, which shows what color corresponds to what value.
Is it possible to somehow add units to the values shown in the colorbar? Instead of saying β0.36β it would say β0.36 V/nmβ? I know this is possible in Matlab, but I canβt figure out how to do it in Octave. Any good workarounds?
I assume someone here will mention that I should use matplotlib instead (that usually happens). How would you accomplish the same thing with that?
The matplotlib answer (using pylab) is
imshow(random((20,20)))
colorbar(format='%.2f V/nm')
In Octave it seems that the following works (but I'm no Octave expert so maybe there's a better way):
c=colorbar();
labels = {};
for v=get(c,'ytick'), labels{end+1} = sprintf('%.2f V/nm',v); end
set(c,'yticklabel',labels);