I am using Pandas for basic evaluations and use it to output Latex tables.
I am outputting various error metrics for most metrics and the results are fine (the smallest error is shown in green).
style = df.style.highlight_min(color='darkgreen', axis=0).highlight_max(color='darkred', axis=0)
latex_table = style.to_latex(multicol_align="c", siunitx=True, hrules=True, [..])
Now I also output the so-called Q-Error (basically max(prediction/actual, actual/prediction)). This error is ideally 1.0 when completely accurate. With the standard Pandas styling, I cannot mark the best error of 1.05 over smaller numbers like 0.6 (which are larger error values).
Is there a way to customize highlights?
Related
I have been encountering OOM errors while getting to score a huge dataset. The dataset shape is (15million,230). Since the working environment is Databricks, I decided to update the scoring code to Koalas and take advantage of the Spark architecture to alleviate my memory issues.
However, I've run into some issues trying to convert part of my code from pandas to koalas. Any help into how to work around this issue is much appreciated.
Currently, I'm trying to add a few adjusted columns to my dataframe but I'm getting a PandasNotImplementedError : The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Code/Problem area :
df[new_sixmon_cols] = df[sixmon_cols].div([min(6,i) for i in df['mob']],axis=0)
df[new_twelvemon_cols] = df[twelvemon_cols].div([min(12,i) for i in df['mob']],axis=0)
df[new_eighteenmon_cols] = df[eighteenmon_cols].div([min(18,i) for i in df['mob']],axis=0)
df[new_twentyfourmon_cols] = df[twentyfourmon_cols].div([min(24,i) for i in df['mob']],axis=0)
print('The shape of df after add adjusted columns for all non indicator columns is:')
print(df.shape)
I believe the problem area is div([min(6,i)] but I'm not certain how to go about converting this particular piece of code efficiently or in general how to handle scoring a big dataset leveraging Databricks or the cloud environment.
Some pointers about the data/model:
The data is feature reduced and selected of course.
I built the model with 2.5m records and now I'm trying to work on scoring files.
I'm trying to plot the data of my DataFarme in a groupedChart and I want the columns to preserve the order I gave them before. The data looks as follows (its not all there but its in the same way organized)
dataframe
When I plot it I get the following Graph:
graph
So the months were sorted even though I specified not to sort in the chart. I used the following code:
chart2 = alt.Chart(melted).mark_bar().encode(
column=alt.Column('variable',sort=None),
x=alt.X('room',sort=None),
y=alt.Y('value'),
color='room',
tooltip= ['room', 'value']
)
Does anyone know how I could fix that?
You've already used sort=None, which is the correct way to make scales in a non-faceted chart reflect the input order.
The missing piece is that faceted charts share scales by default (See Scale and Guide Resolution), so each facet is being forced to share an order.
If you make the x scale resolution independent, then each facet should retain the input order:
chart2 = alt.Chart(melted).mark_bar().encode(
column=alt.Column('variable',sort=None),
x=alt.X('room',sort=None),
y=alt.Y('value'),
color='room',
tooltip= ['room', 'value']
).resolve_scale(x='independent')
I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:
df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()
But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
It boils down to:
import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals
I tried to apply this method to each subset of data with a nested-loop script:
for i in data.groupby('Cancer Stage'):
for p in i.columns[1:3]: # PROBLEM!!
Series = i[p]
print p
print Series.mean()
ci = bootstrap.ci(data=Series, statfunction=scipy.mean)
Which produced an error message
AttributeError: 'tuple' object has no attribute called 'columns'
Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.
The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple":
(name, dataforgroup). The correct recipe for iterating over groupby-objects is
for name, group in data.groupby('Cancer Stage'):
print name
for p in group.columns[0:3]:
...
Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!
Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:
cols=data.columns[0:2]
for col in columns:
print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
does everything you need in one line, and produces a (nicely plotable) series for you
EDIT:
I toyed around with a data frame object I created myself:
df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
yields two series that look like this:
B
a [6.58333333333, 14.3333333333]
b [8.5, 16.25]
B
a [21.5833333333, 29.3333333333]
b [23.4166666667, 31.25]
!I have values in the form of (x,y,z). By creating a list_plot3d plot i can clearly see that they are not quite evenly spaced. They usually form little "blobs" of 3 to 5 points on the xy plane. So for the interpolation and the final "contour" plot to be better, or should i say smoother(?), do i have to create a rectangular grid (like the squares on a chess board) so that the blobs of data are somehow "smoothed"? I understand that this might be trivial to some people but i am trying this for the first time and i am struggling a bit. I have been looking at the scipy packages like scipy.interplate.interp2d but the graphs produced at the end are really bad. Maybe a brief tutorial on 2d interpolation in sagemath for an amateur like me? Some advice? Thank you.
EDIT:
https://docs.google.com/file/d/0Bxv8ab9PeMQVUFhBYWlldU9ib0E/edit?pli=1
This is mostly the kind of graphs it produces along with this message:
Warning: No more knots can be added because the number of B-spline
coefficients
already exceeds the number of data points m. Probably causes:
either
s or m too small. (fp>s)
kx,ky=3,3 nx,ny=17,20 m=200 fp=4696.972223 s=0.000000
To get this graph i just run this command:
f_interpolation = scipy.interpolate.interp2d(*zip(*matrix(C)),kind='cubic')
plot_interpolation = contour_plot(lambda x,y:
f_interpolation(x,y)[0], (22.419,22.439),(37.06,37.08) ,cmap='jet', contours=numpy.arange(0,1400,100), colorbar=True)
plot_all = plot_interpolation
plot_all.show(axes_labels=["m", "m"])
Where matrix(c) can be a huge matrix like 10000 X 3 or even a lot more like 1000000 x 3. The problem of bad graphs persists even with fewer data like the picture i attached now where matrix(C) was only 200 x 3. That's why i begin to think that it could be that apart from a possible glitch with the program my approach to the use of this command might be totally wrong, hence the reason for me to ask for advice about using a grid and not just "throwing" my data into a command.
I've had a similar problem using the scipy.interpolate.interp2d function. My understanding is that the issue arises because the interp1d/interp2d and related functions use an older wrapping of FITPACK for the underlying calculations. I was able to get a problem similar to yours to work using the spline functions, which rely on a newer wrapping of FITPACK. The spline functions can be identified because they seem to all have capital letters in their names here http://docs.scipy.org/doc/scipy/reference/interpolate.html. Within the scipy installation, these newer functions appear to be located in scipy/interpolate/fitpack2.py, while the functions using the older wrappings are in fitpack.py.
For your purposes, RectBivariateSpline is what I believe you want. Here is some sample code for implementing RectBivariateSpline:
import numpy as np
from scipy import interpolate
# Generate unevenly spaced x/y data for axes
npoints = 25
maxaxis = 100
x = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
y = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
xsort = np.sort(x)
ysort = np.sort(y)
# Generate the z-data, which first requires converting
# x/y data into grids
xg, yg = np.meshgrid(xsort,ysort)
z = xg**2 - yg**2
# Generate the interpolated, evenly spaced data
# Note that the min/max of x/y isn't necessarily 0 and 100 since
# randomly chosen points were used. If we want to avoid extrapolation,
# the explicit min/max must be found
interppoints = 100
xinterp = np.linspace(xsort[0],xsort[-1],interppoints)
yinterp = np.linspace(ysort[0],ysort[-1],interppoints)
# Generate the kernel that will be used for interpolation
# Note that the default version uses three coefficients for
# interpolation (i.e. parabolic, a*x**2 + b*x +c). Higher order
# interpolation can be used by setting kx and ky to larger
# integers, i.e. interpolate.RectBivariateSpline(xsort,ysort,z,kx=5,ky=5)
kernel = interpolate.RectBivariateSpline(xsort,ysort,z)
# Now calculate the linear, interpolated data
zinterp = kernel(xinterp, yinterp)
I want to generate a grid of plots, of several arrays, with positive and negative values, with log scale, sharing the same colorbar.
I've achieved the sharing part of the colorbar (using ImageGrid and common max and min values), and I know that I could get a logarithmic scale using LogNorm() on the imshow call in the case of only positive values. But given the presence of negative values, I would need a colorbar on symmetric logarithmic scale.
I have found what would be the solution on https://stackoverflow.com/a/7741317/1101750 , but running the sample code Yann provides gives me very different results, cleary wrong:
Reviewing the code, I'm not able to grasp what's going on.
In addition to that, I've discovered that on Matplotlib 1.2, scale.SymmetricalLogScale.SymmetricalLogTransform asks for a new argument not explained on the documentation (linscale, which looking at the code of other transforms I assume that leaving it as 1 is a safe value).
Is the easiest solution subclassing LogNorm?
I've used a pretty simple recipe in the past to do exactly this, without the need to do any subclassing. matplotlib.colors.SymLogNorm provides most of the functionality you need, except that I've found it necessary to generate the tick marks by hand. Note that this solution uses matplotlib 1.3.0, and I may be using features that weren't available with 1.2.
def imshow_symlog(my_matrix, vmin, vmax, logthresh=5):
img=imshow( my_matrix ,
vmin=float(vmin), vmax=float(vmax),
norm=matplotlib.colors.SymLogNorm(10**-logthresh) )
maxlog=int(np.ceil( np.log10(vmax) ))
minlog=int(np.ceil( np.log10(-vmin) ))
#generate logarithmic ticks
tick_locations=([-(10**x) for x in xrange(minlog,-logthresh-1,-1)]
+[0.0]
+[(10**x) for x in xrange(-logthresh,maxlog+1)] )
cb=colorbar(ticks=tick_locations)
return img,cb
Since 1.3 matplotlib has a SymLogNorm. http://matplotlib.org/api/colors_api.html#matplotlib.colors.SymLogNorm