Pandas, compute many means with bootstrap confidence intervals for plotting - pandas

I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:
df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()
But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
It boils down to:
import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals
I tried to apply this method to each subset of data with a nested-loop script:
for i in data.groupby('Cancer Stage'):
for p in i.columns[1:3]: # PROBLEM!!
Series = i[p]
print p
print Series.mean()
ci = bootstrap.ci(data=Series, statfunction=scipy.mean)
Which produced an error message
AttributeError: 'tuple' object has no attribute called 'columns'
Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.

The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple":
(name, dataforgroup). The correct recipe for iterating over groupby-objects is
for name, group in data.groupby('Cancer Stage'):
print name
for p in group.columns[0:3]:
...
Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!
Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:
cols=data.columns[0:2]
for col in columns:
print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
does everything you need in one line, and produces a (nicely plotable) series for you
EDIT:
I toyed around with a data frame object I created myself:
df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
yields two series that look like this:
B
a [6.58333333333, 14.3333333333]
b [8.5, 16.25]
B
a [21.5833333333, 29.3333333333]
b [23.4166666667, 31.25]

Related

Normalization of data using MinMaxScaler over a list of dataframes

I been trying to normalize my data using MinMaxScaler method. However, I have a list of data frames (20 columns) and every one of them include different number of data readings. For example, the first one has 23 cell. Then, if we get into each cell, we will see 1500x24 data (Not every cell has the same number of rows, it's changeable). My plan is to find the min and max across the entire data by scanning those 20 categories (col variable). If I use scikitlearn MinMaxScaler, it only does the normalization considering that specific sub-category, not over the whole data (Or I could not achieve it). Probably, I am also having some issues in importing data because I could not convert it into exactly the array form I wanted. The plan is to convert them into array, normalize the data and then transform them into grayscale RGB images.
Thus, the problem is to find the max and min across all 20 list of dataframes and then normalize every sub-category based on that max and min value. Could anyone give some ideas?
col = ['UN_DTE','UN_DTF','UN_SM05','UN_SM10','UN_SM15','D1_DTE', 'D1_DTF', 'D1_SM05', 'D1_SM10', 'D1_SM15', 'D2_DTE','D2_DTF','D2_SM05','D2_SM10','D2_SM15','D3_DTE','D3_DTF','D3_SM05','D3_SM10','D3_SM15']
df = []
for i in range(len(col)):
df.append(pd.DataFrame(mat[col[i]]))
# Normalize the data
scaler = MinMaxScaler()
scaler.fit(df[0][0][0])
scaled_features = scaler.transform(df[0][0][0])
pd.DataFrame(scaled_features).describe()

Stratified sampling into 3 sets considering unbalance

I have looked into Stratified sample in pandas, stratified sampling on ranges, among others and they don't assess my issue specifically, as I'm looking to split the data into 3 sets randomly.
I have an unbalanced dataframe of 10k rows, 10% is positive class, 90% negative class. I'm trying to figure out a way to split this dataframe into 3 datasets, as 60%, 20%, 20% of the dataframe considering the unbalance. However, this split has to be random and non-replaceable, which means if I put together the 3 datasets, it has to be equal to the original dataframe.
Usually I would use train_test_split() but it only works if you are looking to split into two, not three datasets.
Any suggestions?
Reproducible example:
df = pd.DataFrame({"target" : np.random.choice([0,0,0,0,0,0,0,0,0,1], size=10000)}, index=range(0,10000,1))
How about using train_test_split() twice?
1st time, using train_size=0.6, obtaining a 60% training set and 40% (test + valid) set.
2nd time, using train_size=0.5, obtaining a 50%*40%=20% validation and 20% test.
Is this workaround valid for you?

Issues with Decomposing Trend, Seasonal, and Residual Time Series Elements

i am quite a newbie to Time Series Analysis and this might be a stupid question.
I am trying to generate the trend, seasonal, and residual time series elements, however, my timestamps index are actually strings (lets say 'window1', 'window2', 'window3'). Now, when i try to apply seasonal_decompose(data, model='multiplicative'), it returns an error as, Index' object has no attribute 'inferred_freq' and which is pretty understandable.
However, how to go around this issue by keeping strings as time series index?
Basically here you need to specify freq parameter.
Suppose you have following dataset
s = pd.Series([102,200,322,420], index=['window1', 'window2', 'window3','window4'])
s
>>>window1 102
window2 200
window3 322
window4 420
dtype: int64
Now specify freq parameter,in this case I used freq=1
plt.style.use('default')
plt.figure(figsize = (16,8))
import statsmodels.api as sm
sm.tsa.seasonal_decompose(s.values,freq=1).plot()
result = sm.tsa.stattools.adfuller(s,maxlag=1)
plt.show()
I am not allowed to post image ,but I hope this code will solve your problem.Also here maxlag by default give an error for my dataset ,therefore I used maxlag=1.If you are not sure about its values ,do use default value for maxlag.

Random projection in Python Pandas using a dataframe containing NaN values

I have a dataframe data containing real values and some NaN values. I'm trying to perform locality sensitive hashing using random projections to reduce the dimension to 25 components, specifically with thesklearn.random_projection.GaussianRandomProjection class. However, when I run:
tx = random_projection.GaussianRandomProjection(n_components = 25)
data25 = tx.fit_transform(data)
I get Input contains NaN, infinity or a value too large for dtype('float64'). Is there a work-around to this? I tried changing all the NaN values to a value that is never present in my dataset, such as -1. How valid would my output be in this case? I'm not an expert behind the theory of locality sensitive hashing/random projections so any insight would be helpful as well. Thanks.
NA / NaN values (not-available / not-a-number) are, I have found, just plain troublesome.
You don't want to just substitute a random value like -1. If you are inclined to do that, use one of the Imputer classes. Otherwise, you are likely to very substantially change the distances between points. You likely want to preserve distances as much as possible if you are using random projection:
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset.
However, this may or may not result in reasonable values for learning. As far as I know, imputation is an open field of study, which (for instance) this gentlemen has specialized in studying.
If you have enough examples, consider dropping rows or columns that contain NaN values. Another possibility is training a generative model like a Restricted Boltzman Machine and use that to fill in missing values:
rbm = sklearn.neural_network.BernoulliRBM().fit( data_with_no_nans )
mean_imputed_data = sklearn.preprocessing.Imputer().fit_transform( all_data )
rbm_imputation = rbm.gibbs( mean_imputed_data )
nan_mask = np.isnan( all_data )
all_data[ nan_mask ] = rbm_imputation[ nan_mask ]
Finally, you might consider imputing using nearest neighbors. For a given column, train a nearest neighbors model on all the variables except that column using all complete rows. Then, for a row missing that column, find the k nearest neighbors and use the average value among them. (This gets very costly, especially if you have rows with more than one missing value, as you will have to train a model for every combination of missing columns).

Creating grid and interpolating (x,y,z) for contour plot sagemath

!I have values in the form of (x,y,z). By creating a list_plot3d plot i can clearly see that they are not quite evenly spaced. They usually form little "blobs" of 3 to 5 points on the xy plane. So for the interpolation and the final "contour" plot to be better, or should i say smoother(?), do i have to create a rectangular grid (like the squares on a chess board) so that the blobs of data are somehow "smoothed"? I understand that this might be trivial to some people but i am trying this for the first time and i am struggling a bit. I have been looking at the scipy packages like scipy.interplate.interp2d but the graphs produced at the end are really bad. Maybe a brief tutorial on 2d interpolation in sagemath for an amateur like me? Some advice? Thank you.
EDIT:
https://docs.google.com/file/d/0Bxv8ab9PeMQVUFhBYWlldU9ib0E/edit?pli=1
This is mostly the kind of graphs it produces along with this message:
Warning: No more knots can be added because the number of B-spline
coefficients
already exceeds the number of data points m. Probably causes:
either
s or m too small. (fp>s)
kx,ky=3,3 nx,ny=17,20 m=200 fp=4696.972223 s=0.000000
To get this graph i just run this command:
f_interpolation = scipy.interpolate.interp2d(*zip(*matrix(C)),kind='cubic')
plot_interpolation = contour_plot(lambda x,y:
f_interpolation(x,y)[0], (22.419,22.439),(37.06,37.08) ,cmap='jet', contours=numpy.arange(0,1400,100), colorbar=True)
plot_all = plot_interpolation
plot_all.show(axes_labels=["m", "m"])
Where matrix(c) can be a huge matrix like 10000 X 3 or even a lot more like 1000000 x 3. The problem of bad graphs persists even with fewer data like the picture i attached now where matrix(C) was only 200 x 3. That's why i begin to think that it could be that apart from a possible glitch with the program my approach to the use of this command might be totally wrong, hence the reason for me to ask for advice about using a grid and not just "throwing" my data into a command.
I've had a similar problem using the scipy.interpolate.interp2d function. My understanding is that the issue arises because the interp1d/interp2d and related functions use an older wrapping of FITPACK for the underlying calculations. I was able to get a problem similar to yours to work using the spline functions, which rely on a newer wrapping of FITPACK. The spline functions can be identified because they seem to all have capital letters in their names here http://docs.scipy.org/doc/scipy/reference/interpolate.html. Within the scipy installation, these newer functions appear to be located in scipy/interpolate/fitpack2.py, while the functions using the older wrappings are in fitpack.py.
For your purposes, RectBivariateSpline is what I believe you want. Here is some sample code for implementing RectBivariateSpline:
import numpy as np
from scipy import interpolate
# Generate unevenly spaced x/y data for axes
npoints = 25
maxaxis = 100
x = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
y = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
xsort = np.sort(x)
ysort = np.sort(y)
# Generate the z-data, which first requires converting
# x/y data into grids
xg, yg = np.meshgrid(xsort,ysort)
z = xg**2 - yg**2
# Generate the interpolated, evenly spaced data
# Note that the min/max of x/y isn't necessarily 0 and 100 since
# randomly chosen points were used. If we want to avoid extrapolation,
# the explicit min/max must be found
interppoints = 100
xinterp = np.linspace(xsort[0],xsort[-1],interppoints)
yinterp = np.linspace(ysort[0],ysort[-1],interppoints)
# Generate the kernel that will be used for interpolation
# Note that the default version uses three coefficients for
# interpolation (i.e. parabolic, a*x**2 + b*x +c). Higher order
# interpolation can be used by setting kx and ky to larger
# integers, i.e. interpolate.RectBivariateSpline(xsort,ysort,z,kx=5,ky=5)
kernel = interpolate.RectBivariateSpline(xsort,ysort,z)
# Now calculate the linear, interpolated data
zinterp = kernel(xinterp, yinterp)