Pandas interpolation method definitions - pandas

In the pandas documentation, a number of methods are provided as arguments to pandas.DataFrame.interpolate including
nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5).
‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes
However, the scipy documentation indicates the following options:
kind str or int, optional
Specifies the kind of interpolation as a string or as an integer specifying the order of the spline interpolator to use. The string has to be one of ‘linear’, ‘nearest’, ‘nearest-up’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘previous’, or ‘next’. ‘zero’, ‘slinear’, ‘quadratic’ and ‘cubic’ refer to a spline interpolation of zeroth, first, second or third order; ‘previous’ and ‘next’ simply return the previous or next value of the point; ‘nearest-up’ and ‘nearest’ differ when interpolating half-integers (e.g. 0.5, 1.5) in that ‘nearest-up’ rounds up and ‘nearest’ rounds down. Default is ‘linear’.
The documentation seems wrong since scipy.interpolate.interp1d does not accept barycentric or polynomial as valid methods. I suppose that barycentric refers to scipy.interpolate.barycentric_interpolate, but what does polynomial refer to? I thought it might be equivalent to the piecewise_polynomial option, but the two give different results.
Also, method=cubicspline and method=spline, order=3 give different results. What's the difference here?

The pandas interpolate method is an amalgamation of interpolation methods coming from different places in the numpy and scipy libraries.
Currently all of the code is located in pandas/core/missing.py.
At a high level it splits the interpolation methods into those that are handled by np.iterp and others handled by throughout the scipy library.
# interpolation methods that dispatch to np.interp
NP_METHODS = ["linear", "time", "index", "values"]
# interpolation methods that dispatch to _interpolate_scipy_wrapper
SP_METHODS = ["nearest", "zero", "slinear", "quadratic", "cubic",
"barycentric", "krogh", "spline", "polynomial",
"from_derivatives", "piecewise_polynomial", "pchip",
"akima", "cubicspline"]
Then because the scipy methods are split across different methods, you can see there are a ton of other wrappers within missing.py that indicate the scipy method. Most of the methods are passed off to scipy.interpolate.interp1d; however for a few others there's a dict or other wrapper methods pointing to those specific scipy methods.
from scipy import interpolate
alt_methods = {
"barycentric": interpolate.barycentric_interpolate,
"krogh": interpolate.krogh_interpolate,
"from_derivatives": _from_derivatives,
"piecewise_polynomial": _from_derivatives,
}
where the doc string of _from_derivatives within missing.py indicates:
def _from_derivatives(xi, yi, x, order=None, der=0, extrapolate=False):
"""
Convenience function for interpolate.BPoly.from_derivatives.
...
"""
So TLDR, depending upon the method you specify you wind up directly using one of the following:
numpy.interp
scipy.interpolate.interp1d
scipy.interpolate.barycentric_interpolate
scipy.interpolate.krogh_interpolate
scipy.interpolate.BPoly.from_derivatives
scipy.interpolate.Akima1DInterpolator
scipy.interpolate.UnivariateSpline
scipy.interpolate.CubicSpline

Related

any reasons for inconsistent numpy arguments of numpy.zeros and numpy.random.randn

I'm implementing a computation using numpy zeros and numpy.random.randn
W1 = np.random.randn(n_h, n_x) * .01
b1 = np.zeros((n_h, 1))
I'm not sure why random.randn() can accept two integers while zeros() needs a tuple. Is there a good reason for that?
Cheers, JChen.
Most likely it's just a matter of history. numpy results from the merger of several prior packages, and has a long development. Some quirks get cleaned up, others left as is.
randn(d0, d1, ..., dn)
zeros(shape, dtype=float, order='C')
randn has this note:
This is a convenience function. If you want an interface that takes a
tuple as the first argument, use numpy.random.standard_normal instead.
standard_normal(size=None)
With * it is easy to pass a tuple to randn:
np.random.randn(*(1,2,3))
np.zeros takes a couple of keyword arguments. randn does not. You can define a Python function with a (*args, **kwargs) signature. But accepting a tuple, especially one with a common usage as shape, fits better. But that's a matter of opinion.
np.random.rand and np.random.random_sample are another such pair. Most likely rand and randn are the older versions, and standard_normal and random_sample are newer ones designed to conform to the more common tuple style.

binary classification target specifically on false positive

I got a little confused when using models from sklearn, how do I set the specific optimization functions? for example, when RandomForestClassifier is used, how do I let the model 'know' that I want to maximize 'recall' or 'F1 score'. or 'AUC' instead of 'accuracy'?
Any suggestions? Thank you.
What you are looking for is Parameter Tuning. Basically, first you select an estimator , then you define a hyper-parameter space (i.e. all possible parameters and their respective values that you want to tune), a cross validation scheme and scoring function. Now depending upon your choice of searching the parameter space, you can choose the following:
Exhaustive Grid Search
In this approach, sklearn creates a grid of all possible combination of hyper-paramter values defined by the user using the GridSearchCV method. For instance, :
my_clf = DecisionTreeClassifier(random_state=0,class_weight='balanced')
param_grid = dict(
classifier__min_samples_split=[5,7,9,11],
classifier__max_leaf_nodes =[50,60,70,80],
classifier__max_depth = [1,3,5,7,9]
)
In this case, the grid specified is a cross-product of values of classifier__min_samples_split, classifier__max_leaf_nodes and classifier__max_depth. The documentation states that:
The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.
An example for using GridSearch :
#Create a classifier
clf = LogisticRegression(random_state = 0)
#Cross-validate the dataset
cv=StratifiedKFold(n_splits=n_splits).split(features,labels)
#Declare the hyper-parameter grid
param_grid = dict(
classifier__tol=[1.0,0.1,0.01,0.001],
classifier__C = np.power([10.0]*5,list(xrange(-3,2))).tolist(),
classifier__solver =['newton-cg', 'lbfgs', 'liblinear', 'sag'],
)
#Perform grid search using the classifier,parameter grid, scoring function and the cross-validated dataset
grid_search = GridSearchCV(clf, param_grid=param_grid, verbose=10,scoring=make_scorer(f1_score),cv=list(cv))
grid_search.fit(features.values,labels.values)
#To get the best score using the specified scoring function use the following
print grid_search.best_score_
#Similarly to get the best estimator
best_clf = grid_logistic.best_estimator_
print best_clf
You can read more about it's documentation here to know about the various internal methods, etc. to retrieve the best parameters, etc.
Randomized Search
Instead of exhaustively checking for the hyper-parameter space, sklearn implements RandomizedSearchCV to do a randomized search over the paramters. The documentation states that:
RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values.
You can read more about it from here.
You can read more about other approaches here.
Alternative link for reference:
How to Tune Algorithm Parameters with Scikit-Learn
What is hyperparameter optimization in machine learning in formal terms?
Grid Search for hyperparameter and feature selection
Edit: In your case, if you want to maximize the recall for the model, you simply specify recall_score from sklearn.metrics as the scoring function.
If you wish to maximize 'False Positive' as stated in your question, you can refer this answer to extract the 'False Positives' from the confusion matrix. Then use the make scorer function and pass it to the GridSearchCV object for tuning.
I would suggest you grab a cup of coffee and read (and understand) the following
http://scikit-learn.org/stable/modules/model_evaluation.html
You need to use something along the lines of
cross_val_score(model, X, y, scoring='f1')
possible choices are (check the docs)
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score',
'average_precision', 'completeness_score', 'explained_variance',
'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted',
'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score',
'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error',
'neg_mean_squared_log_error', 'neg_median_absolute_error',
'normalized_mutual_info_score', 'precision', 'precision_macro',
'precision_micro', 'precision_samples', 'precision_weighted', 'r2',
'recall', 'recall_macro', 'recall_micro', 'recall_samples',
'recall_weighted', 'roc_auc', 'v_measure_score']
Have fun
Umberto

Fortran equivalent of Numpy functions

I'm trying to translate something from Python to Fortran because of speed limitations. (So I can then use f2py on it.)
The problem is that the code contains many NumPy functions that don't exist in Fortran 90. So my questions is: is there a Fortran library that implements at least some of the NumPy functionality in Fortran?
The functions that I have to use in the code are generally simple, so I could translate them by hand. However, I'm trying not to re-invent the wheel here, specially because I don't have that much experience in Fortran and I might not know some important caveats.
Anyway, here's a list of some of the functions that I need.
np.mean (with the axis parameter)
np.std (with the axis parameter)
np.roll (again with the axis parameter)
np.mgrid
np.max (again with axis parameter)
Anything is helpful at this point. I'm not counting on finding substitutes for all of them, but it would be very good if some of them, at least, already existed.
I find that the intrinsic list of procedures from gfortran is useful as a first reference here https://gcc.gnu.org/onlinedocs/gfortran/Intrinsic-Procedures.html#Intrinsic-Procedures
np.mean (with the axis parameter)
See sum. It has an axis parameter. In combination with size it can output the mean:
result = sum(data, dim=axis)/size(data, dim=axis)
Here, result has one less dimension than data.
np.std (with the axis parameter)
np.roll (again with the axis parameter)
np.mgrid
np.max (again with axis parameter)
See maxval, it has a dim argument.
I am not aware of a Fortran equivalent to NumPy. The standard-based array abilities of Fortran are such that a "base" library has not emerged. There are several initiatives though:
https://github.com/astrofrog/fortranlib "Collection of personal scientific routines in Fortran"
http://fortranwiki.org/ "The Fortran Wiki is an open venue for discussing all aspects of the Fortran programming language and scientific computing."
http://flibs.sourceforge.net/ "FLIBS - A collection of Fortran modules"
http://www.fortran90.org/ General resource for modern Fortran. Contains a "Python Fortran Rosetta Stone"

sklearn: get feature names after L1-based feature selection

This question and answer demonstrate that when feature selection is performed using one of scikit-learn's dedicated feature selection routines, then the names of the selected features can be retrieved as follows:
np.asarray(vectorizer.get_feature_names())[featureSelector.get_support()]
For example, in the above code, featureSelector might be an instance of sklearn.feature_selection.SelectKBest or sklearn.feature_selection.SelectPercentile, since these classes implement the get_support method which returns a boolean mask or integer indices of the selected features.
When one performs feature selection via linear models penalized with the L1 norm, it's unclear how to accomplish this. sklearn.svm.LinearSVC has no get_support method and the documentation doesn't make clear how to retrieve the feature indices after using its transform method to eliminate features from a collection of samples. Am I missing something here?
For sparse estimators you can generally find the support by checking where the non-zero entries are in the coefficients vector (provided the coefficients vector exists, which is the case for e.g. linear models)
support = np.flatnonzero(estimator.coef_)
For your LinearSVC with l1 penalty it would accordingly be
from sklearn.svm import LinearSVC
svc = LinearSVC(C=1., penalty='l1', dual=False)
svc.fit(X, y)
selected_feature_names = np.asarray(vectorizer.get_feature_names())[np.flatnonzero(svc.coef_)]
I've been using sklearn 15.2, and according to LinearSVC documentation , coef_ is an array, shape = [n_features] if n_classes == 2 else [n_classes, n_features].
So first, np.flatnonzero doesn't work for multi-class. You'll have index out of range error. Second, it should be np.where(svc.coef_ != 0)[1] instead of np.where(svc.coef_ != 0)[0] . 0 is index of classes, not features. I ended up with using np.asarray(vectorizer.get_feature_names())[list(set(np.where(svc.coef_ != 0)[1]))]

Motivation behind numpy's datetime64 type?

I noticed recently that numpy includes a datetime64 data type beginning in numpy 1.7:
http://www.compsci.wm.edu/SciClone/documentation/software/math/NumPy/html1.7/reference/arrays.datetime.html
I am wondering what is the motivation behind including this as a separate type within the numpy package rather than using the builtin datetime.datetime provided by Python?
Some of the reasons I am interested in understanding this better include:
I want to know when it is appropriate to use datetime.datetime vs when to use numpy.datetime64
Since numpy includes no date type analogous to datetime.date, should I use numpy.datetime64 for dates when I need to interact with numpy.datetime64 objects? Or should I intermingle datetime.date and numpy.datetime64 in my code?
The reason is identical as to why there is a np.int and an np.float. These numpy types get stored by value in an array, rather than by boxed reference, as generic python object are. The latter takes far more memory, allocation overhead, and is much less cache friendly to traverse.
I avoid intermingling datetime64 and python's inbuilt datetime objects. The reason for this is that the code that you write to work with a datetime.datetime will not work with a numpy.datetime64 scalar. For example any of the methods or properties of a datetime.datetime would not be availble on a numpy.datetime64 object.
In order to avoid the intermingling, what I tend to do is when I am dealing with scalars, I use python's datetime.datetime or datetime.date. When I am dealing with numpy array's, I use datetime64. This means that when I am extracting or iterating over single values from a numpy datetime64 array, I convert them into a datetime object first before I let propagate into other parts of the codebase.
Also you can read about different units of datetime64, that will allow you use a datetime64 as a datetime.date or datetime.datetime here:
http://docs.scipy.org/doc/numpy-dev/reference/arrays.datetime.html#arrays-dtypes-dateunits