SkLearn - Using RegressorChain with ColumnTransformer in Pipelines? - pandas

I'm having problems using sklearn's RegressorChain (https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html), and unfortunately there doesn't seem to be a lot of documentation/examples about this.
The documentation states indirectly (through the set_params method) that it can be used with Pipelines. My pipeline has:
ct = ColumnTransformer(
transformers=[
('scaler', MinMaxScaler(), numerical_columns),
('onehot', OneHotEncoder(), ['day_of_week']),
],
remainder='passthrough'
)
cv = TimeSeriesSplit(n_splits = groups.nunique()) #groups by date
pipeline = make_pipeline(ct, lgb.LGBMRegressor(random_state=42))
target_transform_output = TransformedTargetRegressor(regressor=pipeline, transformer=PowerTransformer())
and then I do:
chain_regressor = RegressorChain(base_estimator=target_transform_output , order=[1,0,2])
chain_regressor.fit(X, y)
In the above, both X and y are pandas Dataframes, and y has 3 target columns.
When I run the code, I get a python stack trace caused by the fit() call, starting in __init.py__ in _get_column_indices(X, key) when doing all_columns = X.columns. The error is:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
and further down at the end of the stack trace:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I assume this is because the ColumnTransformer returns ndarrays, a well-known problem. Does this mean that the RegressorChain can't be used with the ColumnTransformer?
After this, I removed the column transformer step from the pipeline and tried again, and without the ColumnTransformer everything works fine (even the TransformedTargetRegressor).
Any help, ideas or workaround appreciated.

You have the issue the wrong way around: it's not that ColumnTransformer outputs an array and RegressorChain expected a dataframe; rather, the RegressorChain converts your input to an array before calling your pipeline, and so your ColumnTransformer doesn't get a dataframe as input and cannot use your column-name specifications.
You could just specify the columns by index or callable in the ColumnTransformer. But I think in this case, you have two unfortunate side-effects:
For each target, you are re-encoding day_of_week and re-scaling each independent variable (not wrong, just a little wasteful), and
you never scale the targets, even when they are used as independent variables for "later" targets' regressions (not wrong for a tree-based model like your lightGBM [in fact, for LGBM, why bother scaling at all?], but other models might suffer from not scaling those).
(1) can be fixed by preprocessing as a pipeline step before RegressorChain. (2) can be fixed by changing the scaler's column specification to a callable, below using the helper make_column_selector. Doing that fix for (2) does end up re-calculating the scalings at each step (hurting (1) again), but I think in the end that's a bigger deal (if you wanted to use something other than a tree model at some point).
So I would suggest instead:
encoder = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(), ['day_of_week']),
],
remainder='passthrough',
)
scale_nums = ColumnTransformer(
transformers=[
('scaler', MinMaxScaler(), make_column_selector(dtype_include=np.number)),
],
remainder='passthrough',
)
modeling_pipe = make_pipeline(scale_nums, lgb.LGBMRegressor(random_state=42))
target_transform_output = TransformedTargetRegressor(
regressor=modeling_pipe,
transformer=PowerTransformer(),
)
final_pipeline = make_pipeline(encoder, target_transform_output)

Related

How to setup a batched matrix multiplication in Numba with np.dot() using contiguous arrays

I am trying to speed up a batched matrix multiplication problem with numba, but it keeps telling me that it's faster with contiguous code.
Note: I'm using numba version 0.55.1, and numpy version 1.21.5
Here's the problem:
import numpy as np
import numba as nb
def numbaFastMatMult(mat,vec):
result = np.zeros_like(vec)
for n in nb.prange(vec.shape[0]):
result[n,:] = np.dot(vec[n,:], mat[n,:,:])
return result
D,N = 10,1000
mat = np.random.normal(0,1,(N,D,D))
vec = np.random.normal(0,1,(N,D))
result = numbaFastMatMult(mat,vec)
print(mat.data.contiguous)
print(vec.data.contiguous)
print(mat[n,:,:].data.contiguous)
print(vec[n,:].data.contiguous)
clearly all the relevant data is contiguous (run the above code snippet and see the results of print()...
But, when I run this code, I get the following warning:
NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 1d, C), array(float64, 2d, A))
result[n,:] = np.dot(vec[n,:], mat[n,:,:])
2 Extra comments:
This is just a toy problem for replication. I'm actually using something with many more data points, so hoping this will speed up.
I think the "right" way to solve this is with np.tensordot. However, I want to understand what's going on for future reference. For example, this discussion addresses a similar issue, but as far as I can tell, doesn't address why the warning shows up directly.
I've tried adding a decorator:
nb.float64[:,::1](nb.float64[:,:,::1],nb.float64[:,::1]),
I've tried reordering the arrays so the batch index is first (n in the above code)
I've tried printing whether the "mat" variable is contiguous from inside the function
I'll leave this up, but I figured it out:
Outside of a numba function:
mat[n,:,:].data.contiguous==True
but inside numba, mat[n,:,:] is no longer continous.
Changing my code above to np.dot(vec[n], mat[n]) removed the warning.
I'm making this the "correct" answer since it solved my problem. However, according to max9111's response, this behavior may be a bug!

Predefined feature selection from pandas dataframe as part of a sklearn gridsearch crossvalidation

TLDR: How to iterate through predefined feature subsets as part of a scikit-learn gridsearchcv pipeline?
For a regression task, I have set up a (nested) CV to chose and evaluate models for a given pandas dataframe model_X (all numerical columns, no missing data) and a target pandas series model_y.
My goal is to combine feature selection and hyperparameter tuning. However, for my purpose I do not want to use any of sklearn's feature selection algorithms, instead I simply want to try different predefined subsets of the available columns and get them tested against each other (and of course in all combinations with the other hyperparameters) in the CV.
For this purpose I have a list of tuples feature_candidates_list, where each tuple contains certain column names from model_X to be used together as features.
To achieve this I am using Functiontransformer like so:
def SelectFeatures(model_X, feature_set, feature_sets=feature_candidates_list):
return model_X.loc[:, feature_sets[feature_set]]
CustomFeatureSelector = FunctionTransformer(SelectFeatures, feature_names_out='one-to-one')
And here is how I put all together in a pipeline and param grid (this is a reduced example for only the relevant steps):
PreProcessor = ColumnTransformer([
('selector', CustomFeatureSelector, model_X.columns),
('scaler', StandardScaler(), make_column_selector(dtype_include=np.number)),
])
pipe = Pipeline(steps=[
('preprocessor', PreProcessor),
('regressor', DummyRegressor()) # just a dummy here, as it can't be empty (actual regressor see regressor_params)
])
preprocessor_params = [
{
'preprocessor__selector__kw_args': [{'feature_set':i} for i in range(len(feature_candidates_list))],
'preprocessor__scaler__with_mean': [True, False],
'preprocessor__scaler__with_std': [True, False],
},
]
regressor_params = [
{
'regressor': [TweedieRegressor(max_iter=1000)],
'regressor__power': [0, 1],
'regressor__alpha': [0, 1],
'regressor__link': ['log'],
'regressor__fit_intercept': [True, False],
},
]
params = [{**dict_pre, **dict_reg} for dict_reg in regressor_params for dict_pre in preprocessor_params]
Finally, to run the model selection and evaluation I use:
scoring = {
'R2': 'r2',
'MAPE': 'neg_mean_absolute_percentage_error',
'MedAE': 'neg_median_absolute_error',
'MSLE': 'neg_mean_squared_log_error',
}
refit_scorer = 'R2'
with parallel_backend('loky', n_jobs=-1):
innerCV = GridSearchCV(
pipe,
params,
scoring= scoring,
refit= refit_scorer,
cv=10,
verbose=1,
)
outerCV = cross_validate(
innerCV,
model_X,
model_y,
scoring=scoring,
cv=10,
return_estimator=True,
verbose=1,
)
I am not sure if this pipeline actually selects the features as intended.
model_X has m columns
every tuple in feature_candidates_list contains n column names (n < m, of course).
What I did to check on a single outer fold's best estimator is:
outerCV['estimator'][0].best_estimator_.named_steps['regressor'].n_features_in_
which gives me m + n but I expected n (also tested it for the other folds).
I think there must be something wrong in how I put together my preprocessor. It seems like it is taking all original columns of model_X and concatenates them with the chosen set of columns instead of replacing. When I switch off the scaler the output of the above is in deed equal to n, however, I still cannot see which features were chosen for a respective estimator because calling .feature_names_in_ on them raises:
AttributeError: 'TweedieRegressor' object has no attribute 'feature_names_in_'
Maybe the whole way I approach this selection of features in gridsearchcv is not smart and I should go a different route? Any hints welcome!
Update:
I switched to sklearn nightly (v1.2.dev0) where I can use set_config(transform_output='pandas') to avoid getting my dataframe converted to a numpy array by transformers. This helps to get feature names when calling the .feature_names_in_ on one of the estimators but it only works when I have just the scaler activated.
When I also activate my custom selector the fitting fails for all folds. But when I turn off the set_config(...) again, it works just like in the stable versions v1.1.2 and v1.1.3 without ability to get feature names.

Trouble with KNN on OpenCV, new_samples.type() == CV_32F when training

I am trying to set a simple KNN problem implementation with a three class dataset but whenever I try to execute the train function I keep the said (-215:Assertion failed) new_samples.type() == CV_32F in function 'cv::ml::Impl::train error.
I have tried reshaping the responses array into many different things since most of the errors came from that part of the code, that goes from 1 x n matrix to a single list. I am following this tutorial. I can get it done with two classes by defining my own data just like I do with three classes but I can't manage to train with three classes.
import numpy as np
import cv2 as cv
classA=([(10,1,1),(9,2,2),(11,1,2),(8,3,2),(7,2,3),(8,5,4),(9,3,4),(6,6,5),(8,6,6),(9,7,7)])
classB=([(5,1,20),(5,2,19),(5,1,21),(4,2,18),(4,1,19),(6,3,20),(6,2,19),(4,4,18),(4,5,21),(6,4,19)])
classC=([(5,14,10),(6,13,9),(4,12,11),(6,11,9),(6,7,12),(7,6,13),(7,7,10), (7,8,11),(8,8,12),(7,6,11)])
points = classA + classB + classC
responses = ([0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2])
# Using numpy's array?
points_np = np.asarray(points)
responses_np = np.asarray(responses).reshape((30,1))
#print(points_np)
#print(responses_np)
knn = cv.ml.KNearest_create()
knn.train(points_np, cv.ml.ROW_SAMPLE, responses_np)
I know both sample and response data should follow a similar structure so the function can associate each point to a class but I think my issue is on the type of structure I am using for the responses variable. How should I shape or set the responses variable in order to be readable for the train function?
As indicated in the assertion, the data type for samples must be CV_32F, which stands for 32 bit float.
points_np = np.asarray(points).astype(np.float32)
responses_np = np.asarray(responses).reshape((30,1)).astype(np.float32)

Object dtype dtype('O') has no native HDF5 equivalent

Well, it seems like a couple of similar questions were asked here in stack overflow, but none of them seem like answered correctly or properly, nor they described the exact examples.
I have a problem with saving array or list into hdf5 ...
I have a several files contains list of (n, 35) dimensions, where n may be different in each file. Each of them can be saved in hdf5 with code below.
hdf = hf.create_dataset(fname, data=d)
However, if I want to merge them to make in 3d the error occurs as below.
Object dtype dtype('O') has no native HDF5 equivalent
I have no idea why it turns to dtype object, since what I have done is only this
all_data = list()
for fname in file_list:
d = np.load(fname)
all_data.append(d)
hdf = hf.create_dataset('all_data', data=all_data)
How can I save such data?
I tried a couple of tests, and it seems like all_data turns to dtype with 'object' when I change them with
all_data = np.array(all_data)
Which looks it has the similar problem with saving hdf5.
Again, how can I save such data in hdf5?
I was running into a similar issue with h5py, and changing the type of the NumPy array using array.astype worked for me (I believe this changes the type from dtype('O') to the data type you specify). Please see the code snippet below:
import numpy as np
print(X.dtype)
--> dtype('O')
print(X.astype(np.float64).dtype)
--> dtype('float64')
When I ran h5.create_dataset with this data type conversion, I was able to successfully create a h5 dataset. Hope this helps!
ONE ADDITIONAL UPDATE: I believe the NumPy object type 'O' is created when the NumPy array itself has mixed element types (e.g. np.int8 and np.float32).
dtype('O') stands for object. In my case I had a list of lists where the lengths were different and got the same error. If you convert it to a numpy array numpy warns Creating an ndarray from ragged nested sequences. h5 files can't handle this type of data for more info see this post
This error comes when I use:
with h5py.File(peakfilename, 'w') as pfile: # saves the data
pfile['peakY'] = np.array(X)
pfile['peakX'] = np.array(Y)
However when I used dtype when saving the arrays... the problem went away... I guess h5py is not able to create datasets from undefined data types.
with h5py.File(peakfilename, 'w') as pfile: # saves the data
pfile['peakY'] = np.array(X, dtype=np.float32)
pfile['peakX'] = np.array(Y, dtype=np.float32)

Tensorflow classifier.predict write to csv

I've not used Tensorflow for a while and when I updated it seemed to have broken my old code as many of the old functions are deprecated. I fixed them with the new code and it all seems to be running except for when I write out the results:
y_predicted = classifier.predict(X_test)
There is an as iterable option as well - which I don't think I need.
I use to write out the results of the predictions using:
pandas.DataFrame(y_predicted).to_csv(/dir/)
but now I am getting an error that not all elements can be converted into String type. Is there a class in y_predicted I am suppose to be calling instead of the whole thing?
Anyways, I found a solution using np.array instead of a pandas dataframe:
result = np.asarray(y_predicted)
formatInt = result.astype(np.int)
np.savetxt("dir",formatInt,delimiter=",")
You can also try,
df = pandas.DataFrame({'Prediction':list(y_predicted)})
df.to_csv('filename.csv')