Predefined feature selection from pandas dataframe as part of a sklearn gridsearch crossvalidation - pandas

TLDR: How to iterate through predefined feature subsets as part of a scikit-learn gridsearchcv pipeline?
For a regression task, I have set up a (nested) CV to chose and evaluate models for a given pandas dataframe model_X (all numerical columns, no missing data) and a target pandas series model_y.
My goal is to combine feature selection and hyperparameter tuning. However, for my purpose I do not want to use any of sklearn's feature selection algorithms, instead I simply want to try different predefined subsets of the available columns and get them tested against each other (and of course in all combinations with the other hyperparameters) in the CV.
For this purpose I have a list of tuples feature_candidates_list, where each tuple contains certain column names from model_X to be used together as features.
To achieve this I am using Functiontransformer like so:
def SelectFeatures(model_X, feature_set, feature_sets=feature_candidates_list):
return model_X.loc[:, feature_sets[feature_set]]
CustomFeatureSelector = FunctionTransformer(SelectFeatures, feature_names_out='one-to-one')
And here is how I put all together in a pipeline and param grid (this is a reduced example for only the relevant steps):
PreProcessor = ColumnTransformer([
('selector', CustomFeatureSelector, model_X.columns),
('scaler', StandardScaler(), make_column_selector(dtype_include=np.number)),
])
pipe = Pipeline(steps=[
('preprocessor', PreProcessor),
('regressor', DummyRegressor()) # just a dummy here, as it can't be empty (actual regressor see regressor_params)
])
preprocessor_params = [
{
'preprocessor__selector__kw_args': [{'feature_set':i} for i in range(len(feature_candidates_list))],
'preprocessor__scaler__with_mean': [True, False],
'preprocessor__scaler__with_std': [True, False],
},
]
regressor_params = [
{
'regressor': [TweedieRegressor(max_iter=1000)],
'regressor__power': [0, 1],
'regressor__alpha': [0, 1],
'regressor__link': ['log'],
'regressor__fit_intercept': [True, False],
},
]
params = [{**dict_pre, **dict_reg} for dict_reg in regressor_params for dict_pre in preprocessor_params]
Finally, to run the model selection and evaluation I use:
scoring = {
'R2': 'r2',
'MAPE': 'neg_mean_absolute_percentage_error',
'MedAE': 'neg_median_absolute_error',
'MSLE': 'neg_mean_squared_log_error',
}
refit_scorer = 'R2'
with parallel_backend('loky', n_jobs=-1):
innerCV = GridSearchCV(
pipe,
params,
scoring= scoring,
refit= refit_scorer,
cv=10,
verbose=1,
)
outerCV = cross_validate(
innerCV,
model_X,
model_y,
scoring=scoring,
cv=10,
return_estimator=True,
verbose=1,
)
I am not sure if this pipeline actually selects the features as intended.
model_X has m columns
every tuple in feature_candidates_list contains n column names (n < m, of course).
What I did to check on a single outer fold's best estimator is:
outerCV['estimator'][0].best_estimator_.named_steps['regressor'].n_features_in_
which gives me m + n but I expected n (also tested it for the other folds).
I think there must be something wrong in how I put together my preprocessor. It seems like it is taking all original columns of model_X and concatenates them with the chosen set of columns instead of replacing. When I switch off the scaler the output of the above is in deed equal to n, however, I still cannot see which features were chosen for a respective estimator because calling .feature_names_in_ on them raises:
AttributeError: 'TweedieRegressor' object has no attribute 'feature_names_in_'
Maybe the whole way I approach this selection of features in gridsearchcv is not smart and I should go a different route? Any hints welcome!
Update:
I switched to sklearn nightly (v1.2.dev0) where I can use set_config(transform_output='pandas') to avoid getting my dataframe converted to a numpy array by transformers. This helps to get feature names when calling the .feature_names_in_ on one of the estimators but it only works when I have just the scaler activated.
When I also activate my custom selector the fitting fails for all folds. But when I turn off the set_config(...) again, it works just like in the stable versions v1.1.2 and v1.1.3 without ability to get feature names.

Related

SkLearn - Using RegressorChain with ColumnTransformer in Pipelines?

I'm having problems using sklearn's RegressorChain (https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html), and unfortunately there doesn't seem to be a lot of documentation/examples about this.
The documentation states indirectly (through the set_params method) that it can be used with Pipelines. My pipeline has:
ct = ColumnTransformer(
transformers=[
('scaler', MinMaxScaler(), numerical_columns),
('onehot', OneHotEncoder(), ['day_of_week']),
],
remainder='passthrough'
)
cv = TimeSeriesSplit(n_splits = groups.nunique()) #groups by date
pipeline = make_pipeline(ct, lgb.LGBMRegressor(random_state=42))
target_transform_output = TransformedTargetRegressor(regressor=pipeline, transformer=PowerTransformer())
and then I do:
chain_regressor = RegressorChain(base_estimator=target_transform_output , order=[1,0,2])
chain_regressor.fit(X, y)
In the above, both X and y are pandas Dataframes, and y has 3 target columns.
When I run the code, I get a python stack trace caused by the fit() call, starting in __init.py__ in _get_column_indices(X, key) when doing all_columns = X.columns. The error is:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
and further down at the end of the stack trace:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I assume this is because the ColumnTransformer returns ndarrays, a well-known problem. Does this mean that the RegressorChain can't be used with the ColumnTransformer?
After this, I removed the column transformer step from the pipeline and tried again, and without the ColumnTransformer everything works fine (even the TransformedTargetRegressor).
Any help, ideas or workaround appreciated.
You have the issue the wrong way around: it's not that ColumnTransformer outputs an array and RegressorChain expected a dataframe; rather, the RegressorChain converts your input to an array before calling your pipeline, and so your ColumnTransformer doesn't get a dataframe as input and cannot use your column-name specifications.
You could just specify the columns by index or callable in the ColumnTransformer. But I think in this case, you have two unfortunate side-effects:
For each target, you are re-encoding day_of_week and re-scaling each independent variable (not wrong, just a little wasteful), and
you never scale the targets, even when they are used as independent variables for "later" targets' regressions (not wrong for a tree-based model like your lightGBM [in fact, for LGBM, why bother scaling at all?], but other models might suffer from not scaling those).
(1) can be fixed by preprocessing as a pipeline step before RegressorChain. (2) can be fixed by changing the scaler's column specification to a callable, below using the helper make_column_selector. Doing that fix for (2) does end up re-calculating the scalings at each step (hurting (1) again), but I think in the end that's a bigger deal (if you wanted to use something other than a tree model at some point).
So I would suggest instead:
encoder = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(), ['day_of_week']),
],
remainder='passthrough',
)
scale_nums = ColumnTransformer(
transformers=[
('scaler', MinMaxScaler(), make_column_selector(dtype_include=np.number)),
],
remainder='passthrough',
)
modeling_pipe = make_pipeline(scale_nums, lgb.LGBMRegressor(random_state=42))
target_transform_output = TransformedTargetRegressor(
regressor=modeling_pipe,
transformer=PowerTransformer(),
)
final_pipeline = make_pipeline(encoder, target_transform_output)

Is plotting with Koalas using TopN has any statistic meaning?

I was going through the source code of Koalas, trying to get a handle on how they actually achieve plotting large datasets. It turns our that they use either sampling or TopN - selecting a given number of records.
I understand the meaning of sampling and internally it uses spark.DataFrame.sample to do it. For TopN, however, they simply take the first max_rows number of records from Koalas' DataFrame using data = data.head(max_rows + 1).to_pandas().
This seems strange and I wonder whether it's correctly reflecting the statistical properties of the dataset doing the data selection in this way.
Koalas DataFrame's plot accessor:
class KoalasPlotAccessor(PandasObject):
pandas_plot_data_map = {
"pie": TopNPlotBase().get_top_n,
"bar": TopNPlotBase().get_top_n,
"barh": TopNPlotBase().get_top_n,
"scatter": SampledPlotBase().get_sampled,
"area": SampledPlotBase().get_sampled,
"line": SampledPlotBase().get_sampled,
}
_backends = {} # type: ignore
...
class TopNPlotBase:
def get_top_n(self, data):
from databricks.koalas import DataFrame, Series
max_rows = get_option("plotting.max_rows")
# Simply use the first 1k elements and make it into a pandas dataframe
# For categorical variables, it is likely called from df.x.value_counts().plot.xxx().
if isinstance(data, (Series, DataFrame)):
data = data.head(max_rows + 1).to_pandas()
...

FeatureUnion: keep existing features plus add new engineered features (aka transformed columns)

Say I have a dataset with 2 numerical columns and I would like to add a third column which is the product of the two (or some other function of the two existing columns). I can compute the new feature using a ColumnTransformer:
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)
(X is a pandas DataFrame, therefore the indexing via column names. Note also the reshape I had to use. Maybe someone has a better idea there.)
As written above I would like to keep the original features (similar to what sklearn.preprocessing.PolynomialFeatures is doing), i.e. use all 3 columns to fit a linear model (or generally use them in an sklearn pipeline). How do I do this?
For example,
df = pd.DataFrame({'colX': [3, 4], 'colY': [2, 1]})
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)
tfs.fit_transform(df)
gives
array([[1. ],
[1.73205081]])
but I would like to get an array that includes the original columns to pass this on in the pipeline.
The only way I could think of is a FeatureUnion with an identity transformation for the first two columns. Is there a more direct way?
(I would like to make a pipeline rather than change the DataFrame so that I do not forget to make the augmentation when calling model.predict().)
Reading the documentation more carefully I found that it is possible to pass "special-cased strings" to "indicate to drop the columns or to pass them through untransformed, respectively."
So one possibility to achieve my goal is
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
("passthrough", df.columns)
)
yielding
array([[1. , 3. , 2. ],
[1.73205081, 4. , 1. ]])
In the end there is thus no need for FeatureUnion but it can be done with ColumnTransformer or make_column_transformer alone, resp.

Efficient axis-wise cartesian product of multiple 2D matrices with Numpy or TensorFlow

So first off, I think what I'm trying to achieve is some sort of Cartesian product but elementwise, across the columns only.
What I'm trying to do is, if you have multiple 2D arrays of size [ (N,D1), (N,D2), (N,D3)...(N,Dn) ]
The result is thus to be a combinatorial product across axis=1 such that the final result will then be of shape (N, D) where D=D1*D2*D3*...Dn
e.g.
A = np.array([[1,2],
[3,4]])
B = np.array([[10,20,30],
[5,6,7]])
cartesian_product( [A,B], axis=1 )
>> np.array([[ 1*10, 1*20, 1*30, 2*10, 2*20, 2*30 ]
[ 3*5, 3*6, 3*7, 4*5, 4*6, 4*7 ]])
and extendable to cartesian_product([A,B,C,D...], axis=1)
e.g.
A = np.array([[1,2],
[3,4]])
B = np.array([[10,20],
[5,6]])
C = np.array([[50, 0],
[60, 8]])
cartesian_product( [A,B,C], axis=1 )
>> np.array([[ 1*10*50, 1*10*0, 1*20*50, 1*20*0, 2*10*50, 2*10*0, 2*20*50, 2*20*0]
[ 3*5*60, 3*5*8, 3*6*60, 3*6*8, 4*5*60, 4*5*8, 4*6*60, 4*6*8]])
I have a working solution that essentially creates an empty (N,D) matrix and then broadcasting a vector columnwise product for each column within nested for loops for each matrix in the provided list. Clearly is horrible once the arrays get larger!
Is there an existing solution within numpy or tensorflow for this? Potentially one that is efficiently paralleizable (A tensorflow solution would be wonderful but a numpy is ok and as long as the vector logic is clear then it shouldn't be hard to make a tf equivalent)
I'm not sure if I need to use einsum, tensordot, meshgrid or some combination thereof to achieve this. I have a solution but only for single-dimension vectors from https://stackoverflow.com/a/11146645/2123721 even though that solution says to work for arbitrary dimensions array (which appears to mean vectors). With that one i can do a .prod(axis=1), but again this is only valid for vectors.
thanks!
Here's one approach to do this iteratively in an accumulating manner making use of broadcasting after extending dimensions for each pair from the list of arrays for elmentwise multiplications -
L = [A,B,C] # list of arrays
n = L[0].shape[0]
out = (L[1][:,None]*L[0][:,:,None]).reshape(n,-1)
for i in L[2:]:
out = (i[:,None]*out[:,:,None]).reshape(n,-1)

How to filter tensor from queue based on some predicate in tensorflow?

How can I filter data stored in a queue using a predicate function? For example, let's say we have a queue that stores tensors of features and labels and we just need those that meet the predicate. I tried the following implementation without success:
feature, label = queue.dequeue()
if (predicate(feature, label)):
enqueue_op = another_queue.enqueue(feature, label)
The most straightforward way to do this is to dequeue a batch, run them through the predicate test, use tf.where to produce a dense vector of the ones that match the predicate, and use tf.gather to collect the results, and enqueue that batch. If you want that to happen automatically, you can start a queue runner on the second queue - the easiest way to do that is to use tf.train.batch:
Example:
import numpy as np
import tensorflow as tf
a = tf.constant(np.array([5, 1, 9, 4, 7, 0], dtype=np.int32))
q = tf.FIFOQueue(6, dtypes=[tf.int32], shapes=[])
enqueue = q.enqueue_many([a])
dequeue = q.dequeue_many(6)
predmatch = tf.less(dequeue, [5])
selected_items = tf.reshape(tf.where(predmatch), [-1])
found = tf.gather(dequeue, selected_items)
secondqueue = tf.FIFOQueue(6, dtypes=[tf.int32], shapes=[])
enqueue2 = secondqueue.enqueue_many([found])
dequeue2 = secondqueue.dequeue_many(3) # XXX, hardcoded
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(enqueue) # Fill the first queue
sess.run(enqueue2) # Filter, push into queue 2
print sess.run(dequeue2) # Pop items off of queue2
The predicate produces a boolean vector; the tf.where produces a dense vector of the indexes of the true values, and the tf.gather collects items from your original tensor based upon those indexes.
A lot of things are hardcoded in this example that you'd need to make not-hardcoded in reality, of course, but hopefully it shows the structure of what you're trying to do (create a filtering pipeline). In practice, you'd want QueueRunners on there to keep things churning automatically. Using tf.train.batch is very useful to handle that automatically -- see Threading and Queues for more detail.