FeatureUnion: keep existing features plus add new engineered features (aka transformed columns) - pandas

Say I have a dataset with 2 numerical columns and I would like to add a third column which is the product of the two (or some other function of the two existing columns). I can compute the new feature using a ColumnTransformer:
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)
(X is a pandas DataFrame, therefore the indexing via column names. Note also the reshape I had to use. Maybe someone has a better idea there.)
As written above I would like to keep the original features (similar to what sklearn.preprocessing.PolynomialFeatures is doing), i.e. use all 3 columns to fit a linear model (or generally use them in an sklearn pipeline). How do I do this?
For example,
df = pd.DataFrame({'colX': [3, 4], 'colY': [2, 1]})
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)
tfs.fit_transform(df)
gives
array([[1. ],
[1.73205081]])
but I would like to get an array that includes the original columns to pass this on in the pipeline.
The only way I could think of is a FeatureUnion with an identity transformation for the first two columns. Is there a more direct way?
(I would like to make a pipeline rather than change the DataFrame so that I do not forget to make the augmentation when calling model.predict().)

Reading the documentation more carefully I found that it is possible to pass "special-cased strings" to "indicate to drop the columns or to pass them through untransformed, respectively."
So one possibility to achieve my goal is
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
("passthrough", df.columns)
)
yielding
array([[1. , 3. , 2. ],
[1.73205081, 4. , 1. ]])
In the end there is thus no need for FeatureUnion but it can be done with ColumnTransformer or make_column_transformer alone, resp.

Related

Predefined feature selection from pandas dataframe as part of a sklearn gridsearch crossvalidation

TLDR: How to iterate through predefined feature subsets as part of a scikit-learn gridsearchcv pipeline?
For a regression task, I have set up a (nested) CV to chose and evaluate models for a given pandas dataframe model_X (all numerical columns, no missing data) and a target pandas series model_y.
My goal is to combine feature selection and hyperparameter tuning. However, for my purpose I do not want to use any of sklearn's feature selection algorithms, instead I simply want to try different predefined subsets of the available columns and get them tested against each other (and of course in all combinations with the other hyperparameters) in the CV.
For this purpose I have a list of tuples feature_candidates_list, where each tuple contains certain column names from model_X to be used together as features.
To achieve this I am using Functiontransformer like so:
def SelectFeatures(model_X, feature_set, feature_sets=feature_candidates_list):
return model_X.loc[:, feature_sets[feature_set]]
CustomFeatureSelector = FunctionTransformer(SelectFeatures, feature_names_out='one-to-one')
And here is how I put all together in a pipeline and param grid (this is a reduced example for only the relevant steps):
PreProcessor = ColumnTransformer([
('selector', CustomFeatureSelector, model_X.columns),
('scaler', StandardScaler(), make_column_selector(dtype_include=np.number)),
])
pipe = Pipeline(steps=[
('preprocessor', PreProcessor),
('regressor', DummyRegressor()) # just a dummy here, as it can't be empty (actual regressor see regressor_params)
])
preprocessor_params = [
{
'preprocessor__selector__kw_args': [{'feature_set':i} for i in range(len(feature_candidates_list))],
'preprocessor__scaler__with_mean': [True, False],
'preprocessor__scaler__with_std': [True, False],
},
]
regressor_params = [
{
'regressor': [TweedieRegressor(max_iter=1000)],
'regressor__power': [0, 1],
'regressor__alpha': [0, 1],
'regressor__link': ['log'],
'regressor__fit_intercept': [True, False],
},
]
params = [{**dict_pre, **dict_reg} for dict_reg in regressor_params for dict_pre in preprocessor_params]
Finally, to run the model selection and evaluation I use:
scoring = {
'R2': 'r2',
'MAPE': 'neg_mean_absolute_percentage_error',
'MedAE': 'neg_median_absolute_error',
'MSLE': 'neg_mean_squared_log_error',
}
refit_scorer = 'R2'
with parallel_backend('loky', n_jobs=-1):
innerCV = GridSearchCV(
pipe,
params,
scoring= scoring,
refit= refit_scorer,
cv=10,
verbose=1,
)
outerCV = cross_validate(
innerCV,
model_X,
model_y,
scoring=scoring,
cv=10,
return_estimator=True,
verbose=1,
)
I am not sure if this pipeline actually selects the features as intended.
model_X has m columns
every tuple in feature_candidates_list contains n column names (n < m, of course).
What I did to check on a single outer fold's best estimator is:
outerCV['estimator'][0].best_estimator_.named_steps['regressor'].n_features_in_
which gives me m + n but I expected n (also tested it for the other folds).
I think there must be something wrong in how I put together my preprocessor. It seems like it is taking all original columns of model_X and concatenates them with the chosen set of columns instead of replacing. When I switch off the scaler the output of the above is in deed equal to n, however, I still cannot see which features were chosen for a respective estimator because calling .feature_names_in_ on them raises:
AttributeError: 'TweedieRegressor' object has no attribute 'feature_names_in_'
Maybe the whole way I approach this selection of features in gridsearchcv is not smart and I should go a different route? Any hints welcome!
Update:
I switched to sklearn nightly (v1.2.dev0) where I can use set_config(transform_output='pandas') to avoid getting my dataframe converted to a numpy array by transformers. This helps to get feature names when calling the .feature_names_in_ on one of the estimators but it only works when I have just the scaler activated.
When I also activate my custom selector the fitting fails for all folds. But when I turn off the set_config(...) again, it works just like in the stable versions v1.1.2 and v1.1.3 without ability to get feature names.

Trying to Drop values by column (I convert these values to nan but could be anything) not working

Trying to drop NAs by column in Dask, given a certain threshold and I receive the error below.
I'm receiving the following error, but this should be working. Please advise.
reproducible example.
import pandas as pd
import dask
data = [['tom', 10], ['nick', 15], ['juli', 5]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
import numpy as np
df = df.replace(5, np.nan)
ddf = dd.from_pandas(df, npartitions = 2)
ddf.dropna(axis='columns')
Passing axis is not support for dask dataframes as of now. You cvan also print docstring of the function via ddf.dropna? and it will tell you the same:
Signature: ddf.dropna(how='any', subset=None, thresh=None)
Docstring:
Remove missing values.
This docstring was copied from pandas.core.frame.DataFrame.dropna.
Some inconsistencies with the Dask version may exist.
See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.
Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0 (Not supported in Dask)
Determine if rows or columns which contain missing values are
removed.
* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.
.. versionchanged:: 1.0.0
Pass tuple or list to drop on multiple axes.
Only a single axis is allowed.
how : {'any', 'all'}, default 'any'
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False (Not supported in Dask)
If True, do operation inplace and return None.
Returns
-------
DataFrame or None
DataFrame with NA entries dropped from it or None if ``inplace=True``.
Worth noting that Dask Documentation is copied from pandas for many instances like this. But wherever it does, it specifically states that:
This docstring was copied from pandas.core.frame.DataFrame.drop. Some
inconsistencies with the Dask version may exist.
Therefore its always best to check docstring for dask's pandas-driven functions instead of relying on documentation
The reason this isn't supported in dask is because it requires computing the entire dataframe in order for dask to know the shape of the result. This is very different from the row-wise case, where the number of columns and partitions won't change, so the operation can be scheduled without doing any work.
Dask does not allow some parts of the pandas API which seem like normal pandas operations which might be ported to dask, but in reality can't be scheduled without triggering compute on the current frame. You're running into this issue by design, because while .dropna(axis=0) would work just fine as a scheduled operation, .dropna(axis=1) would have a very different implication.
You can do this manually with the following:
ddf[ddf.columns[~ddf.isna().any(axis=0)]]
but the filtering operation ddf.columns[~ddf.isna().any(axis=0)] will trigger a compute on the whole dataframe. It probably makes sense to persist prior to running this if you can fit the dataframe in your cluster's memory.

How to remove objects from a list using a list of indexes on python?

I have a DataFrame from which I wanted to randomly select 20% of the data to use as test data. However, I need to remove said data from my original set to use as training data.
I have a list of the indexes the random sample is made up from (indexes of the original DF). When i use a for loop and the function .pop() the indexes change so the elements been removed after that the first iteration are not the ones that are in my test data frame. I need help to remove the data from the first data frame but no functions will take a list of indexes as an argument. What can i do about this? Is there a way to subtract a data from from another?
Regarding your question,
Is there a way to subtract a data from from another?
You can simply drop the indexes belonging to Test from the primary DataFrame to get your Train. Try this -
train = df.drop(test.index, axis=0)
#Where df is the main dataset from which test data has been sampled.
#train, test, df are all pd.DataFrames
However, if you are preparing data for a machine learning problem, I would recommend some better methods, as discussed in the next part of my answer.
1. Using Sklearn API (Recommended)
You could try using the sklearn.model_selection.train_test_split api to save you a lot of time in doing such train test splits.
from sklearn.model_selection import train_test_split
df = pd.DataFrame(np.random.random((100,10)))
train, test = train_test_split(df, test_size=0.2)
train.shape, test.shape
((80, 10), (20, 10))
2. Using pandas methods
Another way is to sample 20% data from df and then filter the rest for train.
test = df.sample(frac=0.2)
train = df.loc[~df.index.isin(test.index)]
train.shape, test.shape
((80, 10), (20, 10))
3. Starting with a list of indexes
Let's say you already have a list of indexes (test_idx), as you mention in your question. In that case, you can still work with pandas methods to do this without any for loops or pop()
test_idx = np.random.choice(range(100), 20, replace=False) #approx 20% random indexes
test = df.loc[df.index.isin(test_idx)]
train = df.loc[~df.index.isin(test_idx)]
train.shape, test.shape
((80, 10), (20, 10))
There are a couple of solutions to this problem. You could...
Iterate in reverse
Create another array to store the values
Use list comprehension
An example of the third method is as follows.
Let's say that you want to remove all 2's from an array:
data = [1, 2, 3, 2, 2, 1]
new_data = [n for n in data if n != 2]
# new_data = [1, 3, 1]
In my past experience this is always the method I use when cleaning/reconstructing arrays.

Subsetting DataFrames Using a List in Julia

I was wondering if you could subset a dataframe like the one below based on the values of one of the columns (such as ids), you could use the equals operator like in df2 however, if you want to subset based on a list like ids I cannot find an operator to subset the dataframe based on a list as the .in operator does not seem to work with dataframes is there another operator I could use?
df = DataFrame(ids = [1, 1000, 10000, 100000,1,2,3,4], B = [1,2,3,4,123,6,2,7], D = ["N", "M", "I", "J","hi","CE", "M", "S"])
df2= df[df[:pmid] .== 1000, :]
ids = [2,3, 10000]
df3= df[df[:pmid] .in ids,:]
As of right now df3 gives me a bounds error.
Also I am running this on Julia 0.6.4
I guess there's typo in your first line ids= should be pmid=, I guess, since you're filtering using that name later.
As for df3, the correct syntax should be (I tried on 1.0.2):
df3= df[in.(df[:pmid], [ids]),:]
note added [] around ids as that should be vector of vectors.
I'd like to point you to DataFramesMeta.jl package, which provides much clearer syntax:
using DataFramesMeta
#where df (in.(:pmid, [ids]))
There was also quite an interesting discussion on discourse.julialang.org
regarding syntax for filtering by list, including performance tips.

Efficient axis-wise cartesian product of multiple 2D matrices with Numpy or TensorFlow

So first off, I think what I'm trying to achieve is some sort of Cartesian product but elementwise, across the columns only.
What I'm trying to do is, if you have multiple 2D arrays of size [ (N,D1), (N,D2), (N,D3)...(N,Dn) ]
The result is thus to be a combinatorial product across axis=1 such that the final result will then be of shape (N, D) where D=D1*D2*D3*...Dn
e.g.
A = np.array([[1,2],
[3,4]])
B = np.array([[10,20,30],
[5,6,7]])
cartesian_product( [A,B], axis=1 )
>> np.array([[ 1*10, 1*20, 1*30, 2*10, 2*20, 2*30 ]
[ 3*5, 3*6, 3*7, 4*5, 4*6, 4*7 ]])
and extendable to cartesian_product([A,B,C,D...], axis=1)
e.g.
A = np.array([[1,2],
[3,4]])
B = np.array([[10,20],
[5,6]])
C = np.array([[50, 0],
[60, 8]])
cartesian_product( [A,B,C], axis=1 )
>> np.array([[ 1*10*50, 1*10*0, 1*20*50, 1*20*0, 2*10*50, 2*10*0, 2*20*50, 2*20*0]
[ 3*5*60, 3*5*8, 3*6*60, 3*6*8, 4*5*60, 4*5*8, 4*6*60, 4*6*8]])
I have a working solution that essentially creates an empty (N,D) matrix and then broadcasting a vector columnwise product for each column within nested for loops for each matrix in the provided list. Clearly is horrible once the arrays get larger!
Is there an existing solution within numpy or tensorflow for this? Potentially one that is efficiently paralleizable (A tensorflow solution would be wonderful but a numpy is ok and as long as the vector logic is clear then it shouldn't be hard to make a tf equivalent)
I'm not sure if I need to use einsum, tensordot, meshgrid or some combination thereof to achieve this. I have a solution but only for single-dimension vectors from https://stackoverflow.com/a/11146645/2123721 even though that solution says to work for arbitrary dimensions array (which appears to mean vectors). With that one i can do a .prod(axis=1), but again this is only valid for vectors.
thanks!
Here's one approach to do this iteratively in an accumulating manner making use of broadcasting after extending dimensions for each pair from the list of arrays for elmentwise multiplications -
L = [A,B,C] # list of arrays
n = L[0].shape[0]
out = (L[1][:,None]*L[0][:,:,None]).reshape(n,-1)
for i in L[2:]:
out = (i[:,None]*out[:,:,None]).reshape(n,-1)