sklearn OneHotEncoder returns wrong size shape data - pandas

I am making a pipeline with sklearn to handle my dataset, when trying to use OneHotEncoder (to transform not-numeric attributes into numeric ones) as one of pipeline's step - it returns the wrong shape size array.
The shape of original dataset is (8693, 14) and final dataset returned using pipeline must have the same size. Generally if I don't use OneHotEncoder in pipeline - it returns normal shape size array, but when I add it - shape is ruined and it's wrong.
Can you help please? Already tried OneHotEncoder parameters, 'toarray' method, 'resize' method and they do not solve the problem.

OneHotEncoder creates one column per category, to map a categorical/string column to a number you can use OrdinalEncoder instead.

Related

SkLearn - Using RegressorChain with ColumnTransformer in Pipelines?

I'm having problems using sklearn's RegressorChain (https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html), and unfortunately there doesn't seem to be a lot of documentation/examples about this.
The documentation states indirectly (through the set_params method) that it can be used with Pipelines. My pipeline has:
ct = ColumnTransformer(
transformers=[
('scaler', MinMaxScaler(), numerical_columns),
('onehot', OneHotEncoder(), ['day_of_week']),
],
remainder='passthrough'
)
cv = TimeSeriesSplit(n_splits = groups.nunique()) #groups by date
pipeline = make_pipeline(ct, lgb.LGBMRegressor(random_state=42))
target_transform_output = TransformedTargetRegressor(regressor=pipeline, transformer=PowerTransformer())
and then I do:
chain_regressor = RegressorChain(base_estimator=target_transform_output , order=[1,0,2])
chain_regressor.fit(X, y)
In the above, both X and y are pandas Dataframes, and y has 3 target columns.
When I run the code, I get a python stack trace caused by the fit() call, starting in __init.py__ in _get_column_indices(X, key) when doing all_columns = X.columns. The error is:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
and further down at the end of the stack trace:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I assume this is because the ColumnTransformer returns ndarrays, a well-known problem. Does this mean that the RegressorChain can't be used with the ColumnTransformer?
After this, I removed the column transformer step from the pipeline and tried again, and without the ColumnTransformer everything works fine (even the TransformedTargetRegressor).
Any help, ideas or workaround appreciated.
You have the issue the wrong way around: it's not that ColumnTransformer outputs an array and RegressorChain expected a dataframe; rather, the RegressorChain converts your input to an array before calling your pipeline, and so your ColumnTransformer doesn't get a dataframe as input and cannot use your column-name specifications.
You could just specify the columns by index or callable in the ColumnTransformer. But I think in this case, you have two unfortunate side-effects:
For each target, you are re-encoding day_of_week and re-scaling each independent variable (not wrong, just a little wasteful), and
you never scale the targets, even when they are used as independent variables for "later" targets' regressions (not wrong for a tree-based model like your lightGBM [in fact, for LGBM, why bother scaling at all?], but other models might suffer from not scaling those).
(1) can be fixed by preprocessing as a pipeline step before RegressorChain. (2) can be fixed by changing the scaler's column specification to a callable, below using the helper make_column_selector. Doing that fix for (2) does end up re-calculating the scalings at each step (hurting (1) again), but I think in the end that's a bigger deal (if you wanted to use something other than a tree model at some point).
So I would suggest instead:
encoder = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(), ['day_of_week']),
],
remainder='passthrough',
)
scale_nums = ColumnTransformer(
transformers=[
('scaler', MinMaxScaler(), make_column_selector(dtype_include=np.number)),
],
remainder='passthrough',
)
modeling_pipe = make_pipeline(scale_nums, lgb.LGBMRegressor(random_state=42))
target_transform_output = TransformedTargetRegressor(
regressor=modeling_pipe,
transformer=PowerTransformer(),
)
final_pipeline = make_pipeline(encoder, target_transform_output)

How to use LabelBinarizer for 2 label problem?

If there are only 2 labels in the data, LabelBinarizer.fit_transform() returns array which has only a single column.
But for tensorflow model training for my usecase, what I need is 2 column in label array.
How can this be done with LabelBinarizer or is there any other API for it?OR do I need to manually modify the array by iterating it?
Since you are using TensorFlow, you can use tf.keras.utils.to_categorical.

Vectorizing text from data frame column using pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!
There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.

Stacking list of lists vertically using np.vstack is throwing an error

I am following this piece of code http://queirozf.com/entries/scikit-learn-pipeline-examples in order to develop a Multilabel OnevsRest classifier for text. I would like to compute the hamming_score and thus would need to binarize my test labels as well. I thus have:
X_train, X_test, labels_train, labels_test = train_test_split(meetings, labels, test_size=0.4)
Here, labels_train and labels_test are list of lists
[['dog', 'cat'], ['cat'], ['people'], ['nice', 'people']]
Now I need to binarize all my labels, I am therefore doing this...
all_labels = np.vstack([labels_train, labels_test])
mlb = MultiLabelBinarizer().fit(all_labels)
As directed by in the link. But that throws
ValueError: all the input array dimensions except for the concatenation axis must match exactly
I used np.column_stack as directed here
numpy array concatenate: "ValueError: all the input arrays must have same number of dimensions"
but that throws the same error.
How can the dimensions be the same if I am splitting on train and test, I am bound to get different shapes right? Please help, thank you.
MultilabelBinarizer works on list of lists directly, so you dont need to stack them using numpy. Directly send the list without stacking.
all_labels = labels_train + labels_test
mlb = MultiLabelBinarizer().fit(all_labels)

Save a numpy sparse matrix into file

I want to save the result of TfidfVectorizer in sklearn.feature_extraction.text into a text file for future use. As I found, it is a sparse matrix of type ''. However when I try to save it using the following code
np.savetxt('Feature_TfIdf.txt', X_Tfidf, fmt='%2.6f')
I get an error like this
IndexError: tuple index out of range
Use joblib.dump or sklearn.externals.joblib.dump for this. NumPy doesn't get SciPy sparse matrices.
Simple example:
np.save('TfIdf.pkl',tfidf)
I manage to solve the problem by converting the sparse matrix to full matrix and then save matrix and save the results. This approach however is not useful for large arrays so it is better to save the matrix in .pkl format.