Pandas masked dataframe assign 2D array - pandas

I'm trying to solve a riddle with dataframe and masked array.
Context
I'm trying to make some methods to help me with some machine learning stuff. My goals is from a simple dataframe to add new information to it. I want these functions as simple to use.
Problem
This is a function that take what I called a model, and can make a transform from data. Then, I want to set these new data on new or existing column. I'm not using apply function because, if I do it, I'm losing the ability to extract several data at once with parrallism (that's why I'm extracting first then applying the 'features'.
Here is the riddle, I want to work with a mask because without this mask I cannot modify the original dataframe. If you remove this mask,if dataframe and features have the same number of lines everything is fine. Now, if I add this mask, it seems to be taken row by row and dimension mismatch....
EDIT: I forgot to tell that 'features' correspond to a 2d numpy array.
Error
ValueError: Must have equal len keys and value when setting with an ndarray
Code
def transform(dataframe, tags, out, model, mask=None):
# Check mandatory fields
mandatory = ['datum']
if not isinstance(tags, dict) or not all(elem in mandatory for elem in tags.keys()):
raise Exception(f'Not a dict or missing tag: {mandatory}.')
# Mask creation (see pandas view / copy mechanism)
if mask is None:
mask = [True] * len(dataframe.index)
features = model.transform(dataframe.loc[mask, tags['datum']].to_numpy())
dataframe.loc[mask, out] = features.tolist()
return dataframe
Example
Data can be something like this:
Row ; Data ; Label
0 ; [61953.017837947686, 9.505037089204054, 74.585... ] ;0
1 ; [80832.69302693632, 9.524642547991316, 83.9228... ] ;1
A model could be a PCA from sklearn.
The method could be something like this:
transform(dataframe, {'datum': 'Data'}, 'PCA', PCA(), mask=dataframe[dataframe['Label']==1)
Then output would be:
Row ; Data ; Label ; PCA
0 ; [61953.017837947686, 9.505037089204054, 74.585... ] ;0 ; [74.585... ]
1 ; [80832.69302693632, 9.524642547991316, 83.9228... ] ;1 ; [92.578... ]
Current solution
mask = inputs[condition]
inputs['PCA'] = np.nan
inputs[mask] = Classification.transform(inputs[mask], {'datum': 'Data'}, wavelet, 'PCA')
def transform(dataframe, tags, model, out):
# Check mandatory fields
mandatory = ['datum']
if not isinstance(tags, dict) or not all(elem in mandatory for elem in tags.keys()):
raise Exception(f'Not a dict or missing tag: {mandatory}.')
features = model.transform(dataframe[tags['datum']].to_numpy())
dataframe[out] = features.tolist()
return dataframe
Thanks for answers!
Best regards

Ok, So I find a solution that fit my needs.
Second idea was according to me a bad idea, as it seems not easy to pass view as argument and edit it with a convenient way.
So, I change my first proposition. Here is the final version:
def transform(dataframe, tags, model, out, mask=None):
# Check mandatory fields
mandatory = ['datum']
if not isinstance(tags, dict) or not all(elem in mandatory for elem in tags.keys()):
raise Exception(f'Not a dict or missing tag: {mandatory}.')
# Mask creation (see pandas view / copy mechanism)
if mask is None:
mask = [True] * len(dataframe.index)
# dataframe[out] = np.nan
features = model.transform(dataframe.loc[mask, tags['datum']].to_numpy())
dataframe.loc[mask, out] = pd.Series(features.tolist())
return dataframe
Instead of doing dataframe.loc[mask, out] = features.tolist(), I switch to : dataframe.loc[mask, out] = pd.Series(features.tolist()). By going through a Series it fit without any problems.
Thanks Ken Syme to be involved in my problem.

Related

Predefined feature selection from pandas dataframe as part of a sklearn gridsearch crossvalidation

TLDR: How to iterate through predefined feature subsets as part of a scikit-learn gridsearchcv pipeline?
For a regression task, I have set up a (nested) CV to chose and evaluate models for a given pandas dataframe model_X (all numerical columns, no missing data) and a target pandas series model_y.
My goal is to combine feature selection and hyperparameter tuning. However, for my purpose I do not want to use any of sklearn's feature selection algorithms, instead I simply want to try different predefined subsets of the available columns and get them tested against each other (and of course in all combinations with the other hyperparameters) in the CV.
For this purpose I have a list of tuples feature_candidates_list, where each tuple contains certain column names from model_X to be used together as features.
To achieve this I am using Functiontransformer like so:
def SelectFeatures(model_X, feature_set, feature_sets=feature_candidates_list):
return model_X.loc[:, feature_sets[feature_set]]
CustomFeatureSelector = FunctionTransformer(SelectFeatures, feature_names_out='one-to-one')
And here is how I put all together in a pipeline and param grid (this is a reduced example for only the relevant steps):
PreProcessor = ColumnTransformer([
('selector', CustomFeatureSelector, model_X.columns),
('scaler', StandardScaler(), make_column_selector(dtype_include=np.number)),
])
pipe = Pipeline(steps=[
('preprocessor', PreProcessor),
('regressor', DummyRegressor()) # just a dummy here, as it can't be empty (actual regressor see regressor_params)
])
preprocessor_params = [
{
'preprocessor__selector__kw_args': [{'feature_set':i} for i in range(len(feature_candidates_list))],
'preprocessor__scaler__with_mean': [True, False],
'preprocessor__scaler__with_std': [True, False],
},
]
regressor_params = [
{
'regressor': [TweedieRegressor(max_iter=1000)],
'regressor__power': [0, 1],
'regressor__alpha': [0, 1],
'regressor__link': ['log'],
'regressor__fit_intercept': [True, False],
},
]
params = [{**dict_pre, **dict_reg} for dict_reg in regressor_params for dict_pre in preprocessor_params]
Finally, to run the model selection and evaluation I use:
scoring = {
'R2': 'r2',
'MAPE': 'neg_mean_absolute_percentage_error',
'MedAE': 'neg_median_absolute_error',
'MSLE': 'neg_mean_squared_log_error',
}
refit_scorer = 'R2'
with parallel_backend('loky', n_jobs=-1):
innerCV = GridSearchCV(
pipe,
params,
scoring= scoring,
refit= refit_scorer,
cv=10,
verbose=1,
)
outerCV = cross_validate(
innerCV,
model_X,
model_y,
scoring=scoring,
cv=10,
return_estimator=True,
verbose=1,
)
I am not sure if this pipeline actually selects the features as intended.
model_X has m columns
every tuple in feature_candidates_list contains n column names (n < m, of course).
What I did to check on a single outer fold's best estimator is:
outerCV['estimator'][0].best_estimator_.named_steps['regressor'].n_features_in_
which gives me m + n but I expected n (also tested it for the other folds).
I think there must be something wrong in how I put together my preprocessor. It seems like it is taking all original columns of model_X and concatenates them with the chosen set of columns instead of replacing. When I switch off the scaler the output of the above is in deed equal to n, however, I still cannot see which features were chosen for a respective estimator because calling .feature_names_in_ on them raises:
AttributeError: 'TweedieRegressor' object has no attribute 'feature_names_in_'
Maybe the whole way I approach this selection of features in gridsearchcv is not smart and I should go a different route? Any hints welcome!
Update:
I switched to sklearn nightly (v1.2.dev0) where I can use set_config(transform_output='pandas') to avoid getting my dataframe converted to a numpy array by transformers. This helps to get feature names when calling the .feature_names_in_ on one of the estimators but it only works when I have just the scaler activated.
When I also activate my custom selector the fitting fails for all folds. But when I turn off the set_config(...) again, it works just like in the stable versions v1.1.2 and v1.1.3 without ability to get feature names.

How to save the model comparison dataframe from compare_models() in pycaret?

I want to save the model comparison data frame from compare_models() in pycaret.
# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()
i.e. this data frame as shown above.
Does anyone know how to do that?
The solution is :
df = pull()
by Goosang Yu from the pycaret slack community.
compare_models() returns a pandas dataframe, containing the information of the list of models. Hence, you only need to save a dataframe, which can be for example achieved with best.to_csv(path). If you want to save the object in a different format (pickle, xml, ...), you can refer to pandas i/o documentation.

TensorFlow Federated - Loading and preprocessing data on a remote client

Part of the simulation program that I am working on allows clients to load local data from their device without the server being able to access that data.
Following the idea from this post, I have the following code configured to assign the client a path to load the data from. Although the data is in svmlight format, loading it line-by-line can still allow it to be preprocessed afterwards.
client_paths = {
'client_0': '<path_here>',
'client_1': '<path_here>',
}
def create_tf_dataset_for_client_fn(id):
path = client_paths.get(id)
data = tf.data.TextLineDataset(path)
path_source = tff.simulation.datasets.ClientData.from_clients_and_fn(client_paths.keys(), create_tf_dataset_for_client_fn)
The code above allows a path to be loaded during runtime from the remote client's-side by the following line of code.
data = path_source.create_tf_dataset_for_client('client_0')
Here, the data variable can be iterated through and can be used to display the contents on the client on the remote device when calling tf.print(). But, I need to preprocess this data into an appropriate format before continuing. I am presently attempting to convert this from a string Tensor in svmlight format into a SparseTensor of the appropriate format.
The issue is that, although the defined preprocessing method works in a standalone scenario (i.e. when defined as a function and tested on a manually defined Tensor of the same format), it fails when the code is executed during the client update #tf.function in the tff algorithm. Below is the specified error when executing the notebook cell which contains a #tff.tf_computation function which calls an #tf.function which does the preprocessing and retrieves the data.
ValueError: Shape must be rank 1 but is rank 0 for '{{node Reshape_2}} = Reshape[T=DT_INT64, Tshape=DT_INT32](StringToNumber_1, Reshape_2/shape)' with input shapes: [?,?], [].
Since the issue occurs when executing the client's #tff.tf_computation update function which calls the #tf.function with the preprocessing code, I am wondering how I can allow the function to perform the preprocessing on the data without errors. I assume that if I can just get the functions to properly be run when defined that when called remotely it will work.
Any ideas on how to address this issue? Thank you for your help!
For reference, the preprocessing function uses tf computations to manipulate the data. Although not optimal yet, below is the code presently being used. This is inspired from this link on string_split examples. I have extracted the code to put directly into the client's #tf.function after loading the TextLineDataset as well, but this also fails.
def decode_libsvm(line):
# Split the line into columns, delimiting by a blank space
cols = tf.strings.split([line], ' ')
# Retrieve the labels from the first column as an integer
labels = tf.strings.to_number(cols.values[0], out_type=tf.int32)
# Split all column pairs
splits = tf.strings.split(cols.values[1:], ':')
# Convert splits into a sparse matrix to retrieve all needed properties
splits = splits.to_sparse()
# Reshape the tensor for further processing
id_vals = tf.reshape(splits.values, splits.dense_shape)
# Retrieve the indices and values within two separate tensors
feat_ids, feat_vals = tf.split(id_vals, num_or_size_splits=2, axis=1)
# Convert the indices into int64 numbers
feat_ids = tf.strings.to_number(feat_ids, out_type=tf.int64)
# To reload within a SparseTensor, add a dimension to feat_ids with a default value of 0
feat_ids = tf.reshape(feat_ids, -1)
feat_ids = tf.expand_dims(feat_ids, 1)
feat_ids = tf.pad(feat_ids, [[0,0], [0,1]], constant_values=0)
# Extract and flatten the values
feat_vals = tf.strings.to_number(feat_vals, out_type=tf.float32)
feat_vals = tf.reshape(feat_vals, -1)
# Configure a SparseTensor to contain the indices and values
sparse_output = tf.SparseTensor(indices=feat_ids, values=feat_vals, dense_shape=[1, <shape>])
return {"x": sparse_output, "y": labels}
Update (Fix)
Following the advice from Jakub's comment, the issue was fixed by enclosing the reshape and expand_dim calls in [], when needed. Now there is no issue running the code within tff.
def decode_libsvm(line):
# Split the line into columns, delimiting by a blank space
cols = tf.strings.split([line], ' ')
# Retrieve the labels from the first column as an integer
labels = tf.strings.to_number(cols.values[0], out_type=tf.int32)
# Split all column pairs
splits = tf.strings.split(cols.values[1:], ':')
# Convert splits into a sparse matrix to retrieve all needed properties
splits = splits.to_sparse()
# Reshape the tensor for further processing
id_vals = tf.reshape(splits.values, splits.dense_shape)
# Retrieve the indices and values within two separate tensors
feat_ids, feat_vals = tf.split(id_vals, num_or_size_splits=2, axis=1)
# Convert the indices into int64 numbers
feat_ids = tf.strings.to_number(feat_ids, out_type=tf.int64)
# To reload within a SparseTensor, add a dimension to feat_ids with a default value of 0
feat_ids = tf.reshape(feat_ids, [-1])
feat_ids = tf.expand_dims(feat_ids, [1])
feat_ids = tf.pad(feat_ids, [[0,0], [0,1]], constant_values=0)
# Extract and flatten the values
feat_vals = tf.strings.to_number(feat_vals, out_type=tf.float32)
feat_vals = tf.reshape(feat_vals, [-1])
# Configure a SparseTensor to contain the indices and values
sparse_output = tf.SparseTensor(indices=feat_ids, values=feat_vals, dense_shape=[1, <shape>])
return {"x": sparse_output, "y": labels}

newbie: holoviews curves from pandas follow up: issues with stream

The pandas dataframe rows correspond to successive time samples of a Kalman filter. I want to display the trajectory (truth, measurements and filter estimates) in a stream.
def show_tracker(index,data=run_tracker()):
i = int(index)
sleep(0.1)
p = \
hv.Scatter(data[0:i], kdims=['x'], vdims=['y'])(style=dict(color='r')) *\
hv.Curve (data[0:i], kdims=['x.true'], vdims=['y.true']) *\
hv.Scatter(data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='darkgreen')) *\
hv.Curve (data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='lightgreen'))
return p
%%opts Scatter [width=600,height=280]
ndx=TimeIndex()
hv.DynamicMap(show_tracker, kdims=[], streams=[ndx])
for i in range(N):
ndx.update(index=i)
Issue 1: Axes are automatically set to the bounds of the data.
Consequently, trajectory updates occur at the very edge of the plot boundaries.
Is there a setting to allow some slop,
or do I have to compute appropriate bounds in the show_tracker function?
Issue 2: Bokeh backend;
I can zoom and pan, but
"Reset" causes the data set to be lost. How do I fix that?
Issue 3: The default data argument to show_tracker
requires the function to be reexecuted to generate a new dataframe.
Is there an easy way to address that?
Issue 1
This is one of the last outstanding issues for the 1.7 release coming next week, track this issue for updates. However we also just changed how the ranges are updated on a DynamicMap, if you want to update the ranges make sure to set %%opts Scatter {+framewise} or norm=dict(framewise=True) on one of the displayed objects as you're already doing for the style options.
Issue 2
This is an unfortunate shortcoming of the reset tool in bokeh, you can track this issue for updates.
Issue 3:
That depends on what exactly you're doing, has the data already been generated or are you updating it on the fly? If you just have to generate the data once you can just create it outside function, which means it will be in scope:
data = run_tracker()
def show_tracker(index):
i = int(index)
sleep(0.1)
...
return p
If you actually want to generate new data dynamically the easiest thing to do is write a little class to keep track of the state. You can even make that class a Stream so you don't have to define it separately. Here's what that might look like:
class KalmanTracker(hv.streams.Stream):
index = param.Integer(default=1)
def __init__(self, **params):
# Initializes empty data and parameters
self.data = None
super(KalmanTracker, self).__init__(**params)
def update_data(self, index):
# Update self.data here
def get_view(self, index):
# Update index exceeds data length and
# create a holoviews view of the data
if self.data is None or len(self.data) < index:
self.update_data(index)
data = self.data[:index]
....
return hv_obj
def show(self):
# Create DynamicMap to display and
# pass in self as the Stream
return hv.DynamicMap(self.get_view, kdims=[],
streams=[self])
tracker = KalmanTracker()
tracker.show()
# Should update data and plot
tracker.update(index=10)
Once you've done that you can also use the paramnb library to generate widgets from this class. You'd simply do this:
tracker = KalmanTracker()
paramnb.Widgets(tracker, callback=tracker.update)
tracker.show()

Tensorflow vocabularyprocessor

I am following the wildml blog on text classification using tensorflow. I am not able to understand the purpose of max_document_length in the code statement :
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
Also how can i extract vocabulary from the vocab_processor
I have figured out how to extract vocabulary from vocabularyprocessor object. This worked perfectly for me.
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
not able to understand the purpose of max_document_length
The VocabularyProcessor maps your text documents into vectors, and you need these vectors to be of a consistent length.
Your input data records may not (or probably won't) be all the same length. For example if you're working with sentences for sentiment analysis they'll be of various lengths.
You provide this parameter to the VocabularyProcessor so that it can adjust the length of output vectors. According to the documentation,
max_document_length: Maximum length of documents. if documents are
longer, they will be trimmed, if shorter - padded.
Check out the source code.
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
Note the line word_ids = np.zeros(self.max_document_length).
Each row in raw_documents variable will be mapped to a vector of length max_document_length.