Adding statsmodels 'predict' results to a Pandas dataframe - pandas

It is common to want to append the results of predictions to the dataset used to make the predictions, but the statsmodels predict function returns (non-indexed) results of a potentially different length than the dataset on which predictions are based.
For example, if the test dataset, test, contains any null entries, then
mod_fit = sm.Logit.from_formula('Y ~ A B C', train).fit()
press = mod_fit.predict(test)
will produce an array that is shorter than the length of test, and cannot be usefully appended with
test['preds'] = preds
And since the result of predict is not indexed, there is no way to recover the rows to which the results should be attached.
What is the idiom for associating predict results to the rows from which they were generated? Is there, perhaps, a way to get predict to return a dataframe that preserves the indices of its argument?

Predict shouldn't drop any rows. Can you post a minimal working example where this happens? Preserving the pandas index is on my radar and should be fixed in master soon.
https://github.com/statsmodels/statsmodels/issues/1501
Edit: Nevermind. This is a known issue. https://github.com/statsmodels/statsmodels/issues/1352

Related

Tensorflow: How to shuffle a dataset so that it doesn't reshuffle after splitting

I am so confused as to why it's been so hard for me to find the answer to this. I want to be able to shuffle a dataset one time. After shuffling, I then split the dataset into train/val/test splits. I can't find a way to do this without the train/val/test data being all reshuffled together anytime I iterate over the split datasets.
I guess because the train/val/test dataset are all pointing to locations in a dataset which is being shuffled each time.
Here's an example of my code that is trying to do this.
dataset = tf.data.Dataset.from_tensor_slices((x, y))
dataset = dataset.shuffle(buffer_size=len(x))
train, val, test = split_tf_dataset(dataset, len(x), test_pct=0.1, val_pct=0.1)
train, val, test = train.batch(batch_size=50, drop_remainder=True), val.batch(batch_size=50, drop_remainder=True), test.batch(batch_size=50, drop_remainder=True)
'split_tf_dataset' is just performing take and skip operations, no randomness added there.
My workaround so far has been to shuffle the data before I create the Dataset, but does Dataset have this functionality that I'm missing? The option 'reshuffle_each_iteration' doesn't seem to do anything in this case.
I would expect setting reshuffle_each_iteration to False to fix this problem, however it seems to have no effect. I've also tried calling Dataset.sample_from_datasets, however with one dataset it only
bounces your input back to you, doing nothing.
This is the numpy code that does what I'm expecting tensorflow should be able to do:
x = x[np.random.choice(np.arange(0, len(x)), size=len(x))]

Should I join features and targets dataframes for use with scikit-learn?

I am trying to create a regression model to predict deliverables (dataframe 2) using design parameters (dataframe 1). Both dataframes have a id number that I used as an index.
Is it possible to use two dataframes to create a dataset for sklearn? Or do I need to join them? If I need to join them then what would be the best way?
# import data
df1= pd.read_excel('data.xlsx', sheet_name='Data1',index_col='Unnamed: 0')
df2= pd.read_excel('data.xlsx', sheet_name='Data2',index_col='Unnamed: 0')
I have only used sklearn on a single dataframe that had all of the columns for the feature and target vectors in it. So not sure how to handle the case where I am using two dataframes where one has the features and one has the targets.
All estimators in scikit-learn have a signature like estimator.fit(X, y), X being training features and y training targets.
Then, prediction will be achieved by calling some kind of estimator.predict(X_test), with X_test being the test features.
Even train_test_split takes as parameters two arrays X and y.
This means that, as long as you maintain the right order in rows, nothing requires you to merge features and targets.
Completely agree with the Guillaume answer.
Just be aware, as he said, of the rows order. That's the key of your problem. If they have the same order, you don't need to merge dataframe and you can fit the model directly.
But, if they are not in the same order, you have to combine both dataframes (similar to left join in SQL) in order to relate features and targets of one ID. You can do it like this (more information here):
df_final= pd.concat([df1, df2], axis=1)
As you used the ID as index, it should work properly. Be aware that maybe NaN values will appear if some ID appears in one Dataframe but not in the other one. You will have to handle with them.

Applying Tensorflow Dataset .map() to subsequent dataset elements

I've got a TFRecordDataset and I'm trying to preprocess the features of two subsequent elements by means of the map() API.
dataset_ext = dataset.map(lambda x: tf.py_function(parse_data, [x], [tf.float32]))
As map applies the function parse_data to every dataset element, I don't know what parse_data should look like in order to keep track of the feature extracted from the previous dataset element.
Can anyone help? Thank you
EDIT: I'm working on the Waymo dataset, so each element is a frame. You can refer to https://github.com/Jossome/Waymo-open-dataset-document for its structure.
This is my parse function parse_data:
from waymo_open_dataset import dataset_pb2 as open_dataset
def parse_data(input_data):
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(input_data.numpy()))
av_speed = (frame.images[0].velocity.v_x, frame.images[0].velocity.v_y, frame.images[0].velocity.v_z)
return av_speed
I'd like to build a dataset whose features are the car speed and acceleration, defined as the speed variation between subsequent frames (the first value can be 0).
One way I thought about is to give the map function dataset and dataset.skip(1) as inputs but I'm not sure about it yet.
I am not sure but it might be unnecessary to make your mapped function a tf.py_function. How parse_data is supposed to look like depends on your dataset dataset_ext. If it has for example two file paths (1 instace of input data and 1 instance of output data), the mapping function should have 2 arguments and should return 2 arguments.
For example: if your dataset contains images and you want them to be randomly cropped each time an example of your dataset is drawn the mapping function looks like this:
def process_img_random_crop(img_in, img_out, output_shape):
merged = tf.stack([img_in, img_out])
mergedCrop = tf.image.random_crop(merged, size=(2,) + output_shape)
img_in_cropped, img_out_cropped = tf.unstack(mergedCrop, 2, 0)
return img_in_cropped, img_out_cropped
I call it as follows:
image_ds_test = image_ds_test.map(lambda i, o: process_img_random_crop(i, o, output_shape=(64, 64, 1)), num_parallel_calls=tf.data.experimental.AUTOTUNE)
What exactly is your plan with dataset_ext and what does it contain?
Edit:
Okay, got what you meant with you the two frames. So the map function is applied to each entry of your dataset separatly. If you need cross-entry information, a single entry of your dataset needs to contain two frames. With this more complicated set-up, I would suggest you to use a tensorflow Sequence: The explanation from the tensorflow team is pretty straigth forward. Hope this help!

xgboost rank pairwise what is the prediction input and output

I'm trying to use XGBoost to predict the rank for a set of features for a given query. I managed to train a model it but I'm confused around the input data when I ask for a prediction.
I'm trying to understand if I'm doing something wrong or this is not the right approach.
What I'm doing:
I parse the training data (see here a sample) and feed it in a DMatrix such that the first column represents the quality-of-the-match and the following columns are the scores on different properties and also send the docIds as labels
I configure the group sizes
The training seems to work fine, I get not errors, and I use the rank:pairwise objective
For prediction, I use a fake entry with fake scores (1 row, 2 columns see here) and I get back a single float value.
I'm trying to understand:
1. Do I need to feed in a label for the prediction ?
My understanding is that labels are similar to "doc ids" so at prediction time I don't see why I need them
2. Do I need to set the group size when doing predictions ? And if so, what does it represent ?
My understanding is that groups are for training data to assist ranking "per query". How does that correlate with predictions? Do I set a group size anyway?
3. How do I correlate the "group" from the training with the prediction?
How do I figure out the pair (score, group) from the result of the prediction, given I only get back a single float value - what group is that prediction for?

How to use Pandas get_dummies on predict data?

After using Pandas get_dummies on 3 categorical columns to get a one hot-encoded Dataframe, I've trained (with some success) a Perceptron model.
Now I would like to predict the result from a new observation, that it is not hot-encoded.
Is there any way to record the get_dummies column mapping to re-use it?
There is no automatic procedure to do it at the moment, to my knowledge. In the future release of sklearn CategoricalEncoder will be very handy for this job. You can already get your hands on it, if you clone sklearn github master branch and build in yourself. At the moment 2 options come to my mind:
use LabelEncoder+OneHotEncoder combination, see this answer, for example;
simply retrieve (and store, if needed) the list of columns after the training OHE output. Then run pd.get_dummies on the test set/example. Loop through the output test OHE columns, drop those that do not appear in the training OHE and add those that are missing in test OHE filled with zeros.