How to create an input with multiple data types in mxnet? - mxnet

Suppose I have a data file that has entries that look like this
0.00,2015-10-21,1,Y,798.78,323793701,6684,0.00,Q,H2512,PE0,1,0000
I would like to use this as an input to an mxnet model (basic Feed Forward Multi-layer Perecptron). A single input record has data types, in the order show above
float,date,int,categorical,float,int,int,float,categorical,categorical,categorical,int, float
each record is a meaningful representation of a specific entity. how do I represent this sort of data to mxnet? also, to complicate things a bit, suppose I want to one-hot encode the categorical columns? And what if each record has these fields, in the order show, but repeated multiple times in some cases such that each record may have a different length?
The docs are great for the basic cases where you have input data that is all of the same type and can all be loaded into the same input without any transformation but how to handle this case?
Update: adding some additional details. to keep this as simple as possible, let's say I just want to feed this into a simple network. something like:
my $data = mx->symbol->Variable("data");
my $fc = mx->symbol->FullyConnected($data, num_hidden => 1);
my $softmax=mx->symbol->SoftmaxOutput(data => $fc, name => "softmax");
my $module = mx->mod->new(symbol => $softmax);
in the simple case of the data being all one type and not requiring much in the way of pre-processing I then could just do something along the lines of
$module->fit(
$train_iter,
eval_data => $eval_iter,
optimizer => "adam",
optimizer_params=>{learning_rate=>0.001},
eval_metric => "mse",
num_epoch => 25
);
where $train_iter is a simple NDArray iterator over the training data. (Well, with the Perl API it's not exactly an NDArray, but has complete parity with that interface so it is conceptually identical).

NDArrayIter also supports multi input. You can use it as follows
data = {'data1':np.zeros(shape=(10,2,2)), 'data2':np.zeros(shape=(20,2,2))}
label = {'label1':np.zeros(shape=(10,1)), 'label2':np.zeros(shape=(20,1))}
dataiter = mx.io.NDArrayIter(data, label, 3, True, last_batch_handle='discard')
Before that you will have to convert your non-numeric data into numerical data. This could be in the form of a one-hot vector or some other fashion which depends on the meaning of that variable.
As for the question regarding samples have different length, the easiest way would be to bring them all to a common length by padding the shorter ones with 0s.

Related

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

Applying Tensorflow Dataset .map() to subsequent dataset elements

I've got a TFRecordDataset and I'm trying to preprocess the features of two subsequent elements by means of the map() API.
dataset_ext = dataset.map(lambda x: tf.py_function(parse_data, [x], [tf.float32]))
As map applies the function parse_data to every dataset element, I don't know what parse_data should look like in order to keep track of the feature extracted from the previous dataset element.
Can anyone help? Thank you
EDIT: I'm working on the Waymo dataset, so each element is a frame. You can refer to https://github.com/Jossome/Waymo-open-dataset-document for its structure.
This is my parse function parse_data:
from waymo_open_dataset import dataset_pb2 as open_dataset
def parse_data(input_data):
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(input_data.numpy()))
av_speed = (frame.images[0].velocity.v_x, frame.images[0].velocity.v_y, frame.images[0].velocity.v_z)
return av_speed
I'd like to build a dataset whose features are the car speed and acceleration, defined as the speed variation between subsequent frames (the first value can be 0).
One way I thought about is to give the map function dataset and dataset.skip(1) as inputs but I'm not sure about it yet.
I am not sure but it might be unnecessary to make your mapped function a tf.py_function. How parse_data is supposed to look like depends on your dataset dataset_ext. If it has for example two file paths (1 instace of input data and 1 instance of output data), the mapping function should have 2 arguments and should return 2 arguments.
For example: if your dataset contains images and you want them to be randomly cropped each time an example of your dataset is drawn the mapping function looks like this:
def process_img_random_crop(img_in, img_out, output_shape):
merged = tf.stack([img_in, img_out])
mergedCrop = tf.image.random_crop(merged, size=(2,) + output_shape)
img_in_cropped, img_out_cropped = tf.unstack(mergedCrop, 2, 0)
return img_in_cropped, img_out_cropped
I call it as follows:
image_ds_test = image_ds_test.map(lambda i, o: process_img_random_crop(i, o, output_shape=(64, 64, 1)), num_parallel_calls=tf.data.experimental.AUTOTUNE)
What exactly is your plan with dataset_ext and what does it contain?
Edit:
Okay, got what you meant with you the two frames. So the map function is applied to each entry of your dataset separatly. If you need cross-entry information, a single entry of your dataset needs to contain two frames. With this more complicated set-up, I would suggest you to use a tensorflow Sequence: The explanation from the tensorflow team is pretty straigth forward. Hope this help!

How to Set the Same Categorical Codes to Train and Test data? Python-Pandas

NOTE:
If someone else it's wondering about this topic, I understand you're getting deeper in the Data Analysis world, so I did this question before to learn that:
You encode categorical values as INTEGERES only if you're dealing with Ordinal Classes, i.e. College degree, Customer Satisfaction Surveys as an example.
Otherwise if you're dealing with Nominal Classes like, gender, colors or names, you MUST convert them with other methods since they do not specific any numerical order, most known are One-hot Encoding or Dummy variables.
I encorage you to read more about them and hope this has been useful.
Check the link below to see a nice explanation:
https://www.youtube.com/watch?v=9yl6-HEY7_s
This may be a simple question but I think it can be useful for beginners.
I need to run a prediction model on a test dataset, so to convert the categorical variables into categorical codes that can be handled by the random forests model I use these lines with all of them:
Train:
data_['Col1_CAT'] = data_['Col1'].astype('category')
data_['Col1_CAT'] = data_['Col1_CAT'].cat.codes
So, before running the model I have to apply the same procedure to both, the Train and Test data.
And since both datasets have the same categorical variables/columns, I think it will be useful to apply the same categorical codes to each column respectively.
However, although I'm handling the same variables on each dataset I get different codes everytime I use these two lines.
So, my question is, how can I do to get the same codes everytime I convert the same categoricals on each dataset?
Thanks for your insights and feedback.
Usually, how I do this is to do the categorical conversions before the train test split so that I get a neat transformed dataset. Once I do that, I do the train-test split and train the model and test it on the test set.

Tensorflow: training on JSON data to generate similar output

Assume one has JSON data containining instructions for generating the following 10x5 cell patterns, and that each cell can contain one of the following characters: _ 0 x y z
Also assume that each character can be displayed in various colors.
pattern 1:
_yx_0zzyxx
_0__yz_0y_
x0_0x000yx
_y__x000zx
zyyzx_z_0y
pattern 2:
xx0z00yy_z
zzx_0000_x
_yxy0y__yx
_xz0z__0_y
y__x0_0_y_
pattern 3:
yx0x_xz0_z
xz_x0_xxxz
_yy0x_0z00
zyy0__0zyx
z_xy0_0xz0
These were randomly generated, and are all black, but assume they were devised according to some set of rules, and in color.
The JSON for the first pattern would look something like:
{
width: 10,
height: 5,
cells: [
{
value: '_',
color: 'red'
},
{
value: 'y',
color: 'blue'
}, ...
]
}
If one wanted to train on this data in order to generate new yet similar patterns (again, assuming these were not randomly generated), what is the recommended approach for:
reading the data in (I'd imagine putting the JSON into an Example protobuf, serializing the buffer to string with tf.parse_example, and then writing that to TFRecord files)
training on that data
generating new patterns based on the trained model
supplying seed data for the generated patterns, e.g. first cell is the character "x' with the color blue.
I want to achieve something similar to what I've seen in style transfer with art/photos, and with music/MIDI data (see: Google Magenta). In those cases, here the model is trained an a distinctive set of artwork or melodic style, and a seed in the form of a photograph or primer melody is supplied in order to generate content similar to the data used in training.
Thanks!
I dislike preprocessing the dataset into new forms, it makes it difficult to change later on and slows future development, it's like technical debt in my opinion.
My approach would be to keep your JSON as-is and write some simple python code (a generator specifically which mean you use yield instead of return statements) to read the JSON file and spit out samples in sequence.
Then use the tensorflow Dataset input pipeline with Dataset.from_generator(...) to take data from your input function.
https://www.tensorflow.org/programmers_guide/datasets
The Dataset pipeline provides everything you need to manage the various transformations you'll want to apply, you can buffer, shuffle, batch, prefetch, and map functions onto your data trivially and in a nice modular, testable, framework that feeds naturally into your tensorflow model.

A similar approach for LabelEncoder in sklearn.preprocessing?

For encoding categorical data like sex we normally use LabelEncorder() in scikit learn. But If I'm going to use Tensorflow instead of Scikit Learn, what is the equivalent function or methodology for doing such task? I know that we can do one hot encoding easily with tensorflow, but then it will create labels as 10 , 01 instead of 1 , 0.
There is a package in TensorFlow called tf.feature_columns, that contain 4 methods to create categorical columns from your input data:
categorical_column_with_hash_bucket(...): Hash the input value to a fixed number of categories
categorical_column_with_identity(...): If you have numeric input and you want the value itself to be treated as a categorical column
categorical_column_with_vocabulary_list(...): Outputs a category based on a fixed (memory) list of words
categorical_column_with_vocabulary_file(...): Same as _list but reads the vocabulary from file
The package also provides lots more way of getting your input data to the model. For an overview, see this blogpost written by the developers of the package.