I have a large quantity of missing values that appear at random in my data. Unfortunately, I cannot simply drop observations with missing data as I am grouping observations by a feature and cannot drop NaNs without affecting the entire group.
I was hoping to simply mask features that were missing. So a single group might have 8 items in it, and each item may have 0 to N features, depending on how many got masked due to being missing.
I have been experimenting a lot with RaggedTensors, but have encountered a lot of issues ranging from not being able to flatten the RaggedTensor, not being able to concatenate it with regular tensors of uniform shape, and Dense layers requiring the last dimension of their input to be known, aka the number of features.
Does anybody know if there is a way to do this?
Related
So I have been looking at XGBoost as a place to start with this, however I am not sure the best way to accomplish what I want.
My data is set up something like this
Where every value, whether it be input or output is numerical. The issue I'm facing is that I only have 3 input data points per several output data points.
I have seen that XGBoost has a multi-output regression method, however I am only really seeing it used to predict around 2 outputs per 1 input, whereas my data may have upwards of 50 output points needing to be predicted with only a handful of scalar input features.
I'd appreciate any ideas you may have.
For reference, I've been looking at mainly these two demos (they are the same idea just one is scikit and the other xgboost)
https://machinelearningmastery.com/multi-output-regression-models-with-python/
https://xgboost.readthedocs.io/en/stable/python/examples/multioutput_regression.html
I want to know how to handle the skewed data which contains a particular column that has multiple categorical values. Some of these values have more value_counts() than others.
As you can see in this data the values greater than 7 have value counts lot less than others. How to handle this kind of skewed data? (This is not the target variable. I want to know about skewed independent variable)
I tried changing ' these smaller count values to a particular value (-1). That way I got count of -1 comparable to other values. But training classification model on this data will affect the accuracy.
Oversampling techniques for minority classes/categories may not work well in many scenarios. You could read more about them here.
One thing you could do is to assign different weights to samples from different classes in your model's loss function, inversely proportional to their frequencies. This would ensure that even classes with few datapoints will equally affect the model's loss, as compared to classes with large number of datapoints.
You could share more details about the dataset or the specific model that you are using, to get more specific suggestions/solutions.
I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.
I'm trying to understand the TensorFlow tutorial on wide and deep learning. The demo application creates indicator columns for categorical features with few categories (gender, education), and it creates embedded columns for categorical features with many categories (native_country, occupation).
I don't understand embedded columns. Is there a rule that clarifies when to use embedded columns instead of indicator columns? According to the documentation, the dimension parameter sets the dimension of the embedding. What does that mean?
From the feature columns tutorial:
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
The dimension parameter is the length of the vector you're reducing the categories to.
I am attempting to reproduce a Convolution Neural Network from a research paper using Tensorflow.
There are many times in the diagram where the results of convolutions are concatenated. Currently I am using tf.concat(https://www.tensorflow.org/api_docs/python/tf/concat) along the last axis (representing channels) to concatenate these feature maps. I originally believed that I would want to concatenate along all axes, but this does not seem to be an option in tensorflow. Now I am facing the problem where the paper indicates that tensors(feature maps) of different sizes should be concatenated. tf.concat does not support concatenations of different sizes, so I am wondering if this was the correct command to use in the first place. In summary, what is the correct way to concatenate feature maps(sometimes of different sizes) in tensorflow?
Thank you.
It's impossible and meaningless to concatenate features maps with different sizes.
If you want to concatenate 2 tensors, every dimension except the concatenation one must be equal.
From the image you posted, in fact, you can see that every feature map that gets concatenated, has the same spatial extent (but different depth) of the other one.
If you can't concatenate in that way, probabily that's something wrong in your code, and probably the problem is the lack of padding = valid in the convolution operation.
The problem that you encounter for inception network may be resolved by using padding in convolutional layers to keep the size same. For inception blocks, instead of using "VALID" padding, change it to "SAME" one. So, without requiring any resizing, you can concatenate the outputs.
Alternatively, you can append padding to the feature maps that are going to be concatenated. You can do that by using tf.pad().
If you don't prefer to do this one, you can use tf.image.resize_images function to resize them to same values. However, this is a dirty and computationally expensive approach.
Tensors can only be concatenated along one axis. If you need to concatenate feature maps of different sizes, you must somehow manipulate the sizes of the original tensors.