A huge number of discrete features - dataframe

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?

Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.

It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

Related

Standardization or scaling of categorical variables

I am fairly new to data science. I am working on use-case of predicting sales demand using linear regression based on product no and store no as predictor. There can be many stores and products with numeric values. Do I need to standardize or scales these variables/predictors if theirs values are numeric, unbounded and at different scale? I believe if I try to use interaction term I will have standardize it?
Since these are categorical features, before using linear models you should encode this correctly to create a reasonable model. If you can encode these categorical features to give them linear correlation, then you can standardize it otherwise it wouldn't make sense. If you use tree-based models then you don't have to encode since they are able to discover nonlinear relationships.
Edit-note: You can try to use methods of mean-encodings. Methods like CV loop, Expanding mean, etc.

Isn't it dangerous to apply Min Max Scaling to the test set?

Here's the situation I am worrying about.
Let me say I have a model trained with min-max scaled data. I want to test my model, so I also scaled the test dataset with my old scaler which was used in the training stage. However, my new test data's turned out to be the newer minimum, so the scaler returned negative value.
As far as I know, minimum and maximum aren't that stable value, especially in the volatile dataset such as cryptocurrency data. In this case, should I update my scaler? Or should I retrain my model?
I happen to disagree with #Sharan_Sundar. The point of scaling is to bring all of your features onto a single scale, not to rigorously ensure that they lie in the interval [0,1]. This can be very important, especially when considering regularization techniques the penalize large coefficients (whether they be linear regression coefficients or neural network weights). The combination of feature scaling and regularization help to ensure your model generalizes to unobserved data.
Scaling based on your "test" data is not a great idea because in practice, as you pointed out, you can easily observe new data points that don't lie within the bounds of your original observations. Your model needs to be robust to this.
In general, I would recommend considering different scaling routines. scikitlearn's MinMaxScaler is one, as is StandardScaler (subtract mean and divide by standard deviation). In the case where your target variable, cryptocurrency price can vary over multiple orders of magnitude, it might be worth using the logarithm function for scaling some of your variables. This is where data science becomes an art -- there's not necessarily a 'right' answer here.
(EDIT) - Also see: Do you apply min max scaling separately on training and test data?
Ideally you should scale first and then only split into test and train. But its not preferable to use minmax scaler with data which can have dynamically varying min and max values with significant variance in realtime scenario.

xgboost vs H2o Gradient Boosting

I have a dataset having a large missing values (more than 40% missing). Genrated a model in xgboost and H2o gradient boosting - got a decent model in both cases. However, the xgboost shows this variable as one of the key contributors to the model but as per H2o Gradient Boosting the variable is not important. Does xgboost handle variables with missing values differently. All the configuration to both the models are exactly the same.
Both missing value handling and variable importances are slightly different between the two methods. Both are treating missing values as information (i.e., they learn from them, and don't just impute with a simple constant). The variable importances are computed from the gains of their respective loss functions during tree construction. H2O uses squared error, and XGBoost uses a more complicated one based on gradient and hessian.
One thing you could check is the variance of the variable importances between different runs with different seeds, to see how stable each method is in terms of variable importances.
PS. If you have categoricals, then you're better off leaving the column as a factor for H2O, no need to do your own encoding. This can lead to a different effective count of columns vs XGBoost's dataset, so for column sampling, things will be different.

How tensorflow deals with large Variables which can not be stored in one box

I want to train a DNN model by training data with more than one billion feature dimensions. So the shape of the first layer weight matrix will be (1,000,000,000, 512). this weight matrix is too large to be stored in one box.
By now, is there any solution to deal with such large variables, for example partition the large weight matrix to multiple boxes.
Update:
Thanks Olivier and Keveman. let me add more detail about my problem.
The example is very sparse and all features are binary value: 0 or 1. The parameter weight looks like tf.Variable(tf.truncated_normal([1 000 000 000, 512],stddev=0.1))
The solutions kaveman gave seem reasonable, and I will update results after trying.
The answer to this question depends greatly on what operations you want to perform on the weight matrix.
The typical way to handle such a large number of features is to treat the 512 vector per feature as an embedding. If each of your example in the data set has only one of the 1 billion features, then you can use the tf.nn.embedding_lookup function to lookup the embeddings for the features present in a mini-batch of examples. If each example has more than one feature, but presumably only a handful of them, then you can use the tf.nn.embedding_lookup_sparse to lookup the embeddings.
In both these cases, your weight matrix can be distributed across many machines. That is, the params argument to both of these functions is a list of tensors. You would shard your large weight matrix and locate the shards in different machines. Please look at tf.device and the primer on distributed execution to understand how data and computation can be distributed across many machines.
If you really want to do some dense operation on the weight matrix, say, multiply the matrix with another matrix, that is still conceivable, although there are no ready-made recipes in TensorFlow to handle that. You would still shard your weight matrix across machines. But then, you have to manually construct a sequence of matrix multiplies on the distributed blocks of your weight matrix, and combine the results.

Identical Test set

I have some comments and i want to classify them as Positive or Negative.
So far i have an annotated dataset .
The thing is that the first 100 rows are classified as positive and the rest 100 as Negative.
I am using SQL Server Analysis-2008 R2. The Class attribute has 2 values, POS-for positive and NEG-for negative.
Also i use Naive Bayes algorithm with maximum input/output attributes=0 (want to use all the attributes) for the classification, the test set max case is set to 30%. The current score from the Lift Chart is 0.60.
Do i have to mix them up, for example 2 POS followed by 1 NEG, in order to get better classification accuracy?
The ordering of the learning instances should not affect classification performance. The probabilities computed by Naive Bayes will be the same for any ordering of instances in the data set.
However, the selection of different test and training sets can affect classification performance. For example, some instances might be inherently more difficult to classify than others.
Are you getting similarly poor training and test performance? If your training performance is good and/or much better than your test performance, your model may be over-fitted. Otherwise, if your training performance is also poor, I would suggest (a) trying a better/stronger/more expressive classifier, e.g., SVM, decision trees etc; and/or (b) making sure your features are representive/expressive enough of the data.