Identical Test set - testing

I have some comments and i want to classify them as Positive or Negative.
So far i have an annotated dataset .
The thing is that the first 100 rows are classified as positive and the rest 100 as Negative.
I am using SQL Server Analysis-2008 R2. The Class attribute has 2 values, POS-for positive and NEG-for negative.
Also i use Naive Bayes algorithm with maximum input/output attributes=0 (want to use all the attributes) for the classification, the test set max case is set to 30%. The current score from the Lift Chart is 0.60.
Do i have to mix them up, for example 2 POS followed by 1 NEG, in order to get better classification accuracy?

The ordering of the learning instances should not affect classification performance. The probabilities computed by Naive Bayes will be the same for any ordering of instances in the data set.
However, the selection of different test and training sets can affect classification performance. For example, some instances might be inherently more difficult to classify than others.
Are you getting similarly poor training and test performance? If your training performance is good and/or much better than your test performance, your model may be over-fitted. Otherwise, if your training performance is also poor, I would suggest (a) trying a better/stronger/more expressive classifier, e.g., SVM, decision trees etc; and/or (b) making sure your features are representive/expressive enough of the data.

Related

A huge number of discrete features

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?
Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.
It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

Isn't it dangerous to apply Min Max Scaling to the test set?

Here's the situation I am worrying about.
Let me say I have a model trained with min-max scaled data. I want to test my model, so I also scaled the test dataset with my old scaler which was used in the training stage. However, my new test data's turned out to be the newer minimum, so the scaler returned negative value.
As far as I know, minimum and maximum aren't that stable value, especially in the volatile dataset such as cryptocurrency data. In this case, should I update my scaler? Or should I retrain my model?
I happen to disagree with #Sharan_Sundar. The point of scaling is to bring all of your features onto a single scale, not to rigorously ensure that they lie in the interval [0,1]. This can be very important, especially when considering regularization techniques the penalize large coefficients (whether they be linear regression coefficients or neural network weights). The combination of feature scaling and regularization help to ensure your model generalizes to unobserved data.
Scaling based on your "test" data is not a great idea because in practice, as you pointed out, you can easily observe new data points that don't lie within the bounds of your original observations. Your model needs to be robust to this.
In general, I would recommend considering different scaling routines. scikitlearn's MinMaxScaler is one, as is StandardScaler (subtract mean and divide by standard deviation). In the case where your target variable, cryptocurrency price can vary over multiple orders of magnitude, it might be worth using the logarithm function for scaling some of your variables. This is where data science becomes an art -- there's not necessarily a 'right' answer here.
(EDIT) - Also see: Do you apply min max scaling separately on training and test data?
Ideally you should scale first and then only split into test and train. But its not preferable to use minmax scaler with data which can have dynamically varying min and max values with significant variance in realtime scenario.

Algorithm - finding the order of HMM from observations

I am given a data that consists of N sequences of variable lengths of hidden variables and their corresponding observed variables (i.e., I have both the hidden variables and the observed variables for each sequence).
Is there a way to find the order K of the "best" HMM model for this data, without exhaustive search? (justified heuristics are also legitimate).
I think there may be a confusion about the word "order":
A first-order HMM is an HMM which transition matrix depends only on the previous state. A 2nd-order HMM is an HMM which transition matrix depends only on the 2 previous states, and so on. As the order increases, the theory gets "thicker" (i.e., the equations) and very few implementations of such complex models are implemented in mainstream libraries.
A search on your favorite browser with the keywords "second-order HMM" will bring you to meaningful readings about these models.
If by order you mean the number of states, and with the assumptions that you use single distributions assigned to each state (i.e., you do not use HMMs with mixtures of distributions) then, indeed the only hyperparameter you need to tune is the number of states.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, or the Minimum Message Length Criterion which are based on model's likelihood computations. Usually, the use of these criteria necessitates training multiple models in order to be able to compute some meaningful likelihood results to compare.
If you just want to get a blur idea of a good K value that may not be optimal, a k-means clustering combined with the percentage of variance explained can do the trick: if X clusters explain more than, let say, 90% of the variance of the observations in your training set then, going with an X-state HMM is a good start. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model and can therefore prevent some overfitting.
These criteria can also be applied when one uses mixture-based HMMs, in which case there are more hyperparameters to tune (i.e., the number of states and the number of component of the mixture models).

Splitting Training Data to train optimal number of n models

lets assume we have a huge Database providing us with the training data D and a dedicated smaller testing data T for a machine learning problem.
The data covers many aspects of a real world problem and thus is very diverse in its structure.
When we now train a not closer defined machine learning algorithm (Neural Network, SVM, Random Forest, ...) with D and finally test the created model against T we obtain a certain performance measure P (confusion matrix, mse, ...).
The Question: If I could achieve a better performance, by dividing the problem ito smaller sub-problems, e.g. by clustering D into several distinct training sets D1, D2, D3, ..., how could I find the optimal clusters? (number of clusters, centroids,...)
In a brute-force fashion I am thinking about using a kNN Clustering with a random number of clusters C, which leads to the training data D1, D2,...Dc.
I would now train C different models and finally test them against the training sets T1, T2, ..., Tc, where the same kNN Clustering has been used to split T into the C test sets T1,..,Tc.
The combination which gives me the best overall performance mean(P1,P2,...,Pc) would be the one I would like to choose.
I was just wondering whether you know a more sophisticated way than brute-forcing this?
Many thanks in advance
Clustering is hard.
Much harder than classification, because you don't have labels to tell you if you are doing okay, or not well at all. It can't do magic, but it requires you to carefully choose parameters and evaluate the result.
You cannot just dump your data into k-means and expect anything useful to come out. You'd first need to really really carefully clean and preprocess your data, and then you might simply figure out that it actually is only one single large clump...
Furthermore, if clustering worked well and you train classifiers on each cluster independently, then every classifier will miss crucial data. The result will likely performing really really bad!
If you want to only train on parts of the data, use a random forest.
But it sounds like you are more interested in a hierarchical classification approach. That may work, if you have good hierarchy information. You'd first train a classifier on the category, then another within the category only to get the final class.

How should I test on a small dataset?

I use Weka to test machine learning algorithms on my dataset. I have 3800 rows and around 25 features. I am testing the combination of different features for prediction models and seem to predict lower than just the oneR algorithm does with the use of Cross-validation. Even C4.5 does not predict better, sometimes it does and sometimes it does not on basis of the features that are still able to classify.
But, on a certain moment I splitted my dataset in a testset and dataset(20/80), and testing it on the testset, the C4.5 algorithm had a far higher accuracy than my OneR algorithm had. I thought, with the small size of the dataset, it probably is just a coincidence that it predicted very well(the target was still splitted up relatively as target attributes). And therefore, its more useful to use Cross-validation on small datasets like these.
However, testing it on another testset, did give the high accuracy towards the testset using C4.5. So, my question actually is, what is the best way to test datasets when the datasets are actually pretty small?
I saw some posts where it is discussed, but I am still not sure what is the right way to do it.
It's almost always a good approach to test your model via Cross-Validation.
A rule of thumb is to use 10 fold cross validation.
In your case, 10 fold cross validation will do the following in Weka:
split your 3800 training instances into 10 sets of 380 instances
for each set (s = 1 .. 10) :
use the instances from s for testing and the other 9 sets for training a model (3420 training instances)
the result will be an average of the results obtained with the 10 models used.
Try to avoid testing your dataset using the training set option, because that could result in creating a model that works very well for you existing data but could have big problems with other new instances (overfitting).