Boxcox transformation with tree-based models(XGBoost to be specific) - xgboost

I have a question regarding boxcox transformation(or log transformation). I am working on a data-set which I have lots of skewed features. Now when I take the boxcox transformation, I get quite a nice distribution but the thing is correlation decrease. Now if I was working with linear models I would just consider correlation to decide I should transform the feature or not. But as I mentioned I am working with tree-based models, so should I transform the feature to get a more dispersed distribution or I leave the feature as it is to avoid a decrease in correlation.
I add a screenshot of distribution and its relationship with the target variable, for both transformed and not transformed(Left 2 plots original feature and target).
PS: Guessing from the plots, it seems to me that if I transform the feature it will be easier for tree to find a split for this particular feature.
Thanks a lot,

Related

Using PCA on Part of Dataframe

I want to use a clustering algorithm to a dataframe that contains a lot of features (32 columns).
A part of the features are encoded using one hot encoder.
I want to use PCA ( Principal Component analysis ) to reduce the dimension and make the machine learning process easier.
Is it possible to use the PCA just for some columns of the data frame and keep the other columns as they are then use machine learning model.
Or it is obligatory to use PCA for all the dataframe before clustering.
I guess there should be no issue with doing what you describe.
What this does, effectively, is merge some of the objects' features into fewer ones, but then using other, non-merged ones in addition to the merged ones. I don't know what effect that would have on the outcome; it might be good to run a correlation to see whether the unmerged features add anything to the PCA-merged ones. You might find that they basically duplicate what is there already.
Since clustering is an exploratory method, you can basically do whatever you want. It is of course advisable to have a reason for doing so, as it otherwise ends up as simply trial-and-error, and if you find a result, you won't be able to describe why you got there. It is possible (or even likely for some data sets) that there are multiple ways to cluster them, so you should make decisions based on what you know about the data already, so they can be justified in those terms.
Running random trial-and-error clustering until you find a structure makes it a bit difficult to come up with a good explanation why that structure is valid.

how are histograms constructed in sklearn's HistGradientBoostingClassifier to decide on best split point

Both lightgbm and sklearn's HistGradientBoostingClassifier estimators use histograms to decide on best splits for continuous features.
Is it possible to explain intuitively (or with some example) the process of histogram creation and how does it help in deciding in faster split point at a node.
I have looked for answers extensively over the Internet but could not find any simple or intuitive way as to how histograms are constructed.
I am not sure but it could be related to how (unique) Regression trees are constructed in XGBoost. For a continuous feature, you construct an histogram, decide on the split (e.g. weight < 70kg), construct a Regression tree and compute the Similarity score as well as the Gain. However, when the range of the values in the continuous feature is quite large then it is quite computationally expensive to try all the possible split values. In that case, XGBoost basically makes the split by making use of the quantiles which involves dividing all the observations into equally sized sets.
I guess sklearn's HistGradientBoostingClassifier might involve the above tool optimization as well for coming up with the best split.

C5.0 gives back only a single leaf

I'm doing a data analysis task in SPSS Modeler and I have finally arrived to the point of the stream where I'm trying to fit some models on the data.
However when I tried to run the mentioned c5.0 modeling node on my data, the node generated a modeling nugget containing only a single leaf, so there are no decision rules in the model. I partitioned the data before to train and test subsets (70-30). I did not use misclassification cost, used the properly predefined attribute roles. In the model's model page I checked the use partitioned data, build model for each split, Group symbolics, Use global pruning options in, I also tried to use expert mode, but it fails on simple mode too. I have tried to use different options but it gives the same output without a single split.
How can I make the model give back a more complex decision tree, I suppose that this is not the expected outcome.
Any suggestions are welcomed.
Please, check your distribution of the target variable and share it.
If the balances differs greatly from 50%-50%, you may need to balance your inputs first.
Missclassification cost is another technique to give you an output, but again it should be based on your empirical distributions.

How Data and String are treated in graphlab

I am having large date set in which some of columns are Date and other are categorical Data like Status, Department Name, Country Name.
So how this data is treated in graphlab when i call the graphlab.linear_regression.create method, does i have to pre-process this data and convert them into numbers or can directly provide to graphlab.
Graphlab is mostly used for computing tabular and graph based datasets, and have high scalability and performance. In graphlab.linear_regression.create, graphlab have inbuilt feature of understanding the type of data and giving most suitable method of linear regression for optimizing results. For Example, for numeric data of target and feature both, most of the time, graphlab takes Newtons Method of linear regression. Similarly, depending on the dataset, understands the need and gives method accordingly.
Now, about preprocessing, graphlab only takes SFrame for learning that need to be parsed correctly before any learning. While creating an SFrame, unprocessed and error creating data are always reflected and throws an error. So, in order to go through any learning, you need to have a clean data. If SFrame accepts the data, and also your chosen target and feature for learning that you want, you are good to go but pre-processing and cleaning data is always recommended. Also, its always a good practice to do feature engineering before any learning algorithm, and redefining data types before learning is always recommended for accuracy.
About your point on how data is treated in Graphlab, I would say, it depends!. Some datasets are tabular and are treated accordingly and some in graph structure. Graphlab performs very well when comes to regression tree and boosted classifiers which follows decision tree concept and are quite time and resource consuming in other libraries than graphlab.
For me, graphlab performed very well while creating recommendation engine where I had dataset of nodes and edges and boosted tree classifier with 18 iterations too worked flawless in quite scalable time and I must say, even for tree structured data, graphlab performs very well. I hope this answer helps.

How to make testing data manually for clustering of citation records?

I'm doing a research on the author name disambiguation problem. I want to make some experiments. I want to perform clustering on citation records. My dataset consist of 2000 xml records. I need testing data. The dataset that I'm using is not popular and I need to make testing data manually. I don't know how to do so. I need instruction of how to make testing data manually. Note: I want to compare the performance of a set of techniques in solving the author name disambiguation problem, So I must perform testing.
Even though it is not really clear what kind of testing you want to perform, but general answer to the issue at hand - trying to artificially create more data from the data you have at hand - is a bootstrap. In general it is technique when you perform sampling with replacement from your dataset as many times as you want. It randomly picks up some element from your data repetitively untill you get a sample of the size you want. The sample you get could be larger than your original dataset but should have similar (from statistical point of view) as your original dataset. Bootstrap sampling is available in sklearn.
P.S. You need to keep in mind that this solution is not optimal - best solution to this problem is to actually get more real data somehow.
Classification vs. Clustering
For author name disambiguation, I don't think you want clustering. What you want is classification.
You have a features for each author / publication. Now you give the classifier two of those feature vectors. It classifies "it is the same author" or "those are different authors".
Training / testing data
Having a binary classification problem, the testing suddenly becomes simple: Just use one of the measures used in literature so often (accuracy, precision, recall, confuscation matrix).
Getting the data might be a bit more complicated. You wrote that you have an XML file of 2000 records. I guess you can derive features from those records automatically and authors have an identifier? Then you can simply generate negative examples by having different authors and positive examples by checking if the identifier is the same.
Otherwise you can have a look at http://dblp.uni-trier.de/. Although there are likely many publications under the same author which should be different, they do distinguish authors not only by name but give them identifiers.
Alternatively, you can train a classifier to classify each of the known authors with e.g. > 30 publications. Then remove the softmax layer and use those features to distinguish the authors.