Silhouette function for agglomerative hierarchic clustering validation - hierarchical-clustering

How to make a silhouette validation on Agglomerative Hierarchical Clustering? I want to look the quality of my cluster.
Thanks...
Zhayn

Related

Types of algirthms using to predict item inventory stock

I want to predict stock by analyzing stock mouvement.
Stock mouvement
enter image description here
STOCK :
enter image description here
Which step to start analzying data using machine learning.
Which algorithme ML and DL to use.
Thanks a lot
I need :
learn step to start analzying data using machine learning and deep learning.
Type of algorithme ML and DL to use.
Mohamed:
Your question requires a long and broad answer. I'll try to provide some steps and reference for you to explore further.
First of all you need the right data. The prediction will be as good as your data. So you need economic data, market trends, and company-specific events.
Then you need to follow the standard steps:
Collect and Preprocess Data: You'll need to gather stock data such as daily closing prices, trading volumes, and other relevant financial information. Preprocessing the data involves cleaning and transforming the data so that it can be used for modeling.
Feature Engineering: This involves creating new features from the existing data that can be used as inputs for the ML/DL models. For example, you can calculate technical indicators such as moving averages, relative strength index (RSI), and Bollinger Bands.
Split the Data: You'll need to split the data into training, validation, and testing sets. The training set is used to train the ML/DL models, the validation set is used to fine-tune the models, and the testing set is used to evaluate the performance of the models.
Select an Algorithm: There are various ML algorithms that can be used for stock price prediction. Some popular algorithms include:
ML Algorithms: Linear Regression, Random Forest, XGBoost, Support Vector Machines (SVM), etc.
Hope that helps.
Reference Book: https://learning.oreilly.com/api/v1/continue/9781800560796/
Machine Learning Engineering with MLflow - Chapter 2

What is the difference between optimization algorithms and Ensembling methods?

I was going through ensemblling methods and was wondering what is the difference between the optimization techniques like gradient descent etc. And ensembling techniques like bagging, boosting etc.
Optimization like Gradient Decent is a single model approach. Ensemble per Wikipedia is multiple models. Constituents in the ensemble are weighted for overall consideration. Boosting (per Wikipedia https://en.wikipedia.org/wiki/Ensemble_learning seems to say that it is retraining with a focus on missed (errors) in a model.
To me this is like single image recognition in a monocular fashion vs. binocular image recognition. The two images being an ensemble. Further scrutiny requiring extra attention to errors in classification is boosting. That is to say retraining on some errors. Perhaps error condition data were represented too infrequently enough to make good classifications (thinking black swan here). In vehicles, this could be like combining infrared, thermal, radar and lidar sensor results for an overall classification. The link above has really good explanations of each of your areas of concern.

Strange algorithm selection when using Azure AutoML with XBoostClassifier on categorial data

I have a data model consisting only of categorial features and a categorial label.
So when I build that model manually in XGBoost, I would basically transform the features to binary columns (using LabelEncoder and OneHotEncoder), and the label into classes using LabelEncoder. I would then run a Multilabel Classification (multi:softmax).
I tried that with my dataset and ended up with an accuracy around 0.4 (unfortunately can't share the dataset due to confidentiality)
Now, if I run the same dataset in Azure AutoML, I end up with an accuracy around 0.85 in the best experiment. But what is really interesting is that the AutoML uses SparseNormalizer, XGBoostClassifier, with reg:logistic as objective.
So if I interpret this right, AzureML just normalizes the data (somehow from categorial data?) and then executes a logistic regression? Is this even possible / does this make sense with categorial data?
Thanks in advance.
TL;DR You're right that normalization doesn't make sense for training gradient-boosted decision trees (GBDTs) on categorical data, but it won't have an adverse impact. AutoML is an automated framework for modeling. In exchange for calibration control, you get ease-of-use. It is still worth verifying first that AutoML is receiving data with the columns properly encoded as categorical.
Think of an AutoML model as effectively a sklearn Pipeline, which is a bundled set of pre-processing steps along with a predictive Estimator. AutoML will attempt to sample from a large swath of pre-configured Pipelines such that the most accurate Pipeline will be discovered. As the docs say:
In every automated machine learning experiment, your data is automatically scaled or normalized to help algorithms perform well. During model training, one of the following scaling or normalization techniques will be applied to each model.
Too see this, you can called .named_steps on your fitted model. Also check out fitted_model.get_featurization_summary()
I especially empathize with your concern especially w.r.t. how LightGBM (MSFT's GBDT implementation) is levered by AutoML. LightGBM accepts categorical columns and instead of one-hot encoding, will bin them into two subsets whenever split. Despite this, AutoML will pre-process away the categorical columns by one-hot encoding, scaling, and/or normalization; so this unique categorical approach is never utilized in AutoML.
If you're interested in "manual" ML in Azure ML, I highly suggest looking into Estimators and Azure ML Pipelines

Stratified Kfold

If I am correct Stratified kfold is used so that the dependent variable ratio in the splits is similar to the original data.
What I want to understand is why is it necessary or important to retain that ratio.
Is it necessary for fraud detection problems where data is highly imbalanced.
If yes, why?
Taken from https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation
Cross-validation article in Encyclopedia of Database Systems says:
Stratification is the process of rearranging the data as to ensure each fold is a
good representative of the whole. For example in a binary classification problem
where each class comprises 50% of the data, it is best to arrange the data such
that in every fold, each class comprises around half the instances.
About the importance of the stratification, Kohavi (A study of cross-validation
and bootstrap for accuracy estimation and model selection) concludes that:
stratification is generally a better scheme, both in terms of bias and variance,
when compared to regular cross-validation.
All metrics calculate against the true labels. If there is a bias in the system, say it predicts more of one label, a fold with more of said label would give artificially large results.
a methodology to take care of that is to ensure the true labels distribution is very similar for each fold. Then the aggregation of results would be more indicative of system performance.

How to run Tensorflow clustering algorithm model

I need to run k-means algorithm from Tensorflow in Go, i.e. cluster a graph intro subgraphs according to nodes similarity matrix.
I came across this article which shows an example on how to run a Keras trained model in Go. In this example the algo is of a supervised learning type. However in clustering algos, as I understand, there will be no model to save and export it to Go implementation.
The reason I am interested in Tensorflow, is because I think its code is optimized and will run much faster than k-mean implementation in Go, even with the scenario I described above.
I need an opinion of whether:
It is indeed impossible to use a Tensorflow k-mean algorithm in Go, and it is much better just to use k-means implemented in Go for this case.
It is possible to do this, and some sort of example or ideas on how to do this are very much appreciated.