causality relationship: Granger vs DBN - bayesian-networks

I'm looking to find/prove the causal relationship between two-time series (for example two EEG electrode signals).
I want to know how to choose between using the granger causality test or the dynamic bayesian network algorithm.

Related

A huge number of discrete features

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?
Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.
It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

What impurity index (Gini, entropy?) is used in TensorFlow Random Forests with CART trees?

I was looking for this information in the tensorflow_decision_forests docs (https://github.com/tensorflow/decision-forests) (https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/wrappers/CartModel) and yggdrasil_decision_forests docs (https://github.com/google/yggdrasil-decision-forests).
I've also taken a look at the code of these two libraries, but I didn't find that information.
I'm also curious if I can specify an impurity index to use.
I'm looking for some analogy to sklearn decision tree, where you can specify the impurity index with criterion parameter.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
For TensorFlow Random Forest i found only a parameter uplift_split_score:
uplift_split_score: For uplift models only. Splitter score i.e. score
optimized by the splitters. The scores are introduced in "Decision trees
for uplift modeling with single and multiple treatments", Rzepakowski et
al. Notation: p probability / average value of the positive outcome,
q probability / average value in the control group.
- KULLBACK_LEIBLER or KL: - p log (p/q)
- EUCLIDEAN_DISTANCE or ED: (p-q)^2
- CHI_SQUARED or CS: (p-q)^2/q
Default: "KULLBACK_LEIBLER".
I'm not sure if it's a good lead.
No, you shouldn't use uplift_split_score, because it is For uplift models only.
Uplift modeling is used to estimate treatment effect or other tasks in causal inference

How to test a machine learning model?

I want to develop a framework(for QA testing purpose) that validates a machine learning model. I had a lot of discussions with my peers and read articles from the google.
Most of the discussions or articles are telling machine learning model will evolve with the test data that we provide. correct me if I'm wrong.
What is the possibility of developing a framework that validates the machine learning model will give accurate results?
Few ways to test the model from the articles I read: Split and Multi-split technique, Metamorphic testing
Please also suggest any other approaches
QA testing of ML-based software requires additional, and rather unconventional, tests because oftentimes their outputs for a given set of inputs are not defined, deterministic, or known a priori and they produce approximations rather than exact results.
QA may be designed to test against:
naive but predictable benchmark methods: the average method in forecasting, the class-frequency-based classifier in classification, etc.
sanity checks (the outputs being feasible/rational): e.g., is the predicted age positive?
preset objective acceptance levels: e.g., is its AUCROC > 0.5?
extreme/boundary cases: e.g., thunderstorm conditions for a weather forecast model.
bias-variance tradeoff: what is its performance on in-sample and out-of-sample data? K-Fold cross-validation is useful here.
the model itself: is the coefficient of variation of its performance measure (e.g., AUCROC) from n runs on the same data for same/random train and test partitioning within a reasonable bound?
Some of these tests need performance measures. Here is a comprehensive library of them.
I think the data flow is, actually, the one that needs to be tested here such as raw input, manipulation, test output and predictions. For example, if you have a simple linear model you actually want to test the predictions produced from that model instead of the coefficients of the model. So, maybe, the high level steps are summarized as below;
Raw Input: Does the raw input make sense? Before you start manipulating, you need to be sure the raw data values are within the expected limits. For example, if you normally see 5-10% NA rate in some data, having 95% NA rate in a new batch might be an indicator that something is wrong.
Train/Predict Ready Input: Either you train a new model or feeding new data into a already trained model for prediction, you probably want to be sure that manipulated data makes sense, too. Some ML algorithms are delicate to data anomalies. You don't want to predict a credit score around thousands just because you have some data anomalies in the input.
Model Success: By this time, you should have some idea about your model success. So, you can measure the model's performance on a new test data. You can also check train and test score if they are not significantly different (i.e. Overfitting). If you're retraining, you can compare with the previous training scores. Or, you can separate some test set and compare its score.
Predictions: Finally, you need to be sure your final output makes sense before delivering to production/clients. For example, if you're revenue forecasting for a very small shop, the daily revenue predictions can't be million dollars or some negative amounts.
Full disclosure, I wrote a small Python package for this. You can check here or download as below,
pip install mlqa

How to identify relevant features in WEKA?

I would like to perform feature analysis in WEKA. I have a data set of 8 features and 65 instances.
I would like to perform feature selection and optimization functionalities that are available for machine learning methods like SVM.
For example in Weka I would like to know how I can display which of the features contribute best to the classification result.
I think that WEKA provides a nice graphical user interface and allows a very detailed analysis of the influence of single features. But I dont know how to use it. Any help?
You have two options:
You can perform attribute selection using filters. For instance you can use the AttributeSelection tab (or filter) with the search method Ranker and the attribute evaluation metric InfoGainAttributeEval. This way you get a ranked list of the most predictive features according to its Information Gain score. I have done this many times with good results. Sometimes it helps even to increase the accuracy of SVMs, which are known not to need (too much) of feature selection. You can try with other search methods in order to find subgroups of coupled predictors, and with other metrics.
You can just look at the coefficients in the SVM output. For instance, in linear SVMs, the classifier is a polynomial like a1.f1 + a2.f2 + ... + an.fn + fn+1 > 0, being ai the attribute values for an instance, and fi the "weights" obtained in the SVM training algorithm. In consequence, those weights with values close to 0 represent attributes that do not count too much, thus being bad predictors; extreme weights (either positive or negative) represent good predictors.
Additionally, you can check the visualization options available for a particular classifier (e.g. J48 is a decision tree, the attribute used in the root test is for the best predictor). You can check the AttributeSelection tab visualization options as well.

Bayesian Networks with multiple layers

So I'm trying to solve a problem with Bayesian networking. I know the conditional probabilities of some event, say that it will rain. Suppose that I measure (boolean) values from each of four sensors (A1 - A4). I know the probability that of rain and I know the probability of rain given the measurements on each of the sensors.
Now I add in a new twist. A4 is no longer available, but B1 and B2 are (they are also boolean sensors). I know the conditional probabilities of both B1 and B2 given the measurement of A4. How do I incorporate those probabilities into my Bayesian network to replace the lost data from A4?
Your problem fits perfectly to Multi-Entity Bayesian Networks (MEBN). This is an extension to standard BN using First Order Logic (FOL). It basically allows nodes to be added and/or removed based on the specific situation at hand. You define a template for creating BN on the fly, based on the current knwoledge available.
There are several papers on it available on the Web. A classic reference to this work is "Multi-Entity Bayesian Networks Without Multi-Tears".
We have implemented MEBN inside UnBBayes. You can get a copy of it by following the instructions # http://sourceforge.net/p/unbbayes/discussion/156015/thread/cb2e0887/. An example can be seen in the paper "Probabilistic Ontology and Knowledge Fusion for Procurement Fraud Detection in Brazil" # http://link.springer.com/chapter/10.1007/978-3-642-35975-0_2.
If you are interested in it, I can give you more pointers later on.
Cheers,
Rommel