What is the difference between a Decision Tree and a Bayesian Network? - bayesian

If I understand it right, both use Bayes Theorem to generate an acyclic graph and calculate percentages based on functions applied at every node.
What is the difference?

One simple and fundamental difference is
Acyclic Graph != Tree
For example, a->b<-c is not a tree (it has two roots), but it is an acyclic graph.
I am not well versed in decision trees, but I am well versed in Bayesian Networks.
Here are some things that you can do with Bayesian Networks that I am not sure if you can do with a decision tree. Researching how to do these things with a decision tree may reveal interesting differences.
Compute the joint probability table between the variables
Determine if two variables are conditionally independent
Given some evidence, determine the distribution of the non-evidence variables given the evidence

Related

What is the difference between optimization algorithms and Ensembling methods?

I was going through ensemblling methods and was wondering what is the difference between the optimization techniques like gradient descent etc. And ensembling techniques like bagging, boosting etc.
Optimization like Gradient Decent is a single model approach. Ensemble per Wikipedia is multiple models. Constituents in the ensemble are weighted for overall consideration. Boosting (per Wikipedia https://en.wikipedia.org/wiki/Ensemble_learning seems to say that it is retraining with a focus on missed (errors) in a model.
To me this is like single image recognition in a monocular fashion vs. binocular image recognition. The two images being an ensemble. Further scrutiny requiring extra attention to errors in classification is boosting. That is to say retraining on some errors. Perhaps error condition data were represented too infrequently enough to make good classifications (thinking black swan here). In vehicles, this could be like combining infrared, thermal, radar and lidar sensor results for an overall classification. The link above has really good explanations of each of your areas of concern.

Function inverse tensorflow

Is there a way to find the inverse of neural network representation of a function in tensorflow v1? I require this to find the optimal function in an optimization problem that I am solving.
To be precise, the optimal function is found by minimizing the error computed as L2 norm of difference between the approximated optimal function C* (coded as a neural network object), and inverse of a value function V* (coded as another neural network object).
My problem is that I do not know how to write inverse of V* in tensorflow, as I cannot find something like tf.inverse().
Any help is much appreciated. Thanks.
Unless I am misunderstanding the situation, I believe that it is impossible to do this in a generalized way. Many functions do not have a perfect inverse. For a simple example, imagine a square(x) function that computes x2. You might think that the inverse is sqrt(y), but in reality the "correct" result could be either sqrt(y) or -sqrt(y), with no way of telling which is correct.
Similarly, with most neural networks I imagine it would be impossible to find the "true" mathematical inverse. There are architectures that attempt to train a neural net and its inverse simultaneously (autoencoders and BiGAN/ALI come to mind), and for some nets it might be possible to train an inverse empirically, but these can have extremely varying levels of accuracy that depend heavily on many factors.
Depending on how much control you have over V*, you might be able to design it in such a way that it is mathematically invertible (and then you would have to manually code the inverse), or you might be able to make it a simpler model that is not based on a neural net. However, if V* is an arbitrary preexisting net, then you're probably out of luck.
Further reading:
SO: local inverse of a neural network
AI.SE: Can we get the inverse of the function that a neural network represents?

why normalization do not need parameters, but batch normalization need

Normalization is just normalizing the input layer.
while batch normalization is on each layer.
We do not learn parameters in Normalization
But why we need to learn the batch normalization?
This is has been answered in detail in https://stats.stackexchange.com/a/310761
Deep Learning Book, Section 8.7.1:
Normalizing the mean and standard deviation of a unit can reduce the expressive power of the neural network containing that unit. To
maintain the expressive power of the network, it is common to replace
the batch of hidden unit activations H with γH+β rather than simply
the normalized H. The variables γ and β are learned parameters that
allow the new variable to have any mean and standard deviation. At
first glance, this may seem useless — why did we set the mean to 0,
and then introduce a parameter that allows it to be set back to any
arbitrary value β?
The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the
new parametrization has different learning dynamics. In the old
parametrization, the mean of H was determined by a complicated
interaction between the parameters in the layers below H. In the new
parametrization, the mean of γH+β is determined solely by β. The new
parametrization is much easier to learn with gradient descent.

Standard parameter representation in neural networks

Many times I have seen in neural networks forward propagation that example vectors are multiplied from the left (vector-matrix) and some times from the right (matrix-vector). Notation, some Tensorflow tutorials and the datasets I have found seem to prefer the former over the later, contrary to the way in which linear algebra tends to be teached (matrix-vector way).
Moreover, they represent inverted ways of representing parameters: enumerate problem variables in dimension 0 or enumerate neurons in dimension 0.
This confuses me and makes me wonder if there is really a standard here or it has been only coincidence. If there is, I would like to know if the standard follows some deeper reasons. I would feel really better answering this question.
(By the way, I know that you will normally use example matrices instead of vectors [or more complex things in conv nets, etc..] because the use of minibatches, but the point still holds.)
Not sure if this answer is what you are looking for, but in the context of Tensorflow, the standard is to use a dense layer (https://www.tensorflow.org/api_docs/python/tf/layers/dense) which is a higher level abstraction that wraps up the affine transformation logic you are referring to.

Bayesian Approach in Ensemble modeling

Is ensemble modeling a Bayesian approach? I am thinking it like this: our final model(posterior) is based on other primary models(prior). Can you guys give your opinions?
The question is probably better suited on CrossValidated, but I'll give you a hint.
The way you describe it, Bayesian approach does not fit directly, because Bayes theorem states that posterior equals prior times the likelihood (normalized). The final ensemble model is a weighted sum of individual models. It's not clear what you consider a likelihood to make an ensemble Bayesian.
If you are looking for a probabilistic interpretation, here's a better one: the ensemble model represents a joint distribution of the model selector variable (what is the probability that a particular model is good for a given input) and the model distribution (the accuracy of a particular model). The better you pick both of these distributions (proper models and their weights), the better the ensemble.