Evolutionary algorithm: What is the purpose of hidden/intermediate nodes - evolutionary-algorithm

I saw this video online, it shows a "neural network" with three inputs and three outputs, although the inputs are not changing, I believe there is enough similarity between this network and those of other evolutionary algorithms to make the question valid.
My question is, since it is possible for all three input nodes shown in the video to "exert influence" on the output nodes with controlled weight, why is the four intermediate nodes necessary? Why not connect the input nodes directly to the outputs?

An artificial neural network consisting only of inputs and outputs is a (single-layer) perceptron. Realizing these networks would not solve many problems set back the use of artificial neural networks for over a decade!
For simplicity, imagine only one output neuron (many outputs can be considered many similar problems in parallel). Furthermore, let's consider for the moment only one input. The neurons use an activation function, which determines the activity (output) of this neuron depending on the input it receives. For activation functions used in practice*, the more input, the higher output (or the same in some ranges, but let's forget about that). And chaining two of these also results in "the more input, the more final output".
With one output neuron you interpret the results as "if output is over threshold, then A, otherwise B". (Where "A" and "B" can mean different things). Because both our neurons produce more signal the more input they receive, then our network can only answer easy linear problems of type "if input signal is over threshold, then A, otherwise B".
Using two inputs is very similar: we combine the output of two input-neurons. Now we are in the situation "if inputs to input neurons 1 and 2 are, together, high enough that our final output is over a threshold, then A, otherwise B". Graphically this means we can decide A or B by drawing a line (allow curvature) on the input 1-input 2 plane:
But there are problems that cannot be solved this way! Consider the XOR problem. Our goal is to produce this:
As you can see, it is impossible to draw a line that gets all the A's on one side and all the B's on the other. And these lines represent all the possible one-layer perceptrons! We say that the XOR problem is not linearly separable (and this is why the XOR is a traditional test for neural networks).
Introducing at least one hidden layer allows to solve this problem. In practice this is like combining the result of two one-layer perceptrons:
Adding more neurons to the hidden layer means being able to solve more and more complex problems. In fact, any function f(A,B).
However, you may know other networks use more layers (see deep learning), but in this case the motivation is not a theoretical limitation, but rather searching for networks that perform better.
*Using weird hand-crafted activation functions will not make things better. You may be able to solve an specific problem, but still not all, and you need to know how to design this activation function.

Related

Multiple-input multiple-output CNN with custom loss function

I have a set of 2D input arrays m x n namely A,B,C and I have to predict two 2D output arrays namely d,e for which I do have the expected values. You can think of the inputs/outputs as grey images if you like.
Because of the spatial information is relevant (these are actually 2D physical domains) I want to use a Convolutional Neural Network to predict d and e. My design (not tested yet) looks as follows:
Because I have multiple inputs, I guess I should use multiple columns (or branches) to find different features for each of the inputs (they look fairly different). Each of these columns follows a encoding-decoding architecture used in segmentation (see SegNet): Conv2D block involves a convolution+batch normalisation+ReLU layer. Deconv2D involves a deconvolution+batch normalisation+ReLU.
Then, I can merge the output of each column by either concatenating, averaging or taking the maximum for example. To obtain the original m x n shape for each of the outputs I have seen I could do this with a 1 x 1 kernel convolution.
I want to predict the two outputs from that single layer. Is that okay from the network structure point of view? Finally my loss function depends on the outputs themselves compared to the target plus another relation I want to impose.
A would like to have some expert opinion on this since this is my first design of a CNN and I am not sure if I it makes sense as it is now and/or if there are better approaches (or network architectures) to this problem.
I posted this originally in datascience but I did not get much feedback. I am now posting it here since there is a bigger community on these topics plus I would be very grateful to receive implementation tips beside network architectural ones. Thanks.
I think your design makes sense in general:
since A, B, and C are fairly different, you make each input a transform sub-network, and then fuse them together, which is your intermediate representation.
from the intermediate representation, you apply additional CNN to decode D and E, respectively.
Several things:
A, B, and C looking different does not necessarily mean you can't stack them together as a 3-channel input. The decision should be made upon the fact that whether the values in A, B, and C mean differently or not. For example, if A is a gray scale image, B is a depth map, C is a also a gray image captured by a different camera. Then A and B are better processed in your suggested way, but A and C can be concatenated as one single input before feeding it to your network.
D and E are two outputs of the network and will be trained in the multi-task manner. Of course, they should share some latent feature, and one should split at this feature to apply a down-stream non-shared weight branch for each output. However, where to split is usually tricky.
It is really a broad question, asking for answers relying mostly on opinions. Here are my two cents though, which you might find interesting as it does not go along the previous answers here and on datascience.
First, I wouldn't go with separate columns for each input. AFAIK, when different inputs are processed by different columns, it is almost always the case that the network is some sort of Siemese network and the columns share the same weights; or at least the columns all need to produce a similar code. It is not your case here, so I would simply not bother.
Second, you are blessed with a problem with a dense output and no need to learn a code. This should direct you straight to U-nets, which outperforms any bottleneck-designed network without much effort. U-nets were introduced for dense segmentation but they shine at any dense-output problem really.
In short, just stack your inputs together and use a U-net.

Inference on several inputs in order to calculate the loss function

I am modeling a perceptual process in tensorflow. In the setup I am interested in, the modeled agent is playing a resource game: it has to choose 1 out of n resouces, by relying only on the label that a classifier gives to the resource. Each resource is an ordered pair of two reals. The classifier only sees the first real, but payoffs depend on the second. There is a function taking first to second.
Anyway, ideally I'd like to train the classifier in the following way:
In each run, the classifier give labels to n resources.
The agent then gets the payoff of the resource corresponding to the highest label in some predetermined ranking (say, A > B > C > D), and randomly in case of draw.
The loss is taken to be the normalized absolute difference between the payoff thus obtained and the maximum payoff in the set of resources. I.e., (Payoff_max - Payoff) / Payoff_max
For this to work, one needs to run inference n times, once for each resource, before calculating the loss. Is there a way to do this in tensorflow? If I am tackling the problem in the wrong way feel free to say so, too.
I don't have much knowledge in ML aspects of this, but from programming point of view, I can see doing it in two ways. One is by copying your model n times. All the copies can share the same variables. The output of all of these copies would go into some function that determines the the highest label. As long as this function is differentiable, variables are shared, and n is not too large, it should work. You would need to feed all n inputs together. Note that, backprop will run through each copy and update your weights n times. This is generally not a problem, but if it is, I heart about some fancy tricks one can do by using partial_run.
Another way is to use tf.while_loop. It is pretty clever - it stores activations from each run of the loop and can do backprop through them. The only tricky part should be to accumulate the inference results before feeding them to your loss. Take a look at TensorArray for this. This question can be helpful: Using TensorArrays in the context of a while_loop to accumulate values

Is Tensorflow RNN implements Elman network fully?

Q: Is Tensorflow RNN implemented to ouput Elman Network's hidden state?
cells = tf.contrib.rnn.BasicRNNCell(4)
outputs, state = tf.nn.dynamic_rnn(cell=cells, etc...)
I'm quiet new to TF's RNN and curious about meaning of outputs, and state.
I'm following stanford's tensorflow tutorial but there seems no detailed explanation so I'm asking here.
After testing, I think state is hidden state after sequence calculation and outputs is array of hidden states after each time steps.
so I want to make it clear. outputs and state are just hidden state vectors so to fully implement Elman network, I have to make V matrix in the picture and do matrix multiplication again. am I correct?
I believe you are asking what the output of a intermediate state and output is.
From what I understand, the state would be intermediate output after a convolution / sequence calculation and is hidden, so your understanding is in the right direction.
Output may vary as how you decide to implement your network model, but on a general basis, it is an array where any operation (convolution, sequence calc etc) has been applied after which activation & downsampling/pooling has been applied, to concentrate on identifiable features across that layer.
From Colah's blog ( http://colah.github.io/posts/2015-08-Understanding-LSTMs/ ):
Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanhtanh (to push the values to be between −1−1 and 11) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.
Hope this helps.
Thank you

Neural networking diverging on the same item in all tests

I am having problem finding a cause for diverging values in all tests of my multilayer neural network for recognizing hand written patterns.
Here is a photo of output:
Each column represents a specific letter. The result should be that first letter would dominate in first row, second letter in second row, ...
In every run of few tests, one letter dominates in all values. What could be a cause for this?
Answer can depend to some degree from kind of neural network model that you use (
Perceptron, backprop, recurrent neural network, LSTM ) but what is easy to notice is data that you feed into your NN. Three inputs that you mentioned are very close to each other. First column has very small difference between each other. They are quite identical: 0.31659 and 0.31660. Second column has the same challenge for NN:0.3993. And third column is also quite similar: 0.2657. For NN it's not easy to build some kind of manifold that separated between those values. You should consider somehow to increase contrast between those three columns because they look pretty similar to each other. NN considers those changes as insignificant and you need many iterations before you'll be able to build hyperplane which correctly classifies your letters.

How do I have to train a HMM with Baum-Welch and multiple observations?

I am having some problems understanding how the Baum-Welch algorithm exactly works. I read that it adjusts the parameters of the HMM (the transition and the emission probabilities) in order to maximize the probability that my observation sequence may be seen by the given model.
However, what does happen if I have multiple observation sequences? I want to train my HMM against a huge lot of observations (and I think this is what is usually done).
ghmm for example can take both a single observation sequence and a full set of observations for the baumWelch method.
Does it work the same in both situations? Or does the algorithm have to know all observations at the same time?
In Rabiner's paper, the parameters of GMMs (weights, means and covariances) are re-estimated in the Baum-Welch algorithm using these equations:
These are just for the single observation sequence case. In the multiple case, the numerators and denominators are just summed over all observation sequences, and then divided to get the parameters. (this can be done since they simply represent occupation counts, see pg. 273 of the paper)
So it's not required to know all observation sequences during an invocation of the algorithm. As an example, the HERest tool in HTK has a mechanism that allows splitting up the training data amongst multiple machines. Each machine computes the numerators and denominators and dumps them to a file. In the end, a single machine reads these files, sums up the numerators and denominators and divides them to get the result. See pg. 129 of the HTK book v3.4