Optimization Algorithm vs Regression Models - optimization

Currently, I'm dealing with forecasting problems. I have a reference that used linear function to represent the input and output data.
y = po + p1.x1 + p2.x2
Both of x1 and x2 are known input; y is output; p0, p1, and p2 are the coefficient. Then, he used all the training data and Least Square Estimation (LSE) method to find the optimal coefficient (p0, p1, p2) to build the model.
My question is if he already used the LSE algorithm, can I try to improve his method by using any optimization algorithm (PSO or GA for example) to try find better coefficient value?

You answered this yourself:
Blockquote Then, he used all the training data and Least Square Estimation (LSE) method to find the optimal coefficient (p0, p1, p2) to build the model.
Because a linear-model is quite easy to optimize, the LSE method obtained a global optimum (ignoring subtle rounding-errors and early-stopping/tolerance errors). Without changing the model, there is no gain in terms of using other coefficients, independent on the usage of meta-heuristics lika GA.
So you may modify the model, or add additional data (feature-engineering: e.g. product of two variables; kernel-methods).
One thing to try: Support-Vector machines. These are also convex and can be trained efficiently (with not too much data). They are also designed to work well with kernels. An additional advantage (compared with more complex models: e.g. non-convex): they are quite good regarding generalization which seems to be important here because you don't have much data (sounds like a very small dataset).
See also #ayhan's comment!

Related

Different optimization behavior using np.random-normal instead of tf.random_normal

I’m looking into the code from https://github.com/AshishBora/csgm and experience some strange behavior when using np.random.normal instead of tf.random_normal as initializing of a tf.Variable. More concrete:
Instead of
z = tf.Variable(tf.random_normal((batch_size, hparams.n_z)), name='z')
I have
# in mnist_vae/src/model_def.py, line 74
z = tf.Variable(np.random.normal(size=(batch_size,
hparams.n_z)).astype('float32'), name='z')
z is the variable, which is optimized via Adam optimizer with respect to an objective.
For a little bit background: There is a pre-trained neural network G, whose input z is drawn from a standard normal distribution using tf.random_normal. For a given z*, one wants to solve ẑ= argmin_z ||AG(z)-AG(z*)|| and check the reconstruction error ||G(ẑ)-G(z*)||. The outcoming minimal value c(z*)=||G(ẑ)-G(z*)|| is for several different z* quite stable around a value c1. Now, I wasn’t quite sure whether the optimization (Adam optimizer) might use the information that z comes from a standard normal distribution. So I replaced the tf.random_normal by a np.random_normal in the hope that the optimizer can’t use the information then. (see the code above)
Unfortunately, the results are indeed different using np.random.normal: c(z*)=||G(ẑ)-G(z*)|| is for several different z* stable around a different value c2 (not c1). How can one explain this? Is it really that the optimizer uses the information of the normal distribution (e.g. as loglikelihood prior) in the optimization? My feeling says no, since it's only the initialization.
The code is given in https://github.com/AshishBora/csgm

Inference on several inputs in order to calculate the loss function

I am modeling a perceptual process in tensorflow. In the setup I am interested in, the modeled agent is playing a resource game: it has to choose 1 out of n resouces, by relying only on the label that a classifier gives to the resource. Each resource is an ordered pair of two reals. The classifier only sees the first real, but payoffs depend on the second. There is a function taking first to second.
Anyway, ideally I'd like to train the classifier in the following way:
In each run, the classifier give labels to n resources.
The agent then gets the payoff of the resource corresponding to the highest label in some predetermined ranking (say, A > B > C > D), and randomly in case of draw.
The loss is taken to be the normalized absolute difference between the payoff thus obtained and the maximum payoff in the set of resources. I.e., (Payoff_max - Payoff) / Payoff_max
For this to work, one needs to run inference n times, once for each resource, before calculating the loss. Is there a way to do this in tensorflow? If I am tackling the problem in the wrong way feel free to say so, too.
I don't have much knowledge in ML aspects of this, but from programming point of view, I can see doing it in two ways. One is by copying your model n times. All the copies can share the same variables. The output of all of these copies would go into some function that determines the the highest label. As long as this function is differentiable, variables are shared, and n is not too large, it should work. You would need to feed all n inputs together. Note that, backprop will run through each copy and update your weights n times. This is generally not a problem, but if it is, I heart about some fancy tricks one can do by using partial_run.
Another way is to use tf.while_loop. It is pretty clever - it stores activations from each run of the loop and can do backprop through them. The only tricky part should be to accumulate the inference results before feeding them to your loss. Take a look at TensorArray for this. This question can be helpful: Using TensorArrays in the context of a while_loop to accumulate values

Splitting Training Data to train optimal number of n models

lets assume we have a huge Database providing us with the training data D and a dedicated smaller testing data T for a machine learning problem.
The data covers many aspects of a real world problem and thus is very diverse in its structure.
When we now train a not closer defined machine learning algorithm (Neural Network, SVM, Random Forest, ...) with D and finally test the created model against T we obtain a certain performance measure P (confusion matrix, mse, ...).
The Question: If I could achieve a better performance, by dividing the problem ito smaller sub-problems, e.g. by clustering D into several distinct training sets D1, D2, D3, ..., how could I find the optimal clusters? (number of clusters, centroids,...)
In a brute-force fashion I am thinking about using a kNN Clustering with a random number of clusters C, which leads to the training data D1, D2,...Dc.
I would now train C different models and finally test them against the training sets T1, T2, ..., Tc, where the same kNN Clustering has been used to split T into the C test sets T1,..,Tc.
The combination which gives me the best overall performance mean(P1,P2,...,Pc) would be the one I would like to choose.
I was just wondering whether you know a more sophisticated way than brute-forcing this?
Many thanks in advance
Clustering is hard.
Much harder than classification, because you don't have labels to tell you if you are doing okay, or not well at all. It can't do magic, but it requires you to carefully choose parameters and evaluate the result.
You cannot just dump your data into k-means and expect anything useful to come out. You'd first need to really really carefully clean and preprocess your data, and then you might simply figure out that it actually is only one single large clump...
Furthermore, if clustering worked well and you train classifiers on each cluster independently, then every classifier will miss crucial data. The result will likely performing really really bad!
If you want to only train on parts of the data, use a random forest.
But it sounds like you are more interested in a hierarchical classification approach. That may work, if you have good hierarchy information. You'd first train a classifier on the category, then another within the category only to get the final class.

Implementing a 2D recursive spatial filter using Scipy

Minimally, I would like to know how to achieve what is stated in the title. Specifically, signal.lfilter seems like the only implementation of a difference equation filter in scipy, but it is 1D, as shown in the docs. I would like to know how to implement a 2D version as described by this difference equation. If that's as simple as "bro, use this function," please let me know, pardon my naiveté, and feel free to disregard the rest of the post.
I am new to DSP and acknowledging there might be a different approach to answering my question so I will explain the broader goal and give context for the question in the hopes someone knows how do want I want with Scipy, or perhaps a better way than what I explicitly asked for.
To get straight into it, broadly speaking I am using vectorized computation methods (Numpy/Scipy) to implement a Monte Carlo simulation to improve upon a naive for loop. I have successfully abstracted most of my operations to array computation / linear algebra, but a few specific ones (recursive computations) have eluded my intuition and I continually end up in the digital signal processing world when I go looking for how this type of thing has been done by others (that or machine learning but those "frameworks" are much opinionated). The reason most of my google searches end up on scipy.signal or scipy.ndimage library references is clear to me at this point, and subsequent to accepting the "signal" representation of my data, I have spent a considerable amount of time (about as much as reasonable for a field that is not my own) ramping up the learning curve to try and figure out what I need from these libraries.
My simulation entails updating a vector of data representing the state of a system each period for n periods, and then repeating that whole process a "Monte Carlo" amount of times. The updates in each of n periods are inherently recursive as the next depends on the state of the prior. It can be characterized as a difference equation as linked above. Additionally this vector is theoretically indexed on an grid of points with uneven stepsize. Here is an example vector y and its theoretical grid t:
y = np.r_[0.0024, 0.004, 0.0058, 0.0083, 0.0099, 0.0133, 0.0164]
t = np.r_[0.25, 0.5, 1, 2, 5, 10, 20]
I need to iteratively perform numerous operations to y for each of n "updates." Specifically, I am computing the curvature along the curve y(t) using finite difference approximations and using the result at each point to adjust the corresponding y(t) prior to the next update. In a loop this amounts to inplace variable reassignment with the desired update in each iteration.
y += some_function(y)
Not only does this seem inefficient, but vectorizing things seems intuitive given y is a vector to begin with. Furthermore I am interested in preserving each "updated" y(t) along the n updates, which would require a data structure of dimensions len(y) x n. At this point, why not perform the updates inplace in the array? This is wherein lies the question. Many of the update operations I have succesfully vectorized the "Numpy way" (such as adding random variates to each point), but some appear overly complex in the array world.
Specifically, as mentioned above the one involving computing curvature at each element using its neighbouring two elements, and then imediately using that result to update the next row of the array before performing its own curvature "update." I was able to implement a non-recursive version (each row fails to consider its "updated self" from the prior row) of the curvature operation using ndimage generic_filter. Given the uneven grid, I have unique coefficients (kernel weights) for each triplet in the kernel footprint (instead of always using [1,-2,1] for y'' if I had a uniform grid). This last part has already forced me to use a spatial filter from ndimage rather than a 1d convolution. I'll point out, something conceptually similar was discussed in this math.exchange post, and it seems to me only the third response saliently addressed the difference between mathematical notion of "convolution" which should be associative from general spatial filtering kernels that would require two sequential filtering operations or a cleverly merged kernel.
In any case this does not seem to actually address my concern as it is not about 2D recursion filtering but rather having a backwards looking kernel footprint. Additionally, I think I've concluded it is not applicable in that this only allows for "recursion" (backward looking kernel footprints in the spatial filtering world) in a manner directly proportional to the size of the recursion. Meaning if I wanted to filter each of n rows incorporating calculations on all prior rows, it would require a convolution kernel far too big (for my n anyways). If I'm understanding all this correctly, a recursive linear filter is algorithmically more efficient in that it returns (for use in computation) the result of itself applied over the previous n samples (up to a level where the stability of the algorithm is affected) using another companion vector (z). In my case, I would only need to look back one step at output signal y[n-1] to compute y[n] from curvature at x[n] as the rest works itself out like a cumsum. signal.lfilter works for this, but I can't used that to compute curvature, as that requires a kernel footprint that can "see" at least its left and right neighbors (pixels), which is how I ended up using generic_filter.
It seems to me I should be able to do both simultaneously with one filter namely spatial and recursive filtering; or somehow I've missed the maths of how this could be mathematically simplified/combined (convolution of multiples kernels?).
It seems like this should be a common problem, but perhaps it is rarely relevant to do both at once in signal processing and image filtering. Perhaps this is why you don't use signals libraries solely to implement a fast monte carlo simulation; though it seems less esoteric than using a tensor math library to implement a recursive neural network scan ... which I'm attempting to do right now.
EDIT: For those familiar with the theoretical side of DSP, I know that what I am describing, the process of designing a recursive filters with arbitrary impulse responses, is achieved by employing a mathematical technique called the z-transform which I understand is generally used for two things:
converting between the recursion coefficients and the frequency response
combining cascaded and parallel stages into a single filter
Both are exactly what I am trying to accomplish.
Also, reworded title away from FIR / IIR because those imply specific definitions of "recursion" and may be confusing / misnomer.

How to create a synthetic dataset

I want to run some Machine Learning clustering algorithms on some big data.
The problem is that I'm having troubles to find interesting data for this purpose on the web.Also, usually this data might be inconvenient to use because the format won't fit for me.
I need a txt file which each line represents a mathematical vector, each element seperated by space, for example:
1 2.2 3.1
1.12 0.13 4.46
1 2 54.44
Therefore, I decided to first run those algorithms on some synthetic data which I'll create by my self. How can I do this in a smart way with numpy?
In smart way, I mean that it won't be generated uniformly, because it's a little bit boring. How can I generate some interesting clusters?
I want to have 5GB / 10GB of data at the moment.
You need to define what you mean by "clusters", but I think what you are asking for is several random-parameter normal distributions combined together, for each of your coordinate values.
From http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.randn.html#numpy.random.randn:
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu
And use <range> * np.random.rand(<howmany>) for each of sigma and mu.
There is no one good answer for such question. What is interesting? For clustering, unfortunately, there is no such thing as an interesting or even well posed problem. Clustering as such has no well defineid evaluation, consequently each method is equally good/bad, as long as it has well defined internal objective. So k-means will always be good one to minimize inter-cluster euclidean distance and will struggle with sparse data, non-convex, imbalanced clusters. DBScan will always be the best in greedy density based sense and will strugle with diverse density clusters. GMM will be always great fitting on gaussian mixtures, and will strugle with clusters which are not gaussians (for example lines, squares etc.).
From the question one could deduce that you are at the very begining of work with clustering and so need "just anything more complex than uniform", so I suggest you take a look at datasets generators, in particular accesible in scikit-learn (python) http://scikit-learn.org/stable/datasets/ or in clusterSim (R) http://www.inside-r.org/packages/cran/clusterSim/docs/cluster.Gen or clusterGeneration (R) https://cran.r-project.org/web/packages/clusterGeneration/clusterGeneration.pdf