When Is It Apt to Convert Features to Log-values Before Fitting - pandas

I am planning to implement anomaly detection using one-class SVM. The data which I have has 25 features and some of the columns have unique variations like trend and seasonality.
I have tried to convert the trend and seasonal features into log-values. I have found that the distribution has changed from skewed to normal.
I am not certain, though, if the convertion is way-to-go. Also, I am not sure what consequences it may cause during the fitting.
It would be greatly appreciated if anybody could highlight the best case scenario of converting the features into log-values and/or any other techniques which could mitigate the effect of time series variations.

Related

How do I handle variability of output in Anylogic?

I have been working on a simulation model for battery swapping in Anylogic. So far I have developed the simulation model, optimization experiment and parameters variation experiment.
There are no errors in the model but the output values are unsatisfactory. Small changes such as changing the step size of the decision variables results in a drastic change in the best value obtained after every experiment. Though the objective does not change much but I am concerned about the other variables that are changing with each run. Even with multiple optimization runs it is difficult to come to a conclusion.
For reference I am posting an output of parameters variation experiment here. I ran the experiment with an optimized value but I was getting feasible results (percentile > 95%) far off the expected input values. Although, the overall result is correct (decreasing percentile with increasing charging time) but it is difficult to understand the variability.
Can anyone help?enter image description here
When building a model, this is a common problem you will have when looking at high level overall outputs. You could have a model bug, but it is just as likely (if not more likely) that there is some dynamic to your system that was not clear in simple Excel spreadsheets or mental models. The DES may be telling us something truly interesting about the system behavior, but without additional outputs, there is no way to understand what that is.
A few suggestions:
Run this as a simple single scenario, where you manually update inputs. When you run this with the low range of input values and then the high range of input values, what do you see on the animation or additional outputs that is different than you expected or could explain the overall output trend? Try running several intermediate points.
Add additional output metrics. If you look at queue sizes, resource utilizations, turn-around-times, etc; do you see anything at that level that is different than expected?
Add a "replication" log. When you run a set of inputs for multiple scenarios, does any single replication stand out as an outlier? If so, re-run the scenario with that set of inputs and that random seed.
There is no substitute for understanding underlying system behavior, and without understanding those dynamics, looking at overall correlation with optimization or parameter variation experiments will often lead companies to make the wrong policies decisions.

Heuristics / Solver for high dimensional planning problem

To optimize a production system by planning ~1000 timesteps ahead I try to solve an optimization problem with around 20000 dimensions containing binary and continuous variables and several complex constraints.
I know the provided information is little, but can someone give a hint which approach would be suitable for such big problems? Would you recommend some metaheuristic or a commercial solver?

Is it possible to approximate missing position data by imputation?

I would like to increase the density of my AIS or GPS data in order to carry out more precise analyses afterwards. During my research I came across different approaches like interpolation, filtering or imputation. With the first two approaches, there is no doubt that these can be used to approximate the points between two collected data points.
In the case of imputation (e.g. MICE), however, I have not yet found an approach in the literature for determining position data.
That's why I wanted to ask if anyone knew a paper dealing with this subject and whether it makes sense at all to determine further position data approximately by imputation.
The problem you are describing there is trajectory reconstruction for AIS/GPS data. There's a number of papers for general trajectory reconstruction (see this for example), but AIS data are quite specific.
The irregularity of AIS data is a well know problem with no standard approach to deal with, as far as I know.
However, there is a handful of publications which try to deal with this issue. The problem of reconstruction is connected to the trajectory prediction problem, since both of these two shares some of the methods (the latter is more popular in the scientific community, I think).
Traditionally, AIS trajectory reconstruction is done using some physical models, which take into account the curvature of the earth and other factors, such as data noise (see examples here, here, and here).
More recent approach tries to use LSTM neural networks.
I don't know much about GPS data, but I think the methods are very similar to the ones mentioned above (especially taking into account the fact that you probably want to deal with maritime data).

Correcting SLAM drift error using GPS measurements

I'm trying to figure out how to correct drift errors introduced by a SLAM method using GPS measurements, I have two point sets in euclidian 3d space taken at fixed moments in time:
The red dataset is introduced by GPS and contains no drift errors, while blue dataset is based on SLAM algorithm, it drifts over time.
The idea is that SLAM is accurate on short distances but eventually drifts, while GPS is accurate on long distances and inaccurate on short ones. So I would like to figure out how to fuse SLAM data with GPS in such way that will take best accuracy of both measurements. At least how to approach this problem?
Since your GPS looks like it is very locally biased, I'm assuming it is low-cost and doesn't use any correction techniques, e.g. that it is not differential. As you probably are aware, GPS errors are not Gaussian. The guys in this paper show that a good way to model GPS noise is as v+eps where v is a locally constant "bias" vector (it is usually constant for a few metters, and then changes more or less smoothly or abruptly) and eps is Gaussian noise.
Given this information, one option would be to use Kalman-based fusion, e.g. you add the GPS noise and bias to the state vector, and define your transition equations appropriately and proceed as you would with an ordinary EKF. Note that if we ignore the prediction step of the Kalman, this is roughly equivalent to minimizing an error function of the form
measurement_constraints + some_weight * GPS_constraints
and that gives you a more straigh-forward, second option. For example, if your SLAM is visual, you can just use the sum of squared reprojection errors (i.e. the bundle adjustment error) as the measurment constraints, and define your GPS constraints as ||x- x_{gps}|| where the x are 2d or 3d GPS positions (you might want to ignore the altitude with low-cost GPS).
If your SLAM is visual and feature-point based (you didn't really say what type of SLAM you were using so I assume the most widespread type), then fusion with any of the methods above can lead to "inlier loss". You make a sudden, violent correction, and augment the reprojection errors. This means that you lose inliers in SLAM's tracking. So you have to re-triangulate points, and so on. Plus, note that even though the paper I linked to above presents a model of the GPS errors, it is not a very accurate model, and assuming that the distribution of GPS errors is unimodal (necessary for the EKF) seems a bit adventurous to me.
So, I think a good option is to use barrier-term optimization. Basically, the idea is this: since you don't really know how to model GPS errors, assume that you have more confidance in SLAM locally, and minimize a function S(x) that captures the quality of your SLAM reconstruction. Note x_opt the minimizer of S. Then, fuse with GPS data as long as it does not deteriorate S(x_opt) more than a given threshold. Mathematically, you'd want to minimize
some_coef/(thresh - S(X)) + ||x-x_{gps}||
and you'd initialize the minimization with x_opt. A good choice for S is the bundle adjustment error, since by not degrading it, you prevent inlier loss. There are other choices of S in the litterature, but they are usually meant to reduce computational time and add little in terms of accuracy.
This, unlike the EKF, does not have a nice probabilistic interpretation, but produces very nice results in practice (I have used it for fusion with other things than GPS too, and it works well). You can for example see this excellent paper that explains how to implement this thoroughly, how to set the threshold, etc.
Hope this helps. Please don't hesitate to tell me if you find inaccuracies/errors in my answer.

Encoding invariance for deep neural network

I have a set of data, 2D matrix (like Grey pictures).
And use CNN for classifier.
Would like to know if there is any study/experience on the accuracy impact
if we change the encoding from traditionnal encoding.
I suppose yes, question is rather which transformation of the encoding make the accuracy invariant, which one deteriorates....
To clarify, this concerns mainly the quantization process of the raw data into input data.
EDIT:
Quantize the raw data into input data is already a pre-processing of the data, adding or removing some features (even minor). It seems not very clear the impact in term of accuracy on this quantization process on real dnn computation.
Maybe, some research available.
I'm not aware of any research specifically dealing with quantization of input data, but you may want to check out some related work on quantization of CNN parameters: http://arxiv.org/pdf/1512.06473v2.pdf. Depending on what your end goal is, the "Q-CNN" approach may be useful for you.
My own experience with using various quantizations of the input data for CNNs has been that there's a heavy dependency between the degree of quantization and the model itself. For example, I've played around with using various interpolation methods to reduce image sizes and reducing the color palette size, and in the end, I discovered that each variant required a different tuning of hyper-parameters to achieve optimal results. Generally, I found that minor quantization of data had a negligible impact, but there was a knee in the curve where throwing away additional information dramatically impacted the achievable accuracy. Unfortunately, I'm not aware of any way to determine what degree of quantization will be optimal without experimentation, and even deciding what's optimal involves a trade-off between efficiency and accuracy which doesn't necessarily have a one-size-fits-all answer.
On a theoretical note, keep in mind that CNNs need to be able to find useful, spatially-local features, so it's probably reasonable to assume that any encoding that disrupts the basic "structure" of the input would have a significantly detrimental effect on the accuracy achievable.
In usual practice -- a discrete classification task in classic implementation -- it will have no effect. However, the critical point is in the initial computations for back-propagation. The classic definition depends only on strict equality of the predicted and "base truth" classes: a simple right/wrong evaluation. Changing the class coding has no effect on whether or not a prediction is equal to the training class.
However, this function can be altered. If you change the code to have something other than a right/wrong scoring, something that depends on the encoding choice, then encoding changes can most definitely have an effect. For instance, if you're rating movies on a 1-5 scale, you likely want 1 vs 5 to contribute a higher loss than 4 vs 5.
Does this reasonably deal with your concerns?
I see now. My answer above is useful ... but not for what you're asking. I had my eye on the classification encoding; you're wondering about the input.
Please note that asking for off-site resources is a classic off-topic question category. I am unaware of any such research -- for what little that is worth.
Obviously, there should be some effect, as you're altering the input data. The effect would be dependent on the particular quantization transformation, as well as the individual application.
I do have some limited-scope observations from general big-data analytics.
In our typical environment, where the data were scattered with some inherent organization within their natural space (F dimensions, where F is the number of features), we often use two simple quantization steps: (1) Scale all feature values to a convenient integer range, such as 0-100; (2) Identify natural micro-clusters, and represent all clustered values (typically no more than 1% of the input) by the cluster's centroid.
This speeds up analytic processing somewhat. Given the fine-grained clustering, it has little effect on the classification output. In fact, it sometimes improves the accuracy minutely, as the clustering provides wider gaps among the data points.
Take with a grain of salt, as this is not the main thrust of our efforts.