Pandas looking for oscillations in dataset - pandas

I am fairly new to pandas and I am looking at some flight data. I have some flight paths that seem to oscillate as seen below:
And others that look fairly normal in their altitude
I'd like to be able to identify any flights in my dataset that have oscillations (is that the right word?) like the first graph where it appears the pilot is periodically gaining and dropping altitude. Then I can look at other factors that might have been causing this.
I thought maybe some mixture of standard deviation and mean would help me identify this, but I must admit I am not very mathematically inclined (yet! working on it). There doesn't seem to be a correlation at least when looking just at these 2 samples.
Thanks for your time

Related

Information about CGAL and alternatives

I'm working on a problem that will eventually run in an embedded microcontroller (ESP8266). I need to perform some fairly simple operations on linear equations. I don't need much, but do need to be able work with points and linear equations to:
Define an equations for lines either from two known points, or one
point and a gradient
Calculate a new x,y point on an equation line that is a specific distance from another point on that equation line
Drop a perpendicular onto an equation line from a point
Perform variations of cosine-rule calculations on points and triangle sides defined as equations
I've roughed up some code for this a while ago based on high school "y = mx + c" concepts, but it's flawed (it fails with infinities when lines are vertical), and currently in Scala. Since I suspect I'm reinventing a wheel that's not my primary goal, I'd like to use someone else's work for this!
I've come across CGAL, and it seems very likely it's capable of all this and more, but I have two questions about it (given that it seems to take ages to get enough understanding of this kind of huge library to actually be able to answer simple questions!)
It seems to assert some kind of mathematical perfection in it's calculations, but that's not important to me, and my system will be severely memory constrained. Does it use/offer memory efficient approximations?
Is it possible (and hopefully easy) to separate out just a limited subset of features, or am I going to find the entire library (or even a very large subset) heading into my memory limited machine?
And, I suppose the inevitable follow up: are there more suitable libraries I'm unaware of?
TIA!
The problems that you are mentioning sound fairly simple indeed, so I'm wondering if you really need any library at all. Maybe if you post your original code we could help you fix it--your problem sounds like you need to redo a calculation avoiding a division by zero.
As for your point (2) about separating a limited number of features from CGAL, giving the size and the coding style of that project, from my experience that will be significantly more complicated (if at all possible) than fixing your own code.
In case you want to try a simpler library than CGAL, maybe you could try Boost.Geometry
Regards,

Is it possible to approximate missing position data by imputation?

I would like to increase the density of my AIS or GPS data in order to carry out more precise analyses afterwards. During my research I came across different approaches like interpolation, filtering or imputation. With the first two approaches, there is no doubt that these can be used to approximate the points between two collected data points.
In the case of imputation (e.g. MICE), however, I have not yet found an approach in the literature for determining position data.
That's why I wanted to ask if anyone knew a paper dealing with this subject and whether it makes sense at all to determine further position data approximately by imputation.
The problem you are describing there is trajectory reconstruction for AIS/GPS data. There's a number of papers for general trajectory reconstruction (see this for example), but AIS data are quite specific.
The irregularity of AIS data is a well know problem with no standard approach to deal with, as far as I know.
However, there is a handful of publications which try to deal with this issue. The problem of reconstruction is connected to the trajectory prediction problem, since both of these two shares some of the methods (the latter is more popular in the scientific community, I think).
Traditionally, AIS trajectory reconstruction is done using some physical models, which take into account the curvature of the earth and other factors, such as data noise (see examples here, here, and here).
More recent approach tries to use LSTM neural networks.
I don't know much about GPS data, but I think the methods are very similar to the ones mentioned above (especially taking into account the fact that you probably want to deal with maritime data).

How can I distribute a number of values Normally in Excel VBA

Sorry I know the question isnt as specific as it could be. I am currently working on a replenishment forecasting system for a clothing company (dont ask why it's in VBA). The module I am currently working on is distribution forecasts down to a size level. The idea is that the planners can forecast the number to sell, then can specify a ratio between the sizes.
In order to make the interface a bit nicer I was going to give them 4 options; Assess trend, manual entry, Poisson and Normal. The last two is where I am having an issue. Given a mean and SD I'd like to drop in a ratio (preferably as %s) between the different sizes. The number of the sizes can vary from 1 to ~30 so its going to need to be a calculation.
If anyone could point me towards a method I'd be etenaly greatfull - likewise if you have suggestions for a better method.
Cheers
For the sake of anyone searching this, whilst only a temporary solution I used probability mass functions to get ratios this allowed the user to modify the mean and SD and thus skew the curve as they wished. I could then use the ratios for my calculations. Poisson also worked with this method but turned out to be a slightly stupid idea in terms of choice.

I am looking for a radio advertising scheduling algorithm / example / experience

Tried doing a bit of research on the following with no luck. Thought I'd ask here in case someone has come across it before.
I help a volunteer-run radio station with their technology needs. One of the main things that have come up is they would like to schedule their advertising programmatically.
There are a lot of neat and complex rule engines out there for advertising, but all we need is something pretty simple (along with any experience that's worth thinking about).
I would like to write something in SQL if possible to deal with these entities. Ideally if someone has written something like this for other advertising mediums (web, etc.,) it would be really helpful.
Entities:
Ads (consisting of a category, # of plays per day, start date, end date or permanent play)
Ad Category (Restaurant, Health, Food store, etc.)
To over-simplify the problem, this will be a elegant sql statement. Getting there... :)
I would like to be able to generate a playlist per day using the above two entities where:
No two ads in the same category are played within x number of ads of each other.
(nice to have) high promotion ads can be pushed
At this time, there are no "ad slots" to fill. There is no "time of day" considerations.
We queue up the ads for the day and go through them between songs/shows, etc. We know how many per hour we have to fill, etc.
Any thoughts/ideas/links/examples? I'm going to keep on looking and hopefully come across something instead of learning it the long way.
Very interesting question, SMO. Right now it looks like a constraint programming problem because you aren't looking for an optimal solution, just one that satisfies all the constraints you have specified. In response to those who wanted to close the question, I'd say they need to check out constraint programming a bit. It's far closer to stackoverflow that any operations research sites.
Look into constraint programming and scheduling - I'll bet you'll find an analogous problem toot sweet !
Keep us posted on your progress, please.
Ignoring the T-SQL request for the moment since that's unlikely to be the best language to write this in ...
One of my favorites approaches to tough 'layout' problems like this is Simulated Annealing. It's a good approach because you don't need to think HOW to solve the actual problem: all you define is a measure of how good the current layout is (a score if you will) and then you allow random changes that either increase or decrease that score. Over many iterations you gradually reduce the probability of moving to a worse score. This 'simulated annealing' approach reduces the probability of getting stuck in a local minimum.
So in your case the scoring function for a given layout might be based on the distance to the next advert in the same category and the distance to another advert of the same series. If you later have time of day considerations you can easily add them to the score function.
Initially you allocate the adverts sequentially, evenly or randomly within their time window (doesn't really matter which). Now you pick two slots and consider what happens to the score when you switch the contents of those two slots. If either advert moves out of its allowed range you can reject the change immediately. If both are still in range, does it move you to a better overall score? Initially you take changes randomly even if they make it worse but over time you reduce the probability of that happening so that by the end you are moving monotonically towards a better score.
Easy to implement, easy to add new 'rules' that affect score, can easily adjust run-time to accept a 'good enough' answer, ...
Another approach would be to use a genetic algorithm, see this similar question: Best Fit Scheduling Algorithm this is likely harder to program but will probably converge more quickly on a good answer.

What statistics concepts are useful for profiling?

I've been meaning to do a little bit of brushing up on my knowledge of statistics. One area where it seems like statistics would be helpful is in profiling code. I say this because it seems like profiling almost always involves me trying to pull some information from a large amount of data.
Are there any subjects in statistics that I could brush up on to get a better understanding of profiler output? Bonus points if you can point me to a book or other resource that will help me understand these subjects better.
I'm not sure books on statistics are that useful when it comes to profiling. Running a profiler should give you a list of functions and the percentage of time spent in each. You then look at the one that took the most percentage wise and see if you can optimise it in any way. Repeat until your code is fast enough. Not much scope for standard deviation or chi squared there, I feel.
All I know about profiling is what I just read in Wikipedia :-) but I do know a fair bit about statistics. The profiling article mentioned sampling and statistical analysis of sampled data. Clearly statistical analysis will be able to use those samples to develop some statistical statements on performance. Let's say you have some measure of performance, m, and you sample that measure 1000 times. Let's also say you know something about the underlying processes that created that value of m. For instance, if m is the SUM of a bunch of random variates, the distribution of m is probably normal. If m is the PRODUCT of a bunch of random variates, the distribution is probably lognormal. And so on...
If you don't know the underlying distribution and you want to make some statement about comparing performance, you may need what are called non-parametric statistics.
Overall, I'd suggest any standard text on statistical inference (DeGroot), a text that covers different probability distributions and where they're applicable (Hastings & Peacock), and a book on non-parametric statistics (Conover). Hope this helps.
Statistics is fun and interesting, but for performance tuning, you don't need it. Here's an explanation why, but a simple analogy might give the idea.
A performance problem is like an object (which may actually be multiple connected objects) buried under an acre of snow, and you are trying to find it by probing randomly with a stick. If your stick hits it a couple of times, you've found it - it's exact size is not so important. (If you really want a better estimate of how big it is, take more probes, but that won't change its size.) The number of times you have to probe the snow before you find it depends on how much of the area of the snow it is under.
Once you find it, you can pull it out. Now there is less snow, but there might be more objects under the snow that remains. So with more probing, you can find and remove those as well. In this way, you can keep going until you can't find anything more that you can remove.
In software, the snow is time, and probing is taking random-time samples of the call stack. In this way, it is possible to find and remove multiple problems, resulting in large speedup factors.
And statistics has nothing to do with it.
Zed Shaw, as usual, has some thoughts on the subject of statistics and programming, but he puts them much more eloquently than I could.
I think that the most important statistical concept to understand in this context is Amdahl's law. Although commonly referred to in contexts of parallelization, Amdahl's law has a more general interpretation. Here's an excerpt from the Wikipedia page:
More technically, the law is concerned
with the speedup achievable from an
improvement to a computation that
affects a proportion P of that
computation where the improvement has
a speedup of S. (For example, if an
improvement can speed up 30% of the
computation, P will be 0.3; if the
improvement makes the portion affected
twice as fast, S will be 2.) Amdahl's
law states that the overall speedup of
applying the improvement will be
I think one concept related to both statistics and profiling (your original question) that is very useful and used by some (you see the technique advised from time to time) is while doing "micro profiling": a lot of programmers will rally and yell "you can't micro profile, micro profiling simply doesn't work, too many things can influence your computation".
Yet simply run n times your profiling, and keep only x% of your observations, the ones around the median, because the median is a "robust statistic" (contrarily to the mean) that is not influenced by outliers (outliers being precisely the value you want to not take into account when doing such profiling).
This is definitely a very useful statistician technique for programmers who want to micro-profile their code.
If you apply the MVC programming method with PHP this would be what you need to profile:
Application:
Controller Setup time
Model Setup time
View Setup time
Database
Query - Time
Cookies
Name - Value
Sessions
Name - Value