Detect sharp drops in time series data (Pandas) - pandas

I have a time series DataFrame which is indexed by "Cycles" and I'd like to know if there's a way to detect sharp prolonged drops.
The data is smoothed using a rolling mean of 10k cycles. And the index is not "symmetric", meaning that each row doesn't always increase the same number of cycles.
Index(Cycle) RawData DecimatedData
4 9.032 9.363
6 9.721 9.359
10 9.831 9.363
11 8.974 9.352
15 9.143 9.354
Even after decimating there are many "small" drops along the way which I'd like to ignore.
I've tried using df['DecimatedData'].diff() with different windows but it hasn't yielded good results.
I was thinking of putting a counter and that continues to increase as the .diff() function yields negative values, and then after it reaches a certain limit I can flag it. But I think there might be a more elegant way of doing this.

Related

adding to multiindexed pandas dataframe by specific indices is incredibly slow

I have a rather large data frame (~400M rows). For each row, I calculate which elements in another data frame (~200M rows) I need to increment by 1. This second data frame is indexed by three parameters. So the line I use to do this looks like:
total_counts[i].loc[zipcode,j,g_durs[j][0]:g_durs[j][-1]+1] += 1
The above line works, but it appears to be a horrendous bottleneck and takes almost 1 second to run. Unfortunately, I cannot wait 11 years for this to finish.
Any insights on how I might speed up this operation? Do I need to split up the 200M row data frame into an array of smaller objects? I don't understand exactly what is making this so slow.
Thanks in advance.

Compensating for laggy positive feedback

I'm trying to make a program run as accurately as possible while staying at a fixed frame rate. How do you do this?
Formally, I have some parameter b in [0,1] that I can set to determine how accurate my computations are (where 0 is least accurate, 0.5 is fairly accurate, and 1 is very accurate). The higher this is, the lower frame rate I will get.
However, there is a "lag", where after changing this parameter, the frame rate won't change until d milliseconds afterwards, where d can vary and is unknown.
Is there a way to change this parameter in a way that prevents "wiggling"? The problem is that if I am experiencing a low frame rate, if I increase the parameter then measure again, it will only be slightly higher, so I will need to increase it more, and then the framerate will be too slow, so I need to decrease the parameter, and I get this oscillating behavior. Is there a way to prevent this? I need to be as reactive as possible in doing this, because changing too slowly will cause the framerate to be incorrect for too long.
Looks like you need an adaptive feedback dampener. Trying an electrical circuit analogy :)
I'd first try to get more info about how the circuit's input signal and responsiveness look like. So I'd first make the algorithm update b not with the desired values but with the previous values plus or minus (as needed towards the desired value) a small fixed increment, say .01 instead (ignore the sloppy response time for now). While doing so I'd collect and plot/analyze the "desired" b values, looking for:
the general shape of the changes: smooth or rather "steppy" or "spiky"? (spiky would require a stronger dampening to prevent oscillations, steppy would require a weaker dampening to prevent lagging)
the maximum/typical/minimum changes in values from sample to sample
the distribution of the changes in values from sample to sample (I'd plan the algorithm to react best for changes in a typical range, say 20-80% range and consider acceptable lagging for changes higher than that or oscillations for values lower than that)
The end goal is to be able to obtain parameters for operating alternatively in 2 modes:
a high-speed tracking mode (also the system's initial mode)
a normal tracking mode
In high-speed tracking mode the b value updates can be either:
not dampened - the update value is the full desired value - only if the changes shape is not spiky and only in the 1st b update after entering the high-speed tracking mode. This would help reduce lagging.
dampened - the update delta is just a fraction (dampening factor) of the desired delta and reflects the fact that the effect of the previous b value update might not be completely reflected in the current frame rate due to d. Dampening helps preventing oscillations at the expense of potentially increasing lag (always conflicting requirements). The dampening factor would be higher for a smooth shape and smaller for a spiky shape.
Switching from high-speed tracking mode to normal tracking mode can be done when the delta between b's previous value and its desired value falls below a certain mode change threshold value (eventually maintained for a minimum number of consecutive samples). The mode change threshold value would be initially estimated from the info collected above and/or empirically determined.
In normal tracking mode the delta between b's previous value and its desired value remain below the mode change threshold value and is either ignored (no b update) or and update is made either with the desired value or some average one - tiny course corrections, keeping the frame rate practically constant, no dampening, no lagging, no oscillations.
When in normal tracking mode the delta between b's previous value and its desired value goes above the mode change threshold value the system switches again to the high-speed tracking mode.
I would also try to get a general idea about how the d response time looks like. To do that I'd change the algorithm to only update b with the desired values not at every iteration, but every n iterations apart (maybe even re-try for several n values). This should indicate how many sample periods would generally a b value change take to become fully effective and should be reflected in the dampening factor: the longer it takes for a change to take effect the stronger the dampening should be to prevent oscillations.
Of course, this is just the general idea, a lot of experimental trial/adjustment iterations may be required to reach a satisfactory solution.

Using multiple threads for faster execution

Approximate program behavior:
I have a map image with data associated with the map indicated by RGB index. The data has been populated into an MS Access database. I imported the information in the database into my program as an array and sorted them to go in the order I want the program to run.
I want the program to find the nearest pixel that has a different color from the incumbent pixel being compared. (Colors are stored as string attributes of object Pixel)
First question: Should I use integers to represent my colors instead of string? Would this make the comparison function run significantly faster?
In order to find the nearest pixel of different color, the program begins with all 8 adjacent pixels around the incumbent. If a nonMatch is not found, it then continues onto the next "degree", and in this fashion, it spirals out from the incumbent pixel until it hits a nonMatch. When found, the color of the nonMatch is saved as an attribute of incumbent. After I find the nonMatch for each of the Pixels, the data is re-inserted into the database
The program accomplishes what I want in the manner I've written it, but it is very very slow. After 24 hours, I am only about 3% through with execution.
Question Two: Does my program behavior sound about right? Is this algorithm you would use if you had to accomplish this task?
Question Three: Would it be appropriate for me to use threads in order to finish execution of the program faster? How exactly does that work? (I am brand new to threads, but know a little of the syntax)
Question Four: Would it be more "intelligent" for my program to find the nonMatch for each pixel and insert it into the database immediately after finding it? (I'm making a guess that this would be good in multi-threading, because while one record is accessing the database (to insert), another record is accessing the array of pixels (shared global variable in program).
Question Five: If threading is a good idea, I'm guessing I would split the records up into more manageable chunks (i.e. quarters), and have each thread run the same functions for their specified number of records? Am I close at all?
Please let me know if I can clarify or provide code samples, I just figured that this is more of a conceptual topic so do not want to overburden the post.
1.) Yes, integers compare much faster than strings. Additionally the y use much less memory
2.) I would adapt the algorithm in this way:
E.g.: #1: Let's say, for pixel(87,23) you found the nearest nonMatch to be (88,24) at degree=1 - you can immediately invert the relation and record, that the nearest nonMatch to (88,24) is (87,23). On degree=1 you finished 2 pixels with 1 search.
E.g. #2: Let's say, for pixel (17,18) you found the nearest nonMatch to be (17,20) at degree=2. You can immediately record, that all pixels, that border on both (16,19), (17,19) and (18,19) have the nearest noMatch (17,20) at degree=1, and that one of them is the nearest noMatch to (17,20). On degree=2 (or higher), you finished 5 pixels with 1 search.
3.) Using threads is a two-sided sword: You can do searches in parallel, but you need locking if you write to your array. So this depends on how many CPU cores you can throw at the problem. If this is 3 or more, threads will surely speed up the search.
4.) The results from 2.) make it necessary to mark a pixel as "done" in your array, as you might have finished up to 5 pixels with 1 search. I recommend you put those into a queue and use a dedicated thread to write the queue back into the database: MS Access can't handle concurrent updates, so a single database writer thread looks like a good idea.
5.) I recommend you NOT chunk up the array: You will run into problems with pixels on the edges of a chunk having their nearest nonMatch in a different chunk. Instead if you use e.g. 4 Threads, let them run 1.) From NW corner E, then S 2.) From SE Corner W, then N 3.) From NE Corner S, then W 4. From SW Corner N, then E
Yes. Using a integer would make it much faster
You can reuse the work you have done for previous pixel. Eg. If (a,b) is the nearest non-equal pixel of (x,y), it is likely that points around (x,y) might also have (a,b) as the nearest non-equal pixel
You can use different threads to work on different pixels instead of dividing searching for one pixel
IMHO, steps 1&2 should make your program much faster and you might not need multi-threading.
Yes, I'd convert colour strings to Integers for speed, or even Color structures if you intend to display them on the screen.
Don't work directly with the database if you can avoid it. Copy the necessary data out of the database into an array before you start, and copy your results back when you're finished.

Spike removal algorithm

I have an array of values ranging from 30 to 300. I want to somehow make an weighted average, where, if I have 5 values and one is a lot bigger than the rest(spike), it won't influence the average that much as it would if I simply make a arithmetic average: eg: (n1+n2+n3+n4+n5)/5.
Does anyone has an idea how to make an simple algorithm that does just that, or where to look?
Sounds like you're looking to discard data that falls outside some parameter range you've specified. You could do it by computing the median/mode and ignoring values outside of this range when computing your mean. You'll have to adjust the divisor accordingly, of course, to account for the number of discarded values. What this "tolerable" range should be is ultimately up to you to decide, and will likely depend on your specific application needs.
Alternatively, you could try something like eliminating items r% out of range of your total average. Something like this (in javascript):
function RangedAverage(arr, r)
{
x = Average(arr);
//now eliminate items r% out of range
for(var i=0; i<arr.length; i++)
if(arr[i] < (x/r) || arr[i]>(x*(1+r)))
arr.splice(i,1);
x = Average(arr); //compute new average
return x;
}
You could try a median filter rather than a mean filter. It's often used in image processing to mitigate spurious pixel values (as opposed to white noise).
As you have noticed the mean is susceptible to skewing by spikes. perhaps median or mode may be a better statistic as they tend to be less skewed?
this should be a comment but js seems to be broken for me atm: its not quite clear whether you are after a single number that is characteristic of your array (i.e. an average) or a new array with the spikes removed (median filter)
in response to that then i'd suggest you first look at if median or mode is more appropriate as a statistic. if not then apply a median filter (very good at removing spikes) then average
A Kalman filter is often used in similar applications. I don't know if it qualifies as "simple," but it's robust and well known.
Lots of ways of doing this: You could implement a low-pass digital filter.
Or, if you're just concerned about removing outliers from a statistical summary, you could just remove the highest and lowest N% of your data values from the dataset before averaging.
"Robust statistics" is the search term that will get you into the literature. An advantage of a Kalman filter is that you have a running estimate of the variability of the data, and this allows you eventually to "discard observations that are more than x% likely to be spurious given the whole set of observations so far".

Efficient division by 128, in tsql

As part of implementing a half-life behaviour, I need to perform x = x - x / 128 on around a hundred thousand rows every few days. Is tsql smart enough to do the division by 128 efficiently (as a bit-shift), or is it just as efficient to divide by 130?
If tsql isn't smart enough, is there anything clever that I can do to make it more efficient?
A hundred thousand rows isn't enough that the difference in perf between a divide operation and a shift operation would probably even be measurable. Especially if you only have to do it every few days. Better to spend your time worrying about other issues.
You could use a computed column with the PERSISTED flag to ensure that the values were physically stored and not recomputed every time they were displayed. That could (possibly, depending on your particular circumstances) be more efficient.
More liklely you will have problems with integer math. I don't know what your x values are but if they are also integers, you want to divide by 128.0 if you don;t want the answer to be rounded to the nearest integer.