I have a list of arrays of varying size, each representing an interval by maintaining a min and max. When the data is dense, it can be assumed that an interval's min is close to the previous interval's max, but they never overlap.
I would like to determine which interval a point belongs to with as few as possible comparisons. What I think is a naive approach is to compare again the min first, and go left is less than. Otherwise compare against the max and go right if greater, otherwise we have found the interval. I'm wondering whether I can instead compare to the median of each interval and go either left or right without comparing against the min and max at first. When the range of the search becomes 0, my theory is that I can then compare against the min of the successor to determine which interval is correct. So it's like an eager branch strategy that avoids the double comparison per interval, but I'm not sure if there is a valid eager solution that works.
What would be an optimal algorithm here? I've been reading a lot about segment trees (which I think is the closest to my problem) but haven't managed to find a solid reference for a smart binary search approach.
Intervals do not overlap and the list of intervals is sorted such that an intervals min is greater than its predecessor's max.
I am applying the VRP example of optaplanner with time windows and I get feasible solutions whenever I define time windows in a range of 24 hours (00:00 to 23:59). But I am needing:
Manage long trips, where I know that the duration between leaving the depot to the first visit, or durations between visits, will be more than 24 hours. So currently it does not give me workable solutions, because the TW format is in 24 hour format. It happens that when applying the scoring rule "arrivalAfterDueTime", always the "arrivalTime" is higher than the "dueTime", because the "dueTime" is in a range of (00:00 to 23:59) and the "arrivalTime" is the next day.
I have thought that I should take each TW of each Customer and add more TW to it, one for each day that is planned.
Example, if I am planning a trip for 3 days, then I would have 3 time windows in each Customer. Something like this: if Customer 1 is available from [08:00-10:00], then say it will also be available from [32:00-34:00] and [56:00-58:00] which are the equivalent of the same TW for the following days.
Likewise I handle the times with long, converted to milliseconds.
I don't know if this is the right way, my consultation would be more about some ideas to approach this constraint, maybe you have a similar problematic and any idea for me would be very appreciated.
Sorry for the wording, I am a Spanish speaker. Thank you.
Without having checked the example, handing multiple days shouldn't be complicated. It all depends on how you model your time variable.
For example, you could:
model the time stamps as a long value denoted as seconds since epoch. This is how most of the examples are model if I remember correctly. Note that this is not very human-readable, but is the fastest to compute with
you could use a time data type, e.g. LocalTime, this is a human-readable time format but will work in the 24-hour range and will be slower than using a primitive data type
you could use a date time data tpe, e.g LocalDateTime, this is also human-readable and will work in any time range and will also be slower than using a primitive data type.
I would strongly encourage to not simply map the current day or current hour to a zero value and start counting from there. So, in your example you denote the times as [32:00-34:00]. This makes it appear as you are using the current day midnight as the 0th hour and start counting from there. While you can do this it will affect debugging and maintainability of your code. That is just my general advice, you don't have to follow it.
What I would advise is to have your own domain models and map them to Optaplanner models where you use a long value for any time stamp that is denoted as seconds since epoch.
I'm using timestamps as the score. I want to prevent duplicates by appending a unique object-id to the score. Currently, this id is a 6 digit number (the highest id right now is 221849), but it is expected to increase over a million. So, the score will be something like
1407971846221849 (timestamp:1407971846 id:221849) and will eventually reach 14079718461000001 (timestamp:1407971846 id:1000001).
My concern is not being able to store scores because they've reached the max allowed.
I've read the docs, but I'm a bit confused. I know, basic math. But bear with me, I want to get this right.
Redis sorted sets use a double 64-bit floating point number to represent the score. In all the architectures we support, this is represented as an IEEE 754 floating point number, that is able to represent precisely integer numbers between -(2^53) and +(2^53) included. In more practical terms, all the integers between -9007199254740992 and 9007199254740992 are perfectly representable. Larger integers, or fractions, are internally represented in exponential form, so it is possible that you get only an approximation of the decimal number, or of the very big integer, that you set as score.
There's another thing bothering me right now. Would the increase in ids break the chronological sort sequence ?
I will appreciate any insights, suggestions, different prespectives or flat out if what I'm trying to do is non-sense.
Thanks for any help.
No, it won't break the "chronological" order, but you may loose the precision of the last digits, so two members may end up having the same score (i.e. non-unique).
There is no problem with duplicate scores. It is just maintaining a sorted set in memory. Members are unique but the scores may be the same. If you want chronological processing I would just rely on the timestamp without adding an id to it.
Appending an id would break the chronological sort if your ids are mixed such that you could have timestamps 1, 2, 3 (simple example) and ids 100, 10, 1, you won't get the correct sort. If your ids will always be added monotonically then you should just use the id as the score.
Given a bunch of numbers, I am trying to determine whether there is a "clump" anywhere where numbers are very densely packed.
To make things more precise, I thought I'd ask a more specific problem: given a set of numbers, I would like to determine whether there is a subset of size n which has a standard deviation <= s. If there are many such subsets, I'd like to find the subset with the lowest standard deviation.
So question #1 : does this formal problem definition effectively capture the intuitive concept of a "clump" of densely packed numbers?
EDIT: I don't actually care about determining which numbers belong to this "clump", I'm much more interested in determining where the clump is centred, which is why I think that specifying n in advance is okay. But feel free to correct me!
And question #2 : assuming it does, what is the best way to go about implementing something like this (in particular, I want a solution with lowest time complexity)? So far I think I have a solution that runs in n log n:
First, note that the lowest-standard-deviation-possessing subset of a given size must consist of consecutive numbers. So step 1 is sort the numbers (this is n log n)
Second, take the first n numbers and compute their standard deviation. If our array of numbers is 0-based, then the first n numbers are [0, n-1]. To get standard deviation, compute s1 and s2 as follows:
s1 = sum of numbers
s2 = sum of squares of numbers
Then, wikipedia says that the standard deviation is sqrt(n*s2 - s1^2)/n. Record this value as the highest standard deviation seen so far.
Find the standard deviation of [1, n], [2, n+1], [3, n+2] ... until you hit the the last n numbers. To do each computation takes only constant time if you keep track of s1 and s2 running totals: for example, to get std dev of [1, n], just subtract the 0th element from the s1 and s2 totals and add the nth element, then recalculate standard deviation. This means that the entire standard deviation calculating portion of the algorithm takes linear time.
So total time complexity n log n.
Is my assessment right? Is there a better way to do this? I really need this to run fast on fairly large sets, so the faster the better! Space is less of an issue (I think).
Having been working recently on a similar problem, both the definition of the clumps and the proposed implementation seem reasonable.
Another reasonable definition would be to find the minimum of all the ranges of n numbers. Thus, given that the list of numbers x is sorted, one would just find the minimum of x[n]-x[1], x[n+1]-x[2], etc. This would be slightly quicker than finding the standard deviation because it would avoid the multiplications and square roots. Indeed, you can avoid the square roots even when looking for the lowest standard deviation by finding the minimum variance (the square of the standard deviation), rather than the sd itself.
A caution would be that the location of the biggest clump might be quite sensitive to the choice of n. If there is an a priori reason to select a particular n, that won't be a problem. If not, however, it might require some experimentation to select the value of n that fairly reliably finds the clumps you are looking for, whether you are selecting by range or by standard deviation. Some ideas on this can be found in Chapter 6 of the online book ABC of EDA.
I'm looking for recommendations on a best practice here.
I have a requirement where on a given day I must have an arbitrary number of intervals (think buckets of time which are composed of transactions) where I can have at most N intervals per day. These intervals are like time but can be arbitrary lengths i.e. some are seconds, others are minutes.
How the intervals should be formed is based on my source data. On any given day, we always start with interval 1 and it is unknown the total number of intervals we will have by EOD, each interval is defined by a fixed number of transactions. For every interval I am going to need to know the end time as well.
What is the best approach here? Should I be bucketing my fact table and connecting to a standard hour/minute/second dimension or should I be using my transactional data to be making a dimension that accommodates it?
I appreciate your feedback.
If the buckets are on time, you probably have to do it on one of your dimensions. There is a property on the attributes called bucket that can do that for you
I have an array of values ranging from 30 to 300. I want to somehow make an weighted average, where, if I have 5 values and one is a lot bigger than the rest(spike), it won't influence the average that much as it would if I simply make a arithmetic average: eg: (n1+n2+n3+n4+n5)/5.
Does anyone has an idea how to make an simple algorithm that does just that, or where to look?
Sounds like you're looking to discard data that falls outside some parameter range you've specified. You could do it by computing the median/mode and ignoring values outside of this range when computing your mean. You'll have to adjust the divisor accordingly, of course, to account for the number of discarded values. What this "tolerable" range should be is ultimately up to you to decide, and will likely depend on your specific application needs.
Alternatively, you could try something like eliminating items r% out of range of your total average. Something like this (in javascript):
function RangedAverage(arr, r)
x = Average(arr);
//now eliminate items r% out of range
for(var i=0; i<arr.length; i++)
if(arr[i] < (x/r) || arr[i]>(x*(1+r)))
x = Average(arr); //compute new average
return x;
You could try a median filter rather than a mean filter. It's often used in image processing to mitigate spurious pixel values (as opposed to white noise).
As you have noticed the mean is susceptible to skewing by spikes. perhaps median or mode may be a better statistic as they tend to be less skewed?
this should be a comment but js seems to be broken for me atm: its not quite clear whether you are after a single number that is characteristic of your array (i.e. an average) or a new array with the spikes removed (median filter)
in response to that then i'd suggest you first look at if median or mode is more appropriate as a statistic. if not then apply a median filter (very good at removing spikes) then average
A Kalman filter is often used in similar applications. I don't know if it qualifies as "simple," but it's robust and well known.
Lots of ways of doing this: You could implement a low-pass digital filter.
Or, if you're just concerned about removing outliers from a statistical summary, you could just remove the highest and lowest N% of your data values from the dataset before averaging.
"Robust statistics" is the search term that will get you into the literature. An advantage of a Kalman filter is that you have a running estimate of the variability of the data, and this allows you eventually to "discard observations that are more than x% likely to be spurious given the whole set of observations so far".