any elegant binary search tree algorithm to work this out? - binary-search-tree

I was confronted with this problem, and would like to have a quick algorithm to work it out.
Given n Points in the 2D plane(none of them has an x value or y value equal to another), find out the number of all pairs of points which form a line with positive slope.(say (0,0) and (1,1), with a positive slope of 45 degrees )
Since the n is big( say 60000), so I need an elegant algorithm to keep it within 1 second.
I know it is easy to do it with O(n^2), but it is simply to slow, which takes about 30 seconds. Is it possible to have an binary search tree to do it with nlogn complexity?
I appreciate anyone who would like to enlighten me on this.

It seems like you should be able to do it mathematically..
Every set of 2 points there are (using permuations, I believe) will either be positive or negative and if they are random points, this means there will be average 50% positive and 50% negative.. So it would be the amount of pairs / 2.

Related

should the "midpoint" location of binary search alway be at 1/2?

In the native binary search, we choose 1/2 as the midpoint to cut off half of the "workload" in linear search and the possible answers. However, if the time complexity of the check_mid(mid) function is not fixed, will 1/2 still be a fair point for the search?
For example, In the problem of finding the first bad version. let's say the time complexity of the check_mid(mid) is O(mid), the length of the array is N. When we set the midpoint at 1/2, the time complexity of linear searching the left part would be 1/8 * N^2, and the right part would be 3/8 * N^2. So, in the aspect of "workload", the division is not fair, will a factor which bigger than 1/2 be a better midpoint in this situation(1/sqrt(2) or 2/3)?
In short, my confusion is that we get rid of half of the possible cases or the cases hold half of the "workload"? Let's say the "workload"-T means linearly checking all the possible cases. If we cut off half of T in each recursion, the worst time complexity would be log2(T). But if we cut off half of the possible cases, the worst time complexity would not be log2(T) when the check_mid(mid) function is not fixed.
Is there a more efficient search factor than midpoint for binary search?
this question is similar but its answer didn't take the time complexity of check_mid(mid) into consideration.
If you know ahead something about distribution, maybe you could find a better pivot, otherwise 1/2 think is the best for something randomly [1,3,8,11,23..].Never know in which half will be and maybe in particular cases other pivot will be faster but overall the time is not the best.[for all searches].In most of the cases binary-search is applied on unknown sequence. For a known distribution:exponential-grow [1 3 9 27 81 ...] it's obvious that very-very lower values will be near the start(or in 1/3) so 1/3 could be fine for lower values and 2/3 for higher values. But even here, after a few iterations is hardly to made any assumption in which half it's "probably" to be (so maybe changing again the pivot to 1/2 will give a better time). The solution here is based on "good chance to guess the right half [the one with less items]" for a few iterations based on known distribution.

Determine the running time of an algorithm with two parameters

I have implemented an algorithm that uses two other algorithms for calculating the shortest path in a graph: Dijkstra and Bellman-Ford. Based on the time complexity of the these algorithms, I can calculate the running time of my implementation, which is easy giving the code.
Now, I want to experimentally verify my calculation. Specifically, I want to plot the running time as a function of the size of the input (I am following the method described here). The problem is that I have two parameters - number of edges and number of vertices.
I have tried to fix one parameter and change the other, but this approach results in two plots - one for varying number of edges and the other for varying number of vertices.
This leads me to my question - how can I determine the order of growth based on two plots? In general, how can one experimentally determine the running time complexity of an algorithm that has more than one parameter?
It's very difficult in general.
The usual way you would experimentally gauge the running time in the single variable case is, insert a counter that increments when your data structure does a fundamental (putatively O(1)) operation, then take data for many different input sizes, and plot it on a log-log plot. That is, log T vs. log N. If the running time is of the form n^k you should see a straight line of slope k, or something approaching this. If the running time is like T(n) = n^{k log n} or something, then you should see a parabola. And if T is exponential in n you should still see exponential growth.
You can only hope to get information about the highest order term when you do this -- the low order terms get filtered out, in the sense of having less and less impact as n gets larger.
In the two variable case, you could try to do a similar approach -- essentially, take 3 dimensional data, do a log-log-log plot, and try to fit a plane to that.
However this will only really work if there's really only one leading term that dominates in most regimes.
Suppose my actual function is T(n, m) = n^4 + n^3 * m^3 + m^4.
When m = O(1), then T(n) = O(n^4).
When n = O(1), then T(n) = O(m^4).
When n = m, then T(n) = O(n^6).
In each of these regimes, "slices" along the plane of possible n,m values, a different one of the terms is the dominant term.
So there's no way to determine the function just from taking some points with fixed m, and some points with fixed n. If you did that, you wouldn't get the right answer for n = m -- you wouldn't be able to discover "middle" leading terms like that.
I would recommend that the best way to predict asymptotic growth when you have lots of variables / complicated data structures, is with a pencil and piece of paper, and do traditional algorithmic analysis. Or possibly, a hybrid approach. Try to break the question of efficiency into different parts -- if you can split the question up into a sum or product of a few different functions, maybe some of them you can determine in the abstract, and some you can estimate experimentally.
Luckily two input parameters is still easy to visualize in a 3D scatter plot (3rd dimension is the measured running time), and you can check if it looks like a plane (in log-log-log scale) or if it is curved. Naturally random variations in measurements plays a role here as well.
In Matlab I typically calculate a least-squares solution to two-variable function like this (just concatenates different powers and combinations of x and y horizontally, .* is an element-wise product):
x = log(parameter_x);
y = log(parameter_y);
% Find a least-squares fit
p = [x.^2, x.*y, y.^2, x, y, ones(length(x),1)] \ log(time)
Then this can be used to estimate running times for larger problem instances, ideally those would be confirmed experimentally to know that the fitted model works.
This approach works also for higher dimensions but gets tedious to generate, maybe there is a more general way to achieve that and this is just a work-around for my lack of knowledge.
I was going to write my own explanation but it wouldn't be any better than this.

Is there a prdefined name for the following solution search/optimization algorithm?

Consider a problem whose solution maximizes an objective function.
Problem : From 500 elements, 15 needs to be selected (candidate solution), Value of Objective function depends on the pairwise relationships between the elements in a candidate solution and some more.
The steps for solving such a problem is described here:
1. Generate a set of candidate solutions in guided random manner(population) //not purely random the direction is given to generate the population
2. Evaluating the objective function for current population
3. If the current_best_solution exceeds the global_best_solution, then replace the global_best with current_best
4. Repeat steps 1,2,3 for N (arbitrary number) times
where size of population and N are smaller (approx 50)
After N iterations it returns a candidate solution stored in global_best_solution
Is this the description of a well-known algorithm?
If it is, what is the name of that algorithm or if not under which category these type of algorithms fit?
What you have sounds like you are just fishing. Note that you might as well get rid of steps 3 and 4 since running the loop 100 times would be the same as doing it once with an initial population 100 times as large.
If you think of the objective function as a random variable which is a function of random decision variables then what you are doing would e.g. give you something in the 99.9th percentile with very high probability -- but there is no limit to how far the optimum might be from the 99.9th percentile.
To illustrate the difficulty, consider the following sort of Travelling Salesman Problem. Imagine two clusters of points A and B, each of which has 100 points. Within the clusters, each point is arbitrarily close to every other point (e.g. 0.0000001). But -- between the clusters the distance is say 1,000,000. The optimal tour would clearly have length 2,000,000 (+ a negligible amount). A random tour is just a random permutation of those 200 decision points. Getting an optimal or near optimal tour would be akin to shuffling a deck of 200 cards with 100 read and 100 black and having all of the red cards in the deck in a block (counting blocks that "wrap around") -- vanishingly unlikely (It can be calculated as 99 * 100! * 100! / 200! = 1.09 x 10^-57). Even if you generate quadrillions of tours it is overwhelmingly likely that each of those tours would be off by millions. This is a min problem, but it is also easy to come up with max problems where it is vanishingly unlikely that you will get a near-optimal solution by purely random settings of the decision variables.
This is an extreme example, but it is enough to show that purely random fishing for a solution isn't very reliable. It would make more sense to use evolutionary algorithms or other heuristics such as simulated annealing or tabu search.
why do you work with a population if the members of that population do not interact ?
what you have there is random search.
if you add mutation it looks like an Evolution Strategy: https://en.wikipedia.org/wiki/Evolution_strategy

does every algorithm have Big Omega?

does every algorithm have Big Omega?
Is it possible for algorithms to have both Big O and Big Omega (but not equal to each other- not Big Theta) ?
For instance Quicksort's Big O - O(n log n) But does it have Big Omega? If it does, how do i calculate it?
First, it is of paramount importance that one not confuse the bound with the case. A bound - like Big-Oh, Big-Omega, Big-Theta, etc. - says something about a rate of growth. A case says something about the kinds of input you're currently considering being processed by your algorithm.
Let's consider a very simple example to illustrate the distinction above. Consider the canonical "linear search" algorithm:
LinearSearch(list[1...n], target)
1. for i := 1 to n do
2. if list[i] = target then return i
3. return -1
There are three broad kinds of cases one might consider: best, worst, and average cases for inputs of size n. In the best case, what you're looking for is the first element in the list (really, within any fixed number of the start of the list). In such cases, it will take no more than some constant amount of time to find the element and return from the function. Therefore, the Big-Oh and Big-Omega happen to be the same for the best case: O(1) and Omega(1). When both O and Omega apply, we also say Theta, so this is Theta(1) as well.
In the worst case, the element is not in the list, and the algorithm must go through all n entries. Since f(n) = n happens to be a function that is bound from above and from below by the same class of functions (linear ones), this is Theta(n).
Average case analysis is usually a bit trickier. We need to define a probability space for viable inputs of length n. One might say that all valid inputs (where integers can be represented using 32 bits in unsigned mode, for instance) are equally probable. From that, one could work out the average performance of the algorithm as follows:
Find the probability that target is not represented in the list. Multiply by n.
Given that target is in the list at least once, find the probability that it appears at position k for each 1 <= k <= n. Multiply each P(k) by k.
Add up all of the above to get a function in terms of n.
Notice that in step 1 above, if the probability is non-zero, we will definitely get at least a linear function (exercise: we can never get more than a linear function). However, if the probability in step 1 is indeed zero, then the assignment of probabilities in step 2 makes all the difference in determining the complexity: you can have best-case behavior for some assignments, worst-case for others, and possibly end up with behavior that isn't the same as best (constant) or worst (linear).
Sometimes, we might speak loosely of a "general" or "universal" case, which considers all kinds of input (not just the best or the worst), but that doesn't give any particular weighting to inputs and doesn't take averages. In other words, you consider the performance of the algorithm in terms of an upper-bound on the worst-case, and a lower-bound on the best-case. This seems to be what you're doing.
Phew. Now, back to your question.
Are there functions which have different O and Omega bounds? Definitely. Consider the following function:
f(n) = 1 if n is odd, n if n is even.
The best case is "n is odd", in which case f is Theta(1); the worst case is "n is even", in which case f is Theta(n); and if we assume for the average case that we're talking about 32-bit unsigned integers, then f is Theta(n) in the average case, as well. However, if we talk about the "universal" case, then f is O(n) and Omega(1), and not Theta of anything. An algorithm whose runtime behaves according to f might be the following:
Strange(list[1...n], target)
1. if n is odd then return target
2. else return LinearSearch(list, target)
Now, a more interesting question might be whether there are algorithms for which some case (besides the "universal" case) cannot be assigned some valid Theta bound. This is interesting, but not overly so. The reason is that you, during your analysis, are allowed to choose the cases that constitutes best- and worst-case behavior. If your first choice for the case turns out not to have a Theta bound, you can simply exclude the inputs that are "abnormal" for your purposes. The case and the bound aren't completely independent, in that sense: you can often choose a case such that it has "good" bounds.
But can you always do it?
I don't know, but that's an interesting question.
Does every algorithm have a Big Omega?
Yes. Big Omega is a lower bound. Any algorithm can be said to take at least constant time, so any algorithm is Ω(1).
Does every algorithm have a Big O?
No. Big O is a upper bound. Algorithms that don't (reliably) terminate don't have a Big O.
An algorithm has an upper bound if we can say that, in the absolute worst case, the algorithm will not take longer than this. I'm pretty sure O(∞) is not valid notation.
When will the Big O and Big Omega of an algorithm be equal?
There is actually a special notation for when they can be equal: Big Theta (Θ).
They will be equal if the algorithm scales perfectly with the size of the input (meaning there aren't input sizes where the algorithm is suddenly a lot more efficient).
This is assuming we take Big O to be the smallest possible upper bound and Big Omega to be the largest possible lower bound. This is not actually required from the definition, but they're commonly informally treated as such. If you drop this assumption, you can find a Big O and Big Omega that aren't equal for any algorithm.
Brute force prime number checking (where we just loop through all smaller numbers and try to divide them into the target number) is perhaps a good example of when the smallest upper bound and largest lower bound are not equal.
Assume you have some number n. Let's also for the time being ignore the fact that bigger numbers take longer to divide (a similar argument holds when we take this into account, although the actual complexities would be different). And I'm also calculating the complexity based on the number itself instead of the size of the number (which can be the number of bits, and could change the analysis here quite a bit).
If n is divisible by 2 (or some other small prime), we can very quickly check whether it's prime with 1 division (or a constant number of divisions). So the largest lower bound would be Ω(1).
Now if n is prime, we'll need to try to divide n by each of the numbers up to sqrt(n) (I'll leave the reason we don't need to go higher than this as an exercise). This would take O(sqrt(n)), which would also then be our smallest upper bound.
So the algorithm would be Ω(1) and O(sqrt(n)).
Exact complexity also may be hard to calculate for some particularly complex algorithms. In such cases it may be much easier and acceptable to simply calculate some reasonably close lower and upper bounds and leave it at that. I don't however have an example on hand for this.
How does this relate to best case and worst case?
Do not confuse upper and lower bounds for best and worst case. This is a common mistake, and a bit confusing, but they're not the same. This is a whole other topic, but as a brief explanation:
The best and worst (and average) cases can be calculated for every single input size. The upper and lower bounds can then be used for each of those 3 cases (separately). You can think of each of those cases as a line on a graph with input size on the x-axis and time on the y-axis and then, for each of those lines, the upper and lower bounds are lines which need to be strictly above or below that line as the input size tends to infinity (this isn't 100% accurate, but it's a good basic idea).
Quick-sort has a worst-case of Θ(n2) (when we pick the worst possible pivot at every step) and a best-case of Θ(n log n) (when we pick good pivots). Note the use of Big Theta, meaning each of those are both lower and upper bounds.
Let's compare quick-sort with the above prime checking algorithm:
Say you have a given number n, and n is 53. Since it's prime, it will (always) take around sqrt(53) steps to determine whether it's prime. So the best and worst cases are all the same.
Say you want to sort some array of size n, and n is 53. Now those 53 elements can be arranged such that quick-sort ends up picking really bad pivots and run in around 532 steps (the worst case) or really good pivots and run in around 53 log 53 steps (the best case). So the best and worst cases are different.
Now take n as 54 for each of the above:
For prime checking, it will only take around 1 step to determine that 54 is prime. The best and worst cases are the same again, but they're different from what they were for 53.
For quick-sort, you'll again have a worst case of around 542 steps and a best case of around 54 log 54 steps.
So for quick-sort the worst case always takes around n2 steps and the best case always takes around n log n steps. So the lower and upper (or "tight") bound of the worst case is Θ(n2) and the tight bound of the best case is Θ(n log n).
For our prime checking, sometimes the worst case takes around sqrt(n) steps and sometimes it takes around 1 step. So the lower bound for the worse case would be Ω(1) and upper bound would be O(sqrt(n)). It would be the same for the best case.
Note that above I simply said "the algorithm would be Ω(1) and O(sqrt(n))". This is slightly ambiguous, as it's not clear whether the algorithm always takes the same amount of time for some input size, or the statement is referring to one of the best, average or worst case.
How do I calculate this?
It's hard to give general advice for this since proofs of bounds are greatly dependent on the algorithm. You'd need to analyse the algorithm similar to what I did above to figure out the worst and best cases.
Big O and Big Omega it can be calculated for every algorithm as you can see in Big-oh vs big-theta

Simplification / optimization of GPS track

I've got a GPS track produced by gpxlogger(1) (supplied as a client for gpsd). GPS receiver updates its coordinates every 1 second, gpxlogger's logic is very simple, it writes down location (lat, lon, ele) and a timestamp (time) received from GPS every n seconds (n = 3 in my case).
After writing down a several hours worth of track, gpxlogger saves several megabyte long GPX file that includes several thousands of points. Afterwards, I try to plot this track on a map and use it with OpenLayers. It works, but several thousands of points make using the map a sloppy and slow experience.
I understand that having several thousands of points of suboptimal. There are myriads of points that can be deleted without losing almost anything: when there are several points making up roughly the straight line and we're moving with the same constant speed between them, we can just leave the first and the last point and throw away anything else.
I thought of using gpsbabel for such track simplification / optimization job, but, alas, it's simplification filter works only with routes, i.e. analyzing only geometrical shape of path, without timestamps (i.e. not checking that the speed was roughly constant).
Is there some ready-made utility / library / algorithm available to optimize tracks? Or may be I'm missing some clever option with gpsbabel?
Yes, as mentioned before, the Douglas-Peucker algorithm is a straightforward way to simplify 2D connected paths. But as you have pointed out, you will need to extend it to the 3D case to properly simplify a GPS track with an inherent time dimension associated with every point. I have done so for a web application of my own using a PHP implementation of Douglas-Peucker.
It's easy to extend the algorithm to the 3D case with a little understanding of how the algorithm works. Say you have input path consisting of 26 points labeled A to Z. The simplest version of this path has two points, A and Z, so we start there. Imagine a line segment between A and Z. Now scan through all remaining points B through Y to find the point furthest away from the line segment AZ. Say that point furthest away is J. Then, you scan the points between B and I to find the furthest point from line segment AJ and scan points K through Y to find the point furthest from segment JZ, and so on, until the remaining points all lie within some desired distance threshold.
This will require some simple vector operations. Logically, it's the same process in 3D as in 2D. If you find a Douglas-Peucker algorithm implemented in your language, it might have some 2D vector math implemented, and you'll need to extend those to use 3 dimensions.
You can find a 3D C++ implementation here: 3D Douglas-Peucker in C++
Your x and y coordinates will probably be in degrees of latitude/longitude, and the z (time) coordinate might be in seconds since the unix epoch. You can resolve this discrepancy by deciding on an appropriate spatial-temporal relationship; let's say you want to view one day of activity over a map area of 1 square mile. Imagining this relationship as a cube of 1 mile by 1 mile by 1 day, you must prescale the time variable. Conversion from degrees to surface distance is non-trivial, but for this case we simplify and say one degree is 60 miles; then one mile is .0167 degrees. One day is 86400 seconds; then to make the units equivalent, our prescale factor for your timestamp is .0167/86400, or about 1/5,000,000.
If, say, you want to view the GPS activity within the same 1 square mile map area over 2 days instead, time resolution becomes half as important, so scale it down twice further, to 1/10,000,000. Have fun.
Have a look at Ramer-Douglas-Peucker algorithm for smoothening complex polygons, also Douglas-Peucker line simplification algorithm can help you reduce your points.
OpenSource GeoKarambola java library (no Android dependencies but can be used in Android) that includes a GpxPathManipulator class that does both route & track simplification/reduction (3D/elevation aware).
If the points have timestamp information that will not be discarded.
https://sourceforge.net/projects/geokarambola/
This is the algorith in action, interactively
https://lh3.googleusercontent.com/-hvHFyZfcY58/Vsye7nVrmiI/AAAAAAAAHdg/2-NFVfofbd4ShZcvtyCDpi2vXoYkZVFlQ/w360-h640-no/movie360x640_05_82_05.gif
This algorithm is based on reducing the number of points by eliminating those that have the greatest XTD (cross track distance) error until a tolerated error is satisfied or the maximum number of points is reached (both parameters of the function), wichever comes first.
An alternative algorithm, for on-the-run stream like track simplification (I call it "streamplification") is:
you keep a small buffer of the points the GPS sensor gives you, each time a GPS point is added to the buffer (elevation included) you calculate the max XTD (cross track distance) of all the points in the buffer to the line segment that unites the first point with the (newly added) last point of the buffer. If the point with the greatest XTD violates your max tolerated XTD error (25m has given me great results) then you cut the buffer at that point, register it as a selected point to be appended to the streamplified track, trim the trailing part of the buffer up to that cut point, and keep going. At the end of the track the last point of the buffer is also added/flushed to the solution.
This algorithm is lightweight enough that it runs on an AndroidWear smartwatch and gives optimal output regardless of if you move slow or fast, or stand idle at the same place for a long time. The ONLY thing that maters is the SHAPE of your track. You can go for many minutes/kilometers and, as long as you are moving in a straight line (a corridor within +/- tolerated XTD error deviations) the streamplify algorithm will only output 2 points: those of the exit form last curve and entry on next curve.
I ran in to a similar issue. The rate at which the gps unit takes points is much larger that needed. Many of the points are not geographically far away from each other. The approach that I took is to calculate the distance between the points using the haversine formula. If the distance was not larger than my threshold (0.1 miles in my case) I threw away the point. This quickly gets the number of points down to a manageable size.
I don't know what language you are looking for. Here is a C# project that I was working on. At the bottom you will find the haversine code.
http://blog.bobcravens.com/2010/09/gps-using-the-netduino/
Hope this gets you going.
Bob
This is probably NP-hard. Suppose you have points A, B, C, D, E.
Let's try a simple deterministic algorithm. Suppose you calculate the distance from point B to line A-C and it's smaller than your threshold (1 meter). So you delete B. Then you try the same for C to line A-D, but it's bigger and D for C-E, which is also bigger.
But it turns out that the optimal solution is A, B, E, because point C and D are close to the line B-E, yet on opposite sides.
If you delete 1 point, you cannot be sure that it should be a point that you should keep, unless you try every single possible solution (which can be n^n in size, so on n=80 that's more than the minimum number of atoms in the known universe).
Next step: try a brute force or branch and bound algorithm. Doesn't scale, doesn't work for real-world size. You can safely skip this step :)
Next step: First do a determinstic algorithm and improve upon that with a metaheuristic algorithm (tabu search, simulated annealing, genetic algorithms). In java there are a couple of open source implementations, such as Drools Planner.
All in all, you 'll probably have a workable solution (although not optimal) with the first simple deterministic algorithm, because you only have 1 constraint.
A far cousin of this problem is probably the Traveling Salesman Problem variant in which the salesman cannot visit all cities but has to select a few.
You want to throw away uninteresting points. So you need a function that computes how interesting a point is, then you can compute how interesting all the points are and throw away the N least interesting points, where you choose N to slim the data set sufficiently. It sounds like your definition of interesting corresponds to high acceleration (deviation from straight-line motion), which is easy to compute.
Try this, it's free and opensource online Service:
https://opengeo.tech/maps/gpx-simplify-optimizer/
I guess you need to keep points where you change direction. If you split your track into the set of intervals of constant direction, you can leave only boundary points of these intervals.
And, as Raedwald pointed out, you'll want to leave points where your acceleration is not zero.
Not sure how well this will work, but how about taking your list of points, working out the distance between them and therefore the total distance of the route and then deciding on a resolution distance and then just linear interpolating the position based on each step of x meters. ie for each fix you have a "distance from start" measure and you just interpolate where n*x is for your entire route. (you could decide how many points you want and divide the total distance by this to get your resolution distance). On top of this you could add a windowing function taking maybe the current point +/- z points and applying a weighting like exp(-k* dist^2/accuracy^2) to get the weighted average of a set of points where dist is the distance from the raw interpolated point and accuracy is the supposed accuracy of the gps position.
One really simple method is to repeatedly remove the point that creates the largest angle (in the range of 0° to 180° where 180° means it's on a straight line between its neighbors) between its neighbors until you have few enough points. That will start off removing all points that are perfectly in line with their neighbors and will go from there.
You can do that in Ο(n log(n)) by making a list of each index and its angle, sorting that list in descending order of angle, keeping how many you need from the front of the list, sorting that shorter list in descending order of index, and removing the indexes from the list of points.
def simplify_points(points, how_many_points_to_remove)
angle_map = Array.new
(2..points.length - 1).each { |next_index|
removal_list.add([next_index - 1, angle_between(points[next_index - 2], points[next_index - 1], points[next_index])])
}
removal_list = removal_list.sort_by { |index, angle| angle }.reverse
removal_list = removal_list.first(how_many_points_to_remove)
removal_list = removal_list.sort_by { |index, angle| index }.reverse
removal_list.each { |index| points.delete_at(index) }
return points
end