Verifying nearest neighbors algorithm is working correctly - testing

I am implementing a complex algorithm for determining the n-nearest neighbors in a high dimensional embedding space from a paper I found online. After I finished the implementation, I wanted to check the results to make sure the code did indeed return the desired n-nearest neighbors. In order to to do so, I check to see if the results are equal to a brute-force search across every element in the embedding space and find the n-nearest neighbors.
The issue arises when there are multiple elements with the same distance from the query input.
For example, if I am checking for the 3-nearest neighbors, and I have four points, one of which is closest and the other 3 all equidistant from the search key, one element will necessarily be left out. I'd like to test to ensure that the two implementations are roughly the same, and I am not interested in the exact details of which elements are left out. As a result, I can't just do an element-wise equality check across the complex algorithm and the brute-force solution.
For business reasons, it is actually helpful if the element left out is random, because I want the end user to see a variety of results, as long as all results are equally relevant. I specifically do not want a stable-ordering on the results.
Is there an off-the-shelf solution for this problem? I am implementing this code in Python, but the solution can be language agnostic.

Related

Simulating a series of rotating vertices with various masses and coupling functions

I continue to run into this problem: wanting to run a complex simulation of interconnected nodes, and aside from some time looking into Rigs of Rods, I don't have any experience in this area.
In this case I'm trying to simulate a series of rotating devices. If I were trying to do CFD or using more vertices, I assume I would need to try and arrive at something for use with an ODE Solver. ...but in this case I have 7 vertices with 6 edges, all in-line; I think brute force is an option. There are various functions that are used to define how force is transmitted along this line of vertices, and at any point in the chain energy/force can applied arbitrarily based on the result of 1 or more functions for a given edge.
I'm guessing that this can't be done in a single iteration without an equation that accounted for everything.
I suppose, I'll take any input. I don't know what I don't know and I wouldn't be shocked to learn that there are some great write-ups if I knew what to search for.

Optimize pattern of rotating holes for all combinations

Sort of a programming question, sort of a general logic question. Imagine a circular base with a pattern of circles:
And another circle, mounted above and able to rotate, with holes that expose the colored circles below:
There must be an optimal pattern of either the colored circles or the openings (or both) that will allow for all N possible combinations of colors... but I have no idea how to attack the problem! At this point, combinations of 2 seem probably the easiest and would be fine as a starting point (red/blue, red/green, red/white, etc).
I would imagine there will need to be gaps in the colors, unlike the example above. Any suggestions welcome!
Edit: clarified the question (hopefully!) thanks to feedback from Robert Harvey
For two holes, you could look for a perfect matching in a bipartite graph, each permutation described by two nodes, one in each partition. Nodes would be connected if they share one element, i.e. the (blue,red) node from the first partition connected to the (red,green) node of the second. The circles arranged in the same distance would allow for both of these patterns. A perfect matching in that graph would correspond to chains or cycles of permutations where two of them always share a single color. A bit like dominoes. If you had a set of cycles of the same length, you could interleave them to form the pattern on the lower disk. I'm not sure how easy it will be to obtain these same length cycles, though, and I also don't know how to generalize this to more than two elements in each permutation.

Generating a sudoku of a desired difficulty?

So, I've done a fair bit of reading into generation of a Sudoku puzzle. From what I can tell, the standard way to have a Sudoku puzzle of a desired difficulty is to generate a puzzle, and then grade it afterwards, and repeat until you have one of an acceptable rating. This can be refined by generating via backtracing using some of the more complex solving patterns (XY-wing, swordfish, etc.), but that's not quite what I'm wanting to do here.
What I want to do, but have been unable to find any real resource on, is generate a puzzle from a "difficulty value" (0-1.0 value, 0 being the easiest, and 1.0 being the hardest).
For example, I want create a moderately difficult puzzle, so the value .675 is selected. Now using that value I want to be able to generate a moderately difficult puzzle.
Anyone know of something like this? Or perhaps something with a similar methodology?
Adding another answer for generating a sudoku of desired difficulty on-the-fly.
This means that unlike other approaches the algorithm runs only once and returns a sudoku configuration matching the desired difficulty (with high probability within a range or with probability=1)
Various solutions for generating (and rating) a sudoku difficulty have to do with human-based techniques and approaches, which can be easily rated.
Then one (after having generated a sudoku configuration) re-solves the sudoku with the human-like solver and depending on the techniques the solver used (e.g pairs, x-wing, swordfish etc.) a difficulty rate is also assigned.
Problems with this approach
(and requirements for the use case i had)
In order to generate a sudoku with given difficulty, with previous method one needs to solve a sudoku twice (once with the basic algorithm and once with the human-like solver).
One has to (pre-)generate many sudokus which can only be rated as to difficulty after being solved by the human-like solver. So one cannot generate a desired sudoku on-the-fly once.
The human-like solver can be complicated and in most cases (if not all) is tightly coupled to 9x9 sudoku grids. So no easy generalisation to other sudokus (e.g 4x4, 16x16, 6x6 etc.)
The difficulty rating of the human-like techniques is very subjective. For example why x-wing is taken to be more difficult than hidden singles? (personaly have solved many difficult published sudoku puzzles manualy and never used such techniques)
Another approach was used which has the following benefits:
Generalises well to arbitrary sudokus (9x9, 4x4, 6x6, 16x16 etc..)
The sudoku configuration, with desired difficulty, is generated once and on-the-fly
The difficulty rating is objective.
How it works?
First of all, the simple fact that the more difficult the puzzle, the more time it needs to be solved.
But time to be solved is intimately correlated to both number of clues (givens) and average alternatives to be investigated per empty cell.
Extending my previous answer, it was mentioned that for any sudoku puzzle the minimum number of clues is an objective property of the puzzle (for example for 9x9 grids the minimum number of clues for having a valid sudoku is 17)
One can start from there and compute minimum number of clues per difficulty level (linear correlation).
Furthermore at each step of the sudoku generation process, one can make sure the average alternatives (to be investigated) per empty cell is within given bounds (as a function of desired difficulty)
Depending on whether the algorithm uses backtrack or not (for the use case discussed the algorithm does no backtracking) the desired difficulty can be reached either with probability=1 or with high probability within bounds (respectively).
Tests of the sudokus generated with this algorithm and difficulty rating based on the previous approaches (human-like solver), show a correlation of desired and estimated difficulty rates, plus a greater ability for generalisation to arbitrary sudoku configurations.
(have used this online sudoku solver (and also this one) to correlate the difficulty rates of the test sudokus)
The code is available free on github sudoku.js (along with sample demo application), a scaled-down version of CrossWord.js a professional crossword builder in JavaScript, by same author
The sudoku difficulty is related in an interesting way to the (minimum) amount of information needed to specify a unique solution for a given grid.
Sounds like information theory, yes it has applications here too.
Sudoku puzzles should have a unique solution. Furthermore sudoku puzzles have certain symmetries, i.e by row, by column and by sub-square.
These symmetries specify the minimum number of clues (and their position more or less) needed so that the solution would be unique (i.e using a sudoku compiler or an algorithm like backtrack-search).
This would be the most difficult/hard sudoku puzzle level (i.e minimum needed number of clues). Then all other difficulty levels from less hard to easy are generated by allowing more clues than the minimum amount needed.
It should be noted that sudoku difficulty levels are not standard, as explained above, one can have as many or as few difficulty levels as one wants. What is standard is the minimum number (and position) of clues (which is the hardest level and which is relatd to the sudoku symmetries), then one can generate as many difficulty levels as one wants simply by allowing extra/redundant clues to be visible as well.
It's not as elegant as what you ask, but you can simulate this behavior with caching:
Decide how many "buckets" you want for puzzles. For example, let's say you choose 20. Thus, your buckets will contain puzzles of different difficulty ranges: 0-.05, .05-.1, .1-.15, .. , .9-.95, .95-1
Generate a puzzle
Grade the puzzle
Put it in the appropriate bucket (or throw it away when the bucket is full)
Repeat till your buckets are "filled". The size of the buckets and where they are stored will be based on the needs of your application.
Then when a user requests a certain difficulty puzzle, give them a cached one from the bucket they choose. You might also want to consider swapping numbers and changing orientation of puzzles with known difficulties to generate similar puzzles with the same level of difficulty. Then repeat the above as needed when you need to refill your buckets with new puzzles.
Well, you can't know how complicated it is, before you know how to solve it. And Sudoku solving (and therefore also the difficulty rating) belongs to the NP-C complexity class, that means it's (most likely) logically impossible to find an algorithm that is (asymptotically) faster than the proposed randomly-guess-and-check.
However, if you can find one, you have solved the P versus NP problem and should clear a cupboard for the Fields Medal... :)

Minimizing pen lifts in a pen plotter or similar device

I'm looking for references to algorithms for plotting on a mechanical pen plotter.
Specifically, I have a list of straight vectors, each representing a line to be plotted. First I want to remove duplicate vectors, so each line is only plotted once. That's easy enough.
Second, there are many vectors that intersect, sometimes at endpoints, but not always. They can be plotted in any order, but I want to find an order that reduces the number of times the pen must be lifted, preferably to a minimum though I understand that may take a long time to compute, if it's computable at all. Vectors that intersect can be broken into smaller vectors if that helps. But generally, if the pen is moving in a straight line, it's best to keep it moving that way as long as possible. So, two parallel vectors joined end to end could be combined into a single vector, etc.
This sounds like some variety of graph theory problem, but I don't know much about that. Can anyone point me to references or algorithms I need to study? Or maybe example code?
Thanks,
Neil
The problem is an example of the Chinese postman problem which is an NP-complete problem. The most wellknown NP-complete problem is the Travelling Salesman. Common for all NP-complete problems are that they can all be translated into eachother. There are no known algorithms for solving any of them in a time that is polynomial dependent of the number of nodes in the input, they are non-polynomial (NP).
For your case I would suggest some simple heuristics. Don't overdo it, just pick anything quite simple like going in a straight line as long as possible and then lift the pen to the closest available starting point and go on from there.

How to test numerical analysis routines?

Are there any good online resources for how to create, maintain and think about writing test routines for numerical analysis code?
One of the limitations I can see for something like testing matrix multiplication is that the obvious tests (like having one matrix being the identity) may not fully test the functionality of the code.
Also, there is the fact that you are usually dealing with large data structures as well. Does anyone have some good ideas about ways to approach this, or have pointers to good places to look?
It sounds as if you need to think about testing in at least two different ways:
Some numerical methods allow for some meta-thinking. For example, invertible operations allow you to set up test cases to see if the result is within acceptable error bounds of the original. For example, matrix M-inverse times the matrix M * random vector V should result in V again, to within some acceptable measure of error.
Obviously, this example exercises matrix inverse, matrix multiplication and matrix-vector multiplication. I like chains like these because you can generate quite a lot of random test cases and get statistical coverage that would be a slog to have to write by hand. They don't exercise single operations in isolation, though.
Some numerical methods have a closed-form expression of their error. If you can set up a situation with a known solution, you can then compare the difference between the solution and the calculated result, looking for a difference that exceeds these known bounds.
Fundamentally, this question illustrates the problem that testing complex methods well requires quite a lot of domain knowledge. Specific references would require a little more specific information about what you're testing. I'd definitely recommend that you at least have Steve Yegge's recommended book list on hand.
If you're going to be doing matrix calculations, use LAPACK. This is very well-tested code. Very smart people have been working on it for decades. They've thought deeply about issues that the uninitiated would never think about.
In general, I'd recommend two kinds of testing: systematic and random. By systematic I mean exploring edge cases etc. It helps if you can read the source code. Often algorithms have branch points: calculate this way for numbers in this range, this other way for numbers in another range, etc. Test values close to the branch points on either side because that's where approximation error is often greatest.
Random input values are important too. If you rationally pick all the test cases, you may systematically avoid something that you don't realize is a problem. Sometimes you can make good use of random input values even if you don't have the exact values to test against. For example, if you have code to calculate a function and its inverse, you can generate 1000 random values and see whether applying the function and its inverse put you back close to where you started.
Check out a book by David Gries called The Science of Programming. It's about proving the correctness of programs. If you want to be sure that your programs are correct (to the point of proving their correctness), this book is a good place to start.
Probably not exactly what you're looking for, but it's the computer science answer to a software engineering question.