Best performance approach to find all combinations of numbers from a given set(>80 elements) to reach a given final sum - sql

Before I am directed to go and keep searching instead of asking this general question, please understand my question in detail.
We have the algorithm that does it in pl sql. however it is not performing well when the set of numbers given has large number of elements. for example it works well when the set has around 22 elements. However after that the performance dies.
We are working with oracle database 12c and this combination of number searching is part of one of our applications and is pulled from oracle tables into associative arrays for finding combinations. example final sum required = 30
set of elements to choose from {1,2,4,6,7,2,8,10,5} and so forth.
My question in gist :
Is pl sql realistically suited to write such algo ? Should we be looking at another programming language/ technology/ server capacity/ tool to handle larger set of more than 80 elements ?

Oracle is not suitable for solving this problem because databases are not suited for it. In fact, I think this problem is an NP-complete problem, so there are no truly efficient solutions.
The approach in a database is to generate all possible combinations up to a certain size, and then filter down to the ones that match your sum. This is inherently an exponential algorithm. There may be some heuristic algorithms that come close to solving the problem, but this is an inherently hard problem.

Unless you can find some special condition to shrink the problem you will never solve it. Don't worry about the language implementation until you know this problem is even theoretically possible.
As others have mentioned, this problem grows exponentially. Solving it for 22 elements is not even close to solving it for 80.
A dynamic programming algorithm may be able to quickly find if there is one solution to a subset sum problem. But finding all solutions requires testing 2^80 sets.
2^80 = 1,208,925,819,614,629,174,706,176. That's 1.2e24.
That's a big number. Let's make a wildly optimistic assumption that a processor can test one billion sets a second. Buy a million of them and you can find your answer in about 38 years. Maybe a quantum computer can solve it more quickly some day.
It might help to explain exactly what you're trying to do. Unless there is some special condition, some way to eliminate most of the processing and avoid a brute-force solution, I don't see any hope for solving this problem. Perhaps this is a question for the Theoretical Computer Science site.

Related

Improving our algorithm with SQLite vs storing everything in memory

The problem...I’m trying to figure out a way to make our algorithm faster.
Our algorithm...is written in C and runs on an embedded Linux system with little memory and a lackluster CPU. The entire algorithm makes heavy use of 2d arrays and stores them all in memory. At a high level, the algorithm’s input data, which is a single array of 250 doubles (0.01234, 0.02532….0.1286), is compared to a larger 2d array, which is 20k+ rows x 250 doubles. The input data is compared against the 20k+ rows using a for loop. For each iteration, the algorithm performs computations and stores those results in memory.
I’m not an embedded software developer, I am a cloud developer that uses databases (Postgres, mainly). Our embedded software doesn’t make use of any databases and, since that is what I know, I thought I’d look into SQLite.
My approach...applying what I know about databases, I'd go about it this way: I would have a single table with 6 columns: id, array, computation_1, computation_2, computation_3, and computation_4. I’d store all 20k+ rows in this table with the computation_* columns initially defaulted to null. Then I’d have the algorithm loop through each entry and update the values for each computation_* column accordingly. For graphical purposes, the table would look like this:
Storing arrays in a database doesn't seem like a good fit so I don't immediately understand if there is a benefit to doing this. But, it seems like it would replace the extensive use of malloc()/calloc() we have baked into the algorithm.
My question is...can SQLite help speed up our algorithm if I use it in the way I've described? Since I don’t know how much benefit this would provide, if any, I thought I’d ask the experts here on SO before going down this path. If it will (or won't) provide an improvement, I'd like to know why from a technical standpoint so that I can learn.
Thanks in advance.
As you have described it so far, SQLite won't help you.
A relational database stores data into tables with various indexes and so on. When it receives SQL, it compiles it into a bytecode program, and then it runs that bytecode program in an interpreter against those tables. You can learn more about SQLite's bytecode from https://www.sqlite.org/opcode.html.
This has a lot of overhead compared to native data structures in a low-level language. In my experience the difference is up to several orders of magnitude.
Why, then, would anyone use a database? It is because you'd have to write a lot of potentially buggy code to match it. Doubly so if you've got multiple users at the same time. Furthermore the database query optimizer is able to find efficient plans for computing complex joins that are orders of magnitude more efficient than what most programmers produce on their own.
So a database is not a recipe for doing arbitrary calculations more efficiently. But if you can describe what you are doing in SQL (particularly if it involves joins), the database may be able to find a much more efficient calculation than the one you're currently performing.
Even in that case, squeezing performance out of a low-end embedded system is a case where it may be worth figuring out what a database would do, and then writing code to do that directly.

Need Help Studying Running Times

At the moment, I'm studying for a final exam for a Computer Science course. One of the questions that will be asked is most likely a question on how to combine running times, so I'll give an example.
I was wondering, if I created a program that preprocessed inputs using Insertion Sort, and then searched for a value "X" using Binary Search, how would I combine the running times to find the best, worst, and average case time complexities of the over-all program?
For example...
Insertion Sort
Worst Case O(n^2)
Best Case O(n)
Average Case O(n^2)
Binary Search
Worst Case O(logn)
Best Case O(1)
Average Case O(logn)
Would the Worst case be O(n^2 + logn), or would it be O(n^2), or neither?
Would the Best Case be O(n)?
Would the Average Case be O(nlogn), O(n+logn), O(logn), O(n^2+logn), or none of these?
I tend to over-think solutions, so if I can get any guidance on combining running times, it would be much appreciated.
Thank you very much.
You usually don't "combine" (as in add) the running times to determine the overall efficiency class rather, you take the one that takes the longest for each worst, average, and best case.
So if you're going to perform insertion sort and then do a binary search after to find an element X in an array, the worst case is O(n^2) and the best case is O(n) -- all from insertion sort since it takes the longest.
Based on my limited study, (we haven't reached Amortization so this might be where Jim has the rest correct), but basically you just go based on whoever is slowest of the overall algorithm.
This seems to be a good book on the subject of Algorithms (I haven't got much to compare to):
http://www.amazon.com/Introduction-Algorithms-Third-Thomas-Cormen/dp/0262033844/ref=sr_1_1?ie=UTF8&qid=1303528736&sr=8-1
Also MIT has a full course on the Algorithms on their site here is the link for that too!
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/
I've actually found it helpful, it might not answer specifically your question, but I think it will help get you more confident seeing some of the topics explained a few times.

Optimization algorithm question

This may be a simple question for those know-how guys. But I cannot figure it out by myself.
Suppose there are a large number of objects that I need to select some from. Each object has two known variables: cost and benefit. I have a budget, say $1000. How could I find out which objects I should buy to maximize the total benefit within the given budget? I want a numeric optimization solution. Thanks!
Your problem is called the "knapsack problem". You can read more on the wikipedia page. Translating the nomenclature from your original question into that of the wikipedia article, your problem's "cost" is the knapsack problem's "weight". Your problem's "benefit" is the knapsack problem's "value".
Finding an exact solution is an NP-complete problem, so be prepared for slow results if you have a lot of objects to choose from!
You might also look into Linear Programming. From MathWorld:
Simplistically, linear programming is
the optimization of an outcome based
on some set of constraints using a
linear mathematical model.
Yes, as stated before, this is the knapsack problem and I would choose to use linear programming.
The key to this problem is storing data so that you do not need to recompute things more than once (if enough memory is available). There are two general ways to go about linear programming: top-down, and bottom - up. This one is a bottom up problem.
(in general) Find base case values, what is the most optimal object to select for a small case. Then build on this. If we allow ourselves to spend more money what is the best combination of objects for that small increment in money. Possibilities could be taking more of what you previously had, taking one new object and replacing the old one, taking another small object that will still keep you under your budget etc.
Like I said, the main idea is to not recompute values. If you follow this pattern, you will get to a high number and find that in order to buy X amount of dollars worth of goods, the best solution is combining what you had for two smaller cases.

How to evaluate a search engine?

I am a student carrying out a study to enhance a search engine's existing algorithm.
I want to know how I can evaluate the search engine - which I have improved - to quantify how much the algorithm was improved.
How should I go about comparing the old and new algorithm?
Thanks
This is normally done by creating a test suite of questions and then evaluating how well the search response answers those questions. In some cases the responses should be unambiguous (if you type slashdot into a search engine you expect to get slashdot.org as your top hit), so you can think of these as a class of hard queries with 'correct' answers.
Most other queries are inherently subjective. To minimise bias you should ask multiple users to try your search engine and rate the results for comparison with the original. Here is an example of a computer science paper that does something similar:
http://www.cs.uic.edu/~liub/searchEval/SearchEngineEvaluation.htm
Regarding specific comparison of the algorithms, although obvious, what you measure depends on what you're interested in knowing. For example, you can compare efficiency in computation, memory usage, crawling overhead or time to return results. If you are trying to produce very specific behaviour, such as running specialist searches (e.g. a literature search) for certain parameters, then you need to explicitly test this.
Heuristics for relevance are also a useful check. For example, when someone uses search terms that are probably 'programming-related', do you tend to get more results from stackoverflow.com? Would your search results be better if you did? If you are providing a set of trust weightings for specific sites or domains (e.g. rating .edu or .ac.uk domains as more trustworthy for technical results), then you need to test the effectiveness of these weightings.
First, let me start out by saying, kudos to you for attempting to apply traditional research methods to search engine results. Many SEO's have done this before you, and generally keep this to themselves as sharing "amazing findings" usually means you can't exploit or have the upper hand anymore, this said I will share as best I can some pointers and things to look for.
Identify what part of the algorithm are you trying to improve?
Different searches execute different algorithms.
Broad Searches
For instance in a broad term search, engines tend to return a variety of results. Common part of these results include
News Feeds
Products
Images
Blog Posts
Local Results (this is based off of a Geo IP lookup).
Which of these result types are thrown into the mix can vary based on the word.
Example: Cats returns images of cats, and news, Shoes returns local shopping for shoes. (this is based on my IP in Chicago on October 6th)
The goal in returning results for a broad term is to provide a little bit of everything for everyone so that everyone is happy.
Regional Modifiers
Generally any time a regional term is attached to a search, it will modify the results greatly. If you search for "Chicago web design" because the word Chicago is attached, the results will start with a top 10 regional results. (these are the one liners to the right of the map), after than 10 listings will display in general "result fashion".
The results in the "top ten local" tend to be drastically different than those in organic listing below. This is because the local results (from google maps) rely on entirely different data for ranking.
Example: Having a phone number on your website with the area code of Chicago will help in local results... but NOT in the general results. Same with address, yellow book listing and so forth.
Results Speed
Currently (as of 10/06/09) Google is beta testing "caffeine" The main highlight of this engine build is that it returns results in almost half the time. Although you may not consider Google to be slow now... speeding up an algorithm is important when millions of searches happen every hour.
Reducing Spam Listings
We have all found experienced a search that was riddled with spam. The new release of Google Caffeine http://www2.sandbox.google.com/ is a good example. Over the last 10+ one of the largest battles online has been between Search Engine Optimizers and Search Engines. Gaming google (and other engines) is highly profitable and what Google spends most of its time combating.
A good example is again the new release of Google Caffeine. So far my research and also a few others in the SEO field are finding this to be the first build in over 5 years to put more weight on Onsite elements (such as keywords, internal site linking, etc) than prior builds. Before this, each "release" seemed to favor inbound links more and more... this is the first to take a step back towards "content".
Ways to test an algorythm.
Compare two builds of the same engine. This is currently possible by comparing Caffeine (see link above or google, google caffeine) and the current Google.
Compare local results in different regions. Try finding search terms like web design, that return local results without a local keyword modifier. Then, use a proxy (found via google) to search from various locations. You will want to make sure you know the proxies location (find a site on google that will tell your your IP address geo IP zipcode or city). Then you can see how different regions return different results.
Warning... DONT pick the term locksmith... and be wary of any terms that when returning result, have LOTS of spammy listings.. Google local is fairly easy to spam, especially in competitive markets.
Do as mentioned in a prior answer, compare how many "click backs" users require to find a result. You should know, currently, no major engines use "bounce rates" as indicators of sites accuracy. This is PROBABLY because it would be EASY to make it look like your result has a bounce rate in the 4-8% range without actually having one that low... in other words it would be easy to game.
Track how many search variations users use on average for a given term in order to find the result that is desired. This is a good indicator of how well an engine is smart guessing the query type (as mentioned WAY up in this answer).
**Disclaimer. These views are based on my industry experience as of October 6th, 2009. One thing about SEO and engines is they change EVERY DAY. Google could release Caffeine tomorrow, and this would change a lot... that said, this is the fun of SEO research!
Cheers
In order to evaluate something, you have to define what you expect from it. This will help to define how to measure it.
Then, you'll be able to measure the improvement.
Concerning a search engine, I guess that you might be able to measure itsability to find things, its accuracy in returning what is relevant.
It's an interesting challenge.
I don't think you will find a final mathematical solution if that is your goal. In order to rate a given algorithm, you require standards and goals that must be accomplished.
What is your baseline to compare against?
What do you classify as "improved"?
What do you consider a "successful search"?
How large is your test group?
What are your tests?
For example, if your goal is to improve the process of page ranking then decide if you are judging the efficiency of the algorithm or the accuracy. Judging efficiency means that you time your code for a consistent large data set and record results. You would then work with your algorithm to improve the time.
If your goal is to improve accuracy then you need to define what is "inaccurate". If you search for "Cup" you can only say that the first site provided is the "best" if you yourself can accurately define what is the best answer for "Cup".
My suggestion for you would be to narrow the scope of your experiment. Define one or two qualities of a search engine that you feel need refinement and work towards improving them.
In the comments you've said "I have heard about a way to measure the quality of the search engines by counting how many time a user need to click a back button before finding the link he wants , but I can use this technique because you need users to test your search engine and that is a headache itself". Well, if you put your engine on the web for free for a few days and advertise a little you will probably get at least a couple dozen tries. Provide these users with the old or new version at random, and measure those clicks.
Other possibility: assume Google is by definition perfect, and compare your answer to its for certain queries. (Maybe sum of distance of your top ten links to their counterparts at Google, for example: if your second link is google's twelveth link, that's 10 distance). That's a huge assumption, but far easier to implement.
Information scientists commonly use precision and recall as two competing measures of quality for an information retrieval system (like a search engine).
So you could measure your search engine's performance relative to Google's by, for example, counting the number of relevant results in the top 10 (call that precision) and the number of important pages for that query that you think should have been in the top 10 but weren't (call that recall).
You'll still need to compare the results from each search engine by hand on some set of queries, but at least you'll have one metric to evaluate them on. And the balance of these two is important too: otherwise you can trivially get perfect precision by not returning any results or perfect recall by returning every page on the web as a result.
The Wikipedia article on precision and recall is quite good (and defines the F-measure which takes into account both).
I have had to test a search engine professionally. This is what I did.
The search included fuzzy logic. The user would type into a web page "Kari Trigger", and the search engine would retrieve entries like "Gary Trager", "Trager, C", "Corey Trager", etc, each with a score from 0->100 so that I could rank them from most likely to least likely.
First, I re-architected the code so that it could be executed removed from the web page, in a batch mode using a big file of search queries as input. For each line in the input file, the batch mode would write out the top search result and its score. I harvested thousands of actual search queries from our production system and ran them thru the batch setup in order to establish a baseline.
From then on, each time I modified the search logic, I would run the batch again and then diff the new results against the baseline. I also wrote tools to make it easier to see the interesting parts of the diff. For example, I didn't really care if the old logic returned "Corey Trager" as an 82 and the new logic returned it as an 83, so my tools would filter those out.
I could not have accomplished as much by hand-crafting test cases. I just wouldn't have had the imagination and insight to have created good test data. The real world data was so much richer.
So, to recap:
1) Create a mechanism that lets you diff the results of running new logic versus the results of prior logic.
2) Test with lots of realistic data.
3) Create tools that help you work with the diff, filtering out the noise, enhancing the signal.
You have to clearly identify positive and negative qualities such as how fast one gets the answer they are seeking or how many "wrong" answers they get on the way there. Is it an improvement if the right answer is #5 but the results are returned 20 times faster? Things like that will be different for each application. The correct answer may be more important in a corporate knowledge base search but a fast answer may be needed for a phone support application.
Without parameters no test can be claimed to be a victory.
Embrace the fact that the quality of search results are ultimately subjective. You should have multiple scoring algorithms for your comparison: The old one, the new one, and a few control groups (e.g. scoring by URI length or page size or some similarly intentionally broken concept). Now pick a bunch of queries that exercise your algorithms, say a hundred or so. Let's say you end up with 4 algorithms total. Make a 4x5 table, displaying the first 5 results of a query across each algorithm. (You could do top ten, but the first five are way more important.) Be sure to randomize which algorithm appears in each column. Then plop a human in front of this thing and have them pick which of the 4 result sets they like best. Repeat across your entire query set. Repeat for as many more humans as you can stand. This should give you a fair comparison based on total wins for each algorithm.
http://www.bingandgoogle.com/
Create an app like this that compares and extracts the data. Then run a test with 50 different things you need to look for and then compare with the results you want.

How to test numerical analysis routines?

Are there any good online resources for how to create, maintain and think about writing test routines for numerical analysis code?
One of the limitations I can see for something like testing matrix multiplication is that the obvious tests (like having one matrix being the identity) may not fully test the functionality of the code.
Also, there is the fact that you are usually dealing with large data structures as well. Does anyone have some good ideas about ways to approach this, or have pointers to good places to look?
It sounds as if you need to think about testing in at least two different ways:
Some numerical methods allow for some meta-thinking. For example, invertible operations allow you to set up test cases to see if the result is within acceptable error bounds of the original. For example, matrix M-inverse times the matrix M * random vector V should result in V again, to within some acceptable measure of error.
Obviously, this example exercises matrix inverse, matrix multiplication and matrix-vector multiplication. I like chains like these because you can generate quite a lot of random test cases and get statistical coverage that would be a slog to have to write by hand. They don't exercise single operations in isolation, though.
Some numerical methods have a closed-form expression of their error. If you can set up a situation with a known solution, you can then compare the difference between the solution and the calculated result, looking for a difference that exceeds these known bounds.
Fundamentally, this question illustrates the problem that testing complex methods well requires quite a lot of domain knowledge. Specific references would require a little more specific information about what you're testing. I'd definitely recommend that you at least have Steve Yegge's recommended book list on hand.
If you're going to be doing matrix calculations, use LAPACK. This is very well-tested code. Very smart people have been working on it for decades. They've thought deeply about issues that the uninitiated would never think about.
In general, I'd recommend two kinds of testing: systematic and random. By systematic I mean exploring edge cases etc. It helps if you can read the source code. Often algorithms have branch points: calculate this way for numbers in this range, this other way for numbers in another range, etc. Test values close to the branch points on either side because that's where approximation error is often greatest.
Random input values are important too. If you rationally pick all the test cases, you may systematically avoid something that you don't realize is a problem. Sometimes you can make good use of random input values even if you don't have the exact values to test against. For example, if you have code to calculate a function and its inverse, you can generate 1000 random values and see whether applying the function and its inverse put you back close to where you started.
Check out a book by David Gries called The Science of Programming. It's about proving the correctness of programs. If you want to be sure that your programs are correct (to the point of proving their correctness), this book is a good place to start.
Probably not exactly what you're looking for, but it's the computer science answer to a software engineering question.