For performing less no of search operations out of a large elements, which searching technique is good enough such that the elements are not sorted? - binary-search

Lets say,
I have 1000 elements in a array and I want to search 10 elements in that array,then which searching mechanism is most appropriate?
Also , if in case , I need to search 900 elements from the same array,then what search method is good one?
Linear or Binary search ?
Thanks in advance.

If the elements aren't sorted, then you can't do a binary search. But binary search is so much faster than linear search (you'd need to look at an average of 10 elements rather than 500) that you'd be best sorting your list (using an algorithm such as quicksort) and then doing a binary search.

Related

Is binary search for an ordered list O(logN) in Elixir?

For an ordered list, the binary search time complexity is O(logN). However in Elixir, the list is linked list, so in order to get the middle element of the list, you have to iterate N/2 times, which make the overall search O(NLogN).
So my question is:
Is above time complexity correct?
If it's correct, the binary search wouldn't make sense in Elixir, right? You have to iterate the list to get what you want, so the best is O(N).
Yes, there is little reason to binary search over a linked list because of the reason you stated. You need a random access data structure (usually an array) for binary search to be useful.
An interesting corner case might arise where the comparison of the elements is very costly, because for example they are just handles to remotely stored items. In that case binary search through a linked list might still outperform linear search, because while requiring more operations (O(N * log(N))) it requires less comparisons (O(log(N))) while linear search requires O(N) comparisons.

When sequencial search is better than binary search?

I know that:
A linear search looks down a list, one item at a time, without jumping. In complexity terms this is an O(n) search - the time taken to search the list gets bigger at the same rate as the list does.
A binary search is when you start with the middle of a sorted list, and see whether that's greater than or less than the value you're looking for, which determines whether the value is in the first or second half of the list. Jump to the half way through the sublist, and compare again etc.
Is there a case where the sequencial/linear search becomes more eficient than Binary Search ?
Yes, e.g. when the item you are looking for happens to be one of the first to be looked at in a sequential search.

Querying Apache Solr based on score values

I am working on an image retrieval task. I have a dataset of wikipedia images with their textual description in xml files (1 xml file per image). I have indexed those xmls in Solr. Now while retrieving those, I want to maintain some threshold for Score values, so that docs with less score will not come in the result (because they are not of much importance). For example I want to retrieve all documents having similarity score greater than or equal to 2.0. I have already tried range queries like score:[2.0 TO *] but can't get it working. Does anyone have any idea how can I do that?
What's the motivation for wanting to do this? The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. It is only meaningful for comparing the results of a
specific query with a specific instance of the index. In other words,
it isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.
http://lucene.472066.n3.nabble.com/score-filter-td493438.html
Also, take a look here - http://wiki.apache.org/lucene-java/ScoresAsPercentages
So, in general it's bad to cut off by some value, because you'll never know which threshold value is best. In good query it could be score=2, in bad query score=0.5, etc.
These two links should explain you why you DONT want to do it.
P.S. If you still want to do it take a look here - https://stackoverflow.com/a/15765203/2663985
P.P.S. I recommend you to fix your search queries, so they will search better with high precision (http://en.wikipedia.org/wiki/Precision_and_recall)

Lucene Paging with search

Hello I am currently using Lucene 4.6.1
In my design I need to be able to search and page possibly many results, so i have some general questions for optimization.
First in the "search(query q, int n)" What is the goal of the variable "n" , Is "n" different from ".totalHits()" ? How should this number be chosen and with what specifications?
Second, it seems that there are two general algorithms for paging. I can either use "searchAfter" or process the "ScoreDoc[]" given a page size.
Currently what way do most people recommend, and what are the design ideas that are required?
searchAfter can be used for efficient "deep paging".
A tutorial on using it with Solr
http://heliosearch.org/solr/paging-and-deep-paging/
The int passed to search is the maximum number of hits the search will retrieve. totalHits, from the TopDocs is the total number of hits for the query. It may be more or less than the value passed in.
Not clear to me what you mean by processing the ScoreDoc array. searchAfter is specifically intended to be used for pagination. Use it.

Search for (Very) Approximate Substrings in a Large Database

I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that Lucene could do it, but is Lucene's levenshtein algorithm fast enough for hundreds of edits? Perhaps something out of the world of plagiarism detection? Any advice is appreciated.
Q-grams could be one approach, but there are others such as Blast, BlastP - which are used for Protein, nucleotide matches etc.
The Simmetrics library is a comprehensive collection of string distance approaches.
Lucene does not seem to be the right tool here. In addition to Mikos' fine suggestions, I have heard about AGREP, FASTA and Locality-Sensitive Hashing(LSH). I believe that an efficient method should first prune the search space heavily, and only then do more sophisticated scoring on the remaining candidates.