Complexity of a Lucene's search - lucene

If I write and algorithm that performs a search using Lucene how can I state the computational complexity of it? I know that Lucene uses tf*idf scoring but I don't know how it is implemented. I've found that tf*idf has the following complexity:
where D is the set of documents and T the set of all terms.
However, I need someone who could check if this is correct and explain me why.
Thank you

Lucene basically uses a Vector Space Model (VSM) with a tf-idf scheme. So, in the standard setting we have:
A collection of documents each represented as a vector
A text query also represented as a vector
We determine the K documents of the collection with the highest vector space scores on the query q. Typically, we seek these K top documents ordered by score in decreasing order; for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.
The basic algorithm for computing vector space scores is:
float Scores[N] = 0
Initialize Length[N]
for each query term t
do calculate w(t,q) and fetch postings list for t (stored in the index)
for each pair d,tf(t,d) in postings list
do Scores[d] += wf(t,d) X w(t,q) (dot product)
Read the array Length[d]
for each d
do Scored[d] = Scores[d] / Length[d]
return Top K components of Scores[]
The array Length holds the lengths (normalization factors) for each of the N
documents, whereas the array Scores holds the scores for each of the documents.
tf is the term frequency of a term in a document.
w(t,q) is the weight of the submitted query for a given term. Note that query is treated as a bag of words and the vector of weights can be considered (as if it was another document).
wf(d,q) is the logarithmic term weighting for query and document
As described here: Complexity of vector dot-product, vector dot-product is O(n). Here the dimension is the number of terms in our vocabulary: |T|, where T is the set of terms.
So, the time complexity of this algorithm is:
O(|Q|· |D| · |T|) = O(|D| · |T|)
we consider |Q| fixed, where Q is the set of words in the query (which average size is low, in average a query has between 2 and 3 terms) and D is the set of all documents.
However, for a search, these sets are bounded and indexes don't tend to grow very often. So, as a result, searches using VSM are really fast (when T and D are large the search is really slow and one has to find an alternative approach).

D is the set of all documents
before (honestly, along side) VSM, the boolean retrieval is invoked. Thus, we can say d is matching docs only (almost. ok. in the best case).
Since Scores is priority queue (at least in doc-at-time-scheme) build on heap, putting every d into takes log(K).
Therefore we can estimate it as O(d·log(K)), here I omitting T since query is expected to be short. (Otherwise, you are in a trouble).


Time Complexity of 1-pass lookup given input size N**2

Given a list of lists, i.e.
What is the time complexity of using nested For loops to see if each numeral from 1-9 is used once and only once? Furthermore, what would be the time complexity if the input is now a singular combined list, i.e. [1,2,3,4,5,6,7,8,9]?
What really matters is the size of the input, not the format. Either you have a list of 9 elements or 9 lists with 1 element, you still have 9 elements to be checked in the worst case.
The answer to the question, as stated, would be O(1), because you have a constant size input.
If what you mean is something like Given N elements what is the time complexity of checking if all number between 1 and N are present, then it would take linear time, i.e., O(N).
Indeed, an option is to use a hash table (e.g., a python set) and check if the element is already in the set, if not adding it. Note that in using this specific option you would get an expected (but not guaranteed, due to potential collisions) linear time complexity algorithm.

BooleanQuery setDisableCoord

I have no idea what setDisableCoord is and what value should I set for it. I understand coord in a simple query (e.g. TFIDF query). But don't understand what it means in a Boolean query consisting of several queries.
To give some context, assume the following two scenarios. What value should I set in setDisableCoord for each of them?
In the first scenario I have a query with BooleanClause.Occur.FILTER (the query is used only for filtering) and another one for scoring (BooleanClause.Occur.MUST). In this scenario the first query only checks if the "year" field of the document is in a specified range and the second query uses some algorithm for ranking.
In the second scenario, I have two queries with BooleanClause.Occur.SHOULD whose scores must be combined to obtain the final retrieval score of documents.
Summary: For Lucene > 6.x, set disableCoord to true, otherwise leave it at false.
Coord is a scoring feature of BooleanQuery to counteract some of TF/IDFs shortcomings for over-saturated terms. It's only relevant for multiple should clauses. In your first scenario, all sub-queries must match, there is no coord factor involved and the disableCoord parameter has no effect.
In the second scenario, when having multiple should clauses, a BooleanQuery sums up all the sub-scores to determine, which of the documents is a better match. The idea is that a doc that matches more sub-queries is a better match and thus, gets a better score.
Now, imagine a query x OR y and a document that has 1000 occurrences of x but none of y. With TF/IDF, due to the high termFreq(x), the sub-score of x is very high and so is the resulting score of x OR y, which can push this document before others, that match both fields, which is not what BooleanQuery was meant to do. This is where the coord comes into play.
The coord factor is calculated per document as number of should clauses matched/total number of should clauses in query. This basically gives a number in [0..1] that represents, how many sub-queries have matches a document. The summed score of all sub-queries is then multiplied by this coord factor. A document matching all should clauses will have it original score of all summed sub-queries and a document matching only x out of x OR y will have it's score halved, counteracting the high score that the over-saturated x gave. If you disabled coord, this factor will not be calculated and the final score is only the sum of the sub-scores.
Coord was designed with TF/IDF in mind and other similarity formulas might not suffer from over-saturated terms. BM25, which has become the default similarity in Lucene 6.0, has much better control over such over-satured terms, controlled by its k1 parameter. Instead of a score that grows near-linear with increasing termFreq, BM25 approaches a limit and stops growing. It gives no boost for documents that have a termFreq=1000 over one that has termFreq=5, but does so for termFreq=1 over termFreq=0. Britta Weber has given a talk at buzzwords about this, where she explains the saturation curve.
That means, for BM25, the coord factor is not necessary anymore and might actually lead to counter-intuitive results. It is already removed from Lucene master and will be gone in 7.0.
If you're using Lucene 6.x witht he default similarity BM25, it's a good idea to always disable the coord, as BM25 does not suffer from the problem coord worked around. If you're using TF/IDF (regardless of 6.x or not), disabling coord will only give you more predictable results as long as your term frequencies are evenly distributed (which they practically never are) and setting disableCoord to false (the default) will give results, that are intuitively better.

Determine whether there is a subset of size n which has a standard deviation <= s

Given a bunch of numbers, I am trying to determine whether there is a "clump" anywhere where numbers are very densely packed.
To make things more precise, I thought I'd ask a more specific problem: given a set of numbers, I would like to determine whether there is a subset of size n which has a standard deviation <= s. If there are many such subsets, I'd like to find the subset with the lowest standard deviation.
So question #1 : does this formal problem definition effectively capture the intuitive concept of a "clump" of densely packed numbers?
EDIT: I don't actually care about determining which numbers belong to this "clump", I'm much more interested in determining where the clump is centred, which is why I think that specifying n in advance is okay. But feel free to correct me!
And question #2 : assuming it does, what is the best way to go about implementing something like this (in particular, I want a solution with lowest time complexity)? So far I think I have a solution that runs in n log n:
First, note that the lowest-standard-deviation-possessing subset of a given size must consist of consecutive numbers. So step 1 is sort the numbers (this is n log n)
Second, take the first n numbers and compute their standard deviation. If our array of numbers is 0-based, then the first n numbers are [0, n-1]. To get standard deviation, compute s1 and s2 as follows:
s1 = sum of numbers
s2 = sum of squares of numbers
Then, wikipedia says that the standard deviation is sqrt(n*s2 - s1^2)/n. Record this value as the highest standard deviation seen so far.
Find the standard deviation of [1, n], [2, n+1], [3, n+2] ... until you hit the the last n numbers. To do each computation takes only constant time if you keep track of s1 and s2 running totals: for example, to get std dev of [1, n], just subtract the 0th element from the s1 and s2 totals and add the nth element, then recalculate standard deviation. This means that the entire standard deviation calculating portion of the algorithm takes linear time.
So total time complexity n log n.
Is my assessment right? Is there a better way to do this? I really need this to run fast on fairly large sets, so the faster the better! Space is less of an issue (I think).
Having been working recently on a similar problem, both the definition of the clumps and the proposed implementation seem reasonable.
Another reasonable definition would be to find the minimum of all the ranges of n numbers. Thus, given that the list of numbers x is sorted, one would just find the minimum of x[n]-x[1], x[n+1]-x[2], etc. This would be slightly quicker than finding the standard deviation because it would avoid the multiplications and square roots. Indeed, you can avoid the square roots even when looking for the lowest standard deviation by finding the minimum variance (the square of the standard deviation), rather than the sd itself.
A caution would be that the location of the biggest clump might be quite sensitive to the choice of n. If there is an a priori reason to select a particular n, that won't be a problem. If not, however, it might require some experimentation to select the value of n that fairly reliably finds the clumps you are looking for, whether you are selecting by range or by standard deviation. Some ideas on this can be found in Chapter 6 of the online book ABC of EDA.

Hamming Distance / Similarity searches in a database

I have a process, similar to tineye that generates perceptual hashes, these are 32bit ints.
I intend to store these in a sql database (maybe a nosql db) in the future
However, I'm stumped at how I would be able to retrieve records based on the similarity of hashes.
Any Ideas?
A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.
So in simple terms: Say you have a bunch of 32 bits hashes in a DB and that you want to find every hash that are within a 4 bits hamming distance of your "query" hash:
Create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4.
Slice your query hash the same way in qslice 1 to 4.
Query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every DB hash that are within 3 bits (4 - 1) of the query hash. It may contain more results, and this is why there is a step 4.
For each returned hash, compute the exact hamming distance pair-wise with you query hash (reconstructing the index-side hash from the four slices)
The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table.
This approach was first described afaik by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:
Given bit vectors consisting of d bits each, we choose
N = O(n 1/(1+ ) ) random permutations of the bits. For each
random permutation σ, we maintain a sorted order O σ of
the bit vectors, in lexicographic order of the bits permuted
by σ. Given a query bit vector q, we find the approximate
nearest neighbor by doing the following:
For each permutation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q.
Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":
3.3 The Results for Algorithm C
We partitioned the bit string of each page into 12 non-
overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one
piece in common. This approach is guaranteed to find all
pairs of pages with difference up to 11, i.e., C-similarity 373,
but might miss some for larger differences.
NB: C-similarity is the same as the Hamming distance: The Hamming distance is the number of positions at which the corresponding bits differ while C-similarity is the number of positions at which the corresponding bits agree.
This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:
Definition: Given a collection of f -bit fingerprints and a
query fingerprint F, identify whether an existing fingerprint
differs from F in at most k bits. (In the batch-mode version
of the above problem, we have a set of query fingerprints
instead of a single query fingerprint)
Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits
in the table. A listing of these d-bit numbers amounts to
“almost a counter” in the sense that (a) quite a few 2 d bit-
combinations exist, and (b) very few d-bit combinations are
duplicated. On the other hand, the least significant f − d
bits are “almost random”.
Now choose d such that |d − d| is a small integer. Since
the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions.
Since |d − d| is small, the number of such matches is also
expected to be small. For each matching fingerprint, we can
easily figure out if it differs from F in at most k bit-positions
or not (these differences would naturally be restricted to the
f − d least-significant bit-positions).
The procedure described above helps us locate an existing
fingerprint that differs from F in k bit-positions, all of which
are restricted to be among the least significant f − d bits of
F. This takes care of a fair number of cases. To cover all
the cases, it suffices to build a small number of additional
sorted tables, as formally outlined in the next Section.
PS: Most of these fine brains are/were associated with Google at some level or some time for these, FWIW.
To find hamming distance, you can just use bitwise addition and subtraction (& and ~ on the integers) in order to compute these.
SQL isn't made for this sort of processing. The comparisons on large data sets get very messy, and will not have the speed of a query that utilizes the strength of the system. That said, I've done similar things.
This will give you individual differences, which would need to be run on the full data set and ordered, which is messy at best. If you want it to run faster, you will need to use strategies like indexing by "region," or finding natural groupings in your data. There are umbrella clustering strategies, and similar - there is a lot of literature. It will, however, be messy in most traditional Database systems.
David's discussion is correct, but if you don't have a lot of data, check out Hamming distance on binary strings in SQL

What is Big O notation? Do you use it? [duplicate]

This question already has answers here:
What is a plain English explanation of "Big O" notation?
(43 answers)
Closed 9 years ago.
What is Big O notation? Do you use it?
I missed this university class I guess :D
Does anyone use it and give some real life examples of where they used it?
See also:
Big-O for Eight Year Olds?
Big O, how do you calculate/approximate it?
Did you apply computational complexity theory in real life?
One important thing most people forget when talking about Big-O, thus I feel the need to mention that:
You cannot use Big-O to compare the speed of two algorithms. Big-O only says how much slower an algorithm will get (approximately) if you double the number of items processed, or how much faster it will get if you cut the number in half.
However, if you have two entirely different algorithms and one (A) is O(n^2) and the other one (B) is O(log n), it is not said that A is slower than B. Actually, with 100 items, A might be ten times faster than B. It only says that with 200 items, A will grow slower by the factor n^2 and B will grow slower by the factor log n. So, if you benchmark both and you know how much time A takes to process 100 items, and how much time B needs for the same 100 items, and A is faster than B, you can calculate at what amount of items B will overtake A in speed (as the speed of B decreases much slower than the one of A, it will overtake A sooner or later—this is for sure).
Big O notation denotes the limiting factor of an algorithm. Its a simplified expression of how run time of an algorithm scales with relation to the input.
For example (in Java):
/** Takes an array of strings and concatenates them
* This is a silly way of doing things but it gets the
* point across hopefully
* #param strings the array of strings to concatenate
* #returns a string that is a result of the concatenation of all the strings
* in the array
public static String badConcat(String[] Strings){
String totalString = "";
for(String s : strings) {
for(int i = 0; i < s.length(); i++){
totalString += s.charAt(i);
return totalString;
Now think about what this is actually doing. It is going through every character of input and adding them together. This seems straightforward. The problem is that String is immutable. So every time you add a letter onto the string you have to create a new String. To do this you have to copy the values from the old string into the new string and add the new character.
This means you will be copying the first letter n times where n is the number of characters in the input. You will be copying the character n-1 times, so in total there will be (n-1)(n/2) copies.
This is (n^2-n)/2 and for Big O notation we use only the highest magnitude factor (usually) and drop any constants that are multiplied by it and we end up with O(n^2).
Using something like a StringBuilder will be along the lines of O(nLog(n)). If you calculate the number of characters at the beginning and set the capacity of the StringBuilder you can get it to be O(n).
So if we had 1000 characters of input, the first example would perform roughly a million operations, StringBuilder would perform 10,000, and the StringBuilder with setCapacity would perform 1000 operations to do the same thing. This is rough estimate, but O(n) notation is about orders of magnitudes, not exact runtime.
It's not something I use per say on a regular basis. It is, however, constantly in the back of my mind when trying to figure out the best algorithm for doing something.
A very similar question has already been asked at Big-O for Eight Year Olds?. Hopefully the answers there will answer your question although the question asker there did have a bit of mathematical knowledge about it all which you may not have so clarify if you need a fuller explanation.
Every programmer should be aware of what Big O notation is, how it applies for actions with common data structures and algorithms (and thus pick the correct DS and algorithm for the problem they are solving), and how to calculate it for their own algorithms.
1) It's an order of measurement of the efficiency of an algorithm when working on a data structure.
2) Actions like 'add' / 'sort' / 'remove' can take different amounts of time with different data structures (and algorithms), for example 'add' and 'find' are O(1) for a hashmap, but O(log n) for a binary tree. Sort is O(nlog n) for QuickSort, but O(n^2) for BubbleSort, when dealing with a plain array.
3) Calculations can be done by looking at the loop depth of your algorithm generally. No loops, O(1), loops iterating over all the set (even if they break out at some point) O(n). If the loop halves the search space on each iteration? O(log n). Take the highest O() for a sequence of loops, and multiply the O() when you nest loops.
Yeah, it's more complex than that. If you're really interested get a textbook.
'Big-O' notation is used to compare the growth rates of two functions of a variable (say n) as n gets very large. If function f grows much more quickly than function g we say that g = O(f) to imply that for large enough n, f will always be larger than g up to a scaling factor.
It turns out that this is a very useful idea in computer science and particularly in the analysis of algorithms, because we are often precisely concerned with the growth rates of functions which represent, for example, the time taken by two different algorithms. Very coarsely, we can determine that an algorithm with run-time t1(n) is more efficient than an algorithm with run-time t2(n) if t1 = O(t2) for large enough n which is typically the 'size' of the problem - like the length of the array or number of nodes in the graph or whatever.
This stipulation, that n gets large enough, allows us to pull a lot of useful tricks. Perhaps the most often used one is that you can simplify functions down to their fastest growing terms. For example n^2 + n = O(n^2) because as n gets large enough, the n^2 term gets so much larger than n that the n term is practically insignificant. So we can drop it from consideration.
However, it does mean that big-O notation is less useful for small n, because the slower growing terms that we've forgotten about are still significant enough to affect the run-time.
What we now have is a tool for comparing the costs of two different algorithms, and a shorthand for saying that one is quicker or slower than the other. Big-O notation can be abused which is a shame as it is imprecise enough already! There are equivalent terms for saying that a function grows less quickly than another, and that two functions grow at the same rate.
Oh, and do I use it? Yes, all the time - when I'm figuring out how efficient my code is it gives a great 'back-of-the-envelope- approximation to the cost.
The "Intuitition" behind Big-O
Imagine a "competition" between two functions over x, as x approaches infinity: f(x) and g(x).
Now, if from some point on (some x) one function always has a higher value then the other, then let's call this function "faster" than the other.
So, for example, if for every x > 100 you see that f(x) > g(x), then f(x) is "faster" than g(x).
In this case we would say g(x) = O(f(x)). f(x) poses a sort of "speed limit" of sorts for g(x), since eventually it passes it and leaves it behind for good.
This isn't exactly the definition of big-O notation, which also states that f(x) only has to be larger than C*g(x) for some constant C (which is just another way of saying that you can't help g(x) win the competition by multiplying it by a constant factor - f(x) will always win in the end). The formal definition also uses absolute values. But I hope I managed to make it intuitive.
It may also be worth considering that the complexity of many algorithms is based on more than one variable, particularly in multi-dimensional problems. For example, I recently had to write an algorithm for the following. Given a set of n points, and m polygons, extract all the points that lie in any of the polygons. The complexity is based around two known variables, n and m, and the unknown of how many points are in each polygon. The big O notation here is quite a bit more involved than O(f(n)) or even O(f(n) + g(m)).
Big O is good when you are dealing with large numbers of homogenous items, but don't expect this to always be the case.
It is also worth noting that the actual number of iterations over the data is often dependent on the data. Quicksort is usually quick, but give it presorted data and it slows down. My points and polygons alogorithm ended up quite fast, close to O(n + (m log(m)), based on prior knowledge of how the data was likely to be organised and the relative sizes of n and m. It would fall down badly on randomly organised data of different relative sizes.
A final thing to consider is that there is often a direct trade off between the speed of an algorithm and the amount of space it uses. Pigeon hole sorting is a pretty good example of this. Going back to my points and polygons, lets say that all my polygons were simple and quick to draw, and I could draw them filled on screen, say in blue, in a fixed amount of time each. So if I draw my m polygons on a black screen it would take O(m) time. To check if any of my n points was in a polygon, I simply check whether the pixel at that point is green or black. So the check is O(n), and the total analysis is O(m + n). Downside of course is that I need near infinite storage if I'm dealing with real world coordinates to millimeter accuracy.... ...ho hum.
It may also be worth considering amortized time, rather than just worst case. This means, for example, that if you run the algorithm n times, it will be O(1) on average, but it might be worse sometimes.
A good example is a dynamic table, which is basically an array that expands as you add elements to it. A naïve implementation would increase the array's size by 1 for each element added, meaning that all the elements need to be copied every time a new one is added. This would result in a O(n2) algorithm if you were concatenating a series of arrays using this method. An alternative is to double the capacity of the array every time you need more storage. Even though appending is an O(n) operation sometimes, you will only need to copy O(n) elements for every n elements added, so the operation is O(1) on average. This is how things like StringBuilder or std::vector are implemented.
What is Big O notation?
Big O notation is a method of expressing the relationship between many steps an algorithm will require related to the size of the input data. This is referred to as the algorithmic complexity. For example sorting a list of size N using Bubble Sort takes O(N^2) steps.
Do I use Big O notation?
I do use Big O notation on occasion to convey algorithmic complexity to fellow programmers. I use the underlying theory (e.g. Big O analysis techniques) all of the time when I think about what algorithms to use.
Concrete Examples?
I have used the theory of complexity analysis to create algorithms for efficient stack data structures which require no memory reallocation, and which support average time of O(N) for indexing. I have used Big O notation to explain the algorithm to other people. I have also used complexity analysis to understand when linear time sorting O(N) is possible.
From Wikipedia.....
Big O notation is useful when analyzing algorithms for efficiency. For example, the time (or the number of steps) it takes to complete a problem of size n might be found to be T(n) = 4n² − 2n + 2.
As n grows large, the n² term will come to dominate, so that all other terms can be neglected — for instance when n = 500, the term 4n² is 1000 times as large as the 2n term. Ignoring the latter would have negligible effect on the expression's value for most purposes.
Obviously I have never used it..
You should be able to evaluate an algorithm's complexity. This combined with a knowledge of how many elements it will take can help you to determine if it is ill suited for its task.
It says how many iterations an algorithm has in the worst case.
to search for an item in an list, you can traverse the list until you got the item. In the worst case, the item is in the last place.
Lets say there are n items in the list. In the worst case you take n iterations. In the Big O notiation it is O(n).
It says factualy how efficient an algorithm is.