Lucene Difference between term and query? - lucene

What is the exact difference between the Term based index and the Query based Index also searching in LUCENE 6.5?

I don't know where you heard about "term-based" and "query-based" indexes.
Terms are the analyzed chunks of the text in the index. Most commonly, these are words, but it depends on your analyzer.
Queries are a set of search criteria that specifies what to look for among the indexed terms.

Related

OrientDB: FullText indexes vs Lucene FullText indexes

OrientDB has two types of full-text indexes: one is their own implementation and the second one is Lucene implementation. However it is absolutely unclear what I should use.
I understand that Lucene provides more features. However what if these features are not required. Should I go with standard full-text indexes or with Lucene? Then obviously performance is the main question.
The indices "FULL TEXT" with engine LUCENE
provides good full-text indexes, but cannot be used to index other types. It is durable, transactional and supports range queries.
More information about lucene see link.
The indices "FULL TEXT" with engine SB-TREE
the index is created with the algorithm that is based on the B-Tree index algorithm. It has been adapted with several optimizations, which relate to data insertion and range queries. As is the case with all other tree-based indexes, SB-Tree index algorithm experiences log(N) complexity, but the base to this logarithm is about 500. This indexing algorithm provides a good mix of features, similar to the features available from other index types. It is good for general use and is durable, transactional and supports range queries.
a simple example that compares the speed:
DB one: 100000 top of Class Person with property name with value "the name is 1...n " and Lucene index on this property
DB one: 100000 top of Class Person with property name with value "the name is 1...n " and sbtree index on this property
On one db: select from Person where name LUCENE "49000" return one record --> Query executed in 0.039 sec.
Db on two: select from Persona where name = "49000" return one record --> Query executed in 1.364 sec

Lucene: Query at least

I'm trying to find if there's a way to search in lucene to say find all documents where there is at least one word that does not match a particualar word.
E.g. I want to find all documents where there is at least one word besides "test". i.e. "test" may or may not be present but there should be at least one word other than "test". Is there a way to do this in Lucene?
thanks,
Purushotham
Lucene could do this, but this wouldn't be a good idea.
The performance of query execution is bound to two factors:
the time to intersect the query with the term dictionary,
the time to retrieve the docs for every matching term.
Performant queries are the ones which can be quickly intersected with the term dictionary, and match only a few terms so that the second step doesn't take too long. For example, in order to prohibit too complex boolean queries, Lucene limits the number of clauses to 1024 by default.
With a TermQuery, intersecting the term dictionary requires (by default) O(log(n)) operations (where n is the size of the term dictionary) in memory and then one random access on disk plus the streaming of at most 16 terms. Another example is this blog entry from Lucene committer Mike McCandless which describes how FuzzyQuery performance improved when a brute-force implementation of the first step was replaced by something more clever.
However, the query you are describing would require to examine every single term of the term dictionary and dismiss documents which are in the "test" document set only!
You should give more details about your use-case so that people can think about a more efficient solution to your problem.
If you need a query with a single negative condition, then use a BooleanQuery with the MatchAllDocsQuery and a TermQuery with occurs=MUST_NOT. There is no way to additionaly enforce the existential constraint ("must contain at least one term that is not excluded"). You'll have to check that separately, once you retrieve Lucene's results. Depending on the ratio of favorable results to all the results returned from Lucene, this kind of solution can range from perfectly fine to a performance disaster.

Performance with LIKE vs CONTAINS using full-text indexing

I have a table with a large(ish) amount of rows 500k, MSSQL Server 2008. I have a column which holds a nvarchar product ID which is usually 15 characters in length, alphabetical and numerical e.g. FF93F348HJKCF5HW9 . I would like to be able to search for this product ID with the best performance. I have done some research into using Full-Text indexing on this column and I dont really think that using full-text indexing using CONTAINS offers any benefit over using LIKE '%%'. This looks to be down to the fact Full-text indexing is more beneficial when searching for whole words, rather than a series of characters.
Can somebody confirm/deny this for me?
Full-Text indexing is about searching for language words in unstructured text data. Your data doesn't contain words, just a sequence of characters.
I haven't tested this, but I would expect that LIKE would actually be faster, as long as your data is indexed. CONTAINS is meant for searching for words & word-like structures.
If your requirement is for "auto-complete", then LIKE will perform pretty well since the optimizer will use an INDEX SEEK when you search for something such as LIKE 'F5521%'.
This MSDN article explains the basics of the CONTAINS keyword.

Lucene - comparing data in multiple indexes

Is it possible to compare data from multiple Lucene indexes? I would like to get documents that have the same value in similar fields (like first name, last name) across two indexes. Does Lucence support queries that can do this?
Well, partly. You can build identical document schemas across indexes, and at least get the set of hits correctly. However, as the Lucene Similarity documentation shows, the idf (inverse document frequency) factor in the Lucene scoring depends both on the index size and the number of documents having the search term in the index. Both these factors are index-dependent. Therefore the same match from different indexes may get different scores depending on these factors.

When should you use full-text indexing?

We have a whole bunch of queries that "search" for clients, customers, etc. You can search by first name, email, etc. We're using LIKE statements in the following manner:
SELECT *
FROM customer
WHERE fname LIKE '%someName%'
Does full-text indexing help in the scenario? We're using SQL Server 2005.
It will depend upon your DBMS. I believe that most systems will not take advantage of the full-text index unless you use the full-text functions. (e.g. MATCH/AGAINST in mySQL or FREETEXT/CONTAINS in MS SQL)
Here is two good articles on when, why, and how to use full-text indexing in SQL Server:
How To Use SQL Server Full-Text Searching
Solving Complex SQL Problems with Full-Text Indexing
FTS can help in this scenario, the question is whether it is worth it or not.
To begin with, let's look at why LIKE may not be the most effective search. When you use LIKE, especially when you are searching with a % at the beginning of your comparison, SQL Server needs to perform both a table scan of every single row and a byte by byte check of the column you are checking.
FTS has some better algorithms for matching data as does some better statistics on variations of names. Therefore FTS can provide better performance for matching Smith, Smythe, Smithers, etc when you look for Smith.
It is, however, a bit more complex to use FTS, as you'll need to master CONTAINS vs FREETEXT and the arcane format of the search. However, if you want to do a search where either FName or LName match, you can do that with one statement instead of an OR.
To determine if FTS is going to be effective, determine how much data you have. I use FTS on a database of several hundred million rows and that's a real benefit over searching with LIKE, but I don't use it on every table.
If your table size is more reasonable, less than a few million, you can get similar speed by creating an index for each column that you're going to be searching on and SQL Server should perform an index scan rather than a table scan.
According to my test scenario:
SQL Server 2008
10.000.000 rows each with a string like "wordA wordB
wordC..." (varies between 1 and 30 words)
selecting count(*) with CONTAINS(column, "wordB")
result size several hundred thousands
catalog size approx 1.8GB
Full-text index was in range of 2s whereas like '% wordB %' was in range of 1-2 minutes.
But this counts only if you don't use any additional selection criteria! E.g. if I used some "like 'prefix%'" on a primary key column additionally, the performance was worse since the operation of going into the full-text index costs more than doing a string search in some fields (as long those are not too much).
So I would recommend full-text index only in cases where you have to do a "free string search" or use some of the special features of it...
To answer the question specifically for MSSQL, full-text indexing will NOT help in your scenario.
In order to improve that query you could do one of the following:
Configure a full-text catalog on the column and use the CONTAINS() function.
If you were primarily searching with a prefix (i.e. matching from the start of the name), you could change the predicate to the following and create an index over the column.
where fname like 'prefix%'
(1) is probably overkill for this, unless the performance of the query is a big problem.