Searching and sorting by language - indexing

I am testing Lucene.NET for our searching requirements, and I've got couple of questions.
We have documents in XML format. Every document contains multi-language text. The number of languages and the languages itself vary from document to document. See example below:
<document>This is a sample document, which is describing a <word lang="de">tisch</word>, a <word lang="en">table</word> and a <word lang="en">desk</word>.</document>
The keywords of a document are tagged with a special element and language attribute.
When I am creating lucene index I extract the text content from the XML and pairs of language and keyword (I am not sure if I have to), like this:
This is a sample document, which is describing a tisch, a table and a desk.
de - tisch
en - table
en - desk
I don't know exactly how to create an index that I will be able to search for example:
- all the documents that contains word tisch in German (and not the document which contains word tisch in other languages).
And also I want to specifiy sorting at runtime:
I want to sort by user specified language order (depending on a user interface). For example, if we have two documents:
<document>This is a sample document, which is describing a <word lang="de">tisch</word>.</document>
<document>This is a another sample document, which is describing a <word lang="en">table</word>.</document>
and a user on an English interface searches by "tisch OR table" I want to get the second result first.
Any information or advice is appreciated.
Many thanks!

You have a design decision to make, where the options are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
If you use the multi-index approach, it will be easier to restrict search to a specific language or set of languages - just search the indexes for these languages, not using the other languages. Also, sorting by language becomes easier. Therefore, if you do not have
an "AND" search that requires keywords from different languages appear in the same document, I would suggest the M-index approach.
Based on your example, I assume that the part of the documents not specially tagged is in English. If this is so, you can add the document text to the English index as a separate field; The other indexes need only store a document id, which will make them lighter.

Related

Lucene difference between Term and Fields

I've read a lot about Lucene indexing and searching and still can't understand what Term is?What is the difference between term and fields?
A very rough analogy would be that fields are like columns in a database table, and terms are like the contents in each database column.
More specifically to Lucene:
Terms
Terms are indexed tokens. See here:
Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms
So, for example, if you have the following sentence in a document...
"This is a list of terms"
...and you pass it through a whitespace tokenizer, this will generate the following terms:
This
is
a
list
of
terms
Terms are therefore also what you place into queries, when performing searches. See here for a definition of how they are used in the classic query parser.
Fields
A field is a section of a document.
A simple example is the title of a document versus the body (the remaining text/content) of the document. These can be defined as two separate Lucene fields within a Lucene index.
(You obviously need to be able to parse the source document so that you can separate the title from the body - otherwise you cannot populate each separate field correctly, while building your Lucene index.)
You can then place all of the title's terms into the title field; and the body's terms into the body field.
Now you can search title data separately from body data.
You can read about fields here and here. There are various different types of fields, specific to the type of data (terms) they will be holding.

Lucene Field.Store.YES versus Field.Store.NO

Will someone please explain under what circumstance I may use Field.Store.NO instead of Field.Store.YES? I am extremely new to Lucene. And I am trying to create a document. Per my basic knowledge, I am doing
doc.add(new StringField(fieldNameA,fieldValueA,Field.Store.YES));
doc.add(new TextField(fieldNameB,fieldValueB,Field.Store.YES));
There are two basic ways a document can be written into Lucene.
Indexed - The field is analyzed and indexed, and can be searched.
Stored - The field's full text is stored and will be returned with search results.
If a document is indexed but not stored, you can search for it, but it won't be returned with search results.
One reasonably common pattern is to use lucene for search, but only have an ID field being stored which can be used to retrieve the full contents of the document/record from, for instance, a SQL database, a file system, or an web resource.
You might also opt not to store a field when that field is just a search tool, but you wouldn't display it to the user, such as a soundex/metaphone, or an alternate analysis of a content field.
Use Field.Store.YES when you need a document back from Lucene document. Use NO when you just need a search from document. Here is a link explained with a scenario.
https://handyopinion.com/java-lucene-saving-fields-or-not/

lucene multiple documents created for large and unique number of resources?

I am a beginner in lucene search.If I have a collection resources like:
id,name,{list of products},{list of keywords}.If I want to search based on name or products or keyword.I have some doubts related to lucene and its usage:
1)For document creation, I create a document that has the structure of id,name,products(multiple values),keywords(multiple values).If I have a thousand unique resources, will it create 1000 unique documents?
2)Also, If I make name and products field as searchable fields(as StringField), then after searching, will the result also contains(ScoreDocs contains) exactly the same set of documents that has the text I searched?
Q> <..> will it create 1000 unique documents?
A> Lucene doesn't have the concept of "uniqueness" - it is only in your head. Alternatively, think of this as if all documents are unique for Lucene. If you search by these fields, relevant documents will be returned.
Q> <..> will the result also contains(ScoreDocs contains) exactly the same set of documents that has the text I searched?
A> Strange/unclear question. If you search for all documents, you will get all documents. If your search query will only match some documents, some documents will be returned. The internals are more complex - it all depends on how you analyze the text. Maybe you can more give concrete example with use cases?

index multi-language field with luncene

I have multi-language document records to index with lucene. That is, each document record is in one language, but there are different language records. I intend to keep them in one index so that I could search with multi-language queries. Currently the document records are in one data input file like this:
<DOCID>1<\DOCID>
<LANGUAGE>CHINESE<\LANGUAGE>
<TEXT>中文内容<\TEXT>
<DOCID>2<\DOCID>
<LANGUAGE>ENGLISH<\LANGUAGE>
<TEXT>Some English text<\TEXT>
My question is: Is there a way to use different analyzers for the same field with one index writer? Or should I split the document records into two input document in different languages to apply different index writer but append to the same index?
Thank you in advance for your advice!
You can provide the Analyzer you intend to use for a document when you call IndexWriter.addDocument.
However, you would probably benefit more from splitting different language texts into different fields, This would prevent having hits on the wrong language, and allow you to just create an AnalyzerWrapper to assign the appropriate analyzer after having detected the correct language.

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".