I know what does not_analyzed mean. In short the field will not be tokenized by specified Analyzer.
However, what does a NO_NORMS means? I see the documentation, but please explain me in plain English. what is index-time field and document boosting and field length normalization ?
It disables the following features:
index-time field and document boosting: this means that the index will ignore any boosts you did to fields (AbstractField.setBoost) or documents (Document.setBoost). A matching token will always be worth the same.
field length normalization: this means that the index will ignore whether a matching token was in a short field (which should be more relevant) vs. a long field (less relevant). Again, a matching token will always be worth the same, no matter the length of the field.
As elastic search has _all field I am not able to find anything regarding that in cratedb. SO do we need to maintain our own analyzed field for that purpose or does crate provide something in built?
The _all field is a special catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter, which is then analyzed and indexed, but not stored. This means that it can be searched, but not retrieved.
The _all field allows you to search for values in documents without knowing which field contains the value. This makes it a useful option when getting started with a new dataset
refer : https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html
We don't have something similar to that, so you'd need to add it to the query or maintain a dedicated column.
Lets say that I'm indexing a string value "useridA;useridB,userdidC,useridA,useridA"
The field is set to ANALYZED and uses a custom CharTokenizer which looks for a boundary comma char.
What is the expected behavior in the index, as the token "useridA" occurs multiples times within the same field?
Will it just re-index the same value an preserve the same space as if it would have been just one occurrence?
At the basic level lucene is an "inverted term index" it stores term->docID. So if a term occurs many times it'll only be recorded once.
Obviously this is a huge simplification. Positional information will also be stored depending on the TermVector value used when adding the field (you will need this to use phrase and slop queries).
Depending only your use-case I'd suggest you de-dupe the list either when indexing or just use a HashSet< string> for that property of whatever your class is.
I am using Lucene 3.6.1.
Do you know if there is a way to change a Term's value (Term.text()) before Lucene actually perform the search on the Document holding this Term ?
I need this preprocessing because the value is encrypted when written in the index. I although need to do fuzzy search and/or approximate search when searching on this Term.
Best regards.
You want to change a value stored in the index BEFORE you've found it? No, that doesn't make sense.
If you are storing data encrypted in the index, you'll need to search it using encrypted data. If you need to be able take advantage proper text searching, you will simply need to index it in an unencrypted form. Unless you are using some form of encryption that is friendly to text searching, I guess. I suppose if it were a simple cipher or something, you could encrypt both the indexed value and the query and search just fine. Apart from that, though, I don't think employing fuzzy searches on encrypted data is going to be feasible.
My Recommendation:
You could index, but not store, an unencrypted form of the field, allowing you to take advantage of searching as you need.
A field could then be created storing encrypted field to house the retrievable version of the field. Whether you index that field or not depends on whether you may, in some cases, which to search using encrypted data, but I would guess not.
Something like:
Document.add(new Field('fieldname', value, Field.Store.NO, Field.Index.ANALYZED);
Document.add(new Field('fieldnameencrypted', value, Field.Store.YES, Field.Index.NO);
Only fieldname can be searched, but only fieldnameencrypted can be retrieved from a found document (in it's encrypted form).
I am trying to index in Lucene a field that could have RDF literal in different languages.
Most of the approaches I have seen so far are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.
It depends.
Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.
In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).
so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese
I have below values in my database.
been Lorem Ipsum and scrambled ever
Here is my query:
MATCH (`sHeadline`) AGAINST ("text" IN BOOLEAN MODE) AS score
FROM wiki_businessads
AND bDeleted ="0" AND nAdStatus ="1"
ORDER BY score DESC, bPrimeListing DESC, dDateCreated DESC
It's not fetching first result, why? It should fetch first result because its contain text word in it. I have disabled the stopword filtering.
This one is also not working
MATCH (`sHeadline`) AGAINST ('"text"' IN BOOLEAN MODE) AS score
FROM wiki_businessads
AND bDeleted ="0" AND nAdStatus ="1"
ORDER BY score DESC, bPrimeListing DESC, dDateCreated DESC
The full text search only matches words and word prefixes. Because your data in the database does not contain word boundaries (spaces) the words are not indexed, so they are not found.
Some possible choices you could make are:
Fix your data so that it contains spaces between words.
Use LIKE '%text%' instead of a full text search.
Use an external full-text search engine.
I will expand on each of these in turn.
Fix your data so that it contains spaces between words.
Your data seems to have been corrupted somehow. It looks like words or sentences but with all the spaces removed. Do you know how that happened? Was it intentional? Perhaps there is a bug elsewhere in the system. Try to fix that. Find out where the data came from and see if it can be reimported correctly.
If the original source doesn't contain spaces, perhaps you could use some natural language toolkit to guess where the spaces should be and insert them. There most likely already exist libraries that can do this, although I don't happen to know any. A Google search might find something.
Use LIKE '%text%' instead of a full text search.
A workaround is to use LIKE '%text%' instead but note that this will be much slower as it will not be able to use the index. However it will give the correct result.
Use an external full-text search engine.
You could also look at Lucene or Sphinx. For example I know that Sphinx supports finding text using *text*. Here is an extract from the documentation which explains how to enable infix searching, which is what you need.
9.2.16. min_infix_len
Minimum infix prefix length to index. Optional, default is 0 (do not index infixes).
Infix indexing allows to implement wildcard searching by 'start*', '*end', and 'middle' wildcards (refer to enable_star option for details on wildcard syntax). When mininum infix length is set to a positive number, indexer will index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. Too short infixes (below the minimum allowed length) will not be indexed.
For instance, indexing a keyword "test" with min_infix_len=2 will result in indexing "te", "es", "st", "tes", "est" infixes along with the word itself. Searches against such index for "es" will match documents that contain "test" word, even if they do not contain "es" on itself. However, indexing infixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times.