Update a Term's value before searching - lucene

I am using Lucene 3.6.1.
Do you know if there is a way to change a Term's value (Term.text()) before Lucene actually perform the search on the Document holding this Term ?
I need this preprocessing because the value is encrypted when written in the index. I although need to do fuzzy search and/or approximate search when searching on this Term.
Best regards.

You want to change a value stored in the index BEFORE you've found it? No, that doesn't make sense.
If you are storing data encrypted in the index, you'll need to search it using encrypted data. If you need to be able take advantage proper text searching, you will simply need to index it in an unencrypted form. Unless you are using some form of encryption that is friendly to text searching, I guess. I suppose if it were a simple cipher or something, you could encrypt both the indexed value and the query and search just fine. Apart from that, though, I don't think employing fuzzy searches on encrypted data is going to be feasible.
My Recommendation:
You could index, but not store, an unencrypted form of the field, allowing you to take advantage of searching as you need.
A field could then be created storing encrypted field to house the retrievable version of the field. Whether you index that field or not depends on whether you may, in some cases, which to search using encrypted data, but I would guess not.
Something like:
Document.add(new Field('fieldname', value, Field.Store.NO, Field.Index.ANALYZED);
Document.add(new Field('fieldnameencrypted', value, Field.Store.YES, Field.Index.NO);
Only fieldname can be searched, but only fieldnameencrypted can be retrieved from a found document (in it's encrypted form).

Related

STRING type or SSTRING element for a text field in table? Pros and cons

I need to create a Z table to store reasons for modifications of a certain custom object.
In the UI, the user will pick a reason ID and then optionally fill a text box. The table will have more or less the fields below:
key objectID
key changeReasonID
changedOn
changedBy
comments
My doubt is with the comments field. I read the documentation about the limitations of STRING and SSTRING, but it's not clear to me if a STRING type field used in a transparent table has a limited length or not.
Even if the length is not limited (at least by the DB), I'm not sure if it's a good idea to use this approach or would you recommend CHAR/SSTRING types with a fix length instead?
**My system is running MSSQL database.
Strings have unlimited length, both in ABAP structures/tables, and in the database.
Most databases will store only a pointer in this column that points to the real CLOB value which is stored in a different memory segment. As a result, they restrict the usage of these columns, and may not allow you to use them as a key or index.
If I remember correctly, ABAP supports a maximum of 16 string fields per structure, which naturally limits its use cases. Also consider that ABAP structures have a maximum size.
For your case, if the comment will remain the only long field, and if you are actually fine with storing unlimited input (--> security constraints?), string sounds like a reasonable option.
If you are unsure what the future will bring, or to be on the safe side regarding security, you might want to opt for sstring or simply a long char instead.

Search in all columns in crate

As elastic search has _all field I am not able to find anything regarding that in cratedb. SO do we need to maintain our own analyzed field for that purpose or does crate provide something in built?
The _all field is a special catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter, which is then analyzed and indexed, but not stored. This means that it can be searched, but not retrieved.
The _all field allows you to search for values in documents without knowing which field contains the value. This makes it a useful option when getting started with a new dataset
refer : https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html
We don't have something similar to that, so you'd need to add it to the query or maintain a dedicated column.

Luke Where are my field values?

I've used Luke like four times per year for the past three years. I only break it out when I need it. One concept I've never understood is why only certain fields' values are displayed. I can query these "empty" fields for expected values and get the expected results, but Luke never displays these. I assume I'm missing something fundamental and obvious, but it's not so obvious to me.
Example Search tab:
Example Documents tab:
When a program creates a Lucene Document, it might tell Lucene whether to store the value of the field or not. See, for example, the stored argument to the StringField constructor. If the value is not stored then it can be searched on, but the original bytes of the value are not saved in the index, since they are not required nor used by the search.
A typical pattern with, say, http://www.elasticsearch.org/ is to store the original JSON in a single field and not to store the actually indexed fields. That way the application working with the retrieved data might use it's native data format and does not have to be aware of the Lucene and it's flat key-value Document.

Indexing multilingual words in lucene

I am trying to index in Lucene a field that could have RDF literal in different languages.
Most of the approaches I have seen so far are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.
It depends.
Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.
In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).
so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese

What does Field.Index.NOT_ANALYZED_NO_NORMS mean

I know what does not_analyzed mean. In short the field will not be tokenized by specified Analyzer.
However, what does a NO_NORMS means? I see the documentation, but please explain me in plain English. what is index-time field and document boosting and field length normalization ?
It disables the following features:
index-time field and document boosting: this means that the index will ignore any boosts you did to fields (AbstractField.setBoost) or documents (Document.setBoost). A matching token will always be worth the same.
field length normalization: this means that the index will ignore whether a matching token was in a short field (which should be more relevant) vs. a long field (less relevant). Again, a matching token will always be worth the same, no matter the length of the field.