Lucene 4.9: Create temporary Directory with Documents - apache

I have a FSDirectory, let's call it NORMAL, which already contains many indexed Document instances. Now, I want to create a temporary Index, i.e., RAMDirectory and IndexReader / IndexSearcher, that contains a subset of the previously indexed Documents (let's call this directory TEMP).
I am wondering what's the best way to do that. While indexing data into NORMAL I use an Analyzer that performs stemming on the tokens (EnglishAnalyzer); also not all of the fields are actually stored, i.e., some of them are only indexed but their value is not stored within the Directory NORMAL. That's fine so far.
However, if I now take a subset of such documents, which I later on read with an IndexReader, and I readd them to the TEMP Directory, is it appropriate for example to use also EnglishAnalyzer or does it cause re-stemming of already stemmed tokens?
And, if a field is not stored at all, I suppose it cannot be used for adding it to TEMP right?

1: It is appropriate to re-analyze. The stored representation of the field is not stemmed, tokenized, or anything else. It's just the raw data.
2: Generally, that's right. If a field is not stored, you can't get it out. Technically, you might be able to reconstruct a lossy version of the field, if the right parameters are set when indexing, and if you are tenacious. Wouldn't recommend it when you could just store the field, of course.
This reads a bit like an XY problem, though. Are you sure there isn't an easier way to do whatever it is you are trying to do? Perhaps by filtering?

Related

Can I mark an elastic search index as incomplete? Can I retrieve the list of "complete" indices?

I want to populate an index but make it searchable only after I'm done. Is there a standard way of doing that with elastic search? I think I can set "index.blocks.read": true but I'd like a way to be able to ask elastic for a list of the searchable indices and I don't know how to do that with that setting. Also closing/opening an index feels a bit cumbersome.
A solution I found is to add a document to each index defining that index's status. Though querying for the list of indices is a bit annoying. Specifically since querying and paginating a long list of 2,000 index status documents is problematic. Scroll-scan is a solution because it gives me all the results in one go (because every shard has at most 1 index status document). Though that feels like I'm using the wrong tool for the job (i.e. a scroll-scan op that always does exactly one scroll).
I don't want one document that references all the indices because then I'd have to garbage collect it manually alongside garbage collecting indices. But maybe that's the best tradeoff...
Is there a standard practice that I'm not aware of?
How about using aliases? Instead of querying an index directly, your application could query an alias (e.g. live) instead. As long as your index is not ready (i.e. still being populated), you don't assign the live alias to it and hence the index won't be searchable.
Basically, the process goes like this:
Create the index with its settings and mappings
Populate it
When done, assign the live alias to it and send your queries against it
Later when you need to index new data, you create another index
You populate that new index
When done, you switch the aliases, i.e. remove the live alias from the previous searchable index and assign the live alias to the new searchable index
Here is a simple example that demonstrates this.

Get original sql query in postgres extension in C

I am creating extension to postgres in C (c++). It is new data type that behave like text but it is encrypted by HSM device. But I have problem to use more then one key to protect data. My idea is to get original sql query and process it to choose what key I should use. But I don't know how to do that or if it is even possible?
My goal is to change some existing text fields in database to encrypted ones. And that's why I can't provide key number to my type in direct way. Type must be seen by external app as text.
Normally there is userID field and single query always use that id to get or set encrypted data. Base on that field I want to chose key. HSM can have billions of keys in itself and that's mean every user can have it's own key. It's not a problem if I need to parse string by myself, I am more then capable of doing that. Performance is not issue too, HSM is so slow that I can encode or decode only couple fields in one second.
In most parts of the planner and executor the current (sub)query is available in a passed PlannerInfo struct, usually:
PlannerInfo *root
This has a parse member containing the Query object.
Earlier in the system, in the rewriter, it's passed as Query *root directly.
In both cases, if there's evaluation of a nested subquery going on, you get the subquery. There's no easy way to access the parent Query node.
The query tree isn't always available deeper in execution paths, such as in expression evaluation. You're not supposed to be referring to it there; expressions are self contained, and don't need to refer to the rest of the query.
So you're going to have a problem doing what you want. Frankly, that's because it's a pretty bad design from the sounds. What you should consider instead is:
Using a function to encode/decode the type to/from cleartext, allowing you to pass parameters; or possibly
Using the typmod of the type to store the desired information (but be aware that the typmod is not preserved across casts, subqueries, etc).
There's also the debug_query_string global, but really don't use that. It's unparsed query text so it won't help you anyway. If you (ab)use this in your code, I will cry. I'm only telling you it exists so I can tell you not to use it.
By far and away your best option is going to be to use a function-based interface for this.

RavenDB Index Prefix

Since I don't yet have the ability to promote a temp index in Raven Studio (using build 573), I created two indexes manually. It seems to have worked well, but I have a question about the prefixes on each index: Temp, Auto, Raven. Is there anything special about those keywords? When I create my own index, should I use a prefix like that? For now, when I created my index, I used the index name from the temp index and replaced the word Temp with Manual.
Is that an acceptable approach? Should I be using a certain prefix?
Bob,
The names are just names, they are there for humans, not for RavenDB.
Indexes starting with Raven/ are reserved, and may be overwritten by the system at some point.
Indexes starting with Auto/ or Temp/ may be generated by the system, and may overwrite an existing index.
I generally use the collection/entity name as prefix before just so that helps me visually to understand right away what entity the index is primarily based on. If I had index for getting latest list of movies. I would name it Movie/GetLatestIndex..

How can I retrieve non-stored Lucene field values?

When searching, only stored fields are returned from a search. For debugging reasons, I need to see the unstored fields, too. Is there a way via the API?
Thanks!
P.S.: I know Luke, unfortunately I can't use it in my case.
If the unstored fields were stored… they'd be called stored fields, right?
For unstored fields, all you can see are the tokenized keywords as they were indexed, and that requires un-inverting the inverted index. Using the IndexReader API, you can enumerate all of the unique terms in a particular field. Then, for each term, you can enumerate the documents that contain the term. This tells you roughly the value of specified field of a given document.
Depending on the analysis performed on the field during indexing, this may allow you to reconstruct the original field exactly, or merely give you an rough idea of what it may have contained.

Multiple or single index in Lucene?

I have to index different kinds of data (text documents, forum messages, user profile data, etc) that should be searched together (ie, a single search would return results of the different kinds of data).
What are the advantages and disadvantages of having multiple indexes, one for each type of data?
And the advantages and disadvantages of having a single index for all kinds of data?
Thank you.
If you want to search all types of document with one search , it's better that you keep all
types to one index . In the index you can define more field type that you want to Tokenize or Vectore them .
It takes a time to introduce to each IndexSearcher a directory that include indeces .
If you want to search terms separately , it would better that index each type to one index .
single index is more structural than multiple index.
In other hand , we can balance our loading with multiple indeces .
Not necessarily answering your direct questions, but... ;)
I'd go with one index, add a Keyword (indexed, stored) field for the type, it'll let you filter if needed, as well as tell the difference between the results you receive back.
(and maybe in the vein of your questions... using separate indexes will allow each corpus to have it's own relevency score, don't know if excessively repeated terms in one corpus will throw off relevancy of documents in others?)
You should think logically as to what each dataset contains and design your indexes by subject-matter or other criteria (such as geography, business unit etc.). As a general rule your index architecture is similar to how you would databases (you likely wouldn't combine an accounting with a personnel database for example even if technically feasible).
As #llama pointed out, creating a single uber-index affects relevance scores, security/access issues, among other things and causes a whole new set of headaches.
In summary: think of a logical partitioning structure depending on your business need. Would be hard to explain without further background.
Agree that each kind of data should have its own index. So that all the index options can be set accordingly - like analyzers for the fields, what is stored for the fields for term vectors and similar. And also to be able to use different dynamic when IndexReaders/Writers are reopened/committed for different kinds of data.
One obvious disadvantage is the need to handle several indexes instead of one. To make it easier, and because I always use more than one index, created small library to handle it: Multi Index Lucene Manager