storage lucene index in database using data objects in java - lucene

Is this possible? I cannot access the database directly--only through data objects.
Would I be able to search the index if the items are returned in something like ArrayList?
If this is not possible, is there some way I can use Lucene (or some other tool) to do fuzzy matching against an object using java?
For example, I have a Person object that has a FirstName and LastName. I want to do a fuzzy match on the name.
So, say I have an array of x amount of Person objects, would there be an efficient way of looping through each Person object and comparing the names?

Take those data objects and build a separate Lucene index over them, storing the fields you need. Using your Person example, every Lucene document would be [Id, FirstName, LastName]. A search on this index would return the Id required to query your database for the complete data object.
The actual indexing is easy, you just need to retrieve a list of data objects, iterate them, generate Lucene documents, and store them using an IndexWriter. You could work against either a filesystem directory for persistent storage, or a in-memory storage.

Those are the possible solutions I came up with--
however, I cannot store my index on FSDirectory (project specs do not allow this) and for RAMDirectory, there are going to be thousands of Person objects we'll need to search through so I don't know if in-memory storage is ideal for this situation.
Is there any other sort of fuzzy match algorithm I can use that will be efficient for large sets of data?

Related

Can Azure Cognitive Search Index Use A Lookup Table?

I have created an Azure Cognitive Search Service index. All the fields that I want to be able to search, retrieve, filter, sort, facet are included within the single table that the index is built from. Some of the data fields in that table are coded, but I have a separate table that serves as a dictionary that defines those codes more literally in plain english. I would like to be able to search on the defined/literal values in my search results from the index without having to add the contents of the large dictionary table to the search index.
Is it possible to configure the index to use a referential table in this way? How?
Or is my only option to denormalize the entire contents of the dictionary into the index table?
Thanks!
Azure Cognitive Search Services will only return data that it has stored in its search indexes. If you want to "join" that data with some external table, that's something you may need to do client-side.
If you want to search against the terms in your external table, that data needs to be included in the search index.

Search large number of ID search in MongoDB

Thanks for looking at my query. I have 20k+ unique identification id that I is provided by client, I want to look for all these id's in MongoDB using single query. I tried looking using $in but then it does not seems feasible to put all the 20K Id in $in and search. Is there a better version of achieving.
If the id field is indexed, an IN query should be very fast, but i don't think it is a good idea to perform a query with 20k ids in one time, as it may consume quite a lot of resources like memory, you can split the ids into multiple groups with a reasonable size and do the query separately and you still can perform the queries parallelly in application level.
Consider importing your 20k+ id into a collection(says using mongoimport etc). Then perform a $lookup from your root collection to the search collection. Depending on whether the $lookup result is empty array or not, you can proceed with your original operation that requires $in.
Here is Mongo playground for your reference.

Can I mark an elastic search index as incomplete? Can I retrieve the list of "complete" indices?

I want to populate an index but make it searchable only after I'm done. Is there a standard way of doing that with elastic search? I think I can set "index.blocks.read": true but I'd like a way to be able to ask elastic for a list of the searchable indices and I don't know how to do that with that setting. Also closing/opening an index feels a bit cumbersome.
A solution I found is to add a document to each index defining that index's status. Though querying for the list of indices is a bit annoying. Specifically since querying and paginating a long list of 2,000 index status documents is problematic. Scroll-scan is a solution because it gives me all the results in one go (because every shard has at most 1 index status document). Though that feels like I'm using the wrong tool for the job (i.e. a scroll-scan op that always does exactly one scroll).
I don't want one document that references all the indices because then I'd have to garbage collect it manually alongside garbage collecting indices. But maybe that's the best tradeoff...
Is there a standard practice that I'm not aware of?
How about using aliases? Instead of querying an index directly, your application could query an alias (e.g. live) instead. As long as your index is not ready (i.e. still being populated), you don't assign the live alias to it and hence the index won't be searchable.
Basically, the process goes like this:
Create the index with its settings and mappings
Populate it
When done, assign the live alias to it and send your queries against it
Later when you need to index new data, you create another index
You populate that new index
When done, you switch the aliases, i.e. remove the live alias from the previous searchable index and assign the live alias to the new searchable index
Here is a simple example that demonstrates this.

Lucene 4.9: Create temporary Directory with Documents

I have a FSDirectory, let's call it NORMAL, which already contains many indexed Document instances. Now, I want to create a temporary Index, i.e., RAMDirectory and IndexReader / IndexSearcher, that contains a subset of the previously indexed Documents (let's call this directory TEMP).
I am wondering what's the best way to do that. While indexing data into NORMAL I use an Analyzer that performs stemming on the tokens (EnglishAnalyzer); also not all of the fields are actually stored, i.e., some of them are only indexed but their value is not stored within the Directory NORMAL. That's fine so far.
However, if I now take a subset of such documents, which I later on read with an IndexReader, and I readd them to the TEMP Directory, is it appropriate for example to use also EnglishAnalyzer or does it cause re-stemming of already stemmed tokens?
And, if a field is not stored at all, I suppose it cannot be used for adding it to TEMP right?
1: It is appropriate to re-analyze. The stored representation of the field is not stemmed, tokenized, or anything else. It's just the raw data.
2: Generally, that's right. If a field is not stored, you can't get it out. Technically, you might be able to reconstruct a lossy version of the field, if the right parameters are set when indexing, and if you are tenacious. Wouldn't recommend it when you could just store the field, of course.
This reads a bit like an XY problem, though. Are you sure there isn't an easier way to do whatever it is you are trying to do? Perhaps by filtering?

Lucene index a large many-to-many relationship

I have a database with two primary tables:
The components table (50M rows),
The assemblies table (100K rows),
..and a many-to-many relationship between them (about 100K components per assembly), thus in total 10G relationships.
What's the best way to index the components such that I could query the index for a given assembly? Given the amount of relationships, I don't want to import them into the Lucene index, but am looking instead for a way to "join" with my external table on-the-fly.
Solr supports multi-valued fields. Not positive if Lucene supports them natively or not. It's been a while for me. If only one of the entities is searchable, which you mentioned is components, I would index all components with a field called "assemblies" or "assemblyIds" or something similar and include whatever metadata you need to identify the assemblies.
Then you can search components with
assemblyIds:(1, 2, 3)
To find components in assembly 1, 2 or 3.
To be brief, you've got to process the data and Index it before you can search. Therefore, there exists no way to just "plug-in" Lucene to some data or database, instead you've to plug-in (process, parse, analyze, index, and query) the data it self to the Lucene.
rustyx: "My data is mostly static. I can even live with a read-only index."
In that case, you might use Lucene it self. You can iterate the datasource to add all the many-to-many relations to the Lucene index. How did you come up with that "100GB" size? People index millions and millions of documents using Lucene, I don't think it'd be a problem for you to index.
You can add multiple field instances to the index with different values ("components") in a document having an "assembly" field as well.
rustyx: "I'm looking instead into a way to "join" the Lucene search with my external data source on the fly"
If you need something seamless, you might try out the following framework which acts like a bridge between relational database and Lucene Index.
Hibernate Search : In that tutorial, you might search for "#ManyToMany" keyword to find the exact section in the tutorial to get some idea.