RavenDB UpdateByIndex using a map/reduce index on a denormalized collection - ravendb

I have a document that contains a denormalized collection, I have created a map/reduce index that returns the indivdual items from the collection. Is it possible to use this index to update the denormalized data? I have attempted to but the data isn't updated. There are no errors that occur, my patch just executes silenty. I am able to update the denormalized collection using a map only index on the whole document, however I want to use the map/reduce index so I can query specific items to update from the denomalized collection.

This is the solution you need for this:
http://ayende.com/blog/162340/ravens-scripted-index-results

Related

Can Azure Cognitive Search Index Use A Lookup Table?

I have created an Azure Cognitive Search Service index. All the fields that I want to be able to search, retrieve, filter, sort, facet are included within the single table that the index is built from. Some of the data fields in that table are coded, but I have a separate table that serves as a dictionary that defines those codes more literally in plain english. I would like to be able to search on the defined/literal values in my search results from the index without having to add the contents of the large dictionary table to the search index.
Is it possible to configure the index to use a referential table in this way? How?
Or is my only option to denormalize the entire contents of the dictionary into the index table?
Thanks!
Azure Cognitive Search Services will only return data that it has stored in its search indexes. If you want to "join" that data with some external table, that's something you may need to do client-side.
If you want to search against the terms in your external table, that data needs to be included in the search index.

How to build an index that is best for this SQL?

I'm querying database by a simple SQL like:
SELECT DISTINCT label FROM Document WHERE folderId=123
I need a best index for it.
There are two way to implement this.
(1) create two indexes
CREATE INDEX .. ON Document (label)
CREATE INDEX .. ON Document (folderId)
(2) create a composite index:
CREATE INDEX .. ON Document (folderId, label)
According to the implementation of the database index. Which method is more reasonable? Thank you.
Your second index -- the composite index -- is the best index for this query:
SELECT DISTINCT label
FROM Document
WHERE folderId = 123;
First, the index covers the query so the data pages do not need to be accessed (in most databases).
The index works because the engine can seek to the records with the identified folderId. The information on label can be pulled. Most databases should use the index for pulling the distinct labels as well.
The answer depends also on the used dbms. I'm not sure if all dbms support the use of multiple indexes.
I should think the combined index (folderid, label) is the best solution in your case as it is possible to build your result set solely by using index data. So it doesn't even need to access the real data.
It can even be a strategy to add extra columns to an index so a query which is called a lot can be answered by only accessing the index.

Can I mark an elastic search index as incomplete? Can I retrieve the list of "complete" indices?

I want to populate an index but make it searchable only after I'm done. Is there a standard way of doing that with elastic search? I think I can set "index.blocks.read": true but I'd like a way to be able to ask elastic for a list of the searchable indices and I don't know how to do that with that setting. Also closing/opening an index feels a bit cumbersome.
A solution I found is to add a document to each index defining that index's status. Though querying for the list of indices is a bit annoying. Specifically since querying and paginating a long list of 2,000 index status documents is problematic. Scroll-scan is a solution because it gives me all the results in one go (because every shard has at most 1 index status document). Though that feels like I'm using the wrong tool for the job (i.e. a scroll-scan op that always does exactly one scroll).
I don't want one document that references all the indices because then I'd have to garbage collect it manually alongside garbage collecting indices. But maybe that's the best tradeoff...
Is there a standard practice that I'm not aware of?
How about using aliases? Instead of querying an index directly, your application could query an alias (e.g. live) instead. As long as your index is not ready (i.e. still being populated), you don't assign the live alias to it and hence the index won't be searchable.
Basically, the process goes like this:
Create the index with its settings and mappings
Populate it
When done, assign the live alias to it and send your queries against it
Later when you need to index new data, you create another index
You populate that new index
When done, you switch the aliases, i.e. remove the live alias from the previous searchable index and assign the live alias to the new searchable index
Here is a simple example that demonstrates this.

Lucene index a large many-to-many relationship

I have a database with two primary tables:
The components table (50M rows),
The assemblies table (100K rows),
..and a many-to-many relationship between them (about 100K components per assembly), thus in total 10G relationships.
What's the best way to index the components such that I could query the index for a given assembly? Given the amount of relationships, I don't want to import them into the Lucene index, but am looking instead for a way to "join" with my external table on-the-fly.
Solr supports multi-valued fields. Not positive if Lucene supports them natively or not. It's been a while for me. If only one of the entities is searchable, which you mentioned is components, I would index all components with a field called "assemblies" or "assemblyIds" or something similar and include whatever metadata you need to identify the assemblies.
Then you can search components with
assemblyIds:(1, 2, 3)
To find components in assembly 1, 2 or 3.
To be brief, you've got to process the data and Index it before you can search. Therefore, there exists no way to just "plug-in" Lucene to some data or database, instead you've to plug-in (process, parse, analyze, index, and query) the data it self to the Lucene.
rustyx: "My data is mostly static. I can even live with a read-only index."
In that case, you might use Lucene it self. You can iterate the datasource to add all the many-to-many relations to the Lucene index. How did you come up with that "100GB" size? People index millions and millions of documents using Lucene, I don't think it'd be a problem for you to index.
You can add multiple field instances to the index with different values ("components") in a document having an "assembly" field as well.
rustyx: "I'm looking instead into a way to "join" the Lucene search with my external data source on the fly"
If you need something seamless, you might try out the following framework which acts like a bridge between relational database and Lucene Index.
Hibernate Search : In that tutorial, you might search for "#ManyToMany" keyword to find the exact section in the tutorial to get some idea.

storage lucene index in database using data objects in java

Is this possible? I cannot access the database directly--only through data objects.
Would I be able to search the index if the items are returned in something like ArrayList?
If this is not possible, is there some way I can use Lucene (or some other tool) to do fuzzy matching against an object using java?
For example, I have a Person object that has a FirstName and LastName. I want to do a fuzzy match on the name.
So, say I have an array of x amount of Person objects, would there be an efficient way of looping through each Person object and comparing the names?
Take those data objects and build a separate Lucene index over them, storing the fields you need. Using your Person example, every Lucene document would be [Id, FirstName, LastName]. A search on this index would return the Id required to query your database for the complete data object.
The actual indexing is easy, you just need to retrieve a list of data objects, iterate them, generate Lucene documents, and store them using an IndexWriter. You could work against either a filesystem directory for persistent storage, or a in-memory storage.
Those are the possible solutions I came up with--
however, I cannot store my index on FSDirectory (project specs do not allow this) and for RAMDirectory, there are going to be thousands of Person objects we'll need to search through so I don't know if in-memory storage is ideal for this situation.
Is there any other sort of fuzzy match algorithm I can use that will be efficient for large sets of data?