Index strategy for tagged documents where tags can change often - lucene

In addition to text content my documents have tags which can be searched too. The problem now is that the tags change quite often and every time a tag gets added or removed I have to call UpdateDocument which is quite slow when done for hundreds of documents.
Are there any well performing strategies for storing tags that change often and need to be searched with Lucene? I have been thinking about keeping the tags in separate documents to keep them smaller but I can't figure out how to quickly search for tags AND content.

Store [tag, UID] pairs in a relational database. Every time a tag is added or updated, it is added and updated in this table in the database.
When performing a Lucene search that includes both tag data (stored in a database) and content (indexed in Lucene) you will need to merge the results together. One way you can do this is to:
Make a database query to pull up all the UID's for the tag in question
Translate all the UID's to Lucene doc ID's and set a bit in a BitSet for every matching Lucene doc ID
Create a Filter that wraps the BitSet, and pass that filter in to your search.
We implemented this approach in our system, and it works well. You might need to put a cache in front of the database for performance reasons, though. The particulars of step (3) will vary depending on which version of Lucene you're using.

Related

What's the storage solution used by search engines to store indexes to enable efficient querying and scalability?

There are lots of articles on how search engines perform indexing, but couldn't find any information on how they store these indexed records in a way that enables fast querying with scalability. Could someone explain the index storing mechanisms used in search engines or point to any article ?
Solr is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This is like retrieving pages in a book related to a keyword by scanning the index at the back of a book, as opposed to searching every word of every page of the book.
This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).
Inverted index is a major term in the domain of Information Retrieval and Natural Language Processing. Take a document, note down all the unique words appearing in that document as well as frequency of the words. Here you are ready with your own inverted index. Solr creates similar inverted index of the documents posted to its core using a defined schema. Schema is a blue print which helps Solr in creating invered index of the documents by giving a set of predefined fields in the schema.xml file.

Creating a DB for comment sections with multiple page tags

I'm having a tough time choosing the correct database (SQL, NoSQL) for this use case even though it's so common.
This is the main record -
Page:
Contains a number of fields (which will probably change and updated in the future).
Connected to a list of tags (Which can contain up to 50 tags).
Connected to a comment section.
Page records will be queried by using the tags.
In general, the reads are more important (so writes could be more expensive) and the availability should be high.
The reason not to choose mongodb style DB is because of the comment section, There are no joins, so the comment section must be embedded in the page and the document size could grow too much.
Also MongoDB is less reliant on availability (using CAP) and availability is important to me.
The reason not to choose SQL is because the scheme of the Page could be updated and there is no fixed scheme.
Also because of the tags system - another relational table should be created and as I understood, it's bad for performance.
What's the best approach here?
Take a look at Postgres and you can have the best of both.
Postgres supports jsonb which allows indexing of jsonb data types so your search by tags can be executed pretty efficiently, or keep them as an array data type.
If you concerned about the comments embedding, then link off to another table and benefit from joins which are first-class citizens.
Given your use-case, you could have a Pages table with main, well known columns and a few foreign keys to Authors etc, tags as an array or jsonb column, some page attributes in jsonb and your comments in a separate Comments table with foreign key to Users and Pages.
Both Mongodb and Postgres and great choices.
PS, I have built far more on Mongodb than Postgres, but really impressed by Postgres after recent evaluation for a new project.

Is it ever a good idea to store an array as a field value, or store array values as records?

In my application I've got "articles" (similar to posts/tweets/articles) that are tagged with descriptive predefined tags: i.e "difficult", "easy", "red", "blue", "business" etc
These available tags are stored in a table, call it "tags" that contains all available tags.
Each article can be tagged with multiple tags, editable through a custom admin interface.
It could be tempting to simply bundle the tags for each entity into a stringified array of the IDs of each tag and store it alongside the article record in my "articles" table:
id | title | author | tags
---+-------+--------+-------------
1 | title | TG | "[1,4,7,12]"
though I'm sure this is a bad idea for a number of reasons, is there ever a reasonable reason to do the above?
I think you should read about Database normalization and decide for yourself. In short though, there are a number of issues with your proposal, but you may decide you can live with them.
The most obvious are:
What if an additional tag is added to row(1)? Do you have to first parse, check if it's already present then update the row to be tags.append(newTag).
Worse still deleting a tag? Search tags, is present, re-create tags.
What if a tag is to change name - some moderation process, perhaps?
Worse again, what about dfferent people specifying a tag-name differently - it'd be hard to rationalise.
What if you want to query data based on tags? Your query becomes far more complex than it would need to be.
Presentation: The client has to parse the tag in order to use it. What about the separator field? Change that and all clients have to change.
In short, all of these operations become harder and more cumbersome. Normalization is designed to overcome such issues. Probably the only reason for doing what you say, IMO, is that you're capturing the data as a one-off and it's informational only - that is, makes sense to a user but not to a system per-se. This is kind of like saying it's probably best avoided (again, IMO).
It seems to me like you want to have a separate table that stores tags and holds a foreign key which relates the tag records back to their parent record in the articles table (this is referred to as "normalizing" the database structure).
Doing it like you have suggested by cramming the tags into one field may seem to make sense now, but it will prove to be difficult to maintain and difficult/time consuming to pull the values out efficiently as your application grows in size or the amount of data grows a lot larger.
I would say that there are very few reasons to do what you have suggested, given how straightforward it is to create another table and setup a relationship to link keys between the two tables to maintain referential integrity.
I totally agree that it CAN be a good idea. I am a strong advocate of storing tags in the database as a single delimited list of strings.
BUT: The reason that I agree is that I like to use Azure Search API to index these types of data, so the query to do a lookup based on tags is not done via SQL. (using the Azure search API service is not necessary, but In my experience you will get much better performance and scalability by using a search index that is outside of the database.)
If you primary query language will be SQL (relational based queries)
then you are better off creating a child table that has a row for each
tag, otherwise you will wear a performance hit when your query has to
perform a logic on each value to split it for analysis.
Tagging is a concept that we use to get around relational data or hierarchical mapping, so to get the best performance do no try to use these relational concepts to query the tags. It is often best implemented in NoSQL data storage because they don't try to use the database to process the search queries.
I encourage you to store the data as a delimited string, and use an external indexing service to provide search and insights into your data. This is a good trade off between CRUD data access performance attempts to manage the data and indexes to optimise for searching. Sure you can optimise the DB and the search queries to make it work in SQL but it can take effort to get it right.
Once your user base hits large volumes and you need to support multiple concurrent searches without affecting update performance you will find that external indexing is an awesome investment in your time now, to save you time and resources later.

Why RavenDB reads all documents in indexing process and not only collections used by index?

I have quite large database with ~2.6 million documents where I have two collections each 1.2 million and rest are small collections (<1000 documents). When I create new index for small collection, it takes lot of time indexing to complete (so temp indexes are useless). It seems that RavenDB indexing process reads each document in DB and checks if it should be added to index. I think it would perform better to index only collections used by index.
Also when using Smuggler to export data and I want to export only one small collection, it reads all documents and exporting might take quite a lot of time. Same time custom app which uses RavenDB Linq API and indexes can export data in seconds.
Why RavenDB behaves like this? And maybe there is some configuration setting which might change this behavior?
RavenDB doesn't actually have any real concept of a "collection". All documents are pretty much the same. It simply looks at the Raven-Entity-Name metadata in each document to determine how to group things together for purposes of querying by type and displaying the "Collections" page in the management studio.
I am not sure of the specific rationale for this. I think it has something to do with the underlying ESENT tables used by the document store. Perhaps Ayende can answer better. Your particular use cases are good examples for why it might be done differently.
One thing you could try is to use multiple databases. You could put the your large-quantity documents in one database, and put everything else in another. Of course, you may have problems with indexing related documents, multi-map/reduce, or other scenarios where documents of different types need to be together on the same database.
Seems that answer to my question is coming in RavenDB 3.0. Ayende says:
In RavenDB 2.x, you still have to pay the full price for indexing
everything, but that isn’t the case in RavenDB 3.0. What we have done
is to effectively optimize the process so that in this case, we will
preload all of the documents taking part in the relevant collection,
and send them directly to be indexed.
We do this by utilizing the Raven/DocumentsByEntityName index. Which
has already indexed everything in the database anyway. This is a nice
little feature, because it allows us to really take advantage of the
work we already did long ago. Using one index to pre-populate another
is a neat trick, and one that I am very happy about.
And here is full blog post: http://ayende.com/blog/165923/shiny-features-in-the-depth-new-index-optimization

Lucene indexing with for structured document where each text line has meta-data

I have a document structure where each text line in the document has some meta-data associated with it. The search result must show the line and the meta-data for the line.
Currently I am storing each such line as a Lucene documents and storing the metata-data as one of the non-indexed fields. That is I create and add a Lucene Document structure for each line. My concerns is that I may end up with too many Documents in the index.
Is there a more elegant approach ?
Thanks
Personally I'd index the documents as normal, and figure out the metadata / line number later.
There is no question about whether or not Lucene can cope with that many documents, however it might degrade the search results somewhat. For you can perform searches where you look for multiple terms in close proximity to each other, however this obviously won't work when the terms are split over multiple documents (lines).
How many is "too many"? Lucene has been known to handle hundreds of millions of records in a single index, so I doubt that you should have a problem. That being said, there's no substitute for testing and benchmarking yourself to see if this approach is good for your needs.