Why RavenDB reads all documents in indexing process and not only collections used by index? - ravendb

I have quite large database with ~2.6 million documents where I have two collections each 1.2 million and rest are small collections (<1000 documents). When I create new index for small collection, it takes lot of time indexing to complete (so temp indexes are useless). It seems that RavenDB indexing process reads each document in DB and checks if it should be added to index. I think it would perform better to index only collections used by index.
Also when using Smuggler to export data and I want to export only one small collection, it reads all documents and exporting might take quite a lot of time. Same time custom app which uses RavenDB Linq API and indexes can export data in seconds.
Why RavenDB behaves like this? And maybe there is some configuration setting which might change this behavior?

RavenDB doesn't actually have any real concept of a "collection". All documents are pretty much the same. It simply looks at the Raven-Entity-Name metadata in each document to determine how to group things together for purposes of querying by type and displaying the "Collections" page in the management studio.
I am not sure of the specific rationale for this. I think it has something to do with the underlying ESENT tables used by the document store. Perhaps Ayende can answer better. Your particular use cases are good examples for why it might be done differently.
One thing you could try is to use multiple databases. You could put the your large-quantity documents in one database, and put everything else in another. Of course, you may have problems with indexing related documents, multi-map/reduce, or other scenarios where documents of different types need to be together on the same database.

Seems that answer to my question is coming in RavenDB 3.0. Ayende says:
In RavenDB 2.x, you still have to pay the full price for indexing
everything, but that isn’t the case in RavenDB 3.0. What we have done
is to effectively optimize the process so that in this case, we will
preload all of the documents taking part in the relevant collection,
and send them directly to be indexed.
We do this by utilizing the Raven/DocumentsByEntityName index. Which
has already indexed everything in the database anyway. This is a nice
little feature, because it allows us to really take advantage of the
work we already did long ago. Using one index to pre-populate another
is a neat trick, and one that I am very happy about.
And here is full blog post: http://ayende.com/blog/165923/shiny-features-in-the-depth-new-index-optimization

Related

How do you implement search over static content within cshtml files

I am using asp.net core and Razor - and as it is a help system I would like to implement some kind of search facility to bring back a list of results hyperlinked based on the search terms.
I would like the search to iterate essentially over the content contained within the and tags and then link this to the appropriate page/view.
What is the best way to do this?
I'm not even sure how you get a handle on the actual content of your own cshtml pages and then go from there.
This question is far too broad. However, I can provide you some pointers.
First, you need to determine what you're actually wanting to surface and where that data lives. Your question says "static web pages", but then you mention .cshtml. Traditionally, when it comes to creating your own search, you're going to have access to some particular dataset (tables in a database, for example). It's much simpler to search across the more structured data than the end result of it being dumped in various and sundry places over a web page.
Search engines like Google only index in this way because they typically don't have access to the raw data (although some amount of "access" can be granted via things like JSON-LD and other forms of Schema.org markup). In other words, they actually read from the web page out of necessity, because that's what they have to work with. It's certainly not the approach you would take if you have access to the data directly.
If for some reason you need to actually spider an index your own site's HTML content, then you'll essentially have to do what the big boys do: create a bot, run it on a schedule, crawl your site, link by link, downloading each document, and then parse and process it. The end result would be to create a set of structured data that you can actually query against, which is why all this is pretty much just wasted effort if you already have that data.
Once you have the data, however you got there, you simply query it. In the most basic of forms, you could store it in a table in a database and literally issue SQL queries against it. Your search keywords/parameters are essentially the WHERE of the SELECT statement, so you'd have to figure out a way to map the keywords/parameters you're receiving to an acceptable WHERE clause that achieves that.
More traditionally, you'd use an actual search engine: essentially a document database that is designed and optimized for search, and generally provides a more search-appropriate API to query against. There's lots of options in this space from roll your own to hosted SaaS solutions, and anywhere in between. Of course the cost meter goes down the more work you have to do and goes up the more out of the box it is.
One popular open-source and largely free option is Elasticsearch. It uses Lucene indexes, which it stitches to together in a clustered environment to provide failover and scale. Deployment is a beast, to say the least, though it's gotten considerably better with things like containerization and orchestration. You can stand up an Elasticsearch cluster in something like Kubernetes with relative ease, though you still will probably need to do a bit of config. Elasticsearch does also have hosted options, but you know, cost.

Can a ravendb collection forced to be in memory?

Can I force a ravendb collection to stay in memory so that the queries are fast. I read about aggressive caching but the documentation only talks about the request caching. If I have sharding enabled can I force all the shards to cache the collection in memory ?
Any help is appreciated,
Thanks
RavenDB doesn't really have "Collections" in the sense you are thinking. The only thing that collections are used for is to filter documents by their Raven-Entity-Name metadata. This serves a few purposes:
The Raven Studio UI can group things to make them easier to find.
Indexes can use a shortcut form of docs.EntityName instead of having a where clause against the metadata in every index.
But that's pretty much it. They aren't isolated on disk. For example, when Raven indexes documents, every index considers all documents. Docs get discarded quickly if they don't pass the collection filter, but they are still put through the pipeline.
You can read more about collections here.
Also - As long as you are still in a learning phase, you may want to post these style of questions on the RavenDB Google Group instead. You will get a much better response. You won't get much rating on StackOverflow when you are asking non-code "can X do Y?" questions. Come back here when you have written some code. See the ravendb tag for other questions that have been answered, and you'll get a feel for what StackOverflow is for. Thanks.
You don't need to do that.
RavenDB will automatically detect usage patterns and keep frequently requested documents in memory.

Is duplicating data in SQL and Document store (like MongoDB) a legit idea or should be avoided?

I have a question. I am considering using a data store for some type of objects (e.g. products data). Criteria for using document store is if object has a detail page, so fast read of the entire object is necessary (example - product with all attributes, images, comments etc). Criteria for using SQL is displaying lists (e.g. N newest, most popular etc).
Some objects meet both criteria. Products is an example. So is it a normal practice to store info that will be used in rendering lists on index pages in SQL database, and other data in document store?
If denormalization is suitable for getting performance, go ahead with denormalization. But you have to ensure that you have a way to deal with updates of denormalized data. Your options in MongoDB are:
multiple queries to avoid denormalization
embedded docs
database references
make your choice..
The main idea is mongoDB was created for denormalization and embedding. At one of my past projects i've done sql denormalization to get better performance, but i don't like sql denormalization because very many dublicated data( if you have one to many relation for example). Second step was rewriting data access layer to mongoDB. And in mongoDB for some difficult pages where i need to load multiple documents i've created denormalized document(with embeded collections and plain data from different documents) to fit page content. No all my problem pages work fast, like facebook ;).
But here possible problems, becase you should support denormalized document every time. Also all my denormalized data updates work async, and some data can be stale in some moment, but it's normal practice. Even stackoverlow use denormalization because sometime when open question i see an answer, but when i return back to questions list and refresh page sometimes question doesn't have answer.
If i need denormalization i choose mongodb.

What exactly is 'indexing' in Core Data?

As an answer to a question I asked yesterday (New Core Data entity identical to existing one: separate entity or other solution?), someone recommended I index an attribute.
After much searching on Google for what an 'index' is in SQLite/Core Data, I'm afraid I'm not closer to knowing exactly what it is or how it speeds up fetching based on an attribute. Keep in mind I know nothing about SQLite/databases in general other than a vague idea based on reading way, way, way, too much about Core Data the past few months.
Simplistically, indexing is a kind of presorting. If you have a numerical attribute index, the store maintains linked list in numerical order. If you have a text attribute, it maintains a linked list in alphabetical order. Depending on the algorithm, it can maintain other kinds of information about the attributes as well. It stores the data in the index attached to the persistent store file.
It makes fetches based on the indexed attribute go faster with the tradeoff of larger file size and slightly slower inserts.
All these answers are good, but overly technical.
An index is pretty much identical to the index you'd find in the back of a book. Thus if you wanted to find which page a certain word occurred at, you'd go through it alphabetically and thus quickly find the all the pages where that word occurred.
If you didn't have an index, then the user would have to resort to going thru EVERY single page word by word, which could take quite a while. Thus, the index is created pretty much in this way ONLY once, and not every time the user wants to search.
Wikipedia has a great explanation of a database index:
"A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space."

Which would be better? Storing/access data in a local text file, or in a database?

Basically, I'm still working on a puzzle-related website (micro-site really), and I'm making a tool that lets you input a word pattern (e.g. "r??n") and get all the matching words (in this case: rain, rein, ruin, etc.). Should I store the words in local text files (such as words5.txt, which would have a return-delimited list of 5-letter words), or in a database (such as the table Words5, which would again store 5-letter words)?
I'm looking at the problem in terms of data retrieval speeds and CPU server load. I could definitely try it both ways and record the times taken for several runs with both methods, but I'd rather hear it from people who might have had experience with this.
Which method is generally better overall?
The database will give you the best performance with the least amount of work. The built in index support and query analyzers will give you good performance for free while a textfile might give you excellent performance for a ton of work.
In the short term, I'd recommend creating a generic interface which would hide the difference between a database and a flat-file. Later on, you can benchmark which one will provide the best performance but I think the database will give you the best bang per hour of development.
For fast retrieval you certainly want some kind of index. If you don't want to write index code yourself, it's certainly easiest to use a database.
If you are using Java or .NET for your app, consider looking into db4o. It just stores any object as is with a single line of code and there are no setup costs for creating tables.
Storing data in a local text file (when you add new records to end of the file) always faster then storing in database. So, if you create high load application, you can save the data in a text file and copy data to a database later. However in most application you should use a database instead of text file, because database approach has many benefits.