Cosmos DB - Do I have to wait for indexing? - indexing

If I insert a document and, on the next line of code, search it by one of it's fields (other than Id), will I find it? Or do I have to wait for some indexing to happen?

Microsoft provides clear documentation around the different types of Indexing Strategies available and how to use them. The information below is a summary of this information.
CosmosDb has multiple indexing strategies. By default, it's set to consistent which means that documents are indexed as they are placed into the collection. New documents should be immediately available for querying. You are free to switch this to lazy indexing mode which indexes when it's more convenient for the database.
It's good to know that with consistent indexing turned on, you will observe a higher RU cost per insert/upsert because the cost of indexing is included. So whether or not consistent or lazy makes sense for you is based on the nature of the app you're building.
You can check the type of indexing you're using in the portal and actually tune indexing by including or excluding specific JSON paths in your documents. This is a really powerful and cool feature in Cosmos. You can see that by default, the settings are consistent indexing and a path of /* indicates that all JSON properties are covered by the index.

Related

Managing the neo4j index's life cycle (CRUD)

I have limited (and disjointed) experience with databases, and nearly none with indexes. Based on web search, reading books, and working with ORMs my understanding can be summed up as follows:
An index in databases is similar to a book index in that it lists "stuff" that's in the book and tells you where to find it. This helps with lookup efficiency (this is most probably not the only benefit)
In (at least some) RDBMS's, primary key fields get automatically indexed so u never have to directly manipulate them.
I'm tinkering with neo4j and it seems you have to be deliberate about indexes so now I need to understand them but I cannot find clear answers to:
How are indexes managed in neo4j?
I know there's automatic indexing, how does it work?
If you choose to manually manage your own indexes, what can you control about them? Perhaps,index name, etc?
Would appreciate answers or pointers to answers, thanx.
Neo4j uses Apache Lucene under the covers if you want index engine like capabilities for your data. You can index nodes and/or relationships- the index helps you look up a particular instance/set of nodes or relationships.
Manual Indexing:
You can create as many node/relationship indexes as you want and you can specify a name for each index. The config can also be controlled i.e. whether you want exact matching (the default) or Lucenes full text indexing support. Once you have the index, you simply add nodes/relationships to it and the key/value you want indexed. You do however need to take care of "updating" data in the index yourself if you make changes to the node properties.
Auto-Indexing:
Here you get one index for nodes and one index for relations if you turn them on in the neo4j.properties file. You may specify what properties are to be indexed and from the point of turning them on, the index is automatically managed for you i.e. any nodes created after this point are added to the index and updated/removed automatically.
More reading:
http://docs.neo4j.org/chunked/stable/indexing.html
The above applies to versions < 2.0
2.0 adds more around the concept of indexing itself, you might want to go through
http://www.neo4j.org/develop/labels
http://blog.neo4j.org/2013/04/nodes-are-people-too.html
Hope that helps.

Why RavenDB reads all documents in indexing process and not only collections used by index?

I have quite large database with ~2.6 million documents where I have two collections each 1.2 million and rest are small collections (<1000 documents). When I create new index for small collection, it takes lot of time indexing to complete (so temp indexes are useless). It seems that RavenDB indexing process reads each document in DB and checks if it should be added to index. I think it would perform better to index only collections used by index.
Also when using Smuggler to export data and I want to export only one small collection, it reads all documents and exporting might take quite a lot of time. Same time custom app which uses RavenDB Linq API and indexes can export data in seconds.
Why RavenDB behaves like this? And maybe there is some configuration setting which might change this behavior?
RavenDB doesn't actually have any real concept of a "collection". All documents are pretty much the same. It simply looks at the Raven-Entity-Name metadata in each document to determine how to group things together for purposes of querying by type and displaying the "Collections" page in the management studio.
I am not sure of the specific rationale for this. I think it has something to do with the underlying ESENT tables used by the document store. Perhaps Ayende can answer better. Your particular use cases are good examples for why it might be done differently.
One thing you could try is to use multiple databases. You could put the your large-quantity documents in one database, and put everything else in another. Of course, you may have problems with indexing related documents, multi-map/reduce, or other scenarios where documents of different types need to be together on the same database.
Seems that answer to my question is coming in RavenDB 3.0. Ayende says:
In RavenDB 2.x, you still have to pay the full price for indexing
everything, but that isn’t the case in RavenDB 3.0. What we have done
is to effectively optimize the process so that in this case, we will
preload all of the documents taking part in the relevant collection,
and send them directly to be indexed.
We do this by utilizing the Raven/DocumentsByEntityName index. Which
has already indexed everything in the database anyway. This is a nice
little feature, because it allows us to really take advantage of the
work we already did long ago. Using one index to pre-populate another
is a neat trick, and one that I am very happy about.
And here is full blog post: http://ayende.com/blog/165923/shiny-features-in-the-depth-new-index-optimization

Why are document stores like Lucene / Solr not included in NoSQL conversations?

All of us have come across the recent hype of no-SQL solutions lately. MongoDB, CouchDB, BigTable, Cassandra, and others have been listed as no-SQL options. Here's an example:
http://architects.dzone.com/articles/what-nosql-store-should-i-use
However, three years ago a co-worker and I were using Lucene.NET as what seem to fit the description of no-SQL. We did not use it just for user-inputted search queries; we used it to make a few reindexed RDBMS table data extremely performant. We implemented our own .NET sort-of-equivalent-to-Solr service to manage these indexes and make them callable. When I left the company, the team switched to Solr itself. (For those not in the know, Solr is a web service that wraps Lucene with REST-callable queries and index dumps.)
What I don't understand is, why is Solr not counted in the typical lists of no-SQL solution options? Am I missing something here? I assume that there are technical reasons why Solr is not comparable to the likes of CouchDB, etc., and in fact I understand that CouchDB uses Lucene as its data store (yes?), but what disqualifies Solr?
I'm not asking as some kind of Solr fanboy or anything, I just don't understand why Solr and the like don't fit the definition of no-SQL, and if Solr technically does fit the definition then what about it likely makes people pooh-pooh it? I'm asking because I'm having difficulty determining whether I should continue using Lucene-based solutions (like Solr) for solutions that I build or if I should really do more research with these other options.
I once listened to an interview with author Ursula K. LeGuin about fiction writing. The interviewer asked her about authors who work in different genre of writing. What makes one author a romance writer, and another a mystery writer, and another a science fiction writer? LeGuin responded by explaining:
Genre is about marketing, not about content.
It was an eye-opening statement.
I think the same applies to technology solutions. The NoSQL movement is attracting attention because it's full of marketing energy right now. NoSQL data stores like Hadoop, CouchDB, MongoDB, have commercial ventures backing them, pushing their solutions as new and innovative and exciting so they can grow their business. The term "NoSQL" is a marketing brand that helps them to explain their value.
You're right that Lucene/Solr is technically very similar to a NoSQL document store: it's a denormalized bag of documents (their term) with fields that aren't necessarily consistent across the collection of documents. It's indexed in a sophisticated way to allow you to search across all fields or by specific fields.
But that's not the genre Lucene uses to explain its value. They don't have the same mission to grow a market and a business, since they're managed by the Apache Foundation. They're happy to focus on the use case of fulltext search, even though the technology could be used in other ways. They're following a tenet of software success: do one thing, and do it well.
After doing more Google-searching, I think this document sums it up pretty well:
https://web.archive.org/web/20100504055638/http://www.lucidimagination.com/blog/2010/04/30/nosql-lucene-and-solr/
Case in point, Lucene/Solr is NoSql and could be considered one of NoSql's more mature "forefathers". It just does not get the NoSql hype it deserves because it didn't invent the term "no-SQL" and its users don't use the term, so the hype machine overlooked it.
I think that the most relevant characteristic of solr/lucene that drops from the nosql list it's because until recently, making lucene work as a real-time system was a pain. The usual workflow for any performant application was to index the incremental updates in batchs, and updating the index every 5 minutes for example.
I think that stimpy77 is partly right on the NoSQL being a branding thing. But also, NoSQL means that it's a data storage platform that is simpler/easier then SQL based solutions. And I think while Solr/Lucene share some aspects (they store data), it really misses the mark to think that Solr/Lucene could be used as primary data storage for anything that has relationships. Sure, lots of documents can be thrown into it, and powerful search pull them back. But as soon as you want relationships, then others such as CouchDB and others do much better that have a query syntax of some kind. Search is a bandaid solution in that case. Think about the use case "find all documents tagged with word 'car'". If I have some structures in my data, then it's easy for me to get the document for tag car, and pull everybody back. Versus relying on a search query that includes fq=tag:'car'. Search is more and more powerful the fewer relationships you have, but the more relationships, the better a datastore like CouchDB and brethren are. Thats why you still see CouchDB and friends paired with Solr, and vice versa! Let each one do what it does best.
Of course, that isn't to say you can't leverage storing your source data in Solr, that can be a powerful tool to use!
The main differences between a no sql and solr in operational wise are the following in my opinion.
Solr requires an intermediate data store (database or XML files) whereas nosql itself a straight data store.
You cannot do a constant writes to solr (solr 4.0 seems to bring that support) and you can only index at the max of every 2 mins and 200 records (which is very slow for high throughput writes and you are forced for an intermediate storage).
You are require to change / define the schema when you alter what is stored in document. NoSQL has no such definitions.
Solr indexes has performance implication when its index size grows whereas NoSQL is optimized for it (or claims to be :) )
Solr has underlying lucene search algorithms bundled but in NoSQL you need to build them, This applies to the magnificent faceted search or blazing fast document search provided by solr.
Last but few points, Its about the difference not the one mentioned here as marketing strategy in which solr goes out from NoSQL
Lucene/Solr - Iam gonna use Solr, Since Solr uses lucene internally and has addition features. So Solr is basically an upgrade to Lucene with new constume.
Solr is mainly used for purpose to create facets and indexing plain texts for search engine.
Solr can use most of the databases to store its data. It is inconsistent to keep data in solr since it directly use disks.
NoSQL databases are easy to learn compared to Solr. Solr is more or less having lot of configurations and concepts (For eg: Fields).
Performance is something that we have to consider b/w . Solr provides high performance compared to other NoSQL databases.
Note: Combining the Solr with some databases provides the best performance.
Summary: Solr is also a NoSQL datastore which is a predecessor of all NoSQL databases. Which didn't get the hype of others. But still in the field due to its performance and power.

What exactly is 'indexing' in Core Data?

As an answer to a question I asked yesterday (New Core Data entity identical to existing one: separate entity or other solution?), someone recommended I index an attribute.
After much searching on Google for what an 'index' is in SQLite/Core Data, I'm afraid I'm not closer to knowing exactly what it is or how it speeds up fetching based on an attribute. Keep in mind I know nothing about SQLite/databases in general other than a vague idea based on reading way, way, way, too much about Core Data the past few months.
Simplistically, indexing is a kind of presorting. If you have a numerical attribute index, the store maintains linked list in numerical order. If you have a text attribute, it maintains a linked list in alphabetical order. Depending on the algorithm, it can maintain other kinds of information about the attributes as well. It stores the data in the index attached to the persistent store file.
It makes fetches based on the indexed attribute go faster with the tradeoff of larger file size and slightly slower inserts.
All these answers are good, but overly technical.
An index is pretty much identical to the index you'd find in the back of a book. Thus if you wanted to find which page a certain word occurred at, you'd go through it alphabetically and thus quickly find the all the pages where that word occurred.
If you didn't have an index, then the user would have to resort to going thru EVERY single page word by word, which could take quite a while. Thus, the index is created pretty much in this way ONLY once, and not every time the user wants to search.
Wikipedia has a great explanation of a database index:
"A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space."

Is there a set of best practices for building a Lucene index from a relational DB?

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.