Managing the neo4j index's life cycle (CRUD) - indexing

I have limited (and disjointed) experience with databases, and nearly none with indexes. Based on web search, reading books, and working with ORMs my understanding can be summed up as follows:
An index in databases is similar to a book index in that it lists "stuff" that's in the book and tells you where to find it. This helps with lookup efficiency (this is most probably not the only benefit)
In (at least some) RDBMS's, primary key fields get automatically indexed so u never have to directly manipulate them.
I'm tinkering with neo4j and it seems you have to be deliberate about indexes so now I need to understand them but I cannot find clear answers to:
How are indexes managed in neo4j?
I know there's automatic indexing, how does it work?
If you choose to manually manage your own indexes, what can you control about them? Perhaps,index name, etc?
Would appreciate answers or pointers to answers, thanx.

Neo4j uses Apache Lucene under the covers if you want index engine like capabilities for your data. You can index nodes and/or relationships- the index helps you look up a particular instance/set of nodes or relationships.
Manual Indexing:
You can create as many node/relationship indexes as you want and you can specify a name for each index. The config can also be controlled i.e. whether you want exact matching (the default) or Lucenes full text indexing support. Once you have the index, you simply add nodes/relationships to it and the key/value you want indexed. You do however need to take care of "updating" data in the index yourself if you make changes to the node properties.
Auto-Indexing:
Here you get one index for nodes and one index for relations if you turn them on in the neo4j.properties file. You may specify what properties are to be indexed and from the point of turning them on, the index is automatically managed for you i.e. any nodes created after this point are added to the index and updated/removed automatically.
More reading:
http://docs.neo4j.org/chunked/stable/indexing.html
The above applies to versions < 2.0
2.0 adds more around the concept of indexing itself, you might want to go through
http://www.neo4j.org/develop/labels
http://blog.neo4j.org/2013/04/nodes-are-people-too.html
Hope that helps.

Related

CosmosDB heterogenous document collection - composite indexing

I'm using a single collection for all my documents and then instantiating them into POCO's using a property of "type". Things have been going great so far.
Now I need to add multiple sorting abilities.
That doesn't work and it says a I need a composite index. Fine, I understand.
But how would I create an Indexing policy when it wants paths that won't exist in some document types or may exist in more than one document type?
Do I really have to create a collection for each document type for this to work?
TIA
It will simply ignore those items. Also note that while, for composite indexes, you have to specify paths to include along with their sort order. For the regular index it's generally preferable to include all paths (i.e. "/*") and then specify those paths to exclude. This way you don't need to keep updating your index policy when you add new entity types into your collection.
Also, note that the max number of composite index paths per composite index is 8 per container. Also currently queries will only use one path at a time but this will change very soon to use multiple paths at the same time which will have significant performance improvement to queries which use them.

Cosmos DB - Do I have to wait for indexing?

If I insert a document and, on the next line of code, search it by one of it's fields (other than Id), will I find it? Or do I have to wait for some indexing to happen?
Microsoft provides clear documentation around the different types of Indexing Strategies available and how to use them. The information below is a summary of this information.
CosmosDb has multiple indexing strategies. By default, it's set to consistent which means that documents are indexed as they are placed into the collection. New documents should be immediately available for querying. You are free to switch this to lazy indexing mode which indexes when it's more convenient for the database.
It's good to know that with consistent indexing turned on, you will observe a higher RU cost per insert/upsert because the cost of indexing is included. So whether or not consistent or lazy makes sense for you is based on the nature of the app you're building.
You can check the type of indexing you're using in the portal and actually tune indexing by including or excluding specific JSON paths in your documents. This is a really powerful and cool feature in Cosmos. You can see that by default, the settings are consistent indexing and a path of /* indicates that all JSON properties are covered by the index.

Some guidance request on 'custom defined' resultsets

I would like some guidance/thoughts on the route to create a functionality that allows me to let user customize their datasets. I have added an image showing this functionality but it has been called queues here.
A view is a segmentation of a resultset where the conditions are defined by either the system (default views) or the user.
I can create predefined indexes/projections for the default views that are under my control but I am stuck on the approach when a user should be able to create custom views.
I can create one big index with all properties, and only query those fields on the index that are in the conditions defined by the user. But in that scenario the index is just one big blob of information. It is probably the easiest way but it feels ugly.
I can dynamically create a new index, based on the entered conditions. Never explored the options of runtime defined indexes before though.
I can dynamically create a query with conditions, however I will have to deal with stale results because I let RavenDB define the index; I would like to avoid index creation by RavenDB if possible.
Some guidance would be highly appreciated; how and with what parts of RavenDB can I efficiently accomplish this? I am not in search of a complete solution, since this is a personal project experimenting with RavenDB.
This question might be too broad/generic but here's my two cents.
Yes, I agree that one massive index would not be optimal. In many cases you can get creative by breaking down an index into smaller indexes.
I don't suggest that you create run-time indexes based on how an user is using the application. That's what dynamic indexes are for. RavenDB will create a an index and manage its importance. So, you have dynamic indexes that don't get used anymore, RavenDB will abandon them. If you're worried about staleness, you can wait for non-stale results.
I'm not clear on your use-case, but maybe you could design your app it in such a way that you save all the views (custom or default) into Raven documents. For example, given the picture you attached, "unassigned issues" and "due this week" would be two separate documents. This could allow you to keep a small number of static indexes.

Why RavenDB reads all documents in indexing process and not only collections used by index?

I have quite large database with ~2.6 million documents where I have two collections each 1.2 million and rest are small collections (<1000 documents). When I create new index for small collection, it takes lot of time indexing to complete (so temp indexes are useless). It seems that RavenDB indexing process reads each document in DB and checks if it should be added to index. I think it would perform better to index only collections used by index.
Also when using Smuggler to export data and I want to export only one small collection, it reads all documents and exporting might take quite a lot of time. Same time custom app which uses RavenDB Linq API and indexes can export data in seconds.
Why RavenDB behaves like this? And maybe there is some configuration setting which might change this behavior?
RavenDB doesn't actually have any real concept of a "collection". All documents are pretty much the same. It simply looks at the Raven-Entity-Name metadata in each document to determine how to group things together for purposes of querying by type and displaying the "Collections" page in the management studio.
I am not sure of the specific rationale for this. I think it has something to do with the underlying ESENT tables used by the document store. Perhaps Ayende can answer better. Your particular use cases are good examples for why it might be done differently.
One thing you could try is to use multiple databases. You could put the your large-quantity documents in one database, and put everything else in another. Of course, you may have problems with indexing related documents, multi-map/reduce, or other scenarios where documents of different types need to be together on the same database.
Seems that answer to my question is coming in RavenDB 3.0. Ayende says:
In RavenDB 2.x, you still have to pay the full price for indexing
everything, but that isn’t the case in RavenDB 3.0. What we have done
is to effectively optimize the process so that in this case, we will
preload all of the documents taking part in the relevant collection,
and send them directly to be indexed.
We do this by utilizing the Raven/DocumentsByEntityName index. Which
has already indexed everything in the database anyway. This is a nice
little feature, because it allows us to really take advantage of the
work we already did long ago. Using one index to pre-populate another
is a neat trick, and one that I am very happy about.
And here is full blog post: http://ayende.com/blog/165923/shiny-features-in-the-depth-new-index-optimization

What kind of indexes does Firebird use and why?

According to the Firebird FAQ, indexes are directional, which means they don't use the classical B-Trees implementation. What do they use?
What are the advantages? Do other databases use it too?
The indexes used by Firebird are B-trees, and they are bi-directional, but in practice this bi-directionality is not used because the reverse direction is considered unreliable. This has to do with the order of updates and how Firebird writes pages. As a result, a read in the reverse direction could skip index pages if that read happens at the same time an index page split occurs.
See also Firebird for the Database Expert: Episode 3 - On disk consistency:
If, on the other hand, you need a double linked chain of pages - index
pages come to mind, there is no separable relationship. Each page
depends on the other and neither can be written first. In fact,
Firebird index pages are double-linked, but the reverse link (high to
low in a descending index) is handled as unreliable. It's used in
recombining index pages from which values have been removed, but not
for backward data scans.
The link you provided does not contain enough information to make a conclusion about index structure used by Firebird.
AFAIK, Firebird indexes are b-tree variants. I do not have a direct documentation link right now to support my word, but you can see some references:
Tracker entry reporting Wrong index entries order at non-leaf b-tree pages (Firebird tracker)
Description of B-tree page structure for some ODS version (IBExpert documentation)
There are many other examples on the internet, just google it.