Extending neo4j with new index structures - indexing

Suppose I want to implement a new index structure (e.g., BITMAT) that will improve the efficiency of some queries (Path queries for the BITMAT case). How do I extend Neo4j so that every query with a specified query pattern uses my new index instead of Neo4j's native index?

YOu can implement a new IndexProvider that hooks into the normal Neo4j indexing system. This is then automatically exposed to Cypher. You can see an example of this in this SpatialIndexProvider, projecting a subgraph query into an index lookup and run Cypher queries against it:
https://github.com/neo4j/spatial/blob/master/src/main/java/org/neo4j/gis/spatial/indexprovider/LayerNodeIndex.java
Test with Cypher:
https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/IndexProviderTest.java#L141

Related

Understanding indices in Neo4j - why is resampling necessary?

At https://neo4j.com/docs/operations-manual/3.3/performance/statistics-execution-plans/, we see the following table:
Controls the percentage of the index that has to have been updated
before a new sampling run is triggered.
What does "sampling" mean? Why do updates invalidate the index? I know indices from relational databases and in this case there is no need to maintain indices (adding/deleting row corresponds to adding/deleting node in some BTree).
Could someone why resampling of indices in Neo4j is necessary?
The index is always valid.
Periodic sampling generates statistics used by the Cypher execution planner, so that is can generate plans that are more optimal for the current state of the DB.
To quote from the operations manual (a little earlier than the table in your question):
When a Cypher query is issued, it gets compiled to an execution plan
that can run and answer the query. The Cypher query engine uses
available information about the database, such as schema information
about which indexes and constraints exist in the database. Neo4j also
uses statistical information about the database to optimize the
execution plan.

Index performance with a startswith structure

Is there not an expression in neo4j like startswith that runs fast on an indexed property?
I currently run a query like
match (p:Page) where p.Url =~ 'http://www.([\\w.?=#/&]*)' return p
The p.Url property is indexed however the query above is very slow. Especially a startswith index search should be quite fast or not?
Currently regex (or 'startsWith') filters are not supported with schema indexes. Your cypher statement will scan through all nodes having the Page label and filter them based on their properties.
More sophisticated query capabilities for schema indexes are to be expected in one of the next releases of Neo4j.
If you need that functionality now you basically have 2 options:
use legacy indexes as documented in the reference manual.
if your queries always filter for starting with protocoll://prefix you can work around by putting the prefix protocoll://prefix into an additional property urlPrefix and declare an index on it (create index on :Page(urlPrefix). Your query is then match (p:Page) where p.urlPrefix='http://www.' return p and will be run via the existing index.

Spring Data Neo4j search and index

I´m using:
neo4j 2.0.1
spring data neo4j 3.0.1.RELEASE
I have a Node Person with the property name and I would like to search with Lucene syntax on that property. I´m using the findByNameLike method in my repository and that works perfect for query like value* or *value or * etc.
But I need a query like that {A* TO D*}. I found a deprecated method findAllByQuery("name", query) and with that method I can achieve my needs.
I was wondering what is the new method not deprecated that understand such query syntax.
I also noticed that if I create node from cypher, the node are not available in my search.
With SDN I think that the generated node will be automatically add also to the index, but I don´t know how to check that and what is the index name. I must generate the node from cypher to have some base data in all my system. Should I add some special property in my cypher query?
First of all, make sure you understand the differences between the legacy (deprecated) Lucene indexes and the newer Schema indexes.
I was wondering what is the new method not deprecated that understand
such query syntax.
You'll have to use one of the methods of the SchemaIndexRepositry which is extended in the GraphRepository interface for convenience. Bear in mind that e.g. wildcard searches are not yet implemented on schema indexes. If you want to use wildcard searches, you'll have 2 options. Continue to use the Lucene indexes (your best option for now), or use a regex query in a custom repository method. E.g.
MATCH (p:Person) WHERE p.name =~ ".*test.*" RETURN p
I also noticed that if I create node from cypher, the node are not
available in my search. With SDN I think that the generated node will
be automatically add also to the index, but I don´t know how to check
that and what is the index name. I must generate the node from cypher
to have some base data in all my system. Should I add some special
property in my cypher query?
If you're using Lucene indexes, new entries will not be added to the index. AFAIK, you can only to that programatically. Schema indexes can be created as follows:
CREATE INDEX ON :Person(name)
New entries with the name property will be automatically added to the index. Again, wildcard searches will not yet use these indexes.

Lucene index a large many-to-many relationship

I have a database with two primary tables:
The components table (50M rows),
The assemblies table (100K rows),
..and a many-to-many relationship between them (about 100K components per assembly), thus in total 10G relationships.
What's the best way to index the components such that I could query the index for a given assembly? Given the amount of relationships, I don't want to import them into the Lucene index, but am looking instead for a way to "join" with my external table on-the-fly.
Solr supports multi-valued fields. Not positive if Lucene supports them natively or not. It's been a while for me. If only one of the entities is searchable, which you mentioned is components, I would index all components with a field called "assemblies" or "assemblyIds" or something similar and include whatever metadata you need to identify the assemblies.
Then you can search components with
assemblyIds:(1, 2, 3)
To find components in assembly 1, 2 or 3.
To be brief, you've got to process the data and Index it before you can search. Therefore, there exists no way to just "plug-in" Lucene to some data or database, instead you've to plug-in (process, parse, analyze, index, and query) the data it self to the Lucene.
rustyx: "My data is mostly static. I can even live with a read-only index."
In that case, you might use Lucene it self. You can iterate the datasource to add all the many-to-many relations to the Lucene index. How did you come up with that "100GB" size? People index millions and millions of documents using Lucene, I don't think it'd be a problem for you to index.
You can add multiple field instances to the index with different values ("components") in a document having an "assembly" field as well.
rustyx: "I'm looking instead into a way to "join" the Lucene search with my external data source on the fly"
If you need something seamless, you might try out the following framework which acts like a bridge between relational database and Lucene Index.
Hibernate Search : In that tutorial, you might search for "#ManyToMany" keyword to find the exact section in the tutorial to get some idea.

Change dynamically elasticsearch synonyms

Is it possible to store the synonyms for elasticsearch in the index? Or is it possible to get the synonym list from a database like couchdb?
I'd like to add synonyms dynamically to elasticsearch via the REST-API.
There are two approaches when working with synonyms :
expanding them at indexing time,
expanding them at query time.
Expanding synonyms at query time is not recommended since it raises issues with :
scoring, since synonyms have different document frequencies,
multi-token synonyms, since the query parser splits on whitespaces.
More details on this at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory (on Solr wiki, but relevant for ElasticSearch too).
So the recommended approach is to expand synonyms at indexing time. In your case, if the synonym list is managed dynamically, it means that you should re-index every document which contains a term whose synonym list has been updated so that scoring remains consistent between documents analyzed pre and post update. I'm not saying that it is not possible but it requires some work and will probably raise performance issues with synonyms which have a high frequency in your index.
There are few new solutions now to those proposed in other answers few years ago. The two main approaches implemented as plugins:
The file-watcher-synonym filter is a plugin that can periodically reload synonyms every given numbers of seconds, as defined by user.
The refresh-token-plugin allows a real-time update of the index. However, this plugin aparrently has some problems, which stem from the fact that elasticsearch is unable to distinguish between analyzers at search time only from those used at index time.
Good discussion on this subject can be found on the elastisearch github ticket system: https://github.com/brusic/refresh-token-filters
It isn't too painful in elasticsearch to update the synonym list. It can be done by opening and closing You could have it driven from anywhere, but need some of your own infrastructure. It'd work like this:
You want an alias pointing at your current index
Sync down a new index file to your servers
Create a new index with a custom analyzer that uses the new index
Rebuild the content from current index to new index
Repoint index alias from current to new index
In 2021, just expand synonyms at query time using a specific search analyzer and use the Reload analyzer API:
POST /my-index/_reload_search_analyzers
The synonym graph token filter must have set updatable to true:
"my-synonyms": {
"type": "synonym_graph",
"synonyms_path": "my-synonyms.txt",
"updateable": true
}
Besides, you should probably expand synonyms at query time anyway. Why?
Chances are that you have too much data to reindex every night or so.
Elasticsearch does not allow the Synonym Graph Filter for an index analyzer, only the deprecated Synonym Filter which does not handle multi word synonyms correctly.