Spring Data Neo4j search and index - lucene

I´m using:
neo4j 2.0.1
spring data neo4j 3.0.1.RELEASE
I have a Node Person with the property name and I would like to search with Lucene syntax on that property. I´m using the findByNameLike method in my repository and that works perfect for query like value* or *value or * etc.
But I need a query like that {A* TO D*}. I found a deprecated method findAllByQuery("name", query) and with that method I can achieve my needs.
I was wondering what is the new method not deprecated that understand such query syntax.
I also noticed that if I create node from cypher, the node are not available in my search.
With SDN I think that the generated node will be automatically add also to the index, but I don´t know how to check that and what is the index name. I must generate the node from cypher to have some base data in all my system. Should I add some special property in my cypher query?

First of all, make sure you understand the differences between the legacy (deprecated) Lucene indexes and the newer Schema indexes.
I was wondering what is the new method not deprecated that understand
such query syntax.
You'll have to use one of the methods of the SchemaIndexRepositry which is extended in the GraphRepository interface for convenience. Bear in mind that e.g. wildcard searches are not yet implemented on schema indexes. If you want to use wildcard searches, you'll have 2 options. Continue to use the Lucene indexes (your best option for now), or use a regex query in a custom repository method. E.g.
MATCH (p:Person) WHERE p.name =~ ".*test.*" RETURN p
I also noticed that if I create node from cypher, the node are not
available in my search. With SDN I think that the generated node will
be automatically add also to the index, but I don´t know how to check
that and what is the index name. I must generate the node from cypher
to have some base data in all my system. Should I add some special
property in my cypher query?
If you're using Lucene indexes, new entries will not be added to the index. AFAIK, you can only to that programatically. Schema indexes can be created as follows:
CREATE INDEX ON :Person(name)
New entries with the name property will be automatically added to the index. Again, wildcard searches will not yet use these indexes.

Related

Lucene index a large many-to-many relationship

I have a database with two primary tables:
The components table (50M rows),
The assemblies table (100K rows),
..and a many-to-many relationship between them (about 100K components per assembly), thus in total 10G relationships.
What's the best way to index the components such that I could query the index for a given assembly? Given the amount of relationships, I don't want to import them into the Lucene index, but am looking instead for a way to "join" with my external table on-the-fly.
Solr supports multi-valued fields. Not positive if Lucene supports them natively or not. It's been a while for me. If only one of the entities is searchable, which you mentioned is components, I would index all components with a field called "assemblies" or "assemblyIds" or something similar and include whatever metadata you need to identify the assemblies.
Then you can search components with
assemblyIds:(1, 2, 3)
To find components in assembly 1, 2 or 3.
To be brief, you've got to process the data and Index it before you can search. Therefore, there exists no way to just "plug-in" Lucene to some data or database, instead you've to plug-in (process, parse, analyze, index, and query) the data it self to the Lucene.
rustyx: "My data is mostly static. I can even live with a read-only index."
In that case, you might use Lucene it self. You can iterate the datasource to add all the many-to-many relations to the Lucene index. How did you come up with that "100GB" size? People index millions and millions of documents using Lucene, I don't think it'd be a problem for you to index.
You can add multiple field instances to the index with different values ("components") in a document having an "assembly" field as well.
rustyx: "I'm looking instead into a way to "join" the Lucene search with my external data source on the fly"
If you need something seamless, you might try out the following framework which acts like a bridge between relational database and Lucene Index.
Hibernate Search : In that tutorial, you might search for "#ManyToMany" keyword to find the exact section in the tutorial to get some idea.

Extending neo4j with new index structures

Suppose I want to implement a new index structure (e.g., BITMAT) that will improve the efficiency of some queries (Path queries for the BITMAT case). How do I extend Neo4j so that every query with a specified query pattern uses my new index instead of Neo4j's native index?
YOu can implement a new IndexProvider that hooks into the normal Neo4j indexing system. This is then automatically exposed to Cypher. You can see an example of this in this SpatialIndexProvider, projecting a subgraph query into an index lookup and run Cypher queries against it:
https://github.com/neo4j/spatial/blob/master/src/main/java/org/neo4j/gis/spatial/indexprovider/LayerNodeIndex.java
Test with Cypher:
https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/IndexProviderTest.java#L141

Grails & GORM: How to specify CREATE INDEX equivalent on a domain-class?

May I share my frustration. GORM (and Grails..) seems to have a VERY limited documentation regarding database indices. I haven't found any help from anywhere on how to create an index for a domain class when I want the index to be something more than just what has been documented here: http://grails.org/doc/latest/guide/GORM.html.
Here's in SQL what I would like to achieve the Grails' way:
CREATE INDEX very_fast_index
ON slow_table(date DESC NULLS LAST)
WHERE is_latest = true;
Seems like I could tell GORM to create an index for the date column. But looks like there's ZERO options for adding the other criteria.
As I hate when simple things are made extremely complicated, I've created these indices manually in the PostgreSQL CLI. Not from Grails, which would be more portable. I do not want to write any HQL, as I don't like that idea either.
Any ideas? I've got none other than the manual way.
HQL is a Data manipulation language non a Data definition language so is not useful for your needs. If you want to use database vendor specific features you have to bypass hibernate and use the lower level jdbc connection to do SQL querys. In grails you can use dataSource bean to make query the groovy way. Of course this tight you to a specific database (in your case Postgres).

RavenDB Index Prefix

Since I don't yet have the ability to promote a temp index in Raven Studio (using build 573), I created two indexes manually. It seems to have worked well, but I have a question about the prefixes on each index: Temp, Auto, Raven. Is there anything special about those keywords? When I create my own index, should I use a prefix like that? For now, when I created my index, I used the index name from the temp index and replaced the word Temp with Manual.
Is that an acceptable approach? Should I be using a certain prefix?
Bob,
The names are just names, they are there for humans, not for RavenDB.
Indexes starting with Raven/ are reserved, and may be overwritten by the system at some point.
Indexes starting with Auto/ or Temp/ may be generated by the system, and may overwrite an existing index.
I generally use the collection/entity name as prefix before just so that helps me visually to understand right away what entity the index is primarily based on. If I had index for getting latest list of movies. I would name it Movie/GetLatestIndex..

Change dynamically elasticsearch synonyms

Is it possible to store the synonyms for elasticsearch in the index? Or is it possible to get the synonym list from a database like couchdb?
I'd like to add synonyms dynamically to elasticsearch via the REST-API.
There are two approaches when working with synonyms :
expanding them at indexing time,
expanding them at query time.
Expanding synonyms at query time is not recommended since it raises issues with :
scoring, since synonyms have different document frequencies,
multi-token synonyms, since the query parser splits on whitespaces.
More details on this at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory (on Solr wiki, but relevant for ElasticSearch too).
So the recommended approach is to expand synonyms at indexing time. In your case, if the synonym list is managed dynamically, it means that you should re-index every document which contains a term whose synonym list has been updated so that scoring remains consistent between documents analyzed pre and post update. I'm not saying that it is not possible but it requires some work and will probably raise performance issues with synonyms which have a high frequency in your index.
There are few new solutions now to those proposed in other answers few years ago. The two main approaches implemented as plugins:
The file-watcher-synonym filter is a plugin that can periodically reload synonyms every given numbers of seconds, as defined by user.
The refresh-token-plugin allows a real-time update of the index. However, this plugin aparrently has some problems, which stem from the fact that elasticsearch is unable to distinguish between analyzers at search time only from those used at index time.
Good discussion on this subject can be found on the elastisearch github ticket system: https://github.com/brusic/refresh-token-filters
It isn't too painful in elasticsearch to update the synonym list. It can be done by opening and closing You could have it driven from anywhere, but need some of your own infrastructure. It'd work like this:
You want an alias pointing at your current index
Sync down a new index file to your servers
Create a new index with a custom analyzer that uses the new index
Rebuild the content from current index to new index
Repoint index alias from current to new index
In 2021, just expand synonyms at query time using a specific search analyzer and use the Reload analyzer API:
POST /my-index/_reload_search_analyzers
The synonym graph token filter must have set updatable to true:
"my-synonyms": {
"type": "synonym_graph",
"synonyms_path": "my-synonyms.txt",
"updateable": true
}
Besides, you should probably expand synonyms at query time anyway. Why?
Chances are that you have too much data to reindex every night or so.
Elasticsearch does not allow the Synonym Graph Filter for an index analyzer, only the deprecated Synonym Filter which does not handle multi word synonyms correctly.