How does Neo4j indexing (using lucene) work under the hood? - lucene

A few questions relating lucene indexes in Neo4j and how they're used during queries and traversal. Basically, the way relationship are stored on disk (a linked list), it seems to me that any graph traversal would require to sequential visit all relationships for a node - not sure how an index could be used in this case. More specifically:
1) When node properties are indexed, how would that be used for a query such as "all my female friends of friends" (gender is indexed). The only way I see an index being used it by first finding all friends of friends, and then submitting a query to lucene to get all the females. Is it faster than just doing to comparison in memory though?
2) When relationships properties are indexed. Since the relationships are stored in a linked list, it's impossible to get a subset of relationships for a node without sequentially walking the list. I suppose we could always index relationships using node_ids but that seems silly - we end up storing adjacency lists in both lucene and Neo4J

Indexes are not used for traversals.
They are only used to find your starting points in the graph.
Depending on the relationship-types and directions you only traverse a subset of relationships from a node.
For your query 1, you don't need an index on gender, as it will return about 50% of the people in your graph. But you would use an index for the initial user lookup (me)
create index on :User(name);
MATCH (m:User {name:"Me"})-[:FRIEND]->(other:User)
WHERE other.gender = "female"
RETURN other;
2) yes, you are right.
You can do that, but it is only necessary if you have a lot of relationships (millions) and want to access a tiny slice of those.
So if that's your use case a relationship-index might help.
Relationships are actually indexed with both node-id's and a relationship-property

Related

B Tree Index vs Inverted Index?

Here is mine understanding about both
B Tree index :- It is generally used database column. It keeps the column content as key and row_id as value . It keeps the key in sorted fashion
to quickly find the key and row location
Inverted Index :- Generally used in full text search. Here also word in document works as key, stored in sorted fashion along with doucument location/id
as value.
So what's the difference b/w B tree index and Inverted index . To me they looks same
Short answer:
yes, they have the same purpose - finding things fast
difference: what are they useful for / particularly good at
and btw the naming is just awfully confusing too
Long answer:
The naming
My knowledge comes from practice with SQL world, so for me the data storage used to be equal to "database" and the structure that allows to find things quick - an "index".
The trick is - search engines already call their storage "index", so how do you call that index-of-"index"? "Inverted Index", of course! Why inverted? Because, as I can already see in your question, it just inverts the the primary storage. Storage is like primary key --> values, that helper-structure inverts it to values --> primary key and helps quickly finding documents by values.
Purpose
Your question has a mix of Ideas. "Inverted index" means actually more like "a data structure that helps finding documents that are already in storage" whereas B-Tree is just an implementation of such structure.
An index could be theoretically implemented with any data structure you want. Hashes, Graphs, Trees, Arrays, Bitmaps.. it just depends on your usecase.
The differences
B-Tree is good for data that changes, so it's used e.g. in databases and filesystems. Downside: multiple indices cannot be used together in one query (I guess because this structure is dynamic and thus references back to documents are not sorted) and it's data tends to become scattered, so the IO can become an issue.
"Inverted index" uses Bitmaps/Arrays and everything's sorted (list of values and the list of references to documents). These are good for static data sets. And because of sorted nature, multiple indices can be used together. Downside: updating is not performant (new document means inserting values somewhere in a sorted list), tricks are used like keeping batches of data together as it comes in and merging into bigger batches in a background process.

How can Datomic users cope without composite indexes?

In Datomic, how do you efficiently perform queries such as 'find all people living in Washington older than 50' (city and age may vary)? In relational databases and most of NoSQL databases you use composite indexes for this purpose; Datomic, as far as I'm aware of, does not support anything like this.
I built several, say, medium-sized web-apps and not a single one would perform quick enough, if not for composite indexes. How are Datomic users dealing with this? Or are they just playing with datasets small enough not to suffer from this? Am I missing something?
This problem and its solution are not identical in Datomic due to the structure of data (datoms) in Datomic. There are two performance characteristics/strategies that may add some shading to this:
(1) When you fetch data in Datomic, you fetch an entire leaf segment from the index tree (not an individual item) - with segments being composed of potentially many thousands of datoms. This is then cached automatically so that you don't have to reach out over the network to get more datoms.
If you're querying a single person - i.e., a single entity, for their age and where they live, it's very likely the query's navigation of the EAVT or AEVT indexes may have cached everything you need. You've effectively cached the datom, how to navigate to it to it, and related datoms (by locality in the index).
(2) Partitions can provide a manual means to specify locality of reference. Partitions impact the entity ID's value (it's encoded in the high bits) and ensure that related entities are sorted near each other. So for an alternative implementation of the above problem, if you needed information from the city and person entities both, you could include them in the same partition.
I've written a library to handle this: https://github.com/arohner/datomic-compound-index
Update 2019-06-28: Since 0.9.5927 (Datomic On-Prem) / 480-8770 (Datomic Cloud), Datomic supports Tuples as a new Attribute Type, which allows you to have compound indexes.

Lucene index a large many-to-many relationship

I have a database with two primary tables:
The components table (50M rows),
The assemblies table (100K rows),
..and a many-to-many relationship between them (about 100K components per assembly), thus in total 10G relationships.
What's the best way to index the components such that I could query the index for a given assembly? Given the amount of relationships, I don't want to import them into the Lucene index, but am looking instead for a way to "join" with my external table on-the-fly.
Solr supports multi-valued fields. Not positive if Lucene supports them natively or not. It's been a while for me. If only one of the entities is searchable, which you mentioned is components, I would index all components with a field called "assemblies" or "assemblyIds" or something similar and include whatever metadata you need to identify the assemblies.
Then you can search components with
assemblyIds:(1, 2, 3)
To find components in assembly 1, 2 or 3.
To be brief, you've got to process the data and Index it before you can search. Therefore, there exists no way to just "plug-in" Lucene to some data or database, instead you've to plug-in (process, parse, analyze, index, and query) the data it self to the Lucene.
rustyx: "My data is mostly static. I can even live with a read-only index."
In that case, you might use Lucene it self. You can iterate the datasource to add all the many-to-many relations to the Lucene index. How did you come up with that "100GB" size? People index millions and millions of documents using Lucene, I don't think it'd be a problem for you to index.
You can add multiple field instances to the index with different values ("components") in a document having an "assembly" field as well.
rustyx: "I'm looking instead into a way to "join" the Lucene search with my external data source on the fly"
If you need something seamless, you might try out the following framework which acts like a bridge between relational database and Lucene Index.
Hibernate Search : In that tutorial, you might search for "#ManyToMany" keyword to find the exact section in the tutorial to get some idea.

Multiple or single index in Lucene?

I have to index different kinds of data (text documents, forum messages, user profile data, etc) that should be searched together (ie, a single search would return results of the different kinds of data).
What are the advantages and disadvantages of having multiple indexes, one for each type of data?
And the advantages and disadvantages of having a single index for all kinds of data?
Thank you.
If you want to search all types of document with one search , it's better that you keep all
types to one index . In the index you can define more field type that you want to Tokenize or Vectore them .
It takes a time to introduce to each IndexSearcher a directory that include indeces .
If you want to search terms separately , it would better that index each type to one index .
single index is more structural than multiple index.
In other hand , we can balance our loading with multiple indeces .
Not necessarily answering your direct questions, but... ;)
I'd go with one index, add a Keyword (indexed, stored) field for the type, it'll let you filter if needed, as well as tell the difference between the results you receive back.
(and maybe in the vein of your questions... using separate indexes will allow each corpus to have it's own relevency score, don't know if excessively repeated terms in one corpus will throw off relevancy of documents in others?)
You should think logically as to what each dataset contains and design your indexes by subject-matter or other criteria (such as geography, business unit etc.). As a general rule your index architecture is similar to how you would databases (you likely wouldn't combine an accounting with a personnel database for example even if technically feasible).
As #llama pointed out, creating a single uber-index affects relevance scores, security/access issues, among other things and causes a whole new set of headaches.
In summary: think of a logical partitioning structure depending on your business need. Would be hard to explain without further background.
Agree that each kind of data should have its own index. So that all the index options can be set accordingly - like analyzers for the fields, what is stored for the fields for term vectors and similar. And also to be able to use different dynamic when IndexReaders/Writers are reopened/committed for different kinds of data.
One obvious disadvantage is the need to handle several indexes instead of one. To make it easier, and because I always use more than one index, created small library to handle it: Multi Index Lucene Manager

SQL Query for an Organization Chart?

I feel that this is likely a common problem, but from my google searching I can't find a solution quite as specific to my problem.
I have a list of Organizations (table) in my database and I need to be able to run queries based on their hierarchy. For example, if you query the highest Organization, I would want to return the Id's of all the Organizations listed under that Organization. Further, if I query an organization sort of mid-range, I want only the Organization Id's listed under that Organization.
What is the best way to a) set up the database schema and b) query? I want to only have to send the topmost Organization Id and then get the Id's under that Organization.
I think that makes sense, but I can clarify if necessary.
As promised in my comment, I dug up an article on how to store hierarchies in a database that allows constant-time retrieval of arbitrary subtrees. I think it will suit your needs much better than the answer currently marked as accepted, both in ease of use and speed of access. I could swear I saw this same concept on wikipedia originally, but I can't find it now. It's apparently called a "modified preorder tree traversal". The gist of it is you number each node in the tree twice, while doing a depth-first traversal, once on the way down, and once on the way back up (i.e. when you're unrolling the stack, in a recursive implementation). This means that the children of a given node have all their numbers in between the two numbers of that node. Throw an index on those columns and you've got really fast lookups. I'm sure that's a terrible explanation, so read the article, which goes into more depth and includes pictures.
One simple way is to store the organization's parentage in a text field, like:
SALES-EUROPE-NORTH
To search for every sales organization, you can query on SALES-%. For each European sales org, query on SALES-EUROPE-%.
If you rename an organization, take care to update its child organizations as well.
This keeps it simple, without recursion, at the cost of some flexibility.
The easy way is to have a ParentID column, which is a foreign key to the ID column in the same table, NULL for root nodes. But this method has some drawbacks.
Nested sets are an efficient way to store trees in an relational database.
You could have an Organization have an id PK and a parent FK reference to the id. Then for the query, use (if your database backend supports them) recursive queries, aka Common Table Expressions.