Lucene index a large many-to-many relationship - lucene

I have a database with two primary tables:
The components table (50M rows),
The assemblies table (100K rows),
..and a many-to-many relationship between them (about 100K components per assembly), thus in total 10G relationships.
What's the best way to index the components such that I could query the index for a given assembly? Given the amount of relationships, I don't want to import them into the Lucene index, but am looking instead for a way to "join" with my external table on-the-fly.

Solr supports multi-valued fields. Not positive if Lucene supports them natively or not. It's been a while for me. If only one of the entities is searchable, which you mentioned is components, I would index all components with a field called "assemblies" or "assemblyIds" or something similar and include whatever metadata you need to identify the assemblies.
Then you can search components with
assemblyIds:(1, 2, 3)
To find components in assembly 1, 2 or 3.

To be brief, you've got to process the data and Index it before you can search. Therefore, there exists no way to just "plug-in" Lucene to some data or database, instead you've to plug-in (process, parse, analyze, index, and query) the data it self to the Lucene.
rustyx: "My data is mostly static. I can even live with a read-only index."
In that case, you might use Lucene it self. You can iterate the datasource to add all the many-to-many relations to the Lucene index. How did you come up with that "100GB" size? People index millions and millions of documents using Lucene, I don't think it'd be a problem for you to index.
You can add multiple field instances to the index with different values ("components") in a document having an "assembly" field as well.
rustyx: "I'm looking instead into a way to "join" the Lucene search with my external data source on the fly"
If you need something seamless, you might try out the following framework which acts like a bridge between relational database and Lucene Index.
Hibernate Search : In that tutorial, you might search for "#ManyToMany" keyword to find the exact section in the tutorial to get some idea.

Related

Indexing several DBs in lucene (and performance on low cardinality fields)

I have to index several relational DBs in a Lucene index (for value-based searching).
Some searches will be for values from a specific DB and others will be for values from all DBs.
I can think of two ways to implement this:
Create one big index and add a field called database_id. Use this field when querying over some specific DB.
Create an index for every DB. When querying one DB, I will direct the query to just one index, when querying all DBs I'll use MultiReader that runs query on all indices.
Option 2 seems more comfortable to me because of easier maintenence and faster querying when querying just one DB. Also I came across several posts that say low-cardinality fields are not good for lucene performance (can someone shed light on this? is this true?).
I'd like to hear community thoughts, what other pros and cons can be?

How does Neo4j indexing (using lucene) work under the hood?

A few questions relating lucene indexes in Neo4j and how they're used during queries and traversal. Basically, the way relationship are stored on disk (a linked list), it seems to me that any graph traversal would require to sequential visit all relationships for a node - not sure how an index could be used in this case. More specifically:
1) When node properties are indexed, how would that be used for a query such as "all my female friends of friends" (gender is indexed). The only way I see an index being used it by first finding all friends of friends, and then submitting a query to lucene to get all the females. Is it faster than just doing to comparison in memory though?
2) When relationships properties are indexed. Since the relationships are stored in a linked list, it's impossible to get a subset of relationships for a node without sequentially walking the list. I suppose we could always index relationships using node_ids but that seems silly - we end up storing adjacency lists in both lucene and Neo4J
Indexes are not used for traversals.
They are only used to find your starting points in the graph.
Depending on the relationship-types and directions you only traverse a subset of relationships from a node.
For your query 1, you don't need an index on gender, as it will return about 50% of the people in your graph. But you would use an index for the initial user lookup (me)
create index on :User(name);
MATCH (m:User {name:"Me"})-[:FRIEND]->(other:User)
WHERE other.gender = "female"
RETURN other;
2) yes, you are right.
You can do that, but it is only necessary if you have a lot of relationships (millions) and want to access a tiny slice of those.
So if that's your use case a relationship-index might help.
Relationships are actually indexed with both node-id's and a relationship-property

How does a full text search server like Sphinx work?

Can anyone explain in simple words how a full text server like Sphinx works? In plain SQL, one would use SQL queries like this to search for certain keywords in texts:
select * from items where name like '%keyword%';
But in the configuration files generated by various Sphinx plugins I can not see any queries like this at all. They contain instead SQL statements like the following, which seem to divide the search into distinct ID groups:
SELECT (items.id * 5 + 1) AS id, ...
WHERE items.id >= $start AND items.id <= $end
GROUP BY items.id
..
SELECT * FROM items WHERE items.id = (($id - 1) / 5)
It it possible to explain in simple words how these queries work and how they are generated?
Inverted Index is the answer to your question: http://en.wikipedia.org/wiki/Inverted_index
Now when you run a sql query through sphinx, it fetches the data from the database and constructs the inverted index which in Sphinx is like a hashtable where the key is a 32 bit integer which is calculated using crc32(word) and the value is the list of documentID's having that word.
This makes it super fast.
Now you can argue that even a database can create a similar structure for making the searches superfast. However the biggest difference is that a Sphinx/Lucene/Solr index is like a single-table database without any support for relational queries (JOINs) [From MySQL Performance Blog]. Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
EDIT: Also please see the source code in cpp files like searchd.cpp etc for the real internal implementation, I think you are just seeing the PHP wrappers.
Those queries you are looking at, are the query sphinx uses, to extract a copy of the data from the database, to put in its own index.
Sphinx needs a copy of the data to build it index (other answers have mentioned how that index works). You then ask for results (matching a specific query) from the searchd daemon - it consults the index and returns you matching documents.
The particular example you have choosen looks quite complicated, because it only extracting a part of the data, probbably for sharding - to split the index into parts for performance reasons. And is using range queries - so can access big datasets piecemeal.
An index could be built with a much simpler query, like
sql_query = select id,name,description from items
which would create a sphinx index, with two fields - name and description that could be searched/queried.
When searching, you would get back the unique id. http://sphinxsearch.com/info/faq/#row-storage
Full text search usually use one implementation of inverted index. In simple words, it brakes the content of a indexed field in tokens (words) and save a reference to that row, indexed by each token. For example, a field with The yellow dog for row #1 and The brown fox for row #2, will populate an index like:
brown -> row#2
dog -> row#1
fox -> row#2
The -> row#1
The -> row#2
yellow -> row#1
A short answer to the question is that databases such as MySQL are specifically designed for storing and indexing records and supporting SQL clauses (SELECT, PROJECT, JOIN, etc). Even though they can be used to do keyword search queries, they cannot give the best performance and features. Search engines such as Sphinx are designed specifically for keyword search queries, thus can provide much better support.

storage lucene index in database using data objects in java

Is this possible? I cannot access the database directly--only through data objects.
Would I be able to search the index if the items are returned in something like ArrayList?
If this is not possible, is there some way I can use Lucene (or some other tool) to do fuzzy matching against an object using java?
For example, I have a Person object that has a FirstName and LastName. I want to do a fuzzy match on the name.
So, say I have an array of x amount of Person objects, would there be an efficient way of looping through each Person object and comparing the names?
Take those data objects and build a separate Lucene index over them, storing the fields you need. Using your Person example, every Lucene document would be [Id, FirstName, LastName]. A search on this index would return the Id required to query your database for the complete data object.
The actual indexing is easy, you just need to retrieve a list of data objects, iterate them, generate Lucene documents, and store them using an IndexWriter. You could work against either a filesystem directory for persistent storage, or a in-memory storage.
Those are the possible solutions I came up with--
however, I cannot store my index on FSDirectory (project specs do not allow this) and for RAMDirectory, there are going to be thousands of Person objects we'll need to search through so I don't know if in-memory storage is ideal for this situation.
Is there any other sort of fuzzy match algorithm I can use that will be efficient for large sets of data?

Multiple or single index in Lucene?

I have to index different kinds of data (text documents, forum messages, user profile data, etc) that should be searched together (ie, a single search would return results of the different kinds of data).
What are the advantages and disadvantages of having multiple indexes, one for each type of data?
And the advantages and disadvantages of having a single index for all kinds of data?
Thank you.
If you want to search all types of document with one search , it's better that you keep all
types to one index . In the index you can define more field type that you want to Tokenize or Vectore them .
It takes a time to introduce to each IndexSearcher a directory that include indeces .
If you want to search terms separately , it would better that index each type to one index .
single index is more structural than multiple index.
In other hand , we can balance our loading with multiple indeces .
Not necessarily answering your direct questions, but... ;)
I'd go with one index, add a Keyword (indexed, stored) field for the type, it'll let you filter if needed, as well as tell the difference between the results you receive back.
(and maybe in the vein of your questions... using separate indexes will allow each corpus to have it's own relevency score, don't know if excessively repeated terms in one corpus will throw off relevancy of documents in others?)
You should think logically as to what each dataset contains and design your indexes by subject-matter or other criteria (such as geography, business unit etc.). As a general rule your index architecture is similar to how you would databases (you likely wouldn't combine an accounting with a personnel database for example even if technically feasible).
As #llama pointed out, creating a single uber-index affects relevance scores, security/access issues, among other things and causes a whole new set of headaches.
In summary: think of a logical partitioning structure depending on your business need. Would be hard to explain without further background.
Agree that each kind of data should have its own index. So that all the index options can be set accordingly - like analyzers for the fields, what is stored for the fields for term vectors and similar. And also to be able to use different dynamic when IndexReaders/Writers are reopened/committed for different kinds of data.
One obvious disadvantage is the need to handle several indexes instead of one. To make it easier, and because I always use more than one index, created small library to handle it: Multi Index Lucene Manager