Can we adopt Lucene index to retrieve data from rdf tables? - zend-lucene

Lucene index can it be merged with b+trees to make the indexing better , inorder to retrieve data from the rdf table? Is there any procedure available?

You can look into Siren
to link the RDF data into Lucene

Related

Query Document Schema in MarkLogic

I would like to query the Schema definition of a Index in MarkLogic.
How can I query that?
What would be the query to do that?
I am talking about the Schema such as Elasticsearch Schema, with Field Types, Analyses, etc.
Please think of my question, as if I am asking how to see the column types, and column names in Oracle. How to do the same in MarkLogic? Any examples?
MarkLogic has a universal index, so there is no requirement to define a schema up front to search on specific elements or properties.
To do datatyped queries on element or properties, you can use TDE in MarkLogic 9 to define how to project datatyped values from documents in a collection into the indexes as a view over the documents. To find out the list of columns with data types for a view, you can either query the system columns view or retrieve the TDE template from the schemas database.
In MarkLogic 8 and before, you would define range indexes on elements, properties, fields, or paths. On the enode, the Admin API can get the list of range indexes for any database. On the middle tier, the Management REST API can express the equivalent REST request.
Hoping that clarifies,

Index SQL table with solr using facets

i am solr newbie, and i am trying to use it for setup a faceted search from a database denormalized view (a table with a lot's of fields).
At the moment i have created the index in solr and i can query the database via solr url. I will use the solr facets to generate the search menu: a set of given fields with all possible values and with the number of occurences for each value
Now the question is, should I use solr to create the fecets and use plain old SQL to query the database or it is better to use solr also to query the database?
I use the facets to create a search refinement, if you want to suggest to you what to look for you should search for terms.
[https://cwiki.apache.org/confluence/display/solr/The+Terms+Component][1]
It is always better to query the solr for your search results because if you query the database the number of results can be different compared to what you are showing against that facet , as the results in database my not be yet updated in solr.
Another reason is performance , querying different fields spread across multiple tables is expensive compared to all the denormalized documents indexed in a search engine.

Lucene index a large many-to-many relationship

I have a database with two primary tables:
The components table (50M rows),
The assemblies table (100K rows),
..and a many-to-many relationship between them (about 100K components per assembly), thus in total 10G relationships.
What's the best way to index the components such that I could query the index for a given assembly? Given the amount of relationships, I don't want to import them into the Lucene index, but am looking instead for a way to "join" with my external table on-the-fly.
Solr supports multi-valued fields. Not positive if Lucene supports them natively or not. It's been a while for me. If only one of the entities is searchable, which you mentioned is components, I would index all components with a field called "assemblies" or "assemblyIds" or something similar and include whatever metadata you need to identify the assemblies.
Then you can search components with
assemblyIds:(1, 2, 3)
To find components in assembly 1, 2 or 3.
To be brief, you've got to process the data and Index it before you can search. Therefore, there exists no way to just "plug-in" Lucene to some data or database, instead you've to plug-in (process, parse, analyze, index, and query) the data it self to the Lucene.
rustyx: "My data is mostly static. I can even live with a read-only index."
In that case, you might use Lucene it self. You can iterate the datasource to add all the many-to-many relations to the Lucene index. How did you come up with that "100GB" size? People index millions and millions of documents using Lucene, I don't think it'd be a problem for you to index.
You can add multiple field instances to the index with different values ("components") in a document having an "assembly" field as well.
rustyx: "I'm looking instead into a way to "join" the Lucene search with my external data source on the fly"
If you need something seamless, you might try out the following framework which acts like a bridge between relational database and Lucene Index.
Hibernate Search : In that tutorial, you might search for "#ManyToMany" keyword to find the exact section in the tutorial to get some idea.

Structured and Unstructured indexing - Lucene and Hbase

I have a set of 200M documents I need to index. Every document has a free text and additional set of sparse metadata information (100+ columns).
It seems that the right tool for free text indexing is Lucene while the right tool for structured sparse metadata is HBase.
I would need to query the data and join between free text search results and the structured data results (e.g. get all books that has the phrase "good morning" in their textand were first published in 1980).
What tools/mechanism should I look at to join structured and unstrcutured queries?
Results may include millions of records (before and after the join)
Thanks
Saar
A couple of things come to mind, in addition to lucene on hbase:
1) Solr/Lucene can store multiple fields, and each field can have different types. So your date range example is plausible wholly within Solr.
2) If you are talking about truly huge data sets that require a cluster, also look at ElasticSearch: http://www.elasticsearch.org/
3) Lily attempts to answer your exact question http://www.lilyproject.org/lily/index.html
Looks like HBase would like some Lucene action as well: https://issues.apache.org/jira/browse/HBASE-3529.

storage lucene index in database using data objects in java

Is this possible? I cannot access the database directly--only through data objects.
Would I be able to search the index if the items are returned in something like ArrayList?
If this is not possible, is there some way I can use Lucene (or some other tool) to do fuzzy matching against an object using java?
For example, I have a Person object that has a FirstName and LastName. I want to do a fuzzy match on the name.
So, say I have an array of x amount of Person objects, would there be an efficient way of looping through each Person object and comparing the names?
Take those data objects and build a separate Lucene index over them, storing the fields you need. Using your Person example, every Lucene document would be [Id, FirstName, LastName]. A search on this index would return the Id required to query your database for the complete data object.
The actual indexing is easy, you just need to retrieve a list of data objects, iterate them, generate Lucene documents, and store them using an IndexWriter. You could work against either a filesystem directory for persistent storage, or a in-memory storage.
Those are the possible solutions I came up with--
however, I cannot store my index on FSDirectory (project specs do not allow this) and for RAMDirectory, there are going to be thousands of Person objects we'll need to search through so I don't know if in-memory storage is ideal for this situation.
Is there any other sort of fuzzy match algorithm I can use that will be efficient for large sets of data?