building fulltext search index for jena and lucene - lucene

I would like to perform a full text search on a subset of dbpedia (which i have in a tdb store) with lucene and jena.
String TDBDirectory = "path" ;
Dataset dataset = TDBFactory.createDataset(TDBDirectory) ;
But not over all resources, only over titles. I think by making indices only over the needed triples I can perform a faster search. E.g.
<http://de.dbpedia.org/resource/Gurke> <http://www.w3.org/2000/01/rdf-schema#label> "Gurke"#de .
Here I would like to search for "Gurke", but not in any other triples than the ones with the #label property.
So my question is how do I build indices and search only triples with the #label property?
I have already looked at http://jena.sourceforge.net/ARQ/lucene-arq.html but it's not detailed enough or too difficult for me.

http://jena.sourceforge.net/ is the old home for Jena -- the project is now http://jena.apache.org/ (how did you managed to find that old page?)
The project recently introduced a replacement for LARQ.
http://jena.apache.org/documentation/query/text-query.html
and this is now part of the main codebase. It will released with the 2.10.2 release - for the moment you must use the development build from https://repository.apache.org/content/repositories/snapshots/org/apache/jena/. You either need to be using Fuseki or add it as a dependency for your project.
This new text search subsystem works much better with TDB and Fuseki.

Related

FROM keyword not retrieving remote triples

The following query works fine with Jena ARQ but in GraphDB it does not retrieve anything:
SELECT *
FROM <http://www.bobdc.com/miscfiles/BeatlesMusicians.ttl>
WHERE { ?s ?p ?o .}
Is there something I need to configure on GraphDB to get this to work?
After reviewing https://www.w3.org/TR/sparql11-query/ it looks to me like the spec is saying that treating the FROM URI as a URL and retrieving triples from that location is something that the query engine can do but is not required to do.
Section 13.2 says that FROM specifies an IRI, which sounds like it's talking about a named graph and not necessarily a remote dataset to retrieve (that is, treating the URI as a URL), which is what I was looking for.
Section 13.2.3 does have "FROM http://example.org/dft.ttl" in one of its examples, which looks to me like it's specifying a Turtle file on some remote server to read into the default graph, like I was trying to do.
Section 21 says "SPARQL queries using FROM, FROM NAMED, or GRAPH may cause the specified URI to be dereferenced," but as we see from the word "may", does not require it. (The rest of that paragraph has a little more about this.)
I have found that the Jena tools arq and fuseki do this, but GraphDB and Blazegraph do not.

Aem fulltextsearch

I want to search for a exact combination of words in all nodes in the aem using query builder.
Trying to debug the query http://localhost:4502/libs/cq/search/content/querydebug.html it returns me results that doesn't match my query.
For example if want to search for 'foo bar' in all nodes and I need to receive all nodes that contain 'Foo Bar', 'foo Bar', 'Foo bar', 'FOO BAR' but not only 'foo' and only 'bar' and not 'foo-bar'. Query in service is done by using QueryBuilder.
QueryBuilder is useful when you try to perform a query similar to SQL where you search against a property and its value. The full text search capabilities of the query debug interface is very limited as you have experienced.
However, remember that AEM uses an underlying Lucene and/or Solr index and it does provide a way to perform a native solr / lucene query.
Firstly create a embedded solr index (embedded is sufficient for a local development AEM instance) as mentioned under "Configuring AEM with an embedded SOLR server" in https://docs.adobe.com/docs/en/aem/6-0/deploy/upgrade/queries-and-indexing.html. This will trigger solr indexing of your JCR content.
Once indexing is complete (as seen from logs), you can perform native queries using the crx/de query interface.
Example query: select [jcr:path] from [nt:base] where native('solr', '<filter>?<solr_query_goes_here>'. Quite obviously you need to be familiar with solr queries. Thanks to the following slide share (slide 50 talks about native queries within AEM) http://www.slideshare.net/justinedelson/demystifying-oak-search
AEM support for native solr queries is a bit patchy. You might need to edit the SOLR schema xml file manually (created under the crx-quickstart folder) to add additional filters, custom fields etc. We had successfully tuned solr within AEM to perform a spacial search using the above method.
If you need all sorts of combinations for "foo bar" then you have to query:
fulltext=foo bar
You will only get the first 10 results. To get all, you'll need to:
p.limit=-1
You may want to specify the path:
path=/content/website/
Visit Adobe Query Builder API for more info.
Behind the scenes, AEM creates an xpath query and then executes it. Then, for any part of the query that doesn't map to xpath, it runs through the results and filters them.
You should also think about if there is a property to match as opposed to any text. That will give you much better results since you want accuracy. Right now you are casting an overly wide net, and I think you should consider restricting if for nothing other than performance reasons. Just a suggestion.
You say the results don't match your query, can you give us some idea of what comes back? And can you please put your actual query here. That will make it much easier to help.
this is a minimal example that provides a full-text search:
Query query = queryBuilder.createQuery(...);
// limit path
Predicate path = new Predicate(PathPredicateEvaluator.PATH);
path.set(PathPredicateEvaluator.PATH, "/content/where/ever);
query.getPredicates().add(path);
// Fulltext
Predicate fulltextSearch = new Predicate(FulltextPredicateEvaluator.FULLTEXT);
fulltextSearch.set(FulltextPredicateEvaluator.FULLTEXT, "foo bar");
fulltextSearch.set(FulltextPredicateEvaluator.REL_PATH, "jcr:content");
query.getPredicates().add(fulltextSearch);
// can I haz excerpt?
query.setExcerpt(true);
// Paging?
query.setStart(...);
query.setHitsPerPage(-1);
Note: it's not required to configure a solr index or whatever, you should be fine out of the box.
But if you limit the search to specific fields, you should create an index entry in oak:index. You can find a great cheat-sheet here.
I'm not sure if this helps.
but to get all the combinations of nodes that have the text i'm looking for I use jcr:like in xpath.
for example if I want to search all the nodes which has any property with Foo bar in its value or key, then my query looks like:
/jcr:root/content/yourpath//*[jcr:like(\*/, '%FOO bar%')]
You will not get that flexibility in QueryBuilder but you can still get what you want by using JCR-SQL2.
The following query will return all entries with "Foo Bar", "foo bar", "foo Bar", "Foo bar", but not "foo", "bar", "foo-bar" when your value is "foo bar".
SELECT * FROM [nt:unstructured] WHERE ISDESCENDANTNODE('/jcr:root/content/yourpath') AND LOWER([prop]) LIKE "%foo bar%" ORDER BY [cq:lastModified] desc
Just ensure that while checking for the values in repository you send the value in lowercase for case-insensitive search.
For case-sensitive search you can use:
SELECT * FROM [nt:unstructured] WHERE ISDESCENDANTNODE('/jcr:root/content/yourpath') AND [prop] LIKE "%foo bar%" ORDER BY [cq:lastModified] desc

Querying for shared nodes in JCR (ModeShape)

I have a JCR content repository implemented in ModeShape (4.0.0.Final). The structure of the repository is quite simple and looks like this:
/ (root)
Content/
Item 1
Item 2
Item 3
...
Tags/
Foo/
Bar/
.../
The content is initially created and stored under /Content as [nt:unstructured] nodes with [mix:shareable] mixin. When a content item is tagged, the tag node is first created under /Tags if it's not already there, and the content node is shared/cloned to the tag node using Workspace.clone(...) as described in the JCR 2.0 spec, section 14.1, Creation of Shared Nodes.
(I don't find this particularly elegant and I did just read this answer, about creating a tag based search system in JCR, so I realize this might not be the best/fastest/most scaleable solution. But I "inherited" this solution from developers before me, so I hope I don't have to rewrite it all...)
Anyway, the sharing itself seems to work (I can verify that the nodes are there using the ModeShape Content Explorer web app or programatically by session.getRootNode().getNode("Tags/Foo").getNodes()). But I am not able to find any shared nodes using a query!
My initial try (using JCR_SQL2 syntax) was:
SELECT * FROM [nt:unstructured] AS content
WHERE PATH(content) LIKE '/Tags/Foo/%' // ISDECENDANTNODE(content, '/Tags/Foo') gives same result
ORDER BY NAME(content)
The result set was to my surprise empty.
I also tried searching in [mix:shareable] like this:
SELECT * FROM [mix:shareable] AS content
WHERE PATH(content) LIKE '/Tags/Foo/%' // ISDECENDANTNODE(content, '/Tags/Foo') gives same result
ORDER BY NAME(content)
This also returned an empty result set.
I can see from the query:
SELECT * FROM [nt:unstructured] AS content
WHERE PATH(content) LIKE '/Content/%' // ISDECENDANTNODE(content, '/Content') works just as well
ORDER BY NAME(content)
...that the query otherwise works, and returns the expected result (all content). It just doesn't work when searching for the shared nodes.
How do I correctly search for shared nodes in JCR using ModeShape?
Update: I upgraded to 4.1.0.Final to see if that helped, but it had no effect on the described behaviour.
Cross-posted from the ModeShape forum:
Shared nodes are really just a single node that appears in multiple places within a workspace, so it's not exactly clear what it semantically means to get multiple query results for that one shareable node. Per Section 14.16 of the JSR-283 (JCR 2.0) specification implementations are free to include shareable nodes in query results at just one or at multiple/all of those locations.
ModeShape 2.x and 3.x always returned in query results only a single location of the shared nodes, as this was the behavior of the reference implementation and this was the feedback we got from users. When we were working on Modeshape 4.0, we tried to make it possible to return multiple results, but we ran into problems with the TCK and uncertainty about what this new expected behavior would be. Therefore, we backed off our goals and implemented query to return only one of the shared locations, as we did with 2.x and 3.x.
I may be wrong, but I'm not exactly sure if any JCR implementation returns multiple rows for a single shared node, but I may be wrong.

Map blank nodes from stardog to pubby

So I have this .rdf that I have loaded onto Stardog and then I am using Pubby running over Jetty, to browse the triple store.
In my rdf file, I have several blank nodes which is given a blank node identifier by stardog. So this is a snippet of the rdf file.
<kbp:ORG rdf:about="http://somehostname/resource/res1">
<kbp:canonical_mention>
<rdf:Description>
<kbp:mentionStart rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1234</kbp:mentionStart>
<kbp:mentionEnd rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1239</kbp:mentionEnd>
</rdf:Description>
</kbp:canonical_mention>
So basically I have some resource "res1" which has links to blank node which has a mention start and mention end offset value.
The snippet of the config.ttl file for Pubby is shown below.
conf:dataset [
# SPARQL endpoint URL of the dataset
conf:sparqlEndpoint <http://localhost:5822/xxx/query>;
#conf:sparqlEndpoint <http://localhost:5822/foaf/query>;
# Default graph name to query (not necessary for most endpoints)
conf:sparqlDefaultGraph <http://dbpedia.org>;
# Common URI prefix of all resource URIs in the SPARQL dataset
conf:datasetBase <http://somehostname/>;
...
...
So the key thing is the datasetBase which maps URIs to URL.
When I try to map this, there is an "Anonymous node" link but upon clicking, nothing is displayed. My guess is, this is because the blank node has some identifier like _:bnode1234 which is not mapped by Pubby.
I wanted to know if anyone out there knows how to map these blank nodes.
(Note: If I load this rdf as a static rdf file directly onto Pubby, it works fine. But when I use stardog as a triple store, this mapping doesn't quite work)
It probably works in Pubby because they are keeping the BNode id's available; generally, the SPARQL spec does not guarantee or require that BNode identifiers are persistent. That is, you can issue the same query multiple times, which brings back the same result set (including bnodes) and the identifiers can be different each time. Similarly, a bnode identifier in a query is treated like a variable, it does not mean you are querying for that specific bnode.
Thus, Pubby is probably being helpful and making that work which is why using it directly works as opposed to a third party database.
Stardog does support the Jena/ARQ trick of putting a bnode identifier in angle brackets, that is, <_:bnode1234> which is taken to mean, the bnode with the identifier "bnode1234". If you can get Pubby to use that syntax in queries for bnodes, it will probably work.
But generally, I think this is something you will have to take up with the Pubby developers.

SPARQL - what does it take to find an ontology?

I'm pretty new to SPARQL, OWL and Jena, so please excuse if I'm asking utterly stupid questions. I'm having a problem that is driving me nuts since a couple of days. I'm using the following String as a query for a Jena QueryFactory.create(queryString),
queryString = "PREFIX foaf: <http://xmlns.com/foaf/0.1/>"+
"PREFIX ho: <http://www.flatlandfarm.de/fhtw/ontologies/2010/5/22/helloOwl.owl#>" +
"SELECT ?name ?person ?test ?group "+
"WHERE { ?person foaf:name ?name ; "+
" a ho:GoodPerson ; "+
" ho:isMemberOf ?group ; "+
"}";
Until this morning it worked as long as I only asked for properties from the foaf namespace. As soon as I asked for properties from my own namespace I always got empty results. While I was about to post this question here and did some final tests to be able to post it as precise as possible, it suddenly worked. So as I did not know what exactly to ask for anymore, I deleted my question before posting it. A couple of hours later I used Protege's Pellet plugin to create and export an inferred Model. I called it helloOwlInferred.owl and uploaded it to the directory on my server where helloWl.owl resided yet. I adjusted my method to load the inferred ontology and changed the above query so that the prefix ho: was assigned to the inferred ontology as well. At once, nothing worked any more. To be exact it was not nothing that worked any more but it was the same symptoms I had till this morning with my original query. My prefix did not work any more. I did a simple test: I renamed all the helloWorldInferred.owl files (the one on my server for the prefix and my local copy which I loaded) to helloWorld.owl. Strange enough that fixed everything.
Renaming it back to helloWorldInferred.owl broke everything again. And so on. What is going on there? Do I just need to wait a couple of weeks until my ontology gets "registered as a valid prefix"?
Maybe your OWL file contains the rdf:ID="something" construct (or some other form of relative URL, such as rdf:about="#something")?
rdf:ID and relative URLs are expanded into full absolute URLs, such as http://whatever/file.owl#something, by using the base URL of the OWL file. If the base URL is not explicitly specified in the file (using something like xml:base="http://whatever/file.owl"), then the location of the file on the web server (or in your file system if you load a local file) will be used as the base URI.
So if you move the file around or have copies in several locations, then the URIs in your file will change, and hence you'd have to change your SPARQL query accordingly.
Including an explicit xml:base, or avoiding relative URIs and rdf:ID, should fix the problem.
The whole idea of prefixes and QNames is just to compress URIs to save space and improve readability, the most common issue with them is spelling errors either in the definitions themselves or in the QNames.
Most likely the prefix definition you are using in your query is causing URIs to be generated which don't match the actual URIs of properties in your ontology.
That being said your issue may be due to something with Jena so it may well be worth asking your question on the Jena Mailing List
It looks like this was caused by a bug (or a feature?) in Protege. When I exported the inferred ontology with a new name, Protege changed the definitions of xmlns(blank) and xml:base to the name of the new file, but it did not change the definition of the actual namespace.
xmlns="http://xyz.com/helloOwl.owl" => xmlns="http://xyz.com/helloOwlInferred.owl"
xml:base="http://xyz.com/helloOwl.owl" => xml:base="http://xyz.com/helloOwlInferred.owl"
xmlns:helloOwl="http://xyz.com/helloOwl.owl" => xml:base="http://xyz.com/helloOwl.owl"
<!ENTITY helloOwl "http://wxyz.com/helloOwl.owl#" > => <!ENTITY helloOwl "http://wxyz.com/helloOwl.owl#" >
Since I fixed that it seems to work.
My fault not having examined the the actual source with the necessary attention.
You have to define a precise URI prefix for ho:, then tell it to Protegé (there is a panel for namespaces and define the same URI as the ontology prefix), so that, when you define GoodPerson in Protegé, it assumes you mean http://www.flatlandfarm.de/fhtw/ontologies/2010/5/22/helloOwl.owl#GoodPerson, which is the same as ho:GoodPerson only if you have used the same URI prefix for the two.
If you don't do so, Protegé (or some other component, like a web server) will do these silly things like composing the ontology's URI and its default URI prefix (the one that goes in front of GoodPerson when you don't specify any prefix) using the file name (or even worse, a URI like file:///home/user/...).
Remember, the ontology's URI is technically different than the URI's prefix that you use for the entities associated to the ontology itself (classes, properties etc), and ho: is just a shortcut having a local meaning, which depends on what you define in documents like files or SPARQL queries.
The ontology URI can also be different than the URL from where the ontology file can be fetched, although it is good to make them the same. Usually you need to play with URL rewriting in Apache to make that happen, but sometimes that ontology file isn't physically published, since the ontology is loaded into a SPARQL endpoint and its URI is resolved to an RDF document through the help of the endpoint itself, by rewriting the ontology URI into a SPARQL request that issues a DESCRIBE statement. The same trick can be used to resolve any other URI (i.e., your ontology-instantiating data), as long as associated data are accessible from your SPARQL endpoint (ie, are in your triple store).