Jena: How to infer data / performance issues - sparql

I'd like to use Jena's inference capabilities, but I'm having some performance problems when I am using InfModel.
Here's a simplified overview of my ontology:
Properties:
hasX (Ranges(intersection): X, inverse properties: isXOf)
|-- hasSpecialX (Ranges(intersection): X, inverse properties: isSpecialXOf)
isXOf (Domains(intersection): X, inverse properties: hasX)
|--isSpecialXOf (Domains(intersection): X, inverse properties: hasSpecialX)
Furthermore there's a class 'Object':
Object hasSpecialX some X
Explicitly stored is the following data:
SomeObject a Object
SomeX a X
SomeObject hasSpecialX SomeX
Using the following query I'd like to determine to which class an instance belongs. According to the assumptions made, only 'SomeObject' should be returned.
SELECT ?x WHERE { ?x :hasX :SomeX . }
However, querying against ds.getDefaultModel() doesn't work, because the data isn't stored explicitly. When I'm using infModel on the other hand, the query never finishes. At the longest I've been waiting for 25 minutes before aborting. (The triplestore has a size of about 180 MB)
This is my code:
OntModel ont = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_MICRO_RULE_INF, null);
ont.read("file:..." , "RDF/XML");
Reasoner reasoner = ReasonerRegistry.getOWLMicroReasoner();
reasoner = reasoner.bindSchema(ont);
Dataset dataset = TDBFactory.createDataset(...);
Model model = dataset.getDefaultModel();
InfModel infModel = ModelFactory.createInfModel(reasoner, model);
QueryExecution qe = null;
ResultSet rs;
try {
String qry = "SELECT ?x WHERE { ?x :hasX :SomeX . }";
qe = QueryExecutionFactory.create(qry, infModel);
rs = qe.execSelect();
while(rs.hasNext()) {
QuerySolution sol = rs.nextSolution();
System.out.println(sol.get("x"));
}
} finally {
qe.close();
infModel.close();
model.close();
dataset.close();
}
Is there anything wrong with the code above, or what else could be the reason it doesn't work?
Beside that, I'd like to know if I can increase the performance if I do 'Export inferred axioms as ontology' (as provided by Protege)?
EDIT:
I the meantime I've tried to use Pellet, but still I can't get an inferred model, as I've described in my other question: OutOfMemoryError using Pellet as Reasoner. So what else can I do?

Regarding performance, it is better to do the inference before asserting the data and not and do the SPARQLs with the Jena inference mechanism off. You are already using TDB which the right Jena component for big datasets.
If by using the inferred data directly you do not get the expected performance then I recommend moving to a more scalable triple store (4store or Virtuoso).

Related

Converting gremlin query from gremlin console to Bytecode

I am trying to convert gremlin query received from gremlin console to bytecode in order to extract StepInstructions. I am using the below code to do that but it looks hacky and ugly to me. Is there any better way of converting gremlin query from gremlin console to Bytecode?
String query = (String) requestMessage.getArgs().get(Tokens.ARGS_GREMLIN);
final GremlinGroovyScriptEngine engine = new GremlinGroovyScriptEngine();
CompiledScript compiledScript = engine.compile(query);
final Graph graph = EmptyGraph.instance();
final GraphTraversalSource g = graph.traversal();
final Bindings bindings = engine.createBindings();
bindings.put("g", g);
DefaultGraphTraversal graphTraversal = (DefaultGraphTraversal) compiledScript.eval(bindings);
Bytecode bytecode = graphTraversal.getBytecode();
If you need to take a Gremlin string and convert it to Bytecode I don't think there is a much better way to do that. You must pass the string through a GremlinGroovyScriptEngine to evaluate it into an actual Traversal object that you can manipulate. The only improvement that I can think of would be to call eval() more directly:
// construct all of this once and re-use it for your application
final GremlinGroovyScriptEngine engine = new GremlinGroovyScriptEngine();
final Graph graph = EmptyGraph.instance();
final GraphTraversalSource g = graph.traversal();
final Bindings bindings = engine.createBindings();
bindings.put("g", g);
//////////////
String query = (String) requestMessage.getArgs().get(Tokens.ARGS_GREMLIN);
DefaultGraphTraversal graphTraversal = (DefaultGraphTraversal) engine.eval(query, bindings);
Bytecode bytecode = graphTraversal.getBytecode();

How to use secured queries in Apache Jena 3.10.0?

I am trying to build secured query in Apache Jena v.3.10.0.
I want to pass some query, modify the query according to existing SecurityEvaluator and execute it later.
However, I don't understand how it should be used with certain format of queries.
I tried doing it using simplified version of SecuredQueryEngine.
public class GenSecuredEngine extends QueryEngineMain {
private static Logger LOG = LoggerFactory
.getLogger(SecuredQueryEngine.class);
private SecurityEvaluator securityEvaluator;
private Node graphIRI;
public GenSecuredEngine(final Query query, final DatasetGraph dataset,
final SecurityEvaluator evaluator,
final Binding input, final Context context) {
super(query, dataset, input, context);
this.securityEvaluator = evaluator;
graphIRI = NodeFactory.createURI("urn:x-arq:DefaultGraph");
}
#Override
protected Op modifyOp(final Op op) {
final OpRewriter rewriter = new OpRewriter(securityEvaluator, graphIRI);
LOG.debug("Before: {}", op);
op.visit(rewriter);
Op result = rewriter.getResult();
result = result == null ? op : result;
LOG.debug("After: {}", result);
result = super.modifyOp(result);
LOG.debug("After Optimize: {}", result);
return result;
}
}
Then we have such code modifying the given query into Op, checking the permissions and building the new Op object.
Op oporiginal = new AlgebraGenerator().compile(query);
Op result = securedEngine.modifyOp(oporiginal);
System.out.println(OpAsQuery.asQuery(result));
If I pass the query like this
select *
where {
graph <forbiddenGraphUri> {?a ?b ?c}
}
then everything works fine and the "forbiddenUri" is checked against SecurityEvaluator and throws org.apache.jena.shared.ReadDeniedException: Model permissions violation.
But what if user wants to execute something like this:
select * where { graph ?g {?a ?b ?c}}
In this case we can check only URI of "?g" which is not really informative. I understand that AlgebraGenerator does not fill the missing spots so this is probably the wrong approach.
So, how can it be done? For example, if user wants to run the query against many named graphs, how to filter the not allowed ones? Is it possible at all with the existing tools?

Select FROM clause in Jena returning no results

We are having trouble reliably issuing sparql queries across multiple graphs using the sparql FROM clause within a Jena dataset.
Here is an example of the issue:
final String subject = "http://example.com/ont/breakfast#espresso";
final String graph1 = "http://example.com/ont/breakfast/graph#espresso_definition";
final String graph2 = "http://example.com/ont/breakfast/graph#espresso_decoration";
// Add some triples to graphs within the dataset
Dataset dataset = DatasetFactory.create();
Model modelG1 = dataset.getNamedModel(graph1);
Resource espressoTypeG1 = modelG1.createResource(subject)
.addProperty(RDF.type, OWL.Class);
Resource espressoLabelG1 = modelG1.createResource(subject)
.addProperty(RDFS.label, "Espresso");
Model modelG2 = dataset.getNamedModel(graph2);
Resource espressoLabelG2 = modelG2.createResource(subject)
.addProperty(RDFS.label, "Black Gold");
// The query to execute - returns no results
String queryString = "select * FROM <" + graph1 + "> FROM <" + graph2 + "> " +
"{ <" + subject + "> ?p ?o }";
// This, however, works:
// String queryString = "select * { graph ?g { <" + subject + "> ?p ?o } }";
// Run the query
Query query = QueryFactory.create(queryString);
try (QueryExecution qe = QueryExecutionFactory.create(query, dataset)) {
ResultSet results = qe.execSelect();
while (results.hasNext()) {
QuerySolution result = results.next();
System.out.println(result);
}
}
A combination of a values clause and the graph keyword has helped us through most of the scenarios where we need to process multiple graphs in the same query. There are some queries where this gets quite unwieldy or downright inefficient.
What can we do to correctly issue a query across a union of models within a single dataset?
Note that the queries are not known at compile time, so we cannot rely on manually creating unions of models in Java code. Furthermore the data is generally added using a combination of loading from files, sparql update and calls to dataset.asDatasetGraph().add(...).
Handling of FROM and FROM NAMED depends on whether the Dataset implementation used supports it, the default in-memory implementations don't support it by default.
To enforce dataset usage you can use the DynamicDatasets and DatasetDescription helper classes to resolve the query specified dataset e.g.
Dataset resolvedDataset =
DynamicDatasets.dynamicDataset(DatasetDescription.create(query), dataset, false);
try (QueryExecution qe = QueryExecutionFactory.create(query, resolvedDataset)) {
// Normal result processing logic goes here...
}

Proper Way to Retrieve More than 128 Documents with RavenDB

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.

How to score a small set of docs in Lucene

I would like to compute the scores for a small number of documents (rather than for the entire collection) for a given query. My attempt, as follows, returns 0 scores for each document, even though the queries I test with were derived from the terms in the documents I am trying to score. I am using Lucene 3.0.3.
List<Float> score(IndexReader reader, Query query, List<Integer> newDocs ) {
List<Float> scores = new List<Float>();
IndexSearcher searcher = reader.getSearcher();
Collector collector = TopScoreDocCollector.create(newDocs.size(), true);
Weight weight = query.createWeight(searcher);
Scorer scorer = weight.scorer(reader, true, true);
collector.setScorer(scorer);
float score = 0.0f;
for(Integer d: newDocs) {
scorer.advance(d);
collector.collect(d);
score = scorer.score();
System.out.println( "doc: " + d + "; score=" + score);
scores.add( new Float(score) );
}
return scores;
}
I am obviously missing something in the setup of scoring, but I cannot figure out from the Lucene source code what that might be.
Thanks in advance,
Gene
Use a filter, and do a search with that filter. Then just iterate through the results as you would with a normal search - Lucene will handle the filtering.
In general, if you are looking at DocIds, you're probably using a lower-level API than you need to, and it's going to give you trouble.