Jena Text query performance slows down dramatically with large dataset

Jena Text query performance slows down dramatically with large dataset - lucene

I am working on querying from a RDF dataset of 2.37 GB with approx 17 million triples in it and lucence index of the dataset is also maintained. I tried text queries of jena-text module which search's on the basis of stored lucene indexes. But its performance is quite slow, it is taking 4 or more seconds for a search query which is very slow.
However when I use luncene index viewer 'luke'. Indexes seems to have no problem and when I search for a particular term in from indexes it takes few miliseconds to search it.
So the problem is that I am unable to recognize that why is it taking so much time when it comes to 'jena-texr'.
Following is the sparql query:
SELECT ?subj ?status ?version ?label
WHERE {
?subj rdf:type ts:Valueset;
text:query 'cancer';
ts:entityStatus ?status;
OPTIONAL { ?subj ts:versionID ?version . } .
OPTIONAL { ?subj rdfs:label ?label . } .
}
LIMIT <limit>
OFFSET <offset>
Here is the jena code:
store.getDataset().begin(ReadWrite.READ) ;
Query query = QueryFactory.create(queryStr);
QueryExecution qexec = QueryExecutionFactory.create(query , store.getDataset()) ;
ResultSet results = qexec.execSelect();
while(results.hasNext()){
QuerySolution qs = results.next();
And Here is the code for creating indexed dataset.
Dataset baseDS = TDBFactory.createDataset(storePath.trim());
//define index mapping
EntityDefinition entityDef = new EntityDefinition("uri", "property", RDFS.label.asNode());
entityDef.set("property", TS.conceptCode.asNode());
entityDef.set("property", SKOS_XL.literalForm.asNode());
entityDef.set("property", SKOS.note.asNode());
entityDef.set("property", SKOS.definition.asNode());
//create in file lucene
File indexDir = new File(textIndexPath);
Directory luceneDir = null;
try {
luceneDir = FSDirectory.open(indexDir);
} catch (IOException e) {
e.printStackTrace();
}
// Join together into a dataset
Dataset indexedDS = TextDatasetFactory.createLucene(baseDS, luceneDir, entityDef) ;
Kindly can anyone identify if there is any problem with the code and the way indexed dataset is configured. Thanks

It seems this is a known issue, I am having problems with it too :(
https://issues.apache.org/jira/browse/JENA-999

Related

How to filter distinct regex matches with SPARQL?

I am querying a (kind of) bibliographic database and would like to find all the distinct matches of a certain regex (matching the signature of typescripts (TS) and manuscripts (MS)); i.e. I would like to return all documents that are currently in the database.
I came up with
SELECT ?document
WHERE
{
{
?documentURI a witt:MS;
rdfs:label ?document.
}
UNION
{
?documentURI a witt:TS;
rdfs:label ?document.
}
FILTER (regex(?document, "(Ms|Ts)\\-((1|2|3)\\d{2}\\w?\\d?)"))
}
(endpoint); this returns all the signatures but I would like to filter the result for the distinct regex matches, i.e. the distinct signatures up to and excluding the comma.
How can this be achieved?

Ok, I think I found a solution with strbefore:
SELECT DISTINCT ?document
WHERE
{
{
?documentURI a witt:MS;
rdfs:label ?documentFull.
}
UNION
{
?documentURI a witt:TS;
rdfs:label ?documentFull.
}
BIND (strbefore(?documentFull, ",") AS ?document)
}
Try it.
Would appreciate opinions on the query though, is this effective/good style?

Does Jena TDB load all data into memory every time?

I am a newbie of Jena. I try to deal with the Yoga dataset using TDB. The dataset is about 200M and everytime I run the same query, it will have to take about 5 minutes to load the data then give out the results. I am wondering do I misunderstand any part of TDB? The following are my codes.
String directory = "tdb";
Dataset dataset = TDBFactory.createDataset(directory);
dataset.begin(ReadWrite.WRITE);
Model tdb = dataset.getDefaultModel();
//String source = "yagoMetaFacts.ttl";
//FileManager.get().readModel(tdb, source);
String queryString = "SELECT DISTINCT ?p WHERE { ?s ?p ?o. }";
Query query = QueryFactory.create(queryString);
try(QueryExecution qexec = QueryExecutionFactory.create(query, tdb)){
ResultSet results = qexec.execSelect();
ResultSetFormatter.out(System.out, results, query) ;
}
dataset.commit();
dataset.end();

There are two ways to load data into tdb, either by API or CMD. Much thanks to #ASKW and #AndyS
1 Load data via API
These codes need to be executed only once especially the readModel line which will takes long time.
String directory = "tdb";
Dataset dataset = TDBFactory.createDataset(directory);
dataset.begin(ReadWrite.WRITE);
Model tdb = dataset.getDefaultModel();
String source = "yagoMetaFacts.ttl";
FileManager.get().readModel(tdb, source);
dataset.commit(); //Important!! This is to commit the data to tdb.
dataset.end();
After the data is loaded into tdb, we can use following codes to query. And it is not necessary to load data again.
String directory = "path\\to\\tdb";
Dataset dataset = TDBFactory.createDataset(directory);
Model tdb = dataset.getDefaultModel();
String queryString = "SELECT DISTINCT ?p WHERE { ?s ?p ?o. }";
Query query = QueryFactory.create(queryString);
try(QueryExecution qexec = QueryExecutionFactory.create(query, tdb)){
ResultSet results = qexec.execSelect();
ResultSetFormatter.out(System.out, results, query) ;
}
2 Load data via CMD
To load data
>tdbloader --loc=path\to\tdb path\to\dataset.ttl
To query
>tdbquery --loc=path\to\tdb --query=q1.rq
q1.rq is the file which stores the query
Should get results like this
-------------------------------------------------------
| p |
=======================================================
| <http://yago-knowledge.org/resource/hasGloss> |
| <http://yago-knowledge.org/resource/occursSince> |
| <http://yago-knowledge.org/resource/occursUntil> |
| <http://yago-knowledge.org/resource/byTransport> |
| <http://yago-knowledge.org/resource/hasPredecessor> |
| <http://yago-knowledge.org/resource/hasSuccessor> |
| <http://www.w3.org/2000/01/rdf-schema#comment> |
-------------------------------------------------------

Simple SPARQL query does not return any results

I am just getting up and running with Blazegraph in embedded mode. I load a few sample triples and am able to retrieve them with a "select all" query:
SELECT * WHERE { ?s ?p ?o }
This query returns all my sample triples:
[s=<<<http://github.com/jschmidt10#person_Thomas>, <http://github.com/jschmidt10#hasAge>, "30"^^<http://www.w3.org/2001/XMLSchema#int>>>;p=blaze:history:added;o="2017-01-15T16:11:15.909Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>]
[s=<<<http://github.com/jschmidt10#person_Tommy>, <http://github.com/jschmidt10#hasLastName>, "Test">>;p=blaze:history:added;o="2017-01-15T16:11:15.909Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>]
[s=<<<http://github.com/jschmidt10#person_Tommy>, <http://www.w3.org/2002/07/owl#sameAs>, <http://github.com/jschmidt10#person_Thomas>>>;p=blaze:history:added;o="2017-01-15T16:11:15.909Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>]
[s=<http://github.com/jschmidt10#person_Thomas>;p=<http://github.com/jschmidt10#hasAge>;o="30"^^<http://www.w3.org/2001/XMLSchema#int>]
[s=<http://github.com/jschmidt10#person_Tommy>;p=<http://github.com/jschmidt10#hasLastName>;o="Test"]
[s=<http://github.com/jschmidt10#person_Tommy>;p=<http://www.w3.org/2002/07/owl#sameAs>;o=<http://github.com/jschmidt10#person_Thomas>]
Next I try a simple query for a particular subject:
SELECT * WHERE { <http://github.com/jschmidt10#person_Thomas> ?p ?o }
This query yields no results. It seems that none of my queries for a URI are working. I am able to get results when I query for a literal (e.g. ?s ?p "Test").
The API I am using to create my query is BigdataSailRepositoryConnection.prepareQuery().
Code snippet (Scala) that executes and generates the query:
val props = BasicRepositoryProvider.getProperties("./graph.jnl")
val sail = new BigdataSail(props)
val repo = new BigdataSailRepository(sail)
repo.initialize()
val query = "SELECT ?p ?o WHERE { <http://github.com/jschmidt10#person_Thomas> ?p ?o }"
val cxn = repo.getConnection
cxn.begin()
var res = cxn.
prepareTupleQuery(QueryLanguage.SPARQL, query).
evaluate()
while (res.hasNext) println(res.next)
cxn.close()
repo.shutDown()

Have you checked the way you filled the database? You might have characters that are getting encoded strangely, or it looks like you might have excess brackets in your objects.
From the print statement, your URI's are printing extra angled brackets. You are likely using:
val subject = valueFactory.createURI("<http://some.url/some/entity>")
when you should be doing this (without angled brackets):
val subject = valueFactory.createURI("http://some.url/some/entity")

Searching semantically tagged documents in MarkLogic

Can any one please point me to some simple examples of semantic tagging and querying semantically tagged documents in MarkLogic?
I am fairly new in this area,so some beginner level examples will do.

When you say "semantically tagged" do you mean regular XML documents that happen to have some triples in them? The discussion and examples at http://docs.marklogic.com/guide/semantics/embedded are pretty good for that.
Start by enabling the triple index in your database. Then insert a test doc. This is just XML, but the sem:triple element represents a semantic fact.
xdmp:document-insert(
'test.xml',
<test>
<source>AP Newswire</source>
<sem:triple date="1972-02-21" confidence="100">
<sem:subject>http://example.org/news/Nixon</sem:subject>
<sem:predicate>http://example.org/wentTo</sem:predicate>
<sem:object>China</sem:object>
</sem:triple>
</test>)
Then query it. The example query is pretty complicated. To understand what's going on I'd insert variations on that sample document, using different URIs instead of just test.xml, and see how the various query terms match up. Try using just the SPARQL component, without the extra cts query. Try cts:search with no SPARQL, just the cts:query.
xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics"
at "/MarkLogic/semantics.xqy";
sem:sparql('
SELECT ?country
WHERE {
<http://example.org/news/Nixon> <http://example.org/wentTo> ?country
}
',
(),
(),
cts:and-query((
cts:path-range-query( "//sem:triple/#confidence", ">", 80) ,
cts:path-range-query( "//sem:triple/#date", "<", xs:date("1974-01-01")),
cts:or-query((
cts:element-value-query( xs:QName("source"), "AP Newswire"),
cts:element-value-query( xs:QName("source"), "BBC"))))))

In case you are talking about enriching your content using semantic technology, that is not directly provided by MarkLogic.
You can enrich your content externally, for instance by calling a public service like the one provided by OpenCalais, and then insert the enrichments to the content before insert.
You can also build lists of lookup values, and then using cts:highlight to mark such terms within your content. That could be as simple as:
let $labels := ("MarkLogic", "StackOverflow")
return
cts:highlight($doc, cts:word-query($labels), <b>{$cts:text}</b>)
Or with a more dynamic replacement using spraql:
let $labels := map:new()
let $_ :=
for $result in sem:sparql('
PREFIX demo: <http://www.marklogic.com/ontologies/demo#>
SELECT DISTINCT ?label
WHERE {
?s a demo:person.
{
?s demo:fullName ?label
} UNION {
?s demo:initialsName ?label
} UNION {
?s demo:email ?label
}
}
')
return
map:put($labels, map:get($result, 'label'), 'person')
return
cts:highlight($doc, cts:word-query(map:keys($labels)),
let $result := sem:sparql(concat('
PREFIX demo: <http://www.marklogic.com/ontologies/demo#>
SELECT DISTINCT ?s ?p
{
?s a demo:', map:get($labels, $cts:text), ' .
?s ?p "', $cts:text, '" .
}
'))
return
if (map:contains($labels, $cts:text))
then
element { xs:QName(fn:concat("demo:", map:get($labels, $cts:text))) } {
attribute subject { map:get($result, 's') },
attribute predicate { map:get($result, 'p') },
$cts:text
}
else ()
)
HTH!

ARQ Making a Query from Scratch

I have a problem in building the query from scratch syntactically or in algebra, based on
https://jena.apache.org/documentation/query/manipulating_sparql_using_arq.html
For example I have the below query
SELECT (count(?instance) AS ?count)
WHERE
{ ?instance <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://data.linkedmdb.org/resource/movie/film> }
(project (?count)
(extend ((?count ?.0))
(group () ((?.0 (count ?instance)))
(bgp (triple ?instance <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.linkedmdb.org/resource/movie/film>)))))
Can any one direct me with a sample code of how to build the above query from scratch?
I have tried to build it syntactically but failing to know about how to alias the aggregation above.
If anybody can at least guide me in including aggregation with its aliasing name in projection it will be very great.

I don't typically construct queries through code, since I can just parse a query string, or use a parameterized SPARQL query, but here's a reconstruction of your query using the API. Most of the methods I used here I found by exploring the autocomplete options in Eclipse, and by looking at the Javadoc.
import com.hp.hpl.jena.graph.Node;
import com.hp.hpl.jena.graph.Triple;
import com.hp.hpl.jena.query.Query;
import com.hp.hpl.jena.query.QueryFactory;
import com.hp.hpl.jena.sparql.core.Var;
import com.hp.hpl.jena.sparql.expr.ExprAggregator;
import com.hp.hpl.jena.sparql.expr.ExprVar;
import com.hp.hpl.jena.sparql.expr.aggregate.AggCountVar;
import com.hp.hpl.jena.sparql.syntax.ElementTriplesBlock;
import com.hp.hpl.jena.vocabulary.RDF;
public class QueryBuilding {
public static void main(String[] args) {
// Create the query and make it a SELECT query.
final Query query = QueryFactory.create();
query.setQuerySelectType();
// Set the projection expression.
final ExprVar instance = new ExprVar( "instance" );
query.getProject().add( Var.alloc( "count" ), new ExprAggregator( instance.asVar(), new AggCountVar( instance )));
// Construct the triples pattern and add it.
final ElementTriplesBlock triples = new ElementTriplesBlock();
final Node film = Node.createURI( "http://data.linkedmdb.org/resource/movie/film" );
triples.addTriple( new Triple( instance.getAsNode(), RDF.type.asNode(), film ));
query.setQueryPattern( triples );
// Show the query
System.out.println( query );
}
}
The output (i.e., the printed query) follows. It's the same as your query, modulo some whitespace location and newlines.
SELECT (count(?instance) AS ?count)
WHERE
{ ?instance <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.linkedmdb.org/resource/movie/film> .}

Although the solution proposed by Joshua was very helpful to me and produces the correct String output, I have found that it contains a problem; the line :
query.getProject().add( Var.alloc( "count" ), new ExprAggregator( instance.asVar(), new AggCountVar( instance )));
Should be replaced by :
query.getProject().add( Var.alloc( "count" ), query.allocAggregate( new AggCountVar( instance ) ));
Otherwise if you execute the query against a Model you will get an exception "NotAVariableException: Node_variable (not a Var) found"

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Jena Text query performance slows down dramatically with large dataset - lucene

It seems this is a known issue, I am having problems with it too :( https://issues.apache.org/jira/browse/JENA-999

Related

How to filter distinct regex matches with SPARQL?

Does Jena TDB load all data into memory every time?

Simple SPARQL query does not return any results

Searching semantically tagged documents in MarkLogic

ARQ Making a Query from Scratch

Categories

Resources