Migrating code from Jena 2 to 3 - semantic-web

Which class has replaced com.hp.hpl.jena.db.IDBConnection in Jena 3.x.x?
I have tried to use org.apache.jena.db but it seems not to be there.

The provided database storage systems are :
TDB - a custom (non-SQL) triple storage layer.
SDB - an SQL-backed triple storage layer.
TDB is preferred, it scales better and is faster. SDB is for when you really must have an SQL based solution.

Related

Implementations of (fully) dynamic connectivity data structures

The dynamic connectivity problem for graphs consists in maintaining a graph data structure that allows for adding and deleting edges of the graph.
Moreover, the data structure should support connectivity queries.
Typically, such a query is of the form ''Are the nodes u and v connected in the graph?''
There are variants of the dynamic connectivity problem that also support different connectivity queries like 2-edge-connectivity or biconnectivity.
My question is: Are there existing efficient implementations of dynamic connectivity data structures?
By efficient I mean that data structures with a low amortized operation costs.
In particular, I am NOT interested in trivial implementations with a complexity of O(n) per operation!
Below I describe in more detail what I am looking for an what I already know.
If only edge insertions are allowed the dynamic connectivity problem can be solved by the well known disjoint-set (aka union find) data structure.
For this data structure there are implementations available in many different programming languages.
Unfortunately, this does not seem to be the case for the dynamic connectivity problem that also allows edge deletions.
The situation is even worse for data structures that also allow other connectivity queries like 2-edge- or biconnectivity.
To the best of my knowledge the algorithms presented in Holm et al. (2001) are still state of the art for many dynamic connectivity problems.
This publication was accompanied by an experimental study, however, as far as I can tell the code was never made publicly available. Also, therein only implementations for the regular connectivity problem are discussed, not for 2-edge- or biconnectivity.
The algorithms by Holm et al. (and also by other authors) are highly non-trivial.
Even though the algorithms are described in much detail it requires a lot of expertise to implement these algorithms in practice.
Because of this I am looking for existing implementation of different dynamic connectivity data structures.
The table below summarizes the (currently underwhelming) implementations of different combinations of supported manipulations and queries.
Graph Manipulations
Connectivity
2-edge-connectivity
Biconnectivity
incremental (adding edges)
disjoint-set
decremental (deleting edges)
Rafael Glikis
fully (adding and deleting edges)
I have searched for implementations in different places. I have looked on git-hub, I have looked through the external links in the relevant Wikipedia articles and I have skimmed through a lot of literature without any success.
I expect we will need a framework for trying things out so that we can discuss this in concrete terms.
I have implemented a small windows application that accepts user queries to read, build, edit and query the connectivity of a graph, showing the time taken to execute each.
Sample run:
Supported queries
add v1 v2 : add link to graph
delete v1 v2 : remove link from graph
reach src dst : find path between vertices
read filepath : input graph links from file
help : this help display
type query> read ../dat/3elt.graph.seq.txt
4720 vertices 27444 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.539246 0.539246 query
type query> delete 23 20
4720 vertices 27443 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.004432 0.004432 query
type query> add 23 20
4720 vertices 27444 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.0046639 0.0046639 query
The complete application is at https://github.com/JamesBremner/graphConnectivity
To demonstrate how this application can be used, I built it with the graph engine at https://github.com/JamesBremner/PathFinderFeb2023 and ran it on a couple of the test datasets from https://dyngraphlab.github.io/
dataset
edge count
delete
add
3elt.graph.seq.txt
27,443
5ms
5ms
144.graph.seq.txt
2,148,787
13ms
13ms
To get the average time to perform multiple queries, use the random command, like this:
Supported queries
add v1 v2 : add link to graph
add random n : add n random links to graph
delete v1 v2 : remove link from graph
reach src dst : find path between vertices
read filepath : input graph links from file
help : this help display
type query> read ../dat/3elt.graph.seq.txt
4720 vertices 27444 edges
type query> add random 10
4720 vertices 27454 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
10 1.62e-06 1.62e-05 randomAdd

Virtuoso SPARQL not retrieving values

I am uploading a RDF file to a Virtuoso Repository vía de graphic interface (ODS-Briefcase). The file is uploaded successfully. However, every time I make a SPARQL query, an empty result is returned.
I have tried with many other files, and I did not have this problem.
This file size is bigger than the previous ones (14MB), so I guess this could be the cause, but I am not sure about it.
Any help in this matter will be appreciated :)
UPDATE
I have tried uploading a smaller file (2KB) and the SPARQL returns results as expected.
However, I uploaded again the file (14 MB) and it seems it is not correctly uploaded.
When I try to read it from the ODS-Briefcase of Virtuoso, this happens:
To solve problems like this you have to fundamentally understand the task you are performing and how its interpreted using Virtuoso.
Task at hand:
Loading an RDF document into Virtuoso's WebDAV repository (for which ODS-Briefcase provides a front-end), in a manner that results in the content of said RDF document being loaded into the Quad Store (where RDF data is indexed and made available to SPARQL Queries etc.).
How you achieve your goal:
Use the ODS-Briefcase UI to create a DET Folder (a folder then provides an automatic conduit between WebDAV storage and the Virtuoso Quad Store) of type: Linked Data Import. One of the attributes (characteristics) of this kind of folder is a Named Graph IRI and Named Graph IRI Base:
With your Linked Data Import DET folder in place, you simply upload RDF documents to the newly created folder.
To verify existence of RDF Language statements imported from an RDF Document placed in this folder simply execute one of the following
SELECT COUNT (*)
FROM {targe-named-graph-iri}
WHERE {?s ?p ?o}
OR
SELECT DISTINCT *
FROM {targe-named-graph-iri}
WHERE {?s ?o ?o}
You can also leverage Virtuoso's in-built RDF data import middleware (a/k/a Sponger) within a SPARQL query using the pattern:
DEFINE get:soft "replace"
SELECT DISTINCT *
FROM {rdf-document-uri}
WHERE {?s ?o ?o}
I hope this brings clarity to options available for importing RDF Document content into the Virtuoso Quad Store (Engine that manages Data represented and RDF Property/Predicate Graphs).
It sounds like you have loaded your files into the Virtuoso WebDAV (file) repository, but you may have not loaded the RDF therein into the Virtuoso (RDF) Quad Store.
See this guide to the bulk loader, and this page of RDF loading methods.
(ObDisclaimer: I work for OpenLink Software, producer of Virtuoso.)

How to traverse & query faster TitanGraph with Cassandra as a backend?

I have millions of nodes stored in Titan 1.0.0 with Cassandra 2.2.4. I want to retrieve graph from Cassandra and query or traverse it in fast way.
If I build index in a code,
mgmt.buildIndex("nameSearchIndex", Vertex.class).addKey(namep, Mapping.TEXT.asParameter()).buildMixedIndex("search");
mgmt.buildIndex("addressSearchIndex", Vertex.class).addKey(addressp, Mapping.TEXT.asParameter()).buildMixedIndex("search");
Still the querying seems to be slower.
When I use
g.traversal().V().count()
it still gives warning - please use indexes, when I have already build indexes in code. Is there any specific configuration to forcefully activate indexes? How to query graph with using indexes?
g.traversal().V().has("Name","Jason")
Does this query uses indexes? if not then how do I make use of indexes to query faster?
Can Spark be used for fast traversal? How to use SparkComputerGraph for the same? I am not able to find the configurations for CassnadraInputFormat with Spark.
Thanks.
There are lots of questions bundled up in this question.
The indexes you're making are mixed indexes, which are implemented by an external indexing system, such as Solr or ElasticSearch. These would help if you are looking for a vertex with a certain name, such as your .has("Name", "Jason) example.
To find out if an index is being used, I suggest looking into the profile() step in Gremlin. You can read about it here.
Spark is meant to be used for traversals that need to potentially load a graph that is bigger than one machine can hold. What use case is .V().count() important for?
This answer was cross-posted on the Titan mailing list.
Indexing is useful for doing fast traversals, but ultimately "fast queries" depends on many factors, including your graph model/volume/shape and the types of questions you are trying to answer.
Read Chapter 8 "Indexing for better Performance" in the Titan docs, and digest the differences between the different types: Composite, Mixed, and Vertex-centric.
Based on the example query you posted, and as Daniel noted, it looks to me like an exact match type of query, so I would start with a Composite index. You can cut and paste this to try it out in the Titan Console.
graph = TitanFactory.open('inmemory')
mgmt = graph.openManagement()
name = mgmt.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.SINGLE).make()
nameIndex = mgmt.buildIndex('nameIndex',Vertex.class).addKey(name).buildCompositeIndex()
mgmt.commit()
graph.addVertex('name','jason')
g = graph.traversal()
g.V().has('name','jason') // no warning should appear
If after reading the Composite vs Mixed Index section you decide that a Mixed index (backed by Elasticsearch, Solr, or Lucene) is what you really need, read Chapter 20 "Index Parameters and Full-Text Search", and digest the differences between the mappings TEXT, STRING, and TEXTSTRING.
Here's an example that uses a STRING mixed index
graph = TitanFactory.build().set('storage.backend','inmemory').set('index.search.backend','elasticsearch').open()
mgmt = graph.openManagement()
name = mgmt.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.SINGLE).make()
nameIndex = mgmt.buildIndex('nameIndex',Vertex.class).addKey(name, Mapping.STRING.asParameter()).buildMixedIndex("search")
mgmt.commit()
graph.addVertex('name','jason')
g = graph.traversal()
g.V().has('name','jason') // no warning should appear

SPARQL 1.1 entailment regimes and query with FROM clause

I'm currently documenting/testing about SPARQL 1.1 entailment regimes and the recommendation repeatedly states that
The scoping graph is graph-equivalent to the active graph
but it does not specifies what is the active graph referring to : is it the data-set used in the query ? a union of all graphs in the store ?
As a test to determine this , I got this graph URIed <http://www.example.org/> in a Sesame Memory store with RDF Schema and direct type inferencing store (v2.7.14)
#prefix ex:<http://www.example.org/> .
ex:book1 rdf:type ex:Publication .
ex:book2 rdf:type ex:Article .
ex:Article rdfs:subClassOf ex:Publication .
ex:publishes rdfs:range ex:Publication .
ex:MITPress ex:publishes ex:book3 .
I've been trying the following query (which means using the default graph thus the inference engine)
SELECT ?s WHERE { ?s a ex:Publication . }
As expected, it returns me all three instances
<http://www.example.org/book1>
<http://www.example.org/book2>
<http://www.example.org/book3>
while the query :
SELECT ?s FROM ex: WHERE { ?s a ex:Publication . }
returns only
<http://www.example.org/book1>
Under the said circumstances, shouldn't both results be the same?
What should happen (according to the recommendation) if data and schema are split between two graphs in the store (like <urn:rdfs-schema> and <urn:data>, or even scattered across more graphs) and the query uses both graphs (or a subset of schema-related graphs) in the FROM clause instead of the default graph ?
Meaning should the inferencing be global throughout the store or does it depend on the query dataset ?
Or maybe is the recommendation loose enough to make this an implementation dependent issue ?
Thanks for your lights,
Max.
EDIT this question is being redirected to SPARQL 1.1 entailment regimes and query with FROM clause (follow-up)
Your second query only returns only book1 because in Sesame's RDFS inferencer, entailed statements are inserted in the default graph, not in the named graph(s) from which the premises for the entailment come. So the entailed results are simply not present in the graph you are querying.
The reason for this design choice is at least partly historic, as the Sesame RDFS inference engine predates the W3C notion of entailment regimes. The rationale at the time was that in the case of inferencing over several named graphs (where e.g. one premise comes from graph A and another from B), insertion in the default graph (rather than in either A, or B, or both) was simplest with the least amount for confusion.
Sesame currently does not explicitly support the W3C entailment regimes specification. However, if you feel that a simple improvement would be possible to make it more compatible, by all means log a feature request.
(disclosure: Sesame developer)
What exactly is in the default graph isn't specified by the SPARQL 1.1 standard. In particular, see 13.1 Examples of RDF Datasets which mentions that:
The definition of RDF Dataset does not restrict the relationships of
named and default graphs. Information can be repeated in different
graphs; relationships between graphs can be exposed. Two useful
arrangements are:
to have information in the default graph that includes provenance information about the named graphs
to include the information in the named graphs in the default graph as well.
However, by using FROM clauses to specify which graph should be the default graph, or by using multiple FROM clauses to specify which graphs should be merged to be the default graph.
That's all concerning the default graph. The active graph is another term you'll see in the SPARQL 1.1 spec:
The graph that is used for matching a basic graph pattern is the
active graph. In the previous sections, all queries have been shown
executed against a single graph, the default graph of an RDF dataset
as the active graph. The GRAPH keyword is used to make the active
graph one of all of the named graphs in the dataset for part of the
query.
So, you can use from (possible multiple times) to control the default graph, and thereby the initial active graph, and then graph { … } within a query to change the active graph.

Specification Pattern vs Specific Hibernate Query

My question is when to use a specification pattern, and when to use specific SQL query.
I understood that specific pattern need to collect whole collection and post filter using concrete specification. But i dont't understand the advantage in front of specific SQL query.
CarColorSpecification cc = new CarColorSpecification(RED);
CarAgeSpecification ca = new CarAgeSpecification(OLDER, 5);
ISpecification finalSpec = cc.And(ca);
List<Car> res;
List<Car> carColl = service.getCars();
foreach(Car c in carColl) {
if(finalSpec.isSatisfiedBy(c)) {
res.add(c);
}
}
And the same in SQL / Hibernate
FROM Car c WHERE c.color = RED AND c.age > 5
I think it depends of the data volume to process.
The SQL version will run quickly provided the table is appropriately indexed for the columns in question and its size, and it will transfer a smaller volume of data between the DB server and the app server if they're different. It may, however, impose a higher load on the SQL box in terms of CPU and disk I/O usage, and in many environments, the DB server is the most expensive component to scale.
So yes, it depends a great deal on the size of the data.
The Repository is used to abstract the persistence implementation away from your domain classes. In short, SQL/HQL should only exist within your repositories.
If you're dealing with high data volumes, create a new method on your Repository and call that method from your Specification.
I think a good compromise would be to combine specification pattern according to hibernate to generate HQL queries. (Maybe Linq for Java ^^ ?)