SPARQL Update minimal diff for RDF/XML? - sparql

My RDF/OWL ontology is versioned as an RDF/XML file in a git repository that I normally edit in a text editor, but I am planning a refactoring that would take too long manually and that is not possible with regular expressions alone.
Specifically, I want to split a generic property in two more specific ones based on the class of the object.
For example
:Alice :responsibleFor :ACME.
:Bob :responsibleFor :Cooking.
should become
:Alice :responsibleForCompany :ACME.
:Bob :responsibleForTask :Cooking.
I am interested in an answer for the general case as well, not just for this specific property refactoring.
My idea is to load the files into a Virtuoso Triple Store, use SPARQL Update queries to refactor the property and then export it back as RDF/XML file. The problem is that this won't keep the order and formatting, which will confuse git and make usage of the old history, such as undoing an old commit, impossible.
Is there a way to work directly with the file structure in order to produce a diff as minimal as possible?

I wouldn't worry much about the git history for undoing commits if you're going to use SPARQL update to make the changes; those update queries become your diffs. Some queries would be easy to invert to undo a change, but, if you have a base version of the ontology, applying all but the N most recent updates would effectively undo N commits.
This is a strategy we've been using for years and it works nicely.

Michael's answer is an excellent solution, but if you do wish to stick to using git history, I'd recommend that you switch to a different syntax format. RDF/XML, being XML (i.e. nested elements over multiple lines), is notoriously troublesome for line-by-line diffs, especially since the tool writing the XML can decide to completely rearrange blocks (there's no prescribed order for RDF/XML elements at the syntax level, and it's very hard to enforce anything like this).
Switch to a line-based syntax format, like N-Triples or N-Quads, and enforce a canonical ordering when exporting back from Virtuoso (should be possible by using a SPARQL query with an ORDER BY clause as the export mechanism).

Related

Incremental indexing for semantic search

I wonder if there are some standards or best practices, in performing an incremental indexation of a triple store for semantic search purpose.
Indeed to support semantic search one usually use solr or elasticsearch where resource are indexed according to some specific SPARQL query. While one can re-index its entire resources set once a day for instance, it is not that desirable. Hence comes the need to perform it incrementally. However that requires somehow to track changes, with the ultimate goat to be able to keep on indexing or deleting whatever has changed only.
For instance to only index what has change, the SPARQL query should include some timestamp filter somehow.
If anyone has some suggestions, or experience on performing it, that he would like to share this would be well apreciated
So far I am being somewhat inspired by EEA ElasticSearch RDF River Plugin. I'm also looking at the ontology Changeset Ontology.
The easiest way to accomplish this would be to get something involved in the transaction lifecycle. Then you're able to see the changes to the database which will give you the graph that needs to be indexed.
But don't dismiss doing a full re-index on a periodic schedule, such as nightly. Unless your requirement is that full-text searches must always be against the most recent data and your data changes quickly, a full re-index on a regular basis will work just fine.

Database Type Agnostic Select Query Encapsulation class

I am upgrading a webapp that will be using two different database types. The existing database is a MySQL database, and is tightly integrated with the current systems, and a MongoDB database for the extended functionality. The new functionality will also be relying pretty heavily on the MySQL database for environmental variables such as information on the current user, content, etc.
Although I know I can just assemble the queries independently, it got me thinking of a way that might make the construction of queries much simpler (only for easier legibility while building, once it's finished, converting back to hard coded queries) that would entail an encapsulation object that would contain:
what data is being selected (including functionally derived data)
source (including joined data, I know that join's are not a good idea for non-relational db's, but it would be nice to have the facility just in case, which can be re-written into two queries later for performance times)
where and having conditions (stored as their own object types so they can be processed later, potentially including other select queries that can be interpreted by whatever db is using it)
orders
groupings
limits
This data can then be passed to an interface adapter that can build and execute the query, returning it in an array, or object or whatever is desired.
Although this sounds good, I have no idea if any code like this exists. If so, can anybody point it out to me, if not, are there any resources on similar projects undertaken that might allow me to continue the work and build a basic version?
I know this is a complicated library, but I have been working on this update for the last few days, and constantly switching back and forth has been getting me muddled at times and allowing for mistakes to occur
I would study things like the SQL grammar: http://www.h2database.com/html/grammar.html
Gives you an idea of how queries should be constructed.
You can study existing libraries around LINQ (C#): https://code.google.com/p/linqbridge/
Maybe even check out this link about FQL (Facebook's query language): https://code.google.com/p/mockfacebook/issues/list?q=label:fql
Like you already know, this is a hard problem. It will be a big challenge to make it run efficiently. Maybe consider moving all data from MySQL and Mongo to a third data store that has a copy of all the data and then running queries against that? Replicating all writes to something like Redis or Elastic Search and then write your queries there?
Either way, good luck!

Tree data structure persistence in Ruby

I have a project where I need to build and store large trees of data in Ruby. I am considering different approaches for serialization, deserialization and querying of trees, and I am wondering what would be the best way to go. My major constraints are read time, query efficiency and and cross-version/cross-platform compatibility. The most frequent operation is to retrieve sets of nodes based on a combination of id/value and/or feature(s).Trees can be up to 15-20 levels deep. Moving subtrees is an uncommon procedure, but should be possible without too much black magic. Rails integration is not a primary concern. The options I thought about, along with some issues I'm concerned about, are the following:
Marshal the trees, and when needed load them into memory and query them in Ruby (inefficiency as tree grows, cross-version compatibility?)
Same as above, but use YAML (more cross-version compatible, but less efficient?)
Same as above, but use a custom XML parser (need to recreate objects from scratch each time the tree is loaded?)
Serialize the trees to XML, store them in an XML database (e.g. Sedna) and use XPath to query the trees (no experience with this approach, not sure about efficiency?)
Use adjacency lists to query trees stored in an schema-less database (inefficiency when counting descendants?)
Use materialized paths (potential of overfilling the max string length for deep trees?)
Use nested sets (complex SQL queries?)
Use the array of ancestors approach? Seems interesting in terms of querying efficiency according to the MongoDB page, but I haven't been able to find any serious discussion of this algorithm.
Based on your experience, which approach would better fit with the constraints I have described? If I go for an XML database, are there ones that would be more suited for this project? Are there other approaches I have overlooked that would be more efficient? Thanks for your time.
Trees work really well with graph databases, such as neo4j: http://neo4j.org/learn/
Neo4j is a graph database, storing data in the nodes and relationships of a graph. The most generic of data structures, a graph elegantly represents any kind of data, preserving the natural structure of the domain.
Ruby has a good interface for the trees:
https://github.com/andreasronge/neo4j
Pacer is a JRuby library that enables very expressive graph traversals. Pacer allows you to create, modify and traverse graphs using very fast and memory efficient stream processing. That also means that almost all processing is done in pure Java, so when it comes the usual Ruby expressiveness vs. speed problem, you can have your cake and eat it too, it's very fast!
https://github.com/pangloss/pacer
Neography is like the neo4j.rb gem and suggested by Ron in the comments (thanks Ron!)
https://github.com/maxdemarzi/neography
Since you are considering a SQL approach, here are some things to think about.
First, how big are the trees? For many applications, 10,000 leafs would seem big. Yet this is small for a database. On any decent database system (like a laptop), you should be able to store hunreds of thousands or millions of leafs in memory.
What a database buys you over other approaches is:
-- Not having to worry about memory/disk performance. When the data spills over to disk, you don't take a big hit on performance. By comparison, consider what happens when a hash table overflows memory.
-- Being able to add indexes to optimize performance.
-- Being able to alter your access path for the tree "just" by modifying SQL
One of the problems with standard SQL is that you can represent a tree node as a simple pair: , , . Then, with a simple join, you can move between parents and leafs. However, the joins accumulate as you move up the tree.
Sigh. Different databases have different solutions for this. SQL Server has recursive CTEs, which let you traverse the tree. Oracle has another approach for tree structures.
This starts to get complicated.
Perhaps a better approach is to assign a "leaf" id based on the hierarchy in the tree. So, if this is a binary tree, then "10011" would be the node at right branch, left branch, left branch, right branch, right branch. There you would store information . . . such as whether it has children and whatever else. Getting the parent is easy, because you can just truncate the last digit.
You can see how this would generalize to non-binary trees. Having any number of children could pose a little challenge.
I believe this may be related to the "array of ancestors" approach.
As I think about it, I think this would work pretty well. I would then suggest that you define separate stored procedures for each action that you want:
usp_tree_FetchNode (NodeId)
usp_tree_GetParent (NodeId)
usp_tree_NodeDelete (NodeId)
usp_tree_FetchSubTree (NodeId)
etc. etc. etc.
Although SQL does not really support object-oriented programming, you can still organize your code with clean naming conventions and good function wrappers.
I actually think this might work and provide a pretty good method for developing the code. One nice side effect is that you can analyze the tree outside the application, which might suggest future enhancements.
Have you looked at ancestry gem? I've used it for simple trees, but by the description it looks to fit on your requirements.

Dynamic RDF builder that takes an ontology and SQL result and builds tree

I've written a program that queries a large and messy sql database and then takes the resulting data and creates an RDF based on an ontology that was written by someone else and outputs a file of triples (using jena).
This works. But the problem is I have to do a lot of tweaking to the code should the ontology be changed in some way (it's still under a lot of scrutiny), and I have to tweak the code further whenever the query changes (the data I'm querying is old and not clean, and it's unclear if I'm hitting the right tables at times).
Is there a tool or a trick that might make my life easier?
Any suggestion would help.
Virtuoso has the notion of RDF views that reside on top of a RDBMS (http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSSQL2RDF). This applies to external databases interfaceable via ODBC/JDBC. D2R Server also does something similar (http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/)

Named Graphs and Federated SPARQL Endpoints

I recently came across the working draft for SPARQL 1.1 Federation Extensions and wondered whether this was already possible using Named Graphs (not to detract from the usefulness of the aforementioned draft).
My understanding of Named Graphs is a little hazy, save that the only thing I have gleamed from reading the specs comprises rules around the merger, non merger in relation to other graphs at query time. Since this doesn't fully satisfy my understanding, my question is as follows:
Given the following query:
SELECT ?something
FROM NAMED <http://www.vw.co.uk/models/used>
FROM NAMED <http://www.autotrader.co.uk/cars/used>
WHERE {
...
}
Is it reasonable to assume that a query processor/endpoint could or should in the context of the named graphs do the following:
Check is the named graph exists locally
If it doesn't then perform the following operation (in the case of the above query, I will use the second named graph)
GET /sparql/?query=EncodedQuery HTTP/1.1
Host: www.autotrader.co.uk
User-agent: my-sparql-client/0.1
Where the EncodedQuery only includes the second named graph in the FROM NAMED clause and the WHERE clause is amended accordingly with respect to GRAPH clauses (e.g if a GRAPH <http://www.vw.co.uk/models/used> {...} is being used).
Only if it can't perform the above, then do any of the following:
GET /cars/used HTTP/1.1
Host: www.autotrader.co.uk
or
LOAD <http://www.autotrader.co.uk/cars/used>
Return appropriate search results.
Obviously there might be some additional considerations around OFFSET's and LIMIT's
I also remember reading somewhere a long time ago in galaxy far far away, that the default graph of any SPARQL endpoint should be a named graph according to the following convention:
For: http://www.vw.co.uk/sparql/ there should be a named graph of: http://www.vw.co.uk that represents the default graph and so by the above logic, it should already be possible to federate SPARQL endpoints using named graphs.
The reason I ask is that I want to start promoting federation across the domains in the above example, without having to wait around for the standard, making sure that I won't do something that is out of kilter or incompatible with something else in the future.
Named graph and URLs used in federated queries (using SERVICE or FROM) are two different things. The latter point to SPARQL endpoints, the named graphs are within a triple store and have the main function of separating different data sets. This, in turn, can be useful to both improve performance and represent knowledge, such as representing what is the source of a set of statements.
For instance, you might have two data sources both stating that ?movie has-rating ?x and you might want to know which source is stating which rating, in this case you can use two named graphs associated to the two sources (e.g., http://www.example.com/rotten-tomatoes and http://www.example.com/imdb). If you're storing both data sets in the same triple store, probably you will want to use NGs, and remote endpoints are a different thing. Furthermore, the URL of a named graph can be used with vocabularies like VoID to describe a dataset as a whole (eg, the data set name, where and when the triples are imported from, who is the maintainer, user licence). This is another reason to partition your triple store into NGs.
That said, your mechanism to bind NGs to endpoint URLs might be implemented as an option, but I don't think it's a good idea to have it as mandatory, since managing remote endpoint URLs and NGs separately can be more useful.
Moreover, the real challenge in federated queries is to offer endpoint-transparent queries, making the query engine smart enough to analyse the query and understand how to split it and perform partial queries on the right endpoints (and join the results later, in an efficient way). There is a lot of research being done on that, one of the most significant results (as far as I know) is FedX, which has been used to implement several query distribution optimisations (example).
Last thing to add, I vaguely remember the convention that you mention about $url, $url/sparql. There are a couple of approaches around (e.g., LOD cloud). That said, in most nowadays triple stores (e.g., Virtuoso), queries that don't specify a named graph (don't use GRAPH) work in a way different than falling into a default graph case, they actually query the union of all named graphs in the store, which is usually much more useful (when you don't know where something is stated, or you want to integrate cross-graph data).