How can I distinguish SQL triples from explicit triples? - sparql

I am using Template Driven Extraction to generate an SQL view and RDF triples from the same set of documents. The SQL view is used for quick inspection of the raw data, while the triples are used downstream to feed information to a knowledge graph.
I now need to extract the RDF triples into an external file, and I'm struggling with separating out those triples that back the SQL view. The documentation suggests that I should use fixed subjects or predicates in my Sparql query, which is something I can't do because I don't know either of the two beforehand. I tried filtering out the SQL triples in XQuery, but I could not devise a way to detect whether a certain value returned by sem:sparql or a triple returned by cts:triples was one of SQL's or mine.
Any help on how to get a dump of all non-SQL triples out of MarkLogic would be appreciated.
Thanks,
Hans

Subjects from SQL views are not real sem:iri's (they are sql:rowID's), so you can use the following to exclude them:
FILTER( ISIRI(?subject) )
HTH!

You could try to use the function tde:node-data-extract.
It basically lets you see the results of a document and TDEs.
While it may involve some work doings this with all documents and converting it into RDF again it should be possible.

Related

What does SELECT FROM DEFAULT actually do?

Sparql has a notion of a "default graph" that is queried when no graph context is specified, and which (depending on the triple store) may be the union of proper graphs available in a repository, or it may be a separate, "null graph"; so far so good.
But sparql also has a keyword DEFAULT that can be specified instead of a graph name, as in
SELECT *
FROM DEFAULT
WHERE { ... }
What does this command do? I can only interpret it as an explicit way to request the same thing that would happen when there's no FROM clause at all. But is this correct? I could find no documentation about it. And what about using it in update queries, or with CLEAR, COPY, etc.? Can anyone point to documentation of the meaning and intended use of this keyword, or at least shed some light on why it exists?
When you have one or more FROM or FROM NAMED statements in a query then the dataset for the query is composed of only those graphs. Per SPARQL 1.1 Query Specification Section 13.2:
The FROM and FROM NAMED keywords allow a query to specify an RDF dataset by reference; they indicate that the dataset should include graphs that are obtained from representations of the resources identified by the given IRIs (i.e. the absolute form of the given IRI references). The dataset resulting from a number of FROM and FROM NAMED clauses is:
a default graph consisting of the RDF merge of the graphs referred to in the FROM clauses, and
a set of (IRI, graph) pairs, one from each FROM NAMED clause.
If there is no FROM clause, but there is one or more FROM NAMED clauses, then the dataset includes an empty graph for the default graph.
So basically the presence of those clauses creates a query dataset that potentially hides some/all graphs in the underlying dataset. Your query operates over this query dataset.
As noted in Andy's answer FROM DEFAULT is a proposed future extension to the SPARQL language that would allow explicitly referring to the datasets default graph (whatever that may be). Currently there's no standardised way to do this, so only queries that omit any FROM clauses can access the default graph unless your service provides some non-standard way to refer to it e.g. a custom URI for referencing the default graph.
For your specific example query:
SELECT *
FROM DEFAULT
WHERE { ... }
This would have the effect of forming a query dataset with a default graph using the services default and no named graphs visible i.e. any GRAPH ?g { } clauses would not match in this query
FROM DEFAULT is a feature that has been proposed for future work sparql-1.2/issues/43.
The grammar covers both SPARQL Query and SPARQL Update because they share a considerable about of the grammar. They have different entry points (QueryUnit and UpdateUnit).
The DEFAULT keyword appears in GraphOrDefault and GraphRefAll. Both of which are only used in SPARQL Update.
ADD, MOVE, CODE use GraphOrDefault; CLEAR and DROP use GraphRefAll.
FROM is followed by either an iri, or NAMED iri.
Omitting FROM means the implicit default graph.

Why some SPARQL queries lack FROM keyword?

I am using this client
http://yasgui.laurensrietveld.nl
and I hope to query bioportal http://bioportal.bioontology.org
Most of my prior queries had a PREFIX and no FROM part. Can I move any FROM URL into PREFIX?
Using YASGUI client, what is the difference between FROM and the Endpoint field?
Can I rewrite any query with a from statement into a query that does not have it?
I am not able to list for example details of Human Phenotype Ontology concept id: HP:0000023 because I am not sure what to put into FROM or if to use it at all.
There are a number of terms and mechanisms here. Let's go over them one by one.
First of all, a PREFIX clause is simply a declaration of a syntax shortcut, for use within your query. So this line:
PREFIX ex: <http://example.org/>
says that the string ex: is a shortcut for the string http://example.org/. If you have this prefix declared at the start of your query, you can use ex:someUrl (instead of <http://example.org/someUrl>) in other places in your query. It's simply there to make queries easier to read and write, but apart from that it has no influence on the meaning of your query.
A SPARQL endpoint is another term for a web service that can answer SPARQL queries.
The FROM clause of a SPARQL query determines the dataset (or more precisely, the default graph, which is part of the dataset) over which the query is executed. Any SPARQL endpoint may contain several graphs, each identified by a URI (so-called named graphs). A collection of such graph together is a dataset. If you don't specify a FROM clause (and perhaps also one or more FROM NAMED clauses), the dataset queried is simply whatever default dataset the endpoint chooses.
So, what this mean for your specific questions?
Most of my prior queries had a PREFIX and no FROM part. Can I move any FROM URL into PREFIX?
As you can see from the above explanation, that would make no sense. They are different mechanisms, for different purposes, that just both happen to use URIs.
Using YASGUI client, what is the difference between FROM and the Endpoint field?
The endpoint field defines which service YASGUI needs to send the query to. The FROM clause tells the endpoint what dataset you want to query.
Can I rewrite any query with a from statement into a query that does not have it?
Not generally, no. The absence of a FROM clause means that the endpoint executes the query over its default dataset. Depending on how that endpoint is configured, this may mean that you either get a lot more results (namely not just from the one dataset you want, but from a lot of others) or none at all (in case the dataset you wanted to query is not part of the endpoint's default dataset).

How to query documents in MarkLogic and process results

I've been working off of the tutorial pages but seem to have a fundamental disconnect in my thinking transitioning off of RDBMS systems. I'm using MarkLogic and handling this database interaction through the Java API focusing on the search access via POJO method outlines in the tutorial documentation.
My reference up to this point has come from here principally: http://developer.marklogic.com/learn/java/processing-search-results
My scenario is this:
I have a series of documents. We'll call them 'books' for simplicity. I'm writing these books into my DB like this:
jsonDocMgr.write("/" + book.getID() + "/",
new StringHandle(
"{name: \""+book.getID()+"\","+
"chaps: "+ book.getNumChaps()+","+
"pages: "+ book.getNumPages()+","+
"}"));
What I want is to execute the following type of operation:
-Query all documents with the name "book*" (as ID is represented by book0, book1, book2, etc)
where chaps > 3. For these documents only, I want to modify the number of pages by reducing by half.
In an RDBMS, I'd use something like jdbcTemplate and get a result set for me to iterate through. For each iteration I'd know I was working with a single record (aka a book), parse the field values from the result set, make a note of the ID, then update the DB accordingly.
With MarkLogic, I'm awash in a sea of different handlers and managers...none of which seems to follow the pattern of the ResultSet with a cursor abstraction. Ultimately I want to do a two-step operation of check the chapter count then update the page field for that specific URI.
What's the most common approach to this? It seems like the most basic of operations...
Try the high-level Java API and see if it works for you. Create a multi-statement transaction with a query by example, then use document operations.
At a lower level, the closest match to a ResultSet is the ResultSequence class. The examples at http://docs.marklogic.com/javadoc/xcc/overview-summary.html are pretty good. For updates the interaction model between Java and MarkLogic is a bit different from JDBC and SQL. There is no SELECT... FOR UPDATE syntax.
The most efficient low-level technique is to select and update in one XQuery transaction, something like a stored procedure. However this requires good knowledge of XQuery. The other low-level approach is to use an XCC multi-statement transaction, which requires a little less knowledge of XQuery.
A minor issue in your code ... you definately do NOT want to end your JSON docuement URIs with "/" as you do in your sample code. You should end them with the ".json" or some other extension or no extension but definately not "/" as that is treated specially in the server.

Different SPARQL query engines give differing results for DESCRIBE Query

I tried one SPARQL query in two different engines:
Protege 4.3 - SPARQL query tab
Jena 2.11.0
While the query is the same the results returned by these two tools are different.
I tried a DESCRIBE query like the following:
DESCRIBE ?x
WHERE { ?x :someproperty "somevalue"}
Results from protege give me tuples that take ?x as subject/object; while the ones from jena are that take ?x as subject only.
My questions are:
Is the syntax of SPARQL uniform?
If I want DESCRIBE to work as in protege, what should I do in Jena?
To answer your first question yes the SPARQL syntax is uniform since you've used the same query in both tools. However what I think you are actually asking is should the results for the two tools be different or not? i.e. are the semantics of SPARQL uniform
In the case of DESCRIBE then yes the results are explicitly allowed to be different by the SPARQL specification i.e. no the semantics of SPARQL are not uniform but this is only in the case of DESCRIBE.
See Section 16.4 DESCRIBE (Informative) of the SPARQL Specification which states the following:
The query pattern is used to create a result set. The DESCRIBE form
takes each of the resources identified in a solution, together with
any resources directly named by IRI, and assembles a single RDF graph
by taking a "description" which can come from any information
available including the target RDF Dataset. The description is
determined by the query service
The important part of this is the last couple of sentences that say the description is determined by the query service. This means that both Protege's and Jena's answers are correct since they are allowed to choose how they form the description.
Changing Jena DESCRIBE handling
To answer the second part of your question you can change how Jena processes DESCRIBE queries by implementing a custom DescribeHandler and an associated DescribeHandlerFactory. You then need to register your factory like so:
DescribeHandlerRegistry.get().set(new YourDescribeHandlerFactory());

RESTful API Design OR Predicates

I'm designing a RESTful API and I'm trying to work out how I could represent a predicate with OR an operator when querying for a resource.
For example if I had a resource Foo with a property Name, how would you search for all Foo resources with a name matching "Name1" OR "Name2"?
This is straight forward when it's an AND operator as I could do the following:
http://www.website.com/Foo?Name=Name1&Age=19
The other approach I've seen is to post the search in the body.
You will need to pick your own approach, but I can name few that seem to be pretty logical (although not without disadvantages):
Option 1.: Using | operator:
http://www.website.com/Foo?Name=Name1|Name2
Option 2.: Using modified query param to allow selection by one of the values from the set (list of possible comma-separated values):
http://www.website.com/Foo?Name_in=Name1,Name2
Option 3.: Using PHP-like notation to provide list instead of single string:
http://www.website.com/Foo?Name[]=Name1&Name[]=Name2
All of the above mentioned options have one huge advantage: they do not interfere with other query params.
But as I mentioned, pick your own approach and be consistent about it across your API.
Well one quick way to fixing that is to add an additional parameter that is identifying the relationship between your parameters wether they're an and or an or for example:
http://www.website.com/Foo?Name=Name1&Age=19&or=true
Or for much more complex queries just keep a single parameter and in it include your whole query by making up your own little query language and on the server side you would parse the whole string and extract the information and the statement.