Named Graphs and Federated SPARQL Endpoints - sparql

I recently came across the working draft for SPARQL 1.1 Federation Extensions and wondered whether this was already possible using Named Graphs (not to detract from the usefulness of the aforementioned draft).
My understanding of Named Graphs is a little hazy, save that the only thing I have gleamed from reading the specs comprises rules around the merger, non merger in relation to other graphs at query time. Since this doesn't fully satisfy my understanding, my question is as follows:
Given the following query:
SELECT ?something
FROM NAMED <http://www.vw.co.uk/models/used>
FROM NAMED <http://www.autotrader.co.uk/cars/used>
WHERE {
...
}
Is it reasonable to assume that a query processor/endpoint could or should in the context of the named graphs do the following:
Check is the named graph exists locally
If it doesn't then perform the following operation (in the case of the above query, I will use the second named graph)
GET /sparql/?query=EncodedQuery HTTP/1.1
Host: www.autotrader.co.uk
User-agent: my-sparql-client/0.1
Where the EncodedQuery only includes the second named graph in the FROM NAMED clause and the WHERE clause is amended accordingly with respect to GRAPH clauses (e.g if a GRAPH <http://www.vw.co.uk/models/used> {...} is being used).
Only if it can't perform the above, then do any of the following:
GET /cars/used HTTP/1.1
Host: www.autotrader.co.uk
or
LOAD <http://www.autotrader.co.uk/cars/used>
Return appropriate search results.
Obviously there might be some additional considerations around OFFSET's and LIMIT's
I also remember reading somewhere a long time ago in galaxy far far away, that the default graph of any SPARQL endpoint should be a named graph according to the following convention:
For: http://www.vw.co.uk/sparql/ there should be a named graph of: http://www.vw.co.uk that represents the default graph and so by the above logic, it should already be possible to federate SPARQL endpoints using named graphs.
The reason I ask is that I want to start promoting federation across the domains in the above example, without having to wait around for the standard, making sure that I won't do something that is out of kilter or incompatible with something else in the future.

Named graph and URLs used in federated queries (using SERVICE or FROM) are two different things. The latter point to SPARQL endpoints, the named graphs are within a triple store and have the main function of separating different data sets. This, in turn, can be useful to both improve performance and represent knowledge, such as representing what is the source of a set of statements.
For instance, you might have two data sources both stating that ?movie has-rating ?x and you might want to know which source is stating which rating, in this case you can use two named graphs associated to the two sources (e.g., http://www.example.com/rotten-tomatoes and http://www.example.com/imdb). If you're storing both data sets in the same triple store, probably you will want to use NGs, and remote endpoints are a different thing. Furthermore, the URL of a named graph can be used with vocabularies like VoID to describe a dataset as a whole (eg, the data set name, where and when the triples are imported from, who is the maintainer, user licence). This is another reason to partition your triple store into NGs.
That said, your mechanism to bind NGs to endpoint URLs might be implemented as an option, but I don't think it's a good idea to have it as mandatory, since managing remote endpoint URLs and NGs separately can be more useful.
Moreover, the real challenge in federated queries is to offer endpoint-transparent queries, making the query engine smart enough to analyse the query and understand how to split it and perform partial queries on the right endpoints (and join the results later, in an efficient way). There is a lot of research being done on that, one of the most significant results (as far as I know) is FedX, which has been used to implement several query distribution optimisations (example).
Last thing to add, I vaguely remember the convention that you mention about $url, $url/sparql. There are a couple of approaches around (e.g., LOD cloud). That said, in most nowadays triple stores (e.g., Virtuoso), queries that don't specify a named graph (don't use GRAPH) work in a way different than falling into a default graph case, they actually query the union of all named graphs in the store, which is usually much more useful (when you don't know where something is stated, or you want to integrate cross-graph data).

Related

Best practice around GraphQL nesting depth

Is there an optimum maximum depth to nesting?
We are often presented with the option to try to represent complex heirarchical data models with the nesting they demonstrate in real life. In my work this is genetics and modelling protein / transcript / homology relationships where it is possible to have very deep nesting up to maybe 7/8 levels. We use dataloader to make nested batching more efficient and resolver level caching with directives. Is it good practice to model a schema on a real life data model or should you focus on making your resolvers reasonable to query and keep nesting to a maximum ideal depth of say 4 levels?
When designing a schema is it better to create a different parent resolver for a type or use arguments that to direct a conditional response?
If I have two sets of for example ‘cars’ let’s say I have cars produced by Volvo and cars produced by tesla and the underlying data while having similarities is originally pulled from different apis with different characteristics. Is it best practice to have a tesla_cars and volvo_cars resolver or one cars resolver which uses for example a manufacturer argument to act differently on the data it returns and homogenise the response especially where there may then be a sub resolver that expects certain fields which may not be similar in the original data.
Or is it better to say that these two things are both cars but the shape of the data we have for them is significantly different so its better to create seperate resolvers with totally or notably different fields?
Should my resolvers and graphQL apis try to model the data they describe or should I allow duplication in order to create efficient application focused queries and responses?
We often find ourselves wondering do we have a seperate API for application x and y that maybe use underlying data and possibly even multiple sources (different databases or even API calls) inside resolvers very differently or should we try to make a resolver work with any application even if that means using type like arguments to allow custom filtering and conditional behaviour?
Is there an optimum maximum depth to nesting?
In general I'd say: don't restrict your schema. Your resolvers / data fetchers will only get called when the client requests the corresponding fields.
Look at it from this point of view: If your client needs the data from 8 levels of the hierarchy to work, then he will ask for it no matter what. With a restricted schema the client will execute multiple requests. With an unrestricted schema he can get all he needs in a single request. Though the amount processing on your server side and amount of data will still be the same, just split across multiple network requests.
The unrestricted schema has several benefits:
The client can decide if he wants all the data at once or use multiple requests
The server may be able to optimize the data fetching process (i.e. don't fetch duplicate data) when he knows everything the client wants to receive
The restricted schema on the other hand has only downsides.
When designing a schema is it better to create a different parent resolver for a type or use arguments that to direct a conditional response
That's a matter of taste and what you want to achieve. But if you expect your application to grow and incorporate more car manufacturers, your API may become messy, if there are lot's of abc_cars and xyz_cars queries.
Another thing to keep in mind: Even if the shape of data is different, all cars have something in common: They are some kind of type Car. And all of them have for example a construction year. If you now want to be able to query "all cars sorted by construction year" you will need a single query endpoint.
You can have a single cars query endpoint in your api an then use interfaces to query different kinds of cars. Just like GraphQL Relay's node endpoint works: Single endpoint that can query all types that implement the Node interface.
On the other hand, if you've got a very specialized application, where your type is not extensible (like for example white and black chess pieces), then I think it's totally valid to have a white_pieces and black_pieces endpoint in your API.
Another thing to keep in mind: With a single endpoint some queries become extremely hard (or even impossible), like "sort white_pieces by value ascending, and black_pieces by value descending". This is much easier if there are separate endpoints for each color.
But even this is solvable if you have a single endpoint for all pieces, and simply call it twice.
Should my resolvers and graphQL apis try to model the data they describe or should I allow duplication in order to create efficient application focused queries and responses?
That's question of use case and scalability. If you have exactly two types of clients that use the API in different ways, just build two seperate APIs. But if you expect your application to grow, get more different clients, then of course it will become an unmaintainable mess to have 20 APIs.
In this case have a look at schema directives. You can for example decorate your types and fields to make them behave differently for each client or even show/hide parts of your API depending on the client.
Summary:
Build your API with your clients in mind.
Keep things object oriented, make use of interfaces for similar types.
Don't provide endpoints you clients don't need, you can still extend your schema later if necessary.
Think of your data a huge graph ;) that's what GraphQL is all about.

Basic questions about ontology manipulation

My questions are probably very basic, but they are fundamental for me since they put all puzzles together.
1) As I understand, ontologies (*.owl) might be either "empty" (without data, i,e, without individuals) or they may involve both the relations between classes and linked data. Is it correct?
2) I downloaded a famous gene_ontology.owl, which seems to contain both data and meta-structure. How can I start creating SPARQL queries? The queries always specify endpoints, classes names, etc. e.g. PREFIX dcore: <http://purl.org/dc/elements/1.1/>. Where do I get all these titles for particular ontology? Should I try to figure out all the titles using e.g. Protégé or is there any "automatic" way to create queries?

How does the semantic web actually work?

I know that in the web you have lots an lots of pages linked to one another and you can go from page to page and so on.
How does the semantic web work? I understand that it uses the concept of Linked Data, where data is identified and linked by URIs or IRIs and not the web pages them self. But I don't understand how the data is linked across the web when all of the data is stored in local triplestores and are linked internally in the triplestores. Are browsers capable to go from triplestore to triplestore behind the scenes and get back all kinds of data? Or how is the data actually linked? Is there a mechanism to go from data to data all across the web and use the meaning of data in real life situations, or a tool that does something like this?
Also anybody can create ontologies and define and describe anything in all kinds of different ways. Won't this lead to a big mess of data?
So, main question:
How does the semantic web and liked data actually work?
It's a tricky and multifaceted question.
First I'll answer some of you questions.
But I don't understand how the data is linked across the web when all of the data is stored in local triplestores and are linked internally in the triplestores
First of all, it is important to realize that triplestores are not a necessity. You could have SQL servers and D2RQ/R2RML mapping on top to translate queries dynamically. Or plain RDF files. Or simple JSON documents in MongoDB, etc, which you extend by adding a JSON-LD #context.
What is important, is that you serve data in one of the RDF formats such as turtle or JSON-LD
Are browsers capable to go from triplestore to triplestore behind the scenes and get back all kinds of data?
See, they don't have to because, as you mention, URIs are used so that a browser (not necessarily a web browser) can download the data. And of course these URI are URLs and are dereferenceable. Otherwise they are just identifiers.
Or how is the data actually linked?
It is linked simply by reusing identifiers for objects and properties. That's why URI (IRI) is used, so that the identifiers are globally unique and created privately within a domain. Of course there is a risk of being mischievous by creating URIs is someone else's domain. It's a separate topic though.
Is there a mechanism to go from data to data all across the web and use the meaning of data in real life situations, or a tool that does something like this?
One simple mechanism is to simply crawl RDF data and download to a local store. Simple occurrence of matching identifiers will combine the data into larger dataset with less mapping effort required. That is of course the theory because you risk data can be corrupt, incorrect or duplicated so you need some curation. Technology exists to help you do that and it's not something you wouldn't experience is traditional data warehousing. Search engines harvest semantic markup from HTML pages (RFDa/Microdata) is similar manner.
Another option is to use federated queries. SPARQL has the ability to automatically download RDF data and perform queries over it in memory.
Last but not least, there are federated queries using Triple Pattern Fragments
Now about Semantic Web
As I wrote, the question is not that simple. You mostly ask about Linked Data. There is more to Semantic Web than that:
ontologies/taxonomies
inferencing
rules
semantic/faceted search
I hope I answered your question to some extent.

semantic web + linked data integration

i'm new to semantic web.
I'm trying to do a sample application where i can query data from different data sources in one query.
i have created a small rdf file which contains references to dbpedia resources for defining localities. my question is : how can i get the data contained in my file and other information which is in the description of the distant resource (for example : the name of the person from the local file, and the total poulation in a city dbpedia-owl:populationTotal from the distant rdf file).
i don't really understand the sparql query language, i tried to use the JENA ARQ API with the SERVICE keyword but it doesn't solve the problem.
Any help please?
I guess you are looking for something like the Semantic Web Client Library, which tries to leverage the GGG. Albeit, the standard exploration algorithm of this framework is that it follows rdfs:seeAlso links. Nevertheless, the general approach seems to be what your are looking for, i.e., you would create a local graph that consists of your seed graph and that traverse the relations up to a certain level, e.g., three steps, resolves the URIs and load that content into your local triple. Utilising advanced technologies like SPARQL federation might be something for later ;)
I have retrived data from two different sources using SPARQL query with named graphs.
I used jena-ARQ to execute the sparql query.

REST API Best practices: Where to put parameters? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 7 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
A REST API can have parameters in at least two ways:
As part of the URL-path (i.e. /api/resource/parametervalue )
As a query argument (i.e. /api/resource?parameter=value )
What is the best practice here? Are there any general guidelines when to use 1 and when to use 2?
Real world example: Twitter uses query parameters for specifying intervals. (http://api.twitter.com/1/statuses/home_timeline.json?since_id=12345&max_id=54321)
Would it be considered better design to put these parameters in the URL path?
If there are documented best practices, I have not found them yet. However, here are a few guidelines I use when determining where to put parameters in an url:
Optional parameters tend to be easier to put in the query string.
If you want to return a 404 error when the parameter value does not correspond to an existing resource then I would tend towards a path segment parameter. e.g. /customer/232 where 232 is not a valid customer id.
If however you want to return an empty list then when the parameter is not found then I suggest using query string parameters. e.g. /contacts?name=dave
If a parameter affects an entire subtree of your URI space then use a path segment. e.g. a language parameter /en/document/foo.txt versus /document/foo.txt?language=en
I prefer unique identifiers to be in a path segment rather than a query parameter.
The official rules for URIs are found in this RFC spec here. There is also another very useful RFC spec here that defines rules for parameterizing URIs.
Late answer but I'll add some additional insight to what has been shared, namely that there are several types of "parameters" to a request, and you should take this into account.
Locators - E.g. resource identifiers such as IDs or action/view
Filters - E.g. parameters that provide a search for, sorting or narrow down the set of results.
State - E.g. session identification, api keys, whatevs.
Content - E.g. data to be stored.
Now let's look at the different places where these parameters could go.
Request headers & cookies
URL query string ("GET" vars)
URL paths
Body query string/multipart ("POST" vars)
Generally you want State to be set in headers or cookies, depending on what type of state information it is. I think we can all agree on this. Use custom http headers (X-My-Header) if you need to.
Similarly, Content only has one place to belong, which is in the request body, either as query strings or as http multipart and/or JSON content. This is consistent with what you receive from the server when it sends you content. So you shouldn't be rude and do it differently.
Locators such as "id=5" or "action=refresh" or "page=2" would make sense to have as a URL path, such as mysite.com/article/5/page=2 where partly you know what each part is supposed to mean (the basics such as article and 5 obviously mean get me the data of type article with id 5) and additional parameters are specified as part of the URI. They can be in the form of page=2, or page/2 if you know that after a certain point in the URI the "folders" are paired key-values.
Filters always go in the query string, because while they are a part of finding the right data, they are only there to return a subset or modification of what the Locators return alone. The search in mysite.com/article/?query=Obama (subset) is a filter, and so is /article/5?order=backwards (modification). Think about what it does, not just what it's called!
If "view" determines output format, then it is a filter (mysite.com/article/5?view=pdf) because it returns a modification of the found resource rather than homing in on which resource we want. If it instead decides which specific part of the article we get to see (mysite.com/article/5/view=summary) then it is a locator.
Remember, narrowing down a set of resources is filtering. Locating something specific within a resource is locating... duh. Subset filtering may return any number of results (even 0). Locating will always find that specific instance of something (if it exists). Modification filtering will return the same data as the locator, except modified (if such a modification is allowed).
Hope this helped give people some eureka moments if they've been lost about where to put stuff!
It depends on a design. There are no rules for URIs at REST over HTTP (main thing is that they are unique). Often it comes to the matter of taste and intuition...
I take following approach:
url path-element: The resource and its path-element forms a directory traversal and a subresource (e.g. /items/{id} , /users/items). When unsure ask your colleagues, if they think that traversal and they think in "another directory" most likely path-element is the right choice
url parameter: when there is no traversal really (search resources with multiple query parameters are a very nice example for that)
IMO the parameters should be better as query arguments. The url is used to identify the resource, while the added query parameters to specify which part of the resource you want, any state the resource should have, etc.
As per the REST Implementation,
1) Path variables are used for the direct action on the resources, like a contact or a song
ex..
GET etc /api/resource/{songid} or
GET etc /api/resource/{contactid} will return respective data.
2) Query perms/argument are used for the in-direct resources like metadata of a song
ex..,
GET /api/resource/{songid}?metadata=genres it will return the genres data for that particular song.
"Pack" and POST your data against the "context" that universe-resource-locator provides, which means #1 for the sake of the locator.
Mind the limitations with #2. I prefer POSTs to #1.
note: limitations are discussed for
POST in Is there a max size for POST parameter content?
GET in Is there a limit to the length of a GET request? and Max size of URL parameters in _GET
p.s. these limits are based on the client capabilities (browser) and server(configuration).
According to the URI standard the path is for hierarchical parameters and the query is for non-hierarchical parameters. Ofc. it can be very subjective what is hierarchical for you.
In situations where multiple URIs are assigned to the same resource I like to put the parameters - necessary for identification - into the path and the parameters - necessary to build the representation - into the query. (For me this way it is easier to route.)
For example:
/users/123 and /users/123?fields="name, age"
/users and /users?name="John"&age=30
For map reduce I like to use the following approaches:
/users?name="John"&age=30
/users/name:John/age:30
So it is really up to you (and your server side router) how you construct your URIs.
note: Just to mention these parameters are query parameters. So what you are really doing is defining a simple query language. By complex queries (which contain operators like and, or, greater than, etc.) I suggest you to use an already existing query language. The capabilities of URI templates are very limited...
As a programmer often on the client-end, I prefer the query argument. Also, for me, it separates the URL path from the parameters, adds to clarity, and offers more extensibility. It also allows me to have separate logic between the URL/URI building and the parameter builder.
I do like what manuel aldana said about the other option if there's some sort of tree involved. I can see user-specific parts being treed off like that.
There are no hard and fast rules, but the rule of thumb from a purely conceptual standpoint that I like to use can briefly be summed up like this: a URI path (by definition) represents a resource and query parameters are essentially modifiers on that resource. So far that likely doesn't help... With a REST API you have the major methods of acting upon a single resource using GET, PUT, and DELETE . Therefore whether something should be represented in the path or as a parameter can be reduced to whether those methods make sense for the representation in question. Would you reasonably PUT something at that path and would it be semantically sound to do so? You could of course PUT something just about anywhere and bend the back-end to handle it, but you should be PUTing what amounts to a representation of the actual resource and not some needlessly contextualized version of it. For collections the same can be done with POST. If you wanted to add to a particular collection what would be a URL that makes sense to POST to.
This still leaves some gray areas as some paths could point to what amount to children of parent resources which is somewhat discretionary and dependent on their use. The one hard line that this draws is that any type of transitive representation should be done using a query parameter, since it would not have an underlying resource.
In response to the real world example given in the original question (Twitter's API), the parameters represent a transitive query that filters on the state of the resources (rather than a hierarchy). In that particular example it would be entirely unreasonable to add to the collection represented by those constraints, and further that query would not be able to be represented as a path that would make any sense in the terms of an object graph.
The adoption of this type of resource oriented perspective can easily map directly to the object graph of your domain model and drive the logic of your API to the point where everything works very cleanly and in a fairly self-documenting way once it snaps into clarity. The concept can also be made clearer by stepping away from systems that use traditional URL routing mapped on to a normally ill-fitting data model (i.e. an RDBMS). Apache Sling would certainly be a good place to start. The concept of object traversal dispatch in a system like Zope also provides a clearer analog.
Here is my opinion.
Query params are used as meta data to a request. They act as filter or modifier to an existing resource call.
Example:
/calendar/2014-08-08/events
should give calendar events for that day.
If you want events for a specific category
/calendar/2014-08-08/events?category=appointments
or if you need events of longer than 30 mins
/calendar/2014-08-08/events?duration=30
A litmus test would be to check if the request can still be served without an query params.
I generally tend towards #2, As a query argument (i.e. /api/resource?parameter=value ).
A third option is to actually post the parameter=value in the body.
This is because it works better for multi parameter resources and is more extendable for future use.
No matter which one you pick, make sure you only pick one, don't mix and match. That leads towards a confusing API.
One "dimension" of this topic has been left out yet it's very important: there are times when the "best practices" have to come into terms with the plaform we are implementing or augmenting with REST capabilities.
Practical example:
Many web applications nowadays implement the MVC (Model, View, Controller) architecture. They assume a certain standard path is provided, even more so when those web applications come with an "Enable SEO URLs" option.
Just to mention a fairly famous web application: an OpenCart e-commerce shop.
When the admin enables the "SEO URLs" it expects said URLs to come in a quite standard MVC format like:
http://www.domain.tld/special-offers/list-all?limit=25
Where
special-offers is the MVC controller that shall process the URL (showing the special-offers page)
list-all is the controller's action or function name to call. (*)
limit=25 is an option, stating that 25 items will be shown per page.
(*) list-all is a fictious function name I used for clarity. In reality, OpenCart and most MVC frameworks have a default, implied (and usually omitted in the URL) index function that gets called when the user wants a default action to be performed. So the real world URL would be:
http://www.domain.tld/special-offers?limit=25
With a now fairly standard application or frameworkd structure similar to the above, you'll often get a web server that is optimized for it, that rewrites URLs for it (the true "non SEOed URL" would be: http://www.domain.tld/index.php?route=special-offers/list-all&limit=25).
Therefore you, as developer, are faced into dealing with the existing infrastructure and adapt your "best practices", unless you are the system admin, know exactly how to tweak an Apache / NGinx rewrite configuration (the latter can be nasty!) and so on.
So, your REST API would often be much better following the referring web application's standards, both for consistency with it and ease / speed (and thus budget saving).
To get back to the practical example above, a consistent REST API would be something with URLs like:
http://www.domain.tld/api/special-offers-list?from=15&limit=25
or (non SEO URLs)
http://www.domain.tld/index.php?route=api/special-offers-list?from=15&limit=25
with a mix of "paths formed" arguments and "query formed" arguments.
I see a lot of REST APIs that don't handle parameters well. One example that comes up often is when the URI includes personally identifiable information.
http://software.danielwatrous.com/design-principles-for-rest-apis/
I think a corollary question is when a parameter shouldn't be a parameter at all, but should instead be moved to the HEADER or BODY of the request.
It's a very interesting question.
You can use both of them, there's not any strict rule about this subject, but using URI path variables has some advantages:
Cache:
Most of the web cache services on the internet don't cache GET request when they contains query parameters.
They do that because there are a lot of RPC systems using GET requests to change data in the server (fail!! Get must be a safe method)
But if you use path variables, all of this services can cache your GET requests.
Hierarchy:
The path variables can represent hierarchy:
/City/Street/Place
It gives the user more information about the structure of the data.
But if your data doesn't have any hierarchy relation you can still use Path variables, using comma or semi-colon:
/City/longitude,latitude
As a rule, use comma when the ordering of the parameters matter, use semi-colon when the ordering doesn't matter:
/IconGenerator/red;blue;green
Apart of those reasons, there are some cases when it's very common to use query string variables:
When you need the browser to automatically put HTML form variables into the URI
When you are dealing with algorithm. For example the google engine use query strings:
http:// www.google.com/search?q=rest
To sum up, there's not any strong reason to use one of this methods but whenever you can, use URI variables.