Can I set up multiples triplestores in Virtuoso in the same way I create multiples databases in, for example, a conventional mysql DBMS? Each database would be independent with (possibly) its own sparql endpoint.
Yes you can,
at least as far as i understood your question.
You can add additional datastes to the virtuoso triple store under a new graph, which you would use in the FROM statament of your queries to point out the named graph you want your results to stem from:
create graph <http://myNewAndShinyGraph.org/some/path>;
Now you can add/upload you dataset into the triplestore under the new context you created. (As usual via SPARQL INSERT, TTLP or ld_dir...)
You can also expose this graph with a different SPARQL endpoint.
Follow the steps described by Hugh Williams here: Defining endpoints in Virtuoso
Also of interest: How to create a SPARQL endpoint using Virtuoso?
Your question is extremely broad and difficult to answer both concisely and usefully. The short answer is "Yes," but that seems less than useful.
Virtuoso (produced by my employer, OpenLink Software) is a "conventional" SQL-style DBMS, akin to MySQL, PostgreSQL, Oracle, SQL Server, etc., though being a hybrid engine, it is also a NoSQL, graph/RDF, XML, object, and various other style DBMS. In the graph/RDF realm, it is actually a Quad Store, which allows for use as either a simple triplestore, or a collection of Named Graph whether each might be considered a separate triplestore...
One Virtuoso DB file may contain multiple SQL-style CATALOGS, as well as multiple Named Graphs and other divisions of RDF/graph data, for which you can set up distinct SPARQL endpoints -- or you can set up distinct DB files (and Virtuoso instances), each with one database/data set. There may be other options appropriate to your needs ...
Virtuoso-specific questions are often better raised in Virtuoso-specific areas, such as the public Virtuoso Users mailing list, the public OpenLink Support Forums, a confidential OpenLink Support Case, etc.
Related
So, I need to run SPARQL query over a semantic database but some of the triples are not going to be in the database but are going to be provided by webservices (and not as a SPARQL endpoint). I would want to be able to run a SELECT query that take into consideration those additional triples but without having to insert them in the database, is there a way to do that ?
This is not part of the SPARQL spec, so "no" is the general answer.
That said, Virtuoso (possibly among others) lets you include an external RDF source (a/k/a webservice) as part of the FROM (among other methods), to be dereferenced during SPARQL query processing.
Such webservice need not be a SPARQL endpoint, but best performance will result if it provides RDF (though serialization may vary).
The Virtuoso Sponger can also be invoked on the fly to derive RDF from many document formats (with an obvious performance hit). To pursue, please raise this to the OpenLink Community Forum.
We are going to have the semantic web. Now we have LOD cloud.
Every data set has its own SPARQL endpoint.
I can query the dataset triples.
How can I query the whole semantic web or LOD?
No, there is no such single SPARQL endpoint, because the Semantic Web is decentralized by design. However, SPARQL 1.1 supports federated queries over different SPARQL endpoints using the SERVICE keyword. See https://www.w3.org/TR/sparql11-federated-query/ for reference. More specifically, there is a mention in the literature about how to determine which data sources might be relevant for query answering at Internet scale:
Hartig O., Bizer C., Freytag J.C. (2009) Executing SPARQL queries over the Web of Linked Data. In: Bernstein A. et al. (eds.) The Semantic Web – ISWC 2009. ISWC 2009. Lecture Notes in Computer Science, vol. 5823, pp. 293–309. Heidelberg: Springer. doi: 10.1007/978-3-642-04930-9_19
There exists a W3C-owned and (un-?)maintained wiki page with ~60 SPARQL endpoints. Many "last accessed/checked" entries are from 2010. On that page is a link to http://sparqles.ai.wu.ac.at/availability which lists more endpoints and is much more recent and up-to-date.
Read the 2nd paragraphs titled "SPARQL Endpoints" of the blogpost Querying DBpedia with GraphQL for a skeptical view of the state of SPARQL today. Cannot say it any better myself.
Also note that SPARQL permits every endpoint to offer any number of "named GRAPH" constructs that can be queried at that endpoint. So that is another feature more to consider.
There is no central point regarding the notion of a Semantic Web of Linked Data. Instead, like any Super Information Highway, you have major concentration points (hubs or junctions) that enable you to discover routes to a variety of destinations.
Major Semantic Web of Linked Data hubs that we oversee at OpenLink Software include:
DBpedia
DBpedia-Live
URIBurner
LOD Cloud Cache
Remember, the fundamental principle behind Linked Open Data is that hyperlinks (HTTP URIs) function as words in sentences constructed using RDF Language. Thus, you can use the SPARQL Query Language to produce query solutions (tables or graphs) that expose desired routes (e.g., using Property Paths).
Finally, you can also use Federated SPARQL Query (SPARQL-FED) to navigate a Semantic Web of Linked Data.
Examples:
select distinct *
where {
?s a <http://dbpedia.org/ontology/AcademicJournal> ;
rdf:type{1,3} ?o
}
LIMIT 50
Query Solution Document Link.
We are also working on a publicly available Google Spreadsheet that provides additional information related to the kinds of datasets accessible via the LOD Cloud that we maintain.
To my knowledge LOD-a-lot is currently the one ongoing effort that gets closest to the vision of querying the whole web of data. And this is obviously done using different means than SPARQL endpoints.
It's still a prototype, which means bugs, but one of the aims of wimuQ is to provide a way to query all 539 public SPARQL endpoints + all datasets from LODLaundromat and LODStats, that is more than 600,000 datasets, more than 5 terabytes. As far as I know, it is the most extensive collection datasets accessible from one single place.
For more information, the paper is available here:
I'm trying to build a query to fetch instances of / any subclasses of abstract elements such as "human" (Q5) by name, however the query fails with a timeout, probably because it has too many nodes to traverse in the graph.
Are there any better methods to query this? The best I could come up with is using the Wikidata API search entities endpoint with the element name, then filter the desired results in Sparql query to minimize the domain of the query instead of the whole graph.
I'm a little worried about using this method in a production environment since Wikidata Sparql is in Beta. Any best practices for migrating knowledge graph use cases from freebase? Is there any update regarding the migration of data from Freebase to Wikidata?
Finally are there any other mature alternatives to the deprecated Freebase service?
What endpoint are you querying against? Querying against a shared public endpoint with no SLA (beta or not) for a production service is very risky proposition.
Wikidata offers full database dumps that you can tailor/subset and load into whatever infrastructure you like. That would give you complete control over performance, quality, and any other metrics which are important to you.
As far as migrating from Freebase goes, there is no migration path. The track that train was on has come to an end (at least for external non-Google users). It's not just deprecated, it was shut down completely a while ago. A tiny fraction of the data was imported to Wikidata (and they shared a bunch in common already due to their common ancestor Wikipedia), but none of the programmatic features such as MQL's JSON query-by-example, Freebase Search, Freebase Suggest, Google-scale performance or availability, etc is available (yet?) for Wikidata.
If the data is important to you, you should self-host using whatever infrastructure meets your needs.
currently, I found out that I can query using model (Model) syntax in Jena in a rdf after loading the model from a file, it gives me same output if I apply a sparql query. So, I want to know that , is it a good way to that without sparql? Though I have tested it with a small rdf file. I also want to know if I use Virtuoso can i manipulate using model syntax without sparql?
Thanks in Advance.
I'm not quite sure if I understand your question. If I can paraphrase, I think you're asking:
Is it OK to query and manipulate RDF data using the Jena Model API instead of using
SPARQL? Does it make a difference if the back-end store is Virtuoso?
Assuming that's the right re-phrasing of the question, then the first part is definitively yes: you can manipulate RDF data through the Model and OntModel APIs. In fact, I would say that's what the majority of Jena users do, particularly for small queries or updates. I find personally that going direct to the API is more succinct up to a certain point of complexity; after that, my code is clearer and more concise if I express the query in SPARQL. Obviously circumstances will have an effect: if you're working with a mixture of local stores and remote SPARQL endpoints (for which sending a query string is your only option) then you may find the consistency of always using SPARQL makes your code clearer.
Regarding Virtuoso, I don't have any direct experience to offer. As far as I know, the Virtuoso Jena Provider fully implements the features of the Model API using a Virtuoso store as the storage layer. Whether the direct API or using SPARQL queries gives you a performance advantage is something you should measure by benchmark with your data and your typical query patterns.
I recently came across the working draft for SPARQL 1.1 Federation Extensions and wondered whether this was already possible using Named Graphs (not to detract from the usefulness of the aforementioned draft).
My understanding of Named Graphs is a little hazy, save that the only thing I have gleamed from reading the specs comprises rules around the merger, non merger in relation to other graphs at query time. Since this doesn't fully satisfy my understanding, my question is as follows:
Given the following query:
SELECT ?something
FROM NAMED <http://www.vw.co.uk/models/used>
FROM NAMED <http://www.autotrader.co.uk/cars/used>
WHERE {
...
}
Is it reasonable to assume that a query processor/endpoint could or should in the context of the named graphs do the following:
Check is the named graph exists locally
If it doesn't then perform the following operation (in the case of the above query, I will use the second named graph)
GET /sparql/?query=EncodedQuery HTTP/1.1
Host: www.autotrader.co.uk
User-agent: my-sparql-client/0.1
Where the EncodedQuery only includes the second named graph in the FROM NAMED clause and the WHERE clause is amended accordingly with respect to GRAPH clauses (e.g if a GRAPH <http://www.vw.co.uk/models/used> {...} is being used).
Only if it can't perform the above, then do any of the following:
GET /cars/used HTTP/1.1
Host: www.autotrader.co.uk
or
LOAD <http://www.autotrader.co.uk/cars/used>
Return appropriate search results.
Obviously there might be some additional considerations around OFFSET's and LIMIT's
I also remember reading somewhere a long time ago in galaxy far far away, that the default graph of any SPARQL endpoint should be a named graph according to the following convention:
For: http://www.vw.co.uk/sparql/ there should be a named graph of: http://www.vw.co.uk that represents the default graph and so by the above logic, it should already be possible to federate SPARQL endpoints using named graphs.
The reason I ask is that I want to start promoting federation across the domains in the above example, without having to wait around for the standard, making sure that I won't do something that is out of kilter or incompatible with something else in the future.
Named graph and URLs used in federated queries (using SERVICE or FROM) are two different things. The latter point to SPARQL endpoints, the named graphs are within a triple store and have the main function of separating different data sets. This, in turn, can be useful to both improve performance and represent knowledge, such as representing what is the source of a set of statements.
For instance, you might have two data sources both stating that ?movie has-rating ?x and you might want to know which source is stating which rating, in this case you can use two named graphs associated to the two sources (e.g., http://www.example.com/rotten-tomatoes and http://www.example.com/imdb). If you're storing both data sets in the same triple store, probably you will want to use NGs, and remote endpoints are a different thing. Furthermore, the URL of a named graph can be used with vocabularies like VoID to describe a dataset as a whole (eg, the data set name, where and when the triples are imported from, who is the maintainer, user licence). This is another reason to partition your triple store into NGs.
That said, your mechanism to bind NGs to endpoint URLs might be implemented as an option, but I don't think it's a good idea to have it as mandatory, since managing remote endpoint URLs and NGs separately can be more useful.
Moreover, the real challenge in federated queries is to offer endpoint-transparent queries, making the query engine smart enough to analyse the query and understand how to split it and perform partial queries on the right endpoints (and join the results later, in an efficient way). There is a lot of research being done on that, one of the most significant results (as far as I know) is FedX, which has been used to implement several query distribution optimisations (example).
Last thing to add, I vaguely remember the convention that you mention about $url, $url/sparql. There are a couple of approaches around (e.g., LOD cloud). That said, in most nowadays triple stores (e.g., Virtuoso), queries that don't specify a named graph (don't use GRAPH) work in a way different than falling into a default graph case, they actually query the union of all named graphs in the store, which is usually much more useful (when you don't know where something is stated, or you want to integrate cross-graph data).