How do triple stores use linked data? - semantic-web

Lets say I have the following scenario:
I have some different ontology files hosted somewhere on the web on different domains like _http://foo1.com/ontolgy1.owl#, _http://foo2.com/ontology2.owl# etc.
I also have a triple store in which I want to insert instances based on the ontology files mentioned like this:
INSERT DATA
{
<http://foo1.com/instance1> a <http://foo1.com/ontolgy1.owl#class1>.
<http://foo2.com/instance2> a <http://foo2.com/ontolgy2.owl#class2>.
<http://foo2.com/instance2x> a <http://foo2.com/ontolgy2.owl#class2x>.
}
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
And after the insert if I run a SPARQL query like this:
select ?a
where
{
?a rdf:type ?type.
?type rdfs:subClassOf* <http://foo2.com/ontolgy2.owl#class2> .
}
the result would be:
<http://foo2.com/instance2>
and not:
<http://foo2.com/instance2>
<http://foo2.com/instance2x>
as it should be. This is happening because the ontology file _http://foo2.com/ontolgy2.owl# is not imported into the triple store.
My question is:
Can we talk in this example about "linked" data? Because it seems to me that it is not linked at all. It has to be imported locally into a triple store, and after that you can start querying.
Lets say if you want to run a query on some complex data that is described by 20 ontology files, then all 20 ontology files would need to be imported.
Isn't this a bit disappointing?
Do I misunderstood triple stores and linked data and how they work together?

as it should be.
I'm not certain that should is the right term here. The semantics of the SPARQL query is to query the data stored in a particular graph stored at the endpoint. IRIs are more or less opaque identifiers; just because they might also be URLs from which additional data can be retrieved doesn't obligate any particular system to actually do that kind of retrieval. Doing that would easily make query behavior unpredictable: "this query worked yesterday, why doesn't it work today? oh, a remote website is no longer available…".
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
Remember, since IRIs are opaque, anyone can define a term in any ontology. It's always possible for someone else to come along and say something else about a resource. You have no way of tracking all that information. For instance, if I go and write an ontology, I can declare http://foo2.com/ontolgy2.owl#class2x as a class and assert that it's equivalent to http://dbpedia.org/ontology/Person. Should the system have some way to know about what I did someplace else, and even if it did, should it be required to go and retrieve information from it? What if I made an ontology that's 2GB in size? Surely your endpoint can't be expected to go and retrieve that just to answer a quick query?
Can we talk in this example about "linked" data? Because it seems to
me that it is not linked at all. It has to be imported locally into a
triple store, and after that you can start querying.
Lets say if wan to run a query on some complex data that is describe
by 20 ontology files, in this case I have to import all 20 ontology
files.
This is usually the case, and the point about linked data is that you have a way to get more information if you choose to, and that you don't have to do as much work in negotiating how to identify resources in that data. However, you can use the service keyword in SPARQL to reference other endpoints, and that can provide a type of linking. For instance, knowing that DBpedia has a SPARQL endpoint, I can run a local query that incorporates DBpedia with something like this:
select ?person ?localValue ?publicName {
?person :hasLocalValueOfInterest ?localValue
service <http://dbpedia.org/sparql> {
?person foaf:name ?publicName
}
}
You can use multiple service blocks to aggregate data from multiple endpoints; you're not limited to just one. That seems pretty "linked" to me.

Related

Wikidata Virtuoso SPARQL Endpoint - How to get more than 100,000 results

I need to get Wikidata artifacts (instance-types, redirects and disambiguations) for a project.
As the original Wikidata endpoint has time constraints when it comes to querying, I have come across Virtuoso Wikidata endpoint.
The problem I have is that if I try to get for example the redirects with this query, it only returns 100,000 results at most:
PREFIX owl: http://www.w3.org/2002/07/owl#
CONSTRUCT {?resource owl:sameAs ?resource2}
WHERE
{
?resource owl:sameAs ?resource2
}
I’m writing to ask if you know of any way to get more than 100,000 results. I would like to be able to achieve the maximum number of possible results.
Once the results are obtained, I must have 3 files (or as few files as possible) in the Ntriples format: wikidata_intance_types.nt, wikidata_redirecions.nt and wikidata_disambiguations.nt.
Thank you very much in advance.
All the best,
Jose Manuel
Please recognize that in both cases (Wikidata itself, and the Virtuoso instance provided by OpenLink Software, my employer), you are querying against a shared resource, and various limits should be expected.
You should space your queries out over time, and consider smaller chunks than the 100,000 limit you've run into -- perhaps 50,000 at a time, waiting for each query to finish retrieving results, plus another second or ten, before issuing the next query.
Most of the guidance in this article about working with the DBpedia public SPARQL endpoint is relevant for any public SPARQL endpoint, especially those powered by Virtuoso. Specific settings on other endpoints will vary, but if you try to be friendly — by limiting the rate of your queries; limiting the size of partial result sets when using ORDER BY, LIMIT, and OFFSET to step through to get a full result set for a query that overflows the instance's maximum result set size; and the like — you'll be far more successful.
You can get and host your own copy of wikidata as explained in
https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
There are also alternatives to get a partial dump of wikidata e.g. with https://github.com/bennofs/wdumper
Or ask for access to one of the non public copies we run by sending me a personal e-mail via my RWTH Aachen i5 account

How to make a selection of a giant ontology, built from several aligned reference ontologies?

My organization has an information requirement spanning several information domains. In order to capture this, we are building a large organization ontology in which we align several domain specific reference ontologies / vocabularies (think of dublin core, geosparql, industry specific information models etc) and where necessary, we add concepts in an ` extension' ontology (which is then also aligned with the reference ontologies).
The totality of this aligned ontology (>3000 classes and >10000 ObjectProperties) contains both unused concepts and semantic doubles, and for the newcomer is impossible to navigate. Further more the organization wishes to standardize the use of specific concepts, so doubles are extremely undesirable. We are therefore looking for a way to construct the SuperAwesomeOntology that contains all concepts (and their owl related predicates like subClassOf, domain/range etc) that have been labeled (maybe by something like dcterms:isRequiredBy "SuperAwesomeOntology"). The result should be a correct OWL-ontology that can be stored in a single file.
One constraint: it has to be done programmatically,(the copy/move/delete axioms interface of protege wont do), because if one of the reference ontologies gets an update, we want to be able to render the SuperAwesomeOntology again from its most up-to-date reference ontologies and find out if there are any conflicts.
How would we go about this? could SPARQL do this, how? Alternative suggestions to the isRequiredBy labeling are also welcome.
If I understand you correctly, you want to programmatically remove unused concepts from a large ontology or a collection of ontologies/graphs and you also want to remove concepts/classes that you identified as duplicates via interlinking.
Identified duplicates are easy to remove:
Define what a duplicate is for you. For example, nodes at either end of a owl:sameAs or skos:closeMatch link that are outside of your core graph (so you don't remove the "original").
Construct the new graph using a SPARQL query:
construct {?s ?p ?o.}
{
?s ?p ?o.
filter not exists {graph ?g {?s owl:sameAs ?x.} filter(?g!=<http://my.core.graph>)}
filter not exists {graph ?g {?o owl:sameAs ?x.} filter(?g!=<http://my.core.graph>)}
}
I tested this query for syntax and performance but not for correctness.
Unused concepts are more difficult to remove:
First, again, you need to define, what "unused" means for you. This criterion will however certainly involve reachability, or "connectedness" in the combined graph, where you want to only select the graph component that contains your core ontology. The problem is that, if you treat the triples as undirected edges, you will probably get a connected graph (that is only a single component and no nodes to remove) because the type hierarchy often connects everything. You could take the direction of edges into account, that is include only resources Y where there is a directed path from any resource X in your core ontology to Y.
This would ensure that you can go up the subclass hierarchy of the target ontology until e.g. owl:Thing but not down again. The problem is that you don't know what other type of edges are in the target ontology and in which direction they go but you could use only rdfs:subClassOf edges for now.
If you have sufficiently defined your "unused concept" or want to try it with some experimental definition, you can either use a graph library or graph analysis application and import your code there.
Here is an example of how to import a SPARQL endpoint into the Cytoscape.js JavaScript graph visualization library, it can be used in node as well. You need to heavily adapt the code however.
Or you do it again in SPARQL using SPARQL 1.1 property paths.
The problem is that those can have a large performance impact (or even a complexity that is way too large to ever complete) especially when applied to a large number of resources and an unrestricted path length. So it is possible that a query such as that times out but feel free to try and adapt it:
construct {?s ?p ?o.}
{
{?s ?p ?o.}
graph <http://my.core.graph> {?x rdfs:subClassOf ?X.}
{?x (<>|!<>)* ?s.}
}
The ?x rdfs:subClassOf ?X statement is just an identifier for which resources of your core ontology you want to use a source points, I couldn't get a valid query without that. When I apply a graph statement to the path expression, I get a syntax error from Virtuoso.

How to assign a suitable science category to keywords, using DBPedia and SPARQL?

I have some keywords like emotion perception ability, students’ motivation, self-efficacy. The goal is to map these keywords to a corresponding category(-ies) of psychology. In this case I know apriori that the answer is Educational psychology, however I want to get the same answer using DBPedia ontologies.
Using the following query I am able to extract different branches of psychology and corresponding abstracts:
SELECT DISTINCT ?subject ?abstract
WHERE {
?concept rdfs:label "Branches of psychology"#en .
?concept ^dct:subject ?subject .
?subject dbo:abstract ?abstract .
}
LIMIT 100
Now I want to add some OPTIONAL clause that would compare my keywords (using OR) with terms from the abstract (dbo:abstract). Is it possible to do this using SPARQL? Or should I use SPARQL just to obtain the abstracts and then to make all further text processing using e.g. Java or Python?
Also, the ideas of some other approaches that might be useful to reach the goal are highly appreciated.
You could retrieve data as text using sparql, but deciding if a text match a query should be done with text data analysis technics, or text mining
This is a whole science, but fortunately lots of libraries for many languages (including Java and Python) exist in order to implement related algorithms. Here is a list of software on wikipedia. NLTK is well-known for the job, and have a Python binding.
In your case, i think of many ways, but i'm far from an expert, so my idea can be wrong:
Create a corpus of abstracts of each desired categories (educational psychology,…), and, for a given abstract A, compare A to each abstract of each abstract of each category C. The result of the comparison will give for each category a score/likelihood that A belongs to C. (cf fuzzy sets)
The comparison could be implemented with vector space model, that works on vocabulary similarities.
Named Entities Recognition could help to detect names of authors, technics or tools related to particular category.
Main idea is the following: once you defined what is the particular trait of each category, by using its vocabulary, authors, references or whatever, you can decide, for any abstract, a score of membership to all categories.
So, the real question to ask is which scoring function should i use ?.
The answer depends heavily of the data, and the results you want. When you say an abstract is about educational psychology, you have to know why. And then implement it as a scoring function.
As a side node, i add that neural networks, by training on the corpus, could, maybe, bypass the scoring, by automatic learning. I don't know enough in that field to say more.

Yago ontology for entity disambiguation

I am using the propriety rdfs:type equal to dbpedia-owl:Organisation for selecting (obviously) organizations on my SPARQL query:
SELECT ?s
WHERE {
?s a dbpedia-owl:Organisation .
} LIMIT 10
I would like to consider the YAGO ontology for increasing my performance on getting real organizations. For example, the FBI (http://dbpedia.org/resource/Federal_Bureau_of_Investigation) is not considered as a dbpedia-owl:Organisation but is tagged as yago:Organization108008335 .
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
Moreover, when I look for more classes with this format (using the query below), I can find two more classes: http://dbpedia.org/class/yago/Organization108008335, http://dbpedia.org/class/yago/Organization101008378, http://dbpedia.org/class/yago/Organization101136519
SELECT DISTINCT ?t WHERE {
?s a ?t
FILTER(regex(str(?t), "http://dbpedia.org/class/yago/Organization\\d+"))
}
Are they different? Why aren't they all "yago:Organization". Should I expect "new" organization classes as new versions of YAGO ontologies are made available? Is there any other class I should consider when selecting Organizations?
I have been digging into this lately, so I'll try to answer all your questions one by one:
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
That number corresponds to the synset id of the word in Wordnet. For example, if you look up wordnet_organization_101136519 in wordnet (the URI in the dbpedia is not resolvable at this moment, maybe they have changed something in the last releases), you will see that it has a synsetID "101136519". I don't think you can know it a priori without looking into wordnet.
Are they different? Why aren't they all "yago:Organization".
They are different because they have a different definition in wordnet. For example:
wordnet_organization_101136519: "the activity or result of distributing or disposing persons or things properly or methodically 'his organization of the work force was very efficient'". Example of an instance: Bogo-Indian_Defence. See more details here
wordnet_organization_101008378: "the act of organizing a business or an activity related to a business 'he was brought in to supervise the organization of a new department'". Example of an instance: Adam_Smith_Foundation. See more details here
If you follow the links I provided you can see more differences and common similarities.
Should I expect "new" organization classes as new versions of YAGO ontologies are made available?
When they generated Yago they associated every word in wordnet to a URI. If more words about organizations are added, then I guess that you'll have more definitions. However it is impossible to know beforehand.
Is there any other class I should consider when selecting Organizations?
You can look for all the classes with the label "organization" in wordnet and then add optionals to your query (or issue one query per class retrieving the different organizations you are interested in). These are the classes with the "organization" label in Wordnet.
I hope it helps.

DBPedia queries missing certain chemical compounds

I am running this query to get list of all compounds from the DBPedia public SPARQL endpoint.
SELECT * WHERE {
?y rdf:type dbpedia-owl:Drug.
?y rdfs:label ?Name .
OPTIONAL {?y dbpedia-owl:iupacName ?iupacname} .
OPTIONAL {?y dcterms:subject ?y1}
FILTER (langMatches(lang(?Name),"en"))
}
LIMIT 50000
I am downloading in batches of 50000 (2 files) using offset parameter.
Somehow Isopropyl_alcohol is not getting covered in this even where page exists at
http://live.dbpedia.org/page/Isopropyl_alcohol
and it has the properties that I am searching for?
There are two issues here. The first is that DBpedia Live and DBpedia do not have exactly the same content. According to the DBpedia live webpage
Wikipedia users constantly revise Wikipedia articles with updates
happening almost each second. Hence, data stored in the official
DBpedia endpoint can quickly become outdated, and Wikipedia articles
need to be re-extracted. DBpedia Live enables such a continuous
synchronization between DBpedia and Wikipedia.
That page also lists two SPARQL endpoints for DBpedia Live:
http://live.dbpedia.org/sparql
http://dbpedia-live.openlinksw.com/sparql
However, you'll run into issues on both. Isopropyl_alcohol is in DBpedia, and its URI is
http://dbpedia.org/resource/Isopropyl_alcohol (which, when using a web browser, redirects to
http://dbpedia.org/page/Isopropyl_alcohol)
Looking there, we see that Isopropyl alcohol doesn't have rdf:type dbpedia-owl:Drug, but only
owl:Thing
dbpedia-owl:ChemicalCompound
dbpedia-owl:ChemicalSubstance
yago:Alcohols
http://umbel.org/umbel/rc/ChemicalSubstanceType
yago:AlcoholSolvents
so you won't be able to find it with your query on DBpedia, because it doesn't have the type `dbpedia-owl:Drug. Now, Isopropyl_alcohol also exists in DBpedia live, and its URL is
http://live.dbpedia.org/resource/Isopropyl_alcohol, which redirects to
http://live.dbpedia.org/page/Isopropyl_alcohol
but it only has the folllowing rdf:types:
yago:Alcohols
yago:AlcoholSolvents
http://umbel.org/umbel/rc/ChemicalSubstanceType
so it won't be found by your query on DBpedia Live, for the same reason.
The second issue is the one that AndyS pointed out. Even if the query would select Isopropyl_alcohol in DBpedia or DBpedia Live, unless you provide an ordering constraint, the limit/offset combination won't be guaranteed to return it, since without an ordering constraint, the server could legitimately return the same set of 50000 results to you every time.
Maybe it is not finding it in the LIMIT/OFFSET combination you are using. The server is not obliged to answer queries in the same order everytime unless you use ORDER BY so maybe the slices you have are not in fact all results.
Maybe the SPARQL site and live.dbpedia are not in step.
Try asking directly for Isopropyl_alcohol.