How do I consistently query dbpedia for programming languages by name? - sparql

There seems to be no consistent way to query for programming languages based on name. Examples:
http://dbpedia.org/page/D_(programming_language)
rdfs:label "D (programming language)"#en
dbpprop:name "D programming language"
owl:sameAs freebase:"D (programming language)"
foaf:name "D programming language"
vs.
http://dbpedia.org/page/C++
rdfs:label "C++"#en
dbpprop:name "C++"
owl:samwAs freebase:"C++"
foaf:name "C++"
Since there's no standard convention for whether "programming language", "(programming language)", "programming_language", "(programming_language", or "" is part of a name for a programming language in dbpedia, I have no idea how to consistently search by name.
I'd like to create some sort of SPARQL query that returns http://dbpedia.org/page/D_(programming_language) for "D" and http://dbpedia.org/page/C++ for "C++", but I don't know how do to this.
Unless at least one of the various triples for programming languages uses a consistent naming convention, I'll have to hack it by querying first against name + " (programming_language)", and falling back to name + "(programming language", name + " programming language" when no results are found. But I'd like a much more robust method.

You could of course just match using a basic substring match or a regex, e.g. like this to find a match for "C++":
SELECT DISTINCT ?pl ?label
WHERE {
?pl a dbpedia-owl:ProgrammingLanguage ;
rdfs:label ?label .
FILTER(langMatches(lang(?label), "en"))
FILTER(regex(str(?label), "C\\+\\+"))
}
Of course, the above will be problematic for a programming language name like "D", since you will get back several matches ("D", "Dylan", "MAD", etc.). In those cases, you might want to do some clever postprocessing of the result, for example tokenizing the returned label and seeing if your input string occurs as a standalone word.
Regex matching in SPARQL is notoriously expensive (in terms of evaluation time), but since you combine it with a type constraint to a particular category, the DBPedia endpoint should be able to handle this kind of query just fine.

I'd use
SELECT distinct ?pl ?label
WHERE {
?pl a dbpedia-owl:ProgrammingLanguage ;
rdfs:label ?label.
?label bif:contains "'C++'" .
filter (str (?label) like '%C++%')
filter (lang(?label)="en")
}
?label bif:contains "'C++'" would filter to some degree, more specifically to C++, C, Objective-C etc., because ++ would be treated as a noise and excluded from actual search pattern.
After that you have C and need two pluses, so filter (str (?label) like '%C++%') would check for them faster than regex.
Add filter (lang(?label)="en") or filter (langmatches(lang(?label),"en")) or no lang check at all by taste.

Related

Functions in SPARQL to Manipulate IRIs?

I want to write some reusable SPARQL queries to do things like take an IRI, get the name part (typically after the # sign), modify it (e.g., replace underscores with blank spaces) and put it in the rdfs:label property. This would be useful for Protege which doesn't fill in the rdfs:label if you use user defined IRIs. Or take an IRI with a user defined name, do the same and then replace the user defined IRI with a UUID. I looked in the SPARQL spec for functions to manipulate IRIs and either they don't exist or I'm missing something obvious. I'm posting this to make sure it isn't the latter. I know it is easy to do the equivalent with things like SUBSTR but I'm surprised that there aren't predefined operators to do things like getting the name part of an IRI and getting the base and want to double check before I roll my own.
In case anyone else wants to do this, I figured it out. There are some answers on this site but they are all for SQL or other languages than SPARQL. The following is for classes and it should be obvious how to adapt it for other entities. Note: this works in the Snap SPARQL Plugin for Protege (that's why I use CONSTRUCT rather than INSERT), however, there is a bug in their implementation of SUBSTR so that it uses 0 based indexing rather than 1 based as the spec says. So if you use this in Snap SPARQL change the 1 to a 2.
CONSTRUCT {?c rdfs:label ?lblname.}
WHERE {?c rdfs:subClassOf owl:Thing.
BIND(STRAFTER(STR(?c), '#') as ?name)
BIND(REPLACE(?name,"([A-Z])", " $1" ) as ?namewbs)
BIND (IF (STRSTARTS(?namewbs," "),SUBSTR(?namewbs,1),?namewbs) AS ?lblname)
FILTER(?c != owl:Thing || ?c != owl:Nothing)}

Select all literals from a specific language in a SPARQL query

In a SPARQL query, it is possible to choose a language for each literal (and also strip the language tag):
CONSTRUCT {?x dc:title ?stripped_title.}
WHERE{
?x rdfs:label ?title .
FILTER (langMatches(lang(?title),"en"))
BIND (STR(?title) AS ?stripped_title)
}
But, if the query has many different literal properties that can be very verbose. Is there a way to set a default language for all literals in a query? Or, at least, make this query less verbose?

Avoiding count explosion in querying with SPARQL ontologies containing owl:sameAs triples

I am trying to build a (local) ontology that describes a finite number of objects, and that links these objects to external resources via the owl:sameAs predicate. However, when I simply query for the number of objects of that kind, I obtain twice as much as the object described. It is clear that also the external resources are counted independently, as what is taken into account is the number of URIs, and not the number of distinct objects.
I have solved this issue in the following way: I assume that the local ontology can be seen as a "reference hub" for knowing basic stuff about these objects, so I select all the objects of a certain kind, and then filter out only those that contain the base URI of the local ontology, i.e.:
# How many objects are there?
PREFIX ch: <http://www.example.com/ontologies/domain#>
SELECT (COUNT(DISTINCT ?elem) AS ?count) WHERE {
?elem a ch:Element.
FILTER (REGEX (STR(?elem) ,"http://www.example.com/ontologies/domain") ).
}
However, I have two concerns with this way of doing:
1) it looks a bit of a hack (even if somehow principled), whilst I would like something that makes more logical sense
2) I have the impression that this query is not very efficient.
I have searched quite a bit here, and on google, but didn't come out with any better solution... any suggestions here?
Thank you very much for any help!
GROUP BY a representative element
If there's some property that should have distinct values for each individual, then you can use it to impose the "equivalent class" structure that you need. E.g., something like this:
prefix ch: <http://www.example.com/ontologies/domain#>
select (count(?label) as ?count) where {
?elem a ch:element ;
rdfs:label ?label .
}
group by ?label
Synthesize a representative element
if there's not a value that will be shared by all elements in an equivalence class, you can still get a representative element from the set by asking for the minimal element in each equivalence class. We can use the IRIs of the elements to order the elements, and use that to select a unique individual. This does presume that each ?elem and all the things that it is the same as have well defined behavior under the str function (and IRIs do).
prefix ch: <http://www.example.com/ontologies/domain#>
select (count(distinct ?elem) as ?count) where {
?elem a ch:element .
filter not exists {
?elem (owl:sameAs|^owl:sameAs)* ?elem_
filter( str(?elem_) < str(?elem) )
}
}

Selecting strings with LIKE operator in SPARQL

In ANSI SQL, you can write something like:
SELECT * FROM DBTable WHERE Description LIKE 'MEET'
or also:
SELECT * FROM DBTable WHERE Description LIKE '%MEET%'
What I would like help with is writing the SPARQL equivalent of the above please.
Use a regex filter. You can find a short tutorial here
Here's what it looks like:
PREFIX ns: <http://example.com/namespace>
SELECT ?x
WHERE
{ ?x ns:SomePredicate ?y .
FILTER regex(?y, "YOUR_REGEX", "i") }
YOUR_REGEX must be an expression of the XQuery regular expression language
i is an optional flag. It means that the match is case insensitive.
If you have a fixed string to match you can use that directly in your graph pattern e.g.
PREFIX ns: <http://example.com/namespace>
SELECT ?x
WHERE
{ ?x ns:SomePredicate "YourString" }
Note this does not always work because pattern matching is based on RDF term equality which means "YourString" is not considered the same term as say "YourString"#en so if that may be an issue use the REGEX approach Tom suggests
Also some SPARQL engines provide a full text search extension which allows you to integrate Lucene style queries into your SPARQL queries which may better fit your use case and almost certainly be more efficient for the generic search which would otherwise require a REGEX

How to recursively expand blank nodes in SPARQL construct query?

There is probably an easy to answer to this, but I can't even figure out how to formulate the Google query to find it.
I'm writing SPARQL construct queries against a dataset that includes blank nodes. So if I do a query like
CONSTRUCT {?x ?y ?z .}
WHERE {?x ?y ?z .}
Then one of my results might be:
nm:John nm:owns _:Node
Which is a problem if all of the
_:Node nm:has nm:Hats
triples don't also get into the query result somehow (because some parsers I'm using like rdflib for Python really don't like dangling bnodes).
Is there a way to write my original CONSTRUCT query to recursively add all triples attached to any bnode results such that no bnodes are left dangling in my new graph?
Recursion isn't possible. The closest I can think of is SPARQL 1.1 property paths (note: that version is out of date) but bnode tests aren't available (afaik).
You could just remove the statements with trailing bnodes:
CONSTRUCT {?x ?y ?z .} WHERE
{
?x ?y ?z .
FILTER (!isBlank(?z))
}
or try your luck fetching the next bit:
CONSTRUCT {?x ?y ?z . ?z ?w ?v } WHERE
{
?x ?y ?z .
OPTIONAL {
?z ?w ?v
FILTER (isBlank(?z) && !isBlank(?v))
}
}
(that last query is pretty punishing, btw)
You may be better off with DESCRIBE, which will often skip bnodes.
As user205512 suggests, performing that grab recursively is not possible, and as they point out, using optional(s) to go arbitrary levels down into your data getting the nodes is not feasible on anything but non-trivial sized databases.
Bnodes themselves are locally scoped, to the result set, or to the file. There's no guarantee that a BNode is you get from parsing or from a result set is the same id that is used in the database (though some database do guarantee this for query results). Furthermore, a query like "select ?s where { ?s ?p _:bnodeid1 }" is the same as "select ? where { ?s ?p ?o }" -- note that bnode is treated as a variable in that case, not as "the thing w/ the id 'bnodeid1'" This quirk of the design makes it difficult to query for bnodes, so if you are in control of the data, I'd suggest not using them. It's not hard to generate names for stuff that would otherwise be bnodes, and named resources v. bnodes will not increase overhead during querying.
That does not help you recurse down and grab data, but for that, I don't recommend doing such general queries; they don't scale well and usually return more than you want or need. I'd suggest you do more directed queries. Your original construct query will pull down the contents of the entire database, that's generally not what you want.
Lastly, while describe can be useful, there's not a standard implementation; the SPARQL spec doesn't define any particular behavior, so what it returns is left to the database vendor, and it can be different. That can make your code less portable if you plan on trying different databases with your application. If you want a specific behavior out of describe, you're best off implementing it yourself. Doing something like the concise bounded description for a resource is an easy piece of code, though you can run into some headaches around Bnodes.
With regard to working with the ruby RDF.rb library, which allows SPARQL queries with significant convenience methods on RDF::Graph objects, the following should expand blank nodes.
rdf_type = RDF::SCHEMA.Person # for example
rdf.query([nil, RDF.type, rdf_type]).each_subject do |subject|
g = RDF::Graph.new
rdf.query([subject, nil, nil]) do |s,p,o|
g << [s,p,o]
g << rdf_expand_blank_nodes(o) if o.node?
end
end
def rdf_expand_blank_nodes(object)
g = RDF::Graph.new
if object.node?
rdf.query([object, nil, nil]) do |s,p,o|
g << [s,p,o]
g << rdf_expand_blank_nodes(o) if o.node?
end
end
g
end