Count properties in a large data set via SPARQL - sparql

I'd like to have a list of most used properties in a SPARQL endpoint. The most straightforward query would be:
select ?p ( count ( distinct * ) as ?ct )
{
?s ?p ?o.
}
group by ?p
order by desc ( ?ct )
limit 1000
The problem is that there are too many triples (1.6 billions) and the server times out. So, after googling, I've also tried this, to get at least a sample statistics (yes, it's Virtuoso-specific and it's fine in my case):
select ?p ( count ( distinct * ) as ?ct )
{
?s ?p ?o.
FILTER ( 1 > <SHORT_OR_LONG::bif:rnd> (0.0001, ?s, ?p, ?o) )
}
group by ?p
order by desc ( ?ct )
limit 1000
But it times out anyway, I guess because it still has to group, count and then order. So, how can I do it? I have access to the Virtuoso relational DB (i.e., iSQL), but I cannot find docs about SQL syntax and how to select random triples from the table db.dba.rdf_quad.
EDIT: I've fixed the queries, initially they were wrong, thanks for the comments. The versions above still don't work.

OK, I've found a way, at least a partial one: Virtuoso has a command line administration tool, isql. This accepts SPARQL queries as well, in the form: SPARQL <query>;. And they're executed without timeout or result size restrictions.
This is still not good if you can only access an endpoint via HTTP, I don't quite know if that way it is possible at all.

Related

Can this SPARQL search query be made more efficient?

I have a compound 'search' query made in SPARQL that
(1) Searches for unique subject URIs that are of a certain rdf:type:
Example:
SELECT ?s FROM NAMED <http://www.example.org/graph1> FROM NAMED <http://www.example.org/graph2>
{
GRAPH ?g
{
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.example.org/widget>.
}
} OFFSET 10000 LIMIT 100
This query is quite simple and just returns all subjects of type 'widget'.
(2) For the returned page of satisfying subject URIs, search for all subject URIs that have a reference to those subject URIs (i.e. referencing entities), specifying the reference predicate URIs that indicate a reference.
Let's say the previous query (1) returned 2 subject URIs http://www.example.org/widget100 and http://www.example.org/widget101 and the referencing predicate I wanted to query for was http://www.example.org/widget:
Example:
SELECT ?s FROM NAMED <http://www.example.org/graph1> FROM NAMED <http://www.example.org/graph2>
WHERE {
UNION
{
?s <http://www.example.org/widget> <http://www.example.org/widget100>
}
UNION
{
?s <http://www.example.org/widget> <http://www.example.org/widget101>
}
}
If the previous page returned 100 subject URIs, there would be 100 'UNION' statements here for each subject.
This query works - it selects the subject URIs of the given type, and returns the additional subject URIs that reference those subjects with the given reference predicate.
The problem is in practice, when I have 100,000s of triples across my query graphs, even on a fast machine on an in-memory graph this query is taking typically 1 minute+ to execute. This is unacceptably slow for users for this fairly typical search scenario.
Under profiling, both queries take roughly 50% of the query time.
I have enough experience with SPARQL to construct such a query above, but I am certainly not an expert. I am wondering if this could be made more efficient. For example, could it be combined into a single query that might at least reduce query times by 50%+? Is my use of UNIONs across potentially many subjects replacable by a more efficient method?
Thank you
SPARQL Guy
UPDATE: I have managed to reduce the query down to a single query of the following form:
SELECT *
FROM NAMED <http://www.example.org/widgets>
FROM NAMED <http://www.example.org/widgetstats>
FROM NAMED <http://www.example.org/widgetmetadata>
FROM NAMED <http://www.example.org/widgetfactory>
WHERE
{ { SELECT ?s ?p ?o
WHERE
{ GRAPH ?g
{ ?s ?p ?o }
{ SELECT ?s
WHERE
{ GRAPH ?i
{ ?s a <http://www.example.org/widget> }
}
OFFSET 0
LIMIT 100
}
}
}
UNION
{ SELECT ?s ?p ?o
WHERE
{ GRAPH ?g
{ ?s ?p ?o }
{ SELECT DISTINCT ?s
WHERE
{ GRAPH ?h
{ OPTIONAL
{ ?s <http://www.example.org/widgetstats/widget> ?x }
OPTIONAL
{ ?s <http://www.example.org/widgetmetadata/widget> ?x }
OPTIONAL
{ ?s <http://www.example.org/widgetfactory/widget> ?x }
}
{ SELECT ?x
WHERE
{ GRAPH ?i
{ ?x a <http://www.example.org/widget> }
}
OFFSET 0
LIMIT 100
}
}
}
}
}
}
This improves query speed by approx. 50%. The query can, though, I think be made faster. This form of query - fetching first all triples associated with the primary entities of the given type followed by all the triples associated with the referencing entities - requires two identical innermost subqueries, fetching the unique subjects of the given type.
Is there any way of reducing this query down - perhaps performing with a single query instead of a UNION of two subqueries? I am assuming this will probably improve performance further.
UPDATE 2: I couldn't improve on the query above (first update) and so I will make this as the answer for now.
If you still want the paging of the first query then probably the best approach would be to combine the queries using a SPARQL subquery.
Note that with subqueries you work from the inside out, so the subquery selects the widgets and the outer query expands to find the references. If you are using FROM NAMED then you need to match on the graph (assuming your results are in a named graph and you aren't working with a union default graph). The OFFSET and LIMIT on the inner query means that the example below returns references to the 3rd widget (in whatever default sort order the engine is applying).
I'm not sure if this will speed up the overall query time, but worth experimenting with and saves you a bunch of string concantenation!
PREFIX ex: <http://www.example.org/>
SELECT ?s FROM NAMED ex:g1 FROM NAMED ex:g2 WHERE {
GRAPH ?h {
?s ex:widget ?x
}
{
SELECT ?x WHERE {
GRAPH ?g {
?x a ex:widget
}
} OFFSET 2 LIMIT 1
}
}

Why DISTINCT keyword lead to different entity for these two queries?

Query 1
PREFIX ns: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?x
WHERE {
FILTER (!isLiteral(?x) OR lang(?x) = '' OR langMatches(lang(?x), 'en'))
?x ns:type.object.type ns:religion.religious_leadership_title .
?x ns:religion.religious_leadership_title.leaders ?c0 .
?c0 ns:religion.religious_organization_leadership.start_date ?sk0 .
}
ORDER BY ?sk0
LIMIT 1
Query 2
PREFIX ns: <http://rdf.freebase.com/ns/>
SELECT ?x
WHERE {
FILTER (!isLiteral(?x) OR lang(?x) = '' OR langMatches(lang(?x), 'en'))
?x ns:type.object.type ns:religion.religious_leadership_title .
?x ns:religion.religious_leadership_title.leaders ?c0 .
?c0 ns:religion.religious_organization_leadership.start_date ?sk0 .
}
ORDER BY ?sk0
LIMIT 1
So the only difference between Q1 and Q2 is that there is a DISTINCT keyword when SELECT ?x in Q1. However, Q1 gives answer m.01h_90 while Q2 gives answer m.05rd8.
Ideally, I feel this should not lead to different results, as the purpose of DISTINCT is only to get rid of duplicates in the results set if I understand it correctly, so if the original results do not have duplicates at all, there should not be any difference by adding the DISTINCT keyword.
You have a tie on the value you're ordering. Specifying distinct is causing a different execution plan which orders the rows differently, though still ordering by the one column as requested, with another row as the first one to output. Add the output column to the order by clause and you should see consustent results between the two queries.

Fast publication date lookup with Wikidata Query Service

Is there a way to lookup publication dates quickly in Wikidata Query Service's SPARQL to find publications of a certain date, e.g., today?
I was hoping that something like this query would be quick:
SELECT * WHERE {
?work wdt:P577 ?datetime .
BIND("2018-09-28T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> as ?now_datetime)
FILTER (?datetime = ?now_datetime)
}
LIMIT 10
However, it times out when using it on the SPARQL endpoint at https://query.wikidata.org
A range query seems neither to be quick. The query below returns after almost 30 seconds:
SELECT * WHERE {
?work wdt:P577 ?datetime .
FILTER (?datetime > "2018-09-28T00:00:00Z"^^xsd:dateTime)
}
LIMIT 1
The trick is to avoid full scan and use indexes:
VALUES:
SELECT * WHERE {
VALUES (?datetime) {("2018-09-28T00:00:00Z"^^xsd:dateTime)}
?work wdt:P577 ?datetime .
} LIMIT 10
Try it!
hint:rangeSafe:
SELECT * WHERE {
VALUES (?datetime) {("2018-09-28T00:00:00Z"^^xsd:dateTime)}
?work wdt:P577 ?date_time .
hint:Prior hint:rangeSafe true .
FILTER (?date_time > ?datetime)
} LIMIT 10
Try it!
[The rangeSafe hint] declare[s] that the data touched by the query for a specific triple pattern is strongly typed, thus allowing a range filter to be pushed down onto an index.

SPARQL Most Frequent Predicates

I want to find the top k frequent predicates in a graph. That is the predicates that occur in more triples. How can this be done using SPARQL?
You can do this using the group and aggregation features of the query language. You want to group by the predicate and count all the triples that use the predicate e.g.
SELECT ?p (COUNT(*) AS ?usages)
WHERE
{
GRAPH <http://your-graph.com> { ?s ?p ?o }
}
GROUP BY ?p
ORDER BY DESC(?usages)
LIMIT 5
You can then sort by descending number of usages and limit the results to obtain just the top K
See the spec for more examples - https://www.w3.org/TR/sparql11-query/#aggregates

How to force virtuoso sparql endpoint return full answer?

I want to query DBpedia and use Virtuoso. In some queries which their results are too much, it returns only part of the results. For example, in the query below, the predicate http://dbpedia.org/ontology/birthplace is missing. Is there any way to get all results either from Virtuoso or any other endpoint ?
SELECT DISTINCT ( ?p AS ?outEdge )
( ?q AS ?inEdge )
( ?px AS ?dest )
( ?qx AS ?source )
WHERE {
{ <http://dbpedia.org/resource/England> ?p ?px . }
UNION
{ ?qx ?q <http://dbpedia.org/resource/England> . }
}
I want to query DBPeida and use virtuoso. In some queries which their results are too much it returns only part of the results for example in the below query the predicate http://dbpedia.org/ontology/birthplace is missing. Is there anyway to get all results either from virtuoso or any other endpoint ?
While I don't detect anything malicious or mean-spirited in your question, you're essentially asking how circumvent DBpedia's defenses against intentional and unintentional denial of service attacks. Internal limits help to ensure that too many resources aren't consumed by any particular query. The right way to get all the results from a SPARQL query, if they aren't all returned at once, is to use limit, offset, and order by, and to use multiple queries. E.g.,
#-- get first 10 results
select ... where ...
order by ?name
limit 10 offset 0
#-- get next 10 results
select ... where ...
order by ?name
limit 10 offset 10
#-- get more resuls
select ... where ...
order by ?name
limit 10 offset 20