Pagination / Breaking up large query for Wikidata SparQL - sparql

I am running the following query with Wikidata's SparQL:
SELECT ?item ?itemLabel ?articleSimple ?article WHERE {
?articleSimple schema:about ?item ;
schema:isPartOf <https://simple.wikipedia.org/> .
?article schema:about ?item ;
schema:isPartOf <https://en.wikipedia.org/> .
}
LIMIT 1000
(thanks #UninformedUser for the help!). It runs fine (under 1 minute) for up to LIMIT 200000, but times out soon after that.
I was hoping to find a way to get all results back either through pagination or splitting the query in some way (for instance Wikipedia pages that start with A, B, etc.).
Unfortunately, any time I add an ORDER BY statement, it makes the query go over the time limit. Any ideas on how to approach the problem?
Otherwise, one option would be to download the full Wikidata dump and scan through it, but it seems inefficient.
Thank you very much!

Related

Finding common super classes in SPARQL

We need a SPARQL query to find the common superclasses for two Wikidata entities. I tried it like this:
select ?commonBase #?commonBaseLabel
where {
wd:Q39798 wdt:P31/wdt:P279* ?commonBase.
wd:Q26868 wdt:P31/wdt:P279* ?commonBase.
#SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Here's a direct link to the query.
This always times out, even with the SERVICE line commented out.
However, when I run the same query with just one of the WHERE clauses, it runs very fast, yielding either 130 or 12 results.
Why is it so much slower with both WHERE clauses? In theory, the database could just perform both WHERE clauses separately and then do an INTERSECT, i.e. deliver those results that have occurred in both queries.
If this is a flaw in the design of the SPARQL engine, then how can I adapt the query to make it run faster?

Jena seems to cache BIND(RAND()) Queries

I am running the following query thousands of times in a row in a Java program using Apache Jena (in order to generate random walks).
SELECT ?p ?o
WHERE {
$ENTITY$ ?p ?o .
FILTER(!isLiteral(?o)).
BIND(RAND() AS ?sortKey)
} ORDER BY ?sortKey LIMIT 1
However, I always get the same set of properties and objects (even though this seems to be extremely unlikely). I suppose Jena is caching the result to the query (despite the RAND() component).
What is the best and most efficient way to solve this problem?

Limit 2500 on Linked Movie Database

Right now I'm developing an App for querying the LMDB. LMDB has set a LIMIT for the results in order to prevent page crash (max 2500 results). I want to query the WHOLE LMDB and I was wondering if there's some way to get the whole data that is in the LMDB and not only the limited 2500 results ?
The query that I run:
PREFIX mdb: <http://data.linkedmdb.org/resource/movie/film>
PREFIX dc: <http://purl.org/dc/terms/>
SELECT ?movieName ?resource WHERE {
?resource dc:title ?movieName .
?resource mdb:id ?uri .
FILTER regex(?movieName,'The Mexican', 'i')}
I get no results.
If I change the last line:
FILTER regex(?movieName,'The Mexican', 'i')} to ?resource dc:title 'The Mexican'
I get the desired results. This means that the data about this movie is actually in the LMDB but because of the LIMIT I can't access all entries unless I type in the EXACT name of the movie I'm searching for.
The App that I'm building uses search fields and I can't expect the users to know the exact name of the movie that's why I'm using FILTER regex().
What would be the way to fix this problem? Please help me.
Should I download a local copy and run the queries against it or is there another way?
I wanted to contact the LMDB staff about this but I couldn't find any contact information

Fetching entities label data using SPARQL in Wikidata

I am using wikidata Query Service to fetch data: https://query.wikidata.org/
I have already managed to use an entity's label using 2 methods:
Using wikibase label service. For example:
SELECT ?spouse ?spouseLabel WHERE {
wd:Q1744 wdt:P26 ?spouse.
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
}
Using the rdfs:label property:
SELECT ?spouse ?spouseLabel WHERE {
wd:Q1744 wdt:P26 ?spouse.
?spouse rdfs:label ?spouseLabel. filter(lang(?spouseLabel) = "en").
}
However, it seems that for complex queries, the second method performs faster, in contrary to what the MediaWiki user manual states:
The service is very helpful when you want to retrieve labels, as it
reduces the complexity of SPARQL queries that you would otherwise need
to achieve the same effect.
(https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Label_service)
What does the wikibase add that I can't achieve using just rdfs:label?
It seems odd since they both seemingly achieve the same purpose, but rdfs:label method seems faster (which is logical, becuase the query does not need to join data from external sources).
Thanks!
As I understand from the documentation, the wikibase label service simplifies the query by removing the need to explicitly search for labels. In that regard it reduces complexity of the query you need to write, in terms of syntax.
I would assume that the query is then expanded to another representation, maybe with the rdfs namespace like in your second option, before actually being resolved.
As per the second option being faster, have you done a systematic benchmarking? In a couple of tries of mine, the first option was faster. I would assume that performance of the public endpoint is anyways subject to fluctuation based upon demand, caching, etc so it may be tricky to draw conclusions on performance for similar queries.

How to import and query 4-column N-Quad quadruples into Blazegraph?

I'm relatively new to RDF/SPARQL and I'm trying to import and query some N-Quad, 4 column/quadruple data, into Blazegraph (See schemaOrgEvent.sample.txt).
When I import the Quads data (web console), I'm left with only triples-based 3-column tuple data, i.e. Blazegraph is allowing only SPARQL queries SELECTing {?s ?p ?o},but quadruple selections {?s ?p ?o ?c}, do not. What am I doing wrong???
Can Blazegraph store 4-column Quad data, natively, or am I mis-understanding the nature of RDF/Triple/Quad storage?
My Blazegraph schema/knowledge database (to import into), is configured for QUADs with:
com.bigdata.rdf.store.AbstractTripleStore.quads=true
I've imported the Quads-file into Blazegraph using the web-based UPDATE console # http://127.0.0.1:9999/bigdata/#update. I haven't tried direct bulk upload yet.
Also, the imported quad data, as well as becoming triples (normalised???), appears to have column 1 converted to another, Blazegraph-provided (?), identifier, the original (Quad) data format being (4-columns)
_:node03f536f724c9d62eb9acac3ef91faa9 <http://schema.org/PostalAddress/addressRegion> "Kentucky"#en <http://concerts.eventful.com/Lauren-Alaina> .
And after import (3-columns):
t1702 schema:PostalAddress/addressRegion Kentucky
Whose query was:
SELECT * WHERE
{
?s ?p ?o
#?s ?p ?o ?c - Won't work :-(
FILTER(STR(?o)="Kentucky")
}
The value 't1702' is a 'foreign key' of sorts and can be used to link to other triples (i.e. it is repeated within the imported date).
SPARQL queries RDF triples. The fourth dimension of a Quad data store is the graph. RDF can be divided into separate graphs of triples, analogous to tables in a database. You may be able to query the Quad store with the following:
SELECT *
WHERE {
GRAPH ?g {
?s ?p ?o
}
}