Right now I'm developing an App for querying the LMDB. LMDB has set a LIMIT for the results in order to prevent page crash (max 2500 results). I want to query the WHOLE LMDB and I was wondering if there's some way to get the whole data that is in the LMDB and not only the limited 2500 results ?
The query that I run:
PREFIX mdb: <http://data.linkedmdb.org/resource/movie/film>
PREFIX dc: <http://purl.org/dc/terms/>
SELECT ?movieName ?resource WHERE {
?resource dc:title ?movieName .
?resource mdb:id ?uri .
FILTER regex(?movieName,'The Mexican', 'i')}
I get no results.
If I change the last line:
FILTER regex(?movieName,'The Mexican', 'i')} to ?resource dc:title 'The Mexican'
I get the desired results. This means that the data about this movie is actually in the LMDB but because of the LIMIT I can't access all entries unless I type in the EXACT name of the movie I'm searching for.
The App that I'm building uses search fields and I can't expect the users to know the exact name of the movie that's why I'm using FILTER regex().
What would be the way to fix this problem? Please help me.
Should I download a local copy and run the queries against it or is there another way?
I wanted to contact the LMDB staff about this but I couldn't find any contact information
Related
I am running the following query with Wikidata's SparQL:
SELECT ?item ?itemLabel ?articleSimple ?article WHERE {
?articleSimple schema:about ?item ;
schema:isPartOf <https://simple.wikipedia.org/> .
?article schema:about ?item ;
schema:isPartOf <https://en.wikipedia.org/> .
}
LIMIT 1000
(thanks #UninformedUser for the help!). It runs fine (under 1 minute) for up to LIMIT 200000, but times out soon after that.
I was hoping to find a way to get all results back either through pagination or splitting the query in some way (for instance Wikipedia pages that start with A, B, etc.).
Unfortunately, any time I add an ORDER BY statement, it makes the query go over the time limit. Any ideas on how to approach the problem?
Otherwise, one option would be to download the full Wikidata dump and scan through it, but it seems inefficient.
Thank you very much!
I'm new with Apache Jena Fuseki and SparQL. I have a problem when I tried to query data on Fuseki. The data I used is from DBpedia named 'Topical Concepts' (can be found here). I upload the data through the control panel on a browser (through the default port 3030) and used the query below:
SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?subject
}
LIMIT 25
I got a null table and a message "no data available in this table". However, when I installed Fuseki and do the same thing on my Mac (the problem in above happened on my desktop with Ubuntu 16 operation system), I successfully got 25 entries of the data. I don't think it is the problem of the operation system, but I really have no idea about why it happened. Does anybody encounter the same problem?
In your SPARQL query, you have the following pattern:
?subject ?predicate ?subject
Notice that you repeat ?subject. This query effectively asks: "give me all RDF triples for which the subject is the same value as the object". It's likely that the reason you're not getting a result is simply that no such triples exist in your database.
As for why this didn't happen on a Mac, without more info about your setup we can only speculate. It's possible that you configured your database slightly differently there (e.g. enabling a reasoner which would result in additional RDF triples that do match the query), or it might be as simple as you did a slightly different query there.
I am making two assumptions for answering your question:
you have two different instances of Jena installed. One on your laptop and one on your desktop.
You are sure you have uploaded data, possibly into a named graph and the default is empty
Fuseki, I haven't tried this on TBD, has one feature that, often, by default is set to query only the default graph. If in the config setting you activate tdb:unionDefaultGraph true ; then it will query all the graphs. Before changing the settings, please do check that this is true. You could check by executing this query:
SELECT distinct ?g
WHERE {
graph ?g{
?s ?p ?o
}
}
If you get a result, that means you need to change the settings for it to work, or be mindful of this fact and always call your queries with graphs (as in the above query).
For more explanation please refer to https://jena.apache.org/documentation/serving_data/
Okay this seems like a really basic question, but for some reason I am not able to figure this out. I have the DBpedia 2014 owl file from here. Now when I load this in Protégé and look at the Ontology metrics tab, I see that the class count is 814, Object property count is 1310, Data property count is 1725. Is this the right number? Out of curiosity I tried to check the numbers on the Virtuoso endpoint and for the query
select ?p (count(?p) as ?totalCount) where {?s ?p ?o } group by ?p order by DESC(?totalCount)
i.e. trying to find the properties and the total number of times they appear in the graph, I find that the total is 10,000. Now I am not sure if this is the right way to check the properties and the number of times they appear in the graph.
For classes when I issue this query :
SELECT ?class
WHERE {
?class rdf:type rdfs:Class.
}
I don't get any results at all. Now using the default query in Virtuoso i.e.
Select count(distinct ?Concept) where {[] a ?Concept}
I get the value as 369857. So I am a bit confused. Is this large number because of the fact that the graph has concepts from yago,umbel,schema.org and purl or am I looking at something wrongly? Are the Concepts completely different from the classes? (interpreted differently, which I haven't thought of).
Now honestly I got waylaid by these numbers because I needed them to calculate the selectivity as defined in this paper
Here the say that for a triple pattern, the selectivity of a subject is 1/R, where R is the number of resource, so do the resources mean the Class count or the Concept count? or the count of ?s in the ?s ?p ?o . triple pattern?
The DBpedia ontology contains just the axioms for classes and properties with the namespace http://dbpedia.org/ontology.
The DBpedia SPARQL endpoint contains much more data:
At first, it contains triples with properties that have the namespace http://dbpedia.org/property . Those properties are untyped (i.e. of typerdf:Property, which in fact means that the value can be both a resource or a literal. In OWL we have typed properties, i.e. object and data properties.
Other information that are loaded into the SPARQL endpoint are, among others, links to external datasets like YAGO or the upper-level ontology UMBEL. You can find more details here [1], [2].
By the way, you can see that easily from your first query. There are much more properties with different namespaces.
According to your first query: It's the correct query if you want the number of triples for each property. It returns only 10000 because that's the default result set limit of the Virtuoso triple store in which DBpedia is loaded. For more results you have to use pagination. The total number of properties used in triples can be found with
SELECT (COUNT(DISTINCT ?p) AS ?cnt)
WHERE
{ ?s ?p ?o}
Your second query with all classes of type rdf:Class returns nothing because no class in DBpedia is of that type. It's more usual to query for classes of type owl:Class for OWL ontologies. The third query in fact returns all resources that ever occur in rdf:type triples in object position, which is slightly different as it works on instance data. That means it return all classes that are really used in the data.
For your last question. I haven't read the paper, but a common metric in many research papers is often to use the distinct subjects that use a given property.
I downloaded some data from http://downloads.dbpedia.org/2015-04/core/,
including: instance-type_en.nt, mappingbased-properties_en.nt, and some others.
I have successfully loaded them into OpenLink Virtuoso DB, but when I run some sample SPARQL query, e.g., a query to see all the triples about a subject Xiamen_University, the problem appears.
select ?s ?p ?o
where
{
?s rdfs:label "Xiamen University"#en .
?s ?p ?o .
}
From the DBPedia SPARQL endpoint, there are heaps of triples about iamen_University; while in my db, there are only 4 or 5 of them.
Especially, there is no triple in db indicating Xiamen_University is a type of University, or any instance-type triples at all. I found similar cases on some other subjects as well.
I think the instance-types_en.nt file does not include all the instance-types triple from Wikipedia, same problem with mappingbased-properties. Is that right? If so, where could I find the right source file?
There's a whole list of the datasets available at on the downloads page. I don't see lots of documentation about exactly what's in each one of them, but the names are fairly descriptive, and the question mark links next to each one show a preview of what kind of information is in each one of them. Hovering over each title will provide a short description. E.g.:
It looks like to get most of the interesting properties, you'd probably want the mappingbased datasets, along with the labels dataset (since the query that you wrote identifies objects by label).
I am running this query to get list of all compounds from the DBPedia public SPARQL endpoint.
SELECT * WHERE {
?y rdf:type dbpedia-owl:Drug.
?y rdfs:label ?Name .
OPTIONAL {?y dbpedia-owl:iupacName ?iupacname} .
OPTIONAL {?y dcterms:subject ?y1}
FILTER (langMatches(lang(?Name),"en"))
}
LIMIT 50000
I am downloading in batches of 50000 (2 files) using offset parameter.
Somehow Isopropyl_alcohol is not getting covered in this even where page exists at
http://live.dbpedia.org/page/Isopropyl_alcohol
and it has the properties that I am searching for?
There are two issues here. The first is that DBpedia Live and DBpedia do not have exactly the same content. According to the DBpedia live webpage
Wikipedia users constantly revise Wikipedia articles with updates
happening almost each second. Hence, data stored in the official
DBpedia endpoint can quickly become outdated, and Wikipedia articles
need to be re-extracted. DBpedia Live enables such a continuous
synchronization between DBpedia and Wikipedia.
That page also lists two SPARQL endpoints for DBpedia Live:
http://live.dbpedia.org/sparql
http://dbpedia-live.openlinksw.com/sparql
However, you'll run into issues on both. Isopropyl_alcohol is in DBpedia, and its URI is
http://dbpedia.org/resource/Isopropyl_alcohol (which, when using a web browser, redirects to
http://dbpedia.org/page/Isopropyl_alcohol)
Looking there, we see that Isopropyl alcohol doesn't have rdf:type dbpedia-owl:Drug, but only
owl:Thing
dbpedia-owl:ChemicalCompound
dbpedia-owl:ChemicalSubstance
yago:Alcohols
http://umbel.org/umbel/rc/ChemicalSubstanceType
yago:AlcoholSolvents
so you won't be able to find it with your query on DBpedia, because it doesn't have the type `dbpedia-owl:Drug. Now, Isopropyl_alcohol also exists in DBpedia live, and its URL is
http://live.dbpedia.org/resource/Isopropyl_alcohol, which redirects to
http://live.dbpedia.org/page/Isopropyl_alcohol
but it only has the folllowing rdf:types:
yago:Alcohols
yago:AlcoholSolvents
http://umbel.org/umbel/rc/ChemicalSubstanceType
so it won't be found by your query on DBpedia Live, for the same reason.
The second issue is the one that AndyS pointed out. Even if the query would select Isopropyl_alcohol in DBpedia or DBpedia Live, unless you provide an ordering constraint, the limit/offset combination won't be guaranteed to return it, since without an ordering constraint, the server could legitimately return the same set of 50000 results to you every time.
Maybe it is not finding it in the LIMIT/OFFSET combination you are using. The server is not obliged to answer queries in the same order everytime unless you use ORDER BY so maybe the slices you have are not in fact all results.
Maybe the SPARQL site and live.dbpedia are not in step.
Try asking directly for Isopropyl_alcohol.