I'm using Apache Jena to fetch a huge amount of data from Dbpedia and write it into a CSV file. However, I'm only able to get about 10,000 triples and not the entire data. I need it to fetch all triples in the query. I can't identify whether it is an endpoint timeout or something else. The code I've written is as follows:
public class FetchCountriesData {
public void getCountriesInformation() throws FileNotFoundException {
ParameterizedSparqlString qs = new ParameterizedSparqlString("PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> \n "
+ "SELECT * { ?Subject rdf:type <http://dbpedia.org/ontology/Country> . ?Subject ?Predicate ?Object } ORDER BY ?Subject ");
QueryExecution exec = QueryExecutionFactory.sparqlService("https://dbpedia.org/sparql", qs.asQuery());
//exec.setTimeout(10000000);
exec.setTimeout(10, TimeUnit.MINUTES);
ResultSet results = exec.execSelect();
ResultSetFormatter.outputAsCSV(new FileOutputStream(new File("C:/fakepath/CountryData.csv")), results);
ResultSetFormatter.out(results);
}
}
You are almost certainly hitting one of DBPedias limits. For further information see http://wiki.dbpedia.org/OnlineAccess and http://lists.w3.org/Archives/Public/public-lod/2011Aug/0028.html
Related
I would like to ask, how to do a federated SPARQL query on a subgraph of a SPARQL endpoint (not the entire remote SPARQL endpoint).
I got my data in Virtuoso v7 while the SPARQL endpoint is "http://localhost:8890/sparql", I'd like to do a remote query on a subgraph of this endpoint which is "http://localhost:8890/TC", and I tried
SELECT *
WHERE
{ SERVICE <http://localhost:8890/sparql>
{ SELECT ?subject ?predicate ?object
FROM <http://localhost:8890/TC>
WHERE
{ ?subject ?predicate ?object }
}
} LIMIT 50
And I got the error that "FROM" is not correctly used, so I have two questions:
1) can I do a remote query on a subgraph of a SPARQL endpoint?
2) can I have a SPARQL endpoint for each graph in Virtuoso v7?
Thanks a lot for your help.
You can use graph instead of from.
In your example:
SELECT *
WHERE
{
SERVICE <http://localhost:8890/sparql>
{
SELECT ?subject ?predicate ?object
WHERE
{ graph <http://localhost:8890/TC> { ?subject ?predicate ?object } }
}
} LIMIT 50
I tested this syntax with the following query in the Uniprot SPARQL endpoint (Virtuoso) while federating with dbpedia (Virtuoso):
SELECT *
WHERE
{ SERVICE <http://dbpedia.org/sparql>
{select distinct ?activity where { graph <http://dbpedia.org> {?activity a <http://www.ontologydesignpatterns.org/ont/d0.owl#Activity>} } LIMIT 10
}
} LIMIT 50
I have my data organised in multiple graphs. The graph in which a triple is saved matters. The data structure is complicated but it can be simplified like this:
My store contains cakes, where there's a hierarchy of different cake types, all subclasses of <cake>
<http://example.com/a1> a <http://example.com/applecake>
<http://example.com/a2> a <http://example.com/rainbowcake>
...
Depending on how they get created by a user in a UI, they end up in a different graph. If for instance the user "bakes" a cake, it goes in the <http://example.com/homemade> graph, if they "buy" one, it goes into the <http://example.com/shopbought> graph.
When I retrieve my cakes from the store, I want to know for each cake whether it's homemade or shopbought. There is no property for this, I want to retrieve the information purely based on the graph the triple is stored in.
I have tried various ways of achieving this but none of them work in Jena TDB. The problem is that all cakes come back as "shopbought". All of the queries however work in Fuseki (on the exact sae dataset) and I was wondering whether this is a TDB bug or if there's another way. Here are the simplified queries (without variations):
Version 1:
SELECT DISTINCT *
FROM <http://example.com/homemade>
FROM <http://example.com/shopbought>
FROM NAMED <http://example.com/homemade>
FROM NAMED <http://example.com/shopbought>
WHERE {
?cake rdf:type ?caketype .
?caketype rdfs:subClassOf* <cake>
{
GRAPH <http://example.com/homemade> { ?cake rdf:type ?typeHomemade }
} UNION {
GRAPH <http://example.com/shopbought> { ?cake rdf:type ?typeShopbought }
}
BIND(str(if(bound(?typeHomemade), true, false)) AS ?homemade)
}
Version 2:
SELECT DISTINCT *
FROM <http://example.com/homemade>
FROM <http://example.com/shopbought>
FROM NAMED <http://example.com/homemade>
FROM NAMED <http://example.com/shopbought>
WHERE {
?cake rdf:type ?caketype .
?caketype rdfs:subClassOf* <cake>
GRAPH ?g {
?cake rdf:type ?caketype .
}
BIND(STR(IF(?g=<http://example.com/homemade>, true, false)) AS ?homemade)
}
Any ideas why this works in Fuseki but not in TDB?
Edit:
I'm beginning to think it has something to do with the GRAPH keyword. Here are some much simpler queries (which work in Fuseki and tdbquery) and the results I get using the Jena API:
SELECT * WHERE { GRAPH <http://example.com/homemade> { ?s ?p ?o }}
0 results
SELECT * WHERE { GRAPH ?g { ?s ?p ?o }}
0 results
SELECT * FROM <http://example.com/homemade> WHERE { ?s ?p ?o }
x results
SELECT * FROM <http://example.com/homemade> WHERE { GRAPH <http://example.com/homemade> { ?s ?p ?o }}
0 results
SELECT * FROM NAMED <http://example.com/homemade> WHERE { GRAPH <http://example.com/homemade> { ?s ?p ?o }}
0 results
OK so my solution has actually to do with the way I executed the query. My initial idea was to pre-filter the dataset so that a query only gets executed on the relevant graphs (the dataset contains many graphs and they can be quite large which would make querying "everything" slow). This can be done either by adding them to the SPARQL or directly in Jena (although this would not work for other triple stores). Combining both ways however "to be on the safe side" does not work.
This query runs on the entire dataset and works as expected:
Query query = QueryFactory.create("SELECT * WHERE { GRAPH ?g { ?s ?p ?o } }", Syntax.syntaxARQ);
QueryExecution qexec = QueryExecutionFactory.create(query, dataset);
ResultSet result = qexec.execSelect();
The same query can be executed only on a specific graph, where it doesn't matter which graph that is, it does not give any results:
//run only on one graph
Model target = dataset.getNamedModel("http://example.com/homemade");
//OR run on the union of all graphs
Model target = dataset.getNamedModel("urn:x-arq:UnionGraph");
//OR run on a union of specific graphs
Model target = ModelFactory.createUnion(dataset.getNamedModel("http://example.com/shopbought"), dataset.getNamedModel("http://example.com/homemade"), ...);
[...]
QueryExecution qexec = QueryExecutionFactory.create(query, target);
[...]
My workaround was to now always query the entire dataset (which supports the SPARQL GRAPH keyword fine) and for each query always specify the graphs on which it should run to avoid having to query the entire dataset.
Not sure if this is expected behaviour for the Jena API
I am using the Jena Java framework for querying DBpedia end point using SPARQL, to get the type for all points of interest in German cities. I am facing no issue for places that have English DBpedia entries. But, when it comes to place names to be queried from the German DBpedia endpoint (http://de.dbpedia.org/resource/Schloß_Nymphenburg), this query returns no result. This problem is also mentioned over here (http://mail-archives.apache.org/mod_mbox/jena-users/201110.mbox/%3C4E877C8A.4050705#apache.org%3E). Even after referring to this, I am unable to solve the problem. I don't know how to work with QueryEngineHTTP. I am adding two code snippets - one that works (first one - query for Allianz Arena : which has an English entry in DBpedia) and one that doesn't work (second one - for Schloß Nymphenburg, that has a German entry).
This might be a very trivial issue, but I am unable to solve it. Any pointers to a solution would be very very helpful.
Thanks a lot!
Code 1 - working :
String service = "http://dbpedia.org/sparql";
final ParameterizedSparqlString query = new ParameterizedSparqlString(
"PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>" +
"PREFIX dbo: <http://dbpedia.org/ontology/>" +
"PREFIX dcterms: <http://purl.org/dc/terms/>" +
"SELECT * WHERE {" +
"?s geo:lat ?lat ." +
"?s geo:long ?long ." +
"?s dcterms:subject ?sub}");
query.setIri("?s", "http://dbpedia.org/resource/Allianz_Arena");
QueryExecution qe = QueryExecutionFactory.sparqlService(service, query.toString());
ResultSet results = qe.execSelect();
ResultSetFormatter.out(System.out, results);
Code 2 - not working :
String service = "http://dbpedia.org/sparql";
final ParameterizedSparqlString query = new ParameterizedSparqlString(
"PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>" +
"PREFIX dbo: <http://dbpedia.org/ontology/>" +
"PREFIX dcterms: <http://purl.org/dc/terms/>" +
"SELECT * WHERE {" +
"?s geo:lat ?lat ." +
"?s geo:long ?long ." +
"?s dcterms:subject ?sub}");
query.setIri("?s", "http://de.dbpedia.org/resource/Schloß_Nymphenburg");
QueryExecution qe = QueryExecutionFactory.sparqlService(service, query.toString());
ResultSet results = qe.execSelect();
ResultSetFormatter.out(System.out, results);
I don't think this is an issue with jena at all. Trying:
SELECT * WHERE {
<http://de.dbpedia.org/resource/Schloß_Nymphenburg> ?p ?o }
at http://dbpedia.org/sparql I get no results: try it yourself.
SELECT * WHERE {
<http://de.dbpedia.org/resource/Schloss_Nymphenburg> ?p ?o }
by contrast returns something, even if it's just a bunch of cross links.
I am trying to extract labels from DBpedia for some persons. I am partially successful now, but I got stuck in the following problem. The following code works.
public class DbPediaQueryExtractor {
public static void main(String [] args) {
String entity = "Aharon_Barak";
String queryString ="PREFIX dbres: <http://dbpedia.org/resource/> SELECT * WHERE {dbres:"+ entity+ "<http://www.w3.org/2000/01/rdf-schema#label> ?o FILTER (langMatches(lang(?o),\"en\"))}";
//String queryString="select * where { ?instance <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person>; <http://www.w3.org/2000/01/rdf-schema#label> ?o FILTER (langMatches(lang(?o),\"en\")) } LIMIT 5000000";
QueryExecution qexec = getResult(queryString);
try {
ResultSet results = qexec.execSelect();
for ( ; results.hasNext(); )
{
QuerySolution soln = results.nextSolution();
System.out.print(soln.get("?o") + "\n");
}
}
finally {
qexec.close();
}
}
public static QueryExecution getResult(String queryString){
Query query = QueryFactory.create(queryString);
//VirtuosoQueryExecution vqe = VirtuosoQueryExecutionFactory.create (sparql, graph);
QueryExecution qexec = QueryExecutionFactory.sparqlService("http://dbpedia.org/sparql", query);
return qexec;
}
}
However, when the entity contains brackets, it does not work. For example,
String entity = "William_H._Miller_(writer)";
leads to this exception:
Exception in thread "main" com.hp.hpl.jena.query.QueryParseException: Encountered " "(" "( "" at line 1, column 86.`
What is the problem?
It took some copying and pasting to see what exactly was going on. I'd suggest that you put newlines in your query for easier readability. The query you're using is:
PREFIX dbres: <http://dbpedia.org/resource/>
SELECT * WHERE
{
dbres:??? <http://www.w3.org/2000/01/rdf-schema#label> ?o
FILTER (langMatches(lang(?o),"en"))
}
where ??? is being replaced by the contents of the string entity. You're doing absolutely no input validation here to ensure that the value of entity will be legal to paste in. Based on your question, it sounds like entity contains William_H._Miller_(writer), so you're getting the query:
PREFIX dbres: <http://dbpedia.org/resource/>
SELECT * WHERE
{
dbres:William_H._Miller_(writer) <http://www.w3.org/2000/01/rdf-schema#label> ?o
FILTER (langMatches(lang(?o),"en"))
}
You can paste that into the public DBpedia endpoint, and you'll get a similar parse error message:
Virtuoso 37000 Error SP030: SPARQL compiler, line 6: syntax error at 'writer' before ')'
SPARQL query:
define sql:big-data-const 0
#output-format:text/html
define sql:signal-void-variables 1 define input:default-graph-uri <http://dbpedia.org> PREFIX dbres: <http://dbpedia.org/resource/>
SELECT * WHERE
{
dbres:William_H._Miller_(writer) <http://www.w3.org/2000/01/rdf-schema#label> ?o
FILTER (langMatches(lang(?o),"en"))
}
Better than hitting DBpedia's endpoint with bad queries, you can also use the SPARQL query validator, which reports for that query:
Syntax error: Lexical error at line 4, column 34. Encountered: ")" (41), after : "writer"
In Jena, you can use the ParameterizedSparqlString to avoid these sorts of issues. Here's your example, reworked to use a parameterized string:
import com.hp.hpl.jena.query.ParameterizedSparqlString;
public class PSSExample {
public static void main( String[] args ) {
// Create a parameterized SPARQL string for the particular query, and add the
// dbres prefix to it, for later use.
final ParameterizedSparqlString queryString = new ParameterizedSparqlString(
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n" +
"SELECT * WHERE\n" +
"{\n" +
" ?entity rdfs:label ?o\n" +
" FILTER (langMatches(lang(?o),\"en\"))\n" +
"}\n"
) {{
setNsPrefix( "dbres", "http://dbpedia.org/resource/" );
}};
// Entity is the same.
final String entity = "William_H._Miller_(writer)";
// Now retrieve the URI for dbres, concatentate it with entity, and use
// it as the value of ?entity in the query.
queryString.setIri( "?entity", queryString.getNsPrefixURI( "dbres" )+entity );
// Show the query.
System.out.println( queryString.toString() );
}
}
The output is:
PREFIX dbres: <http://dbpedia.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE
{
<http://dbpedia.org/resource/William_H._Miller_(writer)> rdfs:label ?o
FILTER (langMatches(lang(?o),"en"))
}
You can run this query at the public endpoint and get the expected results. Notice that if you use an entity that doesn't need special escaping, e.g.,
final String entity = "George_Washington";
then the query output will use the prefixed form:
PREFIX dbres: <http://dbpedia.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE
{
dbres:George_Washington rdfs:label ?o
FILTER (langMatches(lang(?o),"en"))
}
This is very convenient, because you don't have to do any checking about whether your suffix, i.e., entity, has any characters that need to be escaped; Jena takes care of that for you.
For some reason I can't issue DESCRIBE queries using Redland ( librdf.org ), is it possible to rewrite DESCRIBE as a CONSTRUCT QUERY for a given URI?
DESCRIBE <urn:my-uri>
I was thinking about writting it into something like this but I don't think this is valid in SPARQL
CONSTRUCT { ?subject ?predicate ?object }
WHERE {
{ ?subject ?predicate ?object }
AND {
{ <urn:my-uri> ?predicate ?object }
OR { ?subject <urn:my-uri> ?object }
OR { ?subject ?predicate <urn:my-uri> }
}
}
Your are right that is not a valid SPARQL. The closest thing to your OR is UNION. And, there is no need for the AND operator, every triple pattern is by default a join not a union.
For what you are trying is better to use a FILTER, like this example:
CONSTRUCT { ?subject ?predicate ?object }
WHERE { ?subject ?predicate ?object .
FILTER ( ?subject = <urn:your_uri> || ?object = <urn:your_uri>)
}
In some systems, for large knowledge bases, this query can be very expensive. And also if your database contains bNodes this query won't get the description of those nodes, it will get just the internal code. For most cases, running a DESCRIBE manually can't be accomplished with a single query and you'll have to implement some recursive logic in order to get all the information that describes a URI.
After trying something like the FILTER ( A || B ) method, I got the impression that it is pretty slow.
I think you can do the same thing, basically, but using VALUES and UNION
I tried it on DBPedia (~2.46 billion triples) with a movie, and it seemed to perform well.
CONSTRUCT {
?subject ?predicate ?object
}
WHERE {
{ ?subject ?predicate ?object .
VALUES ?subject { dbpedia:The_Matrix }
}
UNION
{ ?subject ?predicate ?object .
VALUES ?object { dbpedia:The_Matrix }
}
}
sparql result on dbpedia
Edit: Just for the sake of additional info, I think you could technically also write the following:
CONSTRUCT { ?subject ?predicate ?object }
WHERE {
?subject ?predicate ?object .
OPTIONAL { dbpedia:The_Matrix ?predicate ?object . }
OPTIONAL { ?subject ?predicate dbpedia:The_Matrix . }
}
but some popular RDF databases really can't handle OPTIONAL very performantly yet, and will die.