Wikipedia is geotagging a lot of its articles. (Look in the top right corner of the page.)
Is there any API for querying all geotagged pages within a specified radius of a geographical position?
Update
Okay, so based on lost-theory's answer I tried this (on DBpedia query explorer):
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT ?subject ?label ?lat ?long WHERE {
?subject geo:lat ?lat.
?subject geo:long ?long.
?subject rdfs:label ?label.
FILTER(xsd:float(?lat) - 57.03185 <= 0.05 && 57.03185 - xsd:float(?lat) <= 0.05
&& xsd:float(?long) - 9.94513 <= 0.05 && 9.94513 - xsd:float(?long) <= 0.05
&& lang(?label) = "en"
).
} LIMIT 20
This is very close to what I want, except it returns results within a (local) square around the point and not a circle. Also I would like if the results where sorted based on the distance from the point. (If possible.)
Update 2
I am trying to determine the euclidean distance as an approximation of the true distance, But I am having trouble on squaring a number in SPARQL. (Question opened here.) When I get something useful I will update the question, but in the meantime I will appreciate any suggestions on alternative approaches.
Update 3
A final update. I gave up on using SPARQL through DBpedia. I have written a simple parser which fetches the Wikipedia article text nightly database dump and parses all articles for geocodes. It works rather nicely and it allows me to store information about geotagged articles however I wish.
This is probably the solution I will continue using, and if I get around to create a nice interface to it I might consider allowing public API access and/or publishing the source to the parser.
The OpenLink Virtuoso server used by the dbpedia endpoint has several query features. I found the information on http://docs.openlinksw.com/virtuoso/rdfsparqlgeospat.html useful for a similar problem.
I ended up with a query such as this:
SELECT ?page ?lat ?long (bif:st_distance(?geo, bif:st_point(15.560278, 58.394167)))
WHERE{
?m foaf:page ?page.
?m geo:geometry ?geo.
?m geo:lat ?lat.
?m geo:long ?long.
FILTER (bif:st_intersects (?geo, bif:st_point(15.560278, 58.394167), 30))
}
ORDER BY ASC 4 LIMIT 15
This example retrieves the geotagged locations within 30 km from the origin position.
You should be able to query for latitude/longitude using SPARQL and dbpedia. An example (from here):
SELECT distinct ?s ?la ?lo ?name ?country WHERE {
?s dbpedia2:latitude ?la .
?s dbpedia2:longitude ?lo .
?s dbpedia2:officialName ?name .
?s dbpedia2:country ?country .
filter (
regex(?country, 'England|Scotland|Wales|Ireland')
and regex(?name, '^[Aa]')
)
}
You can run your own queries here.
There are a couple of tools listed on Tools and applications based on coordinates from Wikipedia. I'm not sure if it's what you're looking for, but the Geosearch.py tool looks pretty cool.
Not an API, but you can also download this nice set of all geo-tagged wikipedia articles and query it directly in a local database:
http://www.google.com/fusiontables/DataSource?dsrcid=423292
The free GeoNames.org FindNearbyWikipedia service can fetch geotagged articles for a give postal code or coordinates (latitude, longitude)
It provides 30,000 credits daily limit per application (identified by the parameter 'username'), the hourly limit is 2000 credits. A credit is a web service request hit for most services. An exception is thrown when the limit is exceeded.
I'm not familiar enough with SPARQL, but if it can use power in its filter then its easy to compute the distance of a given article from a given point using Pythagoras theorem (a^2 + b^2 = c^2) and that would give you all the articles in a radius.
Another option would be to get a Wikipedia data dump and process it yourself - this is what I did when I needed to do some linguistic analysis on Wikipedia article.
Related
Looking for SPARQL query to do the following:
For example, I have the word Apple. Apple may refer to the organization Apple_Inc or the Species of Plants class as per the ontology. Owl: Thing has a subclass called Species, so I want to return those most relevant/maximum-hit URIs where the keyword Apple does not belong to the Species subclass. So when you return all the URIs, http://dbpedia.org/page/Apple should not be one of them, neither must ANY relevant link that comes under Species subclass.
By maximum-hit/most relevant I mean the top returned results that match the query! Like when you access the PrefixSearch (i.e. Autocomplete) API, it has the parameter called MaxHits.
For example http://lookup.dbpedia.org/api/search/PrefixSearch?QueryClass=&MaxHits=2&QueryString=berl is a link where you want to return the top 2 URIs that match the QueryString=berl.
Like I'm actually really struggling to even explain the work I've done so far because I'm not able to understand the structure and how to formulate a proper query..
with respect to negation in SPARQL, I found a relevant portion of the documentation in the link here.. But I do not know how and where to proceed from there, and cannot understand why keywords like ?person are used.. I can understand the person is used to selected well.. PEOPLE names, but I would like to know how and where to find these keywords like ?person, ?name to represent a specific entity..
SELECT ?uri ?label
WHERE {
?uri rdfs:label ?label .
filter(?label="car"#en)
}
I would really appreciate if someone could link me the part of the documentation I can clearly read and understand that ?uri is used to select a URI in the form www.dbpedia.org'/page/SomeEntity and what these ?person, ?name, ?label represent.
I'm actually so lost.. I will go up and start eating one elephant at a time. For now, I'll be very grateful if I get an answer to this.
If there is anyway you know where I can avoid learning and using SPARQL, that would work too! I know Python well enough, so leveraging an API to pull this information is also fine by me. This question was posted by me.
Answer posted by #Stanislav-Kravin --
SELECT DISTINCT ?s
WHERE
{ ?s a owl:Thing .
?s rdfs:label ?label .
FILTER ( LANGMATCHES ( LANG ( ?label ), 'en' ) )
?label bif:contains '"apple"' .
FILTER NOT EXISTS { ?s rdf:type/rdfs:subClassOf* dbo:Species }
}
I'm doing a Sparql Query right now, to find Places around a specific radius of a given Point from dbpedia Endpoint (Snorql).
My first solution (already doing at some other Endpoints) was this:
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT *
WHERE {
?resource rdfs:label ?label .
?resource geo:lat ?lat .
?resource geo:long ?long .
?resource geo:geometry ?coordinates .
FILTER(bif:st_within(?coordinates, bif:st_geomFromText("POINT(10.2788 47.4093)"), 1)) .
FILTER (lang(?label)= "de") .
}
I noticed that it doesn't give me any results. Then I tried the same thing with the given rounded values in geo:lat and geo:long:
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT *
WHERE {
?resource rdfs:label ?label .
?resource geo:lat ?lat .
?resource geo:long ?long .
?resource geo:geometry ?coordinates .
FILTER(bif:st_within(bif:st_point(?long, ?lat), bif:st_geomFromText("POINT(10.2788 47.4093)"), 1)) .
FILTER (lang(?label)= "de") .
}
Now I'm getting 2 results. When I'm increasing the radius of the first solution to 21, there are plenty results, but decreasing it to 20, there are no results. Is there a mistake I made in the first Query?
Thank you very much,
SaW
As I answered on Confluence...
Interesting!
Your original query (with addition of FROM <http://dbpedia.org> clause) does return the expected results when run against the LOD Cloud Cache, which is now on a somewhat older Virtuoso engine. This looks like a regression in newer versions.
To check things on DBpedia, I started with your query, and added a couple of BINDs using the st_distance() function to the WHERE clause --
BIND
( bif:st_distance
( ?coordinates,
bif:st_geomFromText("POINT(10.2788 47.4093)")
) AS ?coord_distance
) .
BIND
( bif:st_distance
( bif:st_point(?long, ?lat),
bif:st_geomFromText("POINT(10.2788 47.4093)")
) AS ?latlong_distance
)
}
I also added a final ORDER BY ?coord_distance to the query.
My results on DBpedia.org/sparql clearly show two entities within your desired radius of 1, and the calculated distances are the same whether based on ?coordinates or st_point(?long, ?lat) but they are not delivered unless bif:st_within specifies a radius of 21 or greater -- and those results include a number of other entities that are within the larger radius.
I've raised this to Virtuoso Development, and it's being tracked internally as bug#18399.
... and as followed up there ...
st_within() uses st_distance(), so given that the srid is 4326 (as is typical for DBpedia geodata), "the haversine function is used to compute a great circle distance in kilometers on Earth." You can divide your distance in meters by 1000 (or multiply it by 0.001) to get the distance in kilometers for use in the st_within() call.
The computing time is dependent on the instance host, other load on the instance, etc. The public DBpedia instance's response time may well be longer than you can tolerate. You can set up your own mirror in a local server or in the cloud (AMIs based on DBpedia 2016-10 Snapshot [current DBpedia.org/sparql], or DBpedia-Live [current live.DBpedia.org/sparql]), which you can put on any AWS instance type -- so you can give it as much processor and/or RAM as you like.
Note that the LOD Cloud Cache instance may be upgraded to a newer Virtuoso engine at any time, so you should not rely on this delivering the desired results via st_within(). A slightly adjusted DBpedia query will deliver what I think you want using only the st_distance() function (here calculated from ?coordinates, but you could also use the more complex construction based on ?long and ?lat), and not the malfunctioning st_within().
I want to query DBpedia with DBpedia Live endpoint.
I have this query :
SELECT *
WHERE {
?x a dbo:Person .
?x rdfs:label "Usain Bolt"#en .
}
This query gives the correct answer with most names I tried (for example “Teddy Riner"#en) but it fails with Usain Bolt and Rachid Badouri.
I don’t get why as their DBpedia pages (Teddy Riner, Usain Bolt) are constructed the same way: they both have a rdfs:label, which is written exactly like I did.
It seems to me that there is an incoherence between the endpoint and DBpedia. But I don’t think that it's because the endpoint is not to date.
Even more surprising, this query gives the correct answer:
SELECT *
WHERE {
?x rdfs:label "Usain Bolt"#en .
}
However, Usain Bolt is a dbo:Person! Same thing for Rachid Badouri.
Could someone explain me why the first query does not give answer?
Any help would be appreciated! Thanks
According to DBpedia-Live, at the time of writing, the entity with rdfs:label "Usain Bolt"#en has many types, but is not a dbo:Person. Similar for the entity with rdfs:label "Rachid Badouri"#en.
In contrast, the entity with rdfs:label "Teddy Riner"#en is a dbo:Person.
Note: DBpedia-Live content is a moving target, varying with Wikipedia content changes, adjustments in the templates, and other variables. The statements I made above may no longer be true when you read this.
Say I need to fetch content from wikipedia about all mountains. My target is to show initial paragraph, and an image from respective article (eg. Monte Rosa and Vincent Pyramid.
I came to know about dbpedia, and with some research got to find that it provides live queries into wiki database directly.
I have 2 questions:
1 - I am finding it difficult how could I formulate my queries. I can't play around iSPARQL. I tried following query but it throws error saying invalid xml.
SELECT DISTINCT ?Mountain FROM <http://dbpedia.org> WHERE {
[] rdf:type ?Mountain
}
2 - My requirement is to show only mountains that have at least 1 image (I need to show this image too). Now the ones I listed above have images, but how could I be sure? Also, looking at both examples I see many fields differ in wiki articles - so for future extension it maybe quite difficult to fetch them.
I just want to reject those which do not have sufficient data or description.
How can I filter out mountains based on pictures present?
UPDATE:
My corrected query, which solves my first problem:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?name ?description
WHERE {
?name rdf:type <http://dbpedia.org/ontology/Mountain>;
dbpedia-owl:abstract ?description .
}
You can also query dbpedia using its SPARQL endpoint (less fancy than iSPARQL). To find out more about what queries to write, take a look at the DBpedia's datasets page. The examples there show how one can select pages based on Wikipedia categories. To select resources in the Wikipedia Mountains category, you can use the following query:
select ?mountain where {
?mountain a dbpedia-owl:Mountain .
}
SPARQL Results
Once you have some of these links in hand, you can look at them in a web browser and see the data associated with them. For instance the page for Mount Everest shows lots of properties. For restricting results to those pages that have an image, you might be interested in the dbpedia-owl:thumbnail property, or perhaps better yet foaf:depiction. For the introductory paragraph, you probably want something like the dbpedia-owl:abstract. Using those, we can enhance the query from before. The following query finds things in the category Stratovolcanoes with an abstract and an depiction. Since StackOverflow is an English language site, I've restricted the abstracts to those in English.
select * where {
?mountain a dbpedia-owl:Mountain ;
dbpedia-owl:abstract ?abstract ;
foaf:depiction ?depiction .
FILTER(langMatches(lang(?abstract),"EN"))
}
LIMIT 10
SPARQL Results
I'm working with the following SPARQL query, which is an example on the web-based end of my institution's SPARQL endpoint;
SELECT ?building_number ?name ?occupants WHERE {
?site a org:Site ;
rdfs:label "Highfield Campus" .
?building spacerel:within ?site ;
skos:notation ?building_number ;
rdfs:label ?name .
OPTIONAL {
?building soton:buildingOccupants ?occ .
?occ rdfs:label ?occupants .
} .
} ORDER BY ?name
The problem is that as well as getting data from 'Buildings and Places', the Dataset I'm interested in, and would expect the example to use, it also gets data from the 'Facilities and Equipment' dataset, which isn't relevant. You should see this if you follow the link.
I suspect the example may pre-date the addition of the Facilities and Equipment dataset, but even with the research I've done into SPARQL, I can't see a clear way to define which datasets to include.
Can anyone recommend a starting point to limit it to just show 'Buildings', or, more specifically, results from the 'Buildings and Places' dataset.
Thanks
First things first, you really need to use SELECT DISTINCT, as otherwise you'll get repeated results.
To answer your question, you can use GRAPH { ... } to filter certain parts of a SPARQL query to only match data from a specific dataset. This only works if the SPARQL endpoint is divided up into GRAPHs (this one is). The solution you asked for isn't the best choice, as it assumes that things within sites in the 'places' dataset will always be resticted to buildings... That's risky -- as it might end up containing trees and signposts at some time in the future.
Step one is to just find out what graphs are in play:
SELECT DISTINCT ?g1 ?building_number ?name ?occupants WHERE {
?site a org:Site ;
rdfs:label "Highfield Campus" .
GRAPH ?g1 { ?building spacerel:within ?site ;
skos:notation ?building_number ;
rdfs:label ?name .
}
OPTIONAL {
?building soton:buildingOccupants ?occ .
?occ rdfs:label ?occupants .
} .
} ORDER BY ?name
Try it here: http://is.gd/WdRAGX
From this you can see that http://id.southampton.ac.uk/dataset/places/latest and http://id.southampton.ac.uk/dataset/places/facilities are the two relevant ones.
To only look for things 'within' a site according to the "places" graph, use:
SELECT DISTINCT ?building_number ?name ?occupants WHERE {
?site a org:Site ;
rdfs:label "Highfield Campus" .
GRAPH <http://id.southampton.ac.uk/dataset/places/latest> {
?building spacerel:within ?site ;
skos:notation ?building_number ;
rdfs:label ?name .
}
OPTIONAL {
?building soton:buildingOccupants ?occ .
?occ rdfs:label ?occupants .
} .
} ORDER BY ?name
Alternate solutions:
Using rdf:type
Above I've answered your question, but it's not the answer to your problem. This solution is more semantic as it actually says 'only give me buildings within the campus' which is what you really mean.
Instead of filtering by graph, which is not very 'semantic' you could also restrict ?building to be of class 'building' which research facilities are not. They are still sometimes listed as 'within' a site. Usually when the uni has only published what campus they are on but not which building.
?building a rooms:Building
Using FILTER
In extreme cases you may not have data in different GRAPHS and there may not be an elegant relationship to use to filter your results. In this case you can use a FILTER and turn the building URI into a string and use a regular expression to match acceptable ones:
FILTER regex(str(?building), "^http://id.southampton.ac.uk/building/")
This is bar far the worst option and don't use it if you have to.
Belt and Braces
You can use any of these restictions together and a combination of restricting the GRAPH plus ensuring that all ?buildings really are buildings would be my recommended solution.