Limit Graph DB responses per category - sparql

I'm sure there's already an SO question asking the same thing, but I have been unable to find it. Perhaps I just don't know the right words to express this question properly. Some context: I'm deciding if AWS Neptune is a good fit for an upcoming project. So, I apparently have access to a SPARQL engine and a Tinkerpop/Gremlin engine if that helps answer the question.
Let's say I have items in a graph database. Those items have a "property" called category. Is there away to get a max of 20 items from each distinct category?
I've scoured various SPARQL resources (e.g. the docs) and haven't found anything describing this sort of functionality, and I've never had to do this in my previous sparql encounters. Not to familiar TinkerPop and Gremlin, but my initial readings haven't touched on this either.

It's fairly straightforward with Gremlin. Using the air-routes graph which has a region property for each airport. The following query will return five airports or less for California and Texas (there are more than 5 in the graph for each state).
gremlin> g.V().has('airport','region',within('US-CA','US-TX')).
group().
by('region').
by(identity().limit(5).fold())
==>[US-TX:[v[3],v[8],v[11],v[33],v[38]],US-CA:[v[13],v[23],v[24],v[26],v[27]]]
EDITED: Added additional example where specific regions are not looked for.
gremlin> g.V().hasLabel('airport').
limit(50).
group().
by('region').
by(identity().limit(5).fold())
==>[US-FL:[v[9],v[15],v[16],v[19],v[25]],US-NV:[v[30]],US-HI:[v[37]],US-TX:[v[3],v[8],v[11],v[33],v[38]],US-WA:[v[22]],US-NY:[v[12],v[14],v[32],v[35]],US-NC:[v[21]],US-LA:[v[34]],GB-ENG:[v[49],v[50]],US-PA:[v[45]],US-DC:[v[7]],US-NM:[v[44]],US-AZ:[v[20],v[43]],US-TN:[v[4]],CA-BC:[v[48]],CA-ON:[v[47]],PR-U-A:[v[40]],US-MN:[v[17]],US-IL:[v[18]],US-AK:[v[2]],US-VA:[v[10]],US-CO:[v[31]],US-MD:[v[6]],US-IA:[v[36]],US-MA:[v[5]],US-CA:[v[13],v[23],v[24],v[26],v[27]],US-UT:[v[29]],US-OH:[v[41]],US-GA:[v[1]],US-MI:[v[46]]]

Related

Web scraping wikipedia data table, but from dbpedia, and examples/very basic, elementary tutorial resources to build queries

I wanted to ask about the Semantic Web part, in particular using DBpedia. In general, what DBpedia can and can’t do? I roughly understand the subject-verb-object model for something like DBpedia. Practically and concretely speaking, I want to web scrape the technical data (mass, thrust, etc.) found in the Wikipedia page of the Long March rocket family
Now, as of right now (i.e., as far as I know), to find what DBpedia has (i.e., how I’m using DBpedia to find data) is that I find what I’m interested in Wikipedia, copying the last part of the URL, and copy that into DBpedia (is there any method more sophisticated than that?), resulting in this page.
Looking at that page, I only see links to related articles, links, and the abstract.
Other than my smaller questions above, my main question is this: so does DBpedia not have the data table that I want?
Next, could someone help me give me some tips or pointers for building a SPARQL or query string for DBpedia? It seems to me that one wouldn't know how to build one as there's no "directory" for what could or couldn't be asked. Thanks.
DBpedia is an active project, and DBpedia extractors are continuing to evolve. Contributions that might help you would include adding infoboxes to Wikipedia pages, and data extractors to DBpedia. Check the DBpedia website for info, or write to dbpedia-discussion to get started.
As for finding DBpedia content, there are several interfaces you can work with --
Faceted Browse and Search
direct SPARQL query interface
iSPARQL, a drag-and-drop SPARQL query builder
SNORQL, another SPARQL query interface
so does dbpedia not have the data table that I want?
No, it doesn't. Usually, DBpedia gets its data from infoboxes. Your article doesn't have one, so DBpedia can't get much information out of it.

Why is some information from the Wikipedia infobox missing on DBpedia?

Why is some information from the Wikipedia infobox missing on DBpedia?
For example, star Alpha Librae has property distance-from-earth in the infobox, but it isn't a property of the Alpha Librae dbpedia resource. On the other hand, star Betelgeuse has this piece of information on DBpedia). And many other stars have this distance information in the infobox, but there isn't any matching property in the DBpedia resource.
Is there a way to extract thise missing information from DBpedia using SPARQL or is the only way web scraping of the wiki page?
The DBpedia pages present all the data DBpedia has -- no SPARQL nor other query can get data that isn't there.
DBpedia is updated periodically. It may not reflect the latest changes on Wikipedia.
Also, extractors are a living project, and may not grab every property in which you're interested.
Looking at Betelgeuse on Wikipedia, I see one distance in the infobox. Looking at Alpha_Librae, I see two distances. Which should DBpedia have? Perhaps you have the niche knowledge which can ensure that the extractors do the right thing...
As #JoshuaTaylor suggests, you will probably get more satisfactory answers from the DBpedia discussion list and/or the DBpedia development list.
Look at en.wikipedia.org/wiki/Volkswagen_Golf_Mk3:
In the infobox you have:
height = 1991-95 & Cabrio: {{convert|1422|mm|in|1|abbr=on}}1996-99: {{convert|1428|mm|in|1|abbr=on}}
In dbpedia you get height=1991-95
instead of
height=1442
height=1428
This happens because there is no standard how to define properties in a conditional way. For this reason, dbpedia properties are sometimes wrong/missing

Filter Wikipedia geosearch per region

I saw Wikipedia API (called MediaWiki GeoData) to search wiki pages around fixed coordinates. An example call is
https://it.wikipedia.org/w/api.php?action=query&list=geosearch&gsradius=10000&gscoord=37.786971|-122.399677
I saw also that GeoData, in its Extra parameters, has also the concept of region, accepting a ISO 3166-2 region code.
How can I search elements, filtering per this region code? For example, if I am searching around some coordinates near the border between two regions, am I able to filter only the elements of one region?
Short answer: you can't.
Longer answer: we currently lack two features which I just filed in the issue tracker for you, i.e.
populating the region parameter in GeoData: most pages do not specify the region for GeoData but only in their free text (which is useless), the only structured data we have is in Wikidata;
adding an option to filter the results by region.
For now, you'll have to do everything yourself client-side: figure out the coordinates of each region and filter by those; or search the region in Wikidata statements and then fetch corresponding articles in the language desired. As you are a developer you could also help import country data in Wikidata ;-).
(Expanded from MaxSem's answer, hence wiki.)

Named Graphs and Federated SPARQL Endpoints

I recently came across the working draft for SPARQL 1.1 Federation Extensions and wondered whether this was already possible using Named Graphs (not to detract from the usefulness of the aforementioned draft).
My understanding of Named Graphs is a little hazy, save that the only thing I have gleamed from reading the specs comprises rules around the merger, non merger in relation to other graphs at query time. Since this doesn't fully satisfy my understanding, my question is as follows:
Given the following query:
SELECT ?something
FROM NAMED <http://www.vw.co.uk/models/used>
FROM NAMED <http://www.autotrader.co.uk/cars/used>
WHERE {
...
}
Is it reasonable to assume that a query processor/endpoint could or should in the context of the named graphs do the following:
Check is the named graph exists locally
If it doesn't then perform the following operation (in the case of the above query, I will use the second named graph)
GET /sparql/?query=EncodedQuery HTTP/1.1
Host: www.autotrader.co.uk
User-agent: my-sparql-client/0.1
Where the EncodedQuery only includes the second named graph in the FROM NAMED clause and the WHERE clause is amended accordingly with respect to GRAPH clauses (e.g if a GRAPH <http://www.vw.co.uk/models/used> {...} is being used).
Only if it can't perform the above, then do any of the following:
GET /cars/used HTTP/1.1
Host: www.autotrader.co.uk
or
LOAD <http://www.autotrader.co.uk/cars/used>
Return appropriate search results.
Obviously there might be some additional considerations around OFFSET's and LIMIT's
I also remember reading somewhere a long time ago in galaxy far far away, that the default graph of any SPARQL endpoint should be a named graph according to the following convention:
For: http://www.vw.co.uk/sparql/ there should be a named graph of: http://www.vw.co.uk that represents the default graph and so by the above logic, it should already be possible to federate SPARQL endpoints using named graphs.
The reason I ask is that I want to start promoting federation across the domains in the above example, without having to wait around for the standard, making sure that I won't do something that is out of kilter or incompatible with something else in the future.
Named graph and URLs used in federated queries (using SERVICE or FROM) are two different things. The latter point to SPARQL endpoints, the named graphs are within a triple store and have the main function of separating different data sets. This, in turn, can be useful to both improve performance and represent knowledge, such as representing what is the source of a set of statements.
For instance, you might have two data sources both stating that ?movie has-rating ?x and you might want to know which source is stating which rating, in this case you can use two named graphs associated to the two sources (e.g., http://www.example.com/rotten-tomatoes and http://www.example.com/imdb). If you're storing both data sets in the same triple store, probably you will want to use NGs, and remote endpoints are a different thing. Furthermore, the URL of a named graph can be used with vocabularies like VoID to describe a dataset as a whole (eg, the data set name, where and when the triples are imported from, who is the maintainer, user licence). This is another reason to partition your triple store into NGs.
That said, your mechanism to bind NGs to endpoint URLs might be implemented as an option, but I don't think it's a good idea to have it as mandatory, since managing remote endpoint URLs and NGs separately can be more useful.
Moreover, the real challenge in federated queries is to offer endpoint-transparent queries, making the query engine smart enough to analyse the query and understand how to split it and perform partial queries on the right endpoints (and join the results later, in an efficient way). There is a lot of research being done on that, one of the most significant results (as far as I know) is FedX, which has been used to implement several query distribution optimisations (example).
Last thing to add, I vaguely remember the convention that you mention about $url, $url/sparql. There are a couple of approaches around (e.g., LOD cloud). That said, in most nowadays triple stores (e.g., Virtuoso), queries that don't specify a named graph (don't use GRAPH) work in a way different than falling into a default graph case, they actually query the union of all named graphs in the store, which is usually much more useful (when you don't know where something is stated, or you want to integrate cross-graph data).

How to analyse Wikipedia article's data base with R?

This is a "big" question, that I don't know how to start, so I hope some of you can give me a direction. And if this is not a "good" question, I will close the thread with an apology.
I wish to go through the database of Wikipedia (let's say the English one), and do statistics. For example, I am interested in how many active editors (which should be defined) Wikipedia had at each point of time (let's say in the last 2 years).
I don't know how to build such a database, how to access it, how to know which types of data it has and so on. So my questions are:
What tools do I need for this (besides basic R) ? MySQL on my computer? RODBC database connection?
How do you start planning for such a project?
You'll want to start here:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Which will take you to here:
http://download.wikimedia.org/enwiki/20100312/
And the file you probably want is:
# 2010-03-17 04:33:50 done Log events to all pages.
* This contains the log of actions performed on pages.
* pages-logging.xml.gz 1.0 GB
http://download.wikimedia.org/enwiki/20100312/enwiki-20100312-pages-logging.xml.gz
You'll then import the xml into MySQL. Generating a histogram of users per day, week, year, etc. won't require R. You'll be able to do that with a single MySQL query. Something like:
select DAYOFYEAR(wiki_edit_timestamp), count(*)
from page_logs
group by DAYOFYEAR(wiki_edit_timestamp)
order by DAYOFYEAR(wiki_edit_timestamp);
etc.
(I'm not sure what their actual schema is, but it'll be something like that.)
You'll run into issues, no doubt, but you'll learn a lot too. Good luck!
You could
work with the wikipedia database dumps, as already mentioned
work with the live mediawiki API, see this minimal example at Rosettacode or my unfinished approach with a S3 class or this package by Peter Konings
work with dbpedia, an effort to extract knowledge from wikipedia into a knowledge base. They offer an online sparql access I don't know much about, and also datasets as n-triples for download. See this python script which might be a starting point for an R script. This approach might be useful to access the content stored in the wikipedia (such as the infoboxes) but I am not sure if information on contributors to the wikipedia is available.
Try WikiXRay (Python/R) and zotero.