Web scraping wikipedia data table, but from dbpedia, and examples/very basic, elementary tutorial resources to build queries

Web scraping wikipedia data table, but from dbpedia, and examples/very basic, elementary tutorial resources to build queries - sparql

I wanted to ask about the Semantic Web part, in particular using DBpedia. In general, what DBpedia can and can’t do? I roughly understand the subject-verb-object model for something like DBpedia. Practically and concretely speaking, I want to web scrape the technical data (mass, thrust, etc.) found in the Wikipedia page of the Long March rocket family
Now, as of right now (i.e., as far as I know), to find what DBpedia has (i.e., how I’m using DBpedia to find data) is that I find what I’m interested in Wikipedia, copying the last part of the URL, and copy that into DBpedia (is there any method more sophisticated than that?), resulting in this page.
Looking at that page, I only see links to related articles, links, and the abstract.
Other than my smaller questions above, my main question is this: so does DBpedia not have the data table that I want?
Next, could someone help me give me some tips or pointers for building a SPARQL or query string for DBpedia? It seems to me that one wouldn't know how to build one as there's no "directory" for what could or couldn't be asked. Thanks.

DBpedia is an active project, and DBpedia extractors are continuing to evolve. Contributions that might help you would include adding infoboxes to Wikipedia pages, and data extractors to DBpedia. Check the DBpedia website for info, or write to dbpedia-discussion to get started.
As for finding DBpedia content, there are several interfaces you can work with --
Faceted Browse and Search
direct SPARQL query interface
iSPARQL, a drag-and-drop SPARQL query builder
SNORQL, another SPARQL query interface

so does dbpedia not have the data table that I want?
No, it doesn't. Usually, DBpedia gets its data from infoboxes. Your article doesn't have one, so DBpedia can't get much information out of it.

Related

Extract related articles in different languages using Wikidata Toolkit

I'm trying to extract interlanguage related articles in Wikidata dump. After searching on the internet, I found out there is a tool named Wikidata Toolkit that helps to work with these type of data. But there is no information about how to find related articles in different languages. For example, the article: "Dresden" in the English language is related to the article: "Dresda" in the Italiano one. I mean the second one is the translated version of the first one.
I tried to use the toolkit, but I couldn't find any solution.
Please write some example about how to find this related article.

you can use Wikidata dump [1] to get a mapping of articles among wikipedias in multiple language.
for example if you see the wikidata entry for Respiratory System[2] at the bottom you see all the articles referring to the same topic in other languages.
That mapping is available in the wikidata dump. Just download wikidata dump and get the mapping and then get the corresponding text from the wikipedia dump.
You might encounter some other issues, like resolving wikipedia redirects.
[1] https://dumps.wikimedia.org/wikidatawiki/entities/
[2] https://www.wikidata.org/wiki/Q7891

Mount a SPARQL endpoint for use with custom ontologies and triple RDFs

I've been trying to figure out how to mount a SPARQL endpoint for a couple of days, but as much as I read I can not understand it.
Comment my intention: I have an open data server mounted on CKAN and my goal is to be able to use SPARQL queries on the data. I know I could not do it directly on the datasets themselves, and I would have to define my own OWL and convert the data I want to use from CSV format (which is the format they are currently in) to RDF triple format (to be used as linked data).
The idea was to first test with the metadata of the repositories that can be generated automatically with the extension ckanext-dcat, but is that I really do not find where to start. I've searched for information on how to install a Virtuoso server for the SPARQL, but the information I've found leaves a lot to be desired, not to say that I can find nowhere to explain how I could actually introduce my own OWLs and RDFs into Virtuoso itself.
Someone who can lend me a hand to know how to start? Thank you

I'm a little confused. Maybe this is two or more questions?
1. How to convert tabular data, like CSV, into the RDF semantic format?
This can be done with an R2RML approach. Karma is a great GUI for that purpose. Like you say, a conversion like that can really be improved with an underlying OWL ontology. But it can be done without creating a custom ontology, too.
I have elaborated on this in the answer to another question.
2. Now that I have some RDF formatted data, how can I expose it with a SPARQL endpoint?
Virtuoso is a reasonable choice. There are multiple ways to deploy it and multiple ways to load the data, and therefore LOTs of tutorial on the subject. Here's one good one, from DBpedia.
If you'd like a simpler path to starting an RDF triplestore with a SPARQL endpoint, Stardog and Blazegraph are available as JARs, and RDF4J can easily be deployed within a container like Tomcat.
All provide web-based graphical interfaces for loading data and running queries, in addition to SPARQL REST endpoints. At least Stardog also provides command-line tools for bulk loading.

How to find the composedBy relation in dbpedia?

Look at this query:
construct {?symphony dct:composedBy <http://dbpedia.org/resource/Category:Symphonies_by_Ludwig_van_Beethoven>}
{
?symphony dct:subject <http://dbpedia.org/resource/Category:Symphonies_by_Ludwig_van_Beethoven>
}
You can run it over this endpoint:
http://dbpedia.org/sparql/
You will get results, so far so good:
I tried to get the music work for Beethoven by using the dct:subject, well ... that's no so correct, because it lists just the symphonies, there should be a relation to list all the works for Beethoven, including sonatas and strings ... do you know that property please?
Plus, I tried the subject property on some opera composers and the results were films that use the opera's opening for that composer as the theam for that movie. so we can see that the subject property is not good to get the musical works, i am looking for help to find something like composed by

here should be a relation to list all the works for Beethoven, including sonatas and strings
I don't see why that's necessarily the case; DBpedia only contains what people put into it, and that information, even if it's present in Wikipedia, isn't necessarily stored in a way that DBpedia can extract it.
It looks like you've got a reasonably good handle on how to explore DBpedia data, though, and that same kind of process can be helpful here. But if you're interested in what links to Beethoven, then you can have a look at the corresponding resources. This may have varied results.
For instance, if you look at the resource for Für Elise, you'll see there's no property directly linking it to the composer (and since those pages show properties in the reverse direction, too, there's no link from Beethoven to the piece, either). That's enough to show that DBpedia doesn't necessarily have the data you're looking for.
However, there is a property that might be useful, dct:subject dbc:Compositions_by_Ludwig_van_Beethoven. Based on that, you might be able to modify your query to use something like:
?symphony dct:subject dbc:Compositions_by_Ludwig_van_Beethoven
That's no guarantee, but this process of exploring the data looking for relevant bits is probably the best bet for finding this information.

Why is some information from the Wikipedia infobox missing on DBpedia?

Why is some information from the Wikipedia infobox missing on DBpedia?
For example, star Alpha Librae has property distance-from-earth in the infobox, but it isn't a property of the Alpha Librae dbpedia resource. On the other hand, star Betelgeuse has this piece of information on DBpedia). And many other stars have this distance information in the infobox, but there isn't any matching property in the DBpedia resource.
Is there a way to extract thise missing information from DBpedia using SPARQL or is the only way web scraping of the wiki page?

The DBpedia pages present all the data DBpedia has -- no SPARQL nor other query can get data that isn't there.
DBpedia is updated periodically. It may not reflect the latest changes on Wikipedia.
Also, extractors are a living project, and may not grab every property in which you're interested.
Looking at Betelgeuse on Wikipedia, I see one distance in the infobox. Looking at Alpha_Librae, I see two distances. Which should DBpedia have? Perhaps you have the niche knowledge which can ensure that the extractors do the right thing...
As #JoshuaTaylor suggests, you will probably get more satisfactory answers from the DBpedia discussion list and/or the DBpedia development list.

Look at en.wikipedia.org/wiki/Volkswagen_Golf_Mk3:
In the infobox you have:
height = 1991-95 & Cabrio: {{convert|1422|mm|in|1|abbr=on}}1996-99: {{convert|1428|mm|in|1|abbr=on}}
In dbpedia you get height=1991-95
instead of
height=1442
height=1428
This happens because there is no standard how to define properties in a conditional way. For this reason, dbpedia properties are sometimes wrong/missing

Querying DBpedia for partial URI external link and official website matches

I’m trying to retrieve Wikipedia pages based on the "official website" specified on them, but preferably without going and building a complete index of Wikipedia. If I query DBpedia using:
SELECT ?s WHERE {
?s foaf:homepage <http://www.nytimes.com>
}
I get the desired result, but there are several issues when trying to make this work in general:
foaf:homepage is mostly not set.
I couldn’t find a query-able propery that maps to "official website". In some cases, a query based on dbpedia-owl:wikiPageExternalLink works, but of course in others you get a list of pages that happen to have this page as a link.
URLs take various forms - www.example.com, www.example.com/, www.example.com/index.html, etc. and I couldn't figure out an efficient way to query based on a regular expression or even on STRSTARTS - seems like it always involves producing a huge query result and then filtering.

You are hitting on the fact that a lot of data in DBPedia is somewhat incomplete or poorly formatted. This is more or less unavoidable since its source material is the same way. For example, foaf:homepage is sometimes missing, but that is likely because in the source Wikipedia page that same info is missing. That being said, sometimes the crawling tools the DBPedia folks use misses a trick - if you think it's doing something wrong in converting Wikipedia data to RDF let them know directly and they can adjust their crawler.
Other than that, your question is a bit too broad to answer, really. foaf:homepage is the property used for the official website for a given topic. Where it's not set you simply don't know what the official site is. dbpedia-owl:wikiPageExternalLink is a general link for any external resource that is referenced by the wiki article - so it's not just the official website.
As for the formatting - I have yet to see this, most links I encountered while browsing are fully formed URLs. If you want us to answer that you'll have to edit your question to include some concrete examples.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Web scraping wikipedia data table, but from dbpedia, and examples/very basic, elementary tutorial resources to build queries - sparql

so does dbpedia not have the data table that I want? No, it doesn't. Usually, DBpedia gets its data from infoboxes. Your article doesn't have one, so DBpedia can't get much information out of it.

Related

Extract related articles in different languages using Wikidata Toolkit

Mount a SPARQL endpoint for use with custom ontologies and triple RDFs

How to find the composedBy relation in dbpedia?

Why is some information from the Wikipedia infobox missing on DBpedia?

Querying DBpedia for partial URI external link and official website matches

Categories

Resources