Querying DBpedia for partial URI external link and official website matches - sparql

I’m trying to retrieve Wikipedia pages based on the "official website" specified on them, but preferably without going and building a complete index of Wikipedia. If I query DBpedia using:
SELECT ?s WHERE {
?s foaf:homepage <http://www.nytimes.com>
}
I get the desired result, but there are several issues when trying to make this work in general:
foaf:homepage is mostly not set.
I couldn’t find a query-able propery that maps to "official website". In some cases, a query based on dbpedia-owl:wikiPageExternalLink works, but of course in others you get a list of pages that happen to have this page as a link.
URLs take various forms - www.example.com, www.example.com/, www.example.com/index.html, etc. and I couldn't figure out an efficient way to query based on a regular expression or even on STRSTARTS - seems like it always involves producing a huge query result and then filtering.

You are hitting on the fact that a lot of data in DBPedia is somewhat incomplete or poorly formatted. This is more or less unavoidable since its source material is the same way. For example, foaf:homepage is sometimes missing, but that is likely because in the source Wikipedia page that same info is missing. That being said, sometimes the crawling tools the DBPedia folks use misses a trick - if you think it's doing something wrong in converting Wikipedia data to RDF let them know directly and they can adjust their crawler.
Other than that, your question is a bit too broad to answer, really. foaf:homepage is the property used for the official website for a given topic. Where it's not set you simply don't know what the official site is. dbpedia-owl:wikiPageExternalLink is a general link for any external resource that is referenced by the wiki article - so it's not just the official website.
As for the formatting - I have yet to see this, most links I encountered while browsing are fully formed URLs. If you want us to answer that you'll have to edit your question to include some concrete examples.

Related

How to find the composedBy relation in dbpedia?

Look at this query:
construct {?symphony dct:composedBy <http://dbpedia.org/resource/Category:Symphonies_by_Ludwig_van_Beethoven>}
{
?symphony dct:subject <http://dbpedia.org/resource/Category:Symphonies_by_Ludwig_van_Beethoven>
}
You can run it over this endpoint:
http://dbpedia.org/sparql/
You will get results, so far so good:
I tried to get the music work for Beethoven by using the dct:subject, well ... that's no so correct, because it lists just the symphonies, there should be a relation to list all the works for Beethoven, including sonatas and strings ... do you know that property please?
Plus, I tried the subject property on some opera composers and the results were films that use the opera's opening for that composer as the theam for that movie. so we can see that the subject property is not good to get the musical works, i am looking for help to find something like composed by
here should be a relation to list all the works for Beethoven, including sonatas and strings
I don't see why that's necessarily the case; DBpedia only contains what people put into it, and that information, even if it's present in Wikipedia, isn't necessarily stored in a way that DBpedia can extract it.
It looks like you've got a reasonably good handle on how to explore DBpedia data, though, and that same kind of process can be helpful here. But if you're interested in what links to Beethoven, then you can have a look at the corresponding resources. This may have varied results.
For instance, if you look at the resource for Für Elise, you'll see there's no property directly linking it to the composer (and since those pages show properties in the reverse direction, too, there's no link from Beethoven to the piece, either). That's enough to show that DBpedia doesn't necessarily have the data you're looking for.
However, there is a property that might be useful, dct:subject dbc:Compositions_by_Ludwig_van_Beethoven. Based on that, you might be able to modify your query to use something like:
?symphony dct:subject dbc:Compositions_by_Ludwig_van_Beethoven
That's no guarantee, but this process of exploring the data looking for relevant bits is probably the best bet for finding this information.

Web scraping wikipedia data table, but from dbpedia, and examples/very basic, elementary tutorial resources to build queries

I wanted to ask about the Semantic Web part, in particular using DBpedia. In general, what DBpedia can and can’t do? I roughly understand the subject-verb-object model for something like DBpedia. Practically and concretely speaking, I want to web scrape the technical data (mass, thrust, etc.) found in the Wikipedia page of the Long March rocket family
Now, as of right now (i.e., as far as I know), to find what DBpedia has (i.e., how I’m using DBpedia to find data) is that I find what I’m interested in Wikipedia, copying the last part of the URL, and copy that into DBpedia (is there any method more sophisticated than that?), resulting in this page.
Looking at that page, I only see links to related articles, links, and the abstract.
Other than my smaller questions above, my main question is this: so does DBpedia not have the data table that I want?
Next, could someone help me give me some tips or pointers for building a SPARQL or query string for DBpedia? It seems to me that one wouldn't know how to build one as there's no "directory" for what could or couldn't be asked. Thanks.
DBpedia is an active project, and DBpedia extractors are continuing to evolve. Contributions that might help you would include adding infoboxes to Wikipedia pages, and data extractors to DBpedia. Check the DBpedia website for info, or write to dbpedia-discussion to get started.
As for finding DBpedia content, there are several interfaces you can work with --
Faceted Browse and Search
direct SPARQL query interface
iSPARQL, a drag-and-drop SPARQL query builder
SNORQL, another SPARQL query interface
so does dbpedia not have the data table that I want?
No, it doesn't. Usually, DBpedia gets its data from infoboxes. Your article doesn't have one, so DBpedia can't get much information out of it.

Why is some information from the Wikipedia infobox missing on DBpedia?

Why is some information from the Wikipedia infobox missing on DBpedia?
For example, star Alpha Librae has property distance-from-earth in the infobox, but it isn't a property of the Alpha Librae dbpedia resource. On the other hand, star Betelgeuse has this piece of information on DBpedia). And many other stars have this distance information in the infobox, but there isn't any matching property in the DBpedia resource.
Is there a way to extract thise missing information from DBpedia using SPARQL or is the only way web scraping of the wiki page?
The DBpedia pages present all the data DBpedia has -- no SPARQL nor other query can get data that isn't there.
DBpedia is updated periodically. It may not reflect the latest changes on Wikipedia.
Also, extractors are a living project, and may not grab every property in which you're interested.
Looking at Betelgeuse on Wikipedia, I see one distance in the infobox. Looking at Alpha_Librae, I see two distances. Which should DBpedia have? Perhaps you have the niche knowledge which can ensure that the extractors do the right thing...
As #JoshuaTaylor suggests, you will probably get more satisfactory answers from the DBpedia discussion list and/or the DBpedia development list.
Look at en.wikipedia.org/wiki/Volkswagen_Golf_Mk3:
In the infobox you have:
height = 1991-95 & Cabrio: {{convert|1422|mm|in|1|abbr=on}}1996-99: {{convert|1428|mm|in|1|abbr=on}}
In dbpedia you get height=1991-95
instead of
height=1442
height=1428
This happens because there is no standard how to define properties in a conditional way. For this reason, dbpedia properties are sometimes wrong/missing

Code related web searches

Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)

Semantic analysis using Solr

I'm considering about adding semantic analysis to my Solr installation, but I don't exactly know where to start.
Basically, I'd like Solr to be able to find "similar" words (taken from the body of the indexed documents).
For example, if I search for "music", I should be able to query the semantic engine and obtain "rock", "pop", etc. (of course if these words appeared near to music in some of the indexed documents).
I found this project, but I don't know if it is the correct place to start:
http://code.google.com/p/semanticvectors/
Semantic indexing is a good place to start. However, in my experience, these kind of technologies don't work that well in practice. You often end up with very bizarre results. Also, because of Google, people have a certain expectation of how keyword search should behave - i.e. your search term should appear in the matching document.
You may use the Lucene Wordnet contrib package to look for synonyms.
Optimizing Findability in Lucene and Solr gives other ways to expand queries.