DBPedia has so many chapters in several languages and also has mapping-based predicates and raw infobox properties. However it's said DBPedia is a community effort for extracting structured information from wikipedialike this but many row properties are not reasonable at all, especially in localized datasets. have all properties produced by human or machine?
As stated on the DBPedia main page:
DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia[...]
This is done via the DBpedia information Extraction Framework, a software available in a dedicated GitHub repository.
Related
I am trying to practice myself in writing some SPARQL queries. Does anybody know where I can find the best material? Where I can find some RDF file and some tasks to try to write my own SPARQL queries. I am good with SQL, and I just need some material to learn to write in SPARQL.
All sample RDF and queries from the O'Reilly book "Learning SPARQL" are available on the book's home page at learningsparql.com. (Full disclosure: I wrote it.)
data.gov and DataHub have a lot of downloadable RDF data sets. If a public SPARQL endpoint is available, DataHub usually lists it. For example: the Rijksmuseum page offers RDF downloads and a link to the endpoint.
My Experiment has a tutorial with examples and a working endpoint.
If you download Jena, you get their example RDF files and SPARQL queries.
Uniprot has a SPARQL endpoint with examples. The RDF is available for download. some of the files are quite large.
There's a large number of downloadable ontologies in RDF format at the OBO Foundry.
Watch this: Probe the Semantic Web with SPARQL
SPARQL Cheat Sheet Slide Deck
As mentioned above: the website for Bob DuCharme's excellent Learning SPARQL Book
I need your help with the following situation.
I have a local relational database that contains information about several places in a city. These places could be any kind of attraction: Museum, a cathedral, or even a square.
As an example I have information about "Square Victoria" (https://en.wikipedia.org/wiki/Victoria_Square,_Montreal)
A simple search in google gave me the wikipedia URL above. But I want to be able to do it programmatically.
For each place in the database I have also its category (square, museum, church, ....). These categories are local only and do not match any standardized categorization.
My goal is to improve this database by associating each place to its dbpedia URI.
My question is what is the best way to do that? I have some theoretical background about Semantic Web technologies but I don't have yet the practice skills to determine how to do that.
More specific questions:
Is it possible to determine the dbpedia URI using sparql only?
If it is not possible to do it with sparql only, what other technologies would I need to be able to accomplish that?
Thank you
First of all I would recommend, if you have not done it yet, to have a look at wikidata. This project is a semantic extension to wikipedia, but contrary to dbpedia, the data is not extracted from wikipedia, it is created by contributors, and therefore appears (or will appear as the project is still growing) to be more relevant.
The service offers many solutions to access data (including a Sparql endpoint), and it's main advantage is that the underlying software is mediawiki, same used for wikipedia and other Wikimedia foundation projects. The mediawiki API offers an Opensearch option that should allow you to search more efficiently that Sparql queries.
Putting everything together, I think it might be worth having a look at wikidata + wikipedia API to get pivot data to align you local database.
No direct answer but I hope that will help.
I'm trying to describe some resources about books, already did:
Author: dcterms:creator;
Title: dcterms:title;
Location: dcterms:location
This was the easy ones, but i've some thigs that are not in dcterms list. Where can I find other schemas to describe it ? Can you show me examples, even how to create my own schema?
Eg. Homepage; keywords; goal
You need to search for appropriate vocabularies (or: ontologies/schemas). There are many vocabularies.
You could use http://prefix.cc/ to learn about some of them.
For books, have a look at (these are just some suggestions so that you see some examples of what is out there):
Dublin Core (as you already know)
FOAF (Friend of a Friend) (e.g., for authors and topics)
schema.org (e.g., Book and Person)
The Bibliographic Ontology
provides main concepts and properties for describing citations and bibliographic references (i.e. quotes, books, articles, etc) on the Semantic Web.
Ontology for Media Resources
a core set of metadata properties for media resources
SPAR (Semantic Publishing and Referencing Ontologies)
a suite of orthogonal and complementary ontology modules for creating comprehensive machine-readable RDF metadata for all aspects of semantic publishing and referencing
You could create your own vocabulary (but you should only do this if there is no appropriate vocabulary). It’s as simple as defining meanings for URIs under your control. If you intend to publish this vocabulary, so that other people can use it, too, have a look at RDFS (which is a vocabulary to describe vocabularies). See also:
RDF Primer: Defining RDF Vocabularies: RDF Schema
RDFa Primer: Custom Vocabularies
Best Practice Recipes for Publishing RDF Vocabularies
I suggest that you have a look at schema.org. You will find all the information you need to annotate resources related to books among other things, and even documentation explaining how to extend your vocabulary if needed.
I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
Dbpedia Spotlight, the demo looks very promising
OpenNLP requires training. Which training data to use?
OpenNLP tools
Stanbol
NLTK
balie
UIMA
GATE -> example code
Apache Mahout
Stanford CRF-NER
maui-indexer
Mallet
Illinois Named Entity Tagger Not open source but free
wikipedianer data
My questions:
Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
How can they be integrated with Lucene?
Here are some questions related to that subject:
Does an algorithm exist to help detect the "primary topic" of an English sentence?
Named Entity Recognition Libraries for Java
Named entity recognition with Java
The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
Zemanta
Maui-indexer
Dbpedia Spotlight
Extractiv (my company)
These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.
You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/
For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind
Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole.
The output is in your choice of XML or JSON which makes it very easy to use with Lucene.
It is written in java.
There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does.
Yes both versions perform entity and term disambiguation based on the linguistic usage.
The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.
Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml
The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a
Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.
Just in case you find it useful.
I am building an ontology-processing tool and need lots of examples of various owl ontologies, as people are building and using them in the real world. I'm not talking about foundational ontologies such as Cyc, I'm talking about smaller, domain-specific ones.
There's no definitive collection afaik, but these links all have useful collections of OWL and RDFS ontologies:
schemaweb.info
vocab.org
owlseek
linking open data constellation
RDF schema registry (rather old now)
In addition, there are some general-purpose RDF/RDFS/OWL search engines you may find helpful:
sindice
swoogle
Ian
My go-to site for this probably didn't exist at the time of the question. For latecomers like me:
Linked Open Vocabularies
I wish I'd found it much sooner!
It's well-groomed, maintained, has all the most-popular ontologies, and has a good search engine. However, it doesn't include some specialized collections, most notably, (most of?) the stuff in OBO Foundry.
Thanks! A couple more I found:
OntoSelect - browsable ontology repository
Protege Ontology Library
CO-ODE Ontologies
Within the life-science domain, the publically abvailable ontologies can be found listed on the OBO Foundry site. These ontologies can be queried via the ontology lookup service or the NCBO's Bioportal, which also contains additional resources.
One more concept search tool: falcons
There is also one good web engine for searching for ontologies. It is called Watson Semantic Web Search and you can try it here.