I need your help with the following situation.
I have a local relational database that contains information about several places in a city. These places could be any kind of attraction: Museum, a cathedral, or even a square.
As an example I have information about "Square Victoria" (https://en.wikipedia.org/wiki/Victoria_Square,_Montreal)
A simple search in google gave me the wikipedia URL above. But I want to be able to do it programmatically.
For each place in the database I have also its category (square, museum, church, ....). These categories are local only and do not match any standardized categorization.
My goal is to improve this database by associating each place to its dbpedia URI.
My question is what is the best way to do that? I have some theoretical background about Semantic Web technologies but I don't have yet the practice skills to determine how to do that.
More specific questions:
Is it possible to determine the dbpedia URI using sparql only?
If it is not possible to do it with sparql only, what other technologies would I need to be able to accomplish that?
Thank you
First of all I would recommend, if you have not done it yet, to have a look at wikidata. This project is a semantic extension to wikipedia, but contrary to dbpedia, the data is not extracted from wikipedia, it is created by contributors, and therefore appears (or will appear as the project is still growing) to be more relevant.
The service offers many solutions to access data (including a Sparql endpoint), and it's main advantage is that the underlying software is mediawiki, same used for wikipedia and other Wikimedia foundation projects. The mediawiki API offers an Opensearch option that should allow you to search more efficiently that Sparql queries.
Putting everything together, I think it might be worth having a look at wikidata + wikipedia API to get pivot data to align you local database.
No direct answer but I hope that will help.
Related
How architecturally sound and up to industry standards nesting resource representations in REST APIs is, especially when it comes to nested lists of resources (like books of an author)?
I'm interested in finding links to authoritative sources that answer to this question.
The authoritative source for REST is the dissertation of Roy Fielding, based on work he did during the standardization of HTTP/1.1 (RFC 2068, RFC 2616, etc) in the 1990s.
REST defines resource ("Any information that can be named can be a resource..."), and requires that all resources understand messages the same way (uniform interface) but does not actually constrain your resource model.
"RESTful", historically, is context sensitive; in practice it means something like "more like REST than our current designs". In the web services community, it meant "more like REST than WS-* and SOAP". In Rails, it meant more like REST than the resource models that were recommended prior to Rails 1.2. And so on.
If what you are interested in is describing the relationship between a resource that is a collection and a resource that is an item in that collection, then the standard you want is RFC 6573.
But again, it doesn't tell you how to design the resources, or how to design the identifiers for those resources -- it just tells you how to indicate a relationship between them.
As far as I understand the web resource is something abstract identified by the IRIs and accessible through the web. What dereferencing the IRI gives back is the representation of the actual state of the identified resource, this is why it is called representational state transfer. I don't remember any standard that discusses nested resources. Maybe RDF is the closest what you are looking for. In practice if we follow RDF concepts, then to answer a GET request the REST API responds with a representation of an RDF subgraph starting with the resource indentified by the giving IRI and it can be any level deep. Nestedness is not something I would consider here, because it is a graph, not a hierarchy, it is sort of expanding relationships between resources or returning hyperlinks the API consumers can follow to do the exact same thing.
Not sure if this helps. I did not find any RFC beyond what VoiceOfUnreason's answer contains, I remember to read explicitly about web resources and identifying real things with hashtags or non-dereferenceable IRIs in an RFC 5+ years ago, but I have no idea which one it was. Maybe it was the Lanthaler dissertation or the SemWeb document VoiceOfUnreason suggested. What is certain it was somehow connected to the semantic web and RDF.
REST’s identification of resources constraint requires that resources
are identifiable so that they can be accessed and manipulated via
generic interfaces. On the Web, resources are identified by IRIs [44].
Since a resource may represent con- cepts which cannot be serialized
into a byte stream (e.g., persons or a feeling), resources are not
manipulated directly. Instead, REST is built on the concept of
manipulation of resources through representations; i.e., an additional
layer of indirection in the form of resource representations is
introduced.
https://www.markus-lanthaler.com/research/third-generation-web-apis-bridging-the-gap-between-rest-and-linked-data.pdf
On the Semantic Web, all information has to be expressed as statements
about resources, like the members of the company Example.com are Alice
and Bob or Bob's telephone number is "+1 555 262" or this Web page was
created by Alice. Resources are identified by Uniform Resource
Identifiers (URIs) [RFC3986]. This modelling approach is at the heart
of Resource Description Framework (RDF) [RDFPrimer]. A nice
introduction is given in the N3 primer [N3Primer].
Using RDF, the statements can be published on the Web site of the
company. Others can read the data and publish their own information,
linking to existing resources. This forms a distributed model of the
world. It allows the user to pick any application to view and work
with the same data, for example to see Alice's published address in
your address book.
https://www.w3.org/TR/cooluris/#semweb
So what I want to say that what you see in the HTTP response is not the resource itself, just a representation of it and its relationship to other resources.
REST does not have a constraint which tells you how verbose that response must be. It just tells you that you must use hyperlinks to connect resources and that you must use standard MIME types and document your API. At least this is how I interpret the uniform interface constraint.
I think the question is very good, because this part of the architecture is open and there were many questions in the past years which ask how to use the URIs for querying nested resources. The answer is always that REST does not cover it, the URI and URI template standards don't cover it either. There are standards like OData and Hydra, which have suggestions, but it is just up to you. Your problem is connected to it, because it asks how verbose a response to such a query can be. It is not covered as well as far as I can tell, but what is certain that it can and must contain at least hyperlinks to other resources. RDF allows describing several resources in a single document, so if we extend the RDF approach to REST, which does not say this is forbidden, then I guess we can do it.
From practical perspective for example a collection is a sort of nested resource too and if the API consumer would send a dedicated request for every collection item just to know basic things like product names, then it would be wasting resources. Normally we respond this kind of requests with a single HTTP response or multiple ones with 25-50-100 items on a page. It does not make much sense from usability and scalability perspective to give hyperlinks to the consumer for each item and force them to follow those links one by one. In fact we like to respond with the exact view model the consumer needs and design APIs this way. I think the same is true for nested properties as well. From RDF perspective these responses represent a subgraph of a massive resource graph, which are managed by the REST service and by for example RDF vocabulary maintainers like OWL, Schema.org, etc.
So to have a one sentence answer: the representation of "nested resources" is not covered by REST and as far as I know not covered by standards like HTTP and URI either, but currently it is the best practice to use them and MIME types we frequently use for REST e.g. HAL+JSON or RDF/JSON-LD support nested representations too, so I would say yes.
Do we have any API that can identity content from a text file related to a particular topic?
For example I have a text file having 5000 lines of text in it.
I want to extract the text related to TOPIC ABC. Does lucene or any other api do that? Any idea?
I have used Lucene for identifying the documents that contain a particular WORDbut would like to know if we have any api that extracts the content from a file related to a particular topic.
This is quite a broad question, but from the information you have supplied it is clear you have a couple of options.
Option 1: Use an API
You could use the Thomson Reuters Open Calais platform which is the best that I have ever came across available for developers. However, it I can imagine it would get expensive over time. They provide a demo on their site which is worth checking out.
Option 2: Extend Lucene's VSM
When I say extend Lucene, I don't mean you need to. There is open-source projects readily available to be taken advantage of. For example, Luence-LDA which allows queries over Latent Dirichlet allocation (LDA). This particular project hasn't been updated in about 3/4 years so it may want to fork it or build your own.
Everyone. Recently Google Translate Is Integrated Into My Project, Which Plays The Role Of Translating Some Product Names, Product Descriptions, Product Related Category Names. But Cause There Are Plenty Of Products In My Database(And Increased Quickly), Google Translate Api Would Cost Considerable Money.
I Want To Translate By Google As Less As Possible. In The Translation, Many Words Are Same Among Many Products, For Example : 阿迪达斯 - Adidas, 苹果 - iphone, 篮球 - Basketball, Bla Bla..... I Wanna Do Some Tricks, But Find No Idea.
Did Anyone Encounter Such Questions?
Any Help Would Be Appreciated.
It sounds like what you need is actually the ability to reuse translation at the string or substring level (in other words, per database entry). You can't really do that with Google, that I know of. You've got a few options, as I see it:
You could switch over to Microsoft Translator and use their methods
that allow you to place translations yourself, such as their
Collaborative Translation feature that lets you override the MT with
a preferred translation and even to vote translations up/down. Quality here will be broadly comparable to Google (I often find it better), and you have methods at your disposal that allow this override. Also, unlike Google, the Microsoft API is free up to a certain volume. Take a look:
http://www.microsoft.com/en-us/translator/developers.aspx
Microsoft also has a unique feature called the Microsoft Translator Hub, which can use your terminology, for example, for translations. However,depending on how you implemented any solution with Microsoft, you might still have the problem that you are making more calls out to Microsoft than you'd like, and, moreover, that "matching" only takes place at the level of a whole record or string, so it would not hit the case of shared linguistic elements being concatenated into one string.
There's a commercial offering called GeoFluent (full disclosure--I am the product manager for this product, so I'm clearly biased :)) that works with Microsoft Translator but provides pre and post translation processing that can deal with sub-segment and may reduce the volume you are therefore putting through translation each time. It could make sense if, as you mention, you are rapidly adding to your database. Of course, this is a commercial offering too, so you'd have to balance the costs.
Let me know if this helps, and happy to answer any other questions you have.
Marcus
There is a PHP sample here : http://weblite.ca/svn/dataface/modules/tm/trunk/lib/googleTranslatePlugin.php
That allows you to send and array and return an array.
array(source=>target) getTranslations()
translates all of the user provided strings into the target language using the Google Translate API and returns an array of source=>target
strings.
I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
Dbpedia Spotlight, the demo looks very promising
OpenNLP requires training. Which training data to use?
OpenNLP tools
Stanbol
NLTK
balie
UIMA
GATE -> example code
Apache Mahout
Stanford CRF-NER
maui-indexer
Mallet
Illinois Named Entity Tagger Not open source but free
wikipedianer data
My questions:
Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
How can they be integrated with Lucene?
Here are some questions related to that subject:
Does an algorithm exist to help detect the "primary topic" of an English sentence?
Named Entity Recognition Libraries for Java
Named entity recognition with Java
The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
Zemanta
Maui-indexer
Dbpedia Spotlight
Extractiv (my company)
These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.
You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/
For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind
Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole.
The output is in your choice of XML or JSON which makes it very easy to use with Lucene.
It is written in java.
There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does.
Yes both versions perform entity and term disambiguation based on the linguistic usage.
The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.
Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml
The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a
Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.
Just in case you find it useful.
Although i have a little bit of experience in developing dynamic websites using ASP technologies, but I am new to semantic web programming, and i intend to implement a website based on semantic web technology.I would like to develop a search engine, where a web user can query for keywords from the backend RDF triple store.I want to implement the website using Java and JSP.I have following questions:
I am currently studying Jena framework and SPARQL to start with,but
i am not sure what other technologies i need to study in order to
implement the website.
What is the difference between RDF and OWL, I have gone through a
lot of web resources but i am still confused.As per my understanding
RDF and OWL both define relationship between concepts but OWL is
more rich in terms of defining relations.
What is meant by different OWL Vocabularies like FOAF, SIOC etc.Why
do we need these vocabularies?
What exactly is the purpose of Virtuso Open Link
Software(http://ods.openlinksw.com/dataspace/dav/wiki/Main/VirtJenaProvider)
Any help would be highly appreciated.
Thanks!
I would definitely like to be kept up to date of your progress. I'm not experienced with java or jsp. I wonder if this could be done in php? I know that some work has been done in python on this kind of thing.
There are some extensions to drupal that work with these semantic web technologies and Semantic Media Wiki is good too.
Check out this and the related links at the bottom. The difference between microformats and vocabularies can be difficult to understand but I think there is a difference, say between a vocabulary like FOAF and a microformat like hCard, hCalendar or hResume. Oh, the link:
http://en.wikipedia.org/wiki/FOAF_(software)
Anyway these related terms are included.
Thanks,
Bruce
http://futurewavedesigns.com
Re: your first question - why do you want to use RDF to implement a keyword search? Keyword search isn't semantic, and there are many established frameworks and APIs for keyword search, such as Lucene.
Re: your second question, comparing RDF and OWL is comparing apples and oranges. RDF is basically for declaring data, but OWL is a layer on top of RDF that is for declaring ontologies (schemas). A more meaningful comparison would be between RDFS (RDF Schema) and OWL, which both address the ontology layer.
Example:
In RDF you might state that John Smith is a Person who hasAge "42" and is marriedTo Jill Smith.
In RDFS or OWL you would declare that Person is a class, hasAge is a property (with domain of Person and range of xsd:integer) and marriedTo is a property (with domain and range of Person).
In OWL you can also declare that marriedTo is a symmetric property (if A is marriedTo B, then B must be marriedTo A). RDF isn't this powerful, so you can't make this particular statement, so can't make inferences about symmetric properties etc.