Knowing what RDFA vocabulary to use - semantic-web

How do we know which vocabulary/namespace to use to describe data with RDFa?
I have seen a lot of examples that use xmlns:dcterms="http://purl.org/dc/terms/" or xmlns:sioc="http://rdfs.org/sioc/ns#" then there is this video that uses FOAF vocabulary.
This is all pretty confusing and I am not sure what these vocabularies mean or what is best to use for the data I am describing. Is there some trick I am missing?

There are many vocabularies. And you could create your own, too, of course (but you probably shouldn’t before you checked possible alternatives).
You’d have to look for vocabularies for your specific needs, for example
by browsing and searching on http://lov.okfn.org/dataset/lov/ (they collect and index open vocabularies),
on W3C’s RDFa Core Initial Context (it lists vocabularies that have pre-defined prefixes for use with RDFa), or
by browsing through http://prefix.cc/ (it’s a lookup for typically used namespaces, but you might get an overview by that).
After some time you get to know the big/broad ones: Schema.org, Dublin Core, FOAF, RSS, SKOS, SIOC, vCard, DOAP, Open Graph, Ontology for Media Resources, GoodRelations, DBpedia Ontology, ….

The simplest thing is to check if schema.org covers your needs. Schema.org is backed by Google and the other major search engines and generally pretty awesome.
If it doesn't suit your needs, then enter a few of the terms you need into a vocabulary search engine. My recommendation is LOV.
Another option is to just ask the community about the best vocabularies for the specific domain you need to represent. A good place is answers.semanticweb.com, which is like StackOverflow but with more RDF experts hanging out.

Things have changed quite a bit since that video was posted. First, like Richard said, you should check if schema.org fits your needs. Personally when I need to describe something that's not covered on schema.org, I check LOV as well. If, and only if I can't find anything in LOV, I will then consider creating a new type or property. A quick way to do this is to use http://open.vocab.org/
A newer version of RDFa was published since that video was released: RDFa 1.1 and RDFa Lite. If you want to use schema.org only, I'd recommend to check http://www.w3.org/TR/rdfa-lite/

Vocabularies are usually domain specific. The xmlns line is deprecated. The RDFa profile at http://www.w3.org/profile/rdfa-1.1 lists the vocabularies available as part of initial context. Sometimes vocabularies may overlap in the context of your data. Analogous to solving math prb by either Algebraic or Geometric or other technique, mixing up vocabularies is fine. Equal terms can be found using http://sameas.org/ For addressing your consumer base's favoritism amongst vocab recognition, skos:closeMatch and skos:exactMatch may be used, eg. "gr:Brand skos:closeMatch owl:Thing" with any terms you please. Prefix attribute can be used with vocabularies besides those covered by initial context like: prefix="fb: http://ogp.me/ns/fb# vocab2: path2 ..." For cross-cutting concern across different domain vocabularies such as customizing presentation in search results microdata using schema.org guidelines should be beneficial. However, as this has nothing to do with specialization in any peculiar domain, prefixes are unavailable in this syntax. RDFa vocab have been helpful in such specific domain contexts that content seems to appeal further to participative audience while microdata targets those who've lost their way. For tasks that are too simple to merit full-fledged vocab, but have semantic implications, try http://microformats.org/ Interchanging usage of REST profile URIs for vocabs amongst the 3 syntaxes is valid, but useless owing to lack of affordable manpower to implement alternative support for the vocabs on the Web scale. How & why schema.org vocab merited separate microdata syntax of its own is discussed by Google employee Ian Hickson a. k. a. Hixie- the editor of WHATWG HTML5 draft at http://logbot.glob.com.au/?c=freenode%23whatwg&s=28+Nov+2012&e=28+Nov+2012#c747855 or http://krijnhoetmer.nl/irc-logs/whatwg/20121128#l-1122 If only Google had smart enough employees to implement parser for 1 syntax whose WG included its own employee also, then RDFa Lite inside RDFa would have been another course like Core Java within Java, & no need of separate microdata named mocking rip-off, but alas- our's is an imperfect world!

Related

Microdata or JSON-LD? I'm confused

I haven't found a clear and updated answer, even after googling for a few hours, so here it goes:
I am aware of the advantages and disadvantages of both Microdata and JSON-LD. I also know that Microdata was dropped from W3C (and consequently from the browsers' API). What I'm not sure about is that how it will affect any site where Microdata is used specifically for SEO purpose.
Does Google support JSON-LD for SERPs? What format does it recommend to use? I am looking for updated answers - not from 2011 or 2012 (if they are still applicable though, feel free to post it).
What is more appropriate for a dynamic site with lots of contents (think: 50000 videos, images etc): JSON-LD, Microdata or RDFa? Why?
Consumers that support Microdata support Microdata, no matter if or where Microdata is specified.
It’s conceivable that new consumers might decide not to support it, but the syntax is still very popular and still part of WHATWG’s HTML Living Standard, so it’s probably not going to vanish.
About the consumer Google
Some years ago, JSON-LD was not supported for many of their features, and they recommended that authors use Microdata (and they supported RDFa, too). Today it’s different.
See Google’s Markup formats and placement:
JSON-LD is the recommended format. Google is in the process of adding JSON-LD support for all markup-powered features. The table below lists the exceptions to this. We recommend using JSON-LD where possible.
According to the mentioned table, Microdata and RDFa support all of Google’s data types, while JSON-LD supports everything except their Breadcrumbs feature.
I wouldn’t give much weight to their recommendation. They say that "Structured data markup is most easily represented in JSON-LD format", but I think it’s safe to say that this only applies to authors that generate the structured data programmatically (especially from tools that support JSON).
For authors that manually add the structured data markup, it’s typically easier to use Microdata or RDFa, and using these syntaxes minimizes the risk that an author updates the content without updating the structured data, too (see DRY principle).
JSON-LD vs. Microdata vs. RDFa
Unless you know (and care for) consumers that don’t support all three syntaxes, it doesn’t matter. Use what is easier for you and your tools.
If you have no preference, I would say JSON-LD or RDFa, because contrary to Microdata,
both are W3C Recommendations,
both can be used in non-HTML5 contexts,
both allow to (easily) mix several vocabularies.
JSON-LD if you like your structured data not "intermingled" with your markup (= duplicating the content), RDFa if you like to use your existing markup (= not duplicating the content).
I've opted to go for JSON-LD because it is easier to read and compile. Spotting errors is easy for more complicated dictionaries. It is the W3C and Google recommended standard.
One caveat (major if you need to support it), is that as of May 16 2017, Bing STILL doesn't support JSON-LD
Google's Understand how structured data works now says:
Google recommends using JSON-LD for structured data whenever possible.
It seems reasonable to me to still mix in microdata to avoid duplication of long content, such as articleBody, but generally the industry is JSON-LD all the way.
I discovered that JSON-LD does support breadcrumbs. I applied breadcrumbs using the latest version of Yoast on my wordpress site, and it passed muster with google search console in the rich results test of the live page as well as a crawl of the live page after submitting the sitemap.
It should be noted that Google had deprecated the use of data-vocabulary.org. It wants schema.org.
microdata easy to use with angular 8+
but you can do the same thing with json-ld.
Humanly, you can read attributs easiest with json-ld but there is no big difference between both. Just use what you know how to do to win time

Microdata - itemid / global identifier conventions for organizations, business or brands markup with schema.org

My question is the following: when marking up an organization, business or brand with microdata and schema.org, should I use as a global identifier it's official webpage URL? Is there any kind of better reference that I could use (like IMDB for movies or actors)?
I'd like to know if there's any standard, convention or common practice recommended.
It would be better to use some kind of controlled vocabulary (e.g. VIAF) that uniqely identifies the organization in question.
The choice of identifiers is part of the explanation of REST. http://www.infoq.com/articles/rest-introduction
Inspect closely the first principle (for convention), though it is in broader terms of resources rather than specific to org/biz/brand. REST is the thesis that started this trend. Microformats accordingly makes use of rel="profile" link tags. The concept is further expanded at http://purl.org/ so, if IMDB, for example, switches to W3 like W3C did, then in the future this will minimize the impact on the application you're making right now. RDFa Dublin Core vocab's use of this is seen in the profile at http://www.w3.org/2011/rdfa-context/rdfa-1.1.html.
(For references) Applications serving general public or open initiatives such as academic support might be better served by these profiles, however when operating a site for commercial purposes, building application-specific "custom" profiles considering various legal matters identified, that should perform reliably with PURLs, might be advantageous to build credible reputation.
Finally, WHATWG considers prefixes too advanced and HTML5 for newbies only, so the support for W3's XHTML xmlns/RDFa prefix is dropped in microdata. This compels us to reuse long-form URLs for schema.org business/org/brand resources with microdata syntax. The "custom" profile then serves as mere good-will when picking up from where tasks are wrapped up, otherwise a more variety of items might appear in the content than actually intended, owing to mix-ups.
The good news is, Google supports schema.org usage as a vocab in RDFa syntax. So considering RDFa as an already "living" standard that originated in W3 spec, as per the (non-)commercial nature of application, defining PURL for scope namespaces, profiles exhibiting prefixes, and syntax (of official web-page or substitute IRIs) as per target processors is the way to go. Currently no vocab besides schema is processed as microdata, and schema in RDFa isn't supported by anybody but Google!

Microformat's hRecipe vs. Schema's Recipe

I would like to know what are the main differences between Microformat's hRecipe and Schema.org's Recipe and how search engines treat each one.
Besides the differences in code and the fact that the former is open while the latter is propietary, how do search engines treat each one and which one is better to implement, both from a long-term perspective and a SEO perspective?
Schema.org with Google, Bing, Yahoo!, and Yandex
Since you asked this question, Microformat's hRecipe has been updated with microformats2 as h-recipe, but otherwise your question remains relevant and is worth answering more than 6 years later.
…how do search engines treat each one…?
Search engine giants, Google, Microsoft (Bing), and Yahoo!, along with Yandex (a popular search engine in Russia and elsewhere globally) collaborated to create Schema.org and the schemas therein.
This collaboration is the biggest differentiator between Schema.org and Microformats; it does and will likely continue to have an impact on how each treats schemas defined by other parties.
You can read about why they created it and how they treat other formats in the Schema.org FAQ.
Specifically, you may be interested in their answers to…
What is the purpose of schema.org?
Why are Google, Bing, Yandex and Yahoo! collaborating?
I have already added markup in some other format (i.e. microformats, RDFa, data-vocabulary.org, etc). Do I need to change anything on my site?
Why microdata? Why not RDFa or microformats?
Why don't you support other vocabularies such as FOAF, SKOS, etc?
…which one is better to implement, both from a long-term perspective and a SEO perspective?
The schema better to implement is the one with the most support; in this case, that appears to be Schema.org's Recipe. While all of the above search engines still support microformats, mentions of it have disappeared from some of Google's official documentation regarding structured data and rich snippets.
Interestingly, Google recommends a newer syntax for structured data called JSON-LD.
JSON-LD: The future of structured data?
From a long-term perspective, you may want to consider adopting the evermore popular JSON-LD markup syntax with the Schema.org Recipe schema, which even Bing is supporting now ( here are examples demonstrating it ) despite their documentation having no mention of it.
Pinterest's interesting support
The popular content discovery platform Pinterest supports both schemas and even supports the new JSON-LD syntax (though it is not explicitly mentioned in their documentation).
Despite Schema.org's growing popularity and adoption, Pinterest offers seemingly greater support for the h-recipe microformat with their inclusion of e-instructions as a supported class, whereas Schema.org's corresponding recipeInstructions property is not a supported property.
It's unclear if this is intentional or even which schema they actually prefer, but it is worth keeping in mind if you intend to develop specifically for this platform.
hRecipe is based on class attributes while schema's Recipe is based on multiple attributes. those are the main differences in the markup; hRecipe is backwards compatible whereas Recipe is not, because it's using html5 data attributes.
the big three search engines say that they'll treat both the same, however i don't buy that; Google has been pushing their web platform(s) long enough for me to think that they'll be adding extra juice to Recipe, even though i can't prove it. even if they aren't throwing extra seo at Recipe, you can be sure that they'll work something into SERPS so that if you are using their proprietary markup, you get noticed....more. take the link element's prefetch and prender attributes as an example; google created prerender and if you use it on your site, voila, it prerenders in SERPS for the user. prefetch does not.
i'm not sure how to differentiate between a long-term perspective or an seo perspective, i look # them the same; i'm not saying that you can't, just trying to explain more. i have thought this over before from a clients perspective and asked myself these same questions in regards to microformats as a whole vs. schema. it's basically a judgement call: microformats are tried and true format; there are millions more sites using micoformatted data than there are using schema's. they aren't going anywhere. and (as noted earlier) they are backwards compatible.
that said, schema is backed by the big three, and being html5 based, shouldn't have portability problems in the future. also previously mentioned, i'm sure all three will be rewarding users (though i have no proof) in their respective search results. one caveat here though, is how fast everything on the web is moving; just as quickly as Schema popped up, it could conceivably be dropped. i doubt it (though i'm hoping) but it is a possibility.
i can't say which is better to implement, but microformats are certainly much easier to implement, they're class based and so freaking easy.
It is better to use the schema.org formats as that has been accepted as standard by all of the major search engines (Google, Yahoo, and Bing). Using an alternative microformat may mean that some of the search engines will not recognize that data as being special and losing any possible advantages it offers.

Entity Extraction/Recognition with free tools while feeding Lucene Index

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
Dbpedia Spotlight, the demo looks very promising
OpenNLP requires training. Which training data to use?
OpenNLP tools
Stanbol
NLTK
balie
UIMA
GATE -> example code
Apache Mahout
Stanford CRF-NER
maui-indexer
Mallet
Illinois Named Entity Tagger Not open source but free
wikipedianer data
My questions:
Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
How can they be integrated with Lucene?
Here are some questions related to that subject:
Does an algorithm exist to help detect the "primary topic" of an English sentence?
Named Entity Recognition Libraries for Java
Named entity recognition with Java
The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
Zemanta
Maui-indexer
Dbpedia Spotlight
Extractiv (my company)
These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.
You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/
For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind
Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole.
The output is in your choice of XML or JSON which makes it very easy to use with Lucene.
It is written in java.
There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does.
Yes both versions perform entity and term disambiguation based on the linguistic usage.
The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.
Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml
The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a
Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.
Just in case you find it useful.

Developing a Semantic Web Application

Although i have a little bit of experience in developing dynamic websites using ASP technologies, but I am new to semantic web programming, and i intend to implement a website based on semantic web technology.I would like to develop a search engine, where a web user can query for keywords from the backend RDF triple store.I want to implement the website using Java and JSP.I have following questions:
I am currently studying Jena framework and SPARQL to start with,but
i am not sure what other technologies i need to study in order to
implement the website.
What is the difference between RDF and OWL, I have gone through a
lot of web resources but i am still confused.As per my understanding
RDF and OWL both define relationship between concepts but OWL is
more rich in terms of defining relations.
What is meant by different OWL Vocabularies like FOAF, SIOC etc.Why
do we need these vocabularies?
What exactly is the purpose of Virtuso Open Link
Software(http://ods.openlinksw.com/dataspace/dav/wiki/Main/VirtJenaProvider)
Any help would be highly appreciated.
Thanks!
I would definitely like to be kept up to date of your progress. I'm not experienced with java or jsp. I wonder if this could be done in php? I know that some work has been done in python on this kind of thing.
There are some extensions to drupal that work with these semantic web technologies and Semantic Media Wiki is good too.
Check out this and the related links at the bottom. The difference between microformats and vocabularies can be difficult to understand but I think there is a difference, say between a vocabulary like FOAF and a microformat like hCard, hCalendar or hResume. Oh, the link:
http://en.wikipedia.org/wiki/FOAF_(software)
Anyway these related terms are included.
Thanks,
Bruce
http://futurewavedesigns.com
Re: your first question - why do you want to use RDF to implement a keyword search? Keyword search isn't semantic, and there are many established frameworks and APIs for keyword search, such as Lucene.
Re: your second question, comparing RDF and OWL is comparing apples and oranges. RDF is basically for declaring data, but OWL is a layer on top of RDF that is for declaring ontologies (schemas). A more meaningful comparison would be between RDFS (RDF Schema) and OWL, which both address the ontology layer.
Example:
In RDF you might state that John Smith is a Person who hasAge "42" and is marriedTo Jill Smith.
In RDFS or OWL you would declare that Person is a class, hasAge is a property (with domain of Person and range of xsd:integer) and marriedTo is a property (with domain and range of Person).
In OWL you can also declare that marriedTo is a symmetric property (if A is marriedTo B, then B must be marriedTo A). RDF isn't this powerful, so you can't make this particular statement, so can't make inferences about symmetric properties etc.