Best practices for adding semantics to a website

Best practices for adding semantics to a website - semantic-web

I am a bit confused about the semantics of websites. I understand that every URI should represent a ressource. I assume that all information provided by RDFa inside a webpage describes the ressource represented by the URI of that webpage. My question is: What are best practices for providing semantic data for subpages of a website.
In my case I want to create a website for a theater group called magma using RDFa with schema.org and opengraph vocabularies. Let's say I have the welcome page (http://magma.com/), a contact page (http://magma.com/contact/) and pages for individual plays (http://magma.com/play/<playid>/).
Now I would think that both the welcome page and the contact page represent the same ressource (magma) while providing different information about that ressource. The play pages however represent plays that only happen to be performed by magma. Or is it better to say that the play pages also represent magma but providing information about plays which will be performed by that group? The third option I stumbled upon is http://schema.org/WebPage. Especially subtypes like ContactPage seems to be relevant.
When it comes to implementation, where do I put the RDFa?
And finally: How will my choice change the way the website is treated by 3rd parties (google, facebook, ...)?
I realize this question is a bit blurry. To make it more concrete I will add an example that you might critizise:
<html vocab="http://schema.org/" typeof="TheaterGroup">
<head>
<meta charset="UTF-8"/>
<title>Magma - Romeo and Juliet</title>
<!-- magma sematics from a template file -->
<meta property="name" content="Magma"/>
<meta property="logo" content="/static/logo.png"/>
<link rel="home" property="url" content="http://magma.com/"/>
</head>
<body>
<h1>Romeo and Juliet</h1>
<!-- semantics of the play -->
<div typeof="CreativeWork" name="Romeo and Juliet">
...
</div>
<h2>Shows</h2>
<!-- samantics of magma events -->
<ul property="events">
<li typeof="Event"><time property="startDate">...</time></li>
...
</ul>
</body>
</html>

I understand that every URI should represent a ressource. I assume that all information provided by RDFa inside a webpage describes the ressource represented by the URI of that webpage.
Well, a HTTP URI could identify the page itself OR the thing the page is about. You can't tell if an URI identifies the page or the thing by simply looking at it.
Example (in Turtle syntax):
<http://en.wikipedia.org/wiki/The_Lord_of_the_Rings> ex:author "John Doe"
This could mean that the HTML page with the URI http://en.wikipedia.org/wiki/The_Lord_of_the_Rings is authored by "John Doe". Or it could mean that the thing described by that HTML page (→ the novel) is authored by "John Doe". Of course this is an important difference.
There are various ways to differentiate what an URI represents, and there is some dispute about it. The discussion around this is known as httpRange-14 issue. See for example the Wikipedia article Web resource.
One way is using hash URIs (see also this answer). Example: http://magma.com/play/42 could identify the page about the play, http://magma.com/play/42#play could identify the play.
Another way is using HTTP status code 303. The code 200 gives the representation of the page about the thing, the code 303 See Other gives an additional URI identifying the thing. This method is used by DBpedia:
http://dbpedia.org/resource/The_Lord_of_the_Rings represents the novel
http://dbpedia.org/page/The_Lord_of_the_Rings represents the page about the novel
(resp. http://dbpedia.org/data/The_Lord_of_the_Rings for machines)
See Choosing between 303 and Hash.
Now, when using RDFa, you can make statements about both, the page itself and the thing represented by the page. Just use the corresponding URI as subject (e.g., by using the resource attribute).
So let's say http://magma.com/#magma represents the theater group. Now you could use this URI on every page (/contact, /play/, …) to make statements about the group resp. to refer to the group.
<div resource="http://magma.com/#magma">
<span property="ex:name">Magma</span>
</div>
<div resource="http://magma.com/">
<span property="ex:name">Website of Magma</span>
</div>

I suggest that you first look at the schema.org straightforward documentation. This vocabulary is very comprehensive for your concerns and supported by the major search engines.
Here is a snippet example for you to get started, you can include this straight in an HTML page. When you speak about the performance of the play on a page you could use:
<div itemscope itemtype="http://schema.org/TheaterEvent">
<h1 itemprop="name">Romeo and Juliet</h1>
<span itemprop="location">Council Bluffs, IA, US</span>
<meta itemprop="startDate" content="2011-05-23">May 23
Buy tickets
</div>
On your contact page you could include:
<div itemscope itemtype="http://schema.org/TheaterGroup">
<span itemprop="name">Magma</span>
Tel:<span itemprop="telephone">( 33 1) 42 68 53 00 </span>
</div>

Related

Eventbrite API list events 403

I'm trying to fetch events base on time from Eventbrite API with the following command from the docs
curl -X GET https://www.eventbriteapi.com/v3/events/search?date_modified.range_start=2018-01-01T00:00:01Z -H 'Authorization: Bearer MY_API_TOKEN'
However, it returns me 403 with the following HTML in response body:
<html lang="en">
<head>
<title>Whoops!</title>
</head>
<body>
<div id="whoops_wrapper">
<img id="logo" src="https://cdn.evbstatic.com/s3-s3/static/images/django/logos/eb_home_stroke-trans.png" width="154" />
<h1>
This page is <br />currently unavailable.
</h1>
<p>
The Team is currently working to return you to the service as quickly as possible.<br />
If you need to reach us immediately, please <a href='https://www.eventbrite.com/support/contact-us'>contact us</a>.
</p>
<p>
We'll keep you updated on <a href='https://www.twitter.com/eventbrite'>Twitter</a>.<br />
<a href='https://www.eventbritestatus.com/'>Eventbrite Status Page</a>
</p>
<p><font size="-2">Response code: 403</font></p>
</div>
</body>
</html>
It use to work fine before and now it like this for a couple days. I tried to create a new API Token with no success. Calling other endpoints from the API seams working fine. I also didn't hit the rate limit. Any ideas what it could be or things to try?

According to the linked Google Group Thread
Thank you for your patience. We recently made some changes to our APIs in an effort to improve platform functionality and performance. Some of these adjustments necessitated the unforeseen and immediate deprecation of one of our public Event Search APIs events/search endpoint.
Since this change was made, the team has been working hard on potential solutions for providing access to our event feed for API users that are not part of our official distribution partner program. We wanted to reach out to acknowledge the frustration you’re feeling and let you know that we will update you with more details over the coming week as we determine the viability of a potential replacement solution.
Regards,
Eventbrite.
So it looks like they've discontinued search from the public API.
Update: The Search API has been discontinued as of December 12 here

Just came across official Google Support group. Seams like internal Eventbrite issue. :(

How do I crawl twitch.tv where the html body was empty upon initial http request, and contents were loaded by various scripts

I am trying to use Scrapy to crawl through stream pages on twitch. The problem is that the html request returns no useful urls. For example, with wget to twitch.tv main page, I get an empty body tag:
<body>
//some stuff
<div id='flyout'>
<div class='point'>
</div>
<div class='content'>
</div>
</div>
</body>
I understand the content was somehow loaded afterwards, but couldn't figure out how was it done. Any ideas, suggestions? Thanks!!!

open up a browser with the dev tools open as well. Click the network tab then goto twitch.tv and dive into all the requests to see which requests provide which parts of the content and narrow it down to the content you want (and given the example below, the request url will most likely be a request to some form of https://api.twitch.tv/{path to endpoint}/{name of endpoint}?{endpointarg=value}). For example:
If you want to get all the data for the featured content on the homepage you may find that instead of starting your crawl on twitch.tv, you should instead go to https://api.twitch.tv/kraken/streams/featured?limit=6&geo=US&lang=en&on_site=1, which provides nice JSON formatted data like so:
{"_links":
{"self":"https://api.twitch.tv/kraken/streams/featured?geo=US&lang=en&limit=6&offset=0",
"next":"https://api.twitch.tv/kraken/streams/featured?geo=US&lang=en&limit=6&offset=6"},
"featured":[
{"text":"<p>SNES Super Stars is a 11-day speedrun marathon devoted to the Super Nintendo Entertainment System. From March 10th-20th, watch over 200 games being beaten amazingly fast and races between some of the top speedrunners in the world!</p>\n\n<br>\n\n\n<p>Click here to watch and chat!</p>\n\n<p></p>\n",
"title":"SNES Super Stars Marathon",
"sponsored":false,
"priority":5,
"scheduled":true,
...
And you could just follow links from there. You will also have to emulate the headers for that request. So the example above won't work unless you specify a client-id in your request header which you can probably pull from the header of the original request. Every section or feature of the site probably has its own api endpoint which you may be able to access and it is also a bit easier on twitch servers because they dont have to serve up all those pictures and video, kind of a win-win. Also if you notice some of the query arguments at the end of the url, you can probably manipulate how many items you get back (limit=6).
That should get what you want although you will have to dig around for the endpoints. But, if for whatever reason you really need to dynamically process javascript and don't want to automate a browser with selenium while staying within the scrapy ecosystem, then there is also scrapinghub's splash project which integrates quite well with scrapy.

SEO on external html page renderd by <object> tag

I have this HTML:
<body>
<object id="post" data="post/Requirement Process Narative.html" type="text/html"> </object>
</body>
I want Google to index the keywords from the file Requirement Process Narative.html also.
That is if Requirement Process Narative.html contains "Domain Knowledge Acquiring" and someone searches for "Domain Knowledge Acquiring", Google will display the current page in its search list.
How to do it?

Google gets and indexed object tags, if they are matching Schema.org standards.
So your content has the potential to be interpreted and included in the SERP if useful for the end user.
You can read more about Schema here:
http://blog.schema.org/2011/07/on-june-2-nd-we-announced-collaboration.html
Consider also to use Google Web Master, Fetch as Google Bot feature to see what actually Google see of your page and for optimize your code.

Concepts for RDFa DRY references

I started digging into RDFa recently and try to spice my website with semantic information. The site offers services, events, a blog and may offer products in future. Happily schema.org has coarse but adequate categories for it all. But now it comes to practical questions.
All the examples have all information on a single page, which seems pretty academic to me. E.g. on my landing page is a list with upcoming events. Events have a location property. My events run at 2 different locations. I could paste the location information for each entry in and inflate my html. I'd rather link to pages, which describe the locations and hold full details. Not sure, whether this is what sameAs is for. But even then, how would it know which RDFa information on the target URL should be used as the appropriate vCard?
Similarly, my landing page has only partial company information visible. I could add a lot of <meta>, but again a reference to the contact page would be nice.
I just don't want to believe that this aspect slipped the RDF creators. Are there any best practices for redundancy reduction?

URIs! (or IRIs, e.g., in RDFa 1.1)
That’s one of the primary qualities of RDF, and it makes Linked Data possible, as coined by Tim Berners-Lee (emphasis mine):
The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.
Like the web of hypertext, the web of data is constructed with documents on the web. However, unlike the web of hypertext, where links are relationships anchors in hypertext documents written in HTML, for data they links between arbitrary things described by RDF
From my answer to a question about the Semantic Web:
Use RDF (in the form of a serialization format of your choice) and define URIs for your entities so that you and other people can make statements about them.
So give all your "entities" an URI and use it as subject resp. object in RDF triples. Note that you may not want to use the same URI which your web pages have, as it would make it hard to distinguish between data about the web page and data about the thing represented by the web page (see my answer describing this in more detail).
So let’s say your website has these two pages:
http://example.com/event/42 (about the event 42, i.e., the HTML page)
http://example.com/location/51 (about the location 51, i.e., the HTML page)
Using the hash URI method, you could mint these URIs:
http://example.com/event/42#it (the event 42, i.e., the real thing)
http://example.com/location/51#it (the location 51, i.e., the real thing)
Now when you want to use the Schema.org vocabulary to give information about your event, you may use resource to give its URI:
<!-- on http://example.com/event/42 -->
<article resource="#it" typeof="schema:Event">
<h1 property="schema:name">Event 42</h1>
</article>
And when you want to specify the event’s location (using Place), you could use the URI of the location:
<!-- on http://example.com/event/42 -->
<article about="#it" typeof="schema:Event">
<h1 property="schema:name">Event 42</h1>
<a property="schema:location" typeof="schema:Place" href="/location/51#it">Location 51</a>
</article>
And on the location page you might have something like:
<!-- on http://example.com/location/51 -->
<article about="#it" typeof="schema:Place">
<h1 property="schema:name">Location 51</h1>
<a property="schema:event" typeof="schema:Event" href="/event/42#it">Event 42</a>
</article>
Aggregating this data, you’ll have these triples (in Turtle):
#prefix schema: <http://schema.org/> .
<http://example.com/location/51#it> a schema:Place .
<http://example.com/location/51#it> schema:event <http://example.com/event/42#it> .
<http://example.com/location/51#it> schema:name "Location 51" .
<http://example.com/event/42#it> a schema:Event .
<http://example.com/event/42#it> schema:location <http://example.com/location/51#it> .
<http://example.com/event/42#it> schema:name "Event 42" .
EDIT: I’m not sure (and I hope it’s not the case), but maybe Schema.org expects a blank node with a url (or sameAs?) property instead, e.g.:
<article about="#it" typeof="schema:Event">
<h1 property="schema:name">Event 42</h1>
<div property="schema:location" typeof="schema:Place">
<a property="schema:url" href="/location/51#it">Location 51</a>
</div>
</article>

Each RDF resource has identifier. Identifier is an IRI (and URL is a subset of IRIs). So, just reference locations by their identifiers.
Usually, each page describes one implicit main resource and several explicit additional ones. Take a look at RDFa 1.1 Primer. It has a lot of relevant info

How Next and Previous link buttons affect on Crawler

I am new to SEO, I had done a research and read several guids, but I am still confused.
A google guid said
Avoid creating complex webs of navigation links, e.g. linking every
page on your site to every other page.
I have an e-commerce website. We intend to create a page for each issue of a magazine. issue pages will have Next and Previous link buttons which will move from one issue to another.
Is that a bad idea, Am I violating this rule? or Google is talking about another scenario?
Is that will cause indexing all the 1000 issues? Given that the links are dynamic and I will use URL rewriting.
Thanks

This won't be a problem with Google. They clearly explain why it is a good thing to do and how to do it properly.

If you want to fully control your linkjuice transmition and the landing page from Google with a little website, using this method is not recommanded.
But, if it's for website with more than 1k of unique pages (you can't fully control and influence the webcrawler comportment) you can use this to ease the crawler indexing work and the landing page for users.

Pagination can be a fairly complicated aspect of SEO, especially for ecommerce sites.
Here are a few general tips:
If you have a "view all" page, you probably should rel="canonical" all your paginated pages to that page. This is acceptable because the content is identical
If you don't have a "view all" page, but you want Google to treat the first page as the "canonical" or you want to drive all users to the first page, then use the rel=next/prev attributes to "group" together your like pages
For ecommerce faceted navigation, you should probably use a combination of rel=next/prev and query parameter controls through Google Webmaster Tools
In the June 2012 SMX Advanced conference, there were a few good presentations and live blogging posts that highlights a number of these aspects. More notably, Googler Maile Ohye spoke during that conference ... she's sort of the Queen of Pagination ;)
http://www.slideshare.net/audette/seo-for-pagination-faceted-navigation-canonicalization-hits-and-misses
http://outspokenmedia.com/internet-marketing-conferences/pagination-canonicalization-for-the-pros-smx-advanced-2012/
http://www.bruceclay.com/blog/2012/06/pagination-canonicalization-for-the-pros-smx-advanced/
You might also want to watch this Google video with Maile talking about Pagination http://googlewebmastercentral.blogspot.com/2012/03/video-about-pagination-with-relnext-and.html
Last thing to note ... Bing doesn't support rel=next/prev at this time: http://searchengineland.com/no-bing-doesnt-support-pagination-attributes-to-consolidate-pages-in-a-series-118694

If I understand you correctly YES Google is talking about another scenario.
The Next and Previous links on the issue pages, used for navigation from one issue to another are different from <link rel="next" ... > and <link rel="previous" ... > which appear in the <head> ... </head> section of html source.
Google will treat webpages with <link rel="next" ... > and or <link rel="previous" ... > as a series of pages.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Best practices for adding semantics to a website - semantic-web

Related

Eventbrite API list events 403

How do I crawl twitch.tv where the html body was empty upon initial http request, and contents were loaded by various scripts

SEO on external html page renderd by <object> tag

Concepts for RDFa DRY references

How Next and Previous link buttons affect on Crawler

Categories

Resources