How do I find data sets using a particular Schema.org entity? - semantic-web

I am trying to decide whether to use schema.org entities in my own open source app, for potential compatibility with existing open data sets. So I'm looking for usage of relevant schema.org entities "in the wild".
Right now I'm looking for dietary supplement data, IE http://schema.org/DietarySupplement, or http://health-lifesci.schema.org/DietarySupplement
I've been searching for semantic web search engines, and have only found Swoogle, but I get no results for that URI, or "service temporarily unavailable".
The DietarySupplement page on schema.org says that "between 10 and 100" domains are using this entity. Is that talking about DNS, abstract domains that are defined on Schema.org, abstractions defined elsewhere, or something else?

There are only a couple of other resources I can find on this subject.
Web Data Commons - RDFa, Microdata, and Microformat Data
Sets
BuiltWith trends - Microdata Usage Statistics

Related

Is it allowed to access same resource from multiple endpoints in REST API?

I have two endpoints
{ path: "/pages", description: "Retrieves information about all
public pages. If user is logged in, the call would return all pages
that belong to the user and that are public.", arguments: "page,
owner_id, creation_date, page_id, status, author", },
{ path: "/pages/pageId",
description: "Retrieves information about specific page.", arguments:
"page, owner_id, creation_date, status, author", }
I could retrieve a page using /pages?pageId=xxxxx and /pages/pageId. Would this broke DX (developer experience) or any other REST convention (like consistency)?
I could retrieve a page using /pages?pageId=xxxxx and /pages/pageId. Would this broke DX (developer experience) or any other REST convention (like consistency)?
Nothing wrong with that from a REST perspective - it is normal that we might have multiple resources (each with their own identifier) that have the same representations.
For example, the "authors' preferred version" of an academic paper is a mapping whose value changes over time, whereas a mapping to "the paper published in the proceedings of conference X" is static. These are two distinct resources, even if they both map to the same value at some point in time. The distinction is necessary so that both resources can be identified and referenced independently. A similar example from software engineering is the separate identification of a version-controlled source code file when referring to the "latest revision", "revision number 1.2.7", or "revision included with the Orange release." -- Fielding, 2000
Note that general purpose components won't know that two different resource identifiers point to "the same" representations. For instance, if we have cached copies of both resources, and we update one of them (ie POST, PUT, PATCH), the general purpose component won't know that the other has also changed.
You can, of course, design your resources so that one spelling redirects to the other:
GET /pages?pageId=12345
307 Temporary Redirect
Location: /pages/12345
You can also use HTTP metadata is the representation of one resource is "really" from a different resource
GET /pages?pageId=12345
200 OK
Content-Location: /pages/12345
...
A resource can have multiple identifiers (IRIs) and multiple representations. The representations can depend on authorization or anything else that comes with the request (e.g. different request headers) until it respects the statelessness constraint. REST works with standards or at least recommendations according to the uniform interface constraint, so I would not care much about IRI conventions, but you can follow the nice IRI convention if you want to. Normally you got back hypertext in the case of regex, so instead of caring about the IRI structure your client follow hyperlinks and they use the metadata of the hyperlink to decide what it is doing, not the IRI structure, which is completely irrelevant if we talk about REST as it was described by Fielding.

Google duplicate content issue for social network applications

I am making a social network application where user will come and share the posts like facebook. But now I have some doubts like lets say a user is just shared a content by coping it from another site and same with the case of images. So does google crawler consider it as a duplicate content or not?
If yes then how I can tell to the google crawler that "don't consider it as a spam, its a social networking site and the content is shared by the user not by the me". Is there any way or any kind of technique that help me.
Google might consider it to be duplicate content, in which case the search algorithm will choose 1 version, which it believes to be the original or more important one and drop the other.
This isn't a bad thing per se - unless you see that most of your site's content is becoming duplicated.
You can use canonical URL declarations to do what you are saying, but i wouldn't advise it.
If your website belongs to one of these types - forum or e-commerce, it will not be punished for duplicate content issue. I think "social platform" is one type of forum.
If your pages are too similar, the result is that the two or more similar pages will scatter the click rate, flow etc, so the rank in SERPs may not look well.
I suggest do not use "canonical" because this instruction tell the crawlers do not crawl/count this page. If you use it, in the webmaster tool, you will see the indexed pages decrease a lot.
Do not too worry about the duplicate content issue. You can see this article: Google’s Matt Cutts: Duplicate Content Won’t Hurt You, Unless It Is Spammy

SEO: Allowing crawler to index all pages when only few are visible at a time

I'm working on improving the site for the SEO purposes and hit an interesting issue. The site, among other things, includes a large directory of individual items (it doesn't really matter what these are). Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
The directory is large - having about 100,000 items in it. Naturally, on any of the pages only a few items are listed. For example, on the main site homepage, there are links to about 5 or 6 items, from some other page there links to about a dozen different items, etc.
When real users visits the site, they can use search form to find item by keyword or location - so there would be a list produced matching their search criteria. However when, for example, a google crawler visits the site, it won't even attempt to put a text into the keyword search field and submit the form. Thus as far as the bot is concern, after indexing the entire site, it has covered only a few dozen items at best. Naturally, I want it to index each individual item separately. What are my options here?
One thing I considered is to check the user agent and IP ranges and if the requestor is a bot (as best I can say), then add a div to the end of the most relevant page with links to each individual item. Yes, this would be a huge page to load - and I'm not sure how google bot would react to this.
Any other things I can do? What are best practices here?
Thanks in advance.
One thing I considered is to check the user agent and IP ranges and if
the requestor is a bot (as best I can say), then add a div to the end
of the most relevant page with links to each individual item. Yes,
this would be a huge page to load - and I'm not sure how google bot
would react to this.
That would be a very bad thing to do. Serving up different content to the search engines specifically for their benefit is called cloaking and is a great way to get your site banned. Don't even consider it.
Whenever a webmaster is concerned about getting their pages indexed having an XML sitemap is an easy way to ensure the search engines are aware of your site's content. They're very easy to create and update, too, if your site is database driven. The XML file does not have to be static so you can dynamically produce it whenever the search engines request it (Google, Yahoo, and Bing all support XML sitemaps). You can find out mroe about XML sitemaps at sitemaps.org.
If you want to make your content available to search engines and want to benefit from semantic markup (i.e. HTML) you should also make sure your all of content can be reached through hyperlinks (in other words not through form submissions or JavaScript). The reason for this is twofold:
The anchor text in the links to your items will contain the keywords you want to rank well for. This is one of the more heavily weighted ranking factors.
Links count as "votes", especially to Google. Links from external websites, especially related websites, are what you'll hear people recommend the most and for good reason. They're valuable to have. But internal links carry weight, too, and can be a great way to prop up your internal item pages.
(Bonus) Google has PageRank which used to be a huge part of their ranking algorithm but plays only a small part now. But it still has value and links "pass" PageRank to each page they link to increasing the PageRank of that page. When you have as many pages as you do that's a lot of potential PageRank to pass around. If you built your site well you could probably get your home page to a PageRank of 6 just from internal linking alone.
Having an HTML sitemap that somehow links to all of your products is a great way to ensure that search engines, and users, can easily find all of your products. It is also recommended that you structure your site so more important pages are closer to the root of your website (home page) and then as you branch out gets to sub pages (categories) and then to specific items. This gives search engines an idea of what pages are important and helps them organize them (which helps them rank them). It also helps them follow those links from top to bottom and find all of your content.
Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
This is also bad for SEO. When you can pull up the same page using two different URLs you have duplicate content on your website. Google is on a crusade to increase the quality of their index and they consider duplicate content to be low quality. Their infamous Panda Algorithm is partially out to find and penalize sites with low quality content. Considering how many products you have it is only a matter of time before you are penalized for this. Fortunately the solution is easy. You just need to specify a canonical URL for your product pages. I recommend the second format as it is more search engine friendly.
Read my answer to an SEO question at the Pro Webmaster's site for even more information on SEO.
I would suggest for starters having an xml sitemap. Generate a list of all your pages, and submit this to Google via webmaster tools. It wouldn't hurt having a "friendly" sitemap either - linked to from the front page, which lists all these pages, preferably by category, too.
If you're concerned with SEO, then having links to your pages is hugely important. Google could see your page and think "wow, awesome!" and give you lots of authority -- this authority (some like to call it link juice" is then passed down to pages that are linked from it. You ought to make a hierarchy of files, more important ones closer to the top and/or making it wide instead of deep.
Also, showing different stuff to the Google crawler than the "normal" visitor can be harmful in some cases, if Google thinks you're trying to con it.
Sorry -- A little bias on Google here - but the other engines are similar.

Semantic store and entity hub

I am working on a content platform that should provide semantic features such as querying with SPARQL and providing rdf documents for the contained content.
I would be very thankful for some
clarification on the following
questions:
Did I get that right, that an entity
hub can connect several semantic
stores to a single point of access?
And if not, what is the difference
between a semantic store and an
entity hub?
What frameworks would you use to
store content documents as well as
their semantic annotation?
It is important for the solution to be able to later on retrieve the document (html page / docs such as pdf, doc,...) and their annotated version.
Thanks in advance,
Chris
The only Entityhub term that I know is belong to Apache Stanbol project. And here is a paragraph from the original documentation explaining what Entityhub does:
The Entityhub provides two main services. The Entityhub provides the
connection to external linked open data sites as well as using indexes
of them locally. Its services allow to manage a network of sites to
consume entity information and to manage entities locally.
Entityhub documentation:
http://incubator.apache.org/stanbol/docs/trunk/entityhub.html
Enhancer component of Apache Stanbol provides extracting external entities related with the submitted content using the linked open data sites managed by Entityhub. These enhancements of contents are formed as RDF data. Then, it is also possible to store those content items in Apache Stanbol and run SPARQL queries on top of RDF enhancements. Contenthub component of Apache Stanbol also provides faceted search functionality over the submitted content items.
Documentation of Apache Stanbol:
http://incubator.apache.org/stanbol/docs/trunk/
Access to running demos:
http://dev.iks-project.eu/
You can also ask your further questions to stanbol-dev AT incubator.apache.org.
Alternative suggestion...
Drupal 7 has in-built RDFa support for annotation and is more of a general purpose CMS than Semantic MediaWiki
In more detail...
I'm not really sure what you mean by entity hub, where are you getting that definition from or what do you mean by it?
Yes one can easily write a system that connects to multiple semantic stores, given the context of your question I assume you are referring to RDF Triple Stores?
Any decent CMS should be assigning documents some form of unique/persistent ID to documents so even if the system you go with does not support semantic annotation natively you could build your own extension for this. The extension would simply store annotations against the documents ID in whatever storage layer you chose (I'd assume a Triple Store would be appropriate) and then you can build appropriate query and presentation layers for querying and viewing this data as required.
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
Apache Stanbol
Do you want to implement a traditional CMS extended with some Semantic capabilities, or do you want to build a Semantic CMS? It could look the same, but actually both a two completely opposite approaches.
It is important for the solution to be able to later on retrieve the document (html page / docs such as pdf, doc,...) and their annotated version.
You can integrate Apache Stanbol with a JCR/CMIS compliant CMS like Alfresco. To get custom annotations, I suggest creating your own custom enhancement engine (maven archetype) based on your domain and adding it to the enhancement engine chain.
https://stanbol.apache.org/docs/trunk/components/enhancer/
One this is done, you can use the REST API endpoints provided by Stanbol to retrieve the results in RDF/Turtle format.

SharePoint 2010: what's the recommended way to store news?

To store news for a news site, what's a good recommendation?
So far, I'm opting for creating a News Site, mainly because: I get some web parts for free (RSS, "week in pictures"), workflows in place and authoring experience in SharePoint seeems reasonable.
On the other hand, I see for example that, by just creating a Document Library, I can store Word documents based on "Newsletter" template and saved as web page and they look great, and the authoring experience in Word is better than that on SharePoint.
And what about just creating a blog site!
Anyway, what would people do? Am I missing a crucial factor here for one or the other? What's a good trade-off here?
Thanks!
From my experience, the best option would be to
Create a new News Site
Create a custom content type having properties like Region (Choice), Category (Choice), Show on homepage (Boolean) , Summary (Note) etc.
Create a custom page layout attached to above content type. Give it a look and feel you want your news article to look like.
Attach the page layout as default content type to Pages Library of News site.
The advantages of this approach is that you can use CQWP web part on the home page to show latest 5 articles. You can also show a one liner or a picture if you also make it a property in custom content type.
By Storing News in a word document, you are not really using SharePoint as Publishing Environment but only as repository. Choice is yours.
D. All of the above
SharePoint gives you a lot of options because there is no one sized solution that works for everyone. The flexibility of options is not to overwhelm you with choices, but rather to allow you to focus on your process, either how it exists now or how you want it to be, and then select the option that best fits your process.
My company's intranet is a team site and news is placed into an Announcements list. We do not need any flashy. The plain text just needs to be communicated to the employees. On the other hand, our public internet site is a publishing site, which gives our news pages a more finished touch in terms of styling and images. It also allows us to take advantage of scheduling, content roll-up, friendly URLs along with the security of locking down the view forms. Authoring and publishing such a page is more involved than the Announcements list, but each option perfectly fits what we want to accomplish in each environment.
Without knowing more about your needs or process, based only on your highlighting Word as the preferred authoring tool, I would recommend a Blog. It is not as fully featured as a publishing site, but there is some overlap. And posts can be authored in Word.
In the end, if you can list what you want to accomplish, how you want to accomplish it, and pick the closest option (News Site, Team Site, Publishing Site, Blog, Wiki, etc), then you will have made the correct choice.
I tend to use news publishing sites, for what you said and page editing features.
It also allows you to set scheduled go-live and un-publish dates which is kind of critical for news items.