Semantic markups to index PDF files - pdf

What is the proper way to index PDF files ? I would like to add semantic information in them, and help search engines present the files more accurately, more precisely (a particular image, text inside the PDF file). I am thinking about using ontologies that engines already understand like Schema.org.

How about using schema.org to link to the PDF file from a web page like this:
<div itemscope itemtype="http://schema.org/Article">
<img itemprop="thumbnailUrl" src="http://www.example.com/how_to_build_a_web_app.jpg"/>
<a itemprop="url" href="http://www.example.com/how_to_build_a_web_app.pdf">
<span itemprop="name">How to Build a Web App</span></a>
by <span itemprop="author">John Smith</span>
<div itemprop="description">This short e-book explains what a web application
is and how to build one.</div>
</div>
This lets you associate a title, image and textual description with the article in the PDF.

Related

Can I build my html with dynamic Schema.org markup?

I have a page for Mobile application details and features, but the app itself is not implemented yet and not available on any store, so I don't know the size,version,rating etc..
The question is can I render the page with dynamic Schema.org markup, which means to build now for example as an Article markup, and once the app is live the page will be built with MobileApplication ? Can I do this and Google will like it and understand it? if not, what can I do for my case?
#if(IsAppLive){
<div itemscope itemtype="http://schema.org/MobileApplication">
}
else{
<div itemscope itemtype="http://schema.org/Article">
}
Yes, but Google will only scrape and index one version. It won't register a second version until it re-scrapes, which can take up to two weeks.

Tagged PDF using an HTML to PDF Converter such as Winnovative

There are many converter available on the market to create and manipulate PDF file from a simple HTML/CSS page. These tools are very convenient to create quickly some nice PDF files without the hassle of the more complex reporting tools of this world.
I am using Winnovative software to achieve this goal but I was wondering how to create accessible file (tagged PDF) to improve text-to-speach tool processing.
Are there any HTML tags that exists to achieve this? Anybody has some experience with this kind of requirements?
The tool itself has to be able to support the pdf/ua spec (tagged pdf). The list of possible PDF tags corresponds nicely to html tags. For example, there are <h1> through <h6> tags, table tags (<table>, <th>, <tr>, <td>), list tags (<l>, <li>), and so on.
There are minor differences, such as the tag to start a list is <l> instead of html's <ul> or <ol>. With a PDF document, the screen reader will say "list with 3 items" and then you navigate through each item. It doesn't seem to care if it's bulleted or numbered, thus the reason pdf/ua has <l> and html has <ol> and <ul>.
Anyway, the point is you don't need to use any special html tags to generate tagged pdf. The tool that generates the pdf just needs to support pdf/ua. I didn't see anything on Winnovative's website that indicated it support it.
FYI, here are the tags available in PDF/UA
<Art>
<Annot>
<BibEntry>
<BlockQuote>
<Caption>
<Code>
<Div>
<Document>
<Figure>
<Form>
<Formula>
<H>
<H1>
<H2>
<H3>
<H4>
<H5>
<H6>
<Index>
<Lbl>
<Link>
<L>
<LI>
<Lbody>
<Note>
<P>
<Part>
<Quote>
<Reference>
<Sect>
<Span>
<Table>
<TD>
<TH>
<TOC>
<TOCI>
<TR>
Essential PDF supports generating tagged PDF when converting from HTML to PDF using the Internet Explorer MSHTML engine.
Note: I work for Syncfusion.
Good explanation in slugolicious' answer about tagging PDF. While researching accessible PDF output for a project I found PDFReactor (www.pdfreactor.com) can do this. Unfortunately there's no budget for a license in this project right now, so I haven't tested it in production, but have tried the free personal version with satisfying results.

Adding schema.org to site

I want to add schema.org to my site, I've read some guides for that and I understood the way I should do that. But should I add these tags for example for images and url:
<figure itemprop="associatedMedia" itemscope itemtype="http://schema.org/ImageObject">
<a itemprop="contentUrl" href="someurl" rel="bookmark">
<img src="someurl"/>
</a>
</figure>
to all my images, all my urls, all my pages or there is a way to do that globally for my site?
Yes, they are meant to be added to all your tags. That way you show search engines the semantic relationship between every item on your webpage.
If you don't use any programming or frameworks, you need to add them by hand.
It is a good practice to always validate them while developing to see how Search Engines will see them: http://www.google.com/webmasters/tools/richsnippets

SEO - Google index a specific part of a link

Google displays links to pages in its search results by taking all the text inside an tag as the link. So this:
<a href="#">
<span>1</span> This is a great story
</a>
displays in Google search results as:
1 This is a great story
Is there any way to tell Google to index a specific part of the link text, e.g.
<a href="#">
<span class="dont-index-me">1</span>
<span class="index-me">This is a great story</span>
</a>
So I can have just: 'This is a great story'.
Or is the only option to change the markup:
<span>1</span> This is a great story
No. Google will index an entire page's contents. there is no way to tell Google to ignore part of a page. There are black hat techniques, of course, but those just get you banned if you get caught and aren't worth the risk.
just change the markup (2nd solution) ie. move it out of <a> tag

Doesn’t Google support Schema.org’s AggregateRating at the moment?

A rich snippet example from Schema.org http://schema.org/AggregateRating:
<html>
<div itemscope itemtype="http://schema.org/Product">
<img itemprop="image" src="dell-30in-lcd.jpg" />
<span itemprop="name">Dell UltraSharp 30" LCD Monitor</span>
<div itemprop="aggregateRating"
itemscope itemtype="http://schema.org/AggregateRating">
<span itemprop="ratingValue">87</span>
out of <span itemprop="bestRating">100</span>
based on <span itemprop="ratingCount">24</span> user ratings
</div>
</div>
</html>
But http://www.google.com/webmasters/tools/richsnippets won't show a preview.
So, the following words from http://support.google.com/webmasters/bin/answer.py?hl=en&answer=146645 are just lies?
New! schema.org lets you mark up a much wider range of item types on
your pages, using a vocabulary that Google, Microsoft, and Yahoo! can
all understand. Find out more. (Google still supports your existing
rich snippets markup, though.)
It is working absolutely fine.
Google is not obliged to show you preview every time, and here it shows an error when I inserted your give example from schema.org:
The following errors were found during preview generation:
This page does not contain authorship or rich snippet markup.
I have done it in my website's news pieces and it shows fine.