Canonical links and paging - seo

Google has been pushing its new canonical link feature, I agree it is really useful. Now instead of having a ton of entry points in to an area you can have one entry.
I was wondering, does this feature play nice with paging?
For example: I have this page which has 8 pages of content, if I specify the canonical of http://community.mediabrowser.tv/permalinks/154/iso-always-detected-as-a-movie-when-checking-metadata for the page, will there be any undesired side effects? Will this be better overall? Will this mean that a hit on page 5 will take users to page 1?

When specifying a canonical URL, it should have substantially the same content. Pages 2-8 have different content. Yes, if Google were to honor your canonical link on page 5, it would send users to page 1.
You should use the canonical link on page 1 so that Google knows that http://community.mediabrowser.tv/topics/154 and http://community.mediabrowser.tv/topics/154?page=1&response_type=3 are the same as http://community.mediabrowser.tv/permalinks/154/iso-always-detected-as-a-movie-when-checking-metadata
You may also want to put canonical links on the other pages so Google knows that http://community.mediabrowser.tv/topics/154?page=5 is the same as http://community.mediabrowser.tv/topics/154?page=5&response_type=3

You should only add canonical links on pages with identical content. For example, a set of links presented in a different order: sorted by date or alphabetically.
In your case all pages have different content (albeit representing several pages of the same article or conversation thread). Which means you don't need to canonicalize them.
Still if you do, all that happens is that Google gives more priority to the first page, rather than the other pages when displaying them in search results.
Canonical links do not affect your visitors. They only suggest priority and possible duplicate content to bots.
More info from Google here

Related

SEO - sitemap.xml providing explicit links which are not on the page as anchors

I have a site with an input text.
User types the name of a city, hits enter and it's linked there.
my sitemap.xml looks like this:
<urlset>
<url><loc>http://www.example.com/rome.html</loc></url>
<url><loc>http://www.example.com/london.html</loc></url>
<url><loc>http://www.example.com/newyork.html</loc></url>
<url><loc>http://www.example.com/paris.html</loc></url>
<url><loc>http://www.example.com/berlin.html</loc></url>
<url><loc>http://www.example.com/toronto.html</loc></url>
<url><loc>http://www.example.com/milan.html</loc></url>
<url><loc>http://www.example.com/edinburgh.html</loc></url>
<url><loc>http://www.example.com/nice.html</loc></url>
<url><loc>http://www.example.com/boston.html</loc></url>
...
</urlset>
My question is:
Will I be penalized (from a SEO point of view) because my links only appear on the sitemap.xml instead as in a list of anchors in the html page.
Note: the anchor approach was excluded because I have about 5,000 listed cities
It won't be penalised. Google themselves say the primary purpose of a sitemap is "a way to tell Google about pages on your site we might not otherwise discover."
https://support.google.com/webmasters/answer/156184?hl=en
You are rare in that you are using the sitemap correctly to help Google find your pages.
Often SEOs just add one for the sake of it, rather than taking the time to identify and using it to fix potential crawling errors.
The only negative aspect for SEO I can think of is that page rank will not flow between your pages if there is no direct link.
No, you will not be penalized. The sole purpose of sitemaps is to tell search engines where to find your content. That content may or may not be available through hyperlinks on your website.

Two URL's, same content, is this considered duplicate content by search engines?

I've developed a service that allows users to search for stores on www.mysite.com.
I also have partners that uses my service. To the user, it looks like they are on my partners web site, when in fact they are on my site. I have only replaced my own header and footer, with my partners header and footer.
For the user, it looks like they are on mysite.partner.com when in reality they are on partner.mysite.com.
If you understood what I tried to explain, my question is:
Will Google and other search engines consider this duplicate content?
Update - canonical page
If I understand canonical pages correctly, www.mysite.com is my canonical page.
So when my partner uses mysite.partner.com?store=wallmart&id=123 which "redirects" (CNAME) to partner.mysite.com?store=wallmart&id=123, my server recognize my sub-domain.
So what I need to do, is to dynamically add the following in my <HEAD> section:
<link rel="canonical" href="mysite.com?store=wallmart&id=123">
Is this correct?
It's duplicate content but there is no penalty as such.
The problem is, for a specific search Google will pick one version of a page and filter out the others from the results. If your partner is targeting the same region then you are in direct competition.
The canonical tag is a way to tell Google which is the official version. If you use it then only the canonical page will show up in search results. So if you canonicalise back to your domain then your partners will be excluded from search results. Only your domains pages will ever show up. Not good for your partners.
There is no win. The only way your partners will do well is if they have their own content or target a different region and you don't do the canonical tag.
So your partners have a chance, I would not add the canonical. Then it's down to the Google gods to decide which of your duplicate pages gets shown.
Definitely. You'll want to use canonical tagging to stop this happening.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139394
Yes. It will be considered as duplicate content by Google. Cause you have replaced only footer and header. By recent Google algorithm, content should be unique for website or even blog. If content is not unique, your website will be penalized by Google.

SEO: secure pages and rel=nofollow

Should one apply rel="nofollow" attribute to site links that are bound for secure/login required pages?
We have a URI date based link structure where the previous year's news content is free, while the current year, and any year prior to the last, are paid, login required content.
The net effect is that when doing a search for our company name in google, what comes up first is Contact, About, Login, etc., standard non-login required content. That's fine, but ideally we have our free content, the pages we want to promote, shown first in the search engine results.
Toward this end, the link structure now generates rel="follow" for the free content we want to promote, and rel="nofollow" for all paid content and Contact, About, Login, etc. screens that we want at the bottom of the SEO search result ladder.
I have yet to deploy the new linking scheme for fear of, you know, blowing up the site SEO-wise ;-) It's not in great shape to begin with, despite our decent ranking, but I don't want us to disappear either.
Anyway, words of wisdom appreciated.
Thanks
nofollow
I think Emil Vikström is wrong about nofollow. You can use the rel value nofollow for internal links. The microformats spec and the HTML5 spec don't say the opposite.
Google even gives such an example:
Crawl prioritization: Search engine robots can't sign in or register as a member on your forum, so there's no reason to invite Googlebot to follow "register here" or "sign in" links. Using nofollow on these links enables Googlebot to crawl other pages you'd prefer to see in Google's index. However, a solid information architecture — intuitive navigation, user- and search-engine-friendly URLs, and so on — is likely to be a far more productive use of resources than focusing on crawl prioritization via nofollowed links.
This does apply to your use case. So you could nofollow the links to your login page. Note however, if you also meta-noindex them, people that search for "YourSiteName login" probably won't get the desired page in their search results, then.
follow
There is no rel value "follow". It's not defined in the HTML5 spec nor in the HTML5 Link Type extensions. It isn't even mentioned in http://microformats.org/wiki/existing-rel-values at all. A link without the rel value nofollow is automatically a "follow link".
You can't overwrite a meta-nofollow for certain links (the two nofollow values even have a different semantic).
Your case
I'd use nofollow for all links to restricted/paid content. I wouldn't nofollow the links to the informational pages about the site (About, Contact, Login), because they are useful, people might search especially for them, and they give information about your site, while all the content pages give information about the various topics.
Nofollow is only for external links, it does not apply to links within your own domain. Search engines will try to give the most relevant content for the query asked, and they generally actively avoid taking the website owners wishes into account. Thus, nofollow will not help you here.
What you really want to do is make the news content the best choice for a search on your company name. A user searching for your company name may do this for two reasons: They want your homepage (the first page) or they more specifically want to know more about your company. This means that your homepage as well as "About", "Contact", etc, are generally actually what the user is looking for and the search engines will show them at the top of their results pages.
If you don't want this you must make those pages useless for one wanting to know more about your company. This may sound really silly. To make your "About" and "Contact" pages useless to one searching for your company you should remove your company name from those pages, as well as any information about what your company does. Put that info on the news pages instead and the search engines may start to rank the news higher.
Another option is to not let the search engine index those other pages at all by adding them to a robots.txt file.

SEO: Allowing crawler to index all pages when only few are visible at a time

I'm working on improving the site for the SEO purposes and hit an interesting issue. The site, among other things, includes a large directory of individual items (it doesn't really matter what these are). Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
The directory is large - having about 100,000 items in it. Naturally, on any of the pages only a few items are listed. For example, on the main site homepage, there are links to about 5 or 6 items, from some other page there links to about a dozen different items, etc.
When real users visits the site, they can use search form to find item by keyword or location - so there would be a list produced matching their search criteria. However when, for example, a google crawler visits the site, it won't even attempt to put a text into the keyword search field and submit the form. Thus as far as the bot is concern, after indexing the entire site, it has covered only a few dozen items at best. Naturally, I want it to index each individual item separately. What are my options here?
One thing I considered is to check the user agent and IP ranges and if the requestor is a bot (as best I can say), then add a div to the end of the most relevant page with links to each individual item. Yes, this would be a huge page to load - and I'm not sure how google bot would react to this.
Any other things I can do? What are best practices here?
Thanks in advance.
One thing I considered is to check the user agent and IP ranges and if
the requestor is a bot (as best I can say), then add a div to the end
of the most relevant page with links to each individual item. Yes,
this would be a huge page to load - and I'm not sure how google bot
would react to this.
That would be a very bad thing to do. Serving up different content to the search engines specifically for their benefit is called cloaking and is a great way to get your site banned. Don't even consider it.
Whenever a webmaster is concerned about getting their pages indexed having an XML sitemap is an easy way to ensure the search engines are aware of your site's content. They're very easy to create and update, too, if your site is database driven. The XML file does not have to be static so you can dynamically produce it whenever the search engines request it (Google, Yahoo, and Bing all support XML sitemaps). You can find out mroe about XML sitemaps at sitemaps.org.
If you want to make your content available to search engines and want to benefit from semantic markup (i.e. HTML) you should also make sure your all of content can be reached through hyperlinks (in other words not through form submissions or JavaScript). The reason for this is twofold:
The anchor text in the links to your items will contain the keywords you want to rank well for. This is one of the more heavily weighted ranking factors.
Links count as "votes", especially to Google. Links from external websites, especially related websites, are what you'll hear people recommend the most and for good reason. They're valuable to have. But internal links carry weight, too, and can be a great way to prop up your internal item pages.
(Bonus) Google has PageRank which used to be a huge part of their ranking algorithm but plays only a small part now. But it still has value and links "pass" PageRank to each page they link to increasing the PageRank of that page. When you have as many pages as you do that's a lot of potential PageRank to pass around. If you built your site well you could probably get your home page to a PageRank of 6 just from internal linking alone.
Having an HTML sitemap that somehow links to all of your products is a great way to ensure that search engines, and users, can easily find all of your products. It is also recommended that you structure your site so more important pages are closer to the root of your website (home page) and then as you branch out gets to sub pages (categories) and then to specific items. This gives search engines an idea of what pages are important and helps them organize them (which helps them rank them). It also helps them follow those links from top to bottom and find all of your content.
Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
This is also bad for SEO. When you can pull up the same page using two different URLs you have duplicate content on your website. Google is on a crusade to increase the quality of their index and they consider duplicate content to be low quality. Their infamous Panda Algorithm is partially out to find and penalize sites with low quality content. Considering how many products you have it is only a matter of time before you are penalized for this. Fortunately the solution is easy. You just need to specify a canonical URL for your product pages. I recommend the second format as it is more search engine friendly.
Read my answer to an SEO question at the Pro Webmaster's site for even more information on SEO.
I would suggest for starters having an xml sitemap. Generate a list of all your pages, and submit this to Google via webmaster tools. It wouldn't hurt having a "friendly" sitemap either - linked to from the front page, which lists all these pages, preferably by category, too.
If you're concerned with SEO, then having links to your pages is hugely important. Google could see your page and think "wow, awesome!" and give you lots of authority -- this authority (some like to call it link juice" is then passed down to pages that are linked from it. You ought to make a hierarchy of files, more important ones closer to the top and/or making it wide instead of deep.
Also, showing different stuff to the Google crawler than the "normal" visitor can be harmful in some cases, if Google thinks you're trying to con it.
Sorry -- A little bias on Google here - but the other engines are similar.

will limiting dynamic urls with robots.txt improve my SEO ranking?

My website has about 200 useful articles. Because the website has an internal search function with lots of parameters, the search engines end up spidering urls with all possible permutations of additional parameters such as tags, search phrases, versions, dates etc. Most of these pages are simply a list of search results with some snippets of the original articles.
According to Google's Webmaster-tools Google spidered only about 150 of the 200 entries in the xml sitemap. It looks as if Google has not yet seen all of the content years after it went online.
I plan to add a few "Disallow:" lines to robots.txt so that the search engines no longer spiders those dynamic urls. In addition I plan to disable some url parameters in the Webmaster-tools "website configuration" --> "url parameter" section.
Will that improve or hurt my current SEO ranking? It will look as if my website is losing thousands of content pages.
This is exactly what canonical URLs are for. If one page (e.g. article) can be reached by more then one URL then you need to specify the primary URL using a canonical URL. This prevents duplicate content issues and tells Google which URL to display in their search results.
So do not block any of your articles and you don't need to enter any parameters, either. Just use canonical URLs and you'll be fine.
As nn4l pointed out, canonical is not a good solution for search pages.
The first thing you should do is have search results pages include a robots meta tag saying noindex. This will help get them removed from your index and let Google focus on your real content. Google should slowly remove them as they get re-crawled.
Other measures:
In GWMT tell Google to ignore all those search parameters. Just a band aid but may help speed up the recovery.
Don't block the search page in the robots.txt file as this will block the robots from crawling and cleanly removing those pages already indexed. Wait till your index is clear before doing a full block like that.
Your search system must be based on links (a tags) or GET based forms and not POST based forms. This is why they got indexed. Switching them to POST based forms should stop robots from trying to index those pages in the first place. JavaScript or AJAX is another way to do it.