Should a sitemap have *every* url - seo

I have a site with a huge number (well, thousands or tens of thousands) of dynamic URLs, plus a few static URLs.
In theory, due to some cunning SEO linkage on the homepage, it should be possible for any spider to crawl the site and discover all the dynamic urls via a spider-friendly search.
Given this, do I really need to worry about expending the effort to produce a dynamic sitemap index that includes all these URLs, or should I simply ensure that all the main static URLs are in there?
That actual way in which I would generate this isn't a concern - I'm just questioning the need to actually do it.
Indeed, the Google FAQ (and yes, I know they're not the only search engine!) about this recommends including URLs in the sitemap that might not be discovered by a crawl; based on that fact, then, if every URL in your site is reachable from another, surely the only URL you really need as a baseline in your sitemap for a well-designed site is your homepage?

If there is more than one way to get to a page, you should pick a main URL for each page that contains the actual content, and put those URLs in the site map. I.e. the site map should contain links to the actual content, not every possible URL to get to the same content.
Also consider putting canonical meta tags in the pages with this main URL, so that spiders can recognise a page even if it's reachable through different dynamical URLs.
Spiders only spend a limited time searching each site, so you should make it easy to find the actual content as soon as possible. A site map can be a great help as you can use it to point directly to the actual content so that the spider doesn't have to look for it.
We have had a pretty good results using these methods, and Google now indexes 80-90% of our dynamic content. :)

In an SO podcast they talked about limitations on the number of links you could include/submit in a sitemap (around 500 per page with a page limit based on pagerank?) and how you would need to break them over multiple pages.
Given this, do I really need to worry
about expending the effort to produce
a dynamic sitemap index that includes
all these URLs, or should I simply
ensure that all the main static URLs
are in there?
I was under the impression that the sitemap wasn't necessarily about disconnected pages but rather about increasing the crawling of existing pages. In my experience when a site includes a sitemap, minor pages even when prominently linked to are more likely to appear on Google results. Depending on the pagerank/inbound links etc. of your site this may be less of an issue.

Related

SEO: Allowing crawler to index all pages when only few are visible at a time

I'm working on improving the site for the SEO purposes and hit an interesting issue. The site, among other things, includes a large directory of individual items (it doesn't really matter what these are). Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
The directory is large - having about 100,000 items in it. Naturally, on any of the pages only a few items are listed. For example, on the main site homepage, there are links to about 5 or 6 items, from some other page there links to about a dozen different items, etc.
When real users visits the site, they can use search form to find item by keyword or location - so there would be a list produced matching their search criteria. However when, for example, a google crawler visits the site, it won't even attempt to put a text into the keyword search field and submit the form. Thus as far as the bot is concern, after indexing the entire site, it has covered only a few dozen items at best. Naturally, I want it to index each individual item separately. What are my options here?
One thing I considered is to check the user agent and IP ranges and if the requestor is a bot (as best I can say), then add a div to the end of the most relevant page with links to each individual item. Yes, this would be a huge page to load - and I'm not sure how google bot would react to this.
Any other things I can do? What are best practices here?
Thanks in advance.
One thing I considered is to check the user agent and IP ranges and if
the requestor is a bot (as best I can say), then add a div to the end
of the most relevant page with links to each individual item. Yes,
this would be a huge page to load - and I'm not sure how google bot
would react to this.
That would be a very bad thing to do. Serving up different content to the search engines specifically for their benefit is called cloaking and is a great way to get your site banned. Don't even consider it.
Whenever a webmaster is concerned about getting their pages indexed having an XML sitemap is an easy way to ensure the search engines are aware of your site's content. They're very easy to create and update, too, if your site is database driven. The XML file does not have to be static so you can dynamically produce it whenever the search engines request it (Google, Yahoo, and Bing all support XML sitemaps). You can find out mroe about XML sitemaps at sitemaps.org.
If you want to make your content available to search engines and want to benefit from semantic markup (i.e. HTML) you should also make sure your all of content can be reached through hyperlinks (in other words not through form submissions or JavaScript). The reason for this is twofold:
The anchor text in the links to your items will contain the keywords you want to rank well for. This is one of the more heavily weighted ranking factors.
Links count as "votes", especially to Google. Links from external websites, especially related websites, are what you'll hear people recommend the most and for good reason. They're valuable to have. But internal links carry weight, too, and can be a great way to prop up your internal item pages.
(Bonus) Google has PageRank which used to be a huge part of their ranking algorithm but plays only a small part now. But it still has value and links "pass" PageRank to each page they link to increasing the PageRank of that page. When you have as many pages as you do that's a lot of potential PageRank to pass around. If you built your site well you could probably get your home page to a PageRank of 6 just from internal linking alone.
Having an HTML sitemap that somehow links to all of your products is a great way to ensure that search engines, and users, can easily find all of your products. It is also recommended that you structure your site so more important pages are closer to the root of your website (home page) and then as you branch out gets to sub pages (categories) and then to specific items. This gives search engines an idea of what pages are important and helps them organize them (which helps them rank them). It also helps them follow those links from top to bottom and find all of your content.
Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
This is also bad for SEO. When you can pull up the same page using two different URLs you have duplicate content on your website. Google is on a crusade to increase the quality of their index and they consider duplicate content to be low quality. Their infamous Panda Algorithm is partially out to find and penalize sites with low quality content. Considering how many products you have it is only a matter of time before you are penalized for this. Fortunately the solution is easy. You just need to specify a canonical URL for your product pages. I recommend the second format as it is more search engine friendly.
Read my answer to an SEO question at the Pro Webmaster's site for even more information on SEO.
I would suggest for starters having an xml sitemap. Generate a list of all your pages, and submit this to Google via webmaster tools. It wouldn't hurt having a "friendly" sitemap either - linked to from the front page, which lists all these pages, preferably by category, too.
If you're concerned with SEO, then having links to your pages is hugely important. Google could see your page and think "wow, awesome!" and give you lots of authority -- this authority (some like to call it link juice" is then passed down to pages that are linked from it. You ought to make a hierarchy of files, more important ones closer to the top and/or making it wide instead of deep.
Also, showing different stuff to the Google crawler than the "normal" visitor can be harmful in some cases, if Google thinks you're trying to con it.
Sorry -- A little bias on Google here - but the other engines are similar.

Do sitemaps need to cover the entire site?

Do sitemaps need to cover the entire site?
If I have a sitemap link to all the most important views on the site, I should expect that all the first-child pages on that site should get crawled without explicitly defining them right?
You may expect all URLs which are linked to by pages on your sitemap to be crawled, still I would advise on including them in the sitemap, as it provides meta-data.
Keep in mind that the position of a page in your sitemap doesn't matter. If you have lots of pages and want to keep things organized for yourself, you can split the map into several files. Otherwise, it's probably wisest to let the crawler figure out which pages are more important using the priority attribute.

will limiting dynamic urls with robots.txt improve my SEO ranking?

My website has about 200 useful articles. Because the website has an internal search function with lots of parameters, the search engines end up spidering urls with all possible permutations of additional parameters such as tags, search phrases, versions, dates etc. Most of these pages are simply a list of search results with some snippets of the original articles.
According to Google's Webmaster-tools Google spidered only about 150 of the 200 entries in the xml sitemap. It looks as if Google has not yet seen all of the content years after it went online.
I plan to add a few "Disallow:" lines to robots.txt so that the search engines no longer spiders those dynamic urls. In addition I plan to disable some url parameters in the Webmaster-tools "website configuration" --> "url parameter" section.
Will that improve or hurt my current SEO ranking? It will look as if my website is losing thousands of content pages.
This is exactly what canonical URLs are for. If one page (e.g. article) can be reached by more then one URL then you need to specify the primary URL using a canonical URL. This prevents duplicate content issues and tells Google which URL to display in their search results.
So do not block any of your articles and you don't need to enter any parameters, either. Just use canonical URLs and you'll be fine.
As nn4l pointed out, canonical is not a good solution for search pages.
The first thing you should do is have search results pages include a robots meta tag saying noindex. This will help get them removed from your index and let Google focus on your real content. Google should slowly remove them as they get re-crawled.
Other measures:
In GWMT tell Google to ignore all those search parameters. Just a band aid but may help speed up the recovery.
Don't block the search page in the robots.txt file as this will block the robots from crawling and cleanly removing those pages already indexed. Wait till your index is clear before doing a full block like that.
Your search system must be based on links (a tags) or GET based forms and not POST based forms. This is why they got indexed. Switching them to POST based forms should stop robots from trying to index those pages in the first place. JavaScript or AJAX is another way to do it.

Count the number of pages in a site

I'd like to know how many public pages there are in a site, say for example, smashingmagzine.com. Is there are way to count the number of pages?
You can query Google's index using the site operator. e.g:
site:domain-to-query.com
This will return a list of the pages from the site that are currently indexed by Google. Other search engines provide similar functionality but I don't know the syntax off hand.
Of course not all pages may be indexed, and the index may contain pages which no longer exist.
You need to basically crawl the site. Your process would be something like:
Start at root domain / homepage
Look for all links that point within the same domain
For each of those links, repeat the steps
Your loop terminates when there are no more links to crawl that are pointing in the same domain. Remember to stay in the site otherwise you'll start crawling external sites.
You can also try parsing the sitemap if they provide one.
One tool that might prove useful if using Java is JSpider or Sphider in PHP.
You'll need to recursively scan the markup of each page, starting with your top level page, looking for any kind of links to other pages, and recursively crawl through them. You'll also need to keep track of what has been scanned as to not get caught in an infinate loop.

Possible to prevent search engine spiders from infinitely crawling paging links on search results?

Our SEO team would like to open up our main dynamic search results page to spiders and remove the 'nofollow' from the meta tags. It is currently accessible to spiders via allowing the path in robots.txt, but with a 'nofollow' clause in the meta tag which prevents spiders from going beyond the first page.
<meta name="robots" content="index,nofollow">
I am concerned that if we remove the 'nofollow', the impact to our search system will be catastrophic, as spiders will start crawling through all pages in the result set. I would appreciate advice as to:
1) Is there a way to remove the 'nofollow' from the meta tag, but prevent spiders from following only certain links on the page? I have read mixed opinions on rel="nofollow", is this a viable option?
<a rel="nofollow" href="http://www.mysite.com/paginglink" >Next Page</a>
2) Is there a way to control the 'depth' of how far spiders will go? It wouldn't be so bad if they hit a few pages, then stopped.
3) Our search results pages have the standard next/previous links, which would in theory cause spiders to hit pages recursively to infinity, what is the effect of this on SEO?
I understand that different spiders behave differently, but am mainly concerned with the big players, such as Google, Yahoo, MSN.
Note our Search results pages and paging links are not bot-friendly, in that they are not re-written and have a ?name=value query string, but from what I've seen spiders no longer just abort when they see the '?' as the results pages ARE getting indexed with decent page rank.
To be honest you are looking at nofollow wrong. Chances are the search spiders are already especially Google, Yahoo, and MSN searching the nofollow pages, because they still have to hit those pages to see if they have a noindex.
The real problem is nofollow doesn't actually mean don't follow, it just means don't pass on my reputation to this link. So unless you are aggressively blocking bots, which it doesn't sound like you are, changing the ROBOTS meta tag and robot commands on links will not effect performance because they are already hitting your site. To confirm this just look at your HTTP Server Log.
So my vote is that you will not see any problem with removing the robot limits.
I've seen Google index a calendar system that had relative links on each page through the end of time (Jan 19, 2038 - see: http://en.wikipedia.org/wiki/Year_2038_problem). We didn't notice the load on our servers until it exposed a bug in the source code dealing with dates in 2038.
I don't know about the other search engines, but Google offers a number of helpful tools for controlling how much the googlebot impacts your server infrastructure. See http://www.google.com/webmasters/.
There is an option in webmaster tools to set the crawl rate for your site.
Google bots are pretty intelligent about not traversing an entire database of dynamically-generated pages, as long as the URLs give some hint that they are dynamic (i.e. file extension of .asp or .jsp, etc. and numeric ids as query parameters). If you use rewrite rules to make your URLs "friendly", then the bots have a harder time determining whether or not it's a static page they are reading or a dynamically generated page. See this Google article for more information about dynamic vs. static URLs.
You may also want to consider creating a Google Sitemap to give the bots a better idea about what pages on your site can be indexed and which cannot.