I have a site in five languages with several 100.000s of pages. Every day about 10 to 50 new pages are added. 95% of these new pages contain news content (articles). How do I create a XML sitemap for a site like this? More specifically:
I was thinking to let a spider go over the sections that are frequently updated. For all these sections I could make a separate sitemap. It could happen that the same URL is included in different sitemaps though. Is that a problem?
Should I create a different sitemap for each language?
How frequently do I ping Google?
Thanks
I was thinking to let a spider go over the sections that are frequently updated. For all these sections I could make a separate sitemap. It could happen that the same URL is included in different sitemaps though. Is that a problem?
Ans. No, having same URL in multiple sitemap is not a problem.
Should I create a different sitemap for each language?
Ans. Yes, That will be better and easy to maintain
How frequently do I ping Google?
Ans. Submit your sitemaps in "Google Search Console", it will automatically crawled by Google.
Use SiteMap Index for your multiple sitemap links https://support.google.com/webmasters/answer/75712?hl=en
Limit your each XML sitemaps to 50,000 URLs
Related
My website has a very large no of pages. I am looking to create an XML Sitemap that contains only the most important pages (category pages etc).
However, on crawling the website in a tool like Xenu (the others have a 500 page limit), I am unable to control which pages get added to the XML Sitemap, and which ones get excluded.
Essentially, I only want pages that are upto 4 clicks away from my homepage to show up in the XML Sitemap.
How should I create an XML sitemap, and at the same time control which pages of my site I add to it (category pages), and which ones I remove (product pages etc).
Thanks in advance!
Do not create the XML-Sitemap on your own. You just cannot do it every other day, i.e. contents will become invalid over time.
At least Bing has a very tight tolerance limit when it comes to invalid URLs there:
If we see more than 1% of the URLs in a given sitemap returning errors, we begin to distrust the sitemap and stop visiting it.
Let your CMS create the XML-Sitemap for you, if possible. If not: It's ok. It's not a problem if your site is missing a sitemap. (In the vast majority of the cases) you won't rank better just because of having one.
Do sitemaps need to cover the entire site?
If I have a sitemap link to all the most important views on the site, I should expect that all the first-child pages on that site should get crawled without explicitly defining them right?
You may expect all URLs which are linked to by pages on your sitemap to be crawled, still I would advise on including them in the sitemap, as it provides meta-data.
Keep in mind that the position of a page in your sitemap doesn't matter. If you have lots of pages and want to keep things organized for yourself, you can split the map into several files. Otherwise, it's probably wisest to let the crawler figure out which pages are more important using the priority attribute.
Background
I work for an online media company that hosts a news site with over 75K pages. We currently use Google Sitemap Generator (installed on our server) to build dynamic XML sitemaps for our site. In fact since we have a ton of content, we use a sitemap of sitemaps. (Google only allows a maximum of 50K URLs.)
Problem
The sitemaps are generated every 12 hours and is driven by user behavior. That is, it parses the server log file and sees which pages are being fetched the most and builds the sitemap based on that.
Since we cannot guarantee that NEW pages are being added to the sitemap, is it better to submit a sitemap as an RSS feed? In that way, everytime one of our editors creates a new page (or article) it is added to the feed and submitted to google. And this brings up the issue of pushing duplicate content to google as the sitemap and the RSS feed might contain the same urls. Will google penalize us for duplicate content? How do other content-rich or media sites notify google that they are posting new content?
I understand that googlebots only index pages that it deems important and relevant, but it would be great if atleast crawled any new article that we post.
Any help would be greatly appreciated.
Why not simply have every page in your sitemap? 75k pages isn't a huge number, plenty of sites have several sitemaps totalling millions of pages and Google will digest them all (although Google will only index those it deems important as you pointed out).
One technique for you would be to split the sitemaps up into New and Archived content based on the publication date - such as a single sitemap for all content from the previous 7 days and the rest of the content split into other sitemap files as appropriate, this may help to get your freshest content indexed quickly.
Back to your question about an RSS Feed sitemap - don't worry about duplicate content as this is not an issue when it comes to sitemaps. Duplicate content is only a problem if you published the same article several times on the site - sitemaps and RSS feeds are only links to the content, not the content itself, so if a RSS feed is the easiest way of reporting your fresh content to Google, go for it.
I have a site with a huge number (well, thousands or tens of thousands) of dynamic URLs, plus a few static URLs.
In theory, due to some cunning SEO linkage on the homepage, it should be possible for any spider to crawl the site and discover all the dynamic urls via a spider-friendly search.
Given this, do I really need to worry about expending the effort to produce a dynamic sitemap index that includes all these URLs, or should I simply ensure that all the main static URLs are in there?
That actual way in which I would generate this isn't a concern - I'm just questioning the need to actually do it.
Indeed, the Google FAQ (and yes, I know they're not the only search engine!) about this recommends including URLs in the sitemap that might not be discovered by a crawl; based on that fact, then, if every URL in your site is reachable from another, surely the only URL you really need as a baseline in your sitemap for a well-designed site is your homepage?
If there is more than one way to get to a page, you should pick a main URL for each page that contains the actual content, and put those URLs in the site map. I.e. the site map should contain links to the actual content, not every possible URL to get to the same content.
Also consider putting canonical meta tags in the pages with this main URL, so that spiders can recognise a page even if it's reachable through different dynamical URLs.
Spiders only spend a limited time searching each site, so you should make it easy to find the actual content as soon as possible. A site map can be a great help as you can use it to point directly to the actual content so that the spider doesn't have to look for it.
We have had a pretty good results using these methods, and Google now indexes 80-90% of our dynamic content. :)
In an SO podcast they talked about limitations on the number of links you could include/submit in a sitemap (around 500 per page with a page limit based on pagerank?) and how you would need to break them over multiple pages.
Given this, do I really need to worry
about expending the effort to produce
a dynamic sitemap index that includes
all these URLs, or should I simply
ensure that all the main static URLs
are in there?
I was under the impression that the sitemap wasn't necessarily about disconnected pages but rather about increasing the crawling of existing pages. In my experience when a site includes a sitemap, minor pages even when prominently linked to are more likely to appear on Google results. Depending on the pagerank/inbound links etc. of your site this may be less of an issue.
I've submitted sitemap.xml files to google webmaster tools and it says that i has all of the page in total but under "indexed" it says "--"? How long does it take for Google to start indexing? This was a couple of days ago.
A Sitemap is a way for webmasters to help Search Engines to easily discover more pages from their websites. A Sitemap should be considered an aid, not a duty. Even if you submit a Sitemap there's no guarantee that the URLs listed in the Sitemap will be read or included in Search Engine indexes.
Usually it takes from a few hours to some day to be indexed.
Quotes from a Google source
"We don't guarantee that we'll crawl
or index all of your URLs. For
example, we won't crawl or index image
URLs contained in your Sitemap.
However, we use the data in your
Sitemap to learn about your site's
structure, which will allow us to
improve our crawler schedule and do a
better job crawling your site in the
future. In most cases, webmasters will
benefit from Sitemap submission, and
in no case will you be penalized for
it."
Mod Note: An attribution link was originally here, but the site linked to no longer exists
It usually takes up to two weeks to be indexed. Just give it some time :)
In short: it depends.
If your website is new, Google will have to crawl and index it first. This can take time and depends on many factors (see the Google FAQs on indexing).
If the website is not new, it's possible that you are submitting URLs in the Sitemap file which do not match the URLs that were crawled and indexed. In this case, the indexed URL count is usually not zero, but this could theoretically be the case if the URLs in the Sitemap file are drastically wrong (eg with session-ids).
Finally, if you are submitting a non-web Sitemap file (eg for Google Video or Google News), it's normal for the indexed URL count to be zero: the count only applies for URLs within the normal web-search results.
Without knowing the URL of the Sitemap file it's impossible to say for sure which of the above applies.