sitemap generation strategy - seo

i have a huge site, with more than 5 millions url.
We have already pagerank 7/10. The problem is that because of 5 millions url and because we add/remove new urls daily (we add ± 900 and we remove ± 300) google is not fast enough to index all of them. We have a huge and intense perl module to generate this sitemap that normally is composed by 6 sitemap files. For sure google is not faster enough to add all urls, specially because normally we recreate all those sitemaps daily and submit to google. My question is: what should be a better approach? Should i really care to send 5 millions urls to google daily even if i know that google wont be able to process? Or should i send just permalinks that wont change and the google crawler will found the rest, but at least i will have a concise index at google (today i have less than 200 from 5.000.000 urls indexed)

What is the point of having a lot of indexed sites which are removed right away?
Temporary pages are worthless for search engines and their users after they are disposed. So I would go for letting search engine crawlers decides whether a page is worth indexing. Just tell them the URLs that will persist... and implement some list pages (if there aren't any yet), which allow your pages to be crawled easier.
Note below: 6 sitemap files for 5m URLs? AFAIK, a sitemap file may no contain more than 50k URLs.

When URLs change you should watch out that you work properly with 301 status (permanent redirect).
Edit (refinement):
Still you should try that your URL patterns are getting stable. You can use 301 for redirects, but maintaining a lot of redirect rules is cumbersome.

Why don't you just compare your sitemap to the previous one each time, and only send google the URLs that have changed!

Related

Deprecated domain in google index

We have got a deprecated domain www.deprecateddomain.com. Specific fact is that we have got reverse proxy working and redirecting all requests from this domain to the new one www.newdomain.com.
The problem is when you type "deprecateddomain.com" in google search, there is a link to www.deprecateddomain.com in search results besides results with "newdomain.com". It means that there is such entries in google index. Our customer don't want to see links to old site.
We were suggested to create fake robots.txt with Disallow: / directive for www.deprecateddomain.com and reverse proxy rules to get this file from some directory. But after investigation the subject I started hesitating that it will help. Will it remove entries with old domain from index?
Why not to just create the request in search console to remove www.deprecateddomain.com from index? In my opinion it might help.
Anyway, I'm novice in this question. Could you give me advice what to do?
Google takes time to remove old/obsolete entries from its ranking, especially on low visited or low value pages. You have no control on it. Google needs to revisit each page to see the redirection you have implemented.
So DO NOT implement a disallow on the old website, because it will make the problem worse. Bots won't be able to crawls those pages and see the redirection you have implemented. So they will stay longer in the rankings.
You must also make sure you implement a proper 301 redirection (i.e. a permanent one, not a temporary) for all pages of the old website. Else, some pages may stay in the ranking for quite some time.
If some pages are obsolete and should be deleted rather than redirected, return a 404 for them. Google will remove them quickly from its index.

How to tell search engines to only index pages inside my sitemap?

Recently my site got hacked and it has been restored. Thousands of spam URLs were indexed by google. I do have a google webmaster account and I can update and submit my sitemap. But how do I tell google to strictly only index the URLs inside my sitemap? I want to prevent any new spam urls created by hackers from being indexed.
Any parameter inside the sitemap.xml that I can use to do this?
Your sitemap should only include the new URLs and google will crawl and index only them.
If you have removed the old spammed URLs and they are in 404(not found) status, Google will remove them from the index (albeit quite slowly, it could take even 1-2 months).
If you need to remove those URLs from being displayed in the index there's a section about it the webmaster guide: https://support.google.com/webmasters/answer/1663419?hl=en

Should a sitemap have *every* url

I have a site with a huge number (well, thousands or tens of thousands) of dynamic URLs, plus a few static URLs.
In theory, due to some cunning SEO linkage on the homepage, it should be possible for any spider to crawl the site and discover all the dynamic urls via a spider-friendly search.
Given this, do I really need to worry about expending the effort to produce a dynamic sitemap index that includes all these URLs, or should I simply ensure that all the main static URLs are in there?
That actual way in which I would generate this isn't a concern - I'm just questioning the need to actually do it.
Indeed, the Google FAQ (and yes, I know they're not the only search engine!) about this recommends including URLs in the sitemap that might not be discovered by a crawl; based on that fact, then, if every URL in your site is reachable from another, surely the only URL you really need as a baseline in your sitemap for a well-designed site is your homepage?
If there is more than one way to get to a page, you should pick a main URL for each page that contains the actual content, and put those URLs in the site map. I.e. the site map should contain links to the actual content, not every possible URL to get to the same content.
Also consider putting canonical meta tags in the pages with this main URL, so that spiders can recognise a page even if it's reachable through different dynamical URLs.
Spiders only spend a limited time searching each site, so you should make it easy to find the actual content as soon as possible. A site map can be a great help as you can use it to point directly to the actual content so that the spider doesn't have to look for it.
We have had a pretty good results using these methods, and Google now indexes 80-90% of our dynamic content. :)
In an SO podcast they talked about limitations on the number of links you could include/submit in a sitemap (around 500 per page with a page limit based on pagerank?) and how you would need to break them over multiple pages.
Given this, do I really need to worry
about expending the effort to produce
a dynamic sitemap index that includes
all these URLs, or should I simply
ensure that all the main static URLs
are in there?
I was under the impression that the sitemap wasn't necessarily about disconnected pages but rather about increasing the crawling of existing pages. In my experience when a site includes a sitemap, minor pages even when prominently linked to are more likely to appear on Google results. Depending on the pagerank/inbound links etc. of your site this may be less of an issue.

Use of sitemaps

I've recently been involved in the redevelopment of a website (a search engine for health professionals: http://www.tripdatabase.com), and one of the goals was to make it more search engine "friendly", not through any black magic, but through better xhtml compliance, more keyword-rich urls, and a comprehensive sitemap (>500k documents).
Unfortunately, shortly after launching the new version of the site in October 2009, we saw site visits (primarily via organic searches from Google) drop substantially to 30% of their former glory, which wasn't the intention :)
We've brought in a number of SEO experts to help, but none have been able to satisfactorily explain the immediate drop in traffic, and we've heard conflicting advice on various aspects, which I'm hoping someone can help us with.
My question are thus:
do pages present in sitemaps also need to be spiderable from other pages? We had thought the point of a sitemap was specifically to help spiders get to content not already "visible". But now we're getting the advice to make sure every page is also linked to from another page. Which prompts the question... why bother with sitemaps?
some months on, and only 1% of the sitemap (well-formatted, according to webmaster tools) seems to have been spidered - is this usual?
Thanks in advance,
Phil Murphy
The XML sitemap helps search engine spider to indexing of all web pages of your site.
The sitemap is very usefull if you publish frequently many pages, but does not replace the correct system of linking of the site: all documents must be linke from an other related page.
Your site is very large, you must attention at the number of URLs published in the Sitemap because there are the limit of 50.000 URLs for each XML file.
The full documentation is available at Sitemaps.org
re: do pages present in sitemaps also need to be spiderable from other pages?
Yes, in fact this should be one of the first things you do. Make your website more usable to users before the search engines and the search engines will love you for it. Heavy internal linking between pages is a must first step. Most of the time you can do this with internal sitemap pages or category pages ect..
re: why bother with sitemaps?
Yes!, Site map help you set priorities for certain content on your site (like homepage), Tell the search engines what to look at more often. NOTE: Do not set all your pages with the highest priority, it confuses Google and doesn't help you.
re: some months on, and only 1% of the sitemap seems to have been spidered - is this usual?
YES!, I have a webpage with 100k+ pages. Google has never indexed them all in a single month, it takes small chunks of about 20k at a time each month. If you use the priority settings property you can tell the spider what pages they should re index each visit.
As Rinzi mentioned more documentation is available at Sitemaps.org
Try build more backlinks and "trust" (links from quality sources)
May help speed indexing further :)

Google webmaster tools: Sitemaps not indexing?

I've submitted sitemap.xml files to google webmaster tools and it says that i has all of the page in total but under "indexed" it says "--"? How long does it take for Google to start indexing? This was a couple of days ago.
A Sitemap is a way for webmasters to help Search Engines to easily discover more pages from their websites. A Sitemap should be considered an aid, not a duty. Even if you submit a Sitemap there's no guarantee that the URLs listed in the Sitemap will be read or included in Search Engine indexes.
Usually it takes from a few hours to some day to be indexed.
Quotes from a Google source
"We don't guarantee that we'll crawl
or index all of your URLs. For
example, we won't crawl or index image
URLs contained in your Sitemap.
However, we use the data in your
Sitemap to learn about your site's
structure, which will allow us to
improve our crawler schedule and do a
better job crawling your site in the
future. In most cases, webmasters will
benefit from Sitemap submission, and
in no case will you be penalized for
it."
Mod Note: An attribution link was originally here, but the site linked to no longer exists
It usually takes up to two weeks to be indexed. Just give it some time :)
In short: it depends.
If your website is new, Google will have to crawl and index it first. This can take time and depends on many factors (see the Google FAQs on indexing).
If the website is not new, it's possible that you are submitting URLs in the Sitemap file which do not match the URLs that were crawled and indexed. In this case, the indexed URL count is usually not zero, but this could theoretically be the case if the URLs in the Sitemap file are drastically wrong (eg with session-ids).
Finally, if you are submitting a non-web Sitemap file (eg for Google Video or Google News), it's normal for the indexed URL count to be zero: the count only applies for URLs within the normal web-search results.
Without knowing the URL of the Sitemap file it's impossible to say for sure which of the above applies.