temporary blocking google crawler, will it prevent future indexing?

temporary blocking google crawler, will it prevent future indexing? - indexing

Sometimes I need to add several updates to my site. To keep it clean I display a maintenance page. The first time I did this it became the main indexed page on google. Therefore I added the meta tag <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> to prevent this. My question is if google comes across this temporary maintenance page he will not index it but does that mean he will never index that page again? or will they index the page again once there is new content?
I would really appreciate if someone could clear this up for me

Google has a guideline for dealing with planned maintenance/downtime: "How to deal with planned site downtime". in short you should return a 503 HTTP result code on those page(s) which are under maintenance or are down. here is an example php code to use on top of those page(s):
header('HTTP/1.1 503 Service Temporarily Unavailable');
if you know exact/approximate time/date of when maintenance/downtime will be complete you can use an optional Retry-After header like this (alongside with above 503 HTTP result code):
//when the exact completion time is known.
header('Retry-After: Sat, 8 Oct 2011 18:27:00 GMT');
or
//when the length of the downtime in seconds is known.
header('Retry-After: 86400');
for more information read google article .

Google's crawler will reindex your site once you remove that meta tag (it is constantly updating its indices).
If you're really paranoid, I'd suggest checking out Google's Webmaster Tools so that you can directly control the indexing behavior of your site: http://www.google.com/webmasters/

Related

GoogleBot overloads the server by spidering very frequently

My website has about 500.000 pages. I made sitemap.xml and listed all pages in it (I know about limitation 50.000 links per file, so I have 10 sitemaps). Anyway I submitted sitemaps in webmastertool and everything seems ok (no error and I can see submitted and index links). Hoverer I have a problem with spidering frequently. GoogleBot spiders the same page 4 times per day but in sitemap.xml I tell that the page would be changed yearly.
This is an example
<url>
<loc>http://www.domain.com/destitution</loc>
<lastmod>2015-01-01T16:59:23+02:00</lastmod>
<changefreq>yearly</changefreq>
<priority>0.1</priority>
</url>
1) So how to tell GoogleBot not to spider so frequently as it overload my server?
2) the website has several pages like http://www.domain.com/destitution1, http://www.domain.com/destitution2 ... and I put canonical url to http://www.domain.com/destitution. Might it be the reason of multi spidering ?

You can report this to Google crawling team, see here :
In general, specific Googlebot crawling-problems like this are best
handled through Webmaster Tools directly. I'd go through the Site
Settings for your main domain, Crawl Rate, and then use the "Report a
problem with Googlebot" form there. The submissions through this form
go to our Googlebot team, who can work out what (or if anything) needs
to be changed on our side. They generally won't be able to reply, and
won't be able to process anything other than crawling issues, but they
sure know Googlebot and can help tweak what it does.
https://www.seroundtable.com/google-crawl-report-problem-19894.html

The crawling will slow down progressively. Bots are likely revisiting your pages because there are internal links between your pages.
In general, canonicals tend to reduce crawling rates. But at the beginning, Google bots need crawl both the source and target page. You will see the benefit later.
Google bots don't necessarily take lastmod and changefreq information into account. But if they establish content is not modified, they will come back less often. It is a matter of time. Every URL has a scheduler for revisits.
Bots adapt to the capaccity of the server (see crawling summary I maintain for more details). You can temporarily slow down bots by returning them http error code 500 if that is an issue. They will stop and come back later.
I don't believe there is a crawling issue with your site. What you see is normal behavior. When several sitemaps are submitted at once, the crawling rates can be temporarily raised.

Main site url removed from google despite re-submitting it

I have a site www.megalim.co.il,
recently due to a version upgrade, I discovered that i have a robots.txt file that disallowed all Search engines.. my google ranking dropped , and I couldn't find the site's main page anymore
I changed the robots.txt file to one that allows all, and now the web master toolkit doesn't
write me that the site is blocked from google.
I did this about 5 days ago, I've also fetched as google
and submitted www.megalim.co.il to index with all related pages
but still, when i search this: "site:www.megalim.co.il"
i get a bunch of results from my site , but not the main page!
what else should I look for?
thanks!
Igal

You don't see your main page because of your old robots.txt. 5 days is nothing for Google bots to re-index all your website.
Just wait a little and you will see your website fully indexed in Google results.

Issue sorted out..
embarassing...
apparently we (inexplicably) had a nofollow, noindex meta tag..
after a day we start reappearing in google
thanks :)

How google sees the updated content daily?

I have a website with updated content daily. I have two questions:
How does google see this content? Do I need a SEO in this case?
Does the error 404 page have an influence on the ranking on the search engine. (I do not have a static page)

So Google can "know" the content is supposed to be updated daily, it may be useful (if you don't do it yet) to implement a sitemap (and update, if necessary, dynamically). In this simemap, you can specify for each page, the update period.
This is not a constraint for Google, but it can help to adjust the frequency of indexing robots visit.
If you do, you must be "honest" with Google about times updates. If Google realizes that the frequency defined in the sitemap does not correspond to the actual frequency, it can be bad for your rankings.
404 errors (and other HTTP errors) can actually indirectly have an adverse effect on the ranking of the site. Of course, if the robot can not access content at a given moment, it can not be indexed. But scoffers, if too many problems are encountered during the visit of your site by web crawlers, Google will adjust the frequency of visits to the downside.
You can get some personalized advice and monitor the process of indexing your site using Google Webmaster Tools (and to a lesser extent, Analytics or any other tool that could monitor the web crawlers visits).

You can see the date and time when Google last visited your page. So you can see that Google adapted your updated content or not. If you have a website with updated content daily then you can ping your website to many search engines and can also submit your site in Google.
You can make a sitemap for only those urls who have a daily updated content and submit to google webmaster tools. You can define your date and time when the url was last modified in tag. You can also give a hint how frequently your page will likely to change under tag. You can set high priority for the pages which are modified daily under tag.
If you have 404 error (file not found) page then then put them in one directory and define it in your robots.txt file. So Google will not crawl that web pages and automatically it will not be indexed. It will not make any influence on your SERP ranking.

About Isolated Page In My Web Site

I Produced a page which I have no intention to let Search Engines find and claw it.
The advisable solution is robot.txt. But it is not applicable in my situation.
So I isolated this page from my site by clearing all links from other pages to this page, and never put its URL in external sites.
Logically, then, it is impossible for search engines to find out this page. And that means no matter how many out-bound links nesting in this page, the PR of site is save.
Am I right?
Thank you very much!
Hope this question is programming related!

No, there's still a chance your page can be found by search engine crawlers. For example, it's been speculated that data from the Google Toolbar can be used to alert Googlebot to the presence of a page. And there's still a chance others might link to your page from external sites if the URL becomes known.
Your best bet is to add a robots meta tag to your page, this will prevent it from being indexed, and prevent crawlers from following any links:
<meta name="robots" content="noindex,nofollow" />

If it is on the internet and not restricted, it will be found. It may make it harder to find, but it is still possible a crawler may happen across it.
What is the link so I can check? ;)

If you have outbound links on this "isolated" page then your page will probably show up as a referrer in the logs of the linked-to page. Depending on how much the owners of the linked-to page track their stats, then they may find your page.
I've seen httpd log files turn up in Google searches. This in turn may lead others to find your page, including crawlers and other robots.
The easiest solution might be to password protect the page?

Possible to prevent search engine spiders from infinitely crawling paging links on search results?

Our SEO team would like to open up our main dynamic search results page to spiders and remove the 'nofollow' from the meta tags. It is currently accessible to spiders via allowing the path in robots.txt, but with a 'nofollow' clause in the meta tag which prevents spiders from going beyond the first page.
<meta name="robots" content="index,nofollow">
I am concerned that if we remove the 'nofollow', the impact to our search system will be catastrophic, as spiders will start crawling through all pages in the result set. I would appreciate advice as to:
1) Is there a way to remove the 'nofollow' from the meta tag, but prevent spiders from following only certain links on the page? I have read mixed opinions on rel="nofollow", is this a viable option?
<a rel="nofollow" href="http://www.mysite.com/paginglink" >Next Page</a>
2) Is there a way to control the 'depth' of how far spiders will go? It wouldn't be so bad if they hit a few pages, then stopped.
3) Our search results pages have the standard next/previous links, which would in theory cause spiders to hit pages recursively to infinity, what is the effect of this on SEO?
I understand that different spiders behave differently, but am mainly concerned with the big players, such as Google, Yahoo, MSN.
Note our Search results pages and paging links are not bot-friendly, in that they are not re-written and have a ?name=value query string, but from what I've seen spiders no longer just abort when they see the '?' as the results pages ARE getting indexed with decent page rank.

To be honest you are looking at nofollow wrong. Chances are the search spiders are already especially Google, Yahoo, and MSN searching the nofollow pages, because they still have to hit those pages to see if they have a noindex.
The real problem is nofollow doesn't actually mean don't follow, it just means don't pass on my reputation to this link. So unless you are aggressively blocking bots, which it doesn't sound like you are, changing the ROBOTS meta tag and robot commands on links will not effect performance because they are already hitting your site. To confirm this just look at your HTTP Server Log.
So my vote is that you will not see any problem with removing the robot limits.

I've seen Google index a calendar system that had relative links on each page through the end of time (Jan 19, 2038 - see: http://en.wikipedia.org/wiki/Year_2038_problem). We didn't notice the load on our servers until it exposed a bug in the source code dealing with dates in 2038.
I don't know about the other search engines, but Google offers a number of helpful tools for controlling how much the googlebot impacts your server infrastructure. See http://www.google.com/webmasters/.
There is an option in webmaster tools to set the crawl rate for your site.

Google bots are pretty intelligent about not traversing an entire database of dynamically-generated pages, as long as the URLs give some hint that they are dynamic (i.e. file extension of .asp or .jsp, etc. and numeric ids as query parameters). If you use rewrite rules to make your URLs "friendly", then the bots have a harder time determining whether or not it's a static page they are reading or a dynamically generated page. See this Google article for more information about dynamic vs. static URLs.
You may also want to consider creating a Google Sitemap to give the bots a better idea about what pages on your site can be indexed and which cannot.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas