prevent google to index dynamic error pages (none 404) [closed] - seo

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
There are some none 404 error pages on my website. what is the best way to stop google from indexing them?
option 1
header("HTTP/1.0 410 Gone");
what if the content is not gone? for example: the article does not exist. or wrong parameter has been caught
option 2
<meta name="robots" content="noindex" />
does it only affect one page or the whole domain?
option 3
using 404 which will make some other problems and I would like to avoid.
robot.txt
this option will not work since the error will depend on the database and is not static.

Best practice is to do a 301 redirect to similar content on your site if content is removed.
To stop Google indexing certain areas of your site use robots.txt
UPDATE: If you send a 200 OK and add the robots meta tag (Option 2 in your question) - this should do what you want.

One way to prevent google bots to index something is using robots files:
User-agent: googlebot
Disallow: /mypage.html
Disallow: /mp3/
This way you can manually disable single pages or entire directories.

Related

I have disallowed everything for 10 days [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
Due to an update error, I put in prod a robots.txt file that was intended for a test server. Result, the prod ended up with this robots.txt :
User-Agent: *
Disallow: /
That was 10 days ago and I now have more than 7000 URLS blocked Error (Submitted URL blocked by robots.txt) or Warning (Indexed through blocked byt robots.txt).
Yesterday, of course, I corrected the robots.txt file.
What can I do to speed up the correction by Google or any other search engine?
You could use the robots.txt test feature. https://www.google.com/webmasters/tools/robots-testing-tool
Once the robots.txt test has passed, click the "Submit" button and a popup window should appear. and then click option #3 "Submit" button again --
Ask Google to update
Submit a request to let Google know your robots.txt file has been updated.
Other then that, I think you'll have to wait for Googlebot to crawl the site again.
Best of luck :).

Canonical Link Element for Dynamic Pages ( rel="canonical") [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have a stack system that passes page tokens in the URL. As well my pages are dynamically created content so I have one php page to access the content with parameters.
index.php?grade=7&page=astronomy&pageno=2&token=foo1
I understand the search indexing goal to be The goal is to have only one link per unique set of data on your website.
Bing has a way to specify specific parameters to ignore.
Google it seems uses rel="canonical" but is it possible to use this to tell Google to ignore the token parameter? My URL (without tokens) can be anything like:
index.php?grade=5&page=astronomy&pageno=2
index.php?grade=6&page=math&pageno=1
index.php?grade=7&page=chemistry&page2=combustion&pageno=4
If there is not a solution for Google... Other possible solutions:
If I provide a site map for each base page, I can supply base URLs but any crawing of that page's links will crate tokens on resulting pages. Plus I would have to constantly recreate the site map to cover new pages (e.g. 25 posts per page, post 26 is on page 2).
One idea I've had is to identify bots on page load (I do this already) and disable all tokens for bots. Since (I'm presuming) bots don't use session data between pages anyway, the back buttons and editing features are useless. Is it feasible (or is it crazy) to write custom code for bots?
Thanks for your thoughts.
You can use the Google Webmaster Tools to tell Google to ignore certain URL parameters.
This is covered on the Google Webmaster Help page.

Google search results showing my site even though I've disallowed it in robots.txt [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
My staging site is showing up in search results, even though I've specified that I don't want the site crawled. Here's the contents of my robots.txt file for the staging site:
User-agent: Mozilla/4.0 (compatible; ISYS Web Spider 9)
Disallow:
User-agent: *
Disallow: /
Is there something I'm doing wrong here?
Your robots.txt tells Google not to crawl/index your page's content.
It doesn't tell Google not to add your URL to their search results.
So if your page (which is blocked by robots.txt) is linked somewhere else, and Google finds this link, it checks your robots.txt if it is allowed to crawl. It finds that it is forbidden, but hey, it still has your URL.
Now Google might decide that it would be useful to include this URL in their search index. But as they are not allowed (per your robots.txt) to get the page's metadata/content, they only index it with keywords from your URL itself, and possibly anchor/title text that someone else used to link to your page.
If you don't want your URLs to be indexed by Google, you'd need to use the meta-robots, e.g.:
<meta name="robots" content="noindex">
See Google's documentation: Using meta tags to block access to your site
You're robots file looks clean, but remember Google, Yahoo, Bing, etc. etc. do not need to crawl your site in order to index it.
There is a very good chance the Open Directory Project or a less polite bot of some kind stumbled across it. Once someone else finds your site these days it seems everyone gets their hands on it. Drives me crazy too.
A good rule of thumb when staging is:
Always test your robots file for any oversights with relation to syntax before posting it on your production site. Try robots.txt Checker, Analyze robots.txt, or Robots.txt Analysis - Check whether your site can be accessed by Robots.
2.Password protect your content while staging. Even if its somewhat bogus, put a login and password at your indexes root. Its an extra step for your fans and testers -- but well worth it if you want polite --OR-- unpolite bots out of your hair.
3.Depending on the project you may not want to use your actual domain for testing. Even if I have a static ip - sometimes Ill use dnsdynamic or noip.com to stage my password protected site. So for example, if I want to stage my domain ihatebots.com :) I will simply go to dnsdynamic or noip (theyre free btw) and create a fake domain such as: ihatebots.user32.com or somthingtotallyrandom.user32.com and then assign my ip address to it. This way even if someone crawls my staging project -- my original domain: ihatebots.com is still untouched from any kind of search engine result (so are its records too btw).
Remember there are billions of dollars around the world aimed at finding you 24 hrs a day and that number is ever increasing. Its tough these days. Be creative and always password protect if you can while staging.
Good luck.

Would 401 Error be a good choice? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
On one of my sites have a lot of restricted pages which is only available to logged-in users, and for everyone else it outputs a default "you have to be logged in ... " view.
The problem is; a lot of these pages are listed on Google with the not-logged-in-view, and it looks pretty bad when 80% of the pages in the list have the same title and description/preview.
Would it be a good choice to, along with my default not-logged-in-view, send a 401 unauthorized header? And would this stop Google (and other engines) to index these pages?
Thanks!
(and if you have another (better?) solution I would love to hear about it!)
Use a robots.txt to tell search engines not to index the not logged in pages.
http://www.robotstxt.org/
Ex.
User-agent: *
Disallow: /error/notloggedin.html
401 Unauthorized is the response code for requests that requires user authentication. So this is exactly the response code you want and have to send. Status Code Definitions
EDIT: Your previous suggestion, response code 403, is for requests, where authentication makes no difference, eg. disabled directory browsing.
here are the status codes googlebot understands and recommends.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40132
in your case an HTTP 403 would be the right one.

SEO Help with Pages Indexed by Google [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm working on optimizing my site for Google's search engine, and lately I've noticed that when doing a "site:www.joemajewski.com" query, I get results for pages that shouldn't be indexed at all.
Let's take a look at this page, for example: http://www.joemajewski.com/wow/profile.php?id=3
I created my own CMS, and this is simply a breakdown of user id #3's statistics, which I noticed is indexed by Google, although it shouldn't be. I understand that it takes some time before Google's results reflect accurately on my site's content, but this has been improperly indexed for nearly six months now.
Here are the precautions that I have taken:
My robots.txt file has a line like this:
Disallow: /wow/profile.php*
When running the url through Google Webmaster Tools, it indicates that I did, indeed, correctly create the disallow command. It did state, however, that a page that doesn't get crawled may still get displayed in the search results if it's being linked to. Thus, I took one more precaution.
In the source code I included the following meta data:
<meta name="robots" content="noindex,follow" />
I am assuming that follow means to use the page when calculating PageRank, etc, and the noindex tells Google to not display the page in the search results.
This page, profile.php, is used to take the $_GET['id'] and find the corresponding registered user. It displays a bit of information about that user, but is in no way relevant enough to warrant a display in the search results, so that is why I am trying to stop Google from indexing it.
This is not the only page Google is indexing that I would like removed. I also have a WordPress blog, and there are many category pages, tag pages, and archive pages that I would like removed, and am doing the same procedures to attempt to remove them.
Can someone explain how to get pages removed from Google's search results, and possibly some criteria that should help determine what types of pages that I don't want indexed. In terms of my WordPress blog, the only pages that I truly want indexed are my articles. Everything else I have tried to block, with little luck from Google.
Can someone also explain why it's bad to have pages indexed that don't provide any new or relevant content, such as pages for WordPress tags or categories, which are clearly never going to receive traffic from Google.
Thanks!
It would be a better idea to revise your meta robots directives to:
<meta name="robots" content="noindex,noarchive,nosnippet,follow" />
My robots file was blocking access to the page where the meta tag was included. Thus, even though the meta tag told Google to not index my pages, Google never got that far.
Case closed. :P
If you have blocked and tested URL in robots.txt, it must work. Here you don't need to add additional meta tag into particular page.
I am sure, give some time to Google for crawling your website. It should work !
For removing URLs, you can use Google webmaster tool. (i am sure you know that)