Would 401 Error be a good choice? [closed] - seo

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
On one of my sites have a lot of restricted pages which is only available to logged-in users, and for everyone else it outputs a default "you have to be logged in ... " view.
The problem is; a lot of these pages are listed on Google with the not-logged-in-view, and it looks pretty bad when 80% of the pages in the list have the same title and description/preview.
Would it be a good choice to, along with my default not-logged-in-view, send a 401 unauthorized header? And would this stop Google (and other engines) to index these pages?
Thanks!
(and if you have another (better?) solution I would love to hear about it!)

Use a robots.txt to tell search engines not to index the not logged in pages.
http://www.robotstxt.org/
Ex.
User-agent: *
Disallow: /error/notloggedin.html

401 Unauthorized is the response code for requests that requires user authentication. So this is exactly the response code you want and have to send. Status Code Definitions
EDIT: Your previous suggestion, response code 403, is for requests, where authentication makes no difference, eg. disabled directory browsing.

here are the status codes googlebot understands and recommends.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40132
in your case an HTTP 403 would be the right one.

Related

I have disallowed everything for 10 days [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
Due to an update error, I put in prod a robots.txt file that was intended for a test server. Result, the prod ended up with this robots.txt :
User-Agent: *
Disallow: /
That was 10 days ago and I now have more than 7000 URLS blocked Error (Submitted URL blocked by robots.txt) or Warning (Indexed through blocked byt robots.txt).
Yesterday, of course, I corrected the robots.txt file.
What can I do to speed up the correction by Google or any other search engine?
You could use the robots.txt test feature. https://www.google.com/webmasters/tools/robots-testing-tool
Once the robots.txt test has passed, click the "Submit" button and a popup window should appear. and then click option #3 "Submit" button again --
Ask Google to update
Submit a request to let Google know your robots.txt file has been updated.
Other then that, I think you'll have to wait for Googlebot to crawl the site again.
Best of luck :).

Proper route for checking resource existence in a RESTful API [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
What's the best/restful way to design an API endpoint for checking existence of resources?
For example there is a user database. While new user tries to sign up I want to check if email has been used on-the-fly.
My idea is: POST /user/exists and payload would be something like {"email": "foo#bar.com"}. The response would be either 200 OK or 409 Conflict.
Is this a proper way?
Thanks!
HEAD is the most effecient for existence checks:
HEAD /users/{username}
Request a user's path, and return a 200 if they exist, or a 404 if they don't.
Mind you, you probably don't want to be exposing endpoints that check email addresses. It opens a security and privacy hole. Usernames that are already publicly displayed around a site, like on reddit, could be ok.
I believe the proper way to just check for existence is to use a HEAD verb for whatever resource you would normally get with a GET request.
I recently came across a situation where I wanted to check the existence of a potentially large video file on the server. I didn't want the server to try and start streaming the bytes to any client so I implemented a HEAD response that just returned the headers that the client would receive when doing a GET request for that video.
You can check out the W3 specification here or read this blog post about practical uses of the HEAD verb.
I think this is awesome because you don't have to think about how to form your route any differently from a normal RESTful route in order to check for the existence of any resource, Whether that's a file or a typical resource, like a user or something.
GET /users?email=foo#bar.com
This is a basic search query: find me the users which have the email address specified. Respond with an empty collection if no users exist, or respond with the users which match the condition.
I prefer:
HEAD /users/email/foo#bar.com
Explanation: You are trying to find through all the users someone that are using the e-mail foo#bar.com. I'm assuming here that the e-mail is not the key of your resource and you would like to have some flexibility on your endpoint, because if you need another endpoint to check availability of another information from the user (like username, number, etc) , this approach can fit very well:
HEAD /users/email/foo#bar.com
HEAD /users/username/foobar
HEAD /users/number/56534324
As response, you only need to return 200 (exists, so it's not available) or 404 (not exists, so it's available) as http code response.
You can also use:
HEAD /emails/foo#bar.com
if the HEAD /users/email/foo#bar.com conflict with an existing rest resource, like a GET /users/email/foo#bar.com with a different business rule. As described on Mozilla's documentation:
The HEAD method asks for a response identical to that of a GET request, but without the response body.*.
So, have a GET and HEAD with different rules is not good.
A HEAD /users/foo#bar.com is a good option too if the e-mail is the "key" of the users, because you (probably) have a GET /users/foo#bar.com.

prevent google to index dynamic error pages (none 404) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
There are some none 404 error pages on my website. what is the best way to stop google from indexing them?
option 1
header("HTTP/1.0 410 Gone");
what if the content is not gone? for example: the article does not exist. or wrong parameter has been caught
option 2
<meta name="robots" content="noindex" />
does it only affect one page or the whole domain?
option 3
using 404 which will make some other problems and I would like to avoid.
robot.txt
this option will not work since the error will depend on the database and is not static.
Best practice is to do a 301 redirect to similar content on your site if content is removed.
To stop Google indexing certain areas of your site use robots.txt
UPDATE: If you send a 200 OK and add the robots meta tag (Option 2 in your question) - this should do what you want.
One way to prevent google bots to index something is using robots files:
User-agent: googlebot
Disallow: /mypage.html
Disallow: /mp3/
This way you can manually disable single pages or entire directories.

Canonical Link Element for Dynamic Pages ( rel="canonical") [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have a stack system that passes page tokens in the URL. As well my pages are dynamically created content so I have one php page to access the content with parameters.
index.php?grade=7&page=astronomy&pageno=2&token=foo1
I understand the search indexing goal to be The goal is to have only one link per unique set of data on your website.
Bing has a way to specify specific parameters to ignore.
Google it seems uses rel="canonical" but is it possible to use this to tell Google to ignore the token parameter? My URL (without tokens) can be anything like:
index.php?grade=5&page=astronomy&pageno=2
index.php?grade=6&page=math&pageno=1
index.php?grade=7&page=chemistry&page2=combustion&pageno=4
If there is not a solution for Google... Other possible solutions:
If I provide a site map for each base page, I can supply base URLs but any crawing of that page's links will crate tokens on resulting pages. Plus I would have to constantly recreate the site map to cover new pages (e.g. 25 posts per page, post 26 is on page 2).
One idea I've had is to identify bots on page load (I do this already) and disable all tokens for bots. Since (I'm presuming) bots don't use session data between pages anyway, the back buttons and editing features are useless. Is it feasible (or is it crazy) to write custom code for bots?
Thanks for your thoughts.
You can use the Google Webmaster Tools to tell Google to ignore certain URL parameters.
This is covered on the Google Webmaster Help page.

SEO - What to do when content is taken offline [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm going to have a site where content remains on the site for a period of 15 days and then gets removed.
I don't know too much about SEO, but my concern is about the SEO implications of having "content" indexed by the search engines, and then one day it suddenly goes and leaves a 404.
What is the best thing I can do to cope with content that comes and goes in the most SEO friendly way possible?
The best way will be to respond with HTTP Status Code 410;
from w3c:
The requested resource is no longer available at the server and no
forwarding address is known. This condition is expected to be
considered permanent. Clients with link editing capabilities SHOULD
delete references to the Request-URI after user approval. If the
server does not know, or has no facility to determine, whether or not
the condition is permanent, the status code 404 (Not Found) SHOULD be
used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web
maintenance by notifying the recipient that the resource is
intentionally unavailable and that the server owners desire that
remote links to that resource be removed. Such an event is common for
limited-time, promotional services and for resources belonging to
individuals no longer working at the server's site. It is not
necessary to mark all permanently unavailable resources as "gone" or
to keep the mark for any length of time -- that is left to the
discretion of the server owner.
more about status codes here
To keep the traffic it may be an option to not delete but archive the old content. So it remains accessible by its old URL but linked at some deeper points in the archive on your site.
If you really want to delete it then it is totally ok to return with 404 or 410. Spiders understand that the resource is not available anymore.
Most search engines use something called a robot.txt file. You can specify which URLs and Paths you want the search engine to ignore. So if all of your content is at www.domain.com/content/* then you can have Google ignore that whole branch of your site.