robots.txt url blocking [duplicate] - seo

This question already has answers here:
How do I disallow specific page from robots.txt
(4 answers)
Closed 5 years ago.
I am trying to set up robot.txt for a webpage but the disallow wont work when test it
want to block a thank you page
http://designs.webelevate.net/wordpress/index.php/contact-thank-page/
using the code
Disallow: /index.php/contact-thank-page/
any suggestions?

The line you have right now will only block URLs at root. e.g. it will block the below URL (Note: wordpress/ is removed).
http://designs.webelevate.net/index.php/contact-thank-page/
To expand this to include the URL you noted either add /wordpress into the disallow or add a * in the path like below.
Disallow: */index.php/contact-thank-page/

Related

I have disallowed everything for 10 days [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
Due to an update error, I put in prod a robots.txt file that was intended for a test server. Result, the prod ended up with this robots.txt :
User-Agent: *
Disallow: /
That was 10 days ago and I now have more than 7000 URLS blocked Error (Submitted URL blocked by robots.txt) or Warning (Indexed through blocked byt robots.txt).
Yesterday, of course, I corrected the robots.txt file.
What can I do to speed up the correction by Google or any other search engine?
You could use the robots.txt test feature. https://www.google.com/webmasters/tools/robots-testing-tool
Once the robots.txt test has passed, click the "Submit" button and a popup window should appear. and then click option #3 "Submit" button again --
Ask Google to update
Submit a request to let Google know your robots.txt file has been updated.
Other then that, I think you'll have to wait for Googlebot to crawl the site again.
Best of luck :).

Redirect Blog Posts

I have hundreds of blog posts that need to be redirected due to a site redesign. Current the URL for each blog item contains a number followed by the name of the post. For instance, www.mysite.com/2199-this-is-the-blog-post-name. I want to redirect all of the posts to new directories such that the URL will appear as www.mysite.com/new-directory/2199-this-is-the-blog-post-name.
What I want to know is what is the easiest way redirect these. I would like to know if there is a way that any string starting with a number, for example 2, could then be redirected (I am not worried about over non-blog content URL that may start with a 2). I have tried several ReWriteCond/ReWriteRule but have yet to find anything that works.
Try the following in your htaccess
RedirectMatch 301 ^/([0-9]+.+)$ /newdir/$1

What is the meaning of /*+* in robots.txt file?

I have a question regarding robots.txt file.
Disallow: Blog/*+*
What does that mean?
In theory it would stop a robot that chose to respect it from accessing any part of the website that began with Blog/+ ; however, the bot doesn't have to respect it, and since it isn't starting with a directory indicating slash there is no telling how people's robots will deal with it.
from : http://www.robotstxt.org/orig.html
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

prevent google to index dynamic error pages (none 404) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
There are some none 404 error pages on my website. what is the best way to stop google from indexing them?
option 1
header("HTTP/1.0 410 Gone");
what if the content is not gone? for example: the article does not exist. or wrong parameter has been caught
option 2
<meta name="robots" content="noindex" />
does it only affect one page or the whole domain?
option 3
using 404 which will make some other problems and I would like to avoid.
robot.txt
this option will not work since the error will depend on the database and is not static.
Best practice is to do a 301 redirect to similar content on your site if content is removed.
To stop Google indexing certain areas of your site use robots.txt
UPDATE: If you send a 200 OK and add the robots meta tag (Option 2 in your question) - this should do what you want.
One way to prevent google bots to index something is using robots files:
User-agent: googlebot
Disallow: /mypage.html
Disallow: /mp3/
This way you can manually disable single pages or entire directories.

Google search results showing my site even though I've disallowed it in robots.txt [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
My staging site is showing up in search results, even though I've specified that I don't want the site crawled. Here's the contents of my robots.txt file for the staging site:
User-agent: Mozilla/4.0 (compatible; ISYS Web Spider 9)
Disallow:
User-agent: *
Disallow: /
Is there something I'm doing wrong here?
Your robots.txt tells Google not to crawl/index your page's content.
It doesn't tell Google not to add your URL to their search results.
So if your page (which is blocked by robots.txt) is linked somewhere else, and Google finds this link, it checks your robots.txt if it is allowed to crawl. It finds that it is forbidden, but hey, it still has your URL.
Now Google might decide that it would be useful to include this URL in their search index. But as they are not allowed (per your robots.txt) to get the page's metadata/content, they only index it with keywords from your URL itself, and possibly anchor/title text that someone else used to link to your page.
If you don't want your URLs to be indexed by Google, you'd need to use the meta-robots, e.g.:
<meta name="robots" content="noindex">
See Google's documentation: Using meta tags to block access to your site
You're robots file looks clean, but remember Google, Yahoo, Bing, etc. etc. do not need to crawl your site in order to index it.
There is a very good chance the Open Directory Project or a less polite bot of some kind stumbled across it. Once someone else finds your site these days it seems everyone gets their hands on it. Drives me crazy too.
A good rule of thumb when staging is:
Always test your robots file for any oversights with relation to syntax before posting it on your production site. Try robots.txt Checker, Analyze robots.txt, or Robots.txt Analysis - Check whether your site can be accessed by Robots.
2.Password protect your content while staging. Even if its somewhat bogus, put a login and password at your indexes root. Its an extra step for your fans and testers -- but well worth it if you want polite --OR-- unpolite bots out of your hair.
3.Depending on the project you may not want to use your actual domain for testing. Even if I have a static ip - sometimes Ill use dnsdynamic or noip.com to stage my password protected site. So for example, if I want to stage my domain ihatebots.com :) I will simply go to dnsdynamic or noip (theyre free btw) and create a fake domain such as: ihatebots.user32.com or somthingtotallyrandom.user32.com and then assign my ip address to it. This way even if someone crawls my staging project -- my original domain: ihatebots.com is still untouched from any kind of search engine result (so are its records too btw).
Remember there are billions of dollars around the world aimed at finding you 24 hrs a day and that number is ever increasing. Its tough these days. Be creative and always password protect if you can while staging.
Good luck.