robots.txt blocking crawlers from accesing page [closed] - seo

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 9 years ago.
Improve this question
I try to find how to block crawlers to access my links that are something like this:
site.com/something-search.html
I want to block all /something-*
Can someone help me?

User-agent: *
Disallow: /something-
This blocks all URLs whose path starts with /something-, for example for a robots.txt accessible from http://example.com/robots.txt:
http://example.com/something-
http://example.com/something-foo
http://example.com/something-foo.html
http://example.com/something-foo/bar
…
The following URLs would still be allowed:
http://example.com/something
http://example.com/something.html
http://example.com/something/
…

In your robots.txt
User-agent: *
Disallow: site.com/something-(1st link)
.
.
.
Disallow: site.com/somedthing-(last link)
Add entry for each page that you don't want to be seen!
Though regex are not allowd in robots.txt some intelligent crawlers can understand it!
have a look here

Related

Block 100s of url from search engine using robots.txt [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I've about 100 pages of on my website which I don't want to be indexed in google...is there any way to block it using robots.txt..It'd be very tiresome to edit each page and add noindex meta tag....
All the urls which I want to block goes like...
www.example.com/index-01.html
www.example.com/index-02.html
www.example.com/index-03.html
www.example.com/index-04.html
.
.
.
.
www.example.com/index-100.html
Not sure but will adding something like the following work?
User-Agent: *
Disallow: /index-*.html
Yes it will work using wildcard
Ref : "https://geoffkenyon.com/how-to-use-wildcards-robots-txt"

Google webmaster tools and duplicated links [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I have some problems and im confusing now with google webmaster tools
I have example:
http://site.com/link/text.html
and im using CustomVar on google analytics to track clicks from external site
example
http://site.com/link/text.html?promoid=123
Now in webmaster tools i have toons of duplicated links
In robots.txt im add
Disallow: *?promoid
but im not sure if this good idea...
What i should do now, still use robots file and disallow promoid or maybe use rel="canonical" ?
Edit: all links with ?promoid=123 is posted on external site not on my...
This is exactly what canonical URLs are for. It will tell Google that http://site.com/link/text.html is the main URL to use for that page and that any other page using that canonical URL is just a minor variation of that page.

Best way to prevent Google from indexing a directory [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I've researched many methods on how to prevent Google/other search engines from crawling a specific directory. The two most popular ones I've seen are:
Adding it into the robots.txt file: Disallow: /directory/
Adding a meta tag: <meta name="robots" content="noindex, nofollow">
Which method would work the best? I want this directory to remain "invisible" from search engines so it does not affect any of my site's ranking.
In other words, I want this directory to be neutral/invisible and "just there." I don't want it to affect any ranking. Which method would be the best to achieve this?
Robots.txt is the way to go for this.
According to Google, you only use the meta tag if you don't have rights to create/edit the robots.txt file.

Should sitemap be disallowed in robots.txt? and robot.txt itself? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
This a very basic question, but I can't find a direct answer anywhere online. When searching for my website on google, sitemap.xml and robots.txt are returned as search results (amongst more useful results). To prevent this should I add the following lines to robots.txt?:
Disallow: /sitemap.xml
Disallow: /robots.txt
This won't stop search engines accessing the sitemap or robots file?
Also/Instead should I use google's URL removal tool?
you won't stop the crawler from indexing robots.txt because its a chicken and the egg situation, however, if you aren't specifying google and other search engines to look directly at the sitemap, you could lose some indexing weight from denying your sitemap.xml.
Is there a particular reason why you would want to not have users be able to see the sitemap?
I actually do this which is specific just for the google crawler:
Allow: /
# Sitemap
Sitemap: http://www.mysite.com/sitemap.xml

Asterisk in robots.txt [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Wondering if following will work for google in robots.txt
Disallow: /*.action
I need to exclude all urls ending with .action.
Is this correct?
To block files of a specific file type (for example, .gif), use the following:
User-agent: Googlebot
Disallow: /*.gif$
So, you are close. Use Disallow: /*.action$ with a trailing "$"
Of course, that's merely what Google suggests: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449
All bots are different.
The robots.txt specification provides no way to include wildcards, only the beginning of URIs.
Google implement non-standard extensions, described in their documentation (look in the Manually create a robots.txt file section under "To block files of a specific file type").
I don't think it will work, you would need to move all .action files to a location which you then disallow