Asterisk in robots.txt [closed] - seo

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Wondering if following will work for google in robots.txt
Disallow: /*.action
I need to exclude all urls ending with .action.
Is this correct?

To block files of a specific file type (for example, .gif), use the following:
User-agent: Googlebot
Disallow: /*.gif$
So, you are close. Use Disallow: /*.action$ with a trailing "$"
Of course, that's merely what Google suggests: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449
All bots are different.

The robots.txt specification provides no way to include wildcards, only the beginning of URIs.
Google implement non-standard extensions, described in their documentation (look in the Manually create a robots.txt file section under "To block files of a specific file type").

I don't think it will work, you would need to move all .action files to a location which you then disallow

Related

robots.txt blocking crawlers from accesing page [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 9 years ago.
Improve this question
I try to find how to block crawlers to access my links that are something like this:
site.com/something-search.html
I want to block all /something-*
Can someone help me?
User-agent: *
Disallow: /something-
This blocks all URLs whose path starts with /something-, for example for a robots.txt accessible from http://example.com/robots.txt:
http://example.com/something-
http://example.com/something-foo
http://example.com/something-foo.html
http://example.com/something-foo/bar
…
The following URLs would still be allowed:
http://example.com/something
http://example.com/something.html
http://example.com/something/
…
In your robots.txt
User-agent: *
Disallow: site.com/something-(1st link)
.
.
.
Disallow: site.com/somedthing-(last link)
Add entry for each page that you don't want to be seen!
Though regex are not allowd in robots.txt some intelligent crawlers can understand it!
have a look here

Best way to prevent Google from indexing a directory [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I've researched many methods on how to prevent Google/other search engines from crawling a specific directory. The two most popular ones I've seen are:
Adding it into the robots.txt file: Disallow: /directory/
Adding a meta tag: <meta name="robots" content="noindex, nofollow">
Which method would work the best? I want this directory to remain "invisible" from search engines so it does not affect any of my site's ranking.
In other words, I want this directory to be neutral/invisible and "just there." I don't want it to affect any ranking. Which method would be the best to achieve this?
Robots.txt is the way to go for this.
According to Google, you only use the meta tag if you don't have rights to create/edit the robots.txt file.

CANONICAL - Duplicate page issue [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Can someone help me with this problem.
currently google reports that this two link is duplicate.
http://www.ozkidsactivities.com/n/jules-pony-rides-&-mobile-animal-farm/ozkids-36?activityId=1218
http://www.ozkidsactivities.com/n/jules-pony-rides-and-mobile-animal-farm/ozkids-36?activityId=1218
but we already include the canonical tag:
<link rel="canonical" href="/n/jules-pony-rides-and-mobile-animal-farm/ozkids-36?activityId=1218" />
is there a problem with the relative path?
Thanks in advance!
red,
Canonical URL tags can reference the relative path (see Google's guidelines here - http://googlewebmastercentral.blogspot.co.uk/2009/02/specify-your-canonical.html), however, I'd suggest that it's better and safer to use the absolute URL (i.e., including the protocol and fully-formed hostname) - given that many websites tend to be accessible by numerous hostnames (alternative domains, test/development environments with exposed URLs, etc.) it's best to reference the correct absolute URL in order to avoid any adverse incorrect canonisation if/when search engines discover these URLs.
It looks like you've already fixed your solution, though, as well as solving the problem another way by redirecting the ampersand to the 'and'. Good work!

Should sitemap be disallowed in robots.txt? and robot.txt itself? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
This a very basic question, but I can't find a direct answer anywhere online. When searching for my website on google, sitemap.xml and robots.txt are returned as search results (amongst more useful results). To prevent this should I add the following lines to robots.txt?:
Disallow: /sitemap.xml
Disallow: /robots.txt
This won't stop search engines accessing the sitemap or robots file?
Also/Instead should I use google's URL removal tool?
you won't stop the crawler from indexing robots.txt because its a chicken and the egg situation, however, if you aren't specifying google and other search engines to look directly at the sitemap, you could lose some indexing weight from denying your sitemap.xml.
Is there a particular reason why you would want to not have users be able to see the sitemap?
I actually do this which is specific just for the google crawler:
Allow: /
# Sitemap
Sitemap: http://www.mysite.com/sitemap.xml

Where to put robots.txt file? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Where should put robots.txt?
domainname.com/robots.txt
or
domainname/public_html/robots.txt
I placed the file in domainname.com/robots.txt, but it's not opening when I type this in browser.
alt text http://shup.com/Shup/358900/11056202047-My-Desktop.png
Where the file goes in your filesystem depends on what host you're using, so it's hard for us to give a specific answer about that.
The best description is: put it wherever the index.html (or index.php or whatever) file is that represents your homepage. If that's domainname/public_html/index.html, for example, put it in domainname/public_html/robots.txt.
i think the better way to describe it is to have it in the root web folder of your domain... so http://example.com/robots.txt you can also put your sitemap.xml in the root or refer to it with a Sitemap: http://example.com/fldr/smap.xml line in your robots.txt.
dont forget: you can use Google Webmaster Tools to check to make sure you haven't restricted anything you didnt mean to(you also get to see queries and links woohoo!).
suggestion: id consider using the <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW"> if possible because you will still earn linkjuice for links on the page but it wont show up in googles index while a robots.txt directive can leave a plain url with do description in SERPs but will loose all value of links pointed to it because its robots.txted out (its ranking b/c of anchor text so get credit for it)
In the root of your web directory (where you put the files that show up on your website)
In this case you should put it in domainname/public_html/robots.txt, as the public.html folder is where your index file will be.