I have been asked that if it is possible to restrict search results in search engines.. here is an example: I have a website and I want it to be shown only in yahoo, google, bing and ask search engines.. is there any way to do it by using any of the meta tags or by another way?
thanks..
robots.txt? I think that is what you can use to restrict Search engines. http://www.robotstxt.org/
You can Control the Accessible search engine by two methods are as follow:
1. Robots Meta tags
2. Robots.txt file
Robots Meta Tags:
this tags means to tell web crawler following page does not index or show in search results in search engine.
Robots.txt file:
this is a text file submitted to webmaster tool.
eg: user agent: *
disallow: /
This code means all search spider except yahoo this website does not index in search engine.
Related
Is it possible to help search engines by giving them a list of urls to crawl? It might be hard to make the site SEO friendly when using heavy AJAX logic. Let's say that the user chooses a category, then a sub-category and a product. It seems unnecessary to give categories and subcategories urls. But giving only products a url makes sense. When I see the url for the product, I can make the application navigate to that product. So, is it possible to use robots.txt or some other method to direct search engines to the urls I designate?
I am open to other suggestions if this somehow does not make sense.
Yes. What you're describing is called a sitemap -- it's a list of pages on your site which search engines can use to help them crawl your web site.
There are a couple ways of formatting a sitemap, but by far the easiest is to just list out all the URLs in a text file available on your web site -- one per line -- and reference it in robots.txt like so:
Sitemap: http://example.com/sitemap.txt
Here's Google's documentation on the topic: https://support.google.com/webmasters/answer/183668?hl=en
I have a robots.txt file set up as such
User-agent: *
Disallow: /*
For a site that is all unique URL based. Sort of like https://jsfiddle.net/ when you save a new fiddle it gives it a unique URL. I want all of my unique URLs to be invisible to Google. No indexing.
Google has indexed all of my unique URLs, even though it says "A description for this result is not available because of the site's robots.txt file. - learn more"
But that still sucks because all the URLs are there, and clickable - so all the data inside is available. What can I do to 1) get rid of these off Google and 2) stop Google from indexing these URLs.
Robots.txt tells search engines not to crawl the page, but it does not stop them from indexing the page, especially if there are links to the page from other sites. If your main goal is to guarantee that these pages never wind up in search results, you should use robots meta tags instead. A robots meta tag with 'noindex' means "Do not index this page at all". Blocking the page in robots.txt means "Do not request this page from the server."
After you have added the robots meta tags, you will need to change your robots.txt file to no longer disallow the pages. Otherwise, the robots.txt file would prevent the crawler from loading the pages, which would prevent it from seeing the meta tags. In your case, you can just change the robots.txt file to:
User-agent: *
Disallow:
(or just remove the robots.txt file entirely)
If robots meta tags are not an option for some reason, you can also use the X-Robots-Tag header to accomplish the same thing.
I have a wordpress website which has been indexed in search engines.
I have edited Robots.txt to disallow certain directories and webpages from search index.
I only know how to use allow and disallow, but don't know how to use the follow and nofollow in Robots.txt file.
I read somewhere while Googling on this that I can have webpages that won't be indexed in Google but will be crawled for pageranks. This can be achieved by disallowing the webpages in Robots.txt and use follow for the webpages.
Please let me know how to use follow and nofollow in Robots.txt file.
Thanks
Sumit
a.) The follow/no follow and index/no index rules are not for robots.txt (sets general site rules) but for an on-page meta-robots tag (sets the rules for this specific page)
More info about Meta-Robots
b.) Google won't crawl the Disallowed pages but it can index them on SERP (using info from inbound links or website directories like Dmoz).
Having said that, there is no PR value you can gain from this.
More info about Googlebot's indexing behavior
Google actually does recognize the Noindex: directive inside robots.txt. Here's Matt Cutts talking about it: http://www.mattcutts.com/blog/google-noindex-behavior/
If you put "Disallow" in robots.txt for a page already in Google's index, you will usually find that the page stays in the index, like a ghost, stripped of its keywords. I suppose this is because they know they won't be crawling it, and they don't want the index containing bit-rot. So they replace the page description with "A description for this result is not available because of this site's robots.txt – learn more."
So, the problem remains: How do we remove that link from Google since "Disallow" didn't work? Typically, you'd want to use meta robots noindex on the page in question because Google will actually remove the page from the index if it sees this update, but with that Disallow directive in your robots file, they'll never know about it.
So you could remove that page's Disallow rule from robots.txt and add a meta robots noindex tag to the page's header, but now you've got to wait for Google to go back and look at a page you told them to forget about.
You could create a new link to it from your homepage in hopes that Google will get the hint, or you could avoid the whole thing by just adding that Noindex rule directly to the robots.txt file. In the post above, Matt says that this will result in the removal of the link.
No you cant.
You can set which directories you want to block and which bots but you cant set nofollow by robots.txt
Use robots meta tag on the pages to set nofollow.
My website has about 200 useful articles. Because the website has an internal search function with lots of parameters, the search engines end up spidering urls with all possible permutations of additional parameters such as tags, search phrases, versions, dates etc. Most of these pages are simply a list of search results with some snippets of the original articles.
According to Google's Webmaster-tools Google spidered only about 150 of the 200 entries in the xml sitemap. It looks as if Google has not yet seen all of the content years after it went online.
I plan to add a few "Disallow:" lines to robots.txt so that the search engines no longer spiders those dynamic urls. In addition I plan to disable some url parameters in the Webmaster-tools "website configuration" --> "url parameter" section.
Will that improve or hurt my current SEO ranking? It will look as if my website is losing thousands of content pages.
This is exactly what canonical URLs are for. If one page (e.g. article) can be reached by more then one URL then you need to specify the primary URL using a canonical URL. This prevents duplicate content issues and tells Google which URL to display in their search results.
So do not block any of your articles and you don't need to enter any parameters, either. Just use canonical URLs and you'll be fine.
As nn4l pointed out, canonical is not a good solution for search pages.
The first thing you should do is have search results pages include a robots meta tag saying noindex. This will help get them removed from your index and let Google focus on your real content. Google should slowly remove them as they get re-crawled.
Other measures:
In GWMT tell Google to ignore all those search parameters. Just a band aid but may help speed up the recovery.
Don't block the search page in the robots.txt file as this will block the robots from crawling and cleanly removing those pages already indexed. Wait till your index is clear before doing a full block like that.
Your search system must be based on links (a tags) or GET based forms and not POST based forms. This is why they got indexed. Switching them to POST based forms should stop robots from trying to index those pages in the first place. JavaScript or AJAX is another way to do it.
Suppose I have a link in the page My Messages, which on click will display an alert message "You must login to access my messages".
May be it's better to just not display this link when user is not logged in, but I want "My Messages" to be visible even if user is not logged in.
I think this link is user-friendly, but for search engines they will get redirected to login page, which I think is.. bad for SEO? or is it fine?
I thought of keeping My Messages displayed as normal text (not as a link), then wrap it with a link tag by using javascript/jquery, is this solution good or bad? other ideas please? Thank you.
Try to create a robots.txt file and write:
User-agent: *
Disallow: /mymessages
This will keep SEO bots out of that folder
Use a robots.txt file to tell search engines which pages they should not index.
Using nofollow to block access to a page is erroneous - this is not what nofollow is for. This attribute was designed to allow to you place a link in page without conferring any weight or endorsement of the link. In other words, it's not a link that search engines should regard as significant for page-ranking algorithms. It does not mean "do not index this page" - just "don't follow this particular link to that page"
Here's what Google have to say about nofollow
...However, the target pages may still appear in our index if other
sites link to them without using nofollow or if the URLs are submitted
to Google in a Sitemap. Also, it's important to note that other search
engines may handle nofollow in slightly different ways.
One way of keeping the URL from affecting your rank is setting the rel attribute of your link:
My Messages
Another option is robots.txt, that way you can disallow the bots from the URL entirely.
You might want to use robots.txt to exclude /mymessages. This will also prevent engines which have already visited /mymessages from visiting it again.
Alternatively, add the following to the top of the /mymessages script:
<meta name="robots" content="noindex" />
If you want to tell search engines, not to follow a particular link , then use rel="nofollow".
It is a way to tell search engines and bots that don't follow this link.
Now,google will not crawl that link and does not transfer PageRank or anchor text across this link.