Using noindex and nofollow to avoid duplicate content penalization

Using noindex and nofollow to avoid duplicate content penalization - seo

Scenario:
I own a website with original content. But to support some categories I use creative commons licensed contents, which is, of course, duplicate content.
Question:
If I want to avoid penalization for duplicate content, are this statements true?
I should mention the original author to be a fair human being.
I must use meta noindex to avoid robots from fetching the content.
I must use cannonical url to metion the original content and it's author.
I don't need to use nofollow meta along with noindex, because it has other purposes.
I don't have to use rel="nofollow" on incoming links inside my site that point to the duplicate content, because it won't be indexed anyways, given the noindex meta tag.
I did my research and that is what I got from it. But I am not sure about this, and I would like to understand it before applying anything at all.
Thank you.

In order to avoid the penalization for duplicate content, you can of course use meta noindex and rel="nofollow". Here is the syntax:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
This tells robots not to index the content of a page, and/or not scan it for links to follow.
There are two important considerations when using the robots <META> tag:
Robots can ignore your <META> tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
The NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.
Don't confuse this NOFOLLOW with the rel="nofollow" link attribute.

Related

noindex follow in Robots.txt

I have a wordpress website which has been indexed in search engines.
I have edited Robots.txt to disallow certain directories and webpages from search index.
I only know how to use allow and disallow, but don't know how to use the follow and nofollow in Robots.txt file.
I read somewhere while Googling on this that I can have webpages that won't be indexed in Google but will be crawled for pageranks. This can be achieved by disallowing the webpages in Robots.txt and use follow for the webpages.
Please let me know how to use follow and nofollow in Robots.txt file.
Thanks
Sumit

a.) The follow/no follow and index/no index rules are not for robots.txt (sets general site rules) but for an on-page meta-robots tag (sets the rules for this specific page)
More info about Meta-Robots
b.) Google won't crawl the Disallowed pages but it can index them on SERP (using info from inbound links or website directories like Dmoz).
Having said that, there is no PR value you can gain from this.
More info about Googlebot's indexing behavior

Google actually does recognize the Noindex: directive inside robots.txt. Here's Matt Cutts talking about it: http://www.mattcutts.com/blog/google-noindex-behavior/
If you put "Disallow" in robots.txt for a page already in Google's index, you will usually find that the page stays in the index, like a ghost, stripped of its keywords. I suppose this is because they know they won't be crawling it, and they don't want the index containing bit-rot. So they replace the page description with "A description for this result is not available because of this site's robots.txt – learn more."
So, the problem remains: How do we remove that link from Google since "Disallow" didn't work? Typically, you'd want to use meta robots noindex on the page in question because Google will actually remove the page from the index if it sees this update, but with that Disallow directive in your robots file, they'll never know about it.
So you could remove that page's Disallow rule from robots.txt and add a meta robots noindex tag to the page's header, but now you've got to wait for Google to go back and look at a page you told them to forget about.
You could create a new link to it from your homepage in hopes that Google will get the hint, or you could avoid the whole thing by just adding that Noindex rule directly to the robots.txt file. In the post above, Matt says that this will result in the removal of the link.

No you cant.
You can set which directories you want to block and which bots but you cant set nofollow by robots.txt
Use robots meta tag on the pages to set nofollow.

Is googlebot indexing links in html comments?

I got a huge number of NOT FOUND links on Google webmaster tool, looks like the links are coming from a section of code in the footer which was put in an HTML comment
All pages have NOARCHIVE tag so it's probably not a cache issue
Did this happen to anyone?

A quick Google (ironic, eh?) shows that whilst there is no official word on the subject, the general concensus (through anecdotal and experimental evidence) is that Google will process everything including content in comment tags. This means that it will indeed index your links, even if they're in comment tags. However, it does not use the content as a source for keyword searches, i.e. anything in a HTML comment is not considered to be part of your page's visible content and is therefore not usable as part of search criteria.
HTML comments are designed to simply specify human-readable information about what your layout is doing, for example signifying where a particular include begins in a page outputted by a PHP script. You shouldn't be using HTML comments to remove large chunks of code in your site. I suggest that you remove the content.
If you don't want Google to follow a link, you can add rel="nofollow" to your hyperlink. You can also use robots.txt to specify directories or URL wildcards that you do not want Google to index.
References:
http://en.wikipedia.org/wiki/Nofollow
http://en.wikipedia.org/wiki/Robots.txt
http://www.webmasterworld.com/forum3/4270.htm
http://www.codingforums.com/archive/index.php/t-71686.html

If you are talking about links in comments between tags, I don't think they are taking effect with Google Bots as stated there and there.
Regards.

How to tell search engines NOT to look at this specific link?

Suppose I have a link in the page My Messages, which on click will display an alert message "You must login to access my messages".
May be it's better to just not display this link when user is not logged in, but I want "My Messages" to be visible even if user is not logged in.
I think this link is user-friendly, but for search engines they will get redirected to login page, which I think is.. bad for SEO? or is it fine?
I thought of keeping My Messages displayed as normal text (not as a link), then wrap it with a link tag by using javascript/jquery, is this solution good or bad? other ideas please? Thank you.

Try to create a robots.txt file and write:
User-agent: *
Disallow: /mymessages
This will keep SEO bots out of that folder

Use a robots.txt file to tell search engines which pages they should not index.
Using nofollow to block access to a page is erroneous - this is not what nofollow is for. This attribute was designed to allow to you place a link in page without conferring any weight or endorsement of the link. In other words, it's not a link that search engines should regard as significant for page-ranking algorithms. It does not mean "do not index this page" - just "don't follow this particular link to that page"
Here's what Google have to say about nofollow
...However, the target pages may still appear in our index if other
sites link to them without using nofollow or if the URLs are submitted
to Google in a Sitemap. Also, it's important to note that other search
engines may handle nofollow in slightly different ways.

One way of keeping the URL from affecting your rank is setting the rel attribute of your link:
My Messages
Another option is robots.txt, that way you can disallow the bots from the URL entirely.

You might want to use robots.txt to exclude /mymessages. This will also prevent engines which have already visited /mymessages from visiting it again.
Alternatively, add the following to the top of the /mymessages script:
<meta name="robots" content="noindex" />

If you want to tell search engines, not to follow a particular link , then use rel="nofollow".
It is a way to tell search engines and bots that don't follow this link.
Now,google will not crawl that link and does not transfer PageRank or anchor text across this link.

Meta tag vs robots.txt

Is it better to use meta tags* or the robots.txt file for informing spiders/crawlers to include or exclude a page?
Are there any issues in using both the meta tags and the robots.txt?
*Eg: <#META name="robots" content="index, follow">

There is one significant difference. According to Google they will still index a page behind a robots.txt DENY, if the page is linked to via another site.
However, they will not if they see a metatag:
While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the site can still appear in Google search results. You can stop your URL from appearing in Google Search results completely by using other URL blocking methods, such as password-protecting the files on your server or using the noindex meta tag or response header.

Both are supported by all crawlers which respect webmasters wishes. Not all do, but against them neither technique is sufficient.
You can use robots.txt rules for general things, like disallow whole sections of your site. If you say Disallow: /family then all links starting with /family are not indexed by a crawler.
Meta tag can be used to disallow a single page. Pages disallowed by meta tags do not affect sub pages in the page hierarchy. If you have meta disallow tag on /work, it does not prevent a crawler from accessing /work/my-publications if there is a link to it on an allowed page.

Robots.txt IMHO.
The Meta tag option tells bots not to index individual files, whereas Robots.txt can be used to restrict access to entire directories.
Sure, use a Meta tag if you have the odd page in indexed folders that you want skipping, but generally, I'd recommend you most of your non-indexed content in one or more folders and use robots.txt to skip the lot.
No, there isn't a problem in using both - if there is a clash, in general terms, a deny will overrule an allow.

meta is superior.
In order to exclude individual pages from search engine indices, the noindex meta tag is actually superior to robots.txt.

There is a very huge difference between meta robot and robots.txt.
In robots.txt, we ask crawlers which page you have to crawl and which one you have to exclude but we don't ask crawler to not to index those excluded pages from crawling.
But if we use meta robots tag, we can ask search engine crawlers not to index this page.The tag to be used for this is:
<#meta name = "robot name", content = "noindex"> (remove #)
OR
<#meta name = "robot name", content = "follow, noindex"> (remove #)
In the second meta tag, I have asked robot to follow that URL but not to index in search engine.

Here is my knowledge about them. I am talking about their work area. Both we can use for blocking content.
The difference between both is:
Meta Robot can block a single page with some piece of the code paste in the header of the website. By using the meta robot tag we tell the search engine for which function we are using meta tag.
In Robots.txt file you can block the whole website.
Here is the example of meta robot:
<meta name="robots" content="index, follow">
<meta name="robots" CONTENT="all">
<meta name="robots" content="noindex, follow">
<meta name="robots" content="noindex, nofollow">
<meta name="robots" content="index, nofollow" />
<meta name="robots" content="noindex, nofollow" />
Here is the example of Robots.txt file:
Allowing crawlers to crawl all website
user-agent: *
Allow:
Disallow:
Disallowing crawlers to crawl all website
user-agent: *
Allow:
Disallow:/

I would probably use robots.txt over the meta tag. Robots.txt has been around longer, and might be more widely supported (But I am not 100% sure on that).
As for the second part, I think most spiders will take whatever is the most restrictive setting for a page - if there is a disparity between the robots.txt and meta tag.

Robots.txt is good for pages which consume a lot of your crawling budget like internal search or filters with infinite combination. If you allow Google to index yoursite.com/search=lalalala it will waste you crawling budget.

You want to use 'noindex,follow' in a robots meta tag, rather than robots.txt, because it will allow the link juice to pass through. It is better from a SEO perspective.

Is it better to use meta tags* or the robots.txt file for informing spiders/crawlers to include or exclude a page?
Answer: Both are important to use, they are used for different purposes. Robots file is used to include or exclude pages or root files from spider's index. While, Meta tags are used analyse a website page that defines about it's niche & content within the page.
Are there any issues in using both the meta tags and the robots.txt?
Answer: Both should be implemented to sites so that search engine spiders/crawlers can index or de-index the site urls.
Read more here about working of a search engine spiders >>https://www.playbuzz.com/alexhuber10/how-search-and-spider-engines-work

You can have any one but if your website has plenty of web pages then robots.txt is easy and reduces time complexity

Possible to prevent search engine spiders from infinitely crawling paging links on search results?

Our SEO team would like to open up our main dynamic search results page to spiders and remove the 'nofollow' from the meta tags. It is currently accessible to spiders via allowing the path in robots.txt, but with a 'nofollow' clause in the meta tag which prevents spiders from going beyond the first page.
<meta name="robots" content="index,nofollow">
I am concerned that if we remove the 'nofollow', the impact to our search system will be catastrophic, as spiders will start crawling through all pages in the result set. I would appreciate advice as to:
1) Is there a way to remove the 'nofollow' from the meta tag, but prevent spiders from following only certain links on the page? I have read mixed opinions on rel="nofollow", is this a viable option?
<a rel="nofollow" href="http://www.mysite.com/paginglink" >Next Page</a>
2) Is there a way to control the 'depth' of how far spiders will go? It wouldn't be so bad if they hit a few pages, then stopped.
3) Our search results pages have the standard next/previous links, which would in theory cause spiders to hit pages recursively to infinity, what is the effect of this on SEO?
I understand that different spiders behave differently, but am mainly concerned with the big players, such as Google, Yahoo, MSN.
Note our Search results pages and paging links are not bot-friendly, in that they are not re-written and have a ?name=value query string, but from what I've seen spiders no longer just abort when they see the '?' as the results pages ARE getting indexed with decent page rank.

To be honest you are looking at nofollow wrong. Chances are the search spiders are already especially Google, Yahoo, and MSN searching the nofollow pages, because they still have to hit those pages to see if they have a noindex.
The real problem is nofollow doesn't actually mean don't follow, it just means don't pass on my reputation to this link. So unless you are aggressively blocking bots, which it doesn't sound like you are, changing the ROBOTS meta tag and robot commands on links will not effect performance because they are already hitting your site. To confirm this just look at your HTTP Server Log.
So my vote is that you will not see any problem with removing the robot limits.

I've seen Google index a calendar system that had relative links on each page through the end of time (Jan 19, 2038 - see: http://en.wikipedia.org/wiki/Year_2038_problem). We didn't notice the load on our servers until it exposed a bug in the source code dealing with dates in 2038.
I don't know about the other search engines, but Google offers a number of helpful tools for controlling how much the googlebot impacts your server infrastructure. See http://www.google.com/webmasters/.
There is an option in webmaster tools to set the crawl rate for your site.

Google bots are pretty intelligent about not traversing an entire database of dynamically-generated pages, as long as the URLs give some hint that they are dynamic (i.e. file extension of .asp or .jsp, etc. and numeric ids as query parameters). If you use rewrite rules to make your URLs "friendly", then the bots have a harder time determining whether or not it's a static page they are reading or a dynamically generated page. See this Google article for more information about dynamic vs. static URLs.
You may also want to consider creating a Google Sitemap to give the bots a better idea about what pages on your site can be indexed and which cannot.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas