Is googlebot indexing links in html comments? - indexing

I got a huge number of NOT FOUND links on Google webmaster tool, looks like the links are coming from a section of code in the footer which was put in an HTML comment
All pages have NOARCHIVE tag so it's probably not a cache issue
Did this happen to anyone?

A quick Google (ironic, eh?) shows that whilst there is no official word on the subject, the general concensus (through anecdotal and experimental evidence) is that Google will process everything including content in comment tags. This means that it will indeed index your links, even if they're in comment tags. However, it does not use the content as a source for keyword searches, i.e. anything in a HTML comment is not considered to be part of your page's visible content and is therefore not usable as part of search criteria.
HTML comments are designed to simply specify human-readable information about what your layout is doing, for example signifying where a particular include begins in a page outputted by a PHP script. You shouldn't be using HTML comments to remove large chunks of code in your site. I suggest that you remove the content.
If you don't want Google to follow a link, you can add rel="nofollow" to your hyperlink. You can also use robots.txt to specify directories or URL wildcards that you do not want Google to index.
References:
http://en.wikipedia.org/wiki/Nofollow
http://en.wikipedia.org/wiki/Robots.txt
http://www.webmasterworld.com/forum3/4270.htm
http://www.codingforums.com/archive/index.php/t-71686.html

If you are talking about links in comments between tags, I don't think they are taking effect with Google Bots as stated there and there.
Regards.

Related

SEO - sitemap.xml providing explicit links which are not on the page as anchors

I have a site with an input text.
User types the name of a city, hits enter and it's linked there.
my sitemap.xml looks like this:
<urlset>
<url><loc>http://www.example.com/rome.html</loc></url>
<url><loc>http://www.example.com/london.html</loc></url>
<url><loc>http://www.example.com/newyork.html</loc></url>
<url><loc>http://www.example.com/paris.html</loc></url>
<url><loc>http://www.example.com/berlin.html</loc></url>
<url><loc>http://www.example.com/toronto.html</loc></url>
<url><loc>http://www.example.com/milan.html</loc></url>
<url><loc>http://www.example.com/edinburgh.html</loc></url>
<url><loc>http://www.example.com/nice.html</loc></url>
<url><loc>http://www.example.com/boston.html</loc></url>
...
</urlset>
My question is:
Will I be penalized (from a SEO point of view) because my links only appear on the sitemap.xml instead as in a list of anchors in the html page.
Note: the anchor approach was excluded because I have about 5,000 listed cities
It won't be penalised. Google themselves say the primary purpose of a sitemap is "a way to tell Google about pages on your site we might not otherwise discover."
https://support.google.com/webmasters/answer/156184?hl=en
You are rare in that you are using the sitemap correctly to help Google find your pages.
Often SEOs just add one for the sake of it, rather than taking the time to identify and using it to fix potential crawling errors.
The only negative aspect for SEO I can think of is that page rank will not flow between your pages if there is no direct link.
No, you will not be penalized. The sole purpose of sitemaps is to tell search engines where to find your content. That content may or may not be available through hyperlinks on your website.

Editing the head element on an old blog platform on a post-by-post basis. Is this impossible or am I missing something?

Sorry for being a total rookie.
I am trying to help my professor implement this advice:
Either as a courtesy to Forbes or a favor to yourself, you may want to include the rel="canonical" link element on your cross-posts. To do this, on the content you want to take the backseat in search engines, you add in the head of the page. The URL should be for the content you want to be favored by search engines. Otherwise, search engines see duplicate content, grow confused, and then get upset. You can read more about the canonical tag here: http://www.mattcutts.com/blog/canonical-link-tag/. Have a great day!
The problem is I am having trouble figuring out how to edit the head element on a post-by-post basis. We are currently on a super old blogging platform (Movable Type 3.2 from 2005), so maybe it is not possible. But I'd like to know if that is likely the reason, so I'm not missing out on a workaround.
If anyone could point me in the right direction, I would greatly appreciate it!
Without knowing much about your installation, I'll give a general description, and hopefully it matches what you see and helps.
In Movable Type, each blog has a "Design" section where you can see and edit the templates for the blog. On this page, the templates that are published once are listed under "Index Templates," and the templates published multiple times, once per entry, per category, etc., are listed under "Archive Templates."
There probably is an archive template called "Entry" (could be renamed) publishing to a path like category/sub-category/entry-basename.php. This is the main template that publishes each entry. Click on this to open the template editor.
This template could be an entire HTML document, or it might have "includes" that look like <MTInclude module=""> or <$mt:Include module=""$> (MT supports varying tag styles.).
You may find there is an included module that contains the <head> content, or it might just be right in that template. To "follow" the includes and see those templates, there should be links on the side of the included templates.
Once you find the <head> content, you can add a canonical link tag like this:
<mt:IfArchiveType type="Individual">
<mt:If tag="EntryPermalink">
<link rel="canonical" href="<$mt:EntryPermalink$>" />
</mt:If>
</mt:IfArchiveType>
Depending on your needs, you might want to customize this to output a specific URL structure for other types of content, like category listings. The above will just take care of telling search engines the preferred URL for each entry.
#Charlie: may be I'm missing something, but your solution basically places a canonical link on each entry to… itself, which is a no-no for search engines (the link should point to another page that's considered the canonical one).
#user2359284 you need a way to define the canonical entry for those which need this link. As Shmuel suggested, either reuse an unused field or a custom field plugin. Then you simply add that link in the header in the proper archive template that outputs your notes. In the hypothesis that the Entry template includes the same header as other templates, and, say, you're using the Keywords field to set the URL, then the following code should work (the mt:IfArchiveType test simply ensures it's output in the proper context, which you don't need if your Entry template has its own code for the header):
<mt:IfArchiveType type="Individual">
<link rel="canonical" href="<$mt:EntryKeywords$>" />
</mt:IfArchiveType>

Use of canonical tag in HTML

I read about canonical tags in HTML and from what I understood it is used to help search engines to realize which is the original content. I have articles in my recently created blog, which I have pasted in certain other popular websites. In those websites I gave back a link to my original blog post with the canonical tag. But yet my blog page is not visible in search engines (other websites do show my article). Before I had pasted onto other websites, my articles were indexed on google and could be seen on the 1st page. So I guess, there is no problem on my SEO part.
Can someone please suggest a method where my original blog gets higher preference for the content?
You can use cross domain canonical tags.
So if you have duplicated content on other domains you can use the canonical tag on those pages pointing back to the original page on your site.
This a great way to deal with syndicated content; of course you would need code level access on these other websites so you can implement the canonical tag.
More info below
http://googlewebmastercentral.blogspot.com/2009/12/handling-legitimate-cross-domain.html
Don't just copy paste your articles on every place on the internet, that will not do you any good. After writing a good article go to other sites and write something else about your articles like what your article is about, how it is helpful to someone, something like that so that people and websites come to your website to read your article. For this you don't need "canonical"
If you copy paste articles to other websites, it will only create duplicate content issues and will only harm your SEO efforts.
No, it is not required for your Blog section to do canonical issue.
Canonical means Google displays same pages with different URL.
The first thing is not submitting your article in different websites I will not give you any benefit in your ranking. If you write a good and quality content you should post in only one website if you post in different sited google will consider as a duplicate content. So it's better for you you can share your approved blog link in social media sites and also do social bookmarking, microblogging. And after you don't need canonical tag.
As #moobot said you can indeed use a cross-domain canonical tag to let Google know about the original source of the content. How exactly are you adding the canonical on other domains?
The canonical link should be in the head section of the html code. If you're adding it yourself somewhere in the body tag that's not going to do you any good.
Check out this article for some other common mistakes with the canonical tag
http://googlewebmastercentral.blogspot.nl/2013/04/5-common-mistakes-with-relcanonical.html
#metadice mentioned that copying your content all over the web isn't good for your SEO and i agree completely. If you do this for some extra backlinks or something i would recommend you to stop doing this.
Hope my answer will help someone who has this same question.

should i put all my link inside my sitemap

Normally i'm putting only important link inside my sitemap which right now they are about 3985 and google has indexed 3501 of them.
But the exact number of my links are over 100,000 and with each link there is an image that i show it to my users.
So, should i put all my links including my images inside my sitemap?
You are on the right path. Only put important links in your sitemap file. Fore more information perhaps check out the Google help page on the topic.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156184&from=40318&rd=1
I would also check out the following link which describes sitemaps from a Google employee
https://webmasters.stackexchange.com/questions/30186/are-there-any-clear-indicators-that-my-sitemap-file-is-beneficial
Put every link you want search engines to crawl and index in your sitemap. That's the whole purpose of XML sitemaps, to tell search engines about your pages and images as well. 100,000 links and images are not a lot at all so don't worry about thinking Google will ignore your sitemap or be overwhelmed by it.

Is there a way to prevent Googlebot from indexing certain parts of a page?

Is it possible to fine-tune directives to Google to such an extent that it will ignore part of a page, yet still index the rest?
There are a couple of different issues we've come across which would be helped by this, such as:
RSS feed/news ticker-type text on a page displaying content from an external source
users entering contact phone etc. details who want them visible on the site but would rather they not be google-able
I'm aware that both of the above can be addressed via other techniques (such as writing the content with JavaScript), but am wondering if anyone knows if there's a cleaner option already available from Google?
I've been doing some digging on this and came across mentions of googleon and googleoff tags, but these seem to be exclusive to Google Search Appliances.
Does anyone know if there's a similar set of tags to which Googlebot will adhere?
Edit: Just to clarify, I don't want to go down the dangerous route of cloaking/serving up different content to Google, which is why I'm looking to see if there's a "legit" way of achieving what I'd like to do here.
What you're asking for, can't really be done, Google either takes the entire page, or none of it.
You could do some sneaky tricks though like insert the part of the page you don't want indexed in an iFrame and use robots.txt to ask Google not to index that iFrame.
In short NO - unless you use cloaking with is discouraged by Google.
Please check out the official documentation from here
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
Go to section "Excluding Unwanted Text from the Index"
<!--googleoff: index-->
here will be skipped
<!--googleon: index-->
Found useful resource for using certain duplicate content and not to allow index by search engine for such content.
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>
At your server detect the search bot by IP using PHP or ASP. Then feed the IP addresses that fall into that list a version of the page you wish to be indexed. In that search engine friendly version of your page use the canonical link tag to specify to the search engine the page version that you do not want to be indexed.
This way the page with the content that do want to be index will be indexed by address only while the only the content you wish to be indexed will be indexed. This method will not get you blocked by the search engines and is completely safe.
Yes definitely you can stop Google from indexing some parts of your website by creating custom robots.txt and write which portions you don't want to index like wpadmins, or a particular post or page so you can do that easily by creating this robots.txt file .before creating check your site robots.txt for example www.yoursite.com/robots.txt.
All search engines either index or ignore the entire page. The only possible way to implement what you want is to:
(a) have two different versions of the same page
(b) detect the browser used
(c) If it's a search engine, serve the second version of your page.
This link might prove helpful.
There are meta-tags for bots, and there's also the robots.txt, with which you can restrict access to certain directories.