Is rel=self the correct rel tag to use for forum permalinks? - seo

I have been building a forum from scratch with my friends just for fun, and we're starting to see bots and scrapers go by. The problem we're having is that you can load a page /post/1 with four replies, and each reply includes a little permalink to itself /reply/1#reply-1. If I am on /post/1 and navigate to /reply/1, I'll end up right back where I started, just with the anchor to the reply. But! Scrapers have no idea this is the case, so they're opening every /post link and then following every /reply link, and it's causing performance issues, so I've been looking around SEO sites to try to fix it.
I've started using rel=canonical on the /reply page, to tell the bots they're all the same, but as far as I can tell that doesn't help me until the bot has already loaded the page, and thus I wind up with tons of traffic. Would it be correct to change my
Permalink
tags to
Permalink
since they should be the same content? Or would this be misusing rel="self" and there's another, better rel tag I should be using instead?

The self link type is not defined for HTML (but for Atom), so it can’t be used in HTML5 documents.
The canonical link type is appropriate for your case (if you make sure that it always points to the correct page, in case the thread is paginated), but it doesn’t prevent bots from crawling the URLs.
If you want to prevent crawling, no link type will help (not even the nofollow link type, but it’s not appropriate for your case anyway). You’d have to use robots.txt, e.g.:
User-agent: *
Disallow: /reply/
That said, you might want to consider changing the permalink design. I think it’s not useful (neither for your users nor for bots) to have such an architecture. It’s a good practice to have exactly one URL per document, and if users want to link to a certain post, there is no reason to require a new page load if it’s actually the same document.
So I would either use the "canonical" URL and add a fragment component (/post/1#reply-1, or what might make more sense: /threads/1#post-1), or (if you think it can be useful for your users) I would create a page that only contains the reply (with a link back to the full thread).

Related

Is the "google-site-verification" meta tag required for indexing?

I have been advised by an SEO consultant to add the "google-site-verification" meta tag to every page of my site. This is to make sure that my pages are indexed by google.
However, I am reluctant to do this for a couple of reasons
1) My site is already verified using an alternative method of verification -by hosting a html verification file on the server.
2) I recall reading an article indicating that this meta tag does not impact crawling or page rank.
I do have some pages that are not indexed.
An example is
http://www.contractsforgeeks.com/TechJobs/Florida/Tampa.aspx
But I am making the assumption that adding this meta tag will not help the page get indexed.
Is there any value in adding the site verification meta tag to each page instead of uploading a single html verification file?
For example, what happens if I accidentally delete the verification file from my site (some time after the site has already been verified) . Does it need to be need to be re-verified. Or is the verification process a one time deal? In which case, it may be safer to include in each page (even though it does not help indexing?)
One method is enough to verify your site. If you choose the HTML file method, you don't need to put meta tag "google-site-verification" to every page.
Moreover, as assumption, this meta tags doesn't help your site to be indexed by Google. It doesn't impact crawling or PageRank.
If you want seeing your site indexed, you can submit to Google Webmaster Tools a sitemap.xml and put more links from other sites pointing to yours.
And if you delete the verification HTML file from your site, you'll need to verify again your site, this process is not a one time deal.
It does not help indexing. It does not help ranking. Its only purpose is to verify that you are the one claiming to be when registering at Google Webmaster Tools.
If you delete the verification, you'd need to verify your domain again. Otherwise it would be possible to still control a domain at GWT, although the owner changed in the meantime.
If you need to argue against the use of the corresponding meta element, you could point out that it could actually lower your ranking, -- of course this would have no real, measurable effect, only in theory! -- because Google prefers faster-loading pages.

How do the Facebook like button and Google +1 button deal with a redirected url? [duplicate]

I understand the og:url meta tag is the canonical url for the resource in the open graph.
What strategies can I use if I wish to support 301 redirecting of the resource, while preserving its place in the open graph? I don't want to lose my likes because i've changed the URLs.
Is the best way to do this to store the original url of the content, and refer to that? Are there any other strategies for dealing with this?
To clarify - I have page:
/page1, with an og:url of http://www.example.com/page1
I now want to move it to
/page2, using a 301 redirect to http://www.example.com/page2
Do I have any options to avoid losing the likes and comments other than setting the og:url meta to /page1?
Short answer, you can't.
Once the object has been created on Facebook's side its URL in Facebook's graph is fixed - the Likes and Comments are associated with that URL and object; you need that URL to be accessible by Facebook's crawler in order to maintain that object in the future. (note that the object becoming inaccessible doesn't necessarily remove it from Facebook, but effectively you'd be starting over)
What I usually recommend here is (with examples http://www.example.com/oldurl and http://www.example.com/newurl):
On /newpage, keep the og:url tag pointing to /oldurl
Add a HTTP 301 redirect from /oldurl to /newurl
Exempt the Facebook crawler from this redirect
Continue to serve the meta tags for the page on http://www.example.com/oldurl if the request comes from the Facebook crawler.
No need to return any actual content to the crawler, just a simple HTML page with the appropriate tags
Thus:
Existing instances of the object on Facebook will, when clicked, bring users to the correct (new) page via your redirect
The Like button on the (new) page will still produce a like of the correct object (but at the old URL)
If you're moving a lot of URLs around or completely rewriting your URL scheme you should use the new URLs for new articles/products/etc, but you'll need to keep the redirect in place if you want to retain likes, comments, etc on the older content.
This includes if you're changing domain.
The only problem here is maintaining the old URL -> new URL mapping somewhere in your code, but it's not technically difficult, just an additional thing to maintain in the future.
BTW, The Facebook crawler UA is currently facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
I'm having the same problem with my old sites. Domains are changing, admins want to change urls for seo etc
I came to conclusion its best to have some sort uniqe id in db just for facebook - from the beginning. For articles for example I have myurl.com/a/123 where 123 is ID of the article.
Real url is myurl.com/category/article-title. Article can then be put in different category, renamed etc with extensive logic for 301 redirects behind it. But the basic fb identifier can stay the same for ever.
Of course this is viable only when starting with a fresh site or when implementing fb comments for the first time.
Just an idea if you can plan ahead :) Let me know what you think.

Can search engines index pages generated by server side code?

I'm guessing a site like stack overflow doesn't keep an html file around for every question ever asked. Instead, server-side code creates the page every time a question is clicked on(I think). Is it possible for search engines to index every quesiton on Stack Overflow, or would a page-per-question need to be kept in the directory so the search engine can crawl it?
Yes. Search engines can index dynamically generated pages no problem. In fact, from the search engine bot's perspective, it can't really even distinguish between a dynamically generated page and a static one.
You might be interested by the Dynamic URLs vs. static URLs post on the Official Google Webmaster Central Blog.
Yes it's perfectly possible - when a link is followed the server returns HTML just like any other web page. The only difference is that the server generated it, rather than a person.
As far as the client (be it a browser or search engine) is concerned, there is no difference between a server-generated page and a static file. They're virtually indistinguishable (depending on how the page is generated, it might be missing Last-Modified headers, etc). As such, yes, search engines can index generated pages without a problem.
That said, there is something to be said for giving them a hint. Using sitemaps, for example, gives a search engine a nice listing of all your pages, so it's less likely to miss them. More importantly, it can summarize last modified times, to focus the search engine's attention on what has changed recently. This isn't mandatory, but it does help - regardless of whether the pages are static HTML or generated.
Any link that uses a GET can be followed by most crawlers. Anything that requires a POST will generally be ignored.
The mechanism for generating the page is irrelevant.
yes if this is not restricted by robot.txt or meta tags.Search engine requests web page like normal user,no one have access to server side code(if your site isn't hacked))
Search engines can see pretty much anything on a given Web page that isn't hidden behind client-side code (i.e., JavaScript).
So, if there's a URL that you can enter into your browser's address bar to get this page, and this page is linked to from somewhere, a search engine will find it and "see" the same content that you do. The fact that the page was generated dynamically by a server is irrelevant to a search engine, since what is sent to a browser upon requesting a URL is still just an HTML file.
In other words, that HTML file doesn't exist in the same form on the server - i.e., it's actually some server-side code that generates HTML, not a static HTML file - but that's not what a search engine is crawling though and indexing, rather links to document URLs that are exactly what you see in your browser's address bar.

Is there a way to prevent Googlebot from indexing certain parts of a page?

Is it possible to fine-tune directives to Google to such an extent that it will ignore part of a page, yet still index the rest?
There are a couple of different issues we've come across which would be helped by this, such as:
RSS feed/news ticker-type text on a page displaying content from an external source
users entering contact phone etc. details who want them visible on the site but would rather they not be google-able
I'm aware that both of the above can be addressed via other techniques (such as writing the content with JavaScript), but am wondering if anyone knows if there's a cleaner option already available from Google?
I've been doing some digging on this and came across mentions of googleon and googleoff tags, but these seem to be exclusive to Google Search Appliances.
Does anyone know if there's a similar set of tags to which Googlebot will adhere?
Edit: Just to clarify, I don't want to go down the dangerous route of cloaking/serving up different content to Google, which is why I'm looking to see if there's a "legit" way of achieving what I'd like to do here.
What you're asking for, can't really be done, Google either takes the entire page, or none of it.
You could do some sneaky tricks though like insert the part of the page you don't want indexed in an iFrame and use robots.txt to ask Google not to index that iFrame.
In short NO - unless you use cloaking with is discouraged by Google.
Please check out the official documentation from here
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
Go to section "Excluding Unwanted Text from the Index"
<!--googleoff: index-->
here will be skipped
<!--googleon: index-->
Found useful resource for using certain duplicate content and not to allow index by search engine for such content.
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>
At your server detect the search bot by IP using PHP or ASP. Then feed the IP addresses that fall into that list a version of the page you wish to be indexed. In that search engine friendly version of your page use the canonical link tag to specify to the search engine the page version that you do not want to be indexed.
This way the page with the content that do want to be index will be indexed by address only while the only the content you wish to be indexed will be indexed. This method will not get you blocked by the search engines and is completely safe.
Yes definitely you can stop Google from indexing some parts of your website by creating custom robots.txt and write which portions you don't want to index like wpadmins, or a particular post or page so you can do that easily by creating this robots.txt file .before creating check your site robots.txt for example www.yoursite.com/robots.txt.
All search engines either index or ignore the entire page. The only possible way to implement what you want is to:
(a) have two different versions of the same page
(b) detect the browser used
(c) If it's a search engine, serve the second version of your page.
This link might prove helpful.
There are meta-tags for bots, and there's also the robots.txt, with which you can restrict access to certain directories.

Efficient way to add Canonical tags

If the value of the href for Canonical tags is populated via javascript function, would that affect the Search engine indexing (as search engines ignore javascript) ?
I'm not sure I fully understand the question as you worded it. But here's my take:
Canonical tags are used to make sure that Google (et al) knows that the same page with different URLs are, in fact, the same page.
This saves Google a lot of processing time, because it will treat those pages as a single page instead of trying to index every one of them. Also, your domain's search engine ranking will probably go up because Google doesn't think you're duplicating content.
For any page that could be duplicated because of parameters, you should include a canonical link of the page you want known as the original. So yes, it would help in your case. Though you cannot put a canonical link on someone else's domain pointing to your domain, so putting it on a partner's page would not have the intended consequences.
If you want more information, read up here: Google Webmaster Central: Specify Your Canonical