Will rel=canonical break site: queries? - seo

Our company publishes our software product's documentation using a custom-built content management system using a dynamic URL namespace like this:
http://ourproduct.com/documentation/version/pageid
Where "version" is the version number to which the documentation applies, and "pageid" is a unique string which identifies that page in our back-end content management system. For example, if content (e.g. a page about configuration best practices) is unchanged from version 3.0 and 4.0 of our product, it'd be reachable by two different URLs:
http://ourproduct.com/documentation/3.0/configuration-best-practices
http://ourproduct.com/documentation/4.0/configuration-best-practices
This URL scheme allows us to scope Google search results to see only documentaiton for a particular product version, like this:
configuration site:ourproduct.com/documentation/4.0
But when the user is searching across all versions, we don't want Google to arbitrarily choose one of the URLs to show in results. Instead, we always want the latest version to show up. Hence our planned use of rel=canonical so we can proscriptively tell Google which URL we want to show up if multiple versions are being searched. (Users who do oddball things like searching 2 versions but not all of them are a corner case, so we don't care which version(s) show up in that case-- the primary use-cases we care about is searching one version or searching all versions)
But what will happen to scoped searches if we do this? If my rel=canonical URL points to version 4.0, but my search is scoped to 3.0, will Google return a result?
Even if you don't know the answer offhand, do you know a site which uses rel=canonical to redirect across folders in a URL namespace. If so, I could run a few Google searches and figure out the answer.

The rel=canonical link element helps search engines to determine the URL that they should index, so ultimately, by specifying it for a URL, you're telling them to drop the old version and only to index the new version. In practice, it might be that both versions are indexed for a while (depending on how they're discovered and crawled), but in the long run only the canonical will generally remain indexed. In other words, if you do this for your site, over time the site:-query results for the old versions will drop (which probably makes sense).
If you need to have both versions indexed, then I wouldn't use the rel=canonical link element, I'd just link from the old versions to the new versions (eg "The current version of this document can be found at X").
Wikia uses rel=canonical link elements fairly extensively, though I don't think they use it in folders, but you can still see the results for individual URLs.

Related

Liferay document library

Use case: A user can bookmark a link which contains a pdf-document for downloading or viewing it online.
The url contains a version number provided by liferay.
Is it possible to ensure that you always get the latest version of the bookmarked pdf-document even if the url was bookmarked months ago ?
The uploaded pdf-documents are versioned by Liferays document library.
Of course you can remove the version number from the pdf-link but this i guess would lead to the problem that your browser will cache the document and you are again not sure if your pdf- document is the latest one.
Does anyone can drop me a hint ?
No, you cannot do it, so the only solution is to make a hook for the method that fetches the document. In this case I think you should override some of the DLFileEntryLocalServiceUtil methods. With these two links you will have enough information:
Override a service - https://dev.liferay.com/develop/tutorials/-/knowledge_base/6-2/overriding-a-portal-service-using-a-hook
DLFileEntryLocalServiceUtil - https://docs.liferay.com/portal/6.2/javadocs/com/liferay/portlet/documentlibrary/service/DLFileEntryLocalServiceUtil.html
Good luck!

Alfresco Rest API to get the current node version

I'm making a call to GET service/api/version?nodeRef={nodeRef}, but I want to limit the query to just the most recent version. I found this StackOverflow post which suggests the use of &filter={filterQuery?} to filter results, but that looks like it only applies to the people query method. Is there something like this for the version method?
Before answering your specific query, a general point. Alfresco community is open source, and almost all of Alfresco Enterprise is too, so for many queries around this your best bet is simply to go and check the source code yourself! The rest APIs are held in the projects/remote-api
Looking in there, you can see that the versions webscript returns all versions for a given node. It's fairly fast to do that, so calling that and getting the most recent one isn't the end of the world
Otherwise, the slingshot node details api at http://localhost:8080/alfresco/service/slingshot/doclib2/node/{store_type}/{store_id}/{id} (add the parts of the noderef into the URL) will return lots of information on a node quickly, including the latest version. In the JSON returned by that, inside the item key is something like
"version": "1.7",
Which gives you the latest version for the node
Some of the listing and search APIs will also include the version, but those are likely to be quite a bit more heavyweight than just getting the version history of a node

Two identical URL's but different order in parameters: Duplicated content?

My own CMS automatically adds new parameters to links in a page to specify a given language.
It works quite well but it doesn't always put the var in the same position, giving me a link to same page/language:
www.xxx.yy/index.php?mod=blog&page=3&lang=en
or
www.xxx.yy/index.php?mod=blog&lang=en&page=3
Will search engines be smart enough to detect both urls as the same? Or will detect as two different urls and therefore mark them as duplicated content?
I will fix this issue anyway, but I'm curious about this since long time ago.
Google definitely supports this, as they explicitly mention that example in their webmaster blog:
Like www.example.com/skates.asp?color=black&brand=riedell and www.example.com/skates.asp?brand=riedell&color=black. Having this type of duplicate content on your site can potentially affect your site's performance, but it doesn't cause penalties. From our article on duplicate content:
Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don't follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.
For all other duplicate content worries, consider specifying a canonical url.

Is there a way to prevent Googlebot from indexing certain parts of a page?

Is it possible to fine-tune directives to Google to such an extent that it will ignore part of a page, yet still index the rest?
There are a couple of different issues we've come across which would be helped by this, such as:
RSS feed/news ticker-type text on a page displaying content from an external source
users entering contact phone etc. details who want them visible on the site but would rather they not be google-able
I'm aware that both of the above can be addressed via other techniques (such as writing the content with JavaScript), but am wondering if anyone knows if there's a cleaner option already available from Google?
I've been doing some digging on this and came across mentions of googleon and googleoff tags, but these seem to be exclusive to Google Search Appliances.
Does anyone know if there's a similar set of tags to which Googlebot will adhere?
Edit: Just to clarify, I don't want to go down the dangerous route of cloaking/serving up different content to Google, which is why I'm looking to see if there's a "legit" way of achieving what I'd like to do here.
What you're asking for, can't really be done, Google either takes the entire page, or none of it.
You could do some sneaky tricks though like insert the part of the page you don't want indexed in an iFrame and use robots.txt to ask Google not to index that iFrame.
In short NO - unless you use cloaking with is discouraged by Google.
Please check out the official documentation from here
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
Go to section "Excluding Unwanted Text from the Index"
<!--googleoff: index-->
here will be skipped
<!--googleon: index-->
Found useful resource for using certain duplicate content and not to allow index by search engine for such content.
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>
At your server detect the search bot by IP using PHP or ASP. Then feed the IP addresses that fall into that list a version of the page you wish to be indexed. In that search engine friendly version of your page use the canonical link tag to specify to the search engine the page version that you do not want to be indexed.
This way the page with the content that do want to be index will be indexed by address only while the only the content you wish to be indexed will be indexed. This method will not get you blocked by the search engines and is completely safe.
Yes definitely you can stop Google from indexing some parts of your website by creating custom robots.txt and write which portions you don't want to index like wpadmins, or a particular post or page so you can do that easily by creating this robots.txt file .before creating check your site robots.txt for example www.yoursite.com/robots.txt.
All search engines either index or ignore the entire page. The only possible way to implement what you want is to:
(a) have two different versions of the same page
(b) detect the browser used
(c) If it's a search engine, serve the second version of your page.
This link might prove helpful.
There are meta-tags for bots, and there's also the robots.txt, with which you can restrict access to certain directories.

URL scheme for a multi-version web app

I'm looking for the best URL schema to use for a web app that has multiple versions, namely several languages and a simplified version for use by mobile phones - both aspects can be combined, so there's an English regular and mobile version, a German regular and mobile version, etc.
Goals (in order of importance):
User-friendliness
Search engine friendliness
Ease of development
Aspects to consider:
How should the URLs look like?
How should the user navigate between versions?
How much logic should there be to automatically decide on a version?
I'll describe my concept so far below, maybe some of you have better ideas.
My current concept:
When a new user arrives, the app decides, based on cookies (see below), the Accept-Language: header and the user agent string (used to identify mobile browsers) which version to show, but does not reflect this in the URL (no redirects)
It defaults to the non-simplified English version
There are prominently displayed icons (flags, a stylized mobile phone) to choose other versions
When the user explicitly chooses a different version, this is reflected both in a changed URL and a browser cookie
The URL schema is / for the "automatic" version, /en/, /de/, etc. for the language version, /mobile/ for the simplified version, /normal/ for the non-simplified one, and combinations thereof i.e. /mobile/en/ and /normal/de/
mod_rewrite is used to strip these URL prefixes and convert them to GET parameters for the app to parse
robots.txt disallows /mobile/ and /normal/
Advantages:
The different language versions are all indexed separately by search engines
Cookies help, but are not necessary
There'S a good chance that people will see the version that's ideal for them without having to make any choice
The user can always explicitly choose which version he wants (this makes the /normal/ URL necessary)
Each version has an URL which will display exactly that version when passed to others
/mobile/ and /normal/ are ignored by search engines; they would only be duplicate content.
Disadvantages:
Requires heavy use of mod_rewrite, which I find rather cryptic
Users could send their current URL to someone and that person, when visiting it, could end up seeing a different version, which could cause confusion
There is still duplicate content between / and /en/ - I can't disallow / in robots.txt - should I trust the search engines not to penalize me for exact duplicate content on the same domain, or disallow /en/ and accept that people coming to / via a search engine may see a different version than what they found in the search engine?
I suggest subdomains, personally.
I wouldn't include the mobile at all - use the useragent to determine this, and possibly a cookie incase the user wants to view the full site on their mobile (think how Flickr and Google do it). But for languages, yes - primary language at http://mydomain.com/, secondary languages at i.e. http://de.mydomain.com/ or http://fr.mydomain.com/
I am unclear why you would want to incorporate any kind of what you call versioning information, such as accept-language or user-agent, specific designation in the URL scheme. The URL scheme should be indicative of the content only. The server should investigate the various request headers to determine how to retrieve and/or format the response.