Google and Mirror Websites - seo

Which is the best way to manage a website with one or more mirrors so that:
Google don't consider it as "dupicated content"
The website is correctly indexed
No inconsistencies or duplicated information are present in Google Analytics
The Google webmaster guidelines in general are respected
NOTE: I'm not sure if I should ask this question here or in ServerFault. It looks a bit in the middle between programming and server administration. Let me know if you think ServerFault represent a more appropriate place for this and I'll move it.
Thanks.

The official and simple solution is the canonical link tag. This is the official solution recommeded by Google.
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

If you need to host on multiple servers using round-robin DNS (or other load-balancing techniques) is a good idea. This will let you use a single host name and would generally not create problems with crawling and indexing on the search engine side (since the crawlers don't see multiple URLs for the content).
If you need to host using separate host or domain names (for whatever reason) it's best to pick one preferred version and to make sure that only that one is indexed. A way to do that could be to use rel=canonical link elements on the alternate versions. In general, however, I'd recommend working to prevent multiple host/domain names from being visible to the user & search engines by keeping the technical hosting issues (mirrored hosts) out of sight (as mentioned in the first part).
If you need to use multiple ccTLDs to host on country-specific domains then I'd strongly recommend making sure that you actually have country-specific content on each site (and not just mirroring one version). More about this is at http://googlewebmastercentral.blogspot.com/2010/03/working-with-multi-regional-websites.html

Related

What happens to old links if I change my custom URL/subdomain?

If I were to use a specific custom domain (e.g. a.example.com) for a while (creating links via the API, via the Marketing tab in the dashboard, etc.) and then change it later on (e.g. to b.example.com), is there defined behaviour for exactly what happens with the old a.example.com links that are out in the wild?
The docs here simply say:
switching can cause significant problems with your existing links
What exactly are those problems? From a technical perspective, if a.example.com is still pointing to custom.bnc.lt, it should still be possible to identify what app that's for and what links it'll resolve to.
Just curious to know if anyone has any experience with this or a definitive answer to whether or not the "old" links will be broken after changing the custom domain and what "significant problems" may be encountered in doing so.
Thanks in advance!
Alex from Branch.io here: if you change your link domain, all your existing links will stop working and give errors.
While you're right that the CNAME record on the old domain is still pointing to custom.bnc.lt and will still forward traffic there, our backend performs a lookup to make sure the domain of the incoming link is affiliated with a known app in the system. If it doesn't find a match, we don't complete the link routing process.
As far as I know, we don't have any plans to support multiple domains for each Branch app configuration in the future. So the advice still holds!
We recommend that you choose one domain or subdomain to use with Branch and stick with it, as switching can cause significant problems with your existing links.

Usage of CDN in Hybris solutions

I am exploring the support and usage of CDN in a Hybris solution.
I am a Hybris newbie and am working through the wiki to understand the product better.
I am unable to find the answers based on my search of CDN in conjuction with Hybris.
What are the typical CDN Providers that are used in an Hybris solution? Any references would be helpful.
Appreciate any pointers.
PS: This is not a programming question. If this question is considered inappropriate let me know and I will delete this.
Why would there be any preferred CDN providers? You can choose any provider you want but take these questions into consideration:
How is the TTL defined? By cache headers (set by hybris) or manually set on the CDN’s side? The possible cache headers are Cache-Control, Surrogate-Control and Edge-Control. Akamai for example uses Edge-Control but I’m not aware that any other CDN uses this header.
The choice of your CDN will also depend on where the content will be required: do you need to serve it worldwide or do you want to improve your quality of service in certains areas only by adding POPs?
Does content sometimes need to be invalidated from the cache? If yes, check if there is an API to do so (will require work to make hybris communicate with the API).
The easiest solution would be a basic cache where the TTLs would be defined in the CDN’s configuration.
If you choose to go with cache headers, this is a quite simple solution to setup in hybris, you only have to define a request handler that will take care of them.
Hybris is usually used for some sort of ecommerce site. I have seen great results by using ImageEngine.io for an image specific CDN for Hybris. ImageEngine optimizes images on the fly which makes your site load faster. Worth a look: http://imageengine.io

Is there an advantage to building my own tinyURL-like function/service over using third party resources?

I've seen the many different ways I can build a function/service to generate short URLs which I can then control via my own domain.
This sounds like a great idea; however, as I look at the advantages such as being able to control these URLs long term, adjusting the end location if needed have more tracking over where they wind up, etc.
I'm wondering if there is already a service out there that provides for this level of control without needing to build/host/support the solution myself?
The exact features desired are as follows:
Control of where the URL points to AFTER it's generated (the underlying URL needs to change due to legal/regulatory issues)
More robust tracking of where the URL is used as opposed to just doing a Google Search
for the tiny URL
The advantage would be that you own the links and are not dependent on a service that may go out of business. Also, if the shortened URL still has your domain it would have SEO advantages for page rank. Another thing is that it would reduce friction from clicking the link by your users. When you use another domain for shortening you are dependent on the trust the user has with that organization as well.
They may not necessarily be optimised for shortness, typically they look like http://purl.org/net/kin , but PURLs might be what you want
PURLs (Persistent Uniform Resource Locators) are Web addresses that act as permanent identifiers in the face of a dynamic and changing Web infrastructure. Instead of resolving directly to Web resources, PURLs provide a level of indirection that allows the underlying Web addresses of resources to change over time without negatively affecting systems that depend on them. This capability provides continuity of references to network resources that may migrate from machine to machine for business, social or technical reasons.
purl.org
You have complete control over where they point if you use the centralised server; you can also download their software to run your own.

when should i use or avoid subdomains?

Recently a user told me to avoid subdomains when i can. I remember reading google consider subdomains as a unique site (is this true?). What else happens when i use a subdomain and when should i use or should not use a subdomain?
I heard cookies are not shared between subdomains? i know 2 images can be DL simultaneously from a site. Would i be able to DL 4 if i use sub1.mysite.com and sub2.mysite.com?
What else should i know?
You can share cookies between subdomains, provided you set the right parameters in the cookie. By default, they won't be shared, though.
Yes, you can get more simultaneous downloads if you store images in different subdomains. However, the other side of the scale is that the user spends more time resolving DNSs, so it's not practical to have, say, 25 subdomains to get 50 simultaneous downloads.
Another thing that happens with subdomains is that AJAX requests won't work without some effort (you CAN make them work using document.domain tricks, but it's far from straightforward).
Can't help with the SEO part, however, although some people discourage having both yoursite.com and www.yoursite.com working and returning the same content, because it "dilutes your pagerank". Not sure how true that is.
You complicate quite a few things. Collecting stats, controlling spiders, html5 storage, XSS, inter-frame communication, virtual-host setup, third-party ad serving, interaction with remote APIs like google maps.
That's not to say these things can't be solved, just that the rise in complexity adds more work and may not provide suitable benefits to compensate.
I should add that I went down this path once myself for a classifieds site, adding domains like porshe.site.com, ferrari.site.com hoping to boost rank for those keywords. In the end I did not see noticeable improvement and even worse google was walking the entire site via each subdomain, meaning that a search for ferraris might return porsche.site.com/ferraris instead of ferrari.site.com/ferraris. In short google considered each site to be duplicates but it still crawled each site every time it visited.
Again, workarounds existed but I chose simplicity and I don't regret it.
If you use sub domains to store your web sites images, javascript, stylesheets, etc. then your pages may load quicker. Browsers limit the number of simultaneous connections to each domain name. The more sub domains you use, the more connection can be made at the same time to collect the web pages content.
Recently a user told me to avoid subdomains when i can. I remember reading google consider subdomains as a unique site (is this true?). What else happens when i use a subdomain and when should i use or should not use a subdomain?
The last thing I heard about Google optimization, is that domains count for more pagerank than subdomains. I also believe that page rank calculations are per page, not per site (according to algorithm etc.). Though the only person who can really tell you is a Google employee.
I heard cookies are not shared between subdomains?
You should be able to use a cookie for all subdomains. www.mysite.com sub1.mysite.com sub2.mysite.com can all share the same cookies, but a cookie specified for mysite.com cannot be shared with them.
i know 2 images can be DL simultaneously from a site. Would i be able to DL 4 if i use sub1.mysite.com and sub2.mysite.com?
I'm not sure what you mean by DL simultaneously. Often times, a browser with a single thread will download images one at a time, even from different domains. Browsers with multiple thread configurations can download multiple items from different domains at the same time.

Is this a blackhat SEO technique?

I have a site which has been developed completely in flash. Now the site owners do not want to shift to a more text/html based site. So am planning to create an alternative html/text based site which the googlebot will get redirected to. (By checking the useragent). My question is that is this allowed officially by google?
If not then how come there are many subscription based sites which display a different set of data to google compared to the users? Is that allowed?
Thank you very much.
I've dealt with this exact scenario for a large ecommerce site and Google essentially ignored the site. Google considers it cloaking and addresses it directly here and says:
Cloaking refers to the practice of presenting different content or URLs to users and search engines. Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.
Instead, create an ADA compliant version of the website so that users with screen readers and vision aids can use your web site. As long as there as link from your home page to your ADA compliant pages, Google will index them.
The official advice seems to be: offer a visible link to a non-flash version of the site. Fooling the googlebot is a surefire way to get in trouble. And remember, Google results will link to the matching page! Do not make useless results.
Google already indexes flash content so my suggestion would be to check how your site is being indexed. Maybe you don't have to do anything.
I don't think showing an alternate version of the site is good from a Google perspective.
If you serve up your page with the exact same address, then you're probably fine. For example, if you show 'http://www.somesite.com/' but direct googlebot to 'http://www.somesite.com/alt.htm', then Google might direct search users to alt.htm. You don't want that, right?
This is called cloaking. I'm not sure what the effects of it are but it is certainly not whitehat. I am pretty sure Google is working on a way to crawl flash now so it might not even be a concern.
I'm assuming you're not really doing a redirect but instead a PHP import or something similar so it shows up as the same page. If you're actually redirecting then it's just going to index the other page like normal.
Some sites offer a different level of content -- they LIMIT the content, they don't offer alternative and additional content. This is done so it doesn't index unrelated things generally.