automatic language translation and SEO - seo

I'm looking into translating a large UK English site into a number of other european languages. I was wondering what are the free options out there for automatic translation?
Also, in regards to SEO, how do search engines treat language copies of web pages in regards to the duplicate content rules?
Thanks

My limited experience in the matter is that the big G treats automatic language translation as duplicate content. It seems that the DC detection algorithms are language-agnostic. However, when I hand-translate into languages that I know, the 'new' pages rate highly. In fact, I would say that translating highly-rated (PR 4 and above) pages leads to better-performing pages (more search engine landings, and more varied terms as well) then even new original-content pages.
I have done no comparisons in this regard to other search engines, as they typically supply less than 10-20% of my traffic anyway.

You'd be better off hiring translators to write language specific versions that aren't simply translated versions of your English copy. You'll end up with better results, too.
The majority of your audience can probably understand English well enough to navigate your site, so I think you might be putting too much thought into this.

If someone finds this question nowadays, know that today there are multiple free and paid translation tools that can be used, and about the duplicate content because of the other languages, you can set up metatags to tell that your content has alternate languages and tell the search engines which one is the cannonical content, you can have a look at it in Google Documentation for examples:
https://developers.google.com/search/docs/specialty/international/localized-versions

Related

Captcha alternative

In order to implement a CAPTCHA for my login page, I would like to understand how a translation test can be considered secure compared to popular image recognition patterns.
All customers will be bilingual speakers of an orally learnt and used Polynesian language i.e., no formal spelling conventions (hence the translation to English not the reverse), so instead of asking them to read distorted letters I would like to ask them to translate a simple sentence into English to be validated from the PHP server side.
Is this secure/accurate?
The basic idea to state that this kind of CAPTCHA ("Completely Automated Public Turing test to tell Computers and Humans Apart") is totally insecure is that while the OP states that "currently" Google Translator doesn't offer support for Polynesian language, it cannot be excluded that it will do so in the future.
More generally, translation is not a valid CAPTCHA test because of the following considerations:
Comparing a random sentence VS its automated translation using a public translator (e.g. a future version of Google, Bing) is equal for a hacker submitting the same phrase to the translation engine
Using a whitelist of sentences and their translations will be eventually overwhelmed by the accuracy of the automated public translators
I mean that modern public computer translators are perfecting their accuracy. If you assume that a public translator is unable to perform an accurate job today and challenge the user with a known phrase the translator cannot process, technology will tend to eventually fix that translation and you will get the challenge sentence easily spotted by robots.
That is the main principle of ReCaptcha being used as an OCR, but from the opposite side. I will suggest you to read this paper but briefly the researchers state that ReCaptcha is destined to improve its accuracy far more than automated OCRs because of user input.
Since Google and Bing Translate widely use user-submitted data to improve their translation process, they will be subject to a human-aided machine learning eventually breaking the Turing Test for that kind of challenge (e.g. ReCaptcha will read like a human, Translate will translate like a human)
After reading the comments, it seems the only danger I face is a vague future Google Translate one, which is unlikely to eventuate. So I'm going to stick my head out and say that this is indeed a good security measure which could conceivably be useful to many businesses or organisations that have such a customer base. Thanks for the assist.
Major point in it's favor is ease of use for the customers all of which so far prefer it to trying to read captcha. I put it on a live system so had 80+ people use it today.
I presume they all speak English too then? Unusual to require your users to be bilingual. Even if this is the case today, is it possible that with future growth you might be excluding certain users? What if someone moves into the area who wants to signup but only speaks English?
Language is a funny imprecise thing. You could take a sentence and probably translate it a number of different ways. Computers deal in precision so you need a question where there can only be one answer.
Also, the whole idea of a CAPTCHA is to make sure it's a real person but it may not be too hard to write a program that uses google translate or something similar. It may not always get it right but it'd probably get through some of the time.

Microformat's hRecipe vs. Schema's Recipe

I would like to know what are the main differences between Microformat's hRecipe and Schema.org's Recipe and how search engines treat each one.
Besides the differences in code and the fact that the former is open while the latter is propietary, how do search engines treat each one and which one is better to implement, both from a long-term perspective and a SEO perspective?
Schema.org with Google, Bing, Yahoo!, and Yandex
Since you asked this question, Microformat's hRecipe has been updated with microformats2 as h-recipe, but otherwise your question remains relevant and is worth answering more than 6 years later.
…how do search engines treat each one…?
Search engine giants, Google, Microsoft (Bing), and Yahoo!, along with Yandex (a popular search engine in Russia and elsewhere globally) collaborated to create Schema.org and the schemas therein.
This collaboration is the biggest differentiator between Schema.org and Microformats; it does and will likely continue to have an impact on how each treats schemas defined by other parties.
You can read about why they created it and how they treat other formats in the Schema.org FAQ.
Specifically, you may be interested in their answers to…
What is the purpose of schema.org?
Why are Google, Bing, Yandex and Yahoo! collaborating?
I have already added markup in some other format (i.e. microformats, RDFa, data-vocabulary.org, etc). Do I need to change anything on my site?
Why microdata? Why not RDFa or microformats?
Why don't you support other vocabularies such as FOAF, SKOS, etc?
…which one is better to implement, both from a long-term perspective and a SEO perspective?
The schema better to implement is the one with the most support; in this case, that appears to be Schema.org's Recipe. While all of the above search engines still support microformats, mentions of it have disappeared from some of Google's official documentation regarding structured data and rich snippets.
Interestingly, Google recommends a newer syntax for structured data called JSON-LD.
JSON-LD: The future of structured data?
From a long-term perspective, you may want to consider adopting the evermore popular JSON-LD markup syntax with the Schema.org Recipe schema, which even Bing is supporting now ( here are examples demonstrating it ) despite their documentation having no mention of it.
Pinterest's interesting support
The popular content discovery platform Pinterest supports both schemas and even supports the new JSON-LD syntax (though it is not explicitly mentioned in their documentation).
Despite Schema.org's growing popularity and adoption, Pinterest offers seemingly greater support for the h-recipe microformat with their inclusion of e-instructions as a supported class, whereas Schema.org's corresponding recipeInstructions property is not a supported property.
It's unclear if this is intentional or even which schema they actually prefer, but it is worth keeping in mind if you intend to develop specifically for this platform.
hRecipe is based on class attributes while schema's Recipe is based on multiple attributes. those are the main differences in the markup; hRecipe is backwards compatible whereas Recipe is not, because it's using html5 data attributes.
the big three search engines say that they'll treat both the same, however i don't buy that; Google has been pushing their web platform(s) long enough for me to think that they'll be adding extra juice to Recipe, even though i can't prove it. even if they aren't throwing extra seo at Recipe, you can be sure that they'll work something into SERPS so that if you are using their proprietary markup, you get noticed....more. take the link element's prefetch and prender attributes as an example; google created prerender and if you use it on your site, voila, it prerenders in SERPS for the user. prefetch does not.
i'm not sure how to differentiate between a long-term perspective or an seo perspective, i look # them the same; i'm not saying that you can't, just trying to explain more. i have thought this over before from a clients perspective and asked myself these same questions in regards to microformats as a whole vs. schema. it's basically a judgement call: microformats are tried and true format; there are millions more sites using micoformatted data than there are using schema's. they aren't going anywhere. and (as noted earlier) they are backwards compatible.
that said, schema is backed by the big three, and being html5 based, shouldn't have portability problems in the future. also previously mentioned, i'm sure all three will be rewarding users (though i have no proof) in their respective search results. one caveat here though, is how fast everything on the web is moving; just as quickly as Schema popped up, it could conceivably be dropped. i doubt it (though i'm hoping) but it is a possibility.
i can't say which is better to implement, but microformats are certainly much easier to implement, they're class based and so freaking easy.
It is better to use the schema.org formats as that has been accepted as standard by all of the major search engines (Google, Yahoo, and Bing). Using an alternative microformat may mean that some of the search engines will not recognize that data as being special and losing any possible advantages it offers.

How does Google Know you are Cloaking?

I can't seem to find any information on how google determines if you are cloaking your content. How, from a technical standpoint, do you think they are determining this? Are they sending in things other than the googlebot and comparing it to the googlebot results? Do they have a team of human beings comparing? Or can they somehow tell that you have checked the user agent and executed a different code path because you saw "googlebot" in the name?
It's in relation to this question on legitimate url cloaking for seo. If textual content is exactly the same, but the rendering is different (1995-style html vs. ajax vs. flash), is there really a problem with cloaking?
Thanks for your put on this one.
As far as I know, how Google prepares search engine results is secret and constantly changing. Spoofing different user-agents is easy, so they might do that. They also might, in the case of Javascript, actually render partial or entire pages. "Do they have a team of human beings comparing?" This is doubtful. A lot has been written on Google's crawling strategies including this, but if humans are involved, they're only called in for specific cases. I even doubt this: any person-power spent is probably spent by tweaking the crawling engine.
Google looks at your site while presenting user-agent's other than googlebot.
See the Google Chrome comic book page 11 where it describes (even better than layman's terms) about how a Google tool can take a schematic of a web page. They could be using this or similar technology for Google search indexing and cloak detection - at least that would be another good use for it.
Google does hire contractors (indirectly, through an outside agency, for very low pay) to manually review documents returned as search results and judge their relevance to the search terms, quality of translations, etc. I highly doubt that this is their only tool for detecting cloaking, but it is one of them.
In reality, many of Google's algos are trivially reversed and are far from rocket science. In the case of, so called, "cloaking detection" all of the previous guesses are on the money (apart from, somewhat ironically, John K lol) If you don't believe me set up some test sites (inputs) and some 'cloaking test cases' (further inputs), submit your sites to uncle Google (processing) and test your non-assumptions via pseudo-advanced human-based cognitive correlationary quantum perceptions (<-- btw, i made that up for entertainment value (and now i'm nesting parentheses to really mess with your mind :)) AKA "checking google resuts to see if you are banned yet" (outputs). Loop until enlightenment == True (noob!) lol
A very simple test would be to compare the file size of a webpage the Googlbot saw against the file size of the page scanned by an alias user of Google that looks like a normal user.
This would detect most suspect candidates for closeer examination.
They call your page using tools like curl and they construct a hash based on the page without the user agent, then they construct another hash with the googlebot user-agent. Both hashes must be similars, they have algorithms to check the hashes and know if its cloaking or not

What are the common sense SEO practices that aren't dodgy or crap? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
In SEO there are a few techniques that have been flagged that need to avoided at all costs. These are all techniques that used to be perfectly acceptable but are now taboo. Number 1: Spammy guest blogging: Blowing up a page with guest comments is no longer a benefit. Number 2: Optimized Anchors: These have become counterproductive, instead use safe anchors. Number 3: Low Quality Links: Often sites will be flooded with hyperlinks that take you to low quality Q&A sites, don’t do this. Number 4: Keyword Heavy Content: Try and avoid too many of these, use longer well written sections more liberally. Number 5: Link-Back Overuse: Back links can be a great way to redirect to your site but over saturation will make people feel trapped
Content, Content, CONTENT! Create worthwhile content that other people will want to link to from their sites.
Google has the best tools for webmasters, but remember that they aren't the only search engine around. You should also look into Bing and Yahoo!'s webmaster tool offerings (here are the tools for Bing; here for Yahoo). Both of them also accept sitemap.xml files, so if you're going to make one for Google, then you may as well submit it elsewhere as well.
Google Analytics is very useful for helping you tweak this sort of thing. It makes it easy to see the effect that your changes are having.
Google and Bing both have very useful SEO blogs. Here is Google's. Here is Bing's. Read through them--they have a lot of useful information.
Meta keywords and meta descriptions may or may not be useful these days. I don't see the harm in including them if they are applicable.
If your page might be reached by more than one URL (i.e., www.mysite.com/default.aspx versus mysite.com/default.aspx versus www.mysite.com/), then be aware that that sort of thing sometimes confuses search engines, and they may penalize you for what they perceive as duplicated content. Use the link rel="canoncial" element to help avoid this problem.
Adjust your site's layout so that the main content comes as early as possible in the HTML source.
Understand and utilize your robots.txt and meta robots tags.
When you register your domain name, go ahead and claim it for as long of a period of time as you can. If your domain name registration is set to expire ten years from now rather than one year from now, search engines will take you more seriously.
As you probably know already, having other reputable sites that link to your site is a good thing (as long as those links are legitimate).
I'm sure there are many more tips as well. Good luck!
In addition to having quality content, content should be added/updated regularly. I believe that Google (an likely others) will have some bias toward the general "freshness" of content on your site.
Also, try to make sure that the content that the crawler sees is as close as possible to what the user will see (can be tricky for localized pages). If you're careless, your site may be be blacklisted for "bait-and-switch" tactics.
Don't implement important text-based
sections in Flash - Google will
probably not see them and if it does,
it'll screw it up.
Google can Index Flash. I don't know how well but it can. :)
A well organized, easy to navigate, hierarchical site.
There are many SEO practices that all work and that people should take into consideration. But fundamentally, I think it's important to remember that Google doesn't necessarily want people to be using SEO. More and more, google is striving to create a search engine that is capable of ranking websites based on how good the content is, and solely on that. It wants to be able to see what good content is in ways in which we can't trick it. Think about, at the very beginning of search engines, a site which had the same keyword on the same webpage repeated 200 times was sure to rank for that keyword, just like a site with any number of backlinks, regardless of the quality or PR of the sites they come from, was assured Google popularity. We're past that now, but is SEO is still , in a certain way, tricking a search engine into making it believe that your site has good content, because you buy backlinks, or comments, or such things.
I'm not saying that SEO is a bad practice, far from that. But Google is taking more and more measures to make its search results independant of the regular SEO practices we use today. That is way I can't stress this enough: write good content. Content, content, content. Make it unique, make it new, add it as often as you can. A lot of it. That's what matters. Google will always rank a site if it sees that there is a lot of new content, and even more so if it sees content coming onto the site in other ways, especially through commenting.
Common sense is uncommon. Things that appear obvious to me or you wouldn't be so obvious to someone else.
SEO is the process of effectively creating and promoting valuable content or tools, ensuring either is totally accessible to people and robots (search engine robots).
The SEO process includes and is far from being limited to such uncommon sense principles as:
Improving page load time (through minification, including a trailing slash in URLs, eliminating unnecessary code or db calls, etc.)
Canonicalization and redirection of broken links (organizing information and ensuring people/robots find what they're looking for)
Coherent, semantic use of language (from inclusion and emphasis of targeted keywords where they semantically make sense [and earn a rankings boost from SE's] all the way through semantic permalink architecture)
Mining search data to determine what people are going to be searching for before they do, and preparing awesome tools/content to serve their needs
SEO matters when you want your content to be found/accessed by people -- especially for topics/industries where many players compete for attention.
SEO does not matter if you do not want your content to be found/accessed, and there are times when SEO is inappropriate. Motives for not wanting your content found -- the only instances when SEO doesn't matter -- might vary, and include:
Privacy
When you want to hide content from the general public for some reason, you have no incentive to optimize a site for search engines.
Exclusivity
If you're offering something you don't want the general public to have, you need not necessarily optimize that.
Security
For example, say, you're an SEO looking to improve your domain's page load time, so you serve static content through a cookieless domain. Although the cookieless domain is used to improve the SEO of another domain, the cookieless domain need not be optimized itself for search engines.
Testing In Isolation
Let's say you want to measure how many people link to a site within a year which is completely promoted with AdWords, and through no other medium.
When One's Business Doesn't Rely On The Web For Traffic, Nor Would They Want To
Many local businesses or businesses which rely on point-of-sale or earning their traffic through some other mechanism than digital marketing may not want to even consider optimizing their site for search engines because they've already optimized it for some other system, perhaps like people walking down a street after emptying out of bars or an amusement park.
When Competing Differently In An A Saturated Market
Let's say you want to market entirely through social media, or internet cred & reputation here on SE. In such instances, you don't have to worry much about SEO.
Go real and do for user not for robots you will reach the success!!
Thanks!

SEO for product known by different names

If you're selling widgets, we all know that having "Bob's Widgets" in the title and the H1 gives you a better ranking in Google when people search for "widgets".
But what if, as someone explained to me the other day, their product is known by different names in different parts of the world?
In the US, it's called a Widget. In Canada, it's called a Flidget. In Australia, it's called a Zidget. There's really no official name for it, just informal names.
Meta-tags are no problem, but apart from that, what's the best way to cope with that situation? Just make separate pages? You can't have 3 H1s on the page. One H1 which says "Widgets, (aka Flidgets, Zidgets)"?
Or do I just trust that Google is smart enough and some magical taxonomy database groups those three words together as the same thing?
EDIT: This question got downvoted simply because it's about SEO? How bizarre. If you even bother to read the question, you can see I'm not trying to game the system or get away with anything. I have a genuinely interesting question and a valid client need.
Please note also, that I always use semantic HTML, I am well aware of how search engine rankings work, and I'm not trying to get away with anything shady.
If my client was selling beer, I would simply use semantic HTML to put the word "beer" first and foremost. If I was selling beer to French people, I would make another page in French and do the same with "biere". But imagine for a second that beer isn't called "beer" in other English-speaking nations. Imagine it's called "reeb". How do I correctly, semantically code an English-language page when different English-language users will be searching using a different string, but searching for the same thing.
HTML meta-tags were originally created for the purpose of embedding exactly such metadata into a webpage. But because of the SEO industry and the commercialization of the web, meta-tags like 'keywords' are no longer used by major search engines.
With all of the advances in page ranking algorithms and intelligent search robots over the years, there's really not much to do in terms of active 'search engine optimization' for legitimate websites. In today's search environment, all you have to do is optimize your site for your visitors, and it will automatically be optimize for searching.
So you can passively optimize your site's ranking by doing any(or all) of the following:
Use good spelling and writing etiquette (like not writing your entire site in caps or text-message-speak)
Format your pages using proper markup. (Title your document, mark your headings with H1/H2/etc., delimit your paragraphs, and so on and so forth.)
Abide by established web standards and write well-formed code.
Weed out broken links and make sure your site works properly.
Don't use pop-ups, cover your site with banner ads, or otherwise bombard visitors with advertising
Don't link to disreputable websites
Simply put, make your site as user-friendly and as accessible as possible. If your site is useful to visitors and provides valuable content, most major search engines like Google or Yahoo! are smart enough to rank it fairly. Your ranking may be modest at first. But if you're genuinely supplying quality content then, as your site becomes better established on the web, other sites will start linking to you, increasing your search ranking.
And if other webpages linking to your site use the various names & nicknames your product is referred to by, then your site will also be associated with those names/keywords (that's how Google Bombing works). Google also tracks synonymous search terms and is even smart enough to recommend related/alternative search terms in some cases.
On the other hand, if you're creating a spam site or the 10 millionth affiliate marketing website with the same exact products and content as the other 9,999,999 sites of the same exact nature, then expect your search engine ranking to be reasonably poor.
It's generally only websites with no original content and that provide no legitimate value to visitors that require active (black hat) SEO techniques to gain a decent ranking--polluting search results in the process. Otherwise, if you're actually building a useful website, then just optimize it for your visitors and let Google/Yahoo! do their job.
The anchor text of your inbound links is a lot more important than the tags you use. So try getting links to your page with both "beer" and "reeb". As long as you'll get enough links with both terms, you'll do well in SERPs, no matter the keywords you use in it.
One option is to localize pages for the different target regions you are interested.
If you use a local domain, google will give it priority on default searches on that country. When I hit www.google.com, it redirects me to www.google.com.mx, and any search I do tends to display high results from mexico domains. I actually have to hit a couple options, when I don't want that behavior.
I also think google has an option to map parts of the site to a region, so you can keep the single domain.
Update: Regarding the beer example, you can localize per country (which is what I mention above). Actually its not that of a special need, since english british and english US have their differences.
The talk has been language agnostic, but consider how .net handle resources. Lets say the current request is being processed for en-GB, and you look for a resource (i.e. a text, image, etc). It will first try to find the resource for the specific culture: en-GB, if it isn't found it will look under the more general en (and then in the default resource file).
The previous allows you to selectively localize what you really need on the more specific resource files. If you only need to localize the resources with the key beerName, you can just configure that on the specific languages and leave the rest.