I'm working on a website that deals with many languages and when a user enters to example.com, a little PHP script detects the user browser's preferred language (based on the Accept-Language header) and based on that it redirects using header(Location: ...) to en.example.com, it.example.com, es.example.com, etc.
Now, this works perfectly but I found that search engines fail at indexing the homepage properly. I don't know much about the HTTP protocol but I realize I'm doing something wrong here. Does anyone have a suggestion on how to fix this?
Why did you do this?
In your case - i will never see your Website in german when i call the site form a agent that execepts it or es only ...
You should give 'users' the posibillity to choose whiche language they want to have ... and then google and co should be nearer your friend
Related
I have a website with two languages, which works in this format:
example.com/changelanguage.xx?lang=de
and redirects to German language
and calling the same URL again like:
example.com/changelanguage.xx?lang=en
redirects to English language.
The URL remains the same example.com after redirection, just the language changes.
How to add the hreflang attribute here (for Google indexing)?
It’s a bad practice to use the same URL for different (i.e., translated) content.
Consumers, like search engine bots, would use rel-alternate + hreflang markup to find translations. For this to work, you have to provide a different URL for the translated page.
From the perspective of the search engine, it doesn’t work for their users if using the same URL: when they give http://example.com/foobar as search result, they want to make sure that their users get the language the search engine intended (e.g., someone searching for German terms should get the German page). But with your system, this doesn’t work; the search engine user might end up with the English version.
Instead, you should represent the language in the URL, e.g. the language code as first path segment:
http://example.com/en/contact
http://example.com/de/kontact
(Or use different domains/subdomains, or add a query parameter, etc. If you can make sure that translated pages would never have the same URL slug, you could even omit the language codes.)
This is a year late but https://www.bablic.com/ do exactly this!
Furthermore they can automatically detect the language set in the user's browser and automatically show the user your website in that language!
Recently I made my site to reroute the url, based on the language set in the visitors browser. So if a Swedish visitor came to the site, he was rerouted to mysite.com/sv, and an english visitor to mysite.com/en.
Soon after I released this, my Google rank just plummeted. So how did I go wrong here? Is there some common practice to auto-redirect of visitors based on their locale that doesn't hurt SEO, or do I need to set some kind of HTTP code for this to be approved by search engines?
The penalty you've acquired is for cloaking.
Short answer: Don't do redirects yourself - instead use hreflang codes and canonical links, then let the person's Google settings decide.
A Swedish person searching on google.com wants the English version, even if their browser is Swedish. Google does checks where it uses different user agents from different locations to test if you're serving the same content they see to everyone else. When this differs, your site gets flagged for attempting to hide it's true content - hence 'cloaking'.
More here: https://support.google.com/webmasters/answer/66355?hl=en
I'm completely stumped on an SEO issue and could really use some direction from an expert. We built a website recently, http://www.ecovinowines.net and because it is all about wine, we set up an age verification that requires the user to click before entering the site. By using cookies, we prevent the user from accessing any page in the site before clicking the age verification link. It's been a couple of months since launching the site so I thought I'd check out some keywords on google. I just typed in the name of the website to see what pages would be indexed and it is only showing the age verification pages. From the googling I've done, apparently nothing behind the age verification will be visible to the google bots because they ignore cookies.
Is there no safe workaround for this? I checked out New Belgium's site, which uses a similar age verification link, and all of it's pages seem to be getting indexed. Once you click on one of it's links from google, it redirects the user to the age verification page. Are they not using cookies? Or how might they be getting around the cookie bot issue.
Do a test for Google bot's User agent and allow access if it matches. You might want to let other search engines through too...
Googlebot/2.1 (+http://www.google.com/bot.html)
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
ia_archiver
Semi-official response from the Google:
This topic comes up periodically for sites (alcohol, porn, etc.) that
need to serve an age verification notice on every page. What we
recommend in this case is to serve it via JavaScript. That way users
can see the age verification any time they try to access your content,
but search engines that don't run JavaScript won't see the warning and
will instead be able to see your content.
http://groups.google.com/group/Google_Webmaster_Help-Tools/browse_thread/thread/3ce6c0b3cc72ede8/
I think a more modern technique would be to render all the content normally, then obscure it with a Javascript overlay.
I had a quick look at New Belgium and it's not clear what they're doing. Further investigation needed.
Assuming you are using PHP, something like this will do the job. You'd need to add other bots if you want them.
$isBot = strpos($_SERVER[‘HTTP_USER_AGENT’],"Googlebot");
Then you can toggle your age verification on or off based on that variable.
I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!
Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.
Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).
I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.
You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.
Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.
You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.
We basically have 2 sites ( Java /JSP / Apache Webserver) :
something.ca & something.com
The .ca is canadian content, and the .com is american content.
We need users to be redirected based on the ip addreess.
We want US users to get the .com site and Canadian users get the .ca site.
What is the best way to do this (at a webserver level or otherwise ) ?
Please elaborate.
In my web surfing experience, most websites - UPS.com for example - ask the user to select their country site rather than trying to figure it out themselves. They remember the selection in a cookie. Much depends on how voluntary your use case requires this redirection to be.
On the implementation side, I'd use a filter that would check the setting and fire a redirect if need be.
I've used GeoIP from Maxmind and it works well. They have a free version GeoCountry Lite That's 99.3% accurate. the Java API is here I would follow google's practice of having a link back to the original version if you do the redirect.
Check out GeoDirection. It may handle what you want through javascript.
http://www.geobytes.com/GeoDirection.htm
Another option would be to grab the culture from the browser environment settings and map those cultures to countries in your application. Depending on what you are actually trying to do this may not work for you as this will not give you the user's physical location, but will get you their preferred culture. So if a Canadian travels to the US they will still get the Canadian site unless they changed their browser settings for some reason.
There are a lot of IP geolocation APIs out there - I don't know if there's anything good out there that you don't have to pay for:
Using culture settings is an option, but doesn't work in some cases. What if you have a German user in the US who likes his dates etc. displayed in the format he's comfortable with? Doesn't change the fact that he's in the US.
I think that's one of the reasons why most companies simply ask the user and then store that information in a cookie (UPS, FedEx and most major airlines do that). Check out www.lufthansa.com. They actually ask for location and language(to account for countries with more than one official language like Switzerland).