Recently I made my site to reroute the url, based on the language set in the visitors browser. So if a Swedish visitor came to the site, he was rerouted to mysite.com/sv, and an english visitor to mysite.com/en.
Soon after I released this, my Google rank just plummeted. So how did I go wrong here? Is there some common practice to auto-redirect of visitors based on their locale that doesn't hurt SEO, or do I need to set some kind of HTTP code for this to be approved by search engines?
The penalty you've acquired is for cloaking.
Short answer: Don't do redirects yourself - instead use hreflang codes and canonical links, then let the person's Google settings decide.
A Swedish person searching on google.com wants the English version, even if their browser is Swedish. Google does checks where it uses different user agents from different locations to test if you're serving the same content they see to everyone else. When this differs, your site gets flagged for attempting to hide it's true content - hence 'cloaking'.
More here: https://support.google.com/webmasters/answer/66355?hl=en
I just made a website for an alcoholic drink. They need to have the age verification on all links. It's a single page website and I use backbone routing system. I've created the check with the SESSION object, so I am loading the intro view (age verification view) if the SESSION object is unset. This is all working as expected, but the problems are google bots. When they are trying to crawl my pages the app is always loading the intro (age verification) view. Here is a link for the website , but I think it won't be very useful, because I guess that this is more a logical then a technical question...
So..my question is how to redirect only visitors and to let google bots see the actual content of the page? Should I use cookies or there is a way to achieve this with the php?
Yes. Something like
If ($_SERVER['HTTP_USER_AGENT'] == "Googlebot") {
$_SESSION['ageverified'] = true;
// do more
}
Should work.
See here for all the exact user-agent names and what they crawl.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1061943
I'm working on a website that deals with many languages and when a user enters to example.com, a little PHP script detects the user browser's preferred language (based on the Accept-Language header) and based on that it redirects using header(Location: ...) to en.example.com, it.example.com, es.example.com, etc.
Now, this works perfectly but I found that search engines fail at indexing the homepage properly. I don't know much about the HTTP protocol but I realize I'm doing something wrong here. Does anyone have a suggestion on how to fix this?
Why did you do this?
In your case - i will never see your Website in german when i call the site form a agent that execepts it or es only ...
You should give 'users' the posibillity to choose whiche language they want to have ... and then google and co should be nearer your friend
If a homepage on a website has a content if a user is not logged in and another content when the user login, would a search engine bot be able to crawl the user specific content?
If they are not able to crawl, then I can duplicate the content from another part of the website to make it easily accessible to users who have mentioned their needs at the registration time.
My guess is no, but I would rather make sure before I do something stupid.
You cannot assume that crawler support cookies, but you can identify the crawler and let the crawler be "Logged in" in your site by code. However this will open up for any user to pretend being a crawler to gain the data in the logged in area.
The bot will be able to see all the content in your document. If the content does not exist in the document, then it will not be seen by the bot. If it exists in the document but is hidden from view, the crawler will be able to pick it up.
Even if this could be done it is against the terms for most search engines to show the crawler content that is not the same as what any user will get on entry and can cause your site to be banned from the index.
This is why sites like expertsexchange have to provide the answer if you scroll all the way to the bottom even though they try to make it look like you have to register. (This is only possible if you enter expertsexchange with a google referer btw, for this reason)
I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!
Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.
Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).
I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.
You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.
Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.
You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.