How to let googlebot access pages behind a login - authentication

Have searched the net and here too, but still looking for a solid answer to allowing Googlebot to access pages behind my login.
Is there a secure way to do this?
I have added a login allow through Adsense, but wish to go further than just permitting pages that contain Adsense content.
I receive report that 238 pages have access denied errors.
Would appreciate some help here.
Kind Regards Chris

How about checking ip (whether it starts with 66.249.*) and user-agent (Googlebot) and serve authorized pages if both of the situations matches ?

Related

Redirect all traffic to holding page unless logged in using .htaccess

I currently have a landing page setup on my domain.com which already receives traffic.
It will shortly be replaced with an online store. I need to upload this store to my live server in order to get it approved by the Merchant Facility Providers (MFP), and they require it to be accessible from it's final live location on domain.com in order to get approvals. I can't have users access this site until it has met approvals.
To accomplish this I wish to redirect all domain.com traffic to domain.com/holding/ except for MFP visitors.
Ideally this would be restricted by IP address, however MFP say they will need to grant a number of external parties access, and so IP address based access will not be acceptable and I should use passwords.
So my question is, how can I automatically redirect all traffic from domain.com to the holding page domain.com/holding/ unless they have logged in using a password at domain.com/login?
Users visiting the domain.com should not be asked for a password.
Will this be possible using just .htaccess/.htpasswd?
If so, can someone suggest how the logic of how it could work?
It's not possible using just an .htaccess file as all visitors would be presented with an HTTP standard authentication dialog if you enabled it on your domain.com site at the doc_root level.
Without knowing what scripting language you're using? (you've not indicated in the tags, just apache), but you could provide one index page that both acts as a landing page for users/potential-users as well as provide a login (username/password form) for MFP parties (wherever they may come from).
That way, you fulfil both needs without offending or discriminating in any way against any party.
As #nickhar has pointed out, there appears to be no way of doing this using just .htaccess.
My solution was to use a rewrite rule to redirect all requests from domain.com to domain.com/holding unless a specific cookie was set (checked for using RewriteCond %{HTTP_COOKIE}).
I set this cookie in a php script on domain.com/login, which was password protected using .htaccess/.htpasswd.
This is by no means a particularly secure solution, but is adequate for my purposes of keeping the site hidden from general traffic while the approval process is completed.

preventing from the site any external links

I am using DokuWiki and as we've tried to secure it as much as possible the best security for us to keep it's location on our server secret. Therefore we want to make sure no link can be clicked on any pages which would reveal the location of our infrastructure. Is there any way to configure this restriction with DokuWiki or are there known ways to pass URLs through a third party?
Did you tried to protect the site with .htaccess and .htpasswd?? Is a good solution for other to not enter on your site.
And if the site is online you should include a robots.txt to avoid crawlers to index it
User-agent: *
Disallow: /
Hope i help you

how to solve anti-leech in a better way?

As I came across with the hot-leeching problem, I searched the website and found two ways to solve it.
The first is an easier and simpler way with the code showing below:
RewriteEngine On
RewriteCond %{HTTP_REFERER}!^$ Options +FollowSymlinks
RewriteCond %{HTTP_REFERER}!^http://(www\.)?mydomain.com(/)?.*$ [NC]
RewriteRule .*\.(gif¦jpg¦jpeg¦png¦swf)$ [mydomain.com...] [R,NC]"
This can only prevent some simple leeching ,but can do nothing with a determined person.
The other way is a better way with a script-and-cookies-based approach. They said "You set a cookie on an 'authorizated' page of your site, and then use a script to serve images only if the correct cookie is present in the image request. Images are kept in a directory accessible only to the script, and not via the Web. So, the script acts as an 'image server' on your site." I understand this principle but don't have any idea about how to realize it . Could anyone know how to realize this?
Any help appreciated.
I can't really give any implementation, but only some idea of how it can be achieved:
You will need a "portal" page, where you set the cookie for the user. Any request for resources without having a cookie of your site should be redirected here. There may not may not be a login mechanism here, depending on the purpose of your site, but usually you will set the cookie, after the user is logged in.
All resource links will link to to the same "script" page. The difference is that different resource will have different identifier (can be some sort of id - if you maintain a database of id to file path mapping). The identifier must be included in the query of the URL. The "script" will find the resource on the server based on the identifier (in case of id to file mapping, you will obtain the file path and go retrieve the file).
There will be a "script" page, which can be php code, for example. It will check for the cookie, then check for the identifier, then load the resource accordingly. You may also want to check for Referer to restrict the access a bit more (without checking, hot linking will work for any logged in user).
In this implementation, sharing a hot link to a resource will not work for any user that haven't visited the "portal" page (or haven't logged in, depending on your web site). It will also not work even for logged in user if they click the link from somewhere else.
However, scraping your website for resources is simple in both implementations mentioned in your question, since scraper can freely adjust the HTTP header.

Using DNS to Redirect Several Domains into One Single Content. Disaster?

When I searching our web site on Google I found three sites with the same content show up. I always thought we were using only one site www.foo.com, but it turn out we have www.foo.net and www.foo.info with the same content as www.foo.com.
I know it is extremely bad to have the same content under different URL. And it seems we have being using three domains for years and I have not seen punitive blunt so far. What is going on? Is Google using new policy like this blog advocate?http://www.seodenver.com/duplicate-content-over-multiple-domains-seo-issues/ Or is it OK using DNS redirect? What should I do? Thanks
If you are managing the websites via Google Webmaster Tools, it is possible to specify the "primary domain".
However, the world of search engines doesn't stop with Google, so your best bet is to send a 301 redirect to your primary domain. For example.
www.foo.net should 301 redirect to www.foo.com
www.foo.net/bar should 301 redirect to www.foo.com/bar
and so on.
This will ensure that www.foo.com gets the entire score, rather than (potentially) a third of the score that you might get for link-backs (internal and external).
Look into canonical links, as documented by Google.
If your site has identical or vastly
similar content that's accessible
through multiple URLs, this format
provides you with more control over
the URL returned in search results. It
also helps to make sure that
properties such as link popularity are
consolidated to your preferred
version.
They explicitly state it will work cross-domain.

Should I get rid of bots visiting my site?

I have been noticing on my trackers that bots are visiting my site ALOT. Should I change or edit my robots.txt or change something? Not sure if thats good, because they are indexing or what?
Should i change or edit my robots.txt or change something?
Depends on the bot. Some bots will dutifully ignore robots.txt.
We had a similar problem 18 months ago with the google AD bot because our customer was purchasing Soooo many ads.
Google AD bots will (as documented) ignore wildcard (*) exclusions, but listen to explicit ignores.
Remember, bots that honor robots.txt will just not crawl your site. This is undesirable if you want them to get access to your data for indexing.
A better solution is to throttle or supply static content to the bots.
Not sure if thats good, because they are indexing or what?
They could be indexing/scraping/stealing. All the same really. What I think you want is to throttle their http request processing based on UserAgents. How to do this depends on your web server and app container.
As suggested in other answers, if the bot is malicious, then you'll need to either find the UserAgent pattern and send them 403 forbiddens. Or, if the malicious bots dynamically change user agent strings you have a two further options:
White-list UserAgents - e.g. create a user agent filter that only accepts certain user agents. This is very imperfect.
IP banning - the http header will contain the source IP. Or, if you're getting DOS'd (denial of service attack), then you have bigger problems
I really don't think changing the robots.txt is going to help, because only GOOD bots abide by it. All other ignore it and parse your content as they please. Personally I use http://www.codeplex.com/urlrewriter to get rid of the undesirable robots by responding with a forbidden message if they are found.
The spam bots don't care about robots.txt. You can block them with something like mod_security (which is a pretty cool Apache plugin in its own right). Or you could just ignore them.
You might have to use .htaccess to deny some bots to screw with your logs.
See here : http://spamhuntress.com/2006/02/13/another-hungry-java-bot/
I had lots of Java bots crawling my site, adding
SetEnvIfNoCase User-Agent ^Java/1. javabot=yes
SetEnvIfNoCase User-Agent ^Java1. javabot=yes
Deny from env=javabot
made them stop. Now they only get 403 one time and that's it :)
I once worked for a customer who had a number of "price comparison" bots hitting the site all of the time. The problem was that our backend resources were scarce and cost money per transaction.
After trying to fight off some of these for some time, but the bots just kept changing their recognizable characteristics. We ended up with the following strategy:
For each session on the server we determined if the user was at any point clicking too fast. After a given number of repeats, we'd set the "isRobot" flag to true and simply throttle down the response speed within that session by adding sleeps. We did not tell the user in any way, since he'd just start a new session in that case.