Blocking web spider - apache

I want to block all web spiders to vaccum my web site.
Is there a way ?
I only found some Apache rules from 2008 (like this one )
http://perishablepress.com/ultimate-htaccess-blacklist/

Unfortunately, there isn't a way to block ALL scripts from accessing your site: if a human can, nothing prevents anyone from writing a spider that behaves similarly to a person and can therefore view every page.
You can look up techniques to prevent certain robots from accessing your site (you can do it with the majority of search engines), but if it remains up long enough and gets some visits it's likely that it will eventually end up in some database.
Have a look here.

Related

SEO: how can dynamic URL with query strings be searched by search engine bots?

I’m developing an ecommerce web site in ASP.NET using SQL server 2008 database.
Most of my pages are database driven and all the content is gathered from a SQL Server.
Every product page is created dynamically from data coming from the database, hence every product’s page URL has a unique query string, containing a “product_id” variable.
*Example: http://www.myecommence.com/products.aspx?product_id=1*
I'd like to improve my Search Engine Optimization.
Dealing with a small number of products could be fine but what if I
had more than 1000 products, how could every product be crawled?
How does the google spider/bot know that a product_id with a
hypothetical number of 767 exists?
I’ve been googleing this, still I can’t understand how pages that
have absolutely no reference in the site or external sites can be
crawled? If this is possible the spider should know how to read the
website’s database tables, but I guess that this is not the case.
At this point since most of the pages and links are dynamic how
could they be indexed, the same thing applies to “user detail” pages
that are accessed via query string using a “user id=n”?
Probably what I’m asking has already been discussed but still I don’t have clear some points.
I would advise using Mod Rewrite rules to make your URLs search engine friendly.
This is very important for Google.
As is a good category structure.
Eg:
domain.com/t-shirts/girls/star-wars-t-shirt/
is far better than
domain.com/products.aspx?product_id=1*
Here is some info:
http://msdn.microsoft.com/en-us/library/ms972974.aspx
http://www.wrox.com/WileyCDA/Section/id-305997.html
To answer your questions:
Dealing with a small number of products could be fine but what if I had more than 1000 products, how could every product be crawled?
If you have a good sitemap / menu structure etc, it is likely that Google will crawl all your pages.
How does the google spider/bot know that a product_id with a hypothetical number of 767 exists?
Via crawling your site, via your sitemap, via the menu system on the site etc. However always remember: Google is not psychic - it cannot find a page unless you tell how to / link to it.
I’ve been googleing this, still I can’t understand how pages that have absolutely no reference in the site or external sites can be crawled? If this is possible the spider should know how to read the website’s database tables, but I guess that this is not the case.
If you have no reference - you are doing something wrong. Improve your site structure.
At this point since most of the pages and links are dynamic how could they be indexed, the same thing applies to “user detail” pages that are accessed via query string using a “user id=n”?
Nothing wrong with a dynamic URL per-se - but again I would recommend implementing search engine friendly URLs via Mod Rewrite or similar - see the above resources.
Good luck,
Colin
Modern systems optimize for SEO by allowing for either custom or automated URLs that remap to your id based url pattern. This URL style allows for a fully custom word for word product title or keyword/description, which carries more weight than a random id number in a URL.
To ensure all individual pages are indexed, you generally benefit most from submitting or making available a sitemap xml. More info from google on generating one here:
https://code.google.com/p/googlesitemapgenerator/
Hope that gets you going in the right direction!

How do I prevent GoogleBot from finding acquisition URLs?

I have apache in front of zope 2 (multiple virtual hosts) using the standard simple rewrite rule.
I am having big issues with some of the old sites I host and googlebot.
Say I have:
site.example.com/documents/
site.example.com/images/i.jpg
site.example.com/xml/
site.example.com/flash_banner.swf
How do I stop the following from happening?
site.example.com/documents/images/xml/i.jpg
site.example.com/images/xml/i.jpg
site.example.com/images/i.jpg/xml/documents/flash_banner.swf
All respond with the correct object from the last folder on the end of the URI, the old sites where not written very well and it some cases Google is going in and out of hundreds of permutations of folder structures that don’t exist but always finding large flash files. So instead of Googlebot hitting the flash file once, it's dragging it off the site thousands of times. I am in the process of moving the old sites the Django. But I need to put a halt to it in Zope. In tthe past have tried ipchains and mod_security but they are not an option this time around.
Find out what page is providing Google all the variant paths to the same objects. Then fix that page so that it only provides the canonical paths using the absoute_url(), absoute_url_path(), or virtual_url_path() methods of traversable objects.
You could also use sitemaps.xml or robots.txt to tell Google not to spider the wrong paths but that's definitely a workaround and not a fix as the above would be.

Dynamic url shortening script for text input

We are looking for script, which automatically detects url, as you type and shorten it, in text input window, before press "submit". The shortening service used is http://yourls.org/
Have you tried implementing one yourself? Deploy the shortener to your own web site (it's written in PHP, as far as I can see from a cursory glance at the web site) and provide a simple Ajax endpoint which will dynamically perform a shortening conversion, then implement calls to that from the main page using JavaScript.
You might want to impose a reasonable delay to allow the user to finish typing, to avoid performing lots of unnecessary conversions of bogus URLs (which may require, e.g. writes to a file or database - I haven't looked at how the library referenced does things).
I'm not sure what you're trying to achieve; if you create new shortened URLs for each substring before the user has finished typing the full URL, you will just proliferate your database.
I don't see how shortening a URL before it's finished makes sense.
If you want to relieve the user from the arduous task of clicking the submit button, then initiate the submit using javascript (jQuery, or something). I'm not sure if that's what you want to do.
http://monkeytooth.net/2010/12/htaccess-php-how-to-wordpress-slugs/
simple means of implementing the concept its a lot more easier than one would think. Querying a DB or some other means of matching the slug/id with the that of which is found in the URL wouldn't be all to hard either. The linked article doesn't really go in depth as what to do next but catching and breaking the URL apart is the essential process of making it work. I have person used the method myself on several sites and it works like a charm for me and the sites it was used on.

Refresh browser via cron(or not) to a different page on remote request?

I need to display pages in a tutorial fashion. I looked in to netsupport, beamyourscreen and other possibilities but, I do not want the viewers to download anything. I cannot use gd / send screenshots due to audio / video instructions embedded in some of the pages.
Basically, I need the ability to "refresh" a users browser window to a different page via an interface on my end. Whether via a form submission, javascript or any other type of "controller" that allows me to change the page on the viewers browser. PERL preferred but, PHP / javascript whatever works and is cross browser. I set up a simple javascript page forward timer that "works" but, page load times and conversation interruptions are a huge factor.
The entire tutorial website will be developed around this ability.
I was looking in to curl / cron / wget methods but, found little information.
I have seen forum and chat scripts that basically perform a similar task but, there must be a simple(ish) solution in leau of hacking up another script to suit my needs.
I do not want others to control the pages either. The site really, only needs to be accessable during the tutorial however, It "could" remain web accessable as long as user interaction was normal unless (being controlled).
The initial site concept is based on instructing people how to properly introduce new pets into a home. Will be operated by a veteranarian that saved my pets life. I wanted to give something back.
Possible? I really appreciate simple examples etc...
You have no other way but to keep polling the server for "instructions" using javascript. No, you can't send nothing to the end user browser, neither curl nor wget.
Mainly, you'll have to set up a simple request/response protocol between the browser and the server.
If you want to go deeper, you can use something like cometd/meteord/etc. If not, a hidden iframe that reloads himself and receives pages with javascript code for the needed actions can do the trick.
Another alternative.
With javascript dopolling and single character flatfile. Have a simple one character flatfile with a single var. Write it in perl (it is faster and uses less resources than php). The parent script calls a javascript variable in a flatfile. It hits the flatfile and goes wherever the var sets it. The flatfile is written to by the controller. Done.
I guess you could also rename an empty flatfile and use that as the controller. I am usure which is faster, open and read a specific file or hit the directory and return the file name. On the controller side, opening and writing to a file vs renaming a file. Maybe they counter each other in resources and time?
This way the site can act as a normal site. When you want to have remote users see a "presentation" (automatically being shown the site pages at the controllers pace), the controller activates polling and tells the viewers to push a start button. This allows a remote instructor to load pages for the viewers at his leisure.
It is a simple solution that works with nothing really sophisticated going on. No frames are needed either. Just need javascript enabled.
Any better suggestions are welcome!
It occurred to me that what you might want to use is HTML Push technology. Check out the wiki, they have several links. I have never used it myself

How to track all website activity and filtering web robot data

I'm doing a very rudimentary tracking of page views by logging url, referral codes, sessions, times etc but finding it's getting bombarded with robots (Google, Yahoo etc). I'm wondering what an effective way is to filter out or not log these statistics?
I've experimented with robot IP lists etc but this isn't foolproof.
Is there some kind of robots.txt, htaccess, PHP server-side code, javascript or other method(s) that can "trick" robots or ignore non-human interaction?
Just to add - a technique you can employ within your interface would be to use Javascript to encapsulate the actions that lead to certain user-interaction view/counter increments, for a very rudimentary example, a robot will(can) not follow:
Chicken Farms
function viewItem(id)
{
window.location.href = 'www.example.com/items?id=' + id + '&from=userclick';
}
To make those clicks easier to track, they might yield a request such as
www.example.com/items?id=4&from=userclick
That would help you reliably track how many times something is 'clicked', but it has obvious drawbacks, and of course it really depends on what you're trying to achieve.
It depends on what you what to achieve.
If you want search bots to stop visiting certain paths/pages you can include them in robots.txt. The majority of well-behaving bots will stop hitting them.
If you want bots to index these paths but you don't want to see them in your reports then you need to implement some filtering logic. E.g. all major bots have a very clear user-agent string (e.g. Googlebot/2.1). You can use these strings to filter these hits out from your reporting.
Well the robots will all use a specific user-agent, so you can just disregard those requests.
But also, if you just use a robots.txt and deny them from visiting; well that will work too.
Don't redescover the weel!
Any statistical tool at the moment filters robots request. You can install AWSTATS (open source) even if you have a shared hosting. If you won't to install a software in your server you can use Google Analytics adding just a script at the end of your pages. Both solutions are very good. In this way you only have to log your errors (500, 404 and 403 are enough).