How can I block mp3 crawlers from my website under Apache? - apache

Is there some way to block access from a referrer using a .htaccess file or similar? My bandwidth is being eaten up by people referred from http://www.dizzler.com which is a flash based site that allows you to browse a library of crawled publicly available mp3s.
Edit: Dizzler was still getting in (probably wasn't indicating referrer in all cases) so instead I moved all my mp3s to a new folder, disabled directory browsing, and created a robots.txt file to (hopefully) keep it from being indexed again. Accepted answer changed to reflect futility of my previous attempt :P

That's like saying you want to stop spam-bots from harvesting emails on your publicly visible page - it's very tough to tell the difference between users and bots without forcing your viewers to log in to confirm their identity.
You could use robots.txt to disallow the spiders that actually follow those rules, but that's on their side, not your server's. There's a page that explains how to catch the ones that break the rules and explicitly ban them : Using Apache to stop bad robots [evolt.org]
If you want an easy way to stop dizzler in particular using the .htaccess, you should be able to pop it open and add:
<Directory /directoryName/subDirectory>
Order Allow,Deny
Allow from all
Deny from 66.232.150.219
</Directory>

From this site: (put this in your .htaccess file)
RewriteEngine on
RewriteCond %{HTTP_REFERER} ^http://((www\.)?dizzler\.com [NC]
RewriteRule .* - [F]

You could use something like
SetEnvIfNoCase Referer dizzler.com spammer=yes
Order allow,deny
allow from all
deny from env=spammer
Source: http://codex.wordpress.org/Combating_Comment_Spam/Denying_Access

It's not a very elegant solution, but you could block the site's crawler bot, then rename your mp3 files to break the links already on the site.

Related

Block downloading of files, but show on my own site pages

I want to block downloading of images from a directory, but allow them to be displayed on my own blog's pages (on same domain).
Created following .htaccess file
order deny, allow
deny from all
allow from mydomain.ru
It blocks downloading AND blocks showing images on my blog's pages.
What am I missing?
Shared hosting, ubuntu linux, apache. I don't have access to httpd.conf
allow from mydomain.ru will block all the requests that do not come from the IP address that mydomain.ru resolves to. So assuming you are not coming from that IP, that is why the images are blocked.
I don't know how your images are being served, but you may be able to block if the Referer does not match the domain name. This could easily be forged so it's by no means foolproof.
If your html has a number of links to the images, the following would work:
RewriteEngine On
RewriteCond %{HTTP_REFERER} !(.*\.)?mydomain.ru$
RewriteRule /path/to/directory - [F]

Blocking IPs with htaccess and log bloat

I set a 'deny from' in my htaccess to block certain spam bots from parsing my site. While using the code below, I noticed in my log file that I'm getting a lot of 'client denied by server configuration' and it's cluttering up the log files when the bot starts its scan. Any thoughts?
Thanks,
Steve
<Files *>
order allow,deny
allow from all
deny from 123.45.67.8
</Files>
I ended up going with the following:
RewriteCond %{REMOTE_ADDR} 123.4.3.4.5
RewriteRule (.*) - [F,L]
Take a look at the conditional logging here - I think that will provide everything you need:
http://httpd.apache.org/docs/2.2/logs.html
Also - if you can identify that the various bots are always coming from a specific IP address, you can block them in your hosts.allow/deny files VIA IP address or automatically using something like blockhosts or possibly mod_evasive, that way apache will never see the requests to log them.
-sean
UPDATE:
Are you identifying the ip addresses manually then adding them to your htaccess? that sounds painful. If you really want to do it that way I would suggest you block the ip addresses at the firewall with a drop rule OR as above in hosts allow/deny.
SPURIOUS BROKEN RECORD UPDATE:
Take a look at blockhosts, it can block ip addresses based on their 'behavior' & will eliminate the need for you to manually prune them out every day.
You can get the log file to be sent to a program (aka a script).
Perhaps implement a script than just gives a periodic summary?). The rest to log file?

Is it necessary to set [DirectoryIndex] while not using index.php?

My sites root access is managed by htaccess: it redirects various aliases to their own home files /en/home for english /de/home for Deutsch etcettera. Previously, I used index.php to route and redirect all that, and hence the DirectoryIndex had something like this:
DirectoryIndex /index.php
Now, however, there is no index.php file, so I commented it
# DirectoryIndex /index.php
Would it be better to uncomment is and set it to the default /en/home (with or without .php because in this case? I have set up rules sohat my pages in browser also work when no extension is given)
DirectoryIndex /en/home
In all the above cases, my websites work fine and I don't see ANY change when I set either of the three instances as above. but ... "there's gotta be one best ain't it?"
Thanks!
If you have the rules written in .htaccess it is best not to repeat the rules in whatever php config and routing functions you are using. Routing through apache (your .htaccess) is much faster than subverting routes through php, though you will not realize the gains without a pretty high volume of traffic.

How do I force Apache to simply redirect the user and ignore the directory structure?

Ok, so this problem recently arose and I don't know why it is happening; it's actually two problems in one...
0. My .htaccess file, for reference. (EDITED)
Options -Indexes +FollowSymLinks
RewriteEngine On
RewriteBase /
ErrorDocument 400 /index.php?400
ErrorDocument 401 /index.php?401
ErrorDocument 403 /index.php?403
ErrorDocument 404 /index.php?404
ErrorDocument 410 /index.php?410
ErrorDocument 414 /index.php?414
ErrorDocument 500 /global/500.php
RewriteCond %{HTTP_HOST} !^$ [NC]
RewriteRule .* index.php [L]
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^https?://(.*\.)?(animuson)\.(biz|com|info|me|net|org|us|ws)/.*$ [NC]
RewriteRule ^.*$ - [F]
1. My 'pictures' folder is following the hard path instead of the redirect.
I have no idea WHY it is doing this. It's really bugging me. The 'pictures' folder is a symbolic link to another place so that I can easily upload files to that folder without having to search through folders and such via my FTP account, but that's the only thing I use it for. However, when I visit http://example.com/pictures my htaccess sees it as accessing that other folder, which is restricted, and throws a 403 error rather than redirecting to index.php and displaying the page like normal.
I figured it has something to do with that specific folder being a symbolic link causing it to act oddly, but I have determined that my rules are not being applied to folders at all. If I visit folders such as 'css' and 'com' which are folders in the web root, it displays a 404 error page and adds the '/' to the end of the URL because it's treating it as a directory. It also does the same 403 error for my 'images' directory which is set up in the same fashion.
So, the question here is how do I modify my RewriteRule to apply to the directories as well? I want everything accessed via the web to be redirected back to index.php while maintaining the full access path in the address bar, why is it not working? (I'm pretty sure it was working fine before.)
Here's a small chart to show the paths they're following...
example.com/pictures -> pictures/ -> /home/animuson/animuson-pictures -> 403
example.com/com -> com/ -> 404
example.com/test -> index.php
example.com/ -> index.php
example.com/images -> images/ -> /home/animuson/animuson-images -> 403
example.com/css -> css/ -> 404
EDIT: Following information added.
Apache is processing the structure of the directory first. It's determining if the path exists based on what was typed into the address bar. If someone types in a folder name that happens to exist, it will redirect the user to the path with the "/" at the end of the URL signifying that it's a directory. For the 'pictures' directory explained above, the user does not have permission to access that folder so it is redirecting them to a 403 Access Denied page rather than simply showing the page that is supposed to be displayed there via the RewriteRule above. My biggest question is why is Apache processing the directory first and how do I make it stop doing that? I would really love an answer to this question.
2. Why is my compression not working? (EDIT: This part is fixed.)
When analyzing my site through a web optimizer, it keeps saying my page isn't using web compression, but I'm almost 100% positive that it was working fine before under the same settings. Can anyone suggest any reasons why it might not be working with this set up or suggest a better way of doing it?
Where is this .htaccess file situated? At the root or in the pictures directory?
1) You're using Options -Indexes which will deny access to directory listings. This is handled by /index.php?403 which in turn will redirect to /403. (I confirmed this by manually going to /index.php?403) I don't see any other rules in the posted .htaccess that are supposed to affect this. So this either happens because either index.php or some other .htaccess file or server rule makes that redirect.
You might also want to check the UNIX file permissions of the directory in question.
2) According to this aptimizer, http://www.websiteoptimization.com/services/analyze/, compression is indeed enabled for html, js and css files, as specified in the rules. My bet is that the optimizer is being stupid and does one of these three things:
1)) Complaining about images not being compressed. (It's generally a bad idea to compress images because they're typically already compressed and the extra CPU load typically isn't worth it since the net gain is so small. So your rules are OK in this regard.)
2)) It might think that DEFLATE doesn't count as compression, and wants you to use GZip.
3)) It might also react to the externally included StatCounter js file, which is not compressed. (And there's not much you can do about that.)
After a while of deliberating on Apache's IRC channel, I was finally able to figure out the real reasoning behind this on a fluke. I just happened to be looking at the directory structure using ls -l and noticed that all of the symbolic links had somehow has their permissions changed to animuson:animuson from the root:root original. I tried to run a simple chown root:root on them and it had no effect, so I deleted them all and recreated them and the problem has gone away. I don't really have any idea why the permissions made any different in this scenario but the solution worked and everything is okay now. I've also added a DirectorySlash Off to my .htaccess file to get rid of the slashes after folders that exist, just to make it look all that much nicer.

mod_rewrite to absolute path in .htaccess - turning up 404

I want to map a number of directories in a URL:
www.example.com/manual
www.example.com/login
to directories outside the web root.
My web root is
/www/htdocs/customername/site
the manual I want to redirect to is in
/www/customer/some_other_dir/manual
In mod_alias, this would be equal to
Alias /manual /www/customer/some_other_dir/manual
but as I have access only to .htaccess, I can't use Alias, so I have to use mod_rewrite.
What I have got right now after this question is the following:
RewriteRule ^manual(/(.*))?$ /www/htdocs/customername/manual/$2 [L]
this works in the sense that requests are recognized and redirected properly, but I get a 404 that looks like this (note the absolute path):
The requested URL /www/htdocs/customername/manual/resourcename.htm
was not found on this server.
However, I have checked with PHP: echo file_exists(...) and that file definitely exists.
why would this be? According to the mod_rewrite docs, this is possible, even in a .htaccess file. I understand that when doing mod_rewrite in .htaccess, there will be an automated prefix, but not to absolute paths, will it?
It shouldn't be a rights problem either: It's not in the web root, but within the FTP tree to which only one user, the main FTP account, has access.
I can change the web root in the control panel anytime, but I want this to work the way I described.
This is shared hosting, so I have no access to the error logs.
I just checked, this is not a wrongful 301 redirection, just an internal rewrite.
In .htaccess, you cannot rewrite to files outside the wwwroot.
You need to have a symbolic link within the webroot that points to the location of the manual.
Then in your .htaccess you need the line:
Options +SymLinksIfOwnerMatch
or maybe a little more blindly
Options +FollowSymlinks
Then you can
RewriteRule ^manual(/(.*))?$ /www/htdocs/customername/site/manual/$2 [L]
where manual under site is a link to /www/customer/some_other_dir/manual
You create the symlink on the command line with:
ln -s /www/htdocs/customername/site/manual /www/customer/some_other_dir/manual
But I imagine you're on shared hosting without shell access, so look into creating symbolic links within CPanel,Webmin, or whatever your admin interface is. There are php/cgi scripts that do it as well. Of course, you're still limited to the permissions that the host has given you. If they don't allow you to follow symlinks as a policy, you cannot override that within your .htaccess.
AFAIK mod_rewrite works at the 'protocol' level (meaning on the wire HTTP). So I suspect you are getting HTTP 302 with your directory path in the location.
So I'm afraid you might be stuck unless.. your hosting lets you follow symbolic links; so you can link to that location (assuming you have shell access or this is possible using FTP or your control panel) under your current document root.
Edit: It actually mentions URL-file phase hook in the docs so now I suspect the directory directives aren't allowing enough permissions.
This tells you what you need to know.
The requested URL /www/htdocs/customername/manual/resourcename.htm
was not found on this server.
It interprets RewriteRule ^manual(/(.*))?$ /www/htdocs/customername/manual/$2 [L] to mean rewrite example.com/manual/ as if it were example.com/www/htdocs/customername/manual/.
Try
RewriteRule ^manual(/(.*))?$ /customername/manual/$2 [L]
instead.