Difficulties with .htaccess and Blocking Specific File Extensions - apache

I have a rather complicated situation where I run a personal blog where every Friday and Sunday, I will post up music on the blog by uploading the mp3s into a folder, where a Flash mp3 player accesses it and plays it for the world.
Recently, some website called Dizzler, which is like a spider for mp3 files (Like the ones I host on my server!) and lets people play them via their own proprietary player. Now, I normally wouldn't be against other people using my server for their own gain but this recently got out of hand. In the last week of December, they managed to rack up 100k hits on one song and used up 6GB of bandwidth.
In that last week of December, I edited my .htaccess file to remove access to mp3s on my server without taking away access to my mp3s (So "deny all" isn't an option!) and I used this code:
RewriteEngine on
RewriteCond %{HTTP_REFERER} .
RewriteCond %{HTTP_REFERER} !^(www\.)?mydomain.com [NC]
RewriteRule \.(mp3)$ - [NC,F]
Options -Indexes
It worked pretty well with one exception - it broke every Wordpress installation on my server. What I mean is that outside of the index page, if you clicked on an entry in Wordpress, it wouldn't be able to find it. My host's solution was to add "RewriteEngine on" to every .htaccess file for every installation and in the root of the web server root.
That was a great fix and all the pages work again - but it is no longer blocking my mp3 files in that folder.
What can I do?
PS. For clarification, the code above is in an .htaccess file in the folder containing the mp3s. Hope that helps!

Huge thanks to Vinko Vrsalovic for all the help, definitely helped point me in the right direction, currently using the following code:
SetEnvIfNoCase Referer www\.dizzler\.com bad_referer
SetEnvIfNoCase Referer ".*(dizzler|beemp3|skreemr).*" BlockedReferer
SetEnvIfNoCase REMOTE_ADDR ".*(220.181.38.82|202.108.23.172|66.232.150.219).*" BlockedAddress
# deny any matches from above and send a 403 denied
<FilesMatch "\.mp3$">
order deny,allow
deny from env=bad_referer
deny from env=BlockedReferer
deny from env=BlockedAddress
</FilesMatch>
Testing it out tonight, will report back tomorrow if it works!

I'm posting this as another answer instead of adding this to my other post because it approaches the problem from a different angle. Here I am assuming that all your mp3s are in the same folder.
The problem you are facing is due to sloppy coding on the part of whoever made the media-player thing that wordpress uses. What happens is that the player runs on the visiting user's machine, and actually downloads the mp3 and plays it locally. The problem arises because the player does not provide any useful headers at all: the useragent is that of your browser, the referrer is blank, etc. As such, it is completely impossible to tell if the request is coming from the player, or from a browser that clicked your link in an audio search engine. Really, the only way to protect your mp3s from being indexed is to change the link as often as possible.
Which is precisely the plan. In a nutshell, here is what we are going to do:
change the path to your mp3s. This stays SECRET.
create a script to proxy for the mp3s, which requires a valid key which changes every hour
change all your uses of the mp3 player to use the mp3 proxy script but with a placeholder key
create a script to proxy for your webserver, which replaces the key placeholder with the actual key
use .htaccess to rewrite all requests to your server to use the webserver proxy script.
The upshot of all of this is that your user experience will not change, but if a crawler crawls your links, they will only be valid until midnight of that day, at which point requests to that url will result in a snippy message (or even an mp3 of you asking them to please not download your stuff).
Ready? OK, lets go!
Step 1:
First things first, make sure you renamed your mp3s folder! This will break all existing links (and failing to do this will mean all the links already crawled will remain valid). Secondly, create a robots.txt file to stop google and other search engines from indexing your mp3s folder.
Now, create a file in your root directory called mp3serve.php with the following contents:
<?php
/* This script checks 'key', and if it's valid, serves the mp3
* A valid key is defined as the md5 of the current date in
* yyyy-mm-dd-hh format concatenated with the string
* "Hello there :)"
*
* The key can be anything so long as we are consistent in this
* and the viewer proxy thing we're going to make.
*/
// edit this variable to reflect your server
$music_folder = "/new/path/to/mp3s/";
// get inputs of 'file' and 'key'
// 'file' should be the filename of the mp3 WITHOUT the extension
$file = $_GET['file'];
$key = $_GET['key'];
// get todays date
$date = date("Y-m-d-H");
// calculate the valid key
$valid = md5($date+"Hello there :)");
if ($key == $valid)
{
// if the key is valid, get the song in the path:
print(file_get_contents("$music_folder/$file.mp3"));
}
else
{
// if the key is invalid, print an admonishing message:
print("Please don't try to download my songs, poopface.");
}
?>
What this does is it takes the filename of an MP3 and a key of some kind, and serves the file contents if the key is valid. Note that this script:
makes no checks at all that $file points to what you expect it to, other than the fact that it tries to make sure it will only ever return mp3 files.
does not return valid headers for mp3 files - they'll render as text in a browser. This is easy to fix but the correct header eludes me for the moment... and anyway the wordpress mp3 player doesn't care, so it's all good :)
Step 2:
Now for the slightly tricky part: we have to rewrite the links dynamically. The easiest way to do this is to write a "local-proxy" thing, which really is a lot easier than it sounds. What we will do is write a script that gets what your page would have outputted and corrects the mp3 links. In my example we will edit all of your articles with mp3s in them, but if you want to get fancy this is not completely necessary.
First, edit all of your articles with mp3-players in them. You could automate this, but unless WP has a "find/replace in all articles" function I would advise against it for the sole reason that you might screw up and destroy your articles. In any case, edit them and replace the mp3 links in the players from
/path/to/mp3s/<filename>.mp3
to
/mp3serve.php?file=<filename>&key=[{mp3_file_key}]
Now, create another php script in your root directory called proxyviewer.php with the following contents:
<?php
/*
* The purpose of this file is to act as a proxy in which we can dynamically
* rewrite the page contents. Specifically, we want to get the page that the
* user WOULD have seen, and replace all instances of our key placeholder
* with the actual correct key
*/
// get the requested path
$request = $_GET['req'];
// get what the source output WOULD have been
// NOTE: depending on your server's config, you -might- have to
// replace 'localhost' with your actual site-name. This will
// however increase page-load times. If localhost doesn't work
// ask your host how to access your site locally. To clarify,
// maybe show him this file.
$source = file_get_contents("http://localhost/$request");
// The reason we need to pass the request through apache (i.e. use the whole
// "http://localhost/" thing is because we need the PHP to be rendered, and
// I can't think of another way to do that using the original request uri
// calculate the correct key
$key = md5(date("Y-m-d-H")+"Hello there :)");
// replace all instances of "[{mp3_file_key}]" with the key
$output = str_replace("[{mp3_file_key}]",$key,$source);
//output the source
print($output);
?>
Step 3:
Now for the last part: set up your .htaccess file to redirect all requests from
http://yoursite/some/request/here
to
http://yoursite/proxyviewer.php?req=some/request/here
Unfortunately I'm really not good with .htaccess files so I won't be able to give you the exact code, but I imagine it shouldn't be too hard to do.
Congrats, you're done!
Disclaimer:
Please note that the code in here is not production-level code. First of all, I haven't tested it at all - although unless there's a typo somewhere they should all work, I would advise you to look through them carefully before going live with them. I have been fairly careful not to allow any Bad Things to happen, but it doesn't do any serious checking, and it's the wee hours of the morning here so I may have overlooked something.

FilesMatch is the directive you need:
<FilesMatch "\.mp3$">
Order Allow, Deny
Allow from localhost #Or the address of your player
Deny From All
</FilesMatch>

I think my other answer is much better, but this is still worth considering
Reading through some of the answers, I am struck by another idea: Have your page log the IP addresses of all visitors to your site within the last two (or however many) hours. Then, create a job that gets run ever 2 seconds or so which rewrites your .htaccess file to only allow access to mp3 files to those IP addresses in the log.
That way, only those users who have been served a page from your website in the last two hours will have access to your music. This, for the vast majority of people finding your mp3s in audio search-engines, will prove to be false.

Related

Blocking traffic using .htaccess not working [duplicate]

I have several websites that get daily around 5% of visits from spam referrers. There is one strange things I noticed about this referrers: they show in Google Analytics, but I cannot see them in my custom designed table where I insert all the visitors to the site, so I think that they only manipulate the GA code, never reaching the site itself.
If you follow their link, they redirect you to some affiliates link.
I don't know whether they have impact on my SEO/SERP, but I would like to get rid of them. May I do that via htaccess file?
One peculiar aspect is that I get visitors from different forum like pages. E.g.: forum.topic221122.darodar.com, forum.topic125512.darodar.com etc., so I would like to block the full darodar.com domain.
Besides darodar.com, there are also econom.co and iloveitaly.co that are bothering my stats. Can I block them all from htaccess?
Most of the Spam in Google Analytics never access your site so you can't block them using any server-side solution.
Ghost Spam hits directly GA and usually shows up only for a few days and then disappear, that's why some people think they blocked them from the .htaccess file but is just coincidence.
This type of Spam is easy to spot since they use either a fake hostname or is not set. (See image below)
The other type, Crawlers like semalt, actually access your site and can be blocked from the .htaccess file, however, there are just a few of them.
So in summary, to stop spam in Google Analytics:
Crawlers: server-side solutions or filters in GA
Ghosts: ONLY filters in GA
The only efficient solution to prevent being hit by ghost spam is by making an include filter with all your valid hostnames.
First you need to make a REGEX with all the valid hostnames, something like this (you can find them on the network report)
yoursite\.com|shoppingcart\.com|translateservice\.net
These are some examples; you might have more or fewer hostnames. Once you have the REGEX, follow the same steps as above and change this:
Go to the admin tab in Google Analytics
Select FILTER under the View Column > New Filter
Filter type Custom > Include > Filter Field Hostname
File Pattern Copy the hostname expression you built
For Crawlers you will have to create a different filter building an expression with all spammers
spammer1|spammer2|spammer3|spammer4|spammer5
Filter type Custom > Exclude > Filter Field Campaign source
File Pattern Copy the referral expression
Everytime you work with filters it is important that you keep an unfiltered view.
If you need detailed steps for this solutions you can check this complete guide about Spam in Google Analytics.
Guide to stop and remove All the spam in Google Analytics
Hope it helps.
Hostname report Example
This blog post suggests that the spam referrers manipulate Google Analytics and never actually visit your site, so blocking them is pointless. Google Analytics offers filtering if you want to mitigate fake site hits.
Yes you can block with .htaccess and actually you should do it.
Your .htaccess file could look like this:
<IfModule mod_setenvif.c>
# Set spammers referral as spambot
SetEnvIfNoCase Referer darodar.com spambot=yes
SetEnvIfNoCase Referer 7makemoneyonline.com spambot=yes
## add as many as you find
Order allow,deny
Allow from all
Deny from env=spambot
</IfModule>
When traffic comes from these sites, they are blocked with this .htaccess, so the HTML is never loaded and therefore GA script is not fired up (from these sites).
They try to collect traffic from you, once you see the incoming traffic in Google Analytics then trying to find out what is the source you go to that URL. It is harmless to your site, except your statistics are full of junk data.
Google Analytics should prevent this, the same way GMail prevents spam email.
According to this entry, they are never visiting your site, they are faking HTTP request to GA using your UA-code. So, it seems it's pointless to block them using .htaccess or any other method, because they never actually enter to your site, they are only sending fake "visit" data to Google.
We have found that using htaccess is a good way to stop these spams. I have implemented below solution on my clients site which is working really well so far.
Best way is to stop them by contains clause, e.g. spam priceg.com check for priceg in referrer url.
Because many of these sites are creating sub domains and re hitting and when they tweak the url, hard coded conditions fail
RewriteCond %{HTTP_REFERER} (priceg) [NC,OR]
RewriteCond %{HTTP_REFERER} (darodar) [NC,OR]
It is explained in detail here
apparently, this is done by a spammer by communicating directly with google analytics using your website's account ID. So they effectively tell google analytics they visited your page while in fact they never did. They identify themselves to analytics by means of an URL which THEY WANT YOU TO VISIT. So you see their traffic in google analytics and go check them out. They will have an amazon affiliate account hooked up and so they attempt to get a commission from your amazon purchases, for example.
so .htaccess did nothing for me when I was fighting this one; you need to create a filter which filters out things like (.*)/.darodar/.com
the real bad effect I have found from this is it invalidates my website statistics
You can restrict access use .htaccess or by filtering ALL robot visits from being tracked by Google Analytics. If that doesn't work, setup Google Analytics filtering. More details on how to do that can be found here: http://www.wiyre.com/google-analytics-darodar-forum-spam-what-is-it/
They are Russian based but routing their spiders through China and the Philippines. Maybe it would be best to block the whole IP address at this point, they have multiple sub-domains.
Blocking any bots at your web server level makes no sense - spammers are sending fake requests to Google Analytics web server. All they have to know is website domain name and Google Analytics ID linked to it.
So you have to mask your Google Analytics ID at website code. For example, you can do like this at Google Analytics JS code:
ga('create', 'UA-X' + 'XXXXX' + 'XX-X', 'auto');
Spammer's bot should be able to execute JS code to parse your Google Analytics ID after this change (and not so many bots will be able to do it).
https://nobodyonsecurity.com/security/fighting-google-analytics-referrer-spam
.htaccess is not the best way. In my site I use GA, The option tracking information and then Reference exclusion list.
Regards!
Lunametrics posted a nice article to solve this issue using Google Tag Manager:
http://www.lunametrics.com/blog/2014/03/11/goodbye-to-exclude-filters-google-analytics/
I think that the most effective way to avoid ghost spam is to add a custom dimension that let you know the site was indeed visited, because as we know they never visit the site.
ga('set', 'dimension1', "Hey I'm really here!!");
ga('send', 'pageview');
You should simply add this lines in your pages and then add a filter to "include" only when the dimension has the expected value ("Hey I'm really here!!") in this case
I used these mod_rewrite methods for semalt:
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?semalt\.com.*$ [NC]
RewriteCond %{HTTP_REFERER} ^http(s)?://(.*\.)?semalt\.*$ [NC,OR]
RewriteCond %{HTTP_REFERER} ^https?://([^.]+\.)*semalt\.com\ [NC,OR]
or with the .htaccess module mod_setenvif
SetEnvIfNoCase Referer semalt.com spambot=yes
SetEnvIfNoCase REMOTE_ADDR "217\.23\.11\.15" spambot=yes
SetEnvIfNoCase REMOTE_ADDR "217\.23\.7\.144" spambot=yes
Order allow,deny
Allow from all
Deny from env=spambot
I even created an Apache, Nginx & Varnish blacklist plus Google Analytics segment to prevent referrer spam traffic, you can find it here:
https://github.com/Stevie-Ray/referrer-spam-blocker/
Filter future and historical ga spam of all types with the link provided. Hostname filtering is particularly easy.
https://www.ohow.co/ultimate-guide-to-removing-irrelevant-traffic-in-google-analytics/
2019 update
I may have a solution to this problem as I find none of the other solutions to be effective.
Let me address the problems of the existing solutions first
Add a filter for each referrer spam domain.
How many domains will you add?
Most of these referrer spam domains exist for sometime and
then disappear
Maintain a blacklist of referrer spam domains.
This gets even more complicated as they are basically endless in numbers.
You would have to keep updating the blacklist.
Also bigger the blacklist, the more time you need to scan it
Anything else such as maintaining a manual htaccess or something will require manual intervention which will not scale as your site becomes more popular
Anything automatic such as using AI to determine patterns in how referrer spam domains appear will have a hit/miss thing
How do these bots work?
First, it is crucial to understand how these bots work
They use regex patterns at the least such as /UA-\d{6}/ to load tracking ids which they visit recursively after starting at a seed website
I believe I have a solution that offers the following advantages
No need to maintain whitelists and blacklist
Will work against 99% of them easily and can always be modified to take it to 100%
Requires almost NO manual intervention
The idea is to NOT have a tracking ID at all in the script
Here is an example
script.
//- Google Analytics ID
var a = [85, 65, 45, 49, 49, 49, 49, 49, 49, 49, 49, 49, 45, 50];
var newScript = document.createElement("script");
newScript.type = "text/javascript";
newScript.setAttribute("async", "true");
newScript.setAttribute("src", "https://www.googletagmanager.com/gtag/js?id=" + a.map(i => String.fromCharCode(i)).join(""));
document.documentElement.firstChild.appendChild(newScript);
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', a.map(i => String.fromCharCode(i)).join(""), { 'send_page_view': false });
// Feature detects Navigation Timing API support.
if (window.performance) {
// Gets the number of milliseconds since page load
// (and rounds the result since the value must be an integer).
var timeSincePageLoad = Math.round(performance.now());
console.log(timeSincePageLoad)
// Sends the timing event to Google Analytics.
gtag('event', 'timing_complete', {
'name': 'load',
'value': timeSincePageLoad,
'event_category': '#{title}'
});
}
We take a very simple approach, break the tracking ID of the form 'UA-1111111-1' into a char code array
Now we construct the tracking ID dynamically from the char code array at any point we need a reference to the tracking ID
The approach can be made infinitely more complex by turning it into encrypted bunch of numbers, base 8 , hexadecimal, adding a fixed offset, a random offset during each run, RSA encrypting the tracking ID with a private key on the server and decrypting it with a public key but the basic approach is REALLY fast, as arrays in JS are really fast, can easily beat 99% of the bots

.htaccess redirect hotlinked PDF files, no single rule for where

I've seen a lot of anti-hotlink strategies, but so far none where each file needs a unique redirect.
My employer's site has over 500 PDF files of original artwork for printable papercrafts which she offers for free, monetizing through ads.
What we're trying to prevent is others simply linking to our .pdf files and letting their users access our content without ever seeing our ads. The goal is to catch these external links and redirect them to our .html page which links to that file.
What makes this different from a lot of problems I've read is that while we want to get the user as close to the file they're seeking as possible, there is no calculable link between the file names of the .pdf requested, and the .html where they should land.
The best idea I've come up with so far, given my knowledge of .htaccess is to use the best mod_rewrite anti-hotlink strategy I can find to rename /PDF/file.pdf to something like /PDF/file.redirect, then write a separate redirect rule for each one, such as /PDF/fall-leaves.redirect to /seasons.html, and so-on.
Is there a better solution to this problem?
Thanks,
John
You can use a RewriteMap instead of a bunch of rules. See the Apache documentation for more details on how that works, but it's basically a lookup table.

Hash character in URLs (accessing and redirecting in Apache)

It looks as though this question has been asked in part by some others, but I can't find the answer I'm looking for specifically, so I thought I'd pose my particular scenario in case anyone is able to help.
We have an old website (developed externally by a third party) that is due to be retired and replaced by a new site designed in house. For reasons best known to themselves, the developers of the old site used the hash character as part of the URL for the old site (www.mysite.com/#/my-content-stuff). To assist with the transition and help with SEO I need to set up 301 redirects for the top performing URLs from the old site. As I'm now discovering however, I'm not able to set up a simple redirect in the .htaccess file as I believe it takes the hash character to be a comment and ignores the remainder of the line. I've tried escape characters, using %23 instead, wildcard matching, nothing seems to work.
As a workaround, I wondered about simply creating dummy files with the same paths and URLs as the old site had, then simply creating HTML redirects within them to drive traffic to the correct new pages, but it looks as though the server is doing something similar regarding the hash character in the URL, and ignoring anything afterit. So, if I create a sub-folder on my news server called '#' and create a file in there called 'test.html', I expected to be able to just go to 'www.myNEWsite.com/#/test.html', but it just takes me to the default root file of my site.
Please can anyone shed any light on how I might get around this? I must admit I'm not that clued up on Apache so I'm having to learn a lot as I go.
Many thanks in advance for any pointers or info anyone can provide.
Cheers,
Rich
A hash character in the URL specifies the anchor, and it's not even sent to your webserver. A redirect is impossible on the server side, and the old developer probably did it using JavaScript. Implement fallback URLs without the hash instead, and have a global JavaScript script detect these URLs and redirect automatically.
Hash tags cannot be read by the server. They are regarded as locations within the document and are therefore not exposed to the server. The client is the only one whom see's these. The best you could do is use a "meta refresh" tag, or alternatively, you could use javascript to detect the url, and if its one which requires 301 redirection, use "window.location" to move the user to a full url where mod_rewrite or a php page can issue a 301 header.
However neither are SEO friendly and only really solve the issue for users that click onto an old link via an external site
<!-- Put in head tag so the page does not wait to load the content-->
<script type="text/javascript">
if(window.location.hash != "") {
var h = window.location.hash.match(/#\/?(.*)/i)[1];
switch(h) {
case "something_old":
window.location = "/something_new.html";
break;
case "something_also_old":
window.location = "/something_also_new.html";
break;
}
}
</script>

Apache redirect when users home directory is completely empty

I work for an ISP and I have a server with thousands of users 10MB of free storage. They get this free storage with every e-mail account they have with us. An example of a users storage address: http://users.example.com/~username/
One problem I can see is scanning the server for user names to see what accounts are available, basically getting a list of all our customers valid e-mail addresses. This would be very, very bad.
So I'm wanting to redirect to our homepage if someone comes across a users account that is empty (I'd say 90% of them are completely empty). I also do not want to simply -Indexes them and use a custom 403 because the few customers that do use them, want +Indexes.
I know I can always just tell the customers to put a htaccess file in their directory with Options +indexes if they want directory listing, but that's a last resort.
How can I make it pretty much impossible to tell what accounts are on the server but not in use at all?
I can't see a way to do this with Apache rules alone - and even if, it would be pretty expensive, scanning for files on every incoming request.
I would build a script that puts the appropriate .htaccess file, redirecting to your home page, into every completely empty account.
Maybe run it hourly, and make users aware that if they populate a directory for the first time, it may take up to an hour until their changes take place? I think that would be a reasonable time frame.

.htaccess - route to selected desination but change browser url

Problem:
I'd like to accept the original request. Say its, /IWantToGoHere/index.php
but I want to return to the browser, /GoHere/index.php
To be clear:
I actually want to send the original request location down to the script requested, however, I want to return the user a browser URL to another destination.
Code:
RewriteEngine on
RewriteRule ^(.*)IWantToGoHere\/\.php$ GoHere/index.php [NC,C]
RewriteRule ^GoHere/index.php$ GoHere/index.php [R,NC]
Notes:
I realize the code above doesn't work. I've tried a number of different calls. I spent umpteen hours yesterday trying every clever solution I could pull out of my limited mod_rewrite knowledge bank. Based on my understanding of mod_rewrite, I don't think it's do able. I understand its not what the preprocessed was designed to do. At least not from anything I could find on the Apache web site. I've been told that if I could dream it up, that it could be done:) I was wondering if anyone had and ideas how to get it to work.
Why would you want to do that?:
Because I do. No really, I want the URL returned to the user for further processing.
weez
If I understand the question correctly, to accomplish this you'll need to send a header from /IWantToGoHere/index.php that redirects to /GoHere/index.php once the script is finished executing. That is, if you want Apache to still call IWantToGoHere but return to GoHere. So at the end of processing for IWantToGoHere script something like this:
header('Location: /GoHere/Index.php');
Which will redirect correctly.