Archiving an old PHP website: will any webhost let me totally disable query string support? - amazon-s3

I want to archive an old website which was built with PHP. Its URLs are full of .phps and query strings.
I don't want anything to actually change from the perspective of the visitor -- the URLs should remain the same. The only actual difference is that it will no longer be interactive or dynamic.
I ran wget --recursive to spider the site and grab all the static content. So now I have thousands of files such as page.php?param1=a&param2=b. I want to serve them up as they were before, so that means they'll mostly have Content-Type: text/html, and the webserver needs to treat ? and & in the URL as literal ? and & in the files it looks up on disk -- in other words it needs to not support query strings.
And ideally I'd like to host it for free.
My first thought was Netlify, but deployment on Netlify fails if any files have ? in their filename. I'm also concerned that I may not be able to tell it that most of these files are to be served as text/html (and one as application/rss+xml) even though there's no clue about that in their filenames.
I then considered https://surge.sh/, but hit exactly the same problems.
I then tried AWS S3. It's not free but it's pretty close. I got further here: I was able to attach metadata to the files I was uploading so each would have the correct content type, and it doesn't mind the files having ? and & in their filenames. However, its webserver interprets ?... as a query string, and it looks up and serves the file without that suffix. I can't find any way to disable query strings.
Did I miss anything -- is there a way to make any of the above hosts act the way I want them to?
Is there another host which will fit the bill?
If all else fails, I'll find a way to transform all the filenames and all the links between the files. I found how to get wget to transform ? to #, which may be good enough. It would be a shame to go this route, however, since then the URLs are all changing.

I found a solution with Netlify.
I added the wget options --adjust-extension and --restrict-file-names=windows.
The --adjust-extension part adds .html at the end of filenames which were served as HTML but didn't already have that extension, so now we have for example index.php.html. This was the simplest way to get Netlify to serve these files as HTML. It may be possible to skip this and manually specify the content types of these files.
The --restrict-file-names=windows alters filenames in a few ways, the most important of which is that it replaces ? with #. This is needed since Netlify doesn't let us deploy files with ? in the name. It's a bit of a hack; this is not really what this option is meant for.
This gives static files with names like myfile.php#param1=value1&param2=value2.html and myfile.php.html.
I did some cleanup. For example, I needed to adjust a few link and resource paths to be absolute rather than relative due to how Netlify manages presence or lack of trailing slashes.
I wrote a _redirects file to define URL rewriting rules. As the Netlify redirect options documentation shows, we can test for specific query parameters and capture their values. We can use those values in the destinations, and we can specify a 200 code, which makes Netlify handle it as a rewrite rather than a redirection (i.e. the visitor still sees the original URL). An exclamation mark is needed after the 200 code if a "query-string-less" version (such as mypage.php.html) exists, to tell Netlify we are intentionally shadowing.
/mypage.php param1=:param1 param2=:param2 /mypage.php#param1=:param1&param2=:param2.html 200!
/mypage.php param1=:param1 /mypage.php#param1=:param1.html 200!
/mypage.php param2=:param2 /mypage.php#param2=:param2.html 200!
If not all query parameter combinations are actually used in the dumped files, not all of the redirect lines need to be included of course.
There's no need for a final /mypage.php /mypage.php.html 200 line, since Netlify automatically looks for a file with a .html extension added to the requested URL and serves it if found.
I wrote a _headers file to set the content type of my RSS file:
/rss.php
Content-Type: application/rss+xml
I hope this helps somebody.

Related

Fail2Ban ignore 404 of local redirect

Assume a bad actor scripts access to an Apache server to probe for vulnerabilities. With Fail2Ban we can catch some number of 404's and ban the IP. Now assume a single web page has a bad local reference to a CSS, JS, or image file. Repeated hits by the same legitimate site visitor will result in some number of 404s, and possibly an IP ban.
Is there a good way to separate these local requests from remote so that we don't ban the valued visitor?
I know all requests are remote, in that a page gets returned to a browser and the content of the page triggers more requests for assets. The thing is, how do we know the difference between that kind of page load pattern, and a script query for the same resource?
If we do know that a request is coming in based on a link that we just generated, we could do a 302 redirect rather than returning a 404, thus avoiding the banning process.
The HTTP Referer header can be used. If the Refer is the same origin as the requested page, or the same as the local site FQDN then we should not ban. But that header can be spoofed. So is this a good tool to use?
I'm thinking cookies can be used, or a session nonce, where a request might come in for assets from a page without a current session cookie. But I don't know if something like that is a built-in feature.
The best solution is obviously to make sure that all pages generated on a site include a valid reference back to the site, but we all know that's not possible. Some CMS add version info to files, or they adjust image paths to include an image size based on the client device/size. Any of these generated headers might simply be wrong until we can find and fix the code that creates them. Between the time we deploy something faulty and the time we fix it, I'm concerned about accidentally banning legitimate visitors with Fail2Ban (and other tools) that do not factor in where the request originates.
Is there another solution to this challenge? Thanks!
how do we know the difference between that kind of page load pattern
You don't in normal case (at least without some white- or black-list).
But you know URI- or paths segments, file extensions etc which would be rather never a target of such attack vectors, which you can ignore.
Some CMS add version info to files, or they adjust image paths to include an image size based on the client device/size.
But you surely knows the prefixes that where correct, so an RE allowing some paths segments would be possible. For instance this one:
# regex ignoring site and cms paths:
^<HOST> -[^"]*\"[A-Z]{3,}\s+/(?!site/|cms/)\S+ HTTP/[^"]+" 40\d\s\d+
will ignore this one:
192.0.2.1 - - [02/Mar/2021:18:01:06] "GET /site/style.css?ver=1.0 HTTP/1.1" 404 469
and match this one:
192.0.2.1 - - [02/Mar/2021:18:01:06] "GET /xampp/phpmyadmin/scripts/setup.php HTTP/1.1" 404 469
Similar you can write an regex with negative lookahead to ignore certain extensions like .css or .js or arguments like ?ver=1.0.
Another possibility would be to make a special fallback location logging completely worse requests in special log-file (not into access or error logs), like described in wiki :: Best practice so this way it would be possible to consider evildoers with definitely wrong URIs did not matching any proper location which can be handled by web server.
Or simply disable logging of 404 in known as valid locations (paths, prefixes, extensions whatever).
To ensure or completely avoid false positives you can firstly increase maxretry or reduce findtime and observe it a bit (so evildoers with too many attempts going banned and legitimate users with "broken" requests causing 404 but with not so large count of them will be still ignored). So you can cumulate whole list of "valid" 404 request of your application (in order to write more precise regex or filter it in some locations).

List of served files in apache

I am doing some reverse engineering on a website.
We are using LAMP stack under CENTOS 5, without any commercial/open source framework (symfony, laravel, etc). Just plain PHP with an in-house framework.
I wonder if there is any way to know which files in the server have been used to produce a request.
For example, let's say I am requesting http://myserver.com/index.php.
Let's assume that 'index.php' calls other PHP scripts (e.g. to connect to the database and retrieve some info), it also includes a couple of other html files, etc
How can I get the list of those accessed files?
I already tried to enable the server-status directive in apache, and although it is working I can't get what I want (I also passed the 'refresh' parameter)
I also used lsof -c httpd, as suggested in other forums, but it is producing a very big output and I can't find what I'm looking for.
I also read the apache logs, but I am only getting the requests that the server handled.
Some other users suggested to add the PHP directives like 'self', but that means I need to know which files I need to modify to include that directive beforehand (which I don't) and which is precisely what I am trying to find out.
Is that actually possible to trace the internal activity of the server and get those file names and locations?
Regards.
Not that I tried this, but it looks like mod_log_config is the answer to my own question

Enable the use of SSI

Using HostGator, I can't seem to get SSI to work on my server. I'm using Dreamweaver to build the site and the everything works just fine in the preview. But when I actually upload the pages to my server, any elements that are includes files don't appear. Does anyone know how I can enable SSI on my web server?
Your last comment gave me the information I need. The issue is that the file is not in the same directory as the file you're trying to add the footer.inc file to. Try this code:
<!--#include virtual= "includes/footer.inc" -->
when using the file= parameter, the file you're including must be in the same directory. If the file you're including is not in the same directory, then you will have to use virtual. See this page for more information: SSI: The Include Command.
And here, from the source, is pretty much the rule of thumb: Use file= when the included file is within the same directory as the page that wants it. Use virtual= when it isn't.
EDIT: I think I got it now. Copy and paste the above code and it should work for you. Make sure you follow these guideline: after <!--, there is no space between the last - and #. Additionally, there is a space between the closing " and the first -. These rules must be adhered to. You can view more information here: Server Side Includes Not Working

method for getting correct system path on windows

I have made up a simple http server using libevent. The way the resource (folders in my case) are accessed is
http://serverAddress:port/path/to/resouce/
the path to resource is extracted using the decoded url . It works fine on Linux as request would be something like this
http://severAddress:port/home/vickey/folder
but on window$ request is
http://serverAddress:port/c:/users/vickey/folder
which results in decoded url as /c:/users/vickey/folder. Its manually possible to remove the leading slash to correct the problem. However since I m using and learning boost libraries in my code I was wondering if there was some implementation of this sort ? I tried using native() and relative_path(). Thanks.
Its definitely possible to do as you're asking, but I would suggest a different approach. How about creating a configuration property for the server which could be called RESOURCE_BASE_PATH. The resource path received in the URL would be appended to the RESOURCE_BASE_PATH to create the complete path.
This is pretty standard for FTP and HTTP servers and the like. On Windows, it could be set to "c:" and on Linux, left blank which would default to "/".
Also remember on Windows the slashes (\) are different than those on Unix (/).

mod_rewrite to serve static cached files if they exist and if the source file hasn't changed

I am working on a project that processes images, saves the processed images in a cache, and outputs the processed image to the client. Lets say that the project is located in /project/, the cache is located in /project/cache/, and the source images are located wherever else on the server (like in /images/ or /otherproject/images/). I can set up the cache to mirror the path to the source image (e.g. if the source image is /images/image.jpg, the cache for that image could be /project/cache/images/image.jpg), and the requests to the project are roughly /project/path/to/image (e.g. /project/images/image.jpg).
I would like to serve the images from the cache, if they exist, as efficiently as possible. However, I also want to be able to check to see if the source image has changed since the cached image was created. Ideally, this would all be done with mod_rewrite so PHP wouldn't need to be used to do any of the work.
Is this possible? What would the mod_rewrite rules need to be for this to work?
Alternatively, it seems like it would be a fine compromise to have mod_rewrite serve the cached file most of the time but send 1 out of X requests to the PHP script for files that are cached. Is this possible?
You cannot acces the file modification timestamp from the RewriteRule, so there is no way around using PHP or another programming language for that task.
On the other hand this is really simple in PHP, so you should first check whether the PHP solution is good enough in you case. Only if it isn't you should look for alternatives.
What if you used the client to do some of the work? Say you display an image in the web browser and always use src="/cache/images/foobar.jpg" and add an onerror="this.src='/images/foobar.jpg'". In mod_rewrite, send anything that goes to the /images/ dir to a script that will return and generate an image in the cache.