For the first time I've had to wrap something I'm working on as a CGI script. I'm having trouble with browsers (Both both Chrome and Firefox) not recognising the Content-Length header and stating size "unknown" to the users.
When I test this with the linux too wget, the tool recognises the size just fine.
When I test this manually though openssl s_client -connect I get the following headers:
The precise output from the webserver is as follows:
HTTP/1.1 200 OK
Date: Sun, 30 Jul 2017 20:12:20 GMT
Server: Apache/2.4.25 (Ubuntu) mod_fcgid/2.3.9 OpenSSL/1.0.2g
Content-Disposition: attachment; filename=foo.000000000G-000000001G.foofile.txt;
Content-Length: 501959790
Vary: Accept-Encoding
Content-Type: text/plain;charset=utf-8
Can anyone suggest what is missing / badly formatted?
Cracked it eventually.
This was caused by Apache doing something unexpected. Apache is compressing the output of the CGI script on the fly (sending with Content-Encoding: gzip). This changes the size of the file but Apache cannot know how much it is going to change when it sends the header. The files are 1/GB each so it can't / doesn't cache the gzipped content before it starts sending therefore cannot know the file size. This means it has to switch to Transfer-Encoding: chunked
One way to fix this is set Content-Encoding: none in the header which disables Apache from compressing the content. This does mean that 1/2 GB files take much longer to send.
Another might be to manually gzip the content in my cgi script and setting Content-Encoding: gzip and Content-Length: <gzipped size>. This will require me to work out the compressed size before sending.
Related
Some websites provide pdf files for viewing but I can't download such pdf files with wget.
Calling the website in question from my browser views the pdf:
https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
But using the following code I only get a pdf file with 0 lenght.
wget --content-disposition -nd https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
I tried some combinations with saving and loading cookies and referer but nothing works.
At this point I'm just curious what is happening and why wget is not fetching anything except maybe an empty index.html.
When I was looking at server response, it was saying the content length was 0.
--2021-04-17 14:59:35-- https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
Resolving www.lokalmatador.de (www.lokalmatador.de)... 37.202.6.70
Connecting to www.lokalmatador.de (www.lokalmatador.de)|37.202.6.70|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Sat, 17 Apr 2021 13:59:36 GMT
Server: Apache
Set-Cookie: fe_typo_user=477e8a1d2b3dd74bc5b6b408a6d74edd; expires=Mon, 17-May-2021 13:59:36 GMT; Max-Age=2592000; path=/; domain=.lokalmatador.de; httponly; samesite=lax
Upgrade: h2,h2c
Connection: Upgrade, Keep-Alive
Content-Length: Array
Cache-Control: max-age=2592000
Expires: Mon, 17 May 2021 13:59:36 GMT
X-UA-Compatible: IE=edge
X-Content-Type-Options: nosniff
Keep-Alive: timeout=5, max=100
Content-Type: application/pdf
Length: 0 [application/pdf]
Remote file exists but does not contain any link -- not retrieving.
So looked at the manual:
https://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html
And there is a command just exactly for this:
‘--ignore-length’
Unfortunately, some HTTP servers (CGI programs, to be more precise) send out bogus Content-Length headers, which makes Wget go wild, as it thinks not all the document was retrieved. You can spot this syndrome if Wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte.
With this option, Wget will ignore the Content-Length header—as if it never existed.
Then the wget command started working as expected:
wget --ignore-length -O epaper.pdf https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021
Here is output which I'm seeing with the ignore length:
--2021-04-17 14:56:19-- https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021
Resolving www.lokalmatador.de (www.lokalmatador.de)... 37.202.6.70
Connecting to www.lokalmatador.de (www.lokalmatador.de)|37.202.6.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [application/pdf]
Saving to: ‘epaper.pdf’
epaper.pdf [ <=> ] 4.39M 1.23MB/s in 3.6s
2021-04-17 14:56:23 (1.21 MB/s) - ‘epaper.pdf’ saved [4601842]
I use Xampp control panel to host an Apache server. I'm testing how to run mhtml files on a server. So far it only shows me raw text when visiting it on server side. I looked around on how to make it work but the solutions I got (for example, adding "AddType message/rfc822 .mhtml .mht" in http conf file) just proceeds to download the file instead of reading it.
Here's a sample of the initial block of the mhtml file:
From: <Saved by Blink>
Snapshot-Content-Location: https://www.instagram.com/jo0sef/
Subject: =?utf-8?Q?Yousef=20AlSudais=20=D9=8A=D9=88=D8=B3=D9=81=20=D8=A7=D9=84=D8?=
=?utf-8?Q?=B3=D8=AF=D9=8A=D8=B3=20(#jo0sef)=20=E2=80=A2=20Instagram=20pho?=
=?utf-8?Q?tos=20and=20videos?=
Date: Tue, 16 Feb 2021 08:18:55 -0000
MIME-Version: 1.0
Content-Type: multipart/related;
type="text/html";
boundary="----MultipartBoundary--c1Osf7aCebmaZjjAXk0gfl7cuYp300joTDYRFPKyLF----"
------MultipartBoundary--c1Osf7aCebmaZjjAXk0gfl7cuYp300joTDYRFPKyLF----
Content-Type: text/html
Content-ID: <frame-AD05338F6D10E72FA62E6C2E3D66903E#mhtml.blink>
Content-Transfer-Encoding: quoted-printable
Content-Location: https://www.instagram.com/jo0sef/
I cant access a specific filetype on my customer server (production).
Here are the results with cURL:
curl "http://domain.tld/fonts/glyphicons-halflings-regular.eot" -I
HTTP/1.1 200 OK
Date: Tue, 28 Jul 2015 12:06:23 GMT
Server: Apache/2.2.15 (Red Hat)
Last-Modified: Tue, 19 May 2015 15:32:20 GMT
ETag: "14023-4f42-516710421e900"
Accept-Ranges: bytes
Content-Length: 20290
Connection: close
Content-Type: application/vnd.ms-fontobject
The file is here.
But when I try to get the file content:
curl "http://domain.tld/fonts/glyphicons-halflings-regular.eot"
curl: (56) Recv failure: Connection was reset
I can't (yet) access the customer server, so I'm trying to guess what's wrong here.
What is working so far:
curl "https://domain.tld/fonts/glyphicons-halflings-regular.eot" --insecure
It is working in HTTPS, even if there is no certificate (which is why I use --insecure). I get the file content.
The customer can get the file if he accesses the file from a local URL.
I can access all other files on the server, even in the fonts directory.
I can't access all .eot files, even in other directories.
So I think it is one of those 2 problems:
- Apache configuration / .htaccess problem.
- Proxy / reverse proxy problem.
What do you think about it?
What kind of other test should I do?
What information should I ask to the customer?
Thanks.
Ok, here is the cause:
The customer firewall blocks .eot file content.
A vulnerability in Embedded Web Fonts Could Allow Remote Code Execution.
http://www.checkpoint.com/defense/advisories/public/2006/cpai-2006-010.html
As the .eot files are used by IE8 and lower, and those browser versions are not required by the customer, I've simply removed all references to .eot files.
Another solution would be to ask for the customer firewall admins to add an exception, as the severity is low.
Please note: This is not a complain about a shoddy CMS.
Just toying with Apache Bench and got terrible results with our custom CMS, more exactly i got:
Requests per second: 0.37 [#/sec] (mean)
When i run another test with a plain php file i got:
Requests per second: 4786.07 [#/sec] (mean)
Another test with a previous version of the CMS:
Requests per second: 6068.66 [#/sec] (mean)
The website(s) are working fine, no problems detected, Google's Webmaster Tools reports our sites as faster than 80% of the pages which is fine, i think.
The test was:
ab -t 30 -c 10 http://example.com/
Maybe some kind of Apache problem? Bad .htaccess config, or similar?
Update:
Just ran a simple test with sockets and the results are similar. Page loads very, very slowly. If i ran my script with another website everything is fine.
Also, there's a small hint about a chunk length problem. (Bad Apache Headers, or line endings?)
The site is gzipped, and when verbose logging turned on, i see these lines in the response:
LOG: Response code = 200
LOG: header received:
HTTP/1.1 200 OK
Date: Tue, 04 Oct 2011 13:10:49 GMT
Server: Apache
Set-Cookie: PHPSESSID=ibnfoqir9fee2koirfl5mhm633; path=/
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Cache-Control: post-check=0, pre-check=0
Vary: Accept-Encoding
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
2ef6
Always at the same place, in the middle of the HTML-source, then <!DOCTYPE HTML> again.
Please, help.
Update #2:
Just checked my HTTP headers with Rex Swain's HTTP Viewer and got these results:
HTTP/1.1·200·OK(CR)(LF)
Date:·Wed,·05·Oct·2011·08:33:51·GMT(CR)(LF)
Server:·Apache(CR)(LF)
Set-Cookie:·PHPSESSID=n88g3qcvv9p6irm1fo0qfse8m2;·path=/(CR)(LF)
Expires:·Sat,·26·Jul·1997·05:00:00·GMT(CR)(LF)
Cache-Control:·no-store,·no-cache,·must-revalidate(CR)(LF)
Pragma:·no-cache(CR)(LF)
Cache-Control:·post-check=0,·pre-check=0(CR)(LF)
Vary:·Accept-Encoding(CR)(LF)
Connection:·close(CR)(LF)
Transfer-Encoding:·chunked(CR)(LF)
Content-Type:·text/html;·charset=UTF-8(CR)(LF)
(CR)(LF)
Do you notice anything unusual?
If it works well with ordinary web browsers (as you mentioned in the comments) the CMS handle the requests from Apache Benchmark differently.
A quick checklist:
AFAIK Apache Benchmark just send simple requests without any cookie handling, so try to set -C with a valid cookie (copy the values from a web browser).
Try to send exactly the same headers to the CMS as the web browser sends. Save a dump of a valid request with netcat, HttpFox or a packet sniffer and set the missing headers with -H.
Profile the CMS on the server while you're sending to it a request with Apache Benchmark. Maybe you found the bottleneck. Two poor man's error_log calls with a timestamp in the first and the last line of the index.php (or the tested script's entry point) could show how fast is the PHP script and help to calculate the overhead of the Apache HTTP Server and network.
If you run socket tests and browser tests from different machines it's could be a DNS issue (turn off HostnameLookups in Apache). Try to run them from the same machine.
Try ab -k ... or ab -H "Connection: close" ....
I guess the CMS does some costly initialization when it initializes the session and it's happens when it processes the first request. Since Apache Benchmark does not send the cookies back the CMS it creates a new session for every request and it's the cause of the slow answers.
A second guess is that the CMS handle the incoming http headers differently and the headers which was sent (or the lack of them) by Apache Benchmark trigger some costly/slow processing. It looks more appropriate since the report of the Google's Webmaster Tools.
Apache Benchmark sends HTTP 1.0 request, for example:
GET / HTTP/1.0
Host: localhost:9100
User-Agent: ApacheBench/2.3
Accept: */*
It looks to me that your server does not send any http header about Keep-Alive settings but it assumes that the client uses keep-alive when the client uses HTTP 1.0. It's not an RFC compliant behaviour:
From RFC 2616, 19.6.2 Compatibility with HTTP/1.0 Persistent Connections:
Some clients and servers might wish to be compatible with some
previous implementations of persistent connections in HTTP/1.0
clients and servers. Persistent connections in HTTP/1.0 are
explicitly negotiated as they are not the default behavior.
By default Apache Benchmark doesn't use keep-alive so it waits when the response arrives for the closing of the socket. The server closes it after 15 seconds idle. Downloading the main page with wget also takes 15 seconds. Wget also uses HTTP 1.0 in the request.
I think it's a bug in the PHP code of the CMS since ab works well on the same server with a plain php file. Anyway, you can workaround it with using keep-alive connections (-k):
ab -k -t 30 -c 10 http://example.com/
or with explicitly disabling persistent connections:
ab -H "Connection: close" -t 30 -c 10 http://example.com/
but it's still a server side issue and your original ab commands is right.
Please note that this bug probably affects only HTTP 1.0 clients (like Apache Benchmark, wget) and clients with regular browsers will not notice it.
I noticed that chrome cached a video file. I replaced it with another one on the server and chrome kept serving the old one from cache (using JW flash player 5)
The headers of the request look like this:
joe#joe-desktop:~$ wget -O - -S --spider http://www.2xfun.de/files_geheimhihi14/20759.mp4
Spider mode enabled. Check if remote file exists.
--2011-05-15 22:40:56-- http://www.2xfun.de/files_geheimhihi14/20759.mp4
Resolving www.2xfun.de... 213.239.214.112
Connecting to www.2xfun.de|213.239.214.112|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Sun, 15 May 2011 20:40:56 GMT
Server: Apache
Last-Modified: Sun, 15 May 2011 20:37:59 GMT
ETag: "89b38-3bb227-4a35683b477c0"
Accept-Ranges: bytes
Content-Length: 3912231
Cache-Control: max-age=29030400, public, must-revalidate
Expires: Sun, 15 Apr 2012 20:40:56 GMT
Connection: close
Content-Type: video/mp4
Length: 3912231 (3.7M) [video/mp4]
Remote file exists.
I am using mod_headers and mod_expires in apache2 like this:
<FilesMatch "\.(flv|ico|pdf|avi|mov|ppt|doc|mp3|wmv|wav|mp4)$">
ExpiresDefault A29030400
Header append Cache-Control "public, must-revalidate"
</FilesMatch>
Did I spell revalidate wrong or something?
edit:
To make the use case clear: I want the files to be cached, because they are rather big and I want to save bandwidth. But on the other hand I want the files to be re-validated. So the client does a HEAD request and checks whether the content has changed (thats what the etag is for), and only re-fetches if necessary.
Your problem is that must-revalidate only kicks in once a cache entry is no longer fresh, but you've marked the response as cacheable for 29 million seconds. 'Cache-Control: max-age=0, must-revalidate' may be closer to what you want, if you want to allow caching but require revalidation on each use.