Use wget to download pdf with no direct link - pdf

Some websites provide pdf files for viewing but I can't download such pdf files with wget.
Calling the website in question from my browser views the pdf:
https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
But using the following code I only get a pdf file with 0 lenght.
wget --content-disposition -nd https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
I tried some combinations with saving and loading cookies and referer but nothing works.
At this point I'm just curious what is happening and why wget is not fetching anything except maybe an empty index.html.

When I was looking at server response, it was saying the content length was 0.
--2021-04-17 14:59:35-- https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
Resolving www.lokalmatador.de (www.lokalmatador.de)... 37.202.6.70
Connecting to www.lokalmatador.de (www.lokalmatador.de)|37.202.6.70|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Sat, 17 Apr 2021 13:59:36 GMT
Server: Apache
Set-Cookie: fe_typo_user=477e8a1d2b3dd74bc5b6b408a6d74edd; expires=Mon, 17-May-2021 13:59:36 GMT; Max-Age=2592000; path=/; domain=.lokalmatador.de; httponly; samesite=lax
Upgrade: h2,h2c
Connection: Upgrade, Keep-Alive
Content-Length: Array
Cache-Control: max-age=2592000
Expires: Mon, 17 May 2021 13:59:36 GMT
X-UA-Compatible: IE=edge
X-Content-Type-Options: nosniff
Keep-Alive: timeout=5, max=100
Content-Type: application/pdf
Length: 0 [application/pdf]
Remote file exists but does not contain any link -- not retrieving.
So looked at the manual:
https://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html
And there is a command just exactly for this:
‘--ignore-length’
Unfortunately, some HTTP servers (CGI programs, to be more precise) send out bogus Content-Length headers, which makes Wget go wild, as it thinks not all the document was retrieved. You can spot this syndrome if Wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte.
With this option, Wget will ignore the Content-Length header—as if it never existed.
Then the wget command started working as expected:
wget --ignore-length -O epaper.pdf https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021
Here is output which I'm seeing with the ignore length:
--2021-04-17 14:56:19-- https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021
Resolving www.lokalmatador.de (www.lokalmatador.de)... 37.202.6.70
Connecting to www.lokalmatador.de (www.lokalmatador.de)|37.202.6.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [application/pdf]
Saving to: ‘epaper.pdf’
epaper.pdf [ <=> ] 4.39M 1.23MB/s in 3.6s
2021-04-17 14:56:23 (1.21 MB/s) - ‘epaper.pdf’ saved [4601842]

Related

Browsers not recognising Content-Length header when using apache and cgi

For the first time I've had to wrap something I'm working on as a CGI script. I'm having trouble with browsers (Both both Chrome and Firefox) not recognising the Content-Length header and stating size "unknown" to the users.
When I test this with the linux too wget, the tool recognises the size just fine.
When I test this manually though openssl s_client -connect I get the following headers:
The precise output from the webserver is as follows:
HTTP/1.1 200 OK
Date: Sun, 30 Jul 2017 20:12:20 GMT
Server: Apache/2.4.25 (Ubuntu) mod_fcgid/2.3.9 OpenSSL/1.0.2g
Content-Disposition: attachment; filename=foo.000000000G-000000001G.foofile.txt;
Content-Length: 501959790
Vary: Accept-Encoding
Content-Type: text/plain;charset=utf-8
Can anyone suggest what is missing / badly formatted?
Cracked it eventually.
This was caused by Apache doing something unexpected. Apache is compressing the output of the CGI script on the fly (sending with Content-Encoding: gzip). This changes the size of the file but Apache cannot know how much it is going to change when it sends the header. The files are 1/GB each so it can't / doesn't cache the gzipped content before it starts sending therefore cannot know the file size. This means it has to switch to Transfer-Encoding: chunked
One way to fix this is set Content-Encoding: none in the header which disables Apache from compressing the content. This does mean that 1/2 GB files take much longer to send.
Another might be to manually gzip the content in my cgi script and setting Content-Encoding: gzip and Content-Length: <gzipped size>. This will require me to work out the compressed size before sending.

RabbitMQ HTTP API call to aliveness-test returns 404 but other calls work

When using the HTTP API I am trying to make a call to the aliveness-test for monitoring purposes. At the moment I am testing using curl and the following command:
curl -i http://guest:guest#localhost:55672/api/aliveness-test/
And I get the following response:
HTTP/1.1 404 Object Not Found
Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted it blue)
Date: Mon, 05 Nov 2012 17:18:58 GMT
Content-Type: text/html
Content-Length: 193
<HTML><HEAD><TITLE>404 Not Found</TITLE></HEAD><BODY><H1>Not Found</H1>The requested document was not found on this server.<P><HR><ADDRESS>mochiweb+webmachine web server</ADDRESS></BODY></HTML>
When making a request just to list the users or vhosts, the requests returns successfully:
$ curl -I http://guest:guest#localhost:55672/api/users
HTTP/1.1 200 OK
Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted it blue)
Date: Mon, 05 Nov 2012 17:51:44 GMT
Content-Type: application/json
Content-Length: 11210
Cache-Control: no-cache
I'm using the latest stable version (2.8.7) of RabbitMQ and obviously have the management plugin installed for the API to work with the users call (the response is left out due to it containing company data but is just regular JSON as expected).
There isn't much on the internet about this call failing so I am wondering if anyone has seen this before?
Thanks,
Kristian
Turns out that the '/' at the beginning of the vhosts names is not implicit, even when as part of a URL. To get this to work I simply changed my request from:
curl -i http://guest:guest#localhost:55672/api/aliveness-test/
To
curl -i http://guest:guest#localhost:55672/api/aliveness-test/%2F
As %2F is '/' HTTP encoded, my request now queries the vhost named '/' and returns a 200 response which looks like:
{"status":"ok"}

uwsgi breaks headers

I'm using Nginx + uwsgi + python3
Sending any header via start_response goes well, but when I want to send more than one header, it becomes mad.
For example, if I write:
start_response('200 OK', [('Last-Modified', 'Wed, 11 Jan 2012 00:00:00 GMT'), ('Content-Type', 'text/html; charset=windows-1251')])
The headers sent are:
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Server: nginx/1.0.11
Connection: close
Date: Wed, 11 Jan 2012 04:17:22 GMT
Content-Type: text/html; charset=windows-1251
Content-Type: text/html; charset=windows-12
uwsgi sends the same header twice and even more the second one is broken.
which uWSGI and nginx version ? In both 0.9.8.x and 1.0.x i cannot reproduce your error.
You can check the real headers sent by uWSGI putting it in http mode with --http/--http-socket

Keep the assets fresh in browser and cancel the freshness check request of the cache [for rails 3.1 app on heroku]

I have lot of small images (of sizes ~3kb or so) and lot of css and js files. After the first request tey are getting cached on the browser, but when I reload the page the browser is trying to check the freshness of the cached content (by setting the If-Modified-Since etc) and gets the response 304 not modified. Each of this validation request seriously increase the page load time (say 20 time 300ms).
How can I cancel this cache freshness check with the server from the browser? How can instruct the browser to use local cached files/images for certain time (say 1 hour) without re-validating or checking the freshness of the local cache with the remote server for every reload with that time period?
sample small image fetch header details below [using rails 3.1, on heroku]:
Response Headers
HTTP/1.1 304 Not Modified
Server: nginx/0.7.67
Date: Thu, 10 Nov 2011 17:53:33 GMT
Connection: keep-alive
Via: 1.1 varnish
X-Varnish: 1968827848
Last-Modified: Tue, 08 Nov 2011 07:36:04 GMT
Cache-Control: public, max-age=31536000
Etag: "5bda917d22f8a144c293f3f19723dbc6"
Request Headers
GET /assets/icons/flash_close_button-5bda917d22f8a144c293f3f19723dbc6.png HTTP/1.1
Host: ???.heroku.com
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:6.0.1) Gecko/20100101 Firefox/6.0.1
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: http://???.heroku.com/
Cookie: ???
If-Modified-Since: Tue, 08 Nov 2011 07:36:04 GMT
If-None-Match: "5bda917d22f8a144c293f3f19723dbc6"
Cache-Control: max-age=0
This line:
Cache-Control: public, max-age=31536000
is telling the browser to not ask for updates for a long time, and store the files in a publicly accessible cache (which hear means public to the local machine - not the general public). Your browser should therefore not really be re-checking those files. Have you tried another browser to verify this behaviour exists elsewhere?
Saying all of this though, considering that your files are coming from the varnish cache and not your dyno, and are being returned as HTTP 304, 300ms for 20 files sounds like a very long time. However, This should be barely perceptible to the user.

Must-revalidate headers of this request wrong?

I noticed that chrome cached a video file. I replaced it with another one on the server and chrome kept serving the old one from cache (using JW flash player 5)
The headers of the request look like this:
joe#joe-desktop:~$ wget -O - -S --spider http://www.2xfun.de/files_geheimhihi14/20759.mp4
Spider mode enabled. Check if remote file exists.
--2011-05-15 22:40:56-- http://www.2xfun.de/files_geheimhihi14/20759.mp4
Resolving www.2xfun.de... 213.239.214.112
Connecting to www.2xfun.de|213.239.214.112|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Sun, 15 May 2011 20:40:56 GMT
Server: Apache
Last-Modified: Sun, 15 May 2011 20:37:59 GMT
ETag: "89b38-3bb227-4a35683b477c0"
Accept-Ranges: bytes
Content-Length: 3912231
Cache-Control: max-age=29030400, public, must-revalidate
Expires: Sun, 15 Apr 2012 20:40:56 GMT
Connection: close
Content-Type: video/mp4
Length: 3912231 (3.7M) [video/mp4]
Remote file exists.
I am using mod_headers and mod_expires in apache2 like this:
<FilesMatch "\.(flv|ico|pdf|avi|mov|ppt|doc|mp3|wmv|wav|mp4)$">
ExpiresDefault A29030400
Header append Cache-Control "public, must-revalidate"
</FilesMatch>
Did I spell revalidate wrong or something?
edit:
To make the use case clear: I want the files to be cached, because they are rather big and I want to save bandwidth. But on the other hand I want the files to be re-validated. So the client does a HEAD request and checks whether the content has changed (thats what the etag is for), and only re-fetches if necessary.
Your problem is that must-revalidate only kicks in once a cache entry is no longer fresh, but you've marked the response as cacheable for 29 million seconds. 'Cache-Control: max-age=0, must-revalidate' may be closer to what you want, if you want to allow caching but require revalidation on each use.