Python requests NOT decoding gzip file

Python requests NOT decoding gzip file - gzip

I've found a number of questions about how to stop the python-requests package from decoding gzip responses but in my case I would like it to but it doesn't.
The headers from the response are below and I imagine there is something missing that means requests believes that the data is not gzipped - is there a way that I can convince it that the data is gzipped and get it to unzip it?
'content-length': '56185'
'x-amz-id-2': 'M+hA+/y9hrL1t4zuBajryQHmhr3lnaeDqdeE5gXwnOmEwgJI6QWbw/M51M0rx3+r'
'accept-ranges': 'bytes'
'server': 'AmazonS3'
'last-modified': 'Tue, 24 May 2016 00:25:37 GMT'
'x-amz-request-id': 'D2694ABDCA2417E2'
'etag': '"44b77ad22e8edc8f4fab543b3d16b500"'
'date': 'Tue, 24 May 2016 10:18:01 GMT'
'x-amz-server-side-encryption': 'AES256'
'content-type': ''

Related

How to serve a wbn (WebPackage/WebBundle) file from a web server?

does anyone know how to serve a web bundle so that it loads, rather than just downloading as a file?
Some disambiguation: There is a format called WebPackage (not to be confused with webpack), also called a Web Bundle. Files typically have the .wbn suffix. It contains html and js files and can be used to view websites offline. Useful for e.g. archiving websites or making websites that work well with intermittent network access. Download the file once, and you have all the assets you need for at last basic operation of the site.
The standard on how to serve a .wbn file is here:
https://wicg.github.io/webpackage/draft-yasskin-wpack-bundled-exchanges.html
However when I add the required headers in the web server, the .wbn file is just downloaded. If I drag the downloaded file onto my browser (google-chrome), the file is displayed as the website it contains, so unless there is some very subtle bug in there I believe that the format of the bundle is OK.
Here is a sample request:
Request URL: http://localhost/bundle/www-signed.wbn
Request Method: GET
Status Code: 200 OK
Remote Address: [::1]:80
Referrer Policy: strict-origin-when-cross-origin
and the server response:
Accept-Ranges: bytes
Connection: keep-alive
Content-Length: 4300
Content-Type: application/webbundle <-- Required by the standard
Date: Thu, 02 Sep 2021 12:00:24 GMT
ETag: "612ef7cb-10cc"
Last-Modified: Wed, 01 Sep 2021 03:47:23 GMT
Server: nginx/1.18.0 (Ubuntu)
X-Content-Type-Options: nosniff <-- required by the standard
If anyone has this working on a website or knows how to do it, I would love to have a look.

I had the same problem that the wbn file was just downloaded instead of executed.
I had to enable the web bundles feature even though my chrome version is 96+

Use wget to download pdf with no direct link

Some websites provide pdf files for viewing but I can't download such pdf files with wget.
Calling the website in question from my browser views the pdf:
https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
But using the following code I only get a pdf file with 0 lenght.
wget --content-disposition -nd https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
I tried some combinations with saving and loading cookies and referer but nothing works.
At this point I'm just curious what is happening and why wget is not fetching anything except maybe an empty index.html.

When I was looking at server response, it was saying the content length was 0.
--2021-04-17 14:59:35-- https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
Resolving www.lokalmatador.de (www.lokalmatador.de)... 37.202.6.70
Connecting to www.lokalmatador.de (www.lokalmatador.de)|37.202.6.70|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Sat, 17 Apr 2021 13:59:36 GMT
Server: Apache
Set-Cookie: fe_typo_user=477e8a1d2b3dd74bc5b6b408a6d74edd; expires=Mon, 17-May-2021 13:59:36 GMT; Max-Age=2592000; path=/; domain=.lokalmatador.de; httponly; samesite=lax
Upgrade: h2,h2c
Connection: Upgrade, Keep-Alive
Content-Length: Array
Cache-Control: max-age=2592000
Expires: Mon, 17 May 2021 13:59:36 GMT
X-UA-Compatible: IE=edge
X-Content-Type-Options: nosniff
Keep-Alive: timeout=5, max=100
Content-Type: application/pdf
Length: 0 [application/pdf]
Remote file exists but does not contain any link -- not retrieving.
So looked at the manual:
https://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html
And there is a command just exactly for this:
‘--ignore-length’
Unfortunately, some HTTP servers (CGI programs, to be more precise) send out bogus Content-Length headers, which makes Wget go wild, as it thinks not all the document was retrieved. You can spot this syndrome if Wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte.
With this option, Wget will ignore the Content-Length header—as if it never existed.
Then the wget command started working as expected:
wget --ignore-length -O epaper.pdf https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021
Here is output which I'm seeing with the ignore length:
--2021-04-17 14:56:19-- https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021
Resolving www.lokalmatador.de (www.lokalmatador.de)... 37.202.6.70
Connecting to www.lokalmatador.de (www.lokalmatador.de)|37.202.6.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [application/pdf]
Saving to: ‘epaper.pdf’
epaper.pdf [ <=> ] 4.39M 1.23MB/s in 3.6s
2021-04-17 14:56:23 (1.21 MB/s) - ‘epaper.pdf’ saved [4601842]

G-WAN 3.12.26 64-bit add duplicate http header

I use gwan for image generation, so I need to set correct content type, but G-WAN 3.12.26 after some load adds its own header with content type text/html and returns page with 2 http headers.
How to reproduce this:
use setheaders.c servlet from gwan package, start gwan and open this page, lets say http://localhost/?setheaders.c and you will get this (correct response):
HTTP/1.1 200 OK
Date: Sat, 29 Dec 2012 20:37:52 GMT
Last-Modified: Sat, 29 Dec 2012 20:37:52 GMT
Content-type: text/html
Content-Length: 371
Connection: close
<!DOCTYPE HTML><html lang="en"><head><title>Setting response headers</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><link href="imgs/style.css" rel="stylesheet" type="text/css"></head><body style="margin:16px;"><h1>Setting response headers</h1><br>This reply was made with custom HTTP headers, look at the servlet source code.<br></body></html>`
now run apache bench: ab -n 1000 'http://localhost/?setheaders.c' (1000 requests were enough for my system).
DO NOT RESTART GWAN, open http://localhost/?setheaders.c again and this is what you should get (incorrect response, 2 http headers):
HTTP/1.1 200 OK
Server: G-WAN
Date: Sat, 29 Dec 2012 20:43:34 GMT
Last-Modified: Fri, 16 Jan 1970 16:53:33 GMT
ETag: "be86ada7-14b40d-16f"
Vary: Accept-Encoding
Accept-Ranges: bytes
Content-Type: text/html; charset=UTF-8
Content-Length: 367
Content-Encoding: gzip
Connection: close
HTTP/1.1 200 OK
Date: Sat, 29 Dec 2012 20:43:34 GMT
Last-Modified: Sat, 29 Dec 2012 20:43:34 GMT
Content-type: text/html
Content-Length: 371
Connection: close
<!DOCTYPE HTML><html lang="en"><head><title>Setting response headers</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><link href="imgs/style.css" rel="stylesheet" type="text/css"></head><body style="margin:16px;"><h1>Setting response headers</h1><br>This reply was made with custom HTTP headers, look at the servlet source code.<br></body></html>
GWAN returns correct response if gzip and x-gzip are not set as acceptable encoding in request header (Accept-Encoding: gzip, x-gzip).
Is it possible to solve this modifying just servlet? If yes, then how?

Are you setting the MIME type like shown in fractal.c:
// -------------------------------------------------------------------------
// specify a MIME type so we don't have to build custom HTTP headers
// -------------------------------------------------------------------------
char *mime = (char*)get_env(argv, REPLY_MIME_TYPE);
// note that we setup the FILE EXTENTION, not the MIME type:
mime[0] = '.'; mime[1] = 'g'; mime[2] = 'i'; mime[3] = 'f'; mime[4] = 0;
If you do so then there's no way to confuse the automatic headers feature.
Other than that, v3.12 has had many instability problems (file time failures, pthread failures, signals failures, etc.) due to our direct syscalls and GLIBC wrappers - an effort initially intended to make the program run on all versions of Linux.
We have found (thanks to the many reports like yours) that rather than trying to fix those issues one by one (pointlessly fighting GLIBC, a moving target with many different releases each having its own bugs and specificities) a much safer path is to ditch GLIBC.
That's what G-WAN v4 will do, just a few days from now.

uwsgi breaks headers

I'm using Nginx + uwsgi + python3
Sending any header via start_response goes well, but when I want to send more than one header, it becomes mad.
For example, if I write:
start_response('200 OK', [('Last-Modified', 'Wed, 11 Jan 2012 00:00:00 GMT'), ('Content-Type', 'text/html; charset=windows-1251')])
The headers sent are:
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Server: nginx/1.0.11
Connection: close
Date: Wed, 11 Jan 2012 04:17:22 GMT
Content-Type: text/html; charset=windows-1251
Content-Type: text/html; charset=windows-12
uwsgi sends the same header twice and even more the second one is broken.

which uWSGI and nginx version ? In both 0.9.8.x and 1.0.x i cannot reproduce your error.
You can check the real headers sent by uWSGI putting it in http mode with --http/--http-socket

Must-revalidate headers of this request wrong?

I noticed that chrome cached a video file. I replaced it with another one on the server and chrome kept serving the old one from cache (using JW flash player 5)
The headers of the request look like this:
joe#joe-desktop:~$ wget -O - -S --spider http://www.2xfun.de/files_geheimhihi14/20759.mp4
Spider mode enabled. Check if remote file exists.
--2011-05-15 22:40:56-- http://www.2xfun.de/files_geheimhihi14/20759.mp4
Resolving www.2xfun.de... 213.239.214.112
Connecting to www.2xfun.de|213.239.214.112|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Sun, 15 May 2011 20:40:56 GMT
Server: Apache
Last-Modified: Sun, 15 May 2011 20:37:59 GMT
ETag: "89b38-3bb227-4a35683b477c0"
Accept-Ranges: bytes
Content-Length: 3912231
Cache-Control: max-age=29030400, public, must-revalidate
Expires: Sun, 15 Apr 2012 20:40:56 GMT
Connection: close
Content-Type: video/mp4
Length: 3912231 (3.7M) [video/mp4]
Remote file exists.
I am using mod_headers and mod_expires in apache2 like this:
<FilesMatch "\.(flv|ico|pdf|avi|mov|ppt|doc|mp3|wmv|wav|mp4)$">
ExpiresDefault A29030400
Header append Cache-Control "public, must-revalidate"
</FilesMatch>
Did I spell revalidate wrong or something?
edit:
To make the use case clear: I want the files to be cached, because they are rather big and I want to save bandwidth. But on the other hand I want the files to be re-validated. So the client does a HEAD request and checks whether the content has changed (thats what the etag is for), and only re-fetches if necessary.

Your problem is that must-revalidate only kicks in once a cache entry is no longer fresh, but you've marked the response as cacheable for 29 million seconds. 'Cache-Control: max-age=0, must-revalidate' may be closer to what you want, if you want to allow caching but require revalidation on each use.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Python requests NOT decoding gzip file - gzip

Related

How to serve a wbn (WebPackage/WebBundle) file from a web server?

Use wget to download pdf with no direct link

G-WAN 3.12.26 64-bit add duplicate http header

uwsgi breaks headers

Must-revalidate headers of this request wrong?

Categories

Resources