Scrapy+Splash returning wrong headers - scrapy

When using Splash with Scrapy the headers are returned from the Splash server instead of the website Splash renders.
response.headers returns:
{b'Server': [b'TwistedWeb/19.7.0'], b'Date': [b'Sun, 11 Jul 2021 07:31:32 GMT'], b'Content-Type': [b'text/html; charset=utf-8']}
And I'm trying to get the headers of the actual website:
Connection: Keep-Alive
Content-Length: 5
Content-Type: text/html
Date: Sun, 11 Jul 2021 07:05:49 GMT
Keep-Alive: timeout=5, max=100
Server: Apache
X-Cache: HIT
How can I get the headers of the website instead of the Splash server?

I got it to work with this:
splash_lua_script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
html = splash:html(),
headers = last_response.headers
}
end
"""
And then refer it to response.headers with Scrapy.

Related

Google safe browsing API not returning threat URLs

I'm sending requests to the Google safe browsing API. I believe I'm following their documentation correctly. I've tried regenerating my key.
I'm sending the request below
POST https://safebrowsing.googleapis.com/v4/threatMatches:find?key=AIxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx HTTP/1.1
User-Agent: Fiddler
Host: safebrowsing.googleapis.com
Content-Length: 511
{
"client": {
"clientId": "yourcompanyname",
"clientVersion": "1.5.2"
},
"threatInfo": {
"threatTypes": ["MALWARE", "SOCIAL_ENGINEERING"],
"platformTypes": ["WINDOWS"],
"threatEntryTypes": ["URL"],
"threatEntries": [
{"url": "http://www.urltocheck1.org/"},
{"url": "http://malware.testing.google.test"},
{"url": "http://www.urltocheck2.org/"},
{"url": "http://www.urltocheck3.com/"}
]
}
}
And getting back an empty response which is not what I'm expecting with the URLs supplied and following their example.
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Date: Wed, 08 Sep 2021 15:05:59 GMT
Server: scaffolding on HTTPServer2
Cache-Control: private
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
Accept-Ranges: none
Vary: Accept-Encoding
Content-Length: 3
{}
https://transparencyreport.google.com/safe-browsing/search?url=malware.testing.google.test
https://developers.google.com/safe-browsing/v4/lookup-api
You need to pass API key
You need to pass MALWARE url": "http://www.urltocheck1.org/"
if it is not malware it will show empty. try the following url
https://testsafebrowsing.appspot.com/s/malware.html with your code. please search and test with other maleware site

CSS is not always gzipped why?

In my Firefox or Chrome if I check the HTTP header the result are always with Content-Encoding: gzip. But I have customers reporting that they see "transfer-encoding: chunked" instead and the request are not gzipped.
http://www.example.com/public/css/style.min.css
If I or the customer do a gzip compression online check it's confirmed gzip is active.
https://checkgzipcompression.com = gzip!
But if I use a checker like this one. http://onlinecurl.com/
I also get the transfer-encoding: chunked
Request:
GET /style/css.css HTTP/1.1
Host: www.example.com
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
User-Agent: ...
Accept: /
Referer: http://www.example.com/
Accept-Encoding: gzip, deflate
Accept-Language: ...
Cookie: ...
Response:
HTTP/1.1 200 OK
Age: 532948
cache-control: public, max-age=604800
Content-Type: text/css
Date: Wed, 28 Jun 2017 12:35:07 GMT
ETag: "5349e8d595dfd21:0"
Last-Modified: Wed, 07 Jun 2017 13:56:17 GMT
Server: Microsoft-IIS/7.5
Vary: X-UA,Accept-Encoding, User-Agent
X-Cache: HIT
X-Cache-Hits: 6327
X-CacheReason: Static-js-css.
X-Powered-By: ASP.NET
X-Served-By: ip-xxx-xxx-xxx-xx.name.xxx
x-stale: true
X-UA-Device: pc
X-Varnish: 993020034 905795837
X-Varnish-beresp-grace: 43200.000
X-Varnish-beresp-status: 200
X-Varnish-beresp-ttl: 604800.000
transfer-encoding: chunked
Connection: keep-alive
Why are some requests not gzipped, when it should, this is my Varnish config (the part relevant for gzip):
if (req.http.Accept-Encoding) {
if (req.url ~ "\.(jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|flv|swf)$") {
# No point in compressing these
remove req.http.Accept-Encoding;
} elsif (req.http.Accept-Encoding ~ "gzip") {
set req.http.Accept-Encoding = "gzip";
} elsif (req.http.Accept-Encoding ~ "deflate") {
set req.http.Accept-Encoding = "deflate";
} else {
# unkown algorithm
remove req.http.Accept-Encoding;
}
}
# Enabling GZIP
if (beresp.http.Content-Type ~ "(text/css|application/x-javascript|application/javascript)") {
set beresp.do_gzip = true;
}
if (beresp.http.Content-Encoding ~ "gzip" ) {
if (beresp.http.Content-Length == "0") {
unset beresp.http.Content-Encoding;
}
}
set beresp.http.Vary = regsub(beresp.http.Vary, "(?i)^(.*?)X-Forwarded-URI,?(.*)$", "\1\2");
set beresp.http.Vary = regsub(beresp.http.Vary, "(?i)^(.*?)User-Agent,?(.*)$", "\1\2");
set beresp.http.Vary = regsub(beresp.http.Vary, "^(.*?),?$", "X-UA,\1");
set beresp.http.Vary = regsub(beresp.http.Vary, "^(.*?),?$", "\1");
Any ideas, thank you.
Responses will only be gzipped if the request indicates that it can accept a gzipped response. This is indicated by the Accept-Encoding header in the request. So perhaps your online curl is not sending that header. It may be the same for your clients who are seeing this. You really have customers who are reporting that they are not getting responses gzipped?
Update
Ah, I see what you're doing now. Are you using a recent version of Varnish? There's no need to do all this yourself now. Varnish handles it all natively. All you need to do is set do_gzip to on for the content types where you want it, and Varnish takes care of the rest, including the Accept-Encoding header. See the documentation here.
So just remove all of your gzip/encoding related code except the part directly under # Enabling GZIP:
# Enabling GZIP
if (beresp.http.Content-Type ~ "(text/css|application/x-javascript|application/javascript)") {
set beresp.do_gzip = true;
}
And that will probably get everything working. It works fine for me that way. The best amount of VCL is as little as possible, Varnish is very good at handling things itself. Don't forget to restart Varnish or otherwise clear the cache for this site after making the change.
In case it's useful, I use the following VCL for this:
if (
beresp.status == 200
&& beresp.http.content-type ~ "\b((text/(html|plain|css|javascript|xml|xsl))|(application/(javascript|xml|xhtml\+xml)))\b"
) {
set beresp.do_gzip = true;
}
Which checks for more content types that can benefit from compression, including HTML. I don't bother with application/x-javascript as it's ancient and not used.
On another note, are you sure you need to be modifying the Vary header in the way that you are doing there?

How to correctly handle multiple Set-Cookie headers in Hyper?

I'm using Hyper to send HTTP requests, but when multiple cookies are included in the response, Hyper will combine them to one which then fails the parsing procedure.
For example, here's a simple PHP script
<?php
setcookie("hello", "world");
setcookie("foo", "bar");
Response using curl:
$ curl -sLD - http://local.example.com/test.php
HTTP/1.1 200 OK
Date: Sat, 24 Dec 2016 09:24:04 GMT
Server: Apache/2.4.25 (Unix) PHP/7.0.14
X-Powered-By: PHP/7.0.14
Set-Cookie: hello=world
Set-Cookie: foo=bar
Content-Length: 0
Content-Type: text/html; charset=UTF-8
However for the following Rust code:
let client = Client::new();
let response = client.get("http://local.example.com/test.php")
.send()
.unwrap();
println!("{:?}", response);
for header in response.headers.iter() {
println!("{}: {}", header.name(), header.value_string());
}
...the output will be:
Response { status: Ok, headers: Headers { Date: Sat, 24 Dec 2016 09:31:54 GMT, Server: Apache/2.4.25 (Unix) PHP/7.0.14, X-Powered-By: PHP/7.0.14, Set-Cookie: hello=worldfoo=bar, Content-Length: 0, Content-Type: text/html; charset=UTF-8, }, version: Http11, url: "http://local.example.com/test.php", status_raw: RawStatus(200, "OK"), message: Http11Message { is_proxied: false, method: None, stream: Wrapper { obj: Some(Reading(SizedReader(remaining=0))) } } }
Date: Sat, 24 Dec 2016 09:31:54 GMT
Server: Apache/2.4.25 (Unix) PHP/7.0.14
X-Powered-By: PHP/7.0.14
Set-Cookie: hello=worldfoo=bar
Content-Length: 0
Content-Type: text/html; charset=UTF-8
This seems to be really weird to me. I used Wireshark to capture the response and there're two Set-Cookie headers in it. I also checked the Hyper documentation but got no clue...
I noticed Hyper internally uses a VecMap<HeaderName, Item> to store the headers. So they concatenate the them to one? Then how should I divide them into individual cookies afterwards?
I think that Hyper prefers to keep the cookies together in order to make it easier do some extra stuff with them, like checking a cryptographic signature with CookieJar (cf. this implementation outline).
Another reason might be to keep the API simple. Headers in Hyper are indexed by type and you can only get a single instance of that type with Headers::get.
In Hyper, you'd usually access a header by using a corresponding type. In this case the type is SetCookie. For example:
if let Some (&SetCookie (ref cookies)) = response.headers.get() {
for cookie in cookies.iter() {
println! ("Got a cookie. Name: {}. Value: {}.", cookie.name, cookie.value);
}
}
Accessing the raw header value of Set-Cookie makes less sense, because then you'll have to reimplement a proper parsing of quotes and cookie attributes (cf. RFC 6265, 4.1).
P.S. Note that in Hyper 10 the cookie is no longer parsed, because the crate that was used for the parsing triggers the openssl dependency hell.

Download string/file from site with webclient in .net

Can anybody help me to download string from this site
I use this code but
Dim client As New Net.WebClient
Dim str As String = client.DownloadString("
http://www.tsetmc.com/tsev2/chart/data/IndexFinancial.aspx?i=32097828799138957&t=ph")
the results are different.
true data are numbers
"20081206,9249,9168,9249,9178,8539624,9178;20081207,9178,9130,9178,9130,11752353,9130"
but results are like
"‹ ŠÜT ÿdë’í,«…ohýˆg­}ÿ÷µyÆdöûuuQà”ÄxD¬Ï³K}æ¿Sûù"
You should set the webclient's encoding first before calling DownloadString.Try with this code.
Dim client As New Net.WebClient
client.Encoding = Encoding.UTF8
Dim str As String = client.DownloadString("http://goo.gl/JRvlsm")
If you "get" the headers for your link:
Status:200
Raw:
HTTP/1.1 200 OK
Cache-Control: public, max-age=9999
Content-Length: 33183
Content-Type: text/csv; charset=utf-8
Content-Encoding: gzip
Expires: Sat, 23 Jul 2016 02:32:58 GMT
Last-Modified: Fri, 22 Jul 2016 23:46:19 GMT
Vary: *
Set-Cookie: ASP.NET_SessionId=vsxyok45zvtgsbvp4iqxdh45; path=/; HttpOnly
X-Powered-By: ASP.NET
Date: Fri, 22 Jul 2016 23:46:19 GMT
Request:
GET /tsev2/chart/data/IndexFinancial.aspx?i=32097828799138957&t=ph HTTP/1.1
You find that the data is gzip compressed (see the "Content-Encoding:" line). To address that, use this code:
Dim myUrl As String = "http://www.tsetmc.com/tsev2/chart/data/IndexFinancial.aspx?i=32097828799138957&t=ph"
Dim result as string
Using client As New WebClient
client.Headers(HttpRequestHeader.AcceptEncoding) = "gzip"
Using rs As New GZipStream(client.OpenRead(myUrl), CompressionMode.Decompress)
result = New StreamReader(rs).ReadToEnd()
End Using
End Using
The result is uncompressed text, just as you have indicated as the correct set of numbers:
20081206,9249,9168,9249,9178,8539624,9178;20081207,9178,9130,9178,9130,11752353,9130;
Here is where I found the info for decompressing gzip (more info there):
Automatically decompress gzip response via WebClient.DownloadData
Note: you may have to add a reference in your project for "System.IO.Compression"

jquery.ajax() POST receives empty response with IE10 on Nginx/PHP-FPM but works on Apache

I use a very simple jquery.ajax() call to fetch some HTML snippet from a server:
// Init add lines button
$('body').on('click', '.add-lines', function(e) {
$.ajax({
type : 'POST',
url : $(this).attr('href')+'?ajax=1&addlines=1',
data : $('#quickorder').serialize(),
success : function(data,x,y) {
$('#directorderform').replaceWith(data);
},
dataType : 'html'
});
e.preventDefault();
});
On the PHP side i basically echo out a HTML string. The jQuery version is 1.8.3.
The problem is in IE10: While it works fine there on Server A which runs on Apache it fails on Server B which runs on Nginx + PHP-FPM: If i debug the success handler on Server B I get a undefined for data. In the Network tab of the IE developer tools I can see the full response and all headers. It may affect other IE versions, but i could only test IE10 so far.
Here are the two response headers:
Server A, Apache (works):
HTTP/1.1 200 OK
Date: Thu, 25 Apr 2013 13:28:08 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding,User-Agent
Content-Encoding: gzip
Content-Length: 1268
Keep-Alive: timeout=2, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8
Server B, Nginx + PHP-FPM (fails):
HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Thu, 25 Apr 2013 13:41:43 GMT
Content-Type: text/html; charset=utf8
Transfer-Encoding: chunked
Connection: keep-alive
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Encoding: gzip
The body part looks the same in both cases.
Any idea what could cause this issue?
Please also check the Content-Type Header, since Apache and Nginx are sending different values:
Content-Type: text/html; charset=UTF-8
vs.
Content-Type: text/html; charset=utf8
Update your Nginx config, add this line:
charset UTF-8;