Cloudflare returning 520 due to empty server response from Heroku - ssl

My Rails app which has been working great for years suddenly started returning Cloudflare 520 errors. Specifically, api.exampleapp.com backend calls return the 520 whereas hits to the frontend www.exampleapp.com subdomain are working just fine.
The hard part about this is nothing has changed in either my configuration, or code at all. Cloudflare believes this is happening as the Heroku server is returning an empty response.
> GET / HTTP/1.1
> Host: api.exampleapp.com
> Accept: */*
> Accept-Encoding: deflate, gzip
>
{ [5 bytes data]
* TLSv1.2 (IN), TLS alert, close notify (256):
{ [2 bytes data]
* Empty reply from server
* Connection #0 to host ORIGIN_IP left intact
curl: (52) Empty reply from server
error: exit status 52
On the Heroku end, my logs don't even seem to register the request when I hit any of these urls. I also double-checked my SSL setup (Origin certificate created at Cloudflare installed on Heroku), just in case, and it seems to be correct and is not expired.
The app is down for a couple of days now, users are complaining, and no response from either customer care teams despite being a paid customer. My dev ops knowledge is fairly limited.

Welcome to the club: https://community.cloudflare.com/t/sometimes-a-cf-520-error/288733
It seems to be a Cloudflare issue introduced in late July affecting hundreds of sites running very different configurations. It's been almost a month since the issue was first reported, Cloudflare "fixed" it twice, but it's still there. Very frustrating.

Change your webserver logs to a info state and see if your application is not exceeding some HTTP/2 directive while processing the connection.
If this is the case, try to increase the directive size:
#nginx
server {
...
http2_max_field_size 64k;
http2_max_header_size 64k;
}

Related

start-iap-tunnel unable to connect to a listening port

I'm installing OpenVPN Access Server on a Google Cloud instance. Its webUI listens on port 943 using https. It has a self-signed certificate whose name doesn't match the server's hostname (10.150.0.2). I can't start an SSH tunnel. I'm looking for a way to troubleshoot the connection from the IAP service to my server.
The command I'm running is gcloud compute start-iap-tunnel vpn 943 --local-host-port=localhost:943 I receive the normal Testing if tunnel connection works message.
It errs out with ERROR: (gcloud.compute.start-iap-tunnel) While checking if a connection can be made: Error while connecting [4003: 'failed to connect to backend']. (Failed to connect to port 943)
If I add --log-http to the command invocation the relevant information follows (it looks like a normal req/resp cycle with a 200 that I assume is from my client to the IAP service):
Testing if tunnel connection works.
=======================
==== request start ====
uri: https://oauth2.googleapis.com/token
method: POST
== headers start ==
b'content-type': b'application/x-www-form-urlencoded'
b'user-agent': b'google-cloud-sdk gcloud/367.0.0 command/gcloud.compute.start-iap-tunnel invocation-id/db27de82264f47fcb63f6680afaa8327 environment/None environment-version/None interactive/False from-script/False python/3.7.9 term/xterm-256color (Macintosh; Intel Mac OS X 21.2.0)'
== headers end ==
== body start ==
Body redacted: Contains oauth token. Set log_http_redact_token property to false to print the body of this request.
== body end ==
==== request end ====
---- response start ----
status: 200
-- headers start --
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Content-Encoding: gzip
Content-Type: application/json; charset=utf-8
Date: Fri, 24 Dec 2021 02:11:52 GMT
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Pragma: no-cache
Server: scaffolding on HTTPServer2
Transfer-Encoding: chunked
Vary: Origin, X-Origin, Referer
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 0
-- headers end --
-- body start --
Body redacted: Contains oauth token. Set log_http_redact_token property to false to print the body of this response.
-- body end --
total round trip time (request+response): 0.246 secs
---- response end ----
----------------------
ERROR: (gcloud.compute.start-iap-tunnel) While checking if a connection can be made: Error while connecting [4003: 'failed to connect to backend']. (Failed to connect to port 943)
To my knowledge this is the limit of easily accessible troubleshooting for start-tap-tunnel.
Moving on to the local machine we can connect to 10.150.0.2:943 before puking a la certificate.
root#viongier:/usr/local/openvpn_as# wget https://10.150.0.2:943
--2021-12-24 02:01:47-- https://10.150.0.2:943/
Connecting to 10.150.0.2:943... connected.
ERROR: The certificate of ‘10.150.0.2’ is not trusted.
ERROR: The certificate of ‘10.150.0.2’ doesn't have a known issuer.
The certificate's owner does not match hostname ‘10.150.0.2’
It seems to me that my client happily connects to the IAP service which fails to connect to my server. I would expect to see an IAP error if it was erring out because of the cert. The only thing I can think of to test this is by generating a certificate whose issuer google likes. (LetsEncrypt for example.)
This message means that the backend does not have a socket open in the listening state. Common reasons are that no service has been started or a firewall is blocking the port.
To allow the Identity Aware Proxy into your VPC, allow traffic from 35.235.240.0/20.
ERROR: (gcloud.compute.start-iap-tunnel) While checking if a
connection can be made: Error while connecting [4003: 'failed to
connect to backend']. (Failed to connect to port 943)
This error means that the certificate provided does not match the address that the connection is made to:
ERROR: The certificate of ‘10.150.0.2’ is not trusted. ERROR: The
certificate of ‘10.150.0.2’ doesn't have a known issuer. The
certificate's owner does not match hostname ‘10.150.0.2’
Some clients, such as wget support ignoring SSL certificate validation. For wget see the --no-check-certificate flag.
Once you solve that problem you will run into another set of problems:
Under normal circumstances, you can not use HTTPS with tunnels. Tunnels are a form of man in the middle. There are tricks that can be employed, none of them secure.
Commercial SSL certificates do not support IP addresses only public domain names. You would need to create your own self-signed certificate, which would not be trusted or do not validate the certificate.
The last issue is that HTTPS endpoints require encryption negotiation from the client party. The start-iap-tunnel command does not initiate encryption (TLS negotiation). This command also does not do any form of certificate exchange and that is why you do not see an IAP error about certificates. This command only transfers data between the tunnel endpoints.
In summary, you cannot use HTTPS with TCP / SSH tunnels without deploying tricks and/or disabling features which defeats the purpose of HTTPS.
Allow IAP traffic through the firewall allowed my external client to connect to the internal port 943 via an IAP tunnel.
Allowing port 943 from 35.235.240.0/20 solved my problem.
More information is available at the GCP IAP docs

Opensips Tls and certificates issues

I am trying to setup the certificate verification in opensips along with the blink sip client. I followed the tutorial:
https://github.com/antonraharja/book-opensips-101/blob/master/content/3.2.%20SIP%20TLS%20Secure%20Calling.mediawiki
My config look like so:
[opensips.cfg]
disable_tls = no
listen = tls:my_ip:5061
tls_verify_server= 0
tls_verify_client = 1
tls_require_client_certificate = 1
#tls_method = TLSv1
tls_method = SSLv23
tls_certificate = "/usr/local/etc/opensips/tls/server/server-cert.pem"
tls_private_key = "/usr/local/etc/opensips/tls/server/server-privkey.pem"
tls_ca_list = "/usr/local/etc/opensips/tls/server/server-calist.pem"
So i generated the rootCA and the server certificate. Then i took the server-calist.pem added the server-privkey.pem in there (otherwise blink sip client won't load it) and set it in client. I also set the server-calist.pem as a certificate authority in the blink. But when i try to login to my server i get:
Feb 4 21:02:42 user /usr/local/sbin/opensips[28065]: DBG:core:tcp_read_req: Using the global ( per process ) buff
Feb 4 21:02:42 user /usr/local/sbin/opensips[28065]: DBG:core:tls_update_fd: New fd is 17
Feb 4 21:02:42 user /usr/local/sbin/opensips[28065]: ERROR:core:tls_accept: New TLS connection from 130.85.9.114:48253 failed to accept: rejected by client
So i assume that the client doesn't accept the server certificate for some reason, although i have the "Verify server" checkbox turned off in my blink sip client! I think i have the wrong certificate authority file.
./user/user-cert.pem
./user/user-cert_req.pem
./user/user-privkey.pem
./user/user-calist.pem <- this 4 are for using opensips as a client i think
./rootCA/certs/01.pem
./rootCA/private/cakey.pem
./rootCA/cacert.pem
./server/server-privkey.pem
./server/server-calist.pem
./server/server-cert.pem
./server/server-cert_req.pem
./calist.pem
Can anybody help, did i do something wrong i the config or did i use the wrong certificate chain? What certificate exactly should be used by the client as a client cert, and ca authority cert?
Allright, i'm still not sure if it is working or not, because the authorization behaviour became weird, but after it's hanging for 5-6 minutes i get the success authorization, so this is a solution:
Generate rootCA:
opensipsctl tls rootCA
then edit server.conf file in your tls opensips folder and set the commonName = xxx.xxx.xxx.xxx where xxx.xxx.xxx.xxx is your server ip address. Other variables can be edited in any way. Generate the certificates signed by CA
opensipsctl tls userCERT server
This will produce 4 files. Download the server-calist.pem, server-cert.pem, server-privkey.pem. Open the server-privkey.pem, copy it's content and paste in the file server-cert.pem, before the actual certificate. If you are using blink, the produced server-cert.pem goes in the preferences->account->advanced. And server-calist.pem goes into the preferences->advanced. After that restart blink and after 5-6 minutes your account is gonna be logged in. But i'v observed a weird behaviour, if you run another copy of blink and try to log into the other existing account after your logged from the first one with the certificates, you can log in from other account without providing the certificates. So i don't know, but i think it's working.
P.S. I asked about the certificates in the opensips mailing list, but i guess they found my question too lame, so i didn't get the response. If you have the same problem and got better results or an answer from opensips support let me know please.

Gwan report.c statistics

I am testing on G-wan server performance and it's very amazing!!! Here is the output from report.c
Requests
All: 5,725 (6.06% of Cache misses)
HTTP: 66 (1.15% of all requests)
Errors: 70 (1.22% of all requests)
CSP: 5,650 (98.69% of all requests) Exceptions: 1
Connections
Accepted: 4,717 (1.21 requests per connection)
Closed: 4,372
Timeouts: 682 (14.46%) Accept:682 Read:0 Slow:0 Build:0 Send:0 Close:0
Busy: 345 (Waiting: 334 Reading: 9 Replying: 2 Sending: 0 Pushing: 0 Relaying: 0 Closing: 0)
I found that the Errors rate seem to be quite high, and there an exceptions occur on CSP too, could anyone tell me what did "Errors" mean and how to avoid it? Thanks!
the "Errors" rate seem to be quite high
That's HTTP errors (wrong requests coming from a client, not found resources, etc. - look at the error.log file for a trace).
The only way to avoid HTTP errors is to prevent clients from connecting to the server.
If you can't live with this "high rate of HTTP errors" of 1.22% of all requests then use a G-WAN connection handler (with the HTTP_ERROR notification) to make G-WAN ignore HTTP errors and close the connection without sending an HTTP error message (just return 0; in the handler) - but that's probably not what most users want.
there an exceptions occur on CSP too
An exception means a 'graceful crash report' was issued for a servlet bug. As you have only 1 crash on 5,650 dynamic requests, that was probably during the servlet development. Look at your error.log and trace files to check what happened.
Note that the "cache misses" statistics are for static contents only (1.15% of all your HTTP requests).
Apparently, not all your clients are responding in the timely fashion: you have timeouts and pending requests.

Apache, mod_ssl "request failed: error reading the headers" for a specific user

Currently we have an Apache 2.2.3 server with mod_ssl 2.2.3 running Django, with users authenticating by using a x509 certificate.
So far the system is running perfectly except for a single user, who when trying to upload a file receives 400 Bad Request error, and the contents of the ssl_error_log regarding this operation are:
[<date>] [error] [client <client ip>] request failed: error reading the headers, referer: <referrer url>
The contents of the ssl_access_log are:
<client ip> - - [<date>] "POST <target page> HTTP/1.1" 400 321
Also, the user's browser is Firefox as far as I know.
I am completely unable to reproduce this bug and so far none of the other users have experienced it. Could you point out some reasons for this to happen?
I've experienced connectivity that stops the upstream after an X amount of bytes is sent. X was a pretty low value, as in enough to request some simple pages, but not to deal with ajax requests much less upload files. As far as I recall, this connectivity problem occurred only when tethering (from a specific Android phone, but I didnt even test other phones).
So if the upstream gets interrupted and the upload stalls, it makes sense apache would return this error, according to this post: "Apache waits a time equal to the Timeout directive (defaults to 5 minutes if not defined) for a response from the client. It is likely Apache is waiting for the CRLF that indicates the end of the headers, yet it is never received.."

Terrible Apache Bench results on Custom CMS

Please note: This is not a complain about a shoddy CMS.
Just toying with Apache Bench and got terrible results with our custom CMS, more exactly i got:
Requests per second: 0.37 [#/sec] (mean)
When i run another test with a plain php file i got:
Requests per second: 4786.07 [#/sec] (mean)
Another test with a previous version of the CMS:
Requests per second: 6068.66 [#/sec] (mean)
The website(s) are working fine, no problems detected, Google's Webmaster Tools reports our sites as faster than 80% of the pages which is fine, i think.
The test was:
ab -t 30 -c 10 http://example.com/
Maybe some kind of Apache problem? Bad .htaccess config, or similar?
Update:
Just ran a simple test with sockets and the results are similar. Page loads very, very slowly. If i ran my script with another website everything is fine.
Also, there's a small hint about a chunk length problem. (Bad Apache Headers, or line endings?)
The site is gzipped, and when verbose logging turned on, i see these lines in the response:
LOG: Response code = 200
LOG: header received:
HTTP/1.1 200 OK
Date: Tue, 04 Oct 2011 13:10:49 GMT
Server: Apache
Set-Cookie: PHPSESSID=ibnfoqir9fee2koirfl5mhm633; path=/
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Cache-Control: post-check=0, pre-check=0
Vary: Accept-Encoding
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
2ef6
Always at the same place, in the middle of the HTML-source, then <!DOCTYPE HTML> again.
Please, help.
Update #2:
Just checked my HTTP headers with Rex Swain's HTTP Viewer and got these results:
HTTP/1.1·200·OK(CR)(LF)
Date:·Wed,·05·Oct·2011·08:33:51·GMT(CR)(LF)
Server:·Apache(CR)(LF)
Set-Cookie:·PHPSESSID=n88g3qcvv9p6irm1fo0qfse8m2;·path=/(CR)(LF)
Expires:·Sat,·26·Jul·1997·05:00:00·GMT(CR)(LF)
Cache-Control:·no-store,·no-cache,·must-revalidate(CR)(LF)
Pragma:·no-cache(CR)(LF)
Cache-Control:·post-check=0,·pre-check=0(CR)(LF)
Vary:·Accept-Encoding(CR)(LF)
Connection:·close(CR)(LF)
Transfer-Encoding:·chunked(CR)(LF)
Content-Type:·text/html;·charset=UTF-8(CR)(LF)
(CR)(LF)
Do you notice anything unusual?
If it works well with ordinary web browsers (as you mentioned in the comments) the CMS handle the requests from Apache Benchmark differently.
A quick checklist:
AFAIK Apache Benchmark just send simple requests without any cookie handling, so try to set -C with a valid cookie (copy the values from a web browser).
Try to send exactly the same headers to the CMS as the web browser sends. Save a dump of a valid request with netcat, HttpFox or a packet sniffer and set the missing headers with -H.
Profile the CMS on the server while you're sending to it a request with Apache Benchmark. Maybe you found the bottleneck. Two poor man's error_log calls with a timestamp in the first and the last line of the index.php (or the tested script's entry point) could show how fast is the PHP script and help to calculate the overhead of the Apache HTTP Server and network.
If you run socket tests and browser tests from different machines it's could be a DNS issue (turn off HostnameLookups in Apache). Try to run them from the same machine.
Try ab -k ... or ab -H "Connection: close" ....
I guess the CMS does some costly initialization when it initializes the session and it's happens when it processes the first request. Since Apache Benchmark does not send the cookies back the CMS it creates a new session for every request and it's the cause of the slow answers.
A second guess is that the CMS handle the incoming http headers differently and the headers which was sent (or the lack of them) by Apache Benchmark trigger some costly/slow processing. It looks more appropriate since the report of the Google's Webmaster Tools.
Apache Benchmark sends HTTP 1.0 request, for example:
GET / HTTP/1.0
Host: localhost:9100
User-Agent: ApacheBench/2.3
Accept: */*
It looks to me that your server does not send any http header about Keep-Alive settings but it assumes that the client uses keep-alive when the client uses HTTP 1.0. It's not an RFC compliant behaviour:
From RFC 2616, 19.6.2 Compatibility with HTTP/1.0 Persistent Connections:
Some clients and servers might wish to be compatible with some
previous implementations of persistent connections in HTTP/1.0
clients and servers. Persistent connections in HTTP/1.0 are
explicitly negotiated as they are not the default behavior.
By default Apache Benchmark doesn't use keep-alive so it waits when the response arrives for the closing of the socket. The server closes it after 15 seconds idle. Downloading the main page with wget also takes 15 seconds. Wget also uses HTTP 1.0 in the request.
I think it's a bug in the PHP code of the CMS since ab works well on the same server with a plain php file. Anyway, you can workaround it with using keep-alive connections (-k):
ab -k -t 30 -c 10 http://example.com/
or with explicitly disabling persistent connections:
ab -H "Connection: close" -t 30 -c 10 http://example.com/
but it's still a server side issue and your original ab commands is right.
Please note that this bug probably affects only HTTP 1.0 clients (like Apache Benchmark, wget) and clients with regular browsers will not notice it.