CouchDB replication is not working properly behind a proxy - apache

Note: Made some updates based on new information. Old ideas have been added as comments below.
Note: Made some updates (again) based on new information. Old ideas have been added as comments below (again).
We are running two instances of CouchDB on separate computers behind Apache reverse proxies. When attempting to replicate between the two instances:
curl -X POST http://user:pass#localhost/couchdb/_replicate -d '{ "source": "db1", "target": "http://user:pass#10.1.100.59/couchdb/db1" }' --header "Content-Type: application/json"
(we started using curl to debug the problem)
we receive an error similar to:
{"error":"case_clause","reason":"{error,\n {{bad_return_value,\n {invalid_json,\n <<\"<!DOCTYPE HTML PUBLIC \\\"-//IETF//DTD HTML 2.0//EN\\\">\\n<html><head>\\n<title>404 Not Found</title>\\n</head><body>\\n<h1>Not Found</h1>\\n<p>The requested URL /couchdb/db1/_local/01e935dcd2193b87af34c9b449ae2e20 was not found on this server.</p>\\n<hr>\\n<address>Apache/2.2.3 (Red Hat) Server at 10.1.100.59 Port 80</address>\\n</body></html>\\n\">>}},\n {child,undefined,\"01e935dcd2193b87af34c9b449ae2e20\",\n {gen_server,start_link,\n [couch_rep,\n [\"01e935dcd2193b87af34c9b449ae2e20\",\n {[{<<\"source\">>,<<\"db1\">>},\n {<<\"target\">>,\n <<\"http://user:pass#10.1.100.59/couchdb/db1\">>}]},\n {user_ctx,<<\"user\">>,\n [<<\"_admin\">>],\n <<\"{couch_httpd_auth, default_authentication_handler}\">>}],\n []]},\n temporary,1,worker,\n [couch_rep]}}}"}
So after further research it appears that apache returns this error without attempting to access CouchDB (according to the log files). To be clear when fed the following URL
/couchdb/db1/_local/01e935dcd2193b87af34c9b449ae2e20
Apache passes the request to CouchDB and returns CouchDB's 404 error. On the other hand when replication occurs the URL actually being passed is
/couchdb/db1/_local%2F01e935dcd2193b87af34c9b449ae2e20
which apache determines is a missing document and returns its own 404 error for without ever passing the request to CouchDB. This at least gives me some new leads but I could still use help if anyone has an answer offhand.

The source CouchDB (localhost) is telling you that the remote URL was invalid. Instead of a CouchDB response, the source is receiving the Apache httpd proxy's file-not-found response.
Unfortunately, you may have some reverse-proxy troubleshooting to do. My first guess is the Host header the source is sending to the target. Perhaps it's different from when you connect directly from a third location?
Finally, I think you probably know this, but the path
/couchdb/db1/_local%2F01e935dcd2193b87af34c9b449ae2e20
Is not a standard CouchDB path. By the time CouchDB sees a request, it should have the /couchdb stripped, so the query is for a document called _local%2f... in the database called db1.
Incidentally, it is very important not to let the proxy modify the paths before they hit couch. In particular, if you send %2f then CouchDB had better receive %2f and if you send / then CouchDB had better receive /.

From official documentation...
Note that HTTPS proxies are in theory supported but do not work in 1.0.1. This is because 1.0.1 ships with ibrowse version 1.5.5. The CouchDB version in trunk (from where 1.1 will be based) ships with ibrowse version 1.6.2. This later ibrowse contains fixes for HTTPS proxies.
Can you see which version of ibrowse is involved? Maybe update that ver?

Another thought I have is with regard to the SSL certs. If you don't have any, and I know you don't :), then technically you're doing SSL wrong. In java we know there are ways around this, but maybe try putting in proper certs since all SSL stuff basically involves certs.

And for my last contribution (today) I would say have you looked through this document which seems highly relevant?
http://wiki.apache.org/couchdb/Apache_As_a_Reverse_Proxy

Related

Python BaseHTTPServer vs Apache and mod_wsgi

I am setting up a very simple HTTP server for the first time, am considering my options, and would appreciate any feedback on the best way to proceed. My goal is pretty simple: I'm not serving any files, I only need to respond to a very specific HTTP POST request that will contain geolocation data, run some Python code, and return the results as JSON. I do need to be able to respond to multiple simultaneous requests. I would like to use HTTPS.
In looking on stackoverflow it seems I can potentially go with BaseHTTPServer and ThreadingMixIn, or Apache and mod_wsgi. I already have Apache installed, but have never configured it. Are there compelling reasons to go the more complicated Apache route (more complicated to me, because I will need to do research on configuring Apache and getting mod_wsgi going but already have a test instance of BaseHTTPServer up and running), or is it equally safe, secure (very important), and performance-oriented to use BaseHTTPServer for something so simple?
BaseHTTPServer is not a production grade server.
If you don't understand how to set up Apache, but want to get something with mod_wsgi running quickly and easily, then you probably want to look at mod_wsgi express.
This gives you a way of installing mod_wsgi using Python 'pip' and also provides you a way of starting up Apache/mod_wsgi with a auto generated Apache and mod_wsgiconfiguration such that you don't even need to know how to configure Apache.
The next version of mod_wsgi express to be released (version 4.3.0, likely released this week), can even set up a HTTPS site for you, with you just needing to have obtained a valid certificate or generated a self signed certificate.
I would suggest if interested you use the mod_wsgi mailing list to ask for more details about using mod_wsgi express for running a HTTPS site.
http://code.google.com/p/modwsgi/wiki/WhereToGetHelp?tm=6#Asking_Your_Questions
You can start playing around though with it for a normal HTTP site by following instructions at:
https://pypi.python.org/pypi/mod_wsgi

Bad request error in Apache2 when accessing http instead of https

I am a noob and I have recently started playing with my apache2 installation and trying to see how things are working. Also this exercise helps me figure out more things about apache2 than just reading some manual online.
But I am unable to figure out what I did now ?
So, here is my question: I enabled default-ssl (and have disabled default, i.e., have closed port 80, so that you can only connect the server with https)> I remember previously (say couple of days back) when i did the same and tried to access my website using http, it was giving me some error in the browser saying the web page could not be found or something. But today, doing the same thing give a nice error page saying one should use https instead of http.
Bad Request
Your browser sent a request that this server could not understand.
Reason: You're speaking plain HTTP to an SSL-enabled server port.
Instead use the HTTPS scheme to access this URL, please.
Hint: https: // 127.0.1.1/
And, I actually like this. But, I am trying to remember what things I might have done in between to activate such nice error page which was previously not shown.
I know I did something and I cannot remember what I did and asking you to figure that out. I feel bit stupid out there. But, it would be great if any Apache Sherlock out there who could help me. BTW, I am using Ubuntu 12.10.
Thanks

Pushing my Mercurial Repository through HTTP with Apache and Windows

So I have managed it. I can clone mercurial-repositories remotely using HTTP to my Windows Server 2003 machine and the ipaddress from that machine. Although I did deactivate IIS6 and am using Apache 2.2.x now. But not all works right now...darn! Here's the thing:
Cloning goes smooth! But when I want to push my changes to the original repository I get the message "cannot lock static http-repository". On the internet I get to read several explanations that Mercurial wasn't designed to push over HTTP connections. Still, on the Mercurial website there's something about configuring an hgrc file.
There's also the possibilty to configure Apache to host via HTTPS (or SSL). For this you have to load the module enabling OpenSSL and generating keys.
Configuring the hgrc file
Just add "push_ssl = false" under the [web] line. But where to put this file when pushing your changes back?! Because I placed it in the root of the server, in the ".hg" directory, nothing works.
Using SSL/HTTPS with Apache
When I try to access 'https://myipaddress' it fails, displaying a dutch message which would mean something like "server taking too long to respond". Trying to push also gives me a dutch error message which means about the same. It can not connect to my server via https although I followed the steps exactly at this blog.
I don't care which of the above solutions will work for me. Turns out none of them work so far. So please, can anyone help me with one of the solutions above? Pick the easiest! Help will be greatly appreciated, not only from me.
Summary
-Windows Server 2003
-Apache 2.2 with OpenSSL
-Mercurial 1.8.2
-I can clone, but not push!
Thank you!
Maarten Baar(s)
It seems like you might have apache configured incorrectly for getting it to do what you want. Based on your question it sounds like you have a path (maybe the root of the server) pointing to the repository you want to serve.
Mercurial comes with a script for this exact purpose, in the latest version it is hgweb.cgi. There are reasonably good instructions for setting it up on the mercurial site. It should allow both cloning and pushing. You will need the push_ssl=false if you will not be configuring https and also an allow_push line which will let certain users, or all (*) push to the repository. But all that should be part of the setup docs.

Can we detect if a site is on CDN?

Is there a way to detect if a site is on a Content Delivery Network and if yes, can we tell which service are they using?
A method that is achievable from the command line is using the 'host' command, with the -a flag set to see the DNS record e.g.
host -a www.visitbritain.com
Returns:
www.visitbritain.com. 0 IN CNAME d18sjq5nyxcof4.cloudfront.net.
Here you can see that the CNAME entry tells us that the site is using cloudfront as the CDN.
Just take a look at the urls of the images (and other media) of the site.
Reverse lookup IP's of the hostnames you see there and you will see who own them.
I built this little tool to identify the CDN used by a site or a domain, feel free to try it.
The URL: http://www.whatsmycdn.com/
You might also be able to tell from the HTTP headers of the media if the URL doesn't give it away. For example, media served by SimpleCDN has Server: SimpleCDN 5.6a4 in its headers.
cdn planet now have their cdn finder tool on github
http://www.cdnplanet.com/blog/better-cdn-finder/ The tool installs on the command line and allows you the feed in host names and check if they use a CDN.
If Website using GCP CDN you simply check it using curl
curl -I <https://site url>
In reponse you can find following headers there available
x-goog-metageneration: 2
x-goog-stored-content-encoding: identity
x-goog-stored-content-length: 17393
x-goog-meta-object-id: 11602
x-goog-meta-source-id: 013dea516b21eedfd422a05b96e2c3e4
x-goog-meta-file-hash: cf3690283997e18819b224c6c094f26c
Yes you can find by
host -a www.website.com
Apart from some excellent answers already posted here which include some direct methods which may or may not work for all the websites out there, there is also an indirect way to see if a CDN is there. And especially if its your own website and you want to know if you are getting what you are paying for !
The promise of a CDN is that connections from your users are terminated closer to them so that they get less TCP / TLS connection establishment overhead and static content is cached closet to them so that it loads faster, puts less strain on your origin servers.
To verify this, you can take measurements of site load times across the globe and see if all the users get similar loads times. No you dont have to get a machine everywhere in the world to do that ! Someone has already done that for you
Head to https://prober.tech/ and the URL you wish to test for load times.
Because this site itself is in Cloudflare's CDN, you can put that link itself in the test box and use it as baseline !
More information on using the tool can be found here

Apache is incorrectly converting jsp pages to "text/plain"

I've got a fairly normal setup in which Apache proxies requests to a servlet running inside Tomcat over the AJP protocol.
We've run this setup on Apache 2.0.46/Tomcat 5.0.28 for ages without problems but have recently updated to Apache 2.2.3/Tomcat 5.5.
The problem is that we've noticed that intermittently (maybe one time in 3) Apache will somehow convert the "Content-Type" HTTP header of a page served by the servlet from "text/html" to "text/plain", which results in the browser displaying the HTML source instead of rendering it.
Has anyone seen this sort of behavior before and know what might be the cause? I suspect we're doing something bad in our servlet code that the old version of Tomcat/Apache was more forgiving of.
Update: I have confirmed that it's Apache changing the headers. If I browse directly to Tomcat the problem doesn't occur.
Some webapps do not properly set mime types of content they serve, but still may work properly when served standalone because client applications like browsers are able to interpret the type of the content. But when served behind Apache, these apps will not behave correctly because Apache will provide a default type of text/plain.
A solution is to add a DefaultType None line to your apache virtual host for these web apps:
DefaultType None
http://httpd.apache.org/docs/2.2/mod/core.html#defaulttype
From my blog post:
http://patternbuffer.wordpress.com/2011/11/30/mime-type-issue-with-apache-mod_jk-and-mod_proxy-serving-plain-text/
If you're seeing this problem intermittently, it's almost certain to be something in the servlet code rather than a misconfiguration of Tomcat or httpd. Do you have logging that you can turn on to print the contents of the HTTP headers ?
To isolate the problem a bit further, you could also try bypassing httpd and going direct to the Tomcat URLs for your pages.
I haven't seen this particular behaviour before myself, so sorry I can't be more specific.
By intermittent, do you mean that some pages exhibit this behaviour and others don't, or that there are pages that sometimes exhibit the behaviour and sometimes not?
Can you attach any logging to the AJP layer to log HTTP headers at that level, so you can verify whether it's Apache or Tomcat adding the bogus header?
Are you proxying back to a cluster? Maybe one of the servers is configured wrong.
Ok. I figured it out, it was a bug in the servlet code:
We were doing something like this to write serialized Java objects as the result of HTTP requests:
DeflaterOutputStream dos = new DeflaterOutputStream(response.getOutputStream());
ObjectOutputStream oos = new ObjectOutputStream(dos);
response.setContentType("application/x-java-serialized-object");
oos.writeObject(someObject);
What seemed to be happening was that the DeflaterOutputStream and ObjectOutputStream would get garbage-collected three or four requests later when they were still attached to the response object's output stream and this would cause something to happen on the stream that confused Apache and caused it to rewrite the headers.
I replaced the above with:
ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
DeflaterOutputStream dos = new DeflaterOutputStream(byteStream);
oos = new ObjectOutputStream(dos);
response.setContentType("application/x-java-serialized-object");
oos.writeObject(someObject);
oos.flush();
dos.finish();
byteStream.writeTo(response.getOutputStream());
and the problem has gone away.
The following links seem to describe a similar problem:
AJP Flush Packet causing text/plain
ASF Bugzilla – Bug 43478
I was also facing the same issue it got resolved.
If problem in only one folder then there is some servlet that is blocking the request/response and making a customize request/response to tomcat.
Tomcat 7.0.x