How to run a single MediaWiki on many servers?

How to run a single MediaWiki on many servers? - apache

I couldn't find a tutorial on this, all I found was info on how to run multiple wikis from one server. Issues are things like high-speed shared storage for images between servers and good performance with some sort of centralized caching.
Does anybody know of any guides?

It's really hard to give advice to such vague requirements, but doing my best.
Image storage: what are the storage/load needed? For a Wikipedia-sized cluster the only solution known to work is OpenStack Swift, however it's a huge PITA to use so NFS is probably the sane choice.
Shared caching: memcached.

Related

Dataset for testing KeyDB

I'm new to the world of in-memory dbs, expecially the KeyDb is known to me for just like a couple of days, I wonder if the dataset that KeyDb/Redis had been tested on, is publicly available. I would like to do testing myself. Thanks in advance.

You can use benchmarking software to generate different loads for KeyDB or Redis. KeyDB has a built in benchmarking software 'keydb-benchmark', or 'redis-benchmark' for redis. However the best benchmarking tools are Memtier by RedisLabs which is great for testing throughput, and YCSB by Yahoo which is great for testing latency under different loads. YCSB has several predefined workloads you can use and test on many different databases for a comparison. Each tool has pretty good documentation on using it.
You can take a look here at some redis benchmark examples. Also see this section of keydb blog outlining considerations for avoiding bottlenecks while benchmarking keydb with memtier. Both documents include some different examples/commands.

Scaling CakePHP Version 2.3.0

I'm beginning a new project using CakePHP. I like the "auto-magic" features, I think its a good fit for the project. I'm wondering about the potential to scale CakePHP to several million IP hits a day. and hundreds of thousands of database writes and reads a day. Also about 50,000 to 500,000 users, often with 3000 concurrently using the site. I'm making use of heavy stored procedures to offset this, and I'm accessing several servers including a load balancer.
I'm wondering about the computational time of some of the auto-magic and how well Cake is able to assist with session requests making many db hits. Has anyone has had success with cake running from a single server array setup with this level of traffic? I'm not using the cloud or a distributed database (yet). I'm really worried about potential bottlenecks with using this framework. I'm interested in advice from anyone who has worked with Cake in production. I've reseached, but I would love a second opinion. Thank you for your time.

This is not a problem but optimization is up to you.
There are different cache methods available you can implement, memcache, redis, full page caching... All of that is supported by cacke already. What you cache and where is up to you.
For searching you could try elastic search to speedup things
There are before dispatcher filters to by pass controller instantiation (you might want to do that in special cases, check the asset filter for example)
Use nginx not apache
Also I would not start with over optimizing and over-thinking this before any code is written, start well, think about caching but when you start to come across bottleneck analyse and fix them. Otherwise you'll waste a lot of time with over optimization before you even have written anything that works.
Cake itself is very fast. Just to proof the bullshit factor of these fancy benchmarks some frameworks do we did one using a dispatcher filter to "optimize" it and even beat Yii who seems to be pretty eager to show how fast it is, but benchmarks are pointless, specially in a huge project where so many human made fail can be introduced.

Concurrent page request comparisons

I have been hoping to find out what different server setups equate to in theory for concurrent page requests, and the answer always seems to be soaked in voodoo and sorcery. What is the approximation of max concurrent page requests for the following setups?
apache+php+mysql(1 server)
apache+php+mysql+caching(like memcached or similiar (still one server))
apache+php+mysql+caching+dedicated Database Server (2 servers)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/single dbserver)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/multi dbserver)
+distributed (amazon cloud elastic) -- I know this one is "as much as you can afford" but it would be nice to know when to move to it.
I appreciate any constructive criticism, I am just trying to figure out when its time to move from one implementation to the next, because they each come with their own implementation feat either programming wise or setup wise.

In your question you talk about caching and this is probably one of the most important factors in a web architecture r.e performance and capacity.
Memcache is useful, but actually, before that, you should be ensuring proper HTTP cache directives on your server responses. This does 2 things; it reduces the number of requests and speeds up server response times (if you have Apache configured correctly). This can also be improved by using an HTTP accelerator like Varnish and a CDN.
Another factor to consider is whether your system is stateless. By stateless, it usually means that it doesn't store sessions on the server and reference them with every request. A good systems architecture relies on state as little as possible. The less state the more horizontally scalable a system. Most people introduce state when confronted with issues of personalisation - i.e serving up different content for different users. In such cases you should first investigate using the HTML5 session storage (i.e store the complete user data in javascript on the client, obviously over https) or if the data set is smaller, secure javascript cookies. That way you can still serve up cached resources and then personalise with javascript on the client.
Finally, your stack includes a database tier, another potential bottleneck for performance and capacity. If you are only reading data from the system then again it should be quite easy to horizontally scale. If there are reads and writes, its typically better to separate the read write datasets into a separate database and have the read only in another. You can then use more relevant methods to scale.

These setups do not spit out a single answer that you can then compare to each other. The answer will vary on way more factors than you have listed.
Even if they did spit out a single answer, then it is just one metric out of dozens. What makes this the most important metric?
Even worse, each of these alternatives is not free. There is engineering effort and maintenance overhead in each of these. Which could not be analysed without understanding your organisation, your app and your cost/revenue structures.
Options like AWS not only involve development effort but may "lock you in" to a solution so you also need to be aware of that.
I know this response is not complete, but I am pointing out that this question touches on a large complicated area that cannot be reduced to a single metric.
I suspect you are approaching this from exactly the wrong end. Do not go looking for technologies and then figure out how to use them. Instead profile your app (measure, measure, measure), figure out the actual problem you are having, and then solve that problem and that problem only.
If you understand the problem and you understand the technology options then you should have an answer.
If you have already done this and the problem is concurrent page requests then I apologise in advance, but I suspect not.

ZooKeeper and RabbitMQ/Qpid together - overkill or a good combination?

Greetings,
I'm evaluating some components for a multi-data center distributed system. We're going to be using message queues (via either RabbitMQ or Qpid) so agents can make asynchronous requests to other agents without worrying about addressing, routing, load balancing or retransmission.
In many cases, the agents will be interacting with components that were not designed for highly concurrent access, so locking and cross-agent coordination will be needed to avoid race conditions. Also, we'd like the system to automatically respond to agent or data center failures.
With the above use cases in mind, ZooKeeper seemed like it might be a good fit. But I'm wondering if trying to use both ZK and message queuing is overkill. It seems like what Zookeeper does could be accomplished by my own cluster manager using AMQP messaging, but that would be hard to get really right. On the other hand, I've seen some examples where ZooKeeper was used to implement message queuing, but I think RabbitMQ/Qpid are a more natural fit for that.
Has anyone out there used a combination like this?
Thanks in advance,
-Chris

Coming into this late, but maybe it will be of some use. The primary consideration should be the performance characteristics of your system. ZooKeeper, like you said, is more than capable of implementing a task distribution system using a distributed queue, but zk currently, is more optimized for reads than it is for writes (this only comes into play in the 1000's of ops per second range). If your throughput needs are less than this, then using just zk to implement your system would reduce number of runtime components and make it simpler. Of course, you should always run your performance tests before deciding.
Distributed coordination is really hard to get right, so I would definitely recommend using zookeeper for that and not rolling your own.

Not quite sure what ZooKeeper exactly is, but I guess that using a component from Apache (if it does fit your needs well) is preferred before managing such things as distributed synchronization and group services at your own. You could of course hire a team of developers especially for that purpose, but that doesn't guarantee you a better implementation.
I guess, that it would be anyways implemented as a separate component, cuz other way could bring much complexity and decelerate the workflow; so the preference of ZooKeeper or anything similar is kind of obvious (to me).
And surely, unless you're in the global optimization phase of your project workflow, I guess it would be better to use RabbitMQ or such (I would even stress that, cuz implementations (especially commercial) of the AMQP would be more reliable than everything that you'd come up with).
So I would go for both, carefully chosing the appropriate thirdparty products, but using as much of them as it is needed. And that's just my opinion; thanks for reading :)

Katta in production environment

According to the website Katta is a scalable, failure tolerant, distributed, indexed, data storage.
I would like to know if it is ready to be deployed into production environment. Anyone already using it and has advices? Any pitfalls? Recommendations? Testimonials? Please share.
Any answer would be greatly appreciated.

We have tried using katta and for what its worth - found it very stable, relatively easy to manage (as compared to managing plain vanilla lucene)
Only pitfall I can think of is lack of realtime updates - when we tested it (about 9-10 months back) update meant, updating index using a separate process (hadoop job or what have you...) and replacing the live index, this was a deal-breaker for us.
If you are looking into distributed lucene you should really tryout ElasticSearch or Solandra

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas