Memory requirements when hosting R in the cloud - rapache

What is the minimal size server we need to run opencpu, if we expect 100,000 hits a month?
I think opencpu is an exciting project, but need to know about memory usage when opencpu is deployed, since a cloud hosting service such as rackspace charges about $40 per month for 1 GB of RAM.
I know that if I load R without doing anything or without loading any data or package in RAM, it uses almost 700m of RAM (virtual) and 50 megabytes of RAM (in residence).
I know that opencpu uses rApache, and rApache uses preforking, but want to know how this will scale as the number of concurrent users increases. Thanks.
Thanks for the responses
I talked with Jeroen Ooms when visiting LA, and am partly convinced that opencpu will work in high concurrency environments if used correctly, and that he is available to fix issues if they arrise. Opencpu related to his dissertation, after all! In particular, what I find useful about opencpu is its integration with ubuntu's AppArmor, which can restrict processes from using too much RAM and CPU. I think apache might also be able to do this, but RAppArmor can do this and much more. Brilliant! If AppArmor were the only advantage, I would just use that and json as a backend, but it seems like opencpu can also streamline the installation of all this stuff and provides a built in API system.
Given the cost of web-hosting, I imagine a workable real-time analytics system is the following:
create R statistical models on demand, on a specialized analytical server, as often as needed (e.g. every day or hour using cron)
transfer the results of the models to a directory on the opencpu servers using ftp, as native R objects
on the opencpu server, go to the directory and grab the R objects representing the statistical models, and then make predictions or do simulations using it. For example, use the 'predict' function to provide estimates based on user supplied variables.
Does anybody else see this as a viable way to make R a backend for real time analytics?

Dirk is right, it all depends on the R functions that you are calling; the overhead of the OpenCPU architecture should be quite minimal. OpenCPU itself will run on a small server, but as we all know some functionality in R requires much more memory/cpu than others.
If you really want to know how much resources are needed just to run opencpu you could do some benchmarking. As you noted, prefork is used to branch sessions of the main process, so in most cases the copy-on-write principle of forking should make it pretty cheap.
Also there is some other stuff that you can tweak; e.g. preloading of frequently used packages.

Related

Virtualize specific environment (CPU, cache, clock)

I have written some code that's supposed to run on a certain hardware-setup. I'd like to test it to get some preliminary metrics, but without buying the hardware setup, since it's very expensive.
At first I, naively, thought I could set some specifications to the platform when creating a virtual machine through a manager such as VMware Workstation, but it seems like it's not possible.
What ways do you believe would be the best to emulate a certain environment? Of course, RAM, disk space and OS should be fairly easy, but limiting the CPU seems to be the general issue.
I'm trying to simulate the Intel Atom® Processor E3845, so I have some requirements to the maximum cores, cache size and of course the clock frequency.
The closest I've found so far would be to install WMware ESXi on a piece of hardware and limit the CPU. But I'm unsure if this is the best way. Further, I've never really worked with this before, why I'm unsure if I can limit the cache and so forth. Simply "down-scaling" the metrics does not feel like a good solution when we are rather dependant on the cache (that is, we've seen issues with certain sizes and speeds).
I Would love to hear some inputs if you have any.

How much is 1/8th of a core?

I'm new to cloud computing and, for the life of me, I can't figure out how "much" 1/8th of a core is in practical terms.
I know what kind of CPUs Amazon EC2 are using for m1.small, but let's say (for education purposes) that it is a single-core 1GHz CPU.
How is 1/8th of core calculated? Does it mean my application will run at 128MB RAM and 1/1GHz of CPU? Or will my application be able to run only a certain number of operations/CPU cycles before I'll be charged for an addition app-cell?
What I need is a practical explanation of the phrase. Perhaps, on an a simple vert.x HTTP server, where each successful connection calculates 2 + 3? Vert.x uses less than 128MB of RAM.
Afaik, you don't have a limit on the number of cycles: if you application requires many CPU cycles it will probably run slower since it would only use 1/8 of core.
Regarding the memory, if you are just using 1 app cell but your app requires more than 128MB, then it will probably result in an OUT OF MEMORY exception.
slicing of the server to 8th isn't as mathematic as you expect. Sharing server resource with multiple tenant allows to better use CPU globaly, compared to a classic server, so even you path inly 1/8 of the server you actually get more resources, but only when you application actually use them.

How to measure the memory usage per active Apache Connection?

I would like to measure the memory consumption for one active Apache connection(=Thread) under Ubuntu.
Is there a monitoring tool which is capable of doing this?
If not, does anyone knows how much memory an Apache connection roughly needs?
Activate the mod_status module, you'll get a report on /server-status page, there is a more parseable version on /server-status?q=auto. If you enable ExtendedStatus On you will have a lot of information on processes and threads.
This is the page used by monitoring tools to track a lot of stats parameters, so you will certainly find the one you need (edit: if it is not memory...) . Be careful with security/access settings of this file, it's a nice tool to check how your server respond to DOS :-)
About memory you must note that Apache loves memory, how much memory per process depends on a lot of things (number of modules loaded - check that you need all the ones you have, number of virtualHosts, etc). But on a stable configuration it does not move a lot (except if you use PHP scripts with high memory limit usage...). If you find memory leaks try to limit the number of requests per process MaxRequests (apache will kill him and put a new one).
edit: in fact not a lot of memory info in the server-status. About monitoring tools, any tool using SNMP MIB-II can track memory usage per process, with average/top/low values for the different childs (Cacti, Nagios, Munin, etc) if you had a snmpd daemon. Check this excellent Munin example. It's not a tracking of each apache child but it will give you an idea of what you can track with these tools. If you do not need a complete monitoring system such as Nagios or Centreon, with alerts, user managmenent, big networks (and if you do not have a lot of days for books reading) Munin is, IMHO, a pretty tool to get monitoring reports quite fast.
I'm not sure if there are any tools for doing this. But you could estimate it yourself. Start apache and check how much memory it uses without any sessions. Than create a big number of sessions and check again how much memory it uses.
You could use JMeter to create different workloads.

Services similar to S3/EC2

Does any other provider offer a cloud computing + storage layer like S3/EC2, with free data transfer between the two layers?
I have looked at:
Softlayer CloudLayer Storage -- no free transfer between the cloudlayer storage and cloudlayer computing instances.
Rackspace CloudFiles - Quite a bit of marketing mumbo-jumbo, and something about Cloud Connect, I gave up on the site once the Live Chat CSS Popup started following me around.
Does anyone know of any others?
I'm looking to store some large (non-random access) files for constant re-processing on a storage solution, and process it nearby, without paying transfer costs daily (looking to store in the 500-2000GB range, re-processing it all daily).
Re-processing requires a (Linux) server with a "decent" (weasel word alert) configuration.
Thanks!
'Cloud computing' is a bit of a myth.
They're all just, essentially, virtual private servers. 'Cloud' instances tend to have the flexibility to be billed by the hour, rather than monthy, but they're still just a VPS.
Persistent storage is a useful feature from a very limited number of VPS providers, but one that can easily be emulated by having two+ VPS' in the same data centre (Linode are an excellent VPS provider with free local data transfer, sadly they're rather limited by capacity). I don't know of any other VPS/Cloud providers who offer their own persistent storage solution.
It is something you can easily achieve yourself. VPS servers tend to be a little restrictive on hard drive capacity if you're looking for 500-2000GB, Perhaps you could consider a dedicated server and handle storage and processing on the same machine... you can't get data more local than the same machine!
First, the short version: stop looking for “free”.
Now, in more detail: you're looking to consume some somewhat-non-trivial computing, data storage and networking resources. Presumably you've got a good reason for doing this; if you truly have, you'll have the ability to also purchase the resources required for what you want to do. There are a few options on this front, none of which are free:
Buy and host your own hardware.
Buy the hardware and host it in a colocation facility.
Hire the hardware
Long term hire
Short term hire
All the Amazon are doing is short-term, easy set up hiring of resources. Their prices are quite keen (if some other option is cheaper, it's because it is missing something significant that Amazon do; maybe it's something you don't need but that's up to you to figure out). You can host the core of the Amazon API quite easily on whatever resources you've hired (see Eucalyptus) but be aware that going from having the software and the API to having everything work smoothly is a really big step; the more I work with Eucalyptus installations, the more impressed with Amazon I become. And that's despite being also pretty impressed with Eucalyptus itself.
But none of this is free. It takes real resources to provide – e.g., electricity to power the machines and keep them cool and a building to house them in – and ultimately, that's got to be paid for somewhere. To expect otherwise is to believe that others should have to pay for things for you; it's pretty rare that that happens, and the more you need to consume, the rarer it is (especially if the economy isn't doing too good). So stop thinking in terms of how you can get it for nothing (“freeload”) and instead take a good look at what it really costs to provide through various routes and seek to minimize your costs. If you can't afford even that, your #1 problem isn't hosting but funding; fix that first.
Rest assured you're not alone in this matter. This is what lots of other people worldwide have to do to make their projects into reality. Good luck!
GoGrid has an external storage with free transfer and access over typical protocols like SMB, NFS, rsync, FTP. The first two allows for mounting as normal drive.
Note also that many providers will allow you to create cloud servers with 2 TB instance storage. For sure not able to name all of them, but you can find some with cloudorado.com .

Best Dual HD Set up for Development

I've got a machine I'm going to be using for development, and it has two 7200 RPM 160 GB SATA HDs in it.
The information I've found on the net so far seems to be a bit conflicted about which things (OS, Swap files, Programs, Solution/Source code/Other data) I should be installing on how many partitions on which drives to get the most benefit from this situation.
Some people suggest having a separate partition for the OS and/or Swap, some don't bother. Some people say the programs should be on the same physical drive as the OS with the data on the other, some the other way around. Same with the Swap and the OS.
I'm going to be installing Vista 64 bit as my OS and regularly using Visual Studio 2008, VMWare Workstation, SQL Server management studio, etc (pretty standard dev tools).
So I'm asking you--how would you do it?
If the drives support RAID configurations in your BIOS, you should do one of the following:
RAID 1 (Mirror) - Since this is a dev machine this will give you the fault tolerance and peace of mind that your code is safe (and the environment since they are such a pain to put together). You get better performance on reads because it can read from both/either drive. You don't get any performance boost on writes though.
RAID 0 - No fault tolerance here, but this is the fastest configuration because you read and write off both drives. Great if you just want as fast as possible performance and you know your code is safe elsewhere (source control) anyway.
Don't worry about mutiple partitions or OS/Data configs because on a dev machine you sort of need it all anyway and you shouldn't be running heavy multi-user databases or anything anyway (like a server).
If your BIOS doesn't support RAID configurations, however, then you might consider doing the OS/Data split over the two drives just to balance out their use (but as you mentioned, keep the programs on the system drive because it will help with caching). Up to you where to put the swap file (OS will give you dump files, but the data drive is probably less utilized).
If they're both going through the same disk controller, there's not going to be much difference performance-wise no matter which way you do it; if you're going to be doing lots of VM's, I would split one drive for OS and swap / Programs and Data, then keep all the VM's on the other drive.
Having all the VM's on an independant drive would let you move that drive to another machine seamlessly if the host fails, or if you upgrade.
Mark one drive as being your warehouse, put all of your source code, data, assets, etc. on there and back it up regularly. You'll want this to be stable and easy to recover. You can even switch My Documents to live here if wanted.
The other drive should contain the OS, drivers, and all applications. This makes it easy and secure to wipe the drive and reinstall the OS every 18-24 months as you tend to have to do with Windows.
If you want to improve performance, some say put the swap on the warehouse drive. This will increase OS performance, but will decrease the life of the drive.
In reality it all depends on your goals. If you need more performance then you even out the activity level. If you need more security then you use RAID and mirror it. My mix provides for easy maintenance with a reasonable level of data security and minimal bit rot problems.
Your most active files will be the registry, page file, and running applications. If you're doing lots of data crunching then those files will be very active as well.
I would suggest if 160gb total capacity will cover your needs (plenty of space for OS, Applications and source code, just depends on what else you plan to put on it), then you should mirror the drives in a RAID 1 unless you will have a server that data is backed up to, an external hard drive, an online backup solution, or some other means of keeping a copy of data on more then one physical drive.
If you need to use all of the drive capacity, I would suggest using the first drive for OS and Applications and second drive for data. Purely for the fact of, if you change computers at some point, the OS on the first drive doesn't do you much good and most Applications would have to be reinstalled, but you could take the entire data drive with you.
As for dividing off the OS, a big downfall of this is not giving the partition enough space and eventually you may need to use partitioning software to steal some space from the other partition on the drive. It never seems to fail that you allocate a certain amount of space for the OS partition, right after install you have several gigs free space so you think you are fine, but as time goes by, things build up on that partition and you run out of space.
With that in mind, I still typically do use an OS partition as it is useful when reloading a system, you can format that partition blowing away the OS but keep the rest of your data. Ways to keep the space build up from happening too fast is change the location of your my documents folder, change environment variables for items such as temp and tmp. However, there are some things that just refuse to put their data anywhere besides on the system partition. I used to use 10gb, these days I go for 20gb.
Dividing your swap space can be useful for keeping drive fragmentation down when letting your swap file grow and shrink as needed. Again this is an issue though of guessing how much swap you need. This will depend a lot on the amount of memory you have and how much stuff you will be running at one time.
For the posters suggesting RAID - it's probably OK at 160GB, but I'd hesitate for anything larger. Soft errors in the drives reduce the overall reliability of the RAID. See these articles for the details:
http://alumnit.ca/~apenwarr/log/?m=200809#08
http://permabit.wordpress.com/2008/08/20/are-fibre-channel-and-scsi-drives-more-reliable/
You can't believe everything you read on the internet, but the reasoning makes sense to me.
Sorry I wasn't actually able to answer your question.
I usually run a box with two drives. One for the OS, swap, typical programs and applications, and one for VMs, "big" apps (e.g., Adobe CS suite, anything that hits the disk a lot on startup, basically).
But I also run a cheap fileserver (just an old machine with a coupla hundred gigs of disk space in RAID1), that I use to store anything related to my various projects. I find this is a much nicer solution than storing everything on my main dev box, doesn't cost much, gives me somewhere to run a webserver, my personal version control, etc.
Although I admit, it really isn't doing much I couldn't do on my machine. I find it's a nice solution as it helps prevent me from spreading stuff around my workstation's filesystem at random by forcing me to keep all my work in one place where it can be easily backed up, copied elsewhere, etc. I can leave it on all night without huge power bills (it uses <50W under load) so it can back itself up to a remote site with a little script, I can connect to it from outside via SSH (so I can always SCP anything I need).
But really the most important benefit is that I store nothing of any value on my workstation box (at least nothing that isn't also on the server). That means if it breaks, or if I want to use my laptop, etc. everything is always accessible.
I would put the OS and all the applications on the first disk (1 partition). Then, put the data from the SQL server (and any other overflow data) on the second disk (1 partition). This is how I'd set up a machine without any other details about what you're building. Also make sure you have a backup so you don't lose work. It might even be worth it to mirror the two drives (if you have RAID capability) so you don't lose any progress if/when one of them fails. Also, backup to an external disk daily. The RAID won't save you when you accidentally delete the wrong thing.
In general I'd try to split up things that are going to be doing a lot of I/O (such as if you have autosave on VS going off fairly frequently) Think of it as sort of I/O multithreading
I've observed significant speedups by putting my virtual machines on a separate disk. Whenever Windows is doing something stupid in the VM (e.g., indexing yet again), it doesn't thrash my Mac's disk quite so badly.
Another issue is that many tools (Visual Studio comes to mind) break in frustrating ways when bits of them are on the non-primary disk.
Use your second disk for big random things.