Running moses server on Amazon - amazon-s3

I am trying to run a moses server on Amazon ec2 ebs-backed instance. The languages models and translation models are about 200GB in total. I am thinking to have a moses installed instance loads languages models and translation models stored on s3. But i do not know how to configure moses.ini file in order to make moses knowing the path of ttable-file and lmodel-file. If anyone has done this before, any help would be greatly appreciated!!
Thank you.

I wouldn't recommend Amazon S3 for this. Amazon S3 is used for efficiently distributing files across the web. But if your whole purpose is to just read these files inside a VM - then saving these in S3 is not the correct choice. Refer to this answer for more details.
To answer your question, yes it is possible to mount an S3 bucket as a folder inside your server using S3FS. Here are instructions for Ubuntu and Red Hat.
But other ideal approaches are:
If you don't have enough space in the hard disk, then install Moses server on a different partition, format it using BTRFS and enable Transparent Compression. This will automatically compress/decompress files when you save/retrieve from the hard disk, so you will end up using much much less space. Also in a lot of benchmarks, Transparent Compression is shown to be faster, since lesser amount of data is transferred between hard disk and RAM. Specially when involving large files.
You can always attach a secondary EBS disk to your running VM (like a secondary hard disk). Use that for storing the translations/models (and you can also combine enable transparent compression as above too)
Run a separate VM without EBS, and just using the normal instance storage, and use that to store the translations alone. Now in your Moses server, you can mount the translations alone from this separate non-EBS VM using SSHFS
Overall, don't use S3, there are other much better ways.
Edit: Added link

Related

Is it possible configuring working directory with multiple data folders

I've currently installed Redis on VM which has two mounted disks. I'd like to use those two mounted disks as working directories for Redis.
So is it possible to configure Redis working directory dir with multiple folder locations?
Thanks!
AVR
NO, you CANNOT do that.
Redis can only hold data that fits into memory. Normally that size is much smaller than the size of a disk, and there's no need to use multiple disks to extend the storage.
In some cases, multiple disks might help, e.g. Redis is dumping data set to disk while syncing with slaves, Redis dumps both AOF and RDB files. In these cases, there are multiple readers or writers working at the same time, and that might cause performance issues (i.e. too many disk seeks).
However, since Redis focuses on in-memory store, I'm not sure if that's a big problem to concern.

Fastest possible way to transfer a large directory over VPN

I have a problem. Everyday I have to upload my whole source code (it is a directory with several directories and files) to a server over VPN. The size of source code is around 250 MB. What I do everyday is, compress it (that reduces it's size to around 100 MB), then I transfer this zipped file over ftp to the server and finally unzip it there. It takes me around 20 minutes to transfer that.
I am sure there has got to be a better way of doing this than what I am doing. Either suggest me a better compression mechanism or faster upload method.
If you could set up a Version Control server it would be great, Mercurial and Git are perfect for this.
The other option is using rsync, which is a synchronizing tool that only uploads the differences between the two versions, avoiding repetitive transmission of data.
I'm assuming a UNIX-like environment here, but on windows the options are pretty much the same.
PD: this question is more fitted for superuser.

Memory requirements when hosting R in the cloud

What is the minimal size server we need to run opencpu, if we expect 100,000 hits a month?
I think opencpu is an exciting project, but need to know about memory usage when opencpu is deployed, since a cloud hosting service such as rackspace charges about $40 per month for 1 GB of RAM.
I know that if I load R without doing anything or without loading any data or package in RAM, it uses almost 700m of RAM (virtual) and 50 megabytes of RAM (in residence).
I know that opencpu uses rApache, and rApache uses preforking, but want to know how this will scale as the number of concurrent users increases. Thanks.
Thanks for the responses
I talked with Jeroen Ooms when visiting LA, and am partly convinced that opencpu will work in high concurrency environments if used correctly, and that he is available to fix issues if they arrise. Opencpu related to his dissertation, after all! In particular, what I find useful about opencpu is its integration with ubuntu's AppArmor, which can restrict processes from using too much RAM and CPU. I think apache might also be able to do this, but RAppArmor can do this and much more. Brilliant! If AppArmor were the only advantage, I would just use that and json as a backend, but it seems like opencpu can also streamline the installation of all this stuff and provides a built in API system.
Given the cost of web-hosting, I imagine a workable real-time analytics system is the following:
create R statistical models on demand, on a specialized analytical server, as often as needed (e.g. every day or hour using cron)
transfer the results of the models to a directory on the opencpu servers using ftp, as native R objects
on the opencpu server, go to the directory and grab the R objects representing the statistical models, and then make predictions or do simulations using it. For example, use the 'predict' function to provide estimates based on user supplied variables.
Does anybody else see this as a viable way to make R a backend for real time analytics?
Dirk is right, it all depends on the R functions that you are calling; the overhead of the OpenCPU architecture should be quite minimal. OpenCPU itself will run on a small server, but as we all know some functionality in R requires much more memory/cpu than others.
If you really want to know how much resources are needed just to run opencpu you could do some benchmarking. As you noted, prefork is used to branch sessions of the main process, so in most cases the copy-on-write principle of forking should make it pretty cheap.
Also there is some other stuff that you can tweak; e.g. preloading of frequently used packages.

What's a good way to backup (and maybe synchronize) your development machine? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I make extensive use of source control for anything that relates to a project I'm working on (source, docs etc.) and I've never lost anything that way.
However, I have had two or three crashes (spread over the last 4 years) on my development machine that forced me to reinstall my system and reconfigure my apps (eclipse, vim, Firefox, etc.). For weeks after reinstalling, I was missing one little app or another, some PHP or Python module wasn't there, stuff like that.
While this is not fatal, it's very annoying and sucks up time. Because it seemed so rare, I didn't bother about an actual solution, but meanwhile I've developed a mindset where I just don't want stuff like that happening anymore.
So, what are good backup solutions for a development machine? I've read this very similar question, but that guy really wants something different than me.
What I want is to have spare harddrives on the shelf and reduce my recovery time after a crash to something like an hour or less.
Thinking about this, I figured there might also be a way to use the backup mechanism for keeping two or more dev workstations in sync, so I can continue work at a different PC anytime.
EDIT: I should've mentioned that
I'm running Linux
I want incremental backup, so that it's cheap to do it frequently (once or twice a day)
RAID is good, but I'm on a laptop most of the time, no second hd in there, no E-SATA and I'm not sure about RAIDing to a USB drive: would that actually work?
I've seen sysadmins use rsync, has anybody had any experiences with that?
I would set up the machine how you like it and then image it. Then, you can set up rsync(or even SVN) to backup your homedir nightly/etc.
Then when your computer dies, you can reimage, and then redeploy your home dir.
The only problem would be upgraded/new software, but the only way to deal completely with that would be to do complete nightly backups of your drive(s).
Thanks, this sounds like a good suggestion. I think it should be possible to also update the image regularly (to get software updates / installs), but maybe not that often. E. g. I could boot the image in a VM and perform a global package update or something.
Hanno
You could create an image of your workstation after you've installed & configured everything. Then when your computer crashes, you can just restore the image.
A (big) downside to this, is that you won't have any updates or changes you've made since you created the image.
Cobian Backup is a reliable backup system for Windows that will perform scheduled backups to an external drive.
You could create a hard drive image. Restoring from a backup image restores everything to the exact state that it was at the time you took the image.
Or you could create an installer that installs just about everything needed.
Since you expressed interest in rsync, here's an article that covers how to make a bootable backup image via rsync for Debian Linux:
http://www.debian-administration.org/articles/575
Rsync is fast and easy for local and network syncing and is by nature incremental.
You can use RAID-1 for that. It’s the synchronize type, not the backup type.
I use RAID mirroring in conjunction with an external hard drive using Vista's system backup utility to backup the entire machine. That way I can easily fix a hard drive failure, but in the event my system becomes corrupted, I can restore from the E-SATA drive (which I only connect for backup).
Full disclosure: I've never had to restore the backup, so it's kind of like the airbag in your car; hopefully it works when you need it, but there's no way to be sure. Also, the backup process is manual (it can be automated) so I'm only as safe as the last backup.
You can use the linux "dd" command line utility to clone a hard drive.
Just boot from a linux cd, clone or restore your drive and reboot.
It works great for Windows/Mac drives too.
This will clone partition 1 of the first hard drive (/dev/sda) to partition 1 of the second drive (/dev/sdb)
dd if=/dev/sda1 of=/dev/sdb1
This will clone partition 1 of the first hard drive to a FILE on the second drive.
dd if=/dev/sda1 of=/media/drive2/backup/2009-02-25.iso
Simply swap the values for if= and of= to restore the drive.
If you boot from the Ubuntu live CD it will automount your USB drives making it easy to perform the backup/restore with external drive(s).
CAUTION: Verify the identity of your drives BEFORE running the above commands. It's easy to overwrite the wrong drive if you not careful.
Guess this is not exactly what you are looking for, but I just document all what I install and configure on a machine. Google Docs lets me do this from anywhere, and keeps the document intact when the machine crashes.
A good step by step document usually reduces the recovery time to one day or so
If you use a Mac, just plug in an external hard drive and Time Machine will do the rest, creating a complete image of your machine on the schedule you set. I restored from a Time Machine image when I swapped out the hard drive in my MacBook Pro and it worked like a charm.
One other option that a couple of guys use at my company is to have their development environment on a large Linux server. They just use their local machines to run an NX client to access the remote desktop (NX is much faster than VNC) - this has the advantages of fast performance, automatic backup of their files on the server, and the fact that they're developing on the same hardware that our customers use.
No matter what solution you use, it is always a good idea to have a secondary backup, too. Secondary backup should be off-site and include your essential work (source code, important docs). In case something happens to your main site (fire at the office, somebody breaks in and steals all your hardware, etc.), you would still be able to recover, eventually.
There are many online backup solutions. You could just get a remote storage at a reliable provider (e.g. Amazon S3) and sync your work on a daily basis. The solution depends on the type of access you can get, but rsync is probably the tool you would use for that.

Best Dual HD Set up for Development

I've got a machine I'm going to be using for development, and it has two 7200 RPM 160 GB SATA HDs in it.
The information I've found on the net so far seems to be a bit conflicted about which things (OS, Swap files, Programs, Solution/Source code/Other data) I should be installing on how many partitions on which drives to get the most benefit from this situation.
Some people suggest having a separate partition for the OS and/or Swap, some don't bother. Some people say the programs should be on the same physical drive as the OS with the data on the other, some the other way around. Same with the Swap and the OS.
I'm going to be installing Vista 64 bit as my OS and regularly using Visual Studio 2008, VMWare Workstation, SQL Server management studio, etc (pretty standard dev tools).
So I'm asking you--how would you do it?
If the drives support RAID configurations in your BIOS, you should do one of the following:
RAID 1 (Mirror) - Since this is a dev machine this will give you the fault tolerance and peace of mind that your code is safe (and the environment since they are such a pain to put together). You get better performance on reads because it can read from both/either drive. You don't get any performance boost on writes though.
RAID 0 - No fault tolerance here, but this is the fastest configuration because you read and write off both drives. Great if you just want as fast as possible performance and you know your code is safe elsewhere (source control) anyway.
Don't worry about mutiple partitions or OS/Data configs because on a dev machine you sort of need it all anyway and you shouldn't be running heavy multi-user databases or anything anyway (like a server).
If your BIOS doesn't support RAID configurations, however, then you might consider doing the OS/Data split over the two drives just to balance out their use (but as you mentioned, keep the programs on the system drive because it will help with caching). Up to you where to put the swap file (OS will give you dump files, but the data drive is probably less utilized).
If they're both going through the same disk controller, there's not going to be much difference performance-wise no matter which way you do it; if you're going to be doing lots of VM's, I would split one drive for OS and swap / Programs and Data, then keep all the VM's on the other drive.
Having all the VM's on an independant drive would let you move that drive to another machine seamlessly if the host fails, or if you upgrade.
Mark one drive as being your warehouse, put all of your source code, data, assets, etc. on there and back it up regularly. You'll want this to be stable and easy to recover. You can even switch My Documents to live here if wanted.
The other drive should contain the OS, drivers, and all applications. This makes it easy and secure to wipe the drive and reinstall the OS every 18-24 months as you tend to have to do with Windows.
If you want to improve performance, some say put the swap on the warehouse drive. This will increase OS performance, but will decrease the life of the drive.
In reality it all depends on your goals. If you need more performance then you even out the activity level. If you need more security then you use RAID and mirror it. My mix provides for easy maintenance with a reasonable level of data security and minimal bit rot problems.
Your most active files will be the registry, page file, and running applications. If you're doing lots of data crunching then those files will be very active as well.
I would suggest if 160gb total capacity will cover your needs (plenty of space for OS, Applications and source code, just depends on what else you plan to put on it), then you should mirror the drives in a RAID 1 unless you will have a server that data is backed up to, an external hard drive, an online backup solution, or some other means of keeping a copy of data on more then one physical drive.
If you need to use all of the drive capacity, I would suggest using the first drive for OS and Applications and second drive for data. Purely for the fact of, if you change computers at some point, the OS on the first drive doesn't do you much good and most Applications would have to be reinstalled, but you could take the entire data drive with you.
As for dividing off the OS, a big downfall of this is not giving the partition enough space and eventually you may need to use partitioning software to steal some space from the other partition on the drive. It never seems to fail that you allocate a certain amount of space for the OS partition, right after install you have several gigs free space so you think you are fine, but as time goes by, things build up on that partition and you run out of space.
With that in mind, I still typically do use an OS partition as it is useful when reloading a system, you can format that partition blowing away the OS but keep the rest of your data. Ways to keep the space build up from happening too fast is change the location of your my documents folder, change environment variables for items such as temp and tmp. However, there are some things that just refuse to put their data anywhere besides on the system partition. I used to use 10gb, these days I go for 20gb.
Dividing your swap space can be useful for keeping drive fragmentation down when letting your swap file grow and shrink as needed. Again this is an issue though of guessing how much swap you need. This will depend a lot on the amount of memory you have and how much stuff you will be running at one time.
For the posters suggesting RAID - it's probably OK at 160GB, but I'd hesitate for anything larger. Soft errors in the drives reduce the overall reliability of the RAID. See these articles for the details:
http://alumnit.ca/~apenwarr/log/?m=200809#08
http://permabit.wordpress.com/2008/08/20/are-fibre-channel-and-scsi-drives-more-reliable/
You can't believe everything you read on the internet, but the reasoning makes sense to me.
Sorry I wasn't actually able to answer your question.
I usually run a box with two drives. One for the OS, swap, typical programs and applications, and one for VMs, "big" apps (e.g., Adobe CS suite, anything that hits the disk a lot on startup, basically).
But I also run a cheap fileserver (just an old machine with a coupla hundred gigs of disk space in RAID1), that I use to store anything related to my various projects. I find this is a much nicer solution than storing everything on my main dev box, doesn't cost much, gives me somewhere to run a webserver, my personal version control, etc.
Although I admit, it really isn't doing much I couldn't do on my machine. I find it's a nice solution as it helps prevent me from spreading stuff around my workstation's filesystem at random by forcing me to keep all my work in one place where it can be easily backed up, copied elsewhere, etc. I can leave it on all night without huge power bills (it uses <50W under load) so it can back itself up to a remote site with a little script, I can connect to it from outside via SSH (so I can always SCP anything I need).
But really the most important benefit is that I store nothing of any value on my workstation box (at least nothing that isn't also on the server). That means if it breaks, or if I want to use my laptop, etc. everything is always accessible.
I would put the OS and all the applications on the first disk (1 partition). Then, put the data from the SQL server (and any other overflow data) on the second disk (1 partition). This is how I'd set up a machine without any other details about what you're building. Also make sure you have a backup so you don't lose work. It might even be worth it to mirror the two drives (if you have RAID capability) so you don't lose any progress if/when one of them fails. Also, backup to an external disk daily. The RAID won't save you when you accidentally delete the wrong thing.
In general I'd try to split up things that are going to be doing a lot of I/O (such as if you have autosave on VS going off fairly frequently) Think of it as sort of I/O multithreading
I've observed significant speedups by putting my virtual machines on a separate disk. Whenever Windows is doing something stupid in the VM (e.g., indexing yet again), it doesn't thrash my Mac's disk quite so badly.
Another issue is that many tools (Visual Studio comes to mind) break in frustrating ways when bits of them are on the non-primary disk.
Use your second disk for big random things.