ArXiv replication brainstorming - pdf

The arXiv e-print archive has several terabytes of papers from various fields of science. Some users would like to maintain a full copy of this data on their own computers, while others just want to download the most recent papers in a particular category. They are looking to reduce bandwidth load using some kind of distributed download system (e.g. BitTorrent). I'm looking for ideas for a program or set of programs that would cover all of this.

full pdf content is in the amazon cloud.
while there are > 600k papers on arXiv the total size of the pdf is < 1/2 TB
http://arxiv.org/help/bulk_data_s3
T.

arXiv recommends squid in httpd accelerator mode for precisely this purpose. Any particular reason why this is not good enough?

My first idea is that this looks an awful lot like Usenet newsgroups, with infinite persistence for messages on the servers. I don't know how well it works with PDFs, though.

Related

Is manifold cf a good option for Google Drive indexing?

I am using apache manifoldcf open source project for indexing documents from Google Drive into my solr. Often I have seen it is quite inconsistent in indexing the data. Also it takes time to reflect even small number of documents in solr . Do you really think its a good option to index Google Drive using it?
It is currently bit on slow side, due to response time and throttling constraints from google drive itself. But this limit can probably relieved if you buy additional bandwidth from google. With current setup if you are looking to index a large set of documents in google drive it may not be quick as you may expect
Manifold CF is good for crawling through file-system. You can go for Apache Nutch if you are interested in web crawling.
Yes ManifoldCF does take a lot of time to reflect a small number of document. Also it has very less documentation. Although, you can join the mailing list where you can ask questions to the lead developer "Karl". He is very helpful and usually answers withing a few hours.
P.S. :I have worked using ManifoldCF over a project for a span of 10 months.

PDF generation performance

I can't find enough data about pdf generation performance. I'm planning to create some system and one of its features is to generate PDFs. Mostly simple ones that have about 3-5 pages only with text and tables, occasionally some logo.
What's bothering me is the requirement to support high user traffic (about 2500 requests per second).
Do you know any tools (preferably in java) that are fast and reliable to serve that bunch of users as fast as possible ? How long will it take to serve this amount of people on a single, average machine? I would appreciate any info about experience on this topic.
You almost certainly have to execute some tests with your typical workload on your typical machine. This is probably the only way you can evaluate whether any tools will be able to do what you need.
2500 requests per second is a non-trivial requirement so you are right to be concerned. If that 2500/sec is a sustained load and each request has to produce the 3-5 page pdf you simply might not be able to keep up on a "single average machine". It's not only processing power you'll have to consider, but memory and IO performance.
From experience iText is fast and Docmosis has some built-in facilities to distribute load to other hosts. I've seen both working stably under load. Be careful with memory management when you have that many documents on the fly - if you fall behind you might "blow up" no matter what document engine you use.

Ideas for a distributed processing project?

I am looking for a project idea in distributed processing on Unix based systems. I wish to use only the C programming language. I have to finish the project in 4 months and it's a part of my course work. Can someone help me with an idea?
Cryptography problems
Distributed Ray Tracer
Chess AI (really, AI for any game)
Large Prime Number Search
Web crawler or other search mechanism
Generic Problem Solver (push out problem definition on the fly, followed by problem data).
Note on the last one:
An example would be if you have a gaming website with lots of board games that you were coming out with all the time. You don't want to have to install new clients on all your servers every time you write a new AI for a board game, so you have a program which you can send new AIs to and then after that you can just send the game data and the pushed AI will be used to solve the problem. This is best used for problems which can be broken into smaller chunks.
It is hard to answer without knowing anything about performance, the scale of the project, what you are trying to accomplish, etc. For example, is it one task or multiple tasks? Is the project just totally open?
4 months is pretty short, but maybe some kind of physics problem or math problem. Sorting or some kind of database work might be dull but beneficial.
Check out mapreduce for ideas! I was really motivated by this work, personally.
We used distributed processing here at work, but it's such a broad field..
Yeah.
Why not write a distributed compiler. You may then present an interface for people to compile things on the fly, and it will be passed to your distribute compilenet. Java is probably well-suited, and you'll get to do fun things, like be very mindful of security and so on.
The BOINC project is always looking for help and is very interesting:
http://boinc.berkeley.edu/
If you want to leave your mark and change the way we search the web,
look into B-Trees.
B-Trees and offspring/variants are the working horse of the internet.
Google uses them extensively to index the web.
Database indexes/indices are B-Tree offspring/variants.
Every LAMP system uses a database and indexes/indices.
Also, they are used extensively in distributed VLDB (Very Large DataBases)
Perhaps you can improve existing distributed databases (Cassandra and HBase)
These are lofty goals, but for me, this would leave a lasting mark
in the way Web data is processed, indexed and stored.
Write a distributed, fault tolerant, redundant network B+Tree or B*Tree.
Read Drozdek's book Data Structures and Algorithms in C++.
It's a good survey of B-Trees.
Read about skip trees
http://www.cs.huji.ac.il/~ittaia/papers/AAY-OPODIS05.pdf
Read about Efficient B-tree Based Indexing for Cloud Data Processing
http://www.comp.nus.edu.sg/~ooibc/vldb10-cgindex.pdf
Google search "Network B+Tree"
https://www.google.com/search?rlz=1C1CHKZ_enUS431US431&sourceid=chrome&ie=UTF-8&q=Network+B%2BTree

How do you handle off-site backups of terabytes of data?

I have terabytes of files and database dumps that I need to backup off-site.
What's the best way to accomplish this?
I'm currently weighing rsyinc to Amazon EBS or getting an appliance (eg barracuda).
I called a buddy of mine, and he said he uses backula to get all the files on a single disk, then backs that disk up to tape, then sends the tapes off to iron mountain.
Still waiting to hear back from other sysadmins I've contacted. Will post results here.
One common solution to offsite backups that is worth considering is performing the backup onsite and then physically transporting the backup elsewhere, either via secure snail mail or with a service designed for that purpose. If bandwidth is an issue, this may be more practical.
Instead of tapes, I use hard drives that I physically swap out every week. It is less expensive than tape equipment, and easier to plug into another system when necessary.
Back in the late 80s I worked at a place where every week we received a box of tapes of various sorts every monday - we would do one set of weekly backups on the tapes on that box and send them off-site. Evidently they had two of these boxes, one that was in our office and the other they kept locked up somewhere. Then we got an Exabyte drive which had a single tape capacity greater than that whole box of TK-50s, QIC-40s and mag tapes, and it was just simpler to send a single tape home with one of the manager every week.
I'm sure there are still off-site backup systems like that, but I find it easier to keep cycling a couple of 500Gb drives from my home system to my desk at work.
Why not encrpyt it and actually upload to a third party vendor?
I am thinking of doing this with my data at home but have not found a vendor that will just let me do a dump...They all want to install client side apps...
Admittedly, I have not looked that hard...
We use a couple of solutions. We have an offsite backup with another company that we do. We also use several portable hard drives and swap them out each day. Neither solution really handles multiple terabytes of data. More like gigabytes.
In the future, however, we will probably be looking at going the tape router, or something else that is similarly permanent and storable. Terabytes of data is too much to transfer over the wire. When bluray discs become reasonably priced and commercially viable, it may be a good idea to look into the 400GB discs that were touted not long ago. Those would be extremely storage friendly (both in the physical sense and the file size sense), and depending on the longevity stats, may keep for a while, similar to tapes.
I would recommend using a local san from a company like EMC that provides compressed snapshot based replication to remote facilities. It's an expensive solution, but it works.
http://www.emc.com/products/family/emc-centera-family.htm
Over the weekend, I've heard back from a couple of my sysadmin buddies.
It seems the best practice is to backup all machines to a central large disk, then back that disk up to tape, then send the tapes off site (all have used Iron Mountain).
Tapes hold 400-800G and cost $30-$80 per tape.
A tape changer seems to go for $10k on up.
Not sure how much the off-site shipping costs.
I'm scared of tape. I think it gives a false sense of data security. In my own experience from backing up dozens of terrabytes across hundreds of tapes, we discovered that the data recovery rate after a few years fell to about 70%.
To be fair, that was with a now discontinued technology (AIT), but it pretty much put me off tape for life unless it sits on a 1" spool and is reassuringly expensive.
These days, multiple hard drives, multiple locations, and yes, a fall back into Amazon S3 or other cloud provider does no harm (apart from being a tad expensive).

Visualization data gathering for learning

I'm just starting to take an interest in visualization and I'd like to know where I can get my hands on some data, preferably real world, to see what queries and graphics I can draw from it. Its more of a personal exercise to create some pretty looking representations of that data.
After seeing this I wondered where the data came from and what else could be done from Wikipedia. Is there anyway I can obtain data from say, wikipedia?
Also, could anyone recommend any good books? I don't trust the user reviews on the amazon website :-)
You can download the raw Wikipedia data from http://download.wikimedia.org. There are many different views of the data available. The English Wikipedia is by far the largest database, and there isn't a current full dump available, but one is in progress. It will probably take months to finish and be available for download.
The most recent one was 18 GB compressed, which uncompressed to something like 2.5 TB.
A fantastic book is The Visual Display of Quantitative Information by Edward Tufte.