where can i find papers/optimization concepts about email server/email storage? - data-storage

I would like to know about the optimization techniques in email server/storage. Where do i get this information? I understand that gmail and outlook are not open source. But the manner in which they store emails at the server side is a problem that could've been dealt with by researchers or programmers already. Have such things been published somewhere? I am not concerned about how email is sent/received/MTA etc. Just concerned abt the way its stored. Wikipedia talks only about transfer protocols but nothing related to storage. Plz point me to some articles.

E-mail can be stored in databases quite efficiently. Alternatively you can store it in the file system (i.e. on the disk) or in the virtual file system which supports storing of metadata. We recently published an article on storing data in different storages.
Outlook uses custom storage, similar to virtual file system.


How to create a CDN to store and serve images and videos?

We have a requirement to store and retrieve content(Audio, Video, Images) quickly. We are not allowed to use Commercial providers like AWS S3 etc.
Any suggestions on how to go about? Challenges I forsee are
a) Storage
b) Fast Retrieval
c) Caching
Would cassandra help in the above?
This is a very typical use case for Cassandra for things like streaming services or media-sharing social apps.
The difference is that the media files are saved in an object store and only the metadata (such as URL of the media file) is stored in Cassandra so you can retrieve information about where the media is stored really quickly.
As a side note, I wanted to warn you that others will likely vote to close your question because it is soliciting opinions vs a specific software issue. Cheers!

How long should it take to return a 200mb blob from SQL

I have SQL 2008 R2 supporting a SharePoint 2010 environment. Some of the files will be stupidly large (i.e. 200mb). While Remote Blob Storage will be added to our platform in the future, I want to understand what reasonable performance might look like in getting a 200mb file to the user.
The network between the SharePoint WFE is only one part. Simply reading the blob from disk and passing it through the SharePoint layer MUST take some time, but I have no idea how to calculate this (or what additional information people need to help out)
Any suggestions?
That's a very complex question and requires knowledge of the environment in which you are working. The network as you rightly say is one aspect but there are many others. Traffic congestion, QoS, SQL Server versions, setup, hardware, etc Then there are issues with how the Web Front Ends are handing off the data and the HTTP pipe to the user, the browser in use, etc, etc.
Have a look at installing the Developer Dashboard for SharePoint 2010 and you'll be able to see all of the steps in fecthing and delivering files and how long each one will take you. You'll be quite surprised at how detailed the path is.
SharePoint 2010 Developer Dashboard
Regardless of the large size, you should consider activating the BlobCache feature if your large content is currently stored in a document library.
That will put it on your WFEs after first access, deliver it with proper expiration headers and completely cut the load from the SQL Server. Imagine 20 concurrent users accessing a 200mb file. If not in a blobcache, your farm will have a hard time swallowing that useless load.
The first access will be longer than your test scenario when you request it with as single user but any further access will be a fast as IIS 7 is able to deliver it and the network capacities up to your clients.
Hope it helped.

Html5 local datastore, and sync across devices

I am building a full featured web application. Naturally, you can save when you are in 'offline' mode to the local datastore. I want to be able to sync across devices, so people can work on one machine, save, then get on another machine and load their stuff.
The questions are:
1) Is it a bad idea to store json on the server? Why parse the json on the server into model objects when it is just going to be passed back to the (other) client(s) as json?
2) Im not sure if I would want to try a NoSql technology for this. I am not breaking the json down, for now the only relationships in the db would be from a user account to their entries. Other than the user data, the domain model would be a String, which is the json. Advice welcome.
In theory, in the future I might want to do some processing on the server or set up more complicated relationships. In other words, right now I would just be saving the json, but in the future I might want a more traditional relational system. Would NoSQL approach get in the way of this?
3) Are there any security concerns with this? JS injection for example? In theory, for this use case, the user doesn't get to enter anything, at least right now.
Thank you in advance.
EDIT - Thanx for the answers. I chose the answer I did because it went into the most detail on the advantages and disadvantages of NoSql.
It's not a bad idea at all to store JSON on the server, especially if you go with a noSQL solution like MongoDB or CouchDB. Both use JSON as their native format(MongoDB actually uses BSON but it's quite similar).
noSQL Approach: Assuming CouchDB as the storage engine
Baked in replication and concurrency handling
Very simple Rest API, talk to the data base with HTTP.
Store data as JSON natively and not in blobs or text fields
Powerful View/Query engine that will allow you to continue to grow the complexity of your documents
Offline Mode. You can talk to CouchDb directly using javascript and have the entire app continue to run on the client if the internet isn't available.
Make sure you're parsing the JSON documents with the browers JSON.parse or a Javascript library that is safe(json2.js).
I think the reason I'd suggest going with noSQL here, CouchDB in particular, is that it's going to handle all of the hard stuff for you. Replication is going to be a snap to setup. You won't have to worry about concurrency, etc.
That said, I don't know what kind of App you're building. I don't know what your relationship is going to be to the clients and how easy it'll be to get them to put CouchDB on their machines.
CouchDB # Apache
CouchDB the definitive guide
After looking at the app I don't think CouchDB will be a good client side option as you're not going to require folks to install a database engine to play soduku. That said, I still think it'd be a great server side option. If you wanted to sync the server CouchDb instance with the client you could use something like BrowserCouch which is a JavaScript implementation of CouchDB for local-storage.
If most of your processing is going to be done on the client side using JavaScript, I don't see any problem in storing JSON directly on the server.
If you just want to play around with new technologies, you're most welcome to try something different, but for most applications, there isn't a real reason to depart from traditional databases, and SQL makes life simple.
You're safe as long as you use the standard JSON.parse function to parse JSON strings - some browsers (Firefox 3.5 and above, for example) already have a native version, while Crockford's json2.js can replicate this functionality in others.
Just read your post and I have to say I quite like your approach, it heralds the way many web applications will probably work in the future, with both an element of local storage (for disconnected state) and online storage (the master database - to save all customers records in one place and synch to other client devices).
Here are my answers:
1) Storing JSON on server: I'm not sure I would store the objects as JSON, its possible to do so if your application is quite simple, however this will hamper efforts to use the data (running reports and emailing them on a batch job for example). I would prefer to use JSON for TRANSFERRING the information myself and a SQL database for storing it.
2) NoSQL Approach: I think you've answered your own question there. My preferred approach would be to setup a SQL database now (if the extra resource needed is not a problem), that way you'll save yourself a bit of work setting up the data access layer for NoSQL since you will probably have to remove it in the future. SQLite is a good choice if you dont want a fully-featured RDBMS.
If writing a schema is too much hassle and you still want to save JSON on the server, then you can hash up a JSON object management system with a single table and some parsing on the server side to return relevant records. Doing this will be easier and require less permissioning than saving/deleting files.
3) Security: You mentioned there is no user input at the moment:
"for this use case, the user doesn't
get to enter anything"
However at the begining of the question you also mentioned that the user can
"work on one machine, save, then get
on another machine and load their
If this is the case then your application will be storing user data, it doesn't matter that you havent provided a nice GUI for them to do so, you will have to worry about security from more than one standpoint and JSON.parse or similar tools only solve half the the problem (client-side).
Basically, you will also have to check the contents of your POST request on the server to determine if the data being sent is valid and realistic. The integrity of the JSON object (or any data you are tying to save) will need to be validated on the server (using php or another similar language) BEFORE saving to your data store, this is because someone can easily bypass your javascript-layer "security" and tamper with the POST request even if you didnt intend them to do so and then your application will be sending the evil input out the client anyway.
If you have the server side of things tidied up then JSON.parse becomes a bit obsolete in terms of preventing JS injection. Still its not bad to have the extra layer, specially if you are relying on remote website APIs to get some of your data.
Hope this is useful to you.

How do large sites(Google, Facebook, etc) propagate information to all servers in realtime?

I'm looking for some technologies to research. I'm amazed that you can go into [insert large site here]'s interface, update a setting and within seconds it's pushed out so it's live across the board. A good example of this is adwords. If you go into adwords and change a campaign those settings are stored on the server with a unique id. The ad code calls the server with that id and the information(size,colors, etc) is pulled up instantly to show the ad. How is that Google can push that out to hundreds of thousands of servers so quickly? What type of db systems are they using?
Google has published research papers for its Google File System (or "BigFiles" as it was once known) and BigTable, both of which are used extensively in their services. Those would probably make good reading, in and of themselves and because they probably cite prior art.
You might want to read how Oracle has built RAC to propagate data across many DBs: http://download.oracle.com/docs/cd/B14117_01/server.101/b10727/ha_strea.htm
I know that Facebook use peer-to-peer to push update on their server.
The first server get the update, then he send it to some others who does the same thing.. and on until the update is on all of their server!
I have been looking into similar pieces of information.
Look for "Structured Data".
Specifics: MojoDB, CouchDB.
Look for comparisions on mojodb website.
Facebook has made Cassandra (distributed database) open source. I think they and many others use it now.
Also look for Hadoop framework and Map/Reduce, as a matter of interest.

What does LDAP solve?

I've been in touch with LDAP in many projects I've been involved in but, the truth be told, I don't really understand it. I thought it was just a person directory but after I discovered that it can contain any objects in a hierarchical structure.
I installed openldap in my box and I found many tutorials regarding just the installation.
What is LDAP? What are the scenarios where LDAP is the right choice? What are the LDAP concepts I should know for working with it? What are the advantages of LDAP? Is it used just because old applications used it? Is there a good doc anywhere on internet explaining all this questions?
Complementing the answers I found this link which contains a quick start guide for LDAP newbie like me.
What is LDAP? What are the scenarios where LDAP is the right choice?
At its core, LDAP is a protocol for accessing objects that are suitable for storage in a directory. Whether something is "suitable" is an entirely subjective determination that's left up to implementers, but typically this means collections of many objects that each have infrequently (or never) updated data, where each object has an obvious or canonical way to be looked up:
a phone book (look up by name or by phone number)
titles in a library (look up by title, author, etc.)
tenants in a building (look up by floor, suite, name, etc.)
and so on.
Note that LDAP itself is just a protocol and doesn't provide any actual storage -- in much the same way, HTTP doesn't imply anything about whether you're using Apache, Jetty, Tomcat, Mongrel, et al. as a web server. (One problem with LDAP in general is the confusing reuse of names to mean different things. Wikipedia has a good section on this.)
DITs are a hierarchical description scheme that lend themselves to B-Tree algos very nicely, resulting in tremendous search performance in most cases. Directory Server like OpenDS return indexed searches in micro-seconds, whereas RDBMS systems are much slower. Directory Servers (often called LDAP servers) trade resources (RAM, CPU) for fast read response. RDBMS systems provide greater functionality in terms of management of data in question. Need speed with few or zero updates, simplicity, and small network protocol? Use a Directory Server. Need data management and mining capabilities, and/or high rate-of-change of the database with relational aspects defined between data? Use an RDBMS (MySQL is your best bet here).
LDAP has O(1) read performance, in exchange for O(something worse) write performance. It's ideal for data that's accessed frequently, but changed rarely - directories of people, machine names and addresses, and so on. (hence the acronym: Lightweight Directory Access Protocol.)
LDAP is the right choice where the pain of using a database that isn't relational, in terms of decreased developer familiarity and strange performance characteristics, is less than the gain of blindingly fast read access.
This link will explain LDAP http://blogs.oracle.com/raghuvir/entry/ldap
We use LDAP in our office for email address lookups company wide. We use it as a single source sign on service for our internal apps as well.
One perspective I like to harp on is LDAP is an app on top of a persistence store and a database is a persistence store. Both can be used to store user information.
LDAP gives you a hierarchy which is harder to do in a database. You can make a hierarchy in a database but it's harder to do things like delegation (these rows belong to you only) or ACLs on rows. So pushing security problems out of the database is easier if you use LDAP for storing user identities. Trying to solve it in the database is weird.
At the same time, LDAP is terrible for reporting against (transform LDAP to a DB for reporting). Storing attributes deep in the tree that need to be searched quickly can be problematic for performance (don't do this, have a DB on the side or try to flatten the query by redesigning your DIT). Storing attributes all over the place in a really deep DIT is just bad LDAP or system design but sometimes it's unavoidable if you're tied to a vendor product or legacy app.
LDAP is just a protocol, the wikipedia article explains it adequately http://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol
Its a way to query an underlying organizational structure like Microsoft's Active Directory. You can use LDAP queries to get all kinds of information about users, use it for setting application rights, etc.
I am working part time and a full time student. My curriculum encourages (read requires) many group projects.
I have used openLdap and phpLdapAdmin to control access to my Subversion and Mercurial repos, Trac projects, Hudson, etc. It wasn't easy to install, but the time saved in administration was a God send.
If you have projects where you will have many groups of people who need to be able to use different resources, it is a good tool.
See this link :
Which explains deeply LDAP :
For example you can see this image in that documentation ,
(source: dirsvcs at www.umich.edu)
LDAP is an access protocol; it only provides an API to the underlying technology for which you are trying to find applications - a directory service. OpenLDAP is one of the open source directory services; Sun has another implementation called OpenDS. Active Directory and Novell NDS are another two commonly seen in the field.
The directory can be used for storing information about any sort of resource, and the relationships between the resources - for example, rights of a user to a directory, a printer, or a network access device.
Is there a good doc anywhere on internet explaining all this questions?
IBM published an excellent Red Book about LDAP. The title is:
Understanding LDAP - Design and Implementation.
It can be downloaded from the previous link.
In one of my old workplaces we used LDAP as our primary user authentication system.
This in turn provided our various systems with information which dept. they belonged to, where they should mount their home directories, contact information, employee management.
Not necessarily controlled by LDAP, but other things that we had mixed to work through LDAP was the existence of SQL users, K4, samba and email account generation.