GemStone-Linux-Apache-Seaside-Smalltalk.. how practical is 4GB? - smalltalk

I am really interested in GLASS. The 4GB limit for the free version has me concerned. Especially when I consider the price for the next level ($7000 year).
I know this can be subjective and variable, but can someone describe for me in everyday terms what 4 GB of GLASS will get you? Maybe a business example. 4 GB may get me more storage than I realize.. and I don't have to worry about it.
In my app, some messages have file attachments up to 5 MB in size. Can I conserve the 4 GB of Gemstone space by saving these attachments directly to files on the operating system, instead of inside Gemstone? I'm thinking yes.

I'm aware of one GLASS system that is ~944 MB and has 8.3 million objects, or ~118 bytes per object. At this rate, it can grow to over 36 million objects and stay under 4 GB.
As to "attachments", I'd suggest that even in an RDBMS you should consider storing larger, static data in the file system and referencing it from the database. If you are building a web-based application, serving static content (JPG, CSS, etc.) should be done by your web server (e.g., Apache) rather than through the primary application.
By comparison, Oracle and Microsoft SQL Server have no-cost licenses for a 4-GB database.
What do you think would be a good price for the next level?

The 4GByte limit has been removed a while ago. The free version is limited now to the use of two cores and 2GByte ram.

4GB is quite a decent size database. Not having used gemstone before I can only speculate as to how efficient it is a storing objects, but having played with a few other similar object databases (Mongodb, db4o). I know that you're going to be able to fit several(5-10) million records before you even get close to that limit. In reality, how many records depends highly on the type of data you're storing.
As an example I was storing ~2million listings & ~1million transactions, in a mysql database and the space was < 1Gb. You have a small overhead serializing a whole object, but not that much.
Files can definitely can be stored on the file system.

4gb an issue... I guess you think you're building the next ebay!

Nowadays, there is no limit on the size of the repository. See the latest specs for GemStone

If you have multiple simultaneous users with attachments of 5MB you need a separate strategy for them anyway, as each takes about a twentieth second of bandwidth of a GBit ethernet network.

Related

What is the best way to store highly parametrized entities?

Ok, let met try to explain this in more detail.
I am developing a diagnostic system for airplanes. Let imagine that airplanes has 6 to 8 on-board computers. Each computer has more than 200 different parameters. The diagnostic system receives all this parameters in binary formatted package, then I transfer data according to the formulas (to km, km/h, rpm, min, sec, pascals and so on) and must store it somehow in a database. The new data must be handled each 10 - 20 seconds and stored in persistence again.
We store the data for further analytic processing.
Requirements of storage:
support sharding and replication
fast read: support btree-indexing
NOSQL
fast write
So, I calculated an average disk or RAM usage per one plane per day. It is about 10 - 20 MB of data. So an estimated load is 100 airplanes per day or 2GB of data per day.
It seems that to store all the data in RAM (memcached-liked storages: redis, membase) are not suitable (too expensive). However, now I am looking to the mongodb-side. Since it can utilize as RAM and disk usage, it supports all the addressed requirements.
Please, share your experience and advices.
There is a helpful article on NOSQL DBMS Comparison.
Also you may find information about the ranking and popularity of them, by category.
It seems regarding to your requirements, Apache's Cassandra would be a candidate due to its Linear scalability, column indexes, Map/reduce, materialized views and powerful built-in caching.

RavenDB : Storage Size Problems

I'm doing some testing with RavenDB to store data based on an iphone application. The application is going to send up a string of 5 GPS coordinates with a GUID for the key. I'm seeing in RavenDB that each document is around 664-668 bytes. That's HUGE for 10 decimals and a guid. Can someone help me understand what I'm doing wrong? I noticed the size was extraordinarily large when a million records was over a gig on disk. By my calculations it should be much smaller. Purely based on the data sizes shouldn't the document be around 100 bytes? And given that the document database has the object schema built in let's say double that to 200 bytes. Given that calculation the database should be about two hundred megs with 1 million records. But it's ten times larger. Can someone help me where I've gone wrong with the math here?
(Got a friend to check my math and I was off by a bit - numbers updated)
As a general principal, NoSQL databases aren't optimized for disk space. That's the kind of traditional requirement of an RDBMS. Often with NoSQL, you will choose to store the data in duplicate or triplicate for various reasons.
Specifically with RavenDB, each document is in JSON format, so you have some overhead there. However, it is actually persisted on disk in BSON format, saving you some bytes. This implementation detail is obscured from the client. Also, every document has two streams - the main document content, and the associated metadata. This is very powerful, but does take up additional disk space. Both the document and the metadata are kept in BSON format in the ESENT backed document store.
Then you need to consider how you will access the data. Any static indexes you create, and any dynamic indexes you ask Raven to create for you via its LINQ API will have the data copied into the index store. This is a separate store implemented with Lucene.net using their proprietary index file format. You need to take this into consideration if you are estimating disk space requirements. (BTW - you would also have this concern with indexes in an RDBMS solution)
If you are super concerned about optimizing every byte of disk space, perhaps NoSQL solutions aren't for you. Just about every product on the market has these types of overhead. But keep in mind that disk space is cheap today. Relational databases optimized for disk space because storage was very expensive when they were invented. The world has changed, and NoSQL solutions embrace that.

How to store 15 x 100 million 32-byte records for sequential access?

Me got 15 x 100 million 32-byte records. Only sequential access and appends needed. The key is a Long. The value is a tuple - (Date, Double, Double). Is there something in this universe which can do this? I am willing to have 15 seperate databases (sql/nosql) or files for each of those 100 million records. I only have a i7 core and 8 GB RAM and 2 TB hard disk.
I have tried PostgreSQL, MySQL, Kyoto Cabinet (with fine tuning) with Protostuff encoding.
SQL DBs (with indices) take forever to do the silliest query.
Kyoto Cabinet's B-Tree can handle upto 15-18 million records beyond which appends take forever.
I am fed up so much that I am thinking of falling back on awk + CSV which I remember used to work for this type of data.
If you scenario means always going through all records in sequence then it may be an overkill to use a database. If you start to need random lookups, replacing/deleting records or checking if a new record is not a duplicate of an older one, a database engine would make more sense.
For the sequential access, a couple of text files or hand-crafted binary files will be easier to handle. You sound like a developer - I would probably go for an own binary format and access it with help of memory-mapped files to improve the sequential read/append speed. No caching, just a sliding window to read the data. I think that it would perform better and even on usual hardware than any DB would; I did such data analysis once. It would also be faster than awking CSV files; however, I am not sure how much and if it satisfied the effort to develop the binary storage, first of all.
As soon as the database becomes interesting, you can have a look at MongoDB and CouchDB. They are used for storing and serving very large amounts of data. (There is a flattering evaluation that compares one of them to traditional DBs.). Databases usually need a reasonable hardware power to perform better; maybe you could check out how those two would do with your data.
--- Ferda
Ferdinand Prantl's answer is very good. Two points:
By your requirements I recommend that you create a very tight binary format. This will be easy to do because your records are fixed size.
If you understand your data well you might be able to compress it. For example, if your key is an increasing log value you don't need to store it entirely. Instead, store the difference to the previous value (which is almost always going to be one). Then, use a standard compression algorithm/library to save on data size big time.
For sequential reads and writes, leveldb will handle your dataset pretty well.
I think that's about 48 gigs of data in one table.
When you get into large databases, you have to look at things a little differently. With an ordinary database (say, tables less than a couple million rows), you can do just about anything as a proof of concept. Even if you're stone ignorant about SQL databases, server tuning, and hardware tuning, the answer you come up with will probably be right. (Although sometimes you might be right for the wrong reason.)
That's not usually the case for large databases.
Unfortunately, you can't just throw 1.5 billion rows straight at an untuned PostgreSQL server, run a couple of queries, and say, "PostgreSQL can't handle this." Most SQL dbms have ways of dealing with lots of data, and most people don't know that much about them.
Here are some of the things that I have to think about when I have to process a lot of data over the long term. (Short-term or one-off processing, it's usually not worth caring a lot about speed. A lot of companies won't invest in more RAM or a dozen high-speed disks--or even a couple of SSDs--for even a long-term solution, let alone a one-time job.)
Server CPU.
Server RAM.
Server disks.
RAID configuration. (RAID 3 might be worth looking at for you.)
Choice of operating system. (64-bit vs 32-bit, BSD v. AT&T derivatives)
Choice of DBMS. (Oracle will usually outperform PostgreSQL, but it costs.)
DBMS tuning. (Shared buffers, sort memory, cache size, etc.)
Choice of index and clustering. (Lots of different kinds nowadays.)
Normalization. (You'd be surprised how often 5NF outperforms lower NFs. Ditto for natural keys.)
Tablespaces. (Maybe putting an index on its own SSD.)
Partitioning.
I'm sure there are others, but I haven't had coffee yet.
But the point is that you can't determine whether, say, PostgreSQL can handle a 48 gig table unless you've accounted for the effect of all those optimizations. With large databases, you come to rely on the cumulative effect of small improvements. You have to do a lot of testing before you can defensibly conclude that a given dbms can't handle a 48 gig table.
Now, whether you can implement those optimizations is a different question--most companies won't invest in a new 64-bit server running Oracle and a dozen of the newest "I'm the fastest hard disk" hard drives to solve your problem.
But someone is going to pay either for optimal hardware and software, for dba tuning expertise, or for programmer time and waiting on suboptimal hardware. I've seen problems like this take months to solve. If it's going to take months, money on hardware is probably a wise investment.

When should we store images in database?

I have a table of productList in which i have 4 column, now i have to store image for each row so i have two option for this..
Store image in data base.
Save images in a folder and store only path on table.
So my question is which one is better in this situation and why ?
Microsoft Research published quite an extensive paper on the subject, called To Blob Or Not To Blob.
Their synopsis is:
Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
It depends -
You can store images in DB if you know that they wont increase in size very often. This has its advantage when you are deploying your systems or migrating to new servers. you dont have to worry about copying images seperately.
If the no. of rows increase very frequently on that system, and the images get bulkier, then its good to store on the file system and have a path stored in database for later retrieval. This also will keep you on toes when migrating your servers where you have to take care of copying the images from filepath seperately.

database index and memory usage

suppose I have a table that stores 100 million records of strings of varying sizes up to 20 characters in a column field. I need to index this column, I only have a 2GB-Ram machine, is this sufficient to perform such task? Is mysql recommended db engine for storage?
Databases are generally designed in a way that allows them to work with more data then you have available RAM. Giving it more working memory will speed things up, but it should be able to build the index and perform searches on it just fine.
If you have 2 GB of main memory, then yes, you should be able to build the index without any problems; virtual memory is a wonderful thing, and the DBMS may well arrange to spill data to disk as it goes.
If you only have 2 GB of disk space, you don't have enough space for the data and the index.
To no-one's surprise, it is 2 GB of main memory, not 2 GB of disk (that comment was mainly in jest - but these days, if someone says 256 GB, it is not clear whether they're referring to disk space or main memory; it could be either).
Yes, if the DBMS cannot create the index within that constraint, it is not worthy of being termed a DBMS.
MySQL probably can do the job. It isn't what I'd recommend, but I'm very biassed in this area as a result of being one of the developers of an alternative (commercial) DBMS. We don't have enough information about your budget etc to be able to advise reliably.