Will a BLOB Column always consume the defined size even if less data is inserted? - sql

I need to store images in a DB2 Blob field. The average image size is about 200KB but in rare cases there will be images with 2-4MB. I don't want to reject these images so I guess I'd define a BLOB(5M). Is this okay to do or will this Blob always consume the 5MB even if most of it is unused?
What is the common way to deal with the Blob size if it is hard to find an average?

The blob will only use as much space as necessary. There is no overhead in defining a large maximum (think of it as a "constraint" rather than a physical thing)

Related

For linearized PDF how to determine the length of the cross-reference stream in advance?

When generating a linearized PDF, a cross-reference table should be stored in the very beginning of the file. If it is a cross-reference stream, this means the content of the table will be compressed and the actual size of the cross-reference stream after compression is unpredictable.
So my question is:
How to determine the actual size of this cross-reference stream in advance?
If the actual size of the stream is unpredictable, after the offsets of objects are written into the stream and the stream is written into the file, it will change the actual offsets of the following objects again, won't it? Do I miss something here?
Any hints are appreciated.
How to determine the actual size of this cross-reference stream in advance?
First of all you don't. At least not exactly. You described why.
But it suffices to have an estimate. Just add some bytes to the estimate and later-on pad with whitespaces. #VadimR pointed out that such padding can regularly be observed in linearized PDFs.
You can either use a rough estimate as in the QPDF source #VadimR referenced or you can try for a better one.
You could, e.g. make use of predictors:
At the time you eventually have to create the cross reference streams, all PDF objects can already be serialized in the order you need with the exception of the cross reference streams and the linearization dictionary (which contains the final size of the PDF and some object offsets). Thus, you already know the differences between consecutive xref entry values for most of the entries.
If you use up predictors, you essentially only store those differences. So, you already know most of the data to compress. Changes in a few entries won't change the compressed result too much. So this probably gives you a better estimate.
Furthermore, as the first cross reference stream does not contain too many entries in general, you can try compressing that stream multiple times for different numbers of reserved bytes.
PS: I have no idea what Adobe do use in their linearization code. And I don't know whether it makes sense to fight for a few bytes more or less here; after all linearization is most sensible for big documents for which a few bytes more or less hardly count.

RavenDB : Storage Size Problems

I'm doing some testing with RavenDB to store data based on an iphone application. The application is going to send up a string of 5 GPS coordinates with a GUID for the key. I'm seeing in RavenDB that each document is around 664-668 bytes. That's HUGE for 10 decimals and a guid. Can someone help me understand what I'm doing wrong? I noticed the size was extraordinarily large when a million records was over a gig on disk. By my calculations it should be much smaller. Purely based on the data sizes shouldn't the document be around 100 bytes? And given that the document database has the object schema built in let's say double that to 200 bytes. Given that calculation the database should be about two hundred megs with 1 million records. But it's ten times larger. Can someone help me where I've gone wrong with the math here?
(Got a friend to check my math and I was off by a bit - numbers updated)
As a general principal, NoSQL databases aren't optimized for disk space. That's the kind of traditional requirement of an RDBMS. Often with NoSQL, you will choose to store the data in duplicate or triplicate for various reasons.
Specifically with RavenDB, each document is in JSON format, so you have some overhead there. However, it is actually persisted on disk in BSON format, saving you some bytes. This implementation detail is obscured from the client. Also, every document has two streams - the main document content, and the associated metadata. This is very powerful, but does take up additional disk space. Both the document and the metadata are kept in BSON format in the ESENT backed document store.
Then you need to consider how you will access the data. Any static indexes you create, and any dynamic indexes you ask Raven to create for you via its LINQ API will have the data copied into the index store. This is a separate store implemented with Lucene.net using their proprietary index file format. You need to take this into consideration if you are estimating disk space requirements. (BTW - you would also have this concern with indexes in an RDBMS solution)
If you are super concerned about optimizing every byte of disk space, perhaps NoSQL solutions aren't for you. Just about every product on the market has these types of overhead. But keep in mind that disk space is cheap today. Relational databases optimized for disk space because storage was very expensive when they were invented. The world has changed, and NoSQL solutions embrace that.

Does Core Data impose limits on the length of strings?

I was wondering if there are any limits on the length of strings stored using Core Data in iOS. (other than available RAM or disk space on the device)
I think you're more likely to hit the performance limits on an iOS device before you hit any storage limits in Core Data. You'll also be getting a performance hit from pulling in large chunks of data.
You are better off, both in performance and manageability, breaking up large blocks of text into smaller chunks.
From what I remember Marcus Zarra telling me anyway.
Just to confirm, that there are no specific limits in CoreData (not counting memory/disk space limitations). When using CoreData on iOS you are in almost every case using sqlite as persistent storage. CoreData stores String as Varchar and from sqlite's point of view:
SQLite does not enforce the length of a VARCHAR. You can declare a VARCHAR(10) and SQLite will be happy to store a 500-million character string there. And it will keep all 500-million characters intact. Your content is never truncated. SQLite understands the column type of "VARCHAR(N)" to be the same as "TEXT", regardless of the value of N.
...taken from sqlite's FAQ
It does not have a limit as far as I can tell, unless you assign one in the model file (there is a section for min length and max length).
I don't remember reading any limits in Core Data documentations, but remember that Core Data is just a framework on top of a real database, usually sqlite. I think it's safe to assume that the limits are dictated by the underlying DB.

Database or other method of storing and dynamically accessing HUGE binary objects

I have some large (200 GB is normal) flat files of data that I would like to store in some kind of database so that it can be accessed quickly and in the intuitive way that the data is logically organized. Think of it as large sets of very long audio recordings, where each recording is the same length (samples) and can be thought of as a row. One of these files normally has about 100,000 recordings of 2,000,000 samples each in length.
It would be easy enough to store these recordings as rows of BLOB data in a relational database, but there are many instances where I want to load into memory only certain columns of the entire data set (say, samples 1,000-2,000). What's the most memory- and time-efficient way to do this?
Please don't hesitate to ask if you need more clarification on the particulars of my data in order to make a recommendation.
EDIT: To clarify the data dimensions... One file consists of: 100,000 rows (recordings) by 2,000,000 columns (samples). Most relational databases I've researched will allow a maximum of a few hundred to a couple thousand rows in a table. Then again, I don't know much about object-oriented databases, so I'm kind of wondering if something like that might help here. Of course, any good solution is very welcome. Thanks.
EDIT: To clarify the usage of the data... The data will be accessed only by a custom desktop/distributed-server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data "set" (which I've referred to as a 200 GB file up to now). There is also metadata associated with each recording (which I had hoped would be a row in a table so I could just add columns for each piece of recording metadata). All of the metadata is consistent. I.e. if a particular piece of metadata exists for one recording, it also exists for all recordings in that file. The samples themselves do not have metadata. Each sample is 8 bits of plain-ol' binary data.
DB storage may not be ideal for large files. Yes, it can be done. Yes, it can work. But what about DB backups? The file contents likely will not change often - once they're added, they will remain the same.
My recommendation would be store the file on disk, but create a DB-based index. Most filesystems get cranky or slow when you have > 10k files in a folder/directory/etc. Your application can generate the filename and store metadata in the DB, then organize by the generated name on disk. Downsides are file contents may not be directly apparent from the name. However, you can easily backup changed files without specialized DB backup plugins and a sophisticated partitioning, incremental backup scheme. Also, seeks within the file become much simpler operations (skip ahead, rewind, etc.). There is generally better support for these operations in a file system than in a DB.
I wonder what makes you think that RDBMS would be limited to mere thousands of rows; there's no reason this would be the case.
Also, at least some databases (Oracle as an example) do allow direct access to parts of LOB data, without loading the full LOB, if you just know the offset and length you want to have. So, you could have a table with some searchable metadata and then the LOB column, and if needed, an additional metadata table containing metadata on the LOB contents so that you'd have some kind of keyword->(offset,length) relation available for partal loading of LOBs.
Somewhat echoing another post here, incremental backups (which you might wish to have here) are not quite feasible with databases (ok, can be possible, but at least in my experience tend to have a nasty price tag attached).
How big is each sample, and how big is each recording?
Are you saying each recording is 2,000,000 samples, or each file is? (it can be read either way)
If it is 2 million samples to make up 200 GB, then each sample is ~10 K, and each recording is 200K (to have 100,000 per file, which is 20 samples per recording)?
That seems like a very reasonable size to put in a row in a DB rather than a file on disk.
As for loading into memory only a certain range, if you have indexed the sample ids, then you could very quickly query for only the subset you want, loading only that range into memory from the DB query result.
I think that Microsoft SQL does what you need with the varbinary(MAX) field type WHEN used in conjnction with filestream storage.
Have a read on TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).
Basically, you can enter any descriptive fields normally into your database, but the actual BLOB is stored in NTFS, governed by the SQL engine and limited in size only by your NTFS file system.
Hope this helps - I know it raises all kinds of possibilities in my mind. ;-)

Is it possible to memory map a compressed file?

We have large files with zlib-compressed binary data that we would like to memory map.
Is it even possible to memory map such a compressed binary file and access those bytes in an effective manner?
Are we better off just decompressing the data, memory mapping it, then after we're done with our operations compress it again?
EDIT
I think I should probably mention that these files can be appended to at regular intervals.
Currently, this data on disk gets loaded via NSMutableData and decompressed. We then have some arbitrary read/write operations on this data. Finally, at some point we compress and write the data back to disk.
Memory mapping is all about the 1:1 mapping of memory to disk. That's not compatible with automatic decompression, since it breaks the 1:1 mapping.
I assume these files are read-only, since random-access writing to a compressed file is generally impractical. I would therefore assume that the files are somewhat static.
I believe this is a solvable problem, but it's not trivial, and you will need to understand the compression format. I don't know of any easily reusable software to solve it (though I'm sure many people have solved something like it in the past).
You could memory map the file and then provide a front-end adapter interface to fetch bytes at a given offset and length. You would scan the file once, decompressing as you went, and create a "table of contents" file that mapped periodic nominal offsets to real offset (this is just an optimization, you could "discover" this table of contents as you fetched data). Then the algorithm would look something like:
Given nominal offset n, look up greatest real offset m that maps to less than n.
Read m-32k into buffer (32k is the largest allowed distance in DEFLATE).
Begin DEFLATE algorithm at m. Count decompressed bytes until you get to n.
Obviously you'd want to cache your solutions. NSCache and NSPurgeableData are ideal for this. Doing this really well and maintaining good performance would be challenging, but if it's a key part of your application it could be very valuable.