Options for storing large text blobs in/with an SQL database? - sql

I have some large volumes of text (log files) which may be very large (up to gigabytes). They are associated with entities which I'm storing in a database, and I'm trying to figure out whether I should store them within the SQL database, or in external files.
It seems like in-database storage may be limited to 4GB for LONGTEXT fields in MySQL, and presumably other DBs have similar limits. Also, storing in the database presumably precludes any kind of seeking when viewing this data -- I'd have to load the full length of the data to render any part of it, right?
So it seems like I'm leaning towards storing this data out-of-DB: are my misgivings about storing large blobs in the database valid, and if I'm going to store them out of the database then are there any frameworks/libraries to help with that?
(I'm working in python but am interested in technologies in other languages too)

Your misgivings are valid.
DB's gained the ability to handle large binary and text fields some years ago, and after everybody tried we gave up.
The problem stems from the fact that your operations on large objects tend to be very different from your operations on the atomic values. So the code gets difficult and inconsistent.
So most veterans just go with storing them on the filesystem with a pointer in the db.

I know php/mysql/oracle/prob more lets you work with large database objects as if you have a file pointer, which gets around memory issues.

Related

Storing JSON in database column - NVARCHAR(MAX), FileStream?

I have a table with a few standard fields plus a JSON document in one of the columns. We use SQL Server 2017 to store everything and data is consumed by a C# application using Entity Framework 6. The table itself will possibly have tens of thousands of entries, and these entries (or rather, the JSON column on them) will have to be updated daily. I'm trying to figure out what data type to use for the best performance.
I've read this:
SQL Server 2008 FILESTREAM performance
And currently, JSON documents came as files ranging from 30-200 KB. There is a possibility of going above the 256 KB barrier, but the probability of going above 1 MB is currently very low. That would point to NVARCHAR. Also, here:
What's best SQL datatype for storing JSON string?
People suggest that storing JSON as NVARCHAR(MAX) is the way to go.
However, two things worry me:
First, is fragmentation over time, with so many writers (that's one of the areas Filestream seems to have an advantage no matter the column size). I'm not sure how will that affect the performance...
Second, I'm not sure whether storing so much text data in the database won't slow it down due to the size alone? As far as I understand it, another advantage of FileStream is that database size cost is pretty constant, no matter the file size on the disk, and that helps to maintain performance over time. Or am I wrong?
What would you choose, given my use case?
It is not just a matter of performance, but also a matter of development time, knowledge and maintenance. If everything is already in c# using the entity framework, why would you use something more complex? My approach would be to use the solution the developers are most comfortable with until a performance bottleneck occurs and then to benchmark.
Comparison between two solutions with realistic table sizes will give you a real insight on if any adaptations are worthwhile.

Possible differences between two databases with different sizes but same set of data

I have updated two databases in MSSQL Server 2008R2 using liquibase.
Both of them start with the same database, but one ran through several liquibase updates until the final one incrementally, the other just go straight to the final update.
So I have checked they have the same schema, same set of data, but their .mdf file sizes are 10GB apart.
What areas (best to provide also the SQL command) I can look into to investigate what possibly gives me this 10GB difference (e.g. Index? Unused empty spaces? etc...)
I am not trying to make them the same (so no Shrink), I just want to find out the places that contribute to this 10GB size difference. So I will accept answers like using HEX editor to open up the mdf files and compare byte by byte, but I need to know what am I looking at.
Thank you
The internal structure (physical organization, not logical data) of databases is opaque both by design and due to the real-world scenarios that affect how data is created, updated and accessed.
In most cases there is literally no telling why two logically equivalent databases are different on a physical level. It is some combination of deleted objects, unbalanced pages, disk-based temporary tables, history of garbage collection, and many other potential causes.
In short, you would never expect a physical database to be 1:1 with the logical data it contains.

RavenDB : Storage Size Problems

I'm doing some testing with RavenDB to store data based on an iphone application. The application is going to send up a string of 5 GPS coordinates with a GUID for the key. I'm seeing in RavenDB that each document is around 664-668 bytes. That's HUGE for 10 decimals and a guid. Can someone help me understand what I'm doing wrong? I noticed the size was extraordinarily large when a million records was over a gig on disk. By my calculations it should be much smaller. Purely based on the data sizes shouldn't the document be around 100 bytes? And given that the document database has the object schema built in let's say double that to 200 bytes. Given that calculation the database should be about two hundred megs with 1 million records. But it's ten times larger. Can someone help me where I've gone wrong with the math here?
(Got a friend to check my math and I was off by a bit - numbers updated)
As a general principal, NoSQL databases aren't optimized for disk space. That's the kind of traditional requirement of an RDBMS. Often with NoSQL, you will choose to store the data in duplicate or triplicate for various reasons.
Specifically with RavenDB, each document is in JSON format, so you have some overhead there. However, it is actually persisted on disk in BSON format, saving you some bytes. This implementation detail is obscured from the client. Also, every document has two streams - the main document content, and the associated metadata. This is very powerful, but does take up additional disk space. Both the document and the metadata are kept in BSON format in the ESENT backed document store.
Then you need to consider how you will access the data. Any static indexes you create, and any dynamic indexes you ask Raven to create for you via its LINQ API will have the data copied into the index store. This is a separate store implemented with Lucene.net using their proprietary index file format. You need to take this into consideration if you are estimating disk space requirements. (BTW - you would also have this concern with indexes in an RDBMS solution)
If you are super concerned about optimizing every byte of disk space, perhaps NoSQL solutions aren't for you. Just about every product on the market has these types of overhead. But keep in mind that disk space is cheap today. Relational databases optimized for disk space because storage was very expensive when they were invented. The world has changed, and NoSQL solutions embrace that.

sql text field vs flat file vs nosql document store

I plan on having a SQL fact table involving a text field which I don't expect to index on (I will only read out the data and very rarely update it). I think this table could get quite large, primarily due to this text field. The rest of the data in my database does make sense to be relational, however I believe I could scale much more easily and cheaply if I instead store pointers to flat files (where each pointer is to a different text file stored in something like S3) instead of using the text field.
An alternative that seems to be gaining popularity is a fully NoSQL document-based solution (e.g. CouchDB, MongoDB, etc.) I am wondering what are the tradeoffs (scalability/reliability/security/performance/ease of implementation/ease of maintenance/cost) between simply using a SQL text field, having a pointer to flat files, or completely rethinking the entire system in the context of a NoSQL document store?
The best approach is to use a relational db for the normal (non-text) data and save the large (text) data "somewhere else" that can handle large data better than a relational database can.
First, let's discuss why it's a bad idea to save large data in a relational database:'
row sizes become much longer, so the I/O required to read in disk pages with target rows balloons
backup sizes and, more importantly, backup times enlarge to the point they can cripple DBA tasks and even bring systems offline (then backups are turned off, then the disk fails, oops)
you typically don't need to search the text, so there's no need in having it in the database
relational databases and libraries/drivers typically aren't good at handling unusually large data, and the way of handling it is often vendor-specific, making any solution non-portable
Your choice of "somewhere else" is broad, but includes:
large data storage software like Cassandra, MongoDB, etc
NoSQL databases like Lucene
File System
Do what's easiest that will work - they are all valid as long as you do your requirements calculations for:
peak write performance
peak read performance
long-term storage volume
Another tip: Don't store anything about the text in the relational database. Instead, name/index the text using the id of the relational database row. That way, if you change your implementation, you don't have to re-jig your data model.

When should we store images in database?

I have a table of productList in which i have 4 column, now i have to store image for each row so i have two option for this..
Store image in data base.
Save images in a folder and store only path on table.
So my question is which one is better in this situation and why ?
Microsoft Research published quite an extensive paper on the subject, called To Blob Or Not To Blob.
Their synopsis is:
Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
It depends -
You can store images in DB if you know that they wont increase in size very often. This has its advantage when you are deploying your systems or migrating to new servers. you dont have to worry about copying images seperately.
If the no. of rows increase very frequently on that system, and the images get bulkier, then its good to store on the file system and have a path stored in database for later retrieval. This also will keep you on toes when migrating your servers where you have to take care of copying the images from filepath seperately.