Why Select SQL queries on tables with blobs are slow, even when the blob is not selected? - sql

SELECT queries on tables with BLOBs are slow, even if I don't include the BLOB column. Can someone explain why, and maybe how to circumvent it? I am using SQL Server 2012, but maybe this is more of a conceptual problem that would be common for other distributions as well.
I found this post: SQL Server: select on a table that contains a blob, which shows the same problem, but the marked answer doesn't explain why is this happening, neither provides a good suggestion on how to solve the problem.

If you are asking for a way to solve the performance drag, there are a number of approaches that you can take. Adding indexes to your table should help massively provided you aren't simply selecting the entire recordset. Creating views over the table may also assist. It's also worth checking the levels of index fragmentation on the table as this can cause poor performance and could be addressed with a regular maintenance job. The suggestion of creating a linked table to store the blob data is also a genuinely good one.
However, if your question is asking why it's happening, this is because of the fundamentals of the way MS SQL Server functions. Essentially your database, and all databases on the server and split into pages, 8kb chunks of data with a 96-byte header. Each page representing what is possible in a single I/O operation. Pages are collected contained and grouped within Exents, 64kb collections of eight contiguous pages. SQL Server therefore uses sixteen Exents per megabyte of data. There are a few differing page types, a data page type for example won't contain what are termed "Large Objects". This include the data types text, image, varbinary(max), xml data, etc... These also are used to store variable length columns which exceed 8kb (and don't forget the 96 byte header).
At the end of each page will be a small amount of free space. Database operations obviously shift these pages around all the time and free space allocations can grow massively in a database dealing with large amounts of I/O and random record access / modification. This is why free space on a database can grow massively. There are tools available within the management suite to allow you to reduce or remove free space and basically this re-organizes pages and exents.
Now, I may be making a leap here but I'm guessing that the blobs you have in your table exceed 8kb. Bear in mind if they exceed 64kb they will not only span multiple pages but indeed span multiple exents. The net result of this will be that a "normal" table read will cause massive amounts of I/O requests. Even if you're not interested in the BLOB data, the server may have to read through the pages and exents to get the other table data. This will only be compounded as more transactions make pages and exents that make up a table to become non-contiguous.
Where "Large Objects" are used, SQL Server writes Row-Overflow values which include a 24bit pointer to where the data is actually stored. If you have several columns on your table which exceed the 8kb page size combined with blobs and impacted by random transactions, you will find that the majority of the work your server is doing is I/O operations to move pages in and out of memory, reading pointers, fetching associated row data, etc, etc... All of which represents serious overhead.

I got a suggestion then, have all the blobs in a separate table with an identity ID, then only save the identity ID in your main table
it could be because - maybe SQL cannot cache the table pages as easily, and you have to go to the disk more often. I'm no expert as to why though.
A lot of people frown at BLOBS/images in databases - In SQL 2012 there is some sort of compromise where you can configure the DB to keep objects in a file structure, not in the actual DB anymore - you might want to look for that

Related

SQL Server select with large varchar columns take time to load

I am trying to run a simple select query and it has column called instructions with varchar(8000) in the select column list. The table has
90,000 records and it took my SQL server management studio console to 10 seconds to return and display the full table data
SELECT id, name, instructions, etc.... FROM TABLE;
however when i remove the instructions from the select list it took only a 1 second to execute and display the result. Can any one please help me to understand the theory behind this
Thanks
Keth
There are some obvious things here that impact the time, and a few more subtle ones around it. The topic of the underlying storage of SQL Server and how it stores / retrieves this data is a book in itself, of which there are many. (I'd personally recommend Kalen Delaney but everyone will have their own preference and I appreciate we should keep away from subjectivity on SO).
90k rows of instructions potentially have to be marshalled across your network connection if you were connected from another machine than the server.
The SSMS console itself, has to display these, which itself takes time.
depending on the size of what you are reading vs your buffer cache and other queries being executed you could be putting pressure on your cache and generating more physical IO load for the server as a whole.
As mentioned in comments, more data is being read, but does this mean more is being read from the disk? This one is far more subtle when looked at in detail.
In terms of the disk IO issue, depending on when the instructions are placed in the row and the settings for the column around inlining of data. It might be that the instructions for the row are stored inline with the row, which means no additional disk IO is actually occurring to read them vs not read them, its more a case of whether SQL Server bothers to decode the value from the page in memory.
The varchar(8000) though might not be inline with the rest of the data, it could be on a row_overflow_data page, sometimes referred to as short large object (SLOB), in which case the instruction field itself stores a pointer where the data is stored, and when you read the instructions it causes SQL Server to have to read another entirely random page (and extent) elsewhere on the disk per row.
Depending how / when instructions are added, you could see a huge level of fragmentation / lack of contiguous extents being allocated for these instructions, although depending on the IO subsystem, this may be immaterial to the problem.
There are a lot of unknowns at this point which makes it harder to give anything definitive - you are in the 'it depends' area of the DB, which would need a lot more specifics and investigation to be able to point at a specific cause, vs the more general (and not entirely complete) list above.
As Tim Biegeleisen mentioned, do not read the instructions unless you need to.

Is it a good idea to create tables dynamically to store user-content?

I'm currently designing an application where users can create/join groups, and then post content within a group. I'm trying to figure out how best to store this content in a RDBMS.
Option 1: Create a single table for all user content. One of the columns in this table will be the groupID, designating which group the content was posted in. Create an index using the groupID, to enable fast searching of content within a specific group. All content reads/writes will hit this single table.
Option 2: Whenever a user creates a new group, we dynamically create a new table. Something like group_content_{groupName}. All content reads/writes will be routed to the group-specific dynamically created table.
Pros for Option 1:
It's easier to search for content across multiple groups, using a single simple query, operating on a single table.
It's easier to construct simple cross-table queries, since the content table is static and well-defined.
It's easier to implement schema changes and changes to indexing/triggers etc, since there's only one table to maintain.
Pros for Option 2:
All reads and writes will be distributed across numerous tables, thus avoiding any bottlenecks that can result from a lot of traffic hitting a single table (though admittedly, all these tables are still in a single DB)
Each table will be much smaller in size, allowing for faster lookups, faster schema-changes, faster indexing, etc
If we want to shard the DB in future, the transition would be easier if all the data is already "sharded" across different tables.
What are the general recommendations between the above 2 options, from performance/development/maintenance perspectives?
One of the cardinal sins in computing is optimizing too early. It is the opinion of this DBA of 20+ years that you're overestimating the IO that's going to happen to these groups.. RDBMS's are very good at querying and writing this type of info within a standard set of tables. Worst case, you can partition them later. You'll have a lot more search capability and management ease with 1 set of tables instead of a set per user.
Imagine if the schema needs to change? do you really want to update hundreds or thousands of tables or write some long script to fix a mundane issue? Stick with a single set of tables and ignore sharding. Instead, think "maybe we'll partition the tables someday, if necessary"
It is a no-brainer. (1) is the way to go.
You list these as optimizations for the second method. All of these are misconceptions. See comments below:
All reads and writes will be distributed across numerous tables, thus
avoiding any bottlenecks that can result from a lot of traffic hitting
a single table (though admittedly, all these tables are still in a
single DB)
Reads and writes can just as easily be distributed within a table. The only issue would be write conflicts within a page. That is probably a pretty minor consideration, unless you are dealing with more than dozens of transactions per second.
Because of the next item (partially filled pages), you are actually much better off with a single table and pages that are mostly filled.
Each table will be much smaller in size, allowing for faster lookups,
faster schema-changes, faster indexing, etc
Smaller tables can be a performance disaster. Tables are stored on data pages. Each table is then a partially filled page. What you end up with is:
A lot of wasted space on disk.
A lot of wasted space in your page cache -- space that could be used to store records.
A lot of wasted I/O reading in partially filled pages.
If we want to shard the DB in future, the transition would be easier
if all the data is already "sharded" across different tables.
Postgres supports table partitioning, so you can store different parts of a table in different places. That should be sufficient for your purpose of spreading the I/O load.
Option 1: Performance=Normal Development=Easy Maintenance=Easy
Option 2: Performance=Fast Development=Complex Maintenance=Hard
I suggest to choose the Oprion1 and for the BIG table you can manage the performance with better indexes or cash indexes (for some DB) and the last thing is nothing help make the second Option 2, because development a maintenance time is fatal factor

Why Spark SQL considers the support of indexes unimportant?

Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of
these (such as indexes) are less important due to Spark SQL’s
in-memory computational model. Others are slotted for future releases
of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use
case the data size far exceeds the size of available memory.
Assuming this is not uncommon, what is meant by "Spark SQL’s
in-memory computational model"? Is Spark SQL recommended only for
cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large
dataset can take a long time. I read this argument against
indexing in in-memory database, but I was not convinced. The example
there discusses a scan of a 10,000,000 records table, but that's not
really big data. Scanning a table with billions of records can cause
simple queries of the "SELECT x WHERE y=z" type take forever instead
of returning immediately.
I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.
I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?
Indexing input data
The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
Indexing Distributed Data Structures:
standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.
Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.
Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.
In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".
So what is an index, anyway?
Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.
But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.
40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?
Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.
Seriouusly. Think about that a second.
The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.
If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.
Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.
And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.
What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.
Tell me again why you want indexes.
Indexes vs ___ ?
Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?
Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.
If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.
If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!
But I digress...
Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.

Possible differences between two databases with different sizes but same set of data

I have updated two databases in MSSQL Server 2008R2 using liquibase.
Both of them start with the same database, but one ran through several liquibase updates until the final one incrementally, the other just go straight to the final update.
So I have checked they have the same schema, same set of data, but their .mdf file sizes are 10GB apart.
What areas (best to provide also the SQL command) I can look into to investigate what possibly gives me this 10GB difference (e.g. Index? Unused empty spaces? etc...)
I am not trying to make them the same (so no Shrink), I just want to find out the places that contribute to this 10GB size difference. So I will accept answers like using HEX editor to open up the mdf files and compare byte by byte, but I need to know what am I looking at.
Thank you
The internal structure (physical organization, not logical data) of databases is opaque both by design and due to the real-world scenarios that affect how data is created, updated and accessed.
In most cases there is literally no telling why two logically equivalent databases are different on a physical level. It is some combination of deleted objects, unbalanced pages, disk-based temporary tables, history of garbage collection, and many other potential causes.
In short, you would never expect a physical database to be 1:1 with the logical data it contains.

Database or other method of storing and dynamically accessing HUGE binary objects

I have some large (200 GB is normal) flat files of data that I would like to store in some kind of database so that it can be accessed quickly and in the intuitive way that the data is logically organized. Think of it as large sets of very long audio recordings, where each recording is the same length (samples) and can be thought of as a row. One of these files normally has about 100,000 recordings of 2,000,000 samples each in length.
It would be easy enough to store these recordings as rows of BLOB data in a relational database, but there are many instances where I want to load into memory only certain columns of the entire data set (say, samples 1,000-2,000). What's the most memory- and time-efficient way to do this?
Please don't hesitate to ask if you need more clarification on the particulars of my data in order to make a recommendation.
EDIT: To clarify the data dimensions... One file consists of: 100,000 rows (recordings) by 2,000,000 columns (samples). Most relational databases I've researched will allow a maximum of a few hundred to a couple thousand rows in a table. Then again, I don't know much about object-oriented databases, so I'm kind of wondering if something like that might help here. Of course, any good solution is very welcome. Thanks.
EDIT: To clarify the usage of the data... The data will be accessed only by a custom desktop/distributed-server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data "set" (which I've referred to as a 200 GB file up to now). There is also metadata associated with each recording (which I had hoped would be a row in a table so I could just add columns for each piece of recording metadata). All of the metadata is consistent. I.e. if a particular piece of metadata exists for one recording, it also exists for all recordings in that file. The samples themselves do not have metadata. Each sample is 8 bits of plain-ol' binary data.
DB storage may not be ideal for large files. Yes, it can be done. Yes, it can work. But what about DB backups? The file contents likely will not change often - once they're added, they will remain the same.
My recommendation would be store the file on disk, but create a DB-based index. Most filesystems get cranky or slow when you have > 10k files in a folder/directory/etc. Your application can generate the filename and store metadata in the DB, then organize by the generated name on disk. Downsides are file contents may not be directly apparent from the name. However, you can easily backup changed files without specialized DB backup plugins and a sophisticated partitioning, incremental backup scheme. Also, seeks within the file become much simpler operations (skip ahead, rewind, etc.). There is generally better support for these operations in a file system than in a DB.
I wonder what makes you think that RDBMS would be limited to mere thousands of rows; there's no reason this would be the case.
Also, at least some databases (Oracle as an example) do allow direct access to parts of LOB data, without loading the full LOB, if you just know the offset and length you want to have. So, you could have a table with some searchable metadata and then the LOB column, and if needed, an additional metadata table containing metadata on the LOB contents so that you'd have some kind of keyword->(offset,length) relation available for partal loading of LOBs.
Somewhat echoing another post here, incremental backups (which you might wish to have here) are not quite feasible with databases (ok, can be possible, but at least in my experience tend to have a nasty price tag attached).
How big is each sample, and how big is each recording?
Are you saying each recording is 2,000,000 samples, or each file is? (it can be read either way)
If it is 2 million samples to make up 200 GB, then each sample is ~10 K, and each recording is 200K (to have 100,000 per file, which is 20 samples per recording)?
That seems like a very reasonable size to put in a row in a DB rather than a file on disk.
As for loading into memory only a certain range, if you have indexed the sample ids, then you could very quickly query for only the subset you want, loading only that range into memory from the DB query result.
I think that Microsoft SQL does what you need with the varbinary(MAX) field type WHEN used in conjnction with filestream storage.
Have a read on TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).
Basically, you can enter any descriptive fields normally into your database, but the actual BLOB is stored in NTFS, governed by the SQL engine and limited in size only by your NTFS file system.
Hope this helps - I know it raises all kinds of possibilities in my mind. ;-)