JSONB performance degrades as number of keys increase - sql

I am testing the performance of jsonb datatype in postgresql. Each document will have about 1500 keys that are NOT hierarchical. The document is flattened. Here is what the table and document looks like.
create table ztable0
(
id serial primary key,
data jsonb
)
Here is a sample document:
{ "0": 301, "90": 23, "61": 4001, "11": 929} ...
As you can see the document does not contain hierarchies and all values are integers. However, Some will be text in the future.
Rows: 86,000
Columns: 2
Keys in document: 1500+
When searching for a particular value of a key or performing a group by the performance is very noticeably slow. This query:
select (data ->> '1')::integer, count(*) from ztable0
group by (data ->> '1')::integer
limit 100
took about 2 seconds to complete. Is there any way to improve performance of jsonb documents.

This is a known issue in 9.4beta2, please, have a look at this blog post, it contains some details and pointers to the mail threads.
About the issue.
PostgreSQL is using TOAST to store data values, this means that big values (typically round 2kB and more) are stored in the separate special kind of table. And PostgreSQL also tries to compress the data, using it's pglz method (been there for ages). By “tries” it means that before deciding to compress data, first 1k bytes are probed. And if results are not satisfactory, i.e. compression gives no benefits on the probed data, decision is made not to compress.
So, initial JSONB format stored a table of offsets in the beginning of it's value. And for values with high number of root keys in JSON this resulted in first 1kB (and more) being occupied by offsets. This was a series of distinct data, i.e. it was not possible to find two adjacent 4-byte sequences that'd be equal. Thus no compression.
Note, that if one would pass over the offset table, the rest of the value is perfectly compressable.
So one of the options would be to tell to the pglz code explicitly wether compression is applicable and where to probe for it (especially for the newly introduced data types), but existing infrastructure doesn't supports this.
The fix
So decision was made to change the way data is stored inside the JSONB value, making it more suitable for pglz to compress. Here's a commit message by Tom Lane with the change that implements a new JSONB on-disk format. And despite the format changes, lookup of a random element is still O(1).
It took around a month to be fixed though. As I can see, 9.4beta3 had been already tagged, so you'll be able to re-test this soon, after the official announcement.
Important Note: you'll have to do pg_dump/pg_restore exercise or utilize pg_upgrade tool to switch to 9.4beta3, as fix for the issue you've identified required changes in the way data is stored, so beta3 is not binary compatible with beta2.

Related

Postgresql: In-Row vs Out-of-Row for text/varchar

Two part question:
What is Postgresql behavior for storing text/varchars in-row vs
out-of-row? Am I correct in thinking that with default settings, all columns will always be stored in-row until the 2kB size is reached?
Do we have any control over the above behavior? Is there any way I can change the threshold for a specific column/table, or force a specific column to always be stored out-of-row?
I've read through PostGresql Toast documentation (http://www.postgresql.org/docs/8.3/static/storage-toast.html), but I don't see any option for changing the thresholds (default seems to be 2kB-for-row) or for forcing a column to always store out-of-row (EXTERNAL only allows it, but doesn't enforce it).
I've found documentation explaining how to do this on SQL Server (https://msdn.microsoft.com/en-us/library/ms173530.aspx), but don't see anything similar for PostGresql.
If anyone's interested in my motivation, I have a table that has a mix of short-consistent columns (IDs, timestamps etc), a column that is varchar(200), and a column that is text/varchar(max), which can be extremely large in length. I currently have both varchars stored in a separate table, just to allow efficient storage/lookups/scanning on the short-consistent columns.
This is a pain however, because I constantly have to do joins to read all the data. I would really like to store all the above fields in the same table, and tell Postgresql to force-store the 2 VARCHARs out-of-row, always.
Edited Answer
For first part of the question: you are correct (see for instance this).
For the second part of the question: the standard way of storing columns is to compress variable length text fields if their size is over 2KB, and eventually store them into a separate area, called “TOAST table”.
You can give a “hint” to the system on how to store a field by using the following command for your columns:
ALTER TABLE YourTable
ALTER COLUMN YourColumn SET STORAGE (PLAIN | EXTENDED | EXTERNAL | MAIN)
From the manual:
SET STORAGE
This form sets the storage mode for a column. This controls whether this column is held inline or in a secondary TOAST table, and whether the data should be compressed or not. PLAIN must be used for fixed-length values such as integer and is inline, uncompressed. MAIN is for inline, compressible data. EXTERNAL is for external, uncompressed data, and EXTENDED is for external, compressed data. EXTENDED is the default for most data types that support non-PLAIN storage. Use of EXTERNAL will make substring operations on very large text and bytea values run faster, at the penalty of increased storage space. Note that SET STORAGE doesn't itself change anything in the table, it just sets the strategy to be pursued during future table updates. See Section 59.2 for more information.
Since the manual is not completely explicit on this point, this is my interpretation: the final decision about how to store the field is left in any case to the system, given the following constraints:
No field can be stored such that the total size of a row is over
8KB
No field is stored out-of-row if its size is less then the
TOAST_TUPLE_THRESHOLD
After satisfying the previous
constraints, the system tries to satisfy the SET STORAGE strategy
specified by the user. If no storage strategy is specified, each TOAST-able
field is automatically declared EXTENDED.
Under these assumption, the only way to be sure that all the values of a column are stored out-of-row is to recompile the system with a value of TOAST_TUPLE_THRESHOLD less then the minumum size of any value of the column.

Index in Parquet

I would like to be able to do a fast range query on a Parquet table. The amount of data to be returned is very small compared to the total size but because a full column scan has to be performed it is too slow for my use case.
Using an index would solve this problem and I read that this was to be added in Parquet 2.0. However, I cannot find any other information on this so I am guessing that it was not. I do not think that there would be any fundamental obstacles preventing the addition of (multi-column) indexes, if the data were sorted, which in my case it is.
My question is: when will indexes be added to Parquet, and what would be the high level design for doing so? I think I would already be happy with an index that points out the correct partition.
Kind regards,
Sjoerd.
Update Dec/2018:
Parquet Format version 2.5 added column indexes.
https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-250
See https://issues.apache.org/jira/browse/PARQUET-1201 for list of sub-tasks for that new feature.
Notice that this feature just got merged into Parquet format itself, it will take some time for different backends (Spark, Hive, Impala etc) to start supporting it.
This new feature is called Column Indexes. Basically Parquet has added two new structures in parquet layout - Column Index and Offset Index.
Below is a more detailed technical explanation what it solves and how.
Problem Statement
In the current format, Statistics are stored for ColumnChunks in ColumnMetaData and for individual pages inside DataPageHeader structs. When reading pages, a reader has to process the page header in order to determine whether the page can be skipped based on the statistics. This means the reader has to access all pages in a column, thus likely reading most of the column data from disk.
Goals
Make both range scans and point lookups I/O efficient by allowing direct access to pages based on their min and max values. In particular:
A single-row lookup in a rowgroup based on the sort column of that
rowgroup will only read one data page per retrieved column. Range
scans on the sort column will only need to read the exact data pages
that contain relevant data.
Make other selective scans I/O
efficient: if we have a very selective predicate on a non-sorting
column, for the other retrieved columns we should only need to
access data pages that contain matching rows.
No additional decoding
effort for scans without selective predicates, e.g., full-row group
scans. If a reader determines that it does not need to read the
index data, it does not incur any overhead.
Index pages for sorted
columns use minimal storage by storing only the boundary elements
between pages.
Non-Goals
Support for the equivalent of secondary indices, ie, an index structure sorted on the key values over non-sorted data.
Technical Approach
We add two new per-column structures to the row group metadata:
ColumnIndex: this allows navigation to the pages of a column based on column values and is used to locate data pages that contain matching values for a scan predicate
OffsetIndex: this allows navigation by row index and is used to retrieve values for rows identified as matches via the ColumnIndex. Once rows of a column are skipped, the corresponding rows in the other columns have to be skipped. Hence the OffsetIndexes for each column in a RowGroup are stored together.
The new index structures are stored separately from RowGroup, near the footer, so that a reader does not have to pay the I/O and deserialization cost for reading the them if it is not doing selective scans. The index structures’ location and length are stored in ColumnChunk and RowGroup.
Cloudera's Impala team has made some tests on this new feature (not yet available as part of Apache Impala core product). Here's their performance improvements:
and
As you can see some of the queries had a huge improvement in both both cpu time and amount of data it had to read from disks.
Original answer back from 2016:
struct IndexPageHeader {
/** TODO: **/
}
https://github.com/apache/parquet-format/blob/6e5b78d6d23b9730e19b78dceb9aac6166d528b8/src/main/thrift/parquet.thrift#L505
Index Page Header is not implemented, as of yet.
See source code of Parquet format above.
I don't see it even in Parquet 2.0 currently.
But yes - excellent answer from Ryan Blue above on Parquet that it has pseudo-indexing capabilities (bloom filters).
If your're interested in more details, I recommend great document on how Parquet bloom filters and predicate push-down work
https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide
a more technical implementation-specific document -
https://homepages.cwi.nl/~boncz/msc/2018-BoudewijnBraams.pdf
Parquet currently keeps min/max statistics for each data page. A data page is a group of ~1MB of values (after encoding) for a single column; multiple pages are what make up Parquet's column chunks.
Those min/max values are used to filter both column chunks and the pages that make up a chunk. So you should be able to improve your query time by sorting records by the columns you want to filter on, then writing the data into Parquet. That way, you get the most out of the stats filtering.
You can also get more granular filtering with this technique by decreasing the page and row group sizes, though you're then trading encoding efficiency and I/O efficiency.

Representing Sparse Data in PostgreSQL

What's the best way to represent a sparse data matrix in PostgreSQL? The two obvious methods I see are:
Store data in a single a table with a separate column for every conceivable feature (potentially millions), but with a default value of NULL for unused features. This is conceptually very simple, but I know that with most RDMS implementations, that this is typically very inefficient, since the NULL values ususually takes up some space. However, I read an article (can't find its link unfortunately) that claimed PG doesn't take up data for NULL values, making it better suited for storing sparse data.
Create separate "row" and "column" tables, as well as an intermediate table to link them and store the value for the column at that row. I believe this is the more traditional RDMS solution, but there's more complexity and overhead associated with it.
I also found PostgreDynamic, which claims to better support sparse data, but I don't want to switch my entire database server to a PG fork just for this feature.
Are there any other solutions? Which one should I use?
I'm assuming you're thinking of sparse matrices from mathematical context:
http://en.wikipedia.org/wiki/Sparse_matrix (The storing techniques described there are for memory storage (fast arithmetic operation), not persistent storage (low disk usage).)
Since one usually do operate on this matrices on client side rather than on server side a SQL-ARRAY[] is the best choice!
The question is how to take advantage of the sparsity of the matrix? Here the results from some investigations.
Setup:
Postgres 8.4
Matrices w/ 400*400 elements in double precision (8 Bytes) --> 1.28MiB raw size per matrix
33% non-zero elements --> 427kiB effective size per matrix
averaged using ~1000 different random populated matrices
Competing methods:
Rely on the automatic server side compression of columns with SET STORAGE MAIN or EXTENDED.
Only store the non-zero elements plus a bitmap (bit varying(xx)) describing where to locate the non-zero elements in the matrix. (One double precision is 64 times bigger than one bit. In theory (ignoring overheads) this method should be an improvement if <=98% are non-zero ;-).) Server side compression is activated.
Replace the zeros in the matrix with NULL. (The RDBMSs are very effective in storing NULLs.) Server side compression is activated.
(Indexing of non-zero elements using a 2nd index-ARRAY[] is not very promising and therefor not tested.)
Results:
Automatic compression
no extra implementation efforts
no reduced network traffic
minimal compression overhead
persistent storage = 39% of the raw size
Bitmap
acceptable implementation effort
network traffic slightly decreased; dependent on sparsity
persistent storage = 33.9% of the raw size
Replace zeros with NULLs
some implementation effort (API needs to know where and how to set the NULLs in the ARRAY[] while constructing the INSERT query)
no change in network traffic
persistent storage = 35% of the raw size
Conclusion:
Start with the EXTENDED/MAIN storage parameter. If you have some free time investigate your data and use my test setup with your sparsity level. But the effect may be lower than you expect.
I suggest always to use the matrix serialization (e.g. Row-major order) plus two integer columns for the matrix dimensions NxM. Since most APIs use textual SQL you are saving a lot of network traffic and client memory for nested "ARRAY[ARRAY[..], ARRAY[..], ARRAY[..], ARRAY[..], ..]" !!!
Tebas
CREATE TABLE _testschema.matrix_dense
(
matdata double precision[]
);
ALTER TABLE _testschema.matrix_dense ALTER COLUMN matdata SET STORAGE EXTERN;
CREATE TABLE _testschema.matrix_sparse_autocompressed
(
matdata double precision[]
);
CREATE TABLE _testschema.matrix_sparse_bitmap
(
matdata double precision[]
bitmap bit varying(8000000)
);
Insert the same matrices into all tables. The concrete data depends on the certain table.
Do not change the data on server side due to unused but allocated pages. Or do a VACUUM.
SELECT
pg_total_relation_size('_testschema.matrix_dense') AS dense,
pg_total_relation_size('_testschema.matrix_sparse_autocompressed') AS autocompressed,
pg_total_relation_size('_testschema.matrix_sparse_bitmap') AS bitmap;
A few solutions spring to mind,
1) Separate your features into groups that are usually set together, create a table for each group with a one-to-one foreign key relationship to the main data, only join on tables you need when querying
2) Use the EAV anti-pattern, create a 'feature' table with a foreign key field from your primary table as well as a fieldname and a value column, and store the features as rows in that table instead of as attributes in your primary table
3) Similarly to how PostgreDynamic does it, create a table for each 'column' in your primary table (they use a separate namespace for those tables), and create functions to simplify (as well as efficiently index) accessing and updating the data in those tables
4) create a column in your primary data using XML, or VARCHAR, and store some structured text format within it representing your data, create indexes over the data with functional indexes, write functions to update the data (or use the XML functions if you are using that format)
5) use the contrib/hstore module to create a column of type hstore that can hold key-value pairs, and can be indexed and updated
6) live with lots of empty fields
A NULL value will take up no space when it's NULL. It'll take up one bit in a bitmap in the tuple header, but that will be there regardless.
However, the system can't deal with millions of columns, period. There is a theoretical max of a bit over a thousand, IIRC, but you really don't want to go that far.
If you really need that many, in a single table, you need to go the EAV method, which is basically what you're saying in (2).
If each entry has only a relatively few keys, I suggest you look at the "hstore" contrib modules which lets you store this type of data very efficiently, as a third option. It's been enhanced further in the upcoming 9.0 version, so if you are a bit away from production deployment, you might want to look directly at that one. However, it's well worth it in 8.4 as well. And it does support some pretty efficient index based lookups. Definitely worth looking into.
I know this is an old thread, but MadLib provides a sparse vector type for Postgres, along with several machine learning and statistical methods.

Efficiently storing 7.300.000.000 rows

How would you tackle the following storage and retrieval problem?
Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row:
id (unique row identifier)
entity_id (takes on values between 1 and 2.000.000 inclusive)
date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10))
value_1 (takes on values between 1 and 1.000.000 inclusive)
value_2 (takes on values between 1 and 1.000.000 inclusive)
entity_id combined with date_id is unique. Hence, at most one row per entity and date can be added to the table. The database must be able to hold 10 years worth of daily data (7.300.000.000 rows (3.650*2.000.000)).
What is described above is the write patterns. The read pattern is simple: all queries will be made on a specific entity_id. I.e. retrieve all rows describing entity_id = 12345.
Transactional support is not needed, but the storage solution must be open-sourced. Ideally I'd like to use MySQL, but I'm open for suggestions.
Now - how would you tackle the described problem?
Update: I was asked to elaborate regarding the read and write patterns. Writes to the table will be done in one batch per day where the new 2M entries will be added in one go. Reads will be done continuously with one read every second.
"Now - how would you tackle the described problem?"
With simple flat files.
Here's why
"all queries will be made on a
specific entity_id. I.e. retrieve all
rows describing entity_id = 12345."
You have 2.000.000 entities. Partition based on entity number:
level1= entity/10000
level2= (entity/100)%100
level3= entity%100
The each file of data is level1/level2/level3/batch_of_data
You can then read all of the files in a given part of the directory to return samples for processing.
If someone wants a relational database, then load files for a given entity_id into a database for their use.
Edit On day numbers.
The date_id/entity_id uniqueness rule is not something that has to be handled. It's (a) trivially imposed on the file names and (b) irrelevant for querying.
The date_id "rollover" doesn't mean anything -- there's no query, so there's no need to rename anything. The date_id should simply grow without bound from the epoch date. If you want to purge old data, then delete the old files.
Since no query relies on date_id, nothing ever needs to be done with it. It can be the file name for all that it matters.
To include the date_id in the result set, write it in the file with the other four attributes that are in each row of the file.
Edit on open/close
For writing, you have to leave the file(s) open. You do periodic flushes (or close/reopen) to assure that stuff really is going to disk.
You have two choices for the architecture of your writer.
Have a single "writer" process that consolidates the data from the various source(s). This is helpful if queries are relatively frequent. You pay for merging the data at write time.
Have several files open concurrently for writing. When querying, merge these files into a single result. This is helpful is queries are relatively rare. You pay for merging the data at query time.
Use partitioning. With your read pattern you'd want to partition by entity_id hash.
You might want to look at these questions:
Large primary key: 1+ billion rows MySQL + InnoDB?
Large MySQL tables
Personally, I'd also think about calculating your row width to give you an idea of how big your table will be (as per the partitioning note in the first link).
HTH.,
S
Your application appears to have the same characteristics as mine. I wrote a MySQL custom storage engine to efficiently solve the problem. It is described here
Imagine your data is laid out on disk as an array of 2M fixed length entries (one per entity) each containing 3650 rows (one per day) of 20 bytes (the row for one entity per day).
Your read pattern reads one entity. It is contiguous on disk so it takes 1 seek (about 8mllisecs) and read 3650x20 = about 80K at maybe 100MB/sec ... so it is done in a fraction of a second, easily meeting your 1-query-per-second read pattern.
The update has to write 20 bytes in 2M different places on disk. IN simplest case this would take 2M seeks each of which takes about 8millisecs, so it would take 2M*8ms = 4.5 hours. If you spread the data across 4 “raid0” disks it could take 1.125 hours.
However the places are only 80K apart. In the which means there are 200 such places within a 16MB block (typical disk cache size) so it could operate at anything up to 200 times faster. (1 minute) Reality is somewhere between the two.
My storage engine operates on that kind of philosophy, although it is a little more general purpose than a fixed length array.
You could code exactly what I have described. Putting the code into a MySQL pluggable storage engine means that you can use MySQL to query the data with various report generators etc.
By the way, you could eliminate the date and entity id from the stored row (because they are the array indexes) and may be the unique id – if you don't really need it since (entity id, date) is unique, and store the 2 values as 3-byte int. Then your stored row is 6 bytes, and you have 700 updates per 16M and therefore a faster inserts and a smaller file.
Edit Compare to Flat Files
I notice that comments general favor flat files. Don't forget that directories are just indexes implemented by the file system and they are generally optimized for relatively small numbers of relatively large items. Access to files is generally optimized so that it expects a relatively small number of files to be open, and has a relatively high overhead for open and close, and for each file that is open. All of those "relatively" are relative to the typical use of a database.
Using file system names as an index for a entity-Id which I take to be a non-sparse integer 1 to 2Million is counter-intuitive. In a programming you would use an array, not a hash-table, for example, and you are inevitably going to incur a great deal of overhead for an expensive access path that could simply be an array indeing operation.
Therefore if you use flat files, why not use just one flat file and index it?
Edit on performance
The performance of this application is going to be dominated by disk seek times. The calculations I did above determine the best you can do (although you can make INSERT quicker by slowing down SELECT - you can't make them both better). It doesn't matter whether you use a database, flat-files, or one flat-file, except that you can add more seeks that you don't really need and slow it down further. For example, indexing (whether its the file system index or a database index) causes extra I/Os compared to "an array look up", and these will slow you down.
Edit on benchmark measurements
I have a table that looks very much like yours (or almost exactly like one of your partitions). It was 64K entities not 2M (1/32 of yours), and 2788 'days'. The table was created in the same INSERT order that yours will be, and has the same index (entity_id,day). A SELECT on one entity takes 20.3 seconds to inspect the 2788 days, which is about 130 seeks per second as expected (on 8 millisec average seek time disks). The SELECT time is going to be proportional to the number of days, and not much dependent on the number of entities. (It will be faster on disks with faster seek times. I'm using a pair of SATA2s in RAID0 but that isn't making much difference).
If you re-order the table into entity order
ALTER TABLE x ORDER BY (ENTITY,DAY)
Then the same SELECT takes 198 millisecs (because it is reading the order entity in a single disk access).
However the ALTER TABLE operation took 13.98 DAYS to complete (for 182M rows).
There's a few other things the measurements tell you
1. Your index file is going to be as big as your data file. It is 3GB for this sample table. That means (on my system) all the index at disk speeds not memory speeds.
2.Your INSERT rate will decline logarithmically. The INSERT into the data file is linear but the insert of the key into the index is log. At 180M records I was getting 153 INSERTs per second, which is also very close to the seek rate. It shows that MySQL is updating a leaf index block for almost every INSERT (as you would expect because it is indexed on entity but inserted in day order.). So you are looking at 2M/153 secs= 3.6hrs to do your daily insert of 2M rows. (Divided by whatever effect you can get by partition across systems or disks).
I had similar problem (although with much bigger scale - about your yearly usage every day)
Using one big table got me screeching to a halt - you can pull a few months but I guess you'll eventually partition it.
Don't forget to index the table, or else you'll be messing with tiny trickle of data every query; oh, and if you want to do mass queries, use flat files
Your description of the read patterns is not sufficient. You'll need to describe what amounts of data will be retrieved, how often and how much deviation there will be in the queries.
This will allow you to consider doing compression on some of the columns.
Also consider archiving and partitioning.
If you want to handle huge data with millions of rows it can be considered similar to time series database which logs the time and saves the data to the database. Some of the ways to store the data is using InfluxDB and MongoDB.

How much more inefficient are text (blobs) than varchar/nvarchar's?

We're doing a lot of large, but straightforward forms for a fairly big project (about 600 users using it throughout the day - that's big for me at least ;-) ).
The forms have a lot of question/answer type sections, so it's natural for some people to type a sentence, while others type a novel. How beneficial would it be to put a character limit on some of these fields really?
(Please include references or citations, if necessary/possible - Thanks!)
If you have no limitations on the data size, then why worry. This doesn't sound like a mission critical project, even with 600 users and several thousand records. Use CLOB/BLOB and be done with it. I have doubts as to whether you would see any major gains in limiting sizes and risking data loss. That said, you should layout such boundaries before implementation.
Usually varchar is best for storing values that you wish to use logically and perform "whole value" comparisons against. Text is for unstructured data. If your project is a survey result with unstructured text, use CLOB/BLOB
Semi-Reference: I work with hundreds of thousands of call center records sometimes where we use a CLOB to store the dialog between employees and customers.
I say, focus on the needs of the users and only worry about database performance issues when/if those issues arise. Ask yourself "will my users benefit if I limit the amount of data they can enter".
I keep a great gapingvoid cartoon on my wall that says "it's not what the software does. it's what the user does".
You don't mention which sql server you are using
If you are using MySql there are definite advantages in speed to using fixed length fields to keep the table in static mode, however if you have any variable width fields the table will switch to dynamic and you lose the benefit of specifying the length of the field.
http://dev.mysql.com/doc/refman/5.0/en/static-format.html
http://dev.mysql.com/doc/refman/5.0/en/dynamic-format.html
Microsoft SQL Server has similar performance gains when you use fixed length columns. With fixed length columns the server knows exactly what the offset and length of the data in the row is. With variable length columns the server knows the offset but has to store the actual length of the data as a preceding 2byte counter. This has a couple of implications that are discussed in this interesting article that discusses performance as a function of disk space and the advantages of variable length columns.
If you are using SQL Server 2005 or newer you can take advantage of varchar(max). This column type has the same 2GB storage capacity of BLOBs but the data is stored in 8K chunks with the table data pages instead of in a separate store. So you get the large size advantage, only use 8K in your pages at a time, quick access for the DB engine, and the same query semantics that work with other column types work with varchar(max).
In the end specifying a max length on a variable column mainly lets you constrain the growth size of your database. Once you use variable length columns you lose the advantage of fixed size rows and varchar(max) will perform the same as varchar(10) when holding the same amount of data.
blob and text / ntext are stored outside of the row context, and only a reference stored to the object, resulting in a smaller row size, which will improve performance on clustered indexes.
However because text / ntext are not stored with the row data retrival takes longer, and these fields cannot be used in any comparison statements.
from: http://www.making-the-web.com/2008/03/24/saving-bytes-efficient-data-storage-mysql-part-1/
There are a few variations of the TEXT and BLOB types which affect size; they are:
Type - Maximum Length -Storage
TINYBLOB, TINYTEXT 255 Length+1 bytes
BLOB, TEXT 65535 Length+2 bytes
MEDIUMBLOB, MEDIUMTEXT 16777215 Length+3 bytes
LONGBLOB, LONGTEXT 4294967295 Length+4 bytes