Representing Sparse Data in PostgreSQL - sql

What's the best way to represent a sparse data matrix in PostgreSQL? The two obvious methods I see are:
Store data in a single a table with a separate column for every conceivable feature (potentially millions), but with a default value of NULL for unused features. This is conceptually very simple, but I know that with most RDMS implementations, that this is typically very inefficient, since the NULL values ususually takes up some space. However, I read an article (can't find its link unfortunately) that claimed PG doesn't take up data for NULL values, making it better suited for storing sparse data.
Create separate "row" and "column" tables, as well as an intermediate table to link them and store the value for the column at that row. I believe this is the more traditional RDMS solution, but there's more complexity and overhead associated with it.
I also found PostgreDynamic, which claims to better support sparse data, but I don't want to switch my entire database server to a PG fork just for this feature.
Are there any other solutions? Which one should I use?

I'm assuming you're thinking of sparse matrices from mathematical context:
http://en.wikipedia.org/wiki/Sparse_matrix (The storing techniques described there are for memory storage (fast arithmetic operation), not persistent storage (low disk usage).)
Since one usually do operate on this matrices on client side rather than on server side a SQL-ARRAY[] is the best choice!
The question is how to take advantage of the sparsity of the matrix? Here the results from some investigations.
Setup:
Postgres 8.4
Matrices w/ 400*400 elements in double precision (8 Bytes) --> 1.28MiB raw size per matrix
33% non-zero elements --> 427kiB effective size per matrix
averaged using ~1000 different random populated matrices
Competing methods:
Rely on the automatic server side compression of columns with SET STORAGE MAIN or EXTENDED.
Only store the non-zero elements plus a bitmap (bit varying(xx)) describing where to locate the non-zero elements in the matrix. (One double precision is 64 times bigger than one bit. In theory (ignoring overheads) this method should be an improvement if <=98% are non-zero ;-).) Server side compression is activated.
Replace the zeros in the matrix with NULL. (The RDBMSs are very effective in storing NULLs.) Server side compression is activated.
(Indexing of non-zero elements using a 2nd index-ARRAY[] is not very promising and therefor not tested.)
Results:
Automatic compression
no extra implementation efforts
no reduced network traffic
minimal compression overhead
persistent storage = 39% of the raw size
Bitmap
acceptable implementation effort
network traffic slightly decreased; dependent on sparsity
persistent storage = 33.9% of the raw size
Replace zeros with NULLs
some implementation effort (API needs to know where and how to set the NULLs in the ARRAY[] while constructing the INSERT query)
no change in network traffic
persistent storage = 35% of the raw size
Conclusion:
Start with the EXTENDED/MAIN storage parameter. If you have some free time investigate your data and use my test setup with your sparsity level. But the effect may be lower than you expect.
I suggest always to use the matrix serialization (e.g. Row-major order) plus two integer columns for the matrix dimensions NxM. Since most APIs use textual SQL you are saving a lot of network traffic and client memory for nested "ARRAY[ARRAY[..], ARRAY[..], ARRAY[..], ARRAY[..], ..]" !!!
Tebas
CREATE TABLE _testschema.matrix_dense
(
matdata double precision[]
);
ALTER TABLE _testschema.matrix_dense ALTER COLUMN matdata SET STORAGE EXTERN;
CREATE TABLE _testschema.matrix_sparse_autocompressed
(
matdata double precision[]
);
CREATE TABLE _testschema.matrix_sparse_bitmap
(
matdata double precision[]
bitmap bit varying(8000000)
);
Insert the same matrices into all tables. The concrete data depends on the certain table.
Do not change the data on server side due to unused but allocated pages. Or do a VACUUM.
SELECT
pg_total_relation_size('_testschema.matrix_dense') AS dense,
pg_total_relation_size('_testschema.matrix_sparse_autocompressed') AS autocompressed,
pg_total_relation_size('_testschema.matrix_sparse_bitmap') AS bitmap;

A few solutions spring to mind,
1) Separate your features into groups that are usually set together, create a table for each group with a one-to-one foreign key relationship to the main data, only join on tables you need when querying
2) Use the EAV anti-pattern, create a 'feature' table with a foreign key field from your primary table as well as a fieldname and a value column, and store the features as rows in that table instead of as attributes in your primary table
3) Similarly to how PostgreDynamic does it, create a table for each 'column' in your primary table (they use a separate namespace for those tables), and create functions to simplify (as well as efficiently index) accessing and updating the data in those tables
4) create a column in your primary data using XML, or VARCHAR, and store some structured text format within it representing your data, create indexes over the data with functional indexes, write functions to update the data (or use the XML functions if you are using that format)
5) use the contrib/hstore module to create a column of type hstore that can hold key-value pairs, and can be indexed and updated
6) live with lots of empty fields

A NULL value will take up no space when it's NULL. It'll take up one bit in a bitmap in the tuple header, but that will be there regardless.
However, the system can't deal with millions of columns, period. There is a theoretical max of a bit over a thousand, IIRC, but you really don't want to go that far.
If you really need that many, in a single table, you need to go the EAV method, which is basically what you're saying in (2).
If each entry has only a relatively few keys, I suggest you look at the "hstore" contrib modules which lets you store this type of data very efficiently, as a third option. It's been enhanced further in the upcoming 9.0 version, so if you are a bit away from production deployment, you might want to look directly at that one. However, it's well worth it in 8.4 as well. And it does support some pretty efficient index based lookups. Definitely worth looking into.

I know this is an old thread, but MadLib provides a sparse vector type for Postgres, along with several machine learning and statistical methods.

Related

Postgresql: In-Row vs Out-of-Row for text/varchar

Two part question:
What is Postgresql behavior for storing text/varchars in-row vs
out-of-row? Am I correct in thinking that with default settings, all columns will always be stored in-row until the 2kB size is reached?
Do we have any control over the above behavior? Is there any way I can change the threshold for a specific column/table, or force a specific column to always be stored out-of-row?
I've read through PostGresql Toast documentation (http://www.postgresql.org/docs/8.3/static/storage-toast.html), but I don't see any option for changing the thresholds (default seems to be 2kB-for-row) or for forcing a column to always store out-of-row (EXTERNAL only allows it, but doesn't enforce it).
I've found documentation explaining how to do this on SQL Server (https://msdn.microsoft.com/en-us/library/ms173530.aspx), but don't see anything similar for PostGresql.
If anyone's interested in my motivation, I have a table that has a mix of short-consistent columns (IDs, timestamps etc), a column that is varchar(200), and a column that is text/varchar(max), which can be extremely large in length. I currently have both varchars stored in a separate table, just to allow efficient storage/lookups/scanning on the short-consistent columns.
This is a pain however, because I constantly have to do joins to read all the data. I would really like to store all the above fields in the same table, and tell Postgresql to force-store the 2 VARCHARs out-of-row, always.
Edited Answer
For first part of the question: you are correct (see for instance this).
For the second part of the question: the standard way of storing columns is to compress variable length text fields if their size is over 2KB, and eventually store them into a separate area, called “TOAST table”.
You can give a “hint” to the system on how to store a field by using the following command for your columns:
ALTER TABLE YourTable
ALTER COLUMN YourColumn SET STORAGE (PLAIN | EXTENDED | EXTERNAL | MAIN)
From the manual:
SET STORAGE
This form sets the storage mode for a column. This controls whether this column is held inline or in a secondary TOAST table, and whether the data should be compressed or not. PLAIN must be used for fixed-length values such as integer and is inline, uncompressed. MAIN is for inline, compressible data. EXTERNAL is for external, uncompressed data, and EXTENDED is for external, compressed data. EXTENDED is the default for most data types that support non-PLAIN storage. Use of EXTERNAL will make substring operations on very large text and bytea values run faster, at the penalty of increased storage space. Note that SET STORAGE doesn't itself change anything in the table, it just sets the strategy to be pursued during future table updates. See Section 59.2 for more information.
Since the manual is not completely explicit on this point, this is my interpretation: the final decision about how to store the field is left in any case to the system, given the following constraints:
No field can be stored such that the total size of a row is over
8KB
No field is stored out-of-row if its size is less then the
TOAST_TUPLE_THRESHOLD
After satisfying the previous
constraints, the system tries to satisfy the SET STORAGE strategy
specified by the user. If no storage strategy is specified, each TOAST-able
field is automatically declared EXTENDED.
Under these assumption, the only way to be sure that all the values of a column are stored out-of-row is to recompile the system with a value of TOAST_TUPLE_THRESHOLD less then the minumum size of any value of the column.

Improve performance of PostgreSQL array queries

I am storing large vectors (1.4 million values) of doubles in a PostgreSQL table. This table's create statement follows.
CREATE TABLE analysis.expression
(
celfile_name character varying NOT NULL,
core double precision[],
extended double precision[],
"full" double precision[],
probeset double precision[],
CONSTRAINT expression_pkey PRIMARY KEY (celfile_name)
)
WITH (
OIDS=FALSE
);
ALTER TABLE analysis.expression ALTER COLUMN core SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN extended SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN "full" SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN probeset SET STORAGE EXTERNAL;
Each entry in this table is written only once and possibly read many times at random indices. PostgreSQL doesn't seem to scale terribly well for lookups as the vector length grows even with STORAGE set to EXTERNAL (O(n)). This makes queries like the following, where we selected many individual values in the array, very, very slow (minutes - hours).
SELECT probeset[2], probeset[15], probeset[102], probeset[1007], probeset[10033], probeset[200101], probeset[1004000] FROM expression LIMIT 1000;
If there enough individual indices being pulled it can even be slower than pulling the whole array.
Is there any way to make such queries faster?
Edits
I am using PostgreSQL 9.3.
All the queries I am running are simple SELECTs possibly
SELECT probeset[2], probeset[15], probeset[102], probeset[1007], probeset[10033], probeset[200101], probeset[1004000] FROM expression JOIN samples s USING (celfile_name) WHERE s.study = 'x';
In one scenario the results of these queries are feed through prediction models. The prediction probability gets stored back into the DB in another table. In other cases select items are pulled from the arrays for downstream analysis.
Currently 1.4 million is the longest single array, the others are shorter with the smallest being 22 thousand and the average being ~ 100 thousand items long.
Ideally I would store the array data as a wide table but with 1.4 million entries this isn't feasible, and long tables (i.e. rows with celfile_name, index, value) are much slower than PostgreSQL arrays if we want to pull a full array from the data from the DB. We do this to load our downstream data stores for when we do analysis on the full dataset.
You store your data in a structured data management storage container (i.e. PostgreSQL), but due to the nature of your data (i.e. large but irregularly sized collections of like data) you actually store your data outside of the container. PostgreSQL is not good at retrieving data from irregular and unpredictable?) large arrays, as you have noticed; the fact that the arrays are stored externally is already testament to the fact that your requirements are not aligned with where PostgreSQL excels. It is very likely that there are much better solutions for storing and reading your arrays than PostgreSQL. Given that the results from analyzing the arrays through prediction models is stored in some tables in a PostgreSQL database hints at a hybrid solution: store your data in some form that allows efficient access in the patterns that you need, then store the results in PostgreSQL for further processing.
Since you do not provide any details on the prediction models, it is impossible to be specific in this answer, but I hope this will help you on your way.
If your prediction models are written in some language for which a PostgreSQL driver is available, then store your data in some format that is suited for that language, do your predictions and write the results to a table in PostgreSQL. This would work for languages like C and C++ with the pq library and for Java, C#, Python, etc using a high-level library like JDBC.
If your prediction model is written in MatLab, then store your arrays in a MatLab format and connect to PostgreSQL for the results. If written in R, you can use the R extension for PostgreSQL.
The key here is that you should store the arrays in a form that allows for efficient use in your prediction models. Match your data storage to the prediction models, not the other way around.

JSONB performance degrades as number of keys increase

I am testing the performance of jsonb datatype in postgresql. Each document will have about 1500 keys that are NOT hierarchical. The document is flattened. Here is what the table and document looks like.
create table ztable0
(
id serial primary key,
data jsonb
)
Here is a sample document:
{ "0": 301, "90": 23, "61": 4001, "11": 929} ...
As you can see the document does not contain hierarchies and all values are integers. However, Some will be text in the future.
Rows: 86,000
Columns: 2
Keys in document: 1500+
When searching for a particular value of a key or performing a group by the performance is very noticeably slow. This query:
select (data ->> '1')::integer, count(*) from ztable0
group by (data ->> '1')::integer
limit 100
took about 2 seconds to complete. Is there any way to improve performance of jsonb documents.
This is a known issue in 9.4beta2, please, have a look at this blog post, it contains some details and pointers to the mail threads.
About the issue.
PostgreSQL is using TOAST to store data values, this means that big values (typically round 2kB and more) are stored in the separate special kind of table. And PostgreSQL also tries to compress the data, using it's pglz method (been there for ages). By “tries” it means that before deciding to compress data, first 1k bytes are probed. And if results are not satisfactory, i.e. compression gives no benefits on the probed data, decision is made not to compress.
So, initial JSONB format stored a table of offsets in the beginning of it's value. And for values with high number of root keys in JSON this resulted in first 1kB (and more) being occupied by offsets. This was a series of distinct data, i.e. it was not possible to find two adjacent 4-byte sequences that'd be equal. Thus no compression.
Note, that if one would pass over the offset table, the rest of the value is perfectly compressable.
So one of the options would be to tell to the pglz code explicitly wether compression is applicable and where to probe for it (especially for the newly introduced data types), but existing infrastructure doesn't supports this.
The fix
So decision was made to change the way data is stored inside the JSONB value, making it more suitable for pglz to compress. Here's a commit message by Tom Lane with the change that implements a new JSONB on-disk format. And despite the format changes, lookup of a random element is still O(1).
It took around a month to be fixed though. As I can see, 9.4beta3 had been already tagged, so you'll be able to re-test this soon, after the official announcement.
Important Note: you'll have to do pg_dump/pg_restore exercise or utilize pg_upgrade tool to switch to 9.4beta3, as fix for the issue you've identified required changes in the way data is stored, so beta3 is not binary compatible with beta2.

Java EE/SQL: Is there a significant performance lag between primary key types?

Currently I am involved in learning some basics of the Java EE technology. I encountered a particular project and took a deeper look into the underlying database structure.
On server-side I investigated a Java function that creates a primary key with a length of 32 characters (based on concatenating the time, a random hash, and an additional cryptographic nonce).
I am interested in a estimation about the performance loss caused by using such a primary key. If there is no security reason to create such kind of unique IDs wouldn't it be much better to let the underlying database create new increasing primaries, starting at 0?
Wouldn't a SQL/JQL search be much faster when using numbers instead of strings?
Using numbers will probably be faster, but you should measure it with a test case if you need the performance ratio between both options.
I don't think number comparison vs string comparison will give a big performance advantage by itself. However:
larger fields typically means less data per table block, so you have to read more blocks from DB in case of a full scan (it will be slower)
accordingly, larger keys typically means less keys per index block, so you have to read more index blocks in case of index scans (it will be slower)
larger fields are, well, larger, so by definition they are less space-efficient.
Note that we are talking about data size and not data type: most likely a 8-byte integer will not be significantly more efficient than a 8-byte string.
Note also that using random IDs is usually more "clusterable" than sequence numbers, as sequences / autonumerics need to be administered centrally (although this can be mitigated using techniques such as the Hi-Lo algorithm. Most curent persistence frameworks support this technique).

Is it sensible to store long, unique text strings in OLAP cubes for drillthrough retrieval (especially in SSAS)?

I'm motivated to store some long text strings in an OLAP cube, long on the order of 1,000s or 10,000s of characters -- but I'm wondering if this will lead me astray. (I'm also curious to learn a little more about how OLAP engines handle strings.) The particular use case I have in mind is that I have a unique, pre-existing "record description" for each of my OLAP facts, and I want to put those descriptions in the cube so that I have the option to get them back when I do a DRILLTHROUGH operation. In contrast, I don't need the record descriptions to appear when doing normal pivot table / aggregate type operations. (The descriptions are too long to display sensibly in a pivot table, plus each fact has a unique description, meaning it doesn't make sense to aggregate over descriptions.) My current dataset has around 700,000 facts, though I'm also curious if the answer would change for larger datasets.
My hope was that an OLAP server could do something sensible if I put these long strings in a cube. In the Sql Server / SSAS case in particular, I thought perhaps I'd put them in a dimension marked as ROLAP, to save memory usage, and use a degenerate dimension (aka a "fact dimension", in SSAS terminology), to avoid needless ETL complexities. But I'm curious if this would be regarded as a horrible practice for some reason, or if there are any hidden gotchas.
Update: My example use case is where you have a string associated with each OLAP fact. But it might also be instructive to consider the case where the strings are instead associated with each particular value of a particular dimension. (e.g. Suppose you had a Company dimension and each company had a somewhat lengthy Company Description string.)
Here's what I've been able to uncover about the implications of storing such strings in SSAS, especially SSAS 2008. Where I consider data structures, it's exclusively focused on MOLAP storage, which is what I've been experimenting with.
First, standard MS ETL (extract/transform/load, i.e. data import) tools like Business Intelligence Development Studio may try to prevent you from importing large textfields, especially varchar(max) fields, but there is a workaround, and it's proven effective for me. (For BIDS it involves manually setting the DataSize element in an XML file, potentially to the magic size of 163315555 bytes. Props to Matija Lah for figuring this out.)
Second, as far as I can tell, storing lots of long, unique strings shouldn't wreak havoc on the on-disk data structures used by SSAS. Also, the size of the string data on disk should be of the same order of magnitude as the string data in your data source. Here's some rough info on SSAS handles strings:
The core OLAP data structures (e.g. for the attributes of a dimension, or for the facts of a measure groups) don't directly contain strings; instead contain offsets into "string store" files (extensions .ksstore, .asstore, .bsstore, or .string.data), which contain the actual string data.
Within a given string store, each string is represented only once. If several rows in your source data tables contain duplicate strings, then at the SSAS/MOLAP level, that will translate into duplicated file-offsets, rather than duplicated string values
If you're source string has length n, then the corresponding data structure in the string store has 8-ish bytes of overhead, plus 2*n bytes per character. (Strings are inherently stored in 2-byte Unicode format in SSAS.)
For some fantastic detail about this stuff, I suggest the book Microsoft SQL Server 2008 Analysis Services Unleashed, in particular chapter 20, "The Physical Data Model".
At least in my experiments, string store files do not seem to be compressed -- at least they're not notably smaller than an uncompressed string store would be.
I've verified experimentally that text data takes the same order of magnitude of bytes whether stored in SSAS MOLAP or in a sql table. In particular, I did a "select sum(len(myfield)) from mytable" from one of my dimension tables, and then compared to the size of the corresponding attribute's files in my SSAS data directory. Size was 172MB in SQL and 304MB in SQL server. (Sql size was 147MB if I summed all unique strings, rather than all strings.) In my case the size difference was mostly explained by character encoding; my source sql data is stored with one byte per character, whereas SSAS stores all strings with two bytes per character. I found that the .kssstore file totally dominated all the other files associated with this attribute in size, regardless of whether or not I optimized the attribute via AttributeHierarchyOptimizedState=FullyOptimized.
Third, there is a 4GB cap on the size of string store files, which limits the amount of unique text that can be associated, say, with a particular dimension/attribute. In my case I'm less than 10% of the way to the limit, but this might affect some people. (Quick order-of-magnitude calculation for the original post: 1M facts * 10,000 bytes/per fact = 10GB-ish worth of text.) If you do hit this limit, you'll apparently hit it at cube "processing" time. Apparently it applies even to ROLAP dimensions. There may be some hacks to work around this. See here. Note that Sql Server 2012 may remove this 4GB limitation.
Forth, it seems that if long unique strings create a problem in SSAS, they do so at the level of in-memory representation. One potential problem (that I haven't looked into in detail) is that having these extra strings cached in memory will keep SSAS from keeping other important data structures in memory, and thus degrade performance. Another problem, suggested by the book The Microsoft Data Warehouse Toolkit (though I haven't yet found this claim elsewhere), is that SSAS does some expansive string padding on its in-memory data structures:
"The relational database stores variable length string columns ... However, other parts of the SQL Server toolset will fill these columns out to their full width. Notable, Integration Services and Analysis Services pad string columns with spaces as they are loaded into memory. Both Integration Services and Analysis Services love physical memory, so there's a cost to declaring string columns that are far wider than they need to be."
To conclude, so far storing my long string data in the cube seems convenient, and I haven't uncovered any reasons to expect disaster, so I'm giving it a try. I'll try to provide an update if things don't work out.
You could store the values in a table relationaly and then create an integer surrogate key.
add the integer surrogate to your UDM and create a SSRS Drillthrough action
http://msdn.microsoft.com/en-US/library/ms174526(v=SQL.90).aspx
that looks up the text field by the key value.
I would use a degenerate dimension, but hide it via SSAS until requested via a Drillthrough Action.
I can't guide you on the internal storage of strings for the AS engine, but as for storing them in SQL, I would make sure your varchar(MAX) column was at the end of your columns to speed up SQL engines scanning of those rows.
At 700,000 rows, with enough memory and disk I/O, you aren't taxing SQL much.
Haven't worked through all the possibilities described and link to from it yet, but this thread from 2007 is on the same topic and seems pretty relevant:
http://www.sqldev.org/sql-server-analysis-services/discussion-about-how-to-create-a-fact-drillthrough-dimension-the-best-way-34857.shtml
One new possibility raised here is that, rather than treating text stored in the fact table as a degenerate dimension, you could potentially treat it as a text-valued (vs numeric-valued) measure. Initial googling suggests that SSAS might support this but there are some tricks to getting this right, e.g. you probably want to disable aggregation for that measure, you might need to do something non-standard to get the field to appear in a drillthrough, and it might require SSAS enterprise edition.