Efficient storage of a text array column in PostgreSQL

Efficient storage of a text array column in PostgreSQL - sql

My question is concerned primarily with database storage/space optimization. I have a table in my database that has
the following columns:
id : PRIMARY KEY INTEGER
array_col : UNIQUE TEXT[]
This table is - by far - the largest in the database (in terms of storage space) and contains about
~200 million records. The array_col has a few characteristics which make me suspicious that I
am not storing it in a very space optimal manner. They are as follows:
The majority of strings have a decent length to them (on average 25 characters)
The length of the text array is variable (typically 100+ strings per array)
The individual strings will repeat themselves with a decent frequency across records. On average
a given string will appear in several thousand other records. (The array order tends to be similar
across records too)
id
array_col
1
[…,"20 torque clutch settings",…]
2
[…,"20 torque clutch settings",…]
3
[…,"20 torque clutch settings",…]
…
…
The above table shows values repeating across records.
I do not want to normalize this table because treating the text array as an atomic unit is the most
useful for my application and it also makes querying much simpler. I also care about the ordering of
strings in the array as well.
I can think of two approaches to this problem:
Create a lookup table to avoid repeating strings. The assumption here is INT[] is probably
more space efficient than a TEXT[].
Table 1
id
array_col
1
[…,47,…]
2
[…,47,…]
3
[…,47,…]
…
…
Table 2
id
name
…
…
47
"20 torque clutch settings"
…
…
Problem: PostgreSQL, to my knowledge, does not support arrays of foreign keys. I'm also not sure what a trigger or stored procedure for this would look like. Database consistency would probably become more of a concern for me too.
ZSON ?, I have no experience in using this extension, but it sounds like it does something
similar in terms of creating a lookup table of frequently used strings. To my understanding I would
need to convert the array column to some kind of JSON string.
{"array_col":[…,“20 torque clutch settings”,…]}
GitHub - postgrespro/zson: ZSON is a PostgreSQL extension for transparent JSONB compression
Any advice on how to approach this problem would be greatly appreciated. Do any of the above choices
seem reasonable or a better long-term approach in terms of database design? I'm currently using
PostgreSQL 14 for this.

If you really want to optimize for storage space, tell PostgreSQL to compress the column whenever it exceeds 128 bytes:
ALTER TABLE tab SET (toast_tuple_target = 128);
Of course optimizing for space may not be good for performance.

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.

For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

Does a fat table/more columns affect performance in sql

In the data that I have, there are around 1M rows, each with around 60-70 columns. However only few rows(20-30) will have columns beyond 30 filled, i.e, the table is sparse.Also columns beyond 30 are rarely queried.
Does "number of columns" impact performance?
Should I make two tables? one with first 30 columns and the second table is the original table.
or should I keep the original structure?
Table schema :-
Table entity_table (
entity_id int,
tag_1 text,
tag_2 text,
.
.
.
tag_30 text, --upto col. 30 table is dense
tag_31 text,
.
.
.
tag_70 text --sparse columns
);
Also, does the type of these columns affect performance.
Does postgres index null values, how to prevent that?

Does "number of columns" impact performance? Short answer is "Yes, but don't worry about it."
More precisely, it eats space and that space has to go to and from disk, eats cache, etc. all of which costs resources. The exact amount of space depends on the column and is available alongside each data type in the postgres docs for data types: https://www.postgresql.org/docs/14/datatype.html
As Frank Heikens commented, a million rows isn't a lot these days. At 70 columns, 8 bytes per column for a million rows you'd be looking at ~560M which will happily fit in memory on a Raspberry PI so shouldn't be that big of a deal.
However, when you get to billions or trillions of rows all those little bytes really start adding up. Hence you might look at:
Splitting up the table - however, if this results in more joins you could find the overall performance gets worse not better
Using smaller column types (e.g. smallint rather than int)
Reordering columns - see Calculating and saving space in PostgreSQL However, I wouldn't recommend this as a starting point - design for readability first, then performance
Columnar storage https://en.wikipedia.org/wiki/Column-oriented_DBMS for which there are some postgres options which I don't have direct experience of but are potentially worth looking at e.g. https://www.buckenhofer.com/2021/01/postgresql-columnar-extension-cstore_fdw/

Best practice for tables with varying content

Currently I am working on a problem where I have to log data in a Oracle10g database. I want to store data from up to 40 devices (but not necessarily always 40) as one data point, these share a bit of information and the rest is device specific.
So I could either create arrays for every device-specific column and if the device is in use the according array field is getting populated.
ID TIMESTAMP BOARD DEVICE_ID[40] ERROR_CNT[40] TEMP[40] MORE_DATA[40]...
But I think I would be wasting a lot of database space by doing it like that, because the arrays would be hardly populated
The other method I can think of would be to just use the same ID for a multi-line entry and then I put as many rows into the table as I have used devices.
ID TIMESTAMP BOARD DEVICE_ID ERROR_CNT TEMP MORE_DATA
1 437892 1 1 100 25 xxx
1 437892 1 2 50 28 yyy
Now the shared information is multiple times in the database and the data is shattered among multiple lines.
Another issue is that there might be columns used by a part of the devices and some do not carry that information, so there might be even more unused fields. So maybe it would be best to create multiple tables and split the devices into groups according to the information they have and log their data in the corresponding tables.
I appreciate any help, maybe I am even paranoid about wasted db space and should not worry about that and simply follow the 'easiest' approach, which I think would be the one with arrays.

Never store arrays in a database. Violating first normal form is a big mistake.
Worry more about how the data is queried than how it is stored. Keep the data model "dumb" and there are literally millions of people who can understand how to use it. There are probably only a few hundred people who understand Oracle object types.
For example, using object types, here is the simplest code to create a table, insert data, and query it:
drop table device;
create or replace type error_count_type is table of number;
create table device(id number, error_count error_count_type)
nested table error_count store as error_count_table;
insert into device values(1, error_count_type(10, 20));
commit;
select sum(column_value) error_count
from device
cross join table(error_count);
Not many people or tools understand creating types, store as, instantiating types, COLUMN_VALUE, or TABLE(...). Internally, Oracle stores arrays as tables anyway so there's no performance benefit.
Do it the simple way, with multiple tables. As Gordon pointed out, it's a small database anyway. Keep it simple.

I think this is too long for a comment:
1000 hours * 12/hour * 40 devices = 480,000 rows.
This is not a lot of data, so I wouldn't worry about duplication of values. You might want to go with the "other method" because it provides a lot of flexibility.
You can store all the data in columns, but if you get the columns wrong, you have to start messing around with alter table statements and that might affect queries you have already written.

Many-to-Many Relationship Against a Single, Large Table

I have a geometric diagram the consists of 5,000 cells, each of which is an arbitrary polygon. My application will need to save many such diagrams.
I have determined that I need to use a database to make indexed queries against this map. Loading all the map data is far too inefficient for quick responses to simple queries.
I've added the cell data to the database. It has a fairly simple structure:
CREATE TABLE map_cell (
map_id INT NOT NULL ,
cell_index INT NOT NULL ,
...
PRIMARY KEY (map_id, cell_index)
)
5,000 rows per map is quite a few, but queries should remain efficient into the millions of rows because the main join indexes can be clustered. If it gets too unwieldy, it can be partitioned on map_id bounds. Despite the large number of rows per map, this table would be quite scalable.
The problem comes with storing the data that describes which cells neighbor each other. The cell-neighbor relationship is a many-to-many relationship against the same table. There are also a very large number of such relationship per map. A normalized table would probably look something like this:
CREATE TABLE map_cell_neighbors (
id INT NOT NULL AUTO INCREMENT ,
map_id INT NOT NULL ,
cell_index INT NOT NULL ,
neighbor_index INT ,
...
INDEX IX_neighbors (map_id, cell_index)
)
This table requires a surrogate key that will never be used in a join, ever. Also, this table includes duplicate entries: if cell 0 is a neighbor with cell 1, then cell 1 is always a neighbor of cell 0. I can eliminate these entries, at the cost of some extra index space:
CREATE TABLE map_cell_neighbors (
id INT NOT NULL AUTO INCREMENT ,
map_id INT NOT NULL ,
neighbor1 INT NOT NULL ,
neighbor2 INT NOT NULL ,
...
INDEX IX_neighbor1 (map_id, neighbor1),
INDEX IX_neighbor2 (map_id, neighbor2)
)
I'm not sure which one would be considered more "normalized", since option 1 includes duplicate entries (including duplicating any properties the relationship has), and option 2 is some pretty weird database design that just doesn't feel normalized. Neither option is very space efficient. For 10 maps, option 1 used 300,000 rows taking up 12M of file space. Option 2 was 150,000 rows taking up 8M of file space. On both tables, the indexes are taking up more space than the data, considering the data should be about 20 bytes per row, but it's actually taking 40-50 bytes on disk.
The third option wouldn't be normalized at all, but would be incredibly space- and row-efficient. It involves putting a VARBINARY field in map_cell, and storing a binary-packed list of neighbors in the cell table itself. This would take 24-36 bytes per cell, rather than 40-50 bytes per relationship. It would also reduce the overall number of rows, and the queries against the cell table would be very fast due to the clustered primary key. However, performing a join against this data would be impossible. Any recursive queries would have to be done one step at a time. Also, this is just really ugly database design.
Unfortunately, I need my application to scale well and not hit SQL bottlenecks with just 50 maps. Unless I can think of something else, the latter option might be the only one that really works. Before I committed such a vile idea to code, I wanted to make sure I was looking clearly at all the options. There may be another design pattern I'm not thinking of, or maybe the problems I'm foreseeing aren't as bad as they appear. Either way, I wanted to get other people's input before pushing too far into this.
The most complex queries against this data will be path-finding and discovery of paths. These will be recursive queries that start at a specific cell and that travel through neighbors over several iterations and collect/compare properties of these cells. I'm pretty sure I can't do all this in SQL, there will likely be some application code throughout. I'd like to be able to perform queries like this of moderate size, and get results in an acceptable amount of time to feel "responsive" to user, about a second. The overall goal is to keep large table sizes from causing repeated queries or fixed-depth recursive queries from taking several seconds or more.

Not sure which database you are using, but you seem to be re-inventing what a spatial enabled database supports already.
If SQL Server, for example, is an option, you could store your polygons as geometry types, use the built-in spatial indexing, and the OGC compliant methods such as "STContains", "STCrosses", "STOverlaps", "STTouches".
SQL Server spatial indexes, after decomposing the polygons into various b-tree layers, also uses tessellation to index which neighboring cells a given polygon touches, at a given layer of the tree-index.
There are other mainstream databases which support spatial types as well, including MySQL

What's the database performance improvement from storing as numbers rather than text?

Suppose I have text such as "Win", "Lose", "Incomplete", "Forfeit" etc. I can directly store the text in the database. Instead if use numbers such as 0 = Win, 1 = Lose etc would I get a material improvement in database performance? Specifically on queries where the field is part of my WHERE clause

At the CPU level, comparing two fixed-size integers takes just one instruction, whereas comparing variable-length strings usually involves looping through each character. So for a very large dataset there should be a significant performance gain with using integers.
Moreover, a fixed-size integer will generally take less space and can allow the database engine to perform faster algorithms based on random seeking.
Most database systems however have an enum type which is meant for cases like yours - in the query you can compare the field value against a fixed set of literals while it is internally stored as an integer.

There might be significant performance gains if the column is used in an index.

It could range anywhere from negligible to extremely beneficial depending on the table size, the number of possible values being enumerated and the database engine / configuration.
That said, it almost certainly will never perform worse to use a number to represent an enumerated type.

Don't guess. Measure.
Performance depends on how selective the index is (how many distinct values are in it), whether critical information is available in the natural key, how long the natural key is, and so on. You really need to test with representative data.
When I was designing the database for my employer's operational data store, I built a testbed with tables designed around natural keys and with tables designed around id numbers. Both those schemas have more than 13 million rows of computer-generated sample data. In a few cases, queries on the id number schema outperformed the natural key schema by 50%. (So a complex query that took 20 seconds with id numbers took 30 seconds with natural keys.) But 80% of the test queries had faster SELECT performance against the natural key schema. And sometimes it was staggeringly faster--a difference of 30 to 1.
The reason, of course, is that lots of the queries on the natural key schema need no joins at all--the most commonly needed information is naturally carried in the natural key. (I know that sounds odd, but it happens surprisingly often. How often is probably application-dependent.) But zero joins is often going to be faster than three joins, even if you join on integers.

Clearly if your data structures are shorter, they are faster to compare AND faster to store and retrieve.
How much faster 1, 2, 1000. It all depends on the size of the table and so on.
For example: say you have a table with a productId and a varchar text column.
Each row will roughly take 4 bytes for the int and then another 3-> 24 bytes for the text in your example (depending on if the column is nullable or is unicode)
Compare that to 5 bytes per row for the same data with a byte status column.
This huge space saving means more rows fit in a page, more data fits in the cache, less writes happen when you load store data, and so on.
Also, comparing strings at the best case is as fast as comparing bytes and worst case much slower.
There is a second huge issue with storing text where you intended to have a enum. What happens when people start storing Incompete as opposed to Incomplete?

having a skinner column means that you can fit more rows per page.
it is a HUGE difference between a varchar(20) and an integer.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas