Does a fat table/more columns affect performance in sql - sql

In the data that I have, there are around 1M rows, each with around 60-70 columns. However only few rows(20-30) will have columns beyond 30 filled, i.e, the table is sparse.Also columns beyond 30 are rarely queried.
Does "number of columns" impact performance?
Should I make two tables? one with first 30 columns and the second table is the original table.
or should I keep the original structure?
Table schema :-
Table entity_table (
entity_id int,
tag_1 text,
tag_2 text,
.
.
.
tag_30 text, --upto col. 30 table is dense
tag_31 text,
.
.
.
tag_70 text --sparse columns
);
Also, does the type of these columns affect performance.
Does postgres index null values, how to prevent that?

Does "number of columns" impact performance? Short answer is "Yes, but don't worry about it."
More precisely, it eats space and that space has to go to and from disk, eats cache, etc. all of which costs resources. The exact amount of space depends on the column and is available alongside each data type in the postgres docs for data types: https://www.postgresql.org/docs/14/datatype.html
As Frank Heikens commented, a million rows isn't a lot these days. At 70 columns, 8 bytes per column for a million rows you'd be looking at ~560M which will happily fit in memory on a Raspberry PI so shouldn't be that big of a deal.
However, when you get to billions or trillions of rows all those little bytes really start adding up. Hence you might look at:
Splitting up the table - however, if this results in more joins you could find the overall performance gets worse not better
Using smaller column types (e.g. smallint rather than int)
Reordering columns - see Calculating and saving space in PostgreSQL However, I wouldn't recommend this as a starting point - design for readability first, then performance
Columnar storage https://en.wikipedia.org/wiki/Column-oriented_DBMS for which there are some postgres options which I don't have direct experience of but are potentially worth looking at e.g. https://www.buckenhofer.com/2021/01/postgresql-columnar-extension-cstore_fdw/

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

How wide can a postgresSQL table be before I run into issues?

According to the limits postgres supports up to 1600 columns per table.
https://www.postgresql.org/docs/current/limits.html
I understand that it's bad practice to have so many columns but what are the consequences of approaching this limit?
For example, will a table with 200 columns perform fine in an application? How can you tell when you're approaching too many columns for a given table?
The hard limit is that a table row has to fit inside a single 8kB block.
The "soft limits" you encounter with many columns are
writing SELECT list becomes more and more annoying (never, ever, use SELECT *)
each UPDATE has to write a large row version, so lots of data churn
extracting the 603th column from a row requires skipping the previous 602 columns, which is a performance hit
it is plain annoying if the output of \d is 50 pages long

Efficient storage of a text array column in PostgreSQL

My question is concerned primarily with database storage/space optimization. I have a table in my database that has
the following columns:
id : PRIMARY KEY INTEGER
array_col : UNIQUE TEXT[]
This table is - by far - the largest in the database (in terms of storage space) and contains about
~200 million records. The array_col has a few characteristics which make me suspicious that I
am not storing it in a very space optimal manner. They are as follows:
The majority of strings have a decent length to them (on average 25 characters)
The length of the text array is variable (typically 100+ strings per array)
The individual strings will repeat themselves with a decent frequency across records. On average
a given string will appear in several thousand other records. (The array order tends to be similar
across records too)
id
array_col
1
[…,"20 torque clutch settings",…]
2
[…,"20 torque clutch settings",…]
3
[…,"20 torque clutch settings",…]
…
…
The above table shows values repeating across records.
I do not want to normalize this table because treating the text array as an atomic unit is the most
useful for my application and it also makes querying much simpler. I also care about the ordering of
strings in the array as well.
I can think of two approaches to this problem:
Create a lookup table to avoid repeating strings. The assumption here is INT[] is probably
more space efficient than a TEXT[].
Table 1
id
array_col
1
[…,47,…]
2
[…,47,…]
3
[…,47,…]
…
…
Table 2
id
name
…
…
47
"20 torque clutch settings"
…
…
Problem: PostgreSQL, to my knowledge, does not support arrays of foreign keys. I'm also not sure what a trigger or stored procedure for this would look like. Database consistency would probably become more of a concern for me too.
ZSON ?, I have no experience in using this extension, but it sounds like it does something
similar in terms of creating a lookup table of frequently used strings. To my understanding I would
need to convert the array column to some kind of JSON string.
{"array_col":[…,“20 torque clutch settings”,…]}
GitHub - postgrespro/zson: ZSON is a PostgreSQL extension for transparent JSONB compression
Any advice on how to approach this problem would be greatly appreciated. Do any of the above choices
seem reasonable or a better long-term approach in terms of database design? I'm currently using
PostgreSQL 14 for this.
If you really want to optimize for storage space, tell PostgreSQL to compress the column whenever it exceeds 128 bytes:
ALTER TABLE tab SET (toast_tuple_target = 128);
Of course optimizing for space may not be good for performance.

How to store large currency like data in database

My data is similar to currency in many aspects so I will use it for demonstration.
I have 10-15 different groups of data, we can say different currencies like Dollar or Euro.
They need to have these columns:
timestamp INT PRIMARY KEY
value INT
Each of them will have more than 1 billion rows and i will append new rows as time passes.
I will just select them in some intervals and create graphs. Probably multiple currency in same graph.
Question is should I add a group column and store all in one table or leave it separately. If they are in same column timestamp will not be unique anymore and probably I should use advanced SQL techniques to make it efficient.
10 - 15 "currencies"? 1 billion rows each? Consider list partitioning in Postgres 11 or later. This way, the timestamp column stays unique per partition. (Although I am not sure why that is a necessity.)
Or simply have 10 - 15 separate tables without storing the "currency" redundantly per row. Size matters with this many rows.
Or, if you typically have multiple values (one for each "currency") for the same timestamp, you might use a single table with 10-15 dedicated "currency" columns. Much smaller overall, as it saves the tuple overhead for each "currency" (28 bytes per row or more). See:
Making sense of Postgres row sizes
The practicality of a single row for multiple "currencies" depends on detailed specs. For example: might not work so well for many updates on individual values.
You added:
I have read clustered indexes which orders data in physical order in disk. I will not insert new rows in middle of table
That seems like a perfect use case for BRIN indexes, which are dramatically smaller than their B-tree relatives. Typically a bit slower, but with your setup maybe even faster. Related:
How do I improve date-based query performance on a large table?

Large Denormalized Table Optimization

I have a single large denormalized table that mirrors the make up of a fixed length flat file that is loaded yearly. 112 columns and 400,000 records. I have a unique clustered index on the 3 columns that make up the where clause of the query that is run most against this table. Index Frag is .01. Performance on the query is good, sub second. However, returning all the records takes almost 2 minutes. The execution plan shows 100% of the cost is on a Clustered Index Scan (not seek).
There are no queries that require a join (due to the denorm). The table is used for reporting. All fields are type nvarchar (of the length of the field in the data file).
Beyond normalizing the table. What else can I do to improve performance.
Try paginating the query. You can split the results into, let's say, groups of 100 rows. That way, your users will see the results pretty quickly. Also, if they don't need to see all the data every time they view the results, it will greatly cut down the amount of data retrieved.
Beyond this, adding parameters to the query that filter the data will reduce the amount of data returned.
This post is a good way to get started with pagination: SQL Pagination Query with order by
Just replace the "50" and "100" in the answer to use page variables and you're good to go.
Here are three ideas. First, if you don't need nvarchar, switch these to varchar. That will halve the storage requirement and should make things go faster.
Second, be sure that the lengths of the fields are less than nvarchar(4000)/varchar(8000). Anything larger causes the values to be stored on a separate page, increasing retrieval time.
Third, you don't say how you are retrieving the data. If you are bringing it back into another tool, such as Excel, or through ODBC, there may be other performance bottlenecks.
In the end, though, you are retrieving a large amount of data, so you should expect the time to be much longer than for retrieving just a handful of rows.
When you ask for all rows, you'll always get a scan.
400,000 rows X 112 columns X 17 bytes per column is 761,600,000 bytes. (I pulled 17 out of thin air.) Taking two minutes to move 3/4 of a gig across the network isn't bad. That's roughly the throughput of my server's scheduled backup to disk.
Do you have money for a faster network?