I need to save about 500 values in structured database (SQL, postgresql) or whatever. So what is the best way to store data. Is it to take 500 fields or single field as (CSV) comma separated values.
What would be the pros and cons.
What would be easy to maintain.
What would be better to retrieve data.
A comma-separated value is just about never the right way to store values.
The traditional SQL method would be a junction or association table, with one row per field and per entity. This multiplies the number of rows, but that is okay, because databases can handle big tables. This has several advantages, though:
Foreign key relationships can be properly defined.
The correct type can be implemented for the object.
Check constraints are more naturally written.
Indexes can be built, incorporating the column and improving performance.
Queries do not need to depend on string functions (which might be slow).
Postgres also supports two other methods for such data, arrays and JSON-encoding. Under some circumstances one or the other might be appropriate as well. A comma-separated string would almost never be the right choice.
Related
Suppose i have a User table, and other tables (e.g. UserSettings, UserStatistics) which have one-to-one relationship with a user.
Since sql databases don't save complex structs in table fields (some allow JSON fields with undefined format), is it ok to just add said tables, allowing to store individual (complex) data for each user? Will it complicate performance by 'joining' more queries?
And in distirbuted databases cases, will it save those (connected) tables randomly in different nodes, making more redundant requests with each other and decreasing efficiency?
1:1 joins can definitely add overhead, especially in a distributed database. Using a JSON or other schema-less column is one way to avoid that, but there are others.
The simplest approach is a "wide table": instead of creating a new table UserSettings with columns a,b,c, add columns setting_a, setting_b, setting_c to your User table. You can still treat them as separate objects when using an ORM, it'll just need a little extra code.
Some databases (like CockroachDB which you've tagged in your question) let you subdivide a wide table into "column families". This tends to let you get the best of both worlds: the database knows to store rows for the same user on the same node, but also to let them be updated independently.
The main downside of using JSON columns is they're harder to query efficiently--if you want all users with a certain setting, or want to know just one setting for a user, you're going to get at least a minor performance hit if the database has to parse a JSON column to figure that out, or you have to fetch the entire blob and do it in your app. If they're more convenient for other reasons though, you can work around this by adding inverted indexes on your JSON columns, or expression indexes on the specific values you're interested in. Indexes can have a similar cost to 1:1 joins, but you can mitigate that in CockroachDB using by using the STORING keyword to tell the DB to write a copy of all the user columns to the index.
The data I want to store data that has this characteristics:
There are a finite number of fields (I don't expect to add new fields);
There are some columns that are common to all sets of data (a category field, for instance);
There are some columns that are specific to individual sets of data (each category needs it's own fields);
Here's how it would look like in a regular table:
I'm having trouble figuring out which would be the better way to store this data in a database for this situation.
Bellow are the ideas I already had:
Do exactly as the tabular table (I would have many NULL values);
Divide the categories into tables (I would use joins when needed);
Use JSON type for storing the values (no NULL values and having it all in same table).
So my questions are:
Is one of these solutions (or one that I have not thought about it) that is better for this case?
Are there other factors, other than the ones presented here, that I should consider to make this decision?
Unless you have very many columns (~ 100), it is usually better to use normal columns. NULL values don't take any storage space in PostgreSQL.
On the other hand, if you have queries that can use any of these columns in the WHERE condition, and you compare with =, a single GIN index on a jsonb might be better than having many B-tree indexes, because the index maintenance costs would be higher.
The definitive answer depends on the SQL statements that you plan to run on that table.
You have laid out the three options pretty well. Things to consider are:
Performance
Data size
Each of maintenance
Flexibility
Security
Note that you don't even allude to security considerations. But security at the table level is usually a tad simpler than at the column level and might be important for regulated data such as PII (personally identifiable information).
The primary strength of the JSON solution is flexibility. It is easy to add new columns. But you don't need that. JSON has a cost in data size and data type flexibility (notably JSON doesn't support date/times explicitly).
A multiple table solution requires duplicating the primary key but may result in much less storage overall if the columns really are sparse. The "may" may also depend on the data type. A NULL string for instance occupies less space than a NULL float in a table record.
The joins on multiple tables will be 1-1 on primary keys. These should be pretty fast.
What would I do? Unless the answer is obvious, I would dump the data into a single table with a bunch of columns. If that table starts to get unwieldy, then I would think about splitting it into separate tables -- but still have one table for the common columns. The details of one or multiple tables can be hidden behind a view.
Depends on how much data you want to store, but as long as it is finite it shouldn't make a big difference if it contains a lot of null's or not
Let's say you have a large number of fields, in various tables, which have integer codes that must then be cross-referenced against another table which then gives the textual representation of these codes - i.e. essentially an enumeration. Each of these codes - which appear in a number of disparate tables - would then have a foreign key against wherever the enumeration values are stored.
There are two main options:
Store all of the enumerations in one big table which defines all enumerations, and then has some column which specifies the enumeration type.
Store each enumeration definition in an isolated, separate table.
Which is the better way to go, especially with regards to performance? The database in question receives a large number of INSERTs and DELETEs and relatively fewer reads.
It depends. Separate tables have a big advantage. You can define foreign key relationships that enforce the type of the column for the referencing tables.
A second advantage is that there might be different data columns for different types. For instance, a countries table might have ISO2 and ISO3 codes and currency. A cities table might have a timezone.
One occasion where a single table can be handy is for internationalization. For translating values into separate languages, I find it convenient to have them all in one place.
There is also a space advantage for a single table. Tables in SQL are stored on pages -- and many reference tables will be smaller than one page. That leaves a lot of unused space. Storing them in one table "compacts" them, eliminating that space. However, that is rarely a real consideration in the modern world.
In general, though, you would use separate tables unless you had a compelling reason to use a single table.
What are the performance implications in postgres of using an array to store values as compared to creating another table to store the values with a has-many relationship?
I have one table that needs to be able to store anywhere from about 1-100 different string values in either an array column or a separate table. These values will need to be frequently searched for exact matches, so lookup performance is critical. Would the array solution be faster, or would it be faster to use joins to lookup the values in the separate table?
These values will need to be frequently searched
Searched how? This is crucial.
Prefix pattern match only? Infix/suffix pattern matches too? Fuzzy string search / similarity matching? Stubbing and normalization for root words, de-pluralization? Synonym search? Is the data character sequences or natural language text? One language, or multiple different languages?
Hand-waving around "searched" makes any answer that ignores that part pretty much invalid.
so lookup performance is critical. Would the array solution be faster, or would it be faster to use joins to lookup the values in the separate table?
Impossible to be strictly sure without proper info on the data you're searching.
Searching text fields is much more flexible, giving you many options you don't have with an array search. It also generally reduces the amount of data that must be read.
In general, I strongly second Clodaldo: Design it right. Optimize later, if you need to.
According to the official PostgreSQL reference documentation, searching for specific elements in a table is expected to perform better than in an array
https://www.postgresql.org/docs/current/arrays.html#ARRAYS-SEARCHING :
Arrays are not sets; searching for specific array elements can be a
sign of database misdesign. Consider using a separate table with a row
for each item that would be an array element. This will be easier to
search, and is likely to scale better for a large number of elements.
The reason for the worse search performance on array elements than on tables could be that arrays are internally stored as strings as stated here
https://www.postgresql.org/message-id/op.swbsduk5v14azh%40oren-mazors-computer.local
the array is actually stored as a string by postgres. a string that
happens to have lots of brackets in it.
although I could not corroborate this statement by any official PostgreSQL documentation. I also do not have any evidence that handling well-structured strings is necessarily less performant than handling tables.
I have two lists of words and I need to find matches (intersection of the two sets.)
Should I store each list as a string and find matches through string functions (like a regular expression) or store the words in a table, and have SQL find matches by joining?
It is almost impossible to say without more information about the problem. Here are some things to consider:
How many different distinct items do you have?
How many different combinations would be on a typical row?
Do your searches require looking for wildcards?
How long are the individual items?
Specifics on the database engine and hardware you are running on.
I want to emphasize that in almost all situations, you want to store the values in another table. Performance is not necessarily the primary reason. More important are ease of updating and deleting individual values, and the ability to support many more types of queries (such as a list of all available values).
But, we can still think about the performance issues. Storing values in a single string simply requires fetching the page with the record on it, and then applying a function that goes through the string. For simple patterns (such as identifying the presence of a fixed substring), this should go quite fast. There are few things that computers do faster than looping through strings and comparing values (assuming a reasonable implementation).
In the fastest possible join, both tables need to be read in, and the keys need to be matched. This requires additional effort. The situation is even worse, because you really want two additional tables, one for the individual string items and the other for the relationship between the original records and the items.
At this point, you may think "gosh, strings seem like a better idea". This is wrong. One of the big differences is in average size. If you items are, on average, longer than, say, 4 characters, then you save space by using a reference table. This saved space immediately translates into improved performance, because there is less I/O. With indexes, the additional tables would be in memory anyway, so the matching would be quite fast.
And, there is the issue of querying. You can use standard SQL functions for queries such as records that have A and B (many string functions are database specific). You can easily find out exactly which items are in the database, and relatively easily find what pairs exist on records. You can keep track of when an item is added to a record, and the first time it appears in the database. Generally, this flexible functionality -- which is just basic SQL functionality -- is what you need when managing this type of data.
Storing in a table will be much faster than a SQL string manipulation function in most circumstances especially if you can index the words.
I think you're asking if this:
SELECT word FROM table_one WHERE word in (SELECT word FROM table_two)
is faster than this:
SELECT table_one.word FROM table_one
INNER JOIN table_two ON table_one.word = table_two.word
The first answer should be faster, because the second creates a (potentially large) temporary object (the joined table).
Note that I assume you have an index on word. Also: if the strings are very long (URLs, for example), this will be very slow, and you should match on a hash instead.