From other answers on many columns vs many rows (or tables) it seems columns are more performant for normalized data. What about serialized data?
I'm going to store many in-progress web forms, i.e. not yet validated just a dump of what the user has so far so they can continue in another session. The forms will be serialized as json and stored in a jsonb column. There are currently ten forms but (many) more will be added in the future.
Is it better to have one column with a user id and a column for each form:
CREATE TABLE "forms" (
"user_id" uuid NOT NULL,
"form_a" jsonb,
"form_b" jsonb,
"form_c" jsonb,
...
)
or many rows with a user uuid, form id, and form json columns:
CREATE TABLE "forms" (
"user_id" uuid NOT NULL,
"form_id" uuid NOT NULL,
"form_json" jsonb NOT NULL
)
I'm sure querying for just one row is faster, but what about updating a column in a row with many jsonb columns? or adding a new jsonb column to a table with millions of rows? At what point does it tip to favoring many rows?
thanks!
If new forms are introduced only during maintenance windows (upgrades), you might get away with using the first method.
If new forms can be introduced during normal operation, that would cause problems:
ALTER TABLE blocks and is blocked by all concurrent data modifying statements, which can be a problem.
You need to be the table owner or a superuser to run ALTER TABLE, but for security reasons it is better if your application user can be somebody else than the table owner.
Increased data volume for UPDATE is not a consideration, because as the documentation says:
During an UPDATE operation, values of unchanged fields are normally preserved as-is; so an UPDATE of a row with out-of-line values incurs no TOAST costs if none of the out-of-line values change.
I think that the second design is cleaner, and the slightly more complicated query will not be noticably more expensive if you have the right indexes.
Related
Is json suitable for holding values coming from dynamic columns?
I want to store data about documents where the structure is known in advance, e.g. Name, Name, Year, Place, etc. For now, I have defined these fields as columns in a table in MS SQL.
However, I would like to store data about "dynamic" documents, where the user himself will select a field from a certain pool and insert a value. For now, I store this data as json in one column. For example once will be something like this:'{"Name":"John Doe","Place":"Chicago"}'
and other times it can only be: '{"Name": "John Doe"}'
I wonder how to unify this data. I thought that data from first type of documents (with a known number of columns) should also be stored in json. But I don't know if this is a good approach with large amounts of data, eg 100,000 records.
First, 100,000 rows is not a large number of columns. It is doubtful that you are even talking about a gigabyte of data.
Both XML and JSON incur overhead for storing field names in the data. If you have lots of repeated field names, then lots of redundant field names are being stored. And bigger rows slow down queries.
JSON and XML also have challenges in verifying the field names. This can be handled through the application or constraints. That said, they can be quite useful when the attributes are rarely used.
Your sample data simply suggests NULLable columns. You can have name and place. If there is no place then the value is NULL. Given that you have a fixed pool, there is a good change that this structure is the simplest and most efficient. The only downside is that adding a new column requires adding a column to a table. And that can be an expensive operation.
An alternative is an EAV model, which is mentioned in the comments. This solves the problem of repeating the names, because you can use ids instead. So, you could have:
create table optionalFields (
optionFieldId int identity(1, 1) primary key,
name varchar(255)
);
create table userOptionalFields (
userOptionalFieldId int identity (1, 1) primary key,
userId int references users(userId),
optionalFieldId int references optionalFields(optionalFieldId),
value varchar(255)
);
The downside to an EAV model is that it is simplest when all the values are strings, and that can be a little tricky if some of the values are numbers or dates. On the positive side, the database ensures that the fields are valid.
The choice between the different data models depends on factors such as:
The data type of the values.
The total number of fields.
How often new fields are added.
Whether the fields (for a given user) are updated and if so, if they are updated one-by-one or all-at-once.
How common fields are for a given user.
How familiar you are with XML and JSON.
Whether field names have synonyms (for instance "FullName for "Name").
Whether field names ever change. For instance, might "Name" suddenly become "FullName"?
And no doubt other issues as well.
Is there a way in Microsoft SQL to reference a specific item of data based on table, column and record?
For example, table A (COL1 INT, COL2 INT) has 2 records (1,2) and (3,4). Can I somehow capture value 4 by reference, rather than as "4"?
The purpose is to allow me to create an audit method that can point to specific value in a (table, column, record) without having to duplicate that value in my audit table (which could be large, therefore bloating my database size).
I am thinking ... just like Object_Id identifies a particular SQL object, so would this reference (some kind of GUID, perhaps?) identify a specific piece of data.
Many thanks in advance.
The answer is No. In MS SQL (and, as far as I know in other popular databases) there are no such references to specific values.
Moreover, even table rows in MS SQL do not have embedded unique identifiers, unless you take care to create an IDENTITY column.
You can yourself make the implementation of such references. For example, create a table with columns
data_id,
table_name,
row_id,
column_name
and fill it up every time you need a reference. Then you can refer to piece of data by data_id.
But this is not a good solution.
in most cases, a single entry in this table will consume more space
than the referenced data value itself
to get the values you still have to use dynamic sql
this will only work for tables that have an IDENTITY column and it
has the same name for all tables
and so on
I am writing a program that recovers structured data as individual records from a (damaged) file and collects the results into a sqlite database.
The program is invoked several times with slightly different recovery parameters. That leads to recovering often the same, but sometimes different data from the file.
Now, every time I run my program with different parameters, it's supposed to add just the newly (different) found items to the same database.
That means that I need a fast way to tell if each recovered record is already present in the DB or not, in order to add them only if they're not existing in the DB yet.
I understand that for each record I want to add, I could first do a SELECT for all columns to see if there is already a matching record in the DB, and only add the new one if no same is found.
But since I'm adding 10000s of records, doing a SELECT for each of these records feels pretty inefficient (slow) to me.
I wonder if there's a smarter way to handle this? I.e, is there a way I can tell sqlite that I do not want duplicate entries, and so it automatically detects and rejects them? I know about the UNIQUE modifier, but that's not it because it applies to single columns only, doesn't it? I'd need to be able to say that the combination of COL1+COL2+COL3 must be unique. Is there a way to do that?
Note: I never want to update any existing records. I only want to collect a set of different records.
Bonus part - performance
In a classic programming language, I'd use a key-value dictionary where the key is the sum of all a record's values. Similarly, I could calculate a Hash code for each added record and look that hash code up first. If there's no match, then the record is surely not in the DB yet; If there is a match I'd still have to search the DB for any duplicates. That'd surely be faster already, but I still wonder if sqlite can make this more efficient.
Try:
sqlite> create table foo (
...> a int,
...> b int,
...> unique(a, b)
...> );
sqlite>
sqlite> insert into foo values(1, 2);
sqlite> insert into foo values(2, 1);
sqlite> insert into foo values(1, 2);
Error: columns a, b are not unique
sqlite>
You could use UNIQUE column constraint or to declare a multiple columns unique constraint you can use UNIQUE () ON CONFLICT :
CREATE TABLE name ( id int , UNIQUE (col_name1 type , col_name2 type) ON CONFLICT IGNORE )
SQLite has two ways of expressing uniqueness constraints: PRIMARY KEY and UNIQUE. Both of them create an index and so the lookup happens through the created index.
If you do not want to use an SQL approach (as mentioned in other answers) you can do a select for all your data when the program starts, store the data in a dictionary and work with the dictionary do decide which records to insert to your DB.
The benefit of this approach is the single select is much faster than many small selects.
The disadvantage is that it won't work well if you don't have enough memory to store your data in.
I have a problem with finding a way to represent multiple tables hash tables into a single table.
Say I have 3 tables with the format:
Table1(Table1_PK1,Table1_PK2,Table1_PK3,Table1_Hash)
Table2(Table2_PK1,Table2_PK2,Table2_Hash)
Table3(Table3_Pk1,Table3_PK2,Table3_PK3,Table3_PK4,Table3_PK5,Table3_Hash)
Table1_PK1,Table1_PK2,Table1_PK3... are columns and they might have different datatypes (VARCHAR, INT or DATETIME ...).
My question is if there is a way to create a single table (fixed number of columns) that can represent all of these 3 tables (may be more in practical).
I am trying to do this for my database tool. Each table actual a table which contains primary keys and a hash data associating with them.
Since you're apparently building a database tool, not a database, it might make more sense to do this in application code rather than in a database table.
In a different answer, you commented
I am still looking for a dynamic way to do it without knowing how many primary keys a table can have.
A table can have only one primary key. That primary key can consist of more than one column, though. (You already knew this; you were just using the wrong words, which might confuse others.)
A table can also have an arbitrary number of other keys, which will be either declared (as NOT NULL UNIQUE) or "undeclared" (by creating an index that guarantees uniqueness over a set of columns).
You can look all that stuff up at run time in one or both of two ways. (Links go to documentation for PostgreSQL.)
System tables, sometimes called system catalogs
information_schema views
As far as I know, all modern SQL platforms implement at least one of these interfaces. The information_schema views are covered in the SQL standards, but there seems to be some room for interpretation. They don't look quite the same on all platforms.
Why combine the 3 tables into one? Would be really bad db design. But here's a way to do it:
The one table will have a column for each of the 3 tables' columns you want in the final table. I am making the assumption that TableX_Hash is the same type, so that remains as one unique column:
Table_All_in_One (
Table1_PK1,
Table1_PK2,
Table1_PK3,
# space just for clarity of grouping
Table2_PK1,
Table2_PK2,
Table3_PK1,
Table3_PK2,
Table3_PK3,
Table3_PK4,
Table3_PK5,
TableX_Hash # Assuming all the _Hash'es are the same type+length,
# otherwise, add Table1_Hash, Table2_Hash, Table3_Hash
# This can be your new primary key
)
The Primary Keys (PKx) are required to be non-NULL only in their own tables. For this table, they have to allow nulls. The idea is that each row of this new table will only hold the data for one of the tables. The other columns will be empty for that row. If you want to associate the row of one table with another, you can add that to the same row or add FK_Table1_Hash, FK_Table2_Hash and FK_Table3_Hash columns which will refer to the TableX_Hash value of a record.
PS: I wonder if what you are really looking for is a View and not this really bad all-in-one table.
Edit: Combining them into one "without knowing how many primary keys a table can have." as per your comment:
Store all the _PKs concatenated into one column:
Table_All_in_One (
New_PK,
TableX_Hash,
Table1_PKx, # Concatenated PKs of Table1
Table2_PKx, # Concatenated PKs of Table2, etc.
...,
# OR just one
TableX_PKs, # concatenate all the PK's into one VARCHAR field
# Add a pipe `|` between them optionally.
Table_Num # If using just one, then you'll need to store the table number
)
You will not be able to conveniently pick records based on part of their composite primary key. It will always have to be TableX_PKs = CONCAT_WS('|', Table1_PK1, Table1_PK2, ...). So your only dependency is the number of PKs in the original column.
In order to model a bunch of tables you will need 3 tables. An entity table that contains the table names of the tables you wish to set up this way called a factor or entity table. A Factor_detail table that contains all the columns and their associated properties of the tables. A table, factor_detail_value, for storing things like lookup values for lookup tables. I'm trying to learn more about this myself as well because we are using this technique at work as well. Genrate sql on the fly for any table so encoded, and store the data in a repository pertiinant to the data itself. This way if a table changes and you need to add a column or change a datatype, you can add a row to the factor detail table without affecting a database shut down in production. In most businesses a four hour shut down to make a sql data table change can cost thousands of dollars. If you are dealing with insurance for example, each additional state that you sell insurance in has different requirements for being able to seel it and that will result in table changes. We reduced our table count way down from over 700 tables in this manner also we can make changes without database shut down thus avoiding the loss in revenue.
I find this comes up a lot, and I'm not sure the best way to approach it.
The question I have is how to make the decision between using foreign keys to lookup tables, or using lookup table values directly in the tables requesting it, avoiding the lookup table relationship completely.
Points to keep in mind:
With the second method you would
need to do mass updates to all
records referencing the data if it
is changed in the lookup table.
This is focused more
towards tables that have a lot of
the column's referencing many lookup
tables.Therefore lots of foreign
keys means a lot of
joins every time you query the
table.
This data would be coming from drop
down lists which would be pulled
from the lookup tables. In order to match up data when reloading, the values need to be in the existing list (related to the first point).
Is there a best practice here, or any key points to consider?
You can use a lookup table with a VARCHAR primary key, and your main data table uses a FOREIGN KEY on its column, with cascading updates.
CREATE TABLE ColorLookup (
color VARCHAR(20) PRIMARY KEY
);
CREATE TABLE ItemsWithColors (
...other columns...,
color VARCHAR(20),
FOREIGN KEY (color) REFERENCES ColorLookup(color)
ON UPDATE CASCADE ON DELETE SET NULL
);
This solution has the following advantages:
You can query the color names in the main data table without requiring a join to the lookup table.
Nevertheless, color names are constrained to the set of colors in the lookup table.
You can get a list of unique colors names (even if none are currently in use in the main data) by querying the lookup table.
If you change a color in the lookup table, the change automatically cascades to all referencing rows in the main data table.
It's surprising to me that so many other people on this thread seem to have mistaken ideas of what "normalization" is. Using a surrogate keys (the ubiquitous "id") has nothing to do with normalization!
Re comment from #MacGruber:
Yes, the size is a factor. In InnoDB for example, every secondary index stores the primary key value of the row(s) where a given index value occurs. So the more secondary indexes you have, the greater the overhead for using a "bulky" data type for the primary key.
Also this affects foreign keys; the foreign key column must be the same data type as the primary key it references. You might have a small lookup table so you think the primary key size in a 50-row table doesn't matter. But that lookup table might be referenced by millions or billions of rows in other tables!
There's no right answer for all cases. Any answer can be correct for different cases. You just learn about the tradeoffs, and try to make an informed decision on a case by case basis.
In cases of simple atomic values, I tend to disagree with the common wisdom on this one, mainly on the complexity front. Consider a table containing hats. You can do the "denormalized" way:
CREATE TABLE Hat (
hat_id INT NOT NULL PRIMARY KEY,
brand VARCHAR(255) NOT NULL,
size INT NOT NULL,
color VARCHAR(30) NOT NULL /* color is a string, like "Red", "Blue" */
)
Or you can normalize it more by making a "color" table:
CREATE TABLE Color (
color_id INT NOT NULL PRIMARY KEY,
color_name VARCHAR(30) NOT NULL
)
CREATE TABLE Hat (
hat_id INT NOT NULL PRIMARY KEY,
brand VARCHAR(255) NOT NULL,
size INT NOT NULL,
color_id INT NOT NULL REFERENCES Color(color_id)
)
The end result of the latter is that you've added some complexity - instead of:
SELECT * FROM Hat
You now have to say:
SELECT * FROM Hat H INNER JOIN Color C ON H.color_id = C.color_id
Is that extra join a huge deal? No - in fact, that's the foundation of the relational design model - normalizing allows you to prevent possible inconsistencies in the data. But every situation like this adds a little bit of complexity, and unless there's a good reason, it's worth asking why you're doing it. I consider possible "good reasons" to include:
Are there other attributes that "hang off of" this attribute? Are you capturing, say, both "color name" and "hex value", such that hex value is always dependent on color name? If so, then you definitely want a separate color table, to prevent situations where one row has ("Red", "#FF0000") and another has ("Red", "#FF3333"). Multiple correlated attributes are the #1 signal that an entity should be normalized.
Will the set of possible values change frequently? Using a normalized lookup table will make future changes to the elements of the set easier, because you're just updating a single row. If it's infrequent, though, don't balk at statements that have to update lots of rows in the main table instead; databases are quite good at that. Do some speed tests if you're not sure.
Will the set of possible values be directly administered by the users? I.e. is there a screen where they can add / remove / reorder the elements in the list? If so, a separate table is a must, obviously.
Will the list of distinct values power some UI element? E.g. is "color" a droplist in the UI? Then you'll be better off having it in its own table, rather than doing a SELECT DISTINCT on the table every time you need to show the droplist.
If none of those apply, I'd be hard pressed to find another (good) reason to normalize. If you just want to make sure that the value is one of a certain (small) set of legal values, you're better off using a CONSTRAINT that says the value must be in a specific list; keeps things simple, and you can always "upgrade" to a separate table later if the need arises.
One thing no one has considered is that you would not join to the lookup table if the data in it can change over time and the records joined to are historical. The example is a parts table and an order table. The vendors may drop parts or change part numbers, but the orders table should alawys have exactly what was ordered at the time it was ordered. Therefore, it should lookup the data to do the record insert but should never join to the lookup table to get information about an existing order. Instead the part number and description and price, etc. should be stored in the orders table. This is espceially critical so that price changes do not propagate through historical data and make your financial records inaccurate. In this case, you would also want to avoid using any kind of cascading update as well.
rauhr.myopenid.com wrote:
The way we decided to solve this problem is with 4th normal form.
...
That is not 4th normal form. That is a common mistake called One True Lookup:
http://www.dbazine.com/ofinterest/oi-articles/celko22
4th normal form is :
http://en.wikipedia.org/wiki/Fourth_normal_form
Normalization is pretty universally regarded as part of best practices in databases, and normalization says yeah, you push the data out and refer to it by key.
Since no one else has addressed your second point: When queries become long and difficult to read and write due to all those joins, a view will usually resolve that.
You can even make it a rule to always program against the views, having the view get the lookups.
This makes it possible to optimize the view and make your code resistant to changes in the tables.
In oracle, you could even convert the view into a materialized view if you ever need to.