Is json suitable for holding values coming from dynamic columns?
I want to store data about documents where the structure is known in advance, e.g. Name, Name, Year, Place, etc. For now, I have defined these fields as columns in a table in MS SQL.
However, I would like to store data about "dynamic" documents, where the user himself will select a field from a certain pool and insert a value. For now, I store this data as json in one column. For example once will be something like this:'{"Name":"John Doe","Place":"Chicago"}'
and other times it can only be: '{"Name": "John Doe"}'
I wonder how to unify this data. I thought that data from first type of documents (with a known number of columns) should also be stored in json. But I don't know if this is a good approach with large amounts of data, eg 100,000 records.
First, 100,000 rows is not a large number of columns. It is doubtful that you are even talking about a gigabyte of data.
Both XML and JSON incur overhead for storing field names in the data. If you have lots of repeated field names, then lots of redundant field names are being stored. And bigger rows slow down queries.
JSON and XML also have challenges in verifying the field names. This can be handled through the application or constraints. That said, they can be quite useful when the attributes are rarely used.
Your sample data simply suggests NULLable columns. You can have name and place. If there is no place then the value is NULL. Given that you have a fixed pool, there is a good change that this structure is the simplest and most efficient. The only downside is that adding a new column requires adding a column to a table. And that can be an expensive operation.
An alternative is an EAV model, which is mentioned in the comments. This solves the problem of repeating the names, because you can use ids instead. So, you could have:
create table optionalFields (
optionFieldId int identity(1, 1) primary key,
name varchar(255)
);
create table userOptionalFields (
userOptionalFieldId int identity (1, 1) primary key,
userId int references users(userId),
optionalFieldId int references optionalFields(optionalFieldId),
value varchar(255)
);
The downside to an EAV model is that it is simplest when all the values are strings, and that can be a little tricky if some of the values are numbers or dates. On the positive side, the database ensures that the fields are valid.
The choice between the different data models depends on factors such as:
The data type of the values.
The total number of fields.
How often new fields are added.
Whether the fields (for a given user) are updated and if so, if they are updated one-by-one or all-at-once.
How common fields are for a given user.
How familiar you are with XML and JSON.
Whether field names have synonyms (for instance "FullName for "Name").
Whether field names ever change. For instance, might "Name" suddenly become "FullName"?
And no doubt other issues as well.
Related
I am going to be creating a very large table (320k+ rows) that I am going to be doing many complicated operations on so performance is very important. Each row will be a reference to a page / entity from an external site that already has unique IDs. In order to keep the data easy to read and for consistency reasons I would rather use those external IDs as my own row IDs, however the problem is that the IDs are in the format of XXX######## where the XXX part is always the same identical string prefix and the second ######## part is a completely unique number. From what I know, using varchar ids is measurably slower performance wise, and only looking at the numerical part will have the same results.
What is the best way to do this? I still want to be able to do queries like WHERE ID = 'XXX########' and have the actual correct ids displayed in result sets rather than trimmed ones. Is there a way to define getters and setters for a column? Or is there a way to create an index that is a function on just the numerical part of the id?
Since your ID column (with format XXX########) is a primary key, there will already be an index on that column. If you wish to create an index based on the "completely unique number" portion of the ID, it is possible to create an expression index in Postgres:
CREATE INDEX pk_substr_idx ON mytable (substring(id,4));
This will create an index on the ######## portion of your column. However, bear in mind that the values stored in the index will be text, not numbers. Therefore, you might not be able to see any real benefit to having this index around (i.e., you'll only be able to check for equality = and not comparison >/</>=/<=.
The other drawback of this approach is that for every row you insert, you'll be updating two indexes (the one for the PK, and the one for the substring).
Therefore, if at all possible, I would recommend splitting your ID into separate prefix (the XXX portion) and id_num (the ######## portion) columns. Since you stated that "the XXX part is always the same identical string prefix", you would stand to reap a performance benefit by either 1) splitting the string into two columns or 2) hard-code the XXX portion into your app (since it's "always the same identical string prefix") and only store the numeric portion in the database.
Another approach (if you are willing to split the string into separate prefix and id_num columns) is to create a composite index. The table definition would then look something like:
CREATE TABLE mytable (
prefix text,
id_num int,
<other columns>,
PRIMARY KEY (prefix, id_num)
);
This creates a primary key on the two columns, and you would be able to see your queries use the index if you write your application with two columns in mind. Again, you would need to split the ID up into text and number portions. I believe this is the only way to get the best performance out of your queries. Any value that mixes text and numbers will ultimately be stored and interpreted as text.
Disclosure: I work for EnterpriseDB (EDB)
Use an IDENTITY type column for the primary key and load the external IDs as a separate column
I'm wanting to store a wide array of categorical data in MySQL database tables. Let's say that for instance I want to to information on "widgets" and want to categorize attributes in certain ways, i.e. shape category.
For instance, the widgets could be classified as: round, square, triangular, spherical, etc.
Should these categories be stored within a table to reference them best from an application? Another possibility, I would imagine, would be to add a column to widgets that contained a shape column that contained a tiny int. That way my application could search shapes by that and then use a coordinating enum type that would map the shape int meanings.
Which would be best? Or is there another solution that I'm not thinking of yet?
Define a category table for each attribute grouping. IE:
WIDGET_SHAPE_TYPE_CODES
WIDGET_SHAPE_TYPE_CODE (primary key)
DESCRIPTION
Then use a foreign key reference in the WIDGETS table:
WIDGETS
WIDGET_ID (primary key)
...
WIDGET_SHAPE_TYPE_CODE (foreign key)
This has the benefit of being portable to other databases, and more obvious relationships which means simpler maintenance.
What I would do is start with a Widgets table that has a category field that is a numeric type. If you also use the category table the numeric category is a foreign key that relates to a row in the category table. A numeric type is nice and small for better performance.
Optionally you can add a category table containing a a primary key numeric value, and a text description. This matches up the numeric value to a human friendly text value. This table can be used to convert the numbers to text if you just want to run reports directly from the database. The nice thing about having this table is you don't need to update an executable if you add a new category. I would add such a table to my design.
MySQL's ENUM is handy but it stores int the table as a string so it uses up more space in the table than is really needed. However it does have the advantage of preventing values that are not recognized from being stored. Preventing the storage of invalid numeric values is possible, but not as elegantly as ENUM. The other problem with ENUM is because it is regarded as a string, the database must do more work if you are selecting by the value because instead of comparing a single number, multiple characters have to be compared.
If you really want to you can have an enumeration in your code that coverts the numeric category back into something more application code friendly, but you are making your code more difficult to maintain by doing this. However it can have a performance advantage because fewer bytes have to be returned when you run a query. I would try to avoid this because it requires updating the application code every time a category is added to the database. If you really need to squeeze performance out of the database you could select the whole category table, and select the widgets table and merge them in application code, but that is a rare circumstance since the DB client almost always has a fast connection to the DB server and a few more bytes over the network are insignificant.
I think the best way is use ENUM, for example thereare pre defined enum type in mysql - http://dev.mysql.com/doc/refman/5.0/en/enum.html
I am trying to design a sqlite database that will store notes. Each of these notes will have common fields like title, due date, details, priority, and completed.
In addition though, I would like to add data for more specialized notes like price for shopping list items and author/publisher data for books.
I also want to have a few general purpose fields that users can fill with whatever text data they want.
How can I design my database table in this case?
I could just have a field for each piece of data for every note, but that would waste a lot of fields and I'd like to have other options and suggestions.
There are several standard approaches you could use for solving this situation.
You could create separate tables for each kind of note, copying over the common columns in each case. this would be easy but it would make it difficult to query over all notes.
You could create one large table with many columns and some kind of type field which would let you know which type of note it is (and therefore which subset of columns to use)
CREATE TABLE NOTE ( ID int PRIMARY KEY, NOTE_TYPE int, DUEDATE datetime, ...more common fields, price NUMBER NULL, author VARCHAR(100) NULL,.. more specific fields)
you could break your tables up into a inheritance relationship something like this:
CREATE TABLE NOTE ( ID int PRIMARY KEY, NOTE_TYPE int, DUEDATE datetime, ...more common fields);
CREATE TABLE SHOPPINGLITITEM (ID int PRIMARY KEY, NOTE_ID int FORIENKEY NOTE.ID, price number ... more shopping list item fields)
Option 1 would be easy to implement but would involve lots of mostly redundant table definitions.
Option 2 would be easy to create and easy to write queries on but would be space inefficient
And option 3 would be more space efficient and less redundant but would possibly have slower queries because of all the foreign keys.
This is the typical set of trade-offs for modeling these kinds of relationships in SQL, any of these solutions could be appropriate for use case depending non your performance requirements.
You could create something like a custom_field table. It gets pretty messy once you start to normalize.
So you have your note table with it's common fields.
Now add:
dynamic_note_field
id label
1 publisher
2 color
3 size
dynamic_note_field_data
id dynamic_note_field_id value
1 1 Penguin
2 1 Marvel
3 2 Red
Finally, you can relate instances of your data with the fields they use through
note_dynamic_note_field_data
note_id dynamic_note_field_data_id
1 1
1 3
2 2
So now we've said: note_id 1 has two additional fields. The first one has a value "Penguin" and represents a publisher. The second one has a value of "Red" and represents a color.
So what's the point of normalizing it this far?
You're not wasting space adding fields to every item (you relate a note with it's additional dynamic field via the m2m table).
You're not storing redundant labels (you may continue to store redundant data however as the same publisher is likely to appear many times... this aspect is extremely subjective. If you want rich data about your publishers you typically want to take the step of turning them into their own entity rather than an ad-hoc string. Be careful when making this leap because it adds an extra level of hairiness to the db. Evaluate the use case accordingly.
The dynamic_note_field acts as your data definition. If you're interested in answering a question such as "what are the additional fields I've created" this lets you do it easily without searching all of your dynamic_note_field_data. Eventually, you might add extra info to this table such as a type field. I like to create this separation off the bat, but that might be a violation of the YAGNI principle in your case.
Disadvantages:
It's not too bad to search for all notes that have a publisher, where that publisher is "Penguin".
What's tricky is something like "Find any note with a value of 'Penguin' in any field". You don't know up front which field's your searching. At this point you're better off with a separate index that's generated alongside your normalized db data which acts as the point of truth. Again, the nice thing about normalization is that you maintain the data in a very lossless, non-destructive state.
For data you want to store but does not have to be searchable, another option is to serialize it to/from JSON and store it in a TEXT column. This gives you arbitrary structure, but you cannot readily query against those values.
Yet another option is to dump SQLite and go with an object database. I seem to recall there are one or two working for Android. I have not tried any of these, however.
Just create a small table which contains the common fields of all your notes.
Then a table for each class of special notes you have, that that contains all the extra fiels plus a reference on your first table.
For each note you will enter, you create a row in your main table (that contains the common fields) and a row in your extra table that contains the extra fields, and a reference to the row in your main table.
Then you will just have to make a join in you request.
With this solution :
1)you have a safe design (can't access fields that are not part of your note)
2)your db will be optimized
I thought I'd be flexible this time around and let the users decide what contact information the wish to store in their database. In theory it would look as a single row containing, for instance; name, address, zipcode, Category X, Listitems A.
Example
FieldType table defining the datatypes available to a user:
FieldTypeID, FieldTypeName, TableName
1,"Integer","tblContactInt"
2,"String50","tblContactStr50"
...
A user the define his fields in the FieldDefinition table:
FieldDefinitionID, FieldTypeID, FieldDefinitionName
11,2,"Name"
12,2,"Address"
13,1,"Age"
Finally we store the actual contact data in separate tables depending on its datatype.
Master table, only contains the ContactID
tblContact:
ContactID
21
22
tblContactStr50:
ContactStr50ID,ContactID,FieldDefinitionID,ContactStr50Value
31,21,11,"Person A"
32,21,12,"Address of person A"
33,22,11,"Person B"
tblContactInt:
ContactIntID,ContactID,FieldDefinitionID,ContactIntValue
41,22,13,27
Question: Is it possible to return the content of these tables in two rows like this:
ContactID,Name,Address,Age
21,"Person A","Address of person A",NULL
22,"Person B",NULL,27
I have looked into using the COALESCE and Temp tables, wondering if this is at all possible. Even if it is: maybe I'm only adding complexity whilst sacrificing performance for benefit in datastorage and user definition option.
What do you think?
I don't think this is a good way to go because:
A simple insert of 1 record for a contact suddenly becomes n inserts. e.g. if you store varchar, nvarchar, int, bit, datetime, smallint and tinyint data for a contact, that's 7 inserts in datatype specific tables, +1 for the main header record
Likewise, a query will automatically reference 7 tables, with 6 JOINs involved just to get the full details
I personally think it's better to go for a less "generic" approach. Keep it simple.
Update:
The question is, do you really need a flexible solution like this? For contact data, you always expect to be able to store at least a core set of fields (address line 1-n, first name, surname etc). If you need a way for the user to store custom/user definable data on top of that standard data set, that's a common requirement. Various options include:
XML column in your main Contacts table to store all the user defined data
1 extra table containing key-value pair data a bit like you originally talked about but to much lesser degree! This would contain the key of the contact, the custom data item name and the value.
These have been discussed before here on SO so would be worth digging around for that question. Can't seem to find the question I'm remembering after a quick look though!
Found some that discuss the pros/cons of the key-value approach, to save repeating:
Key value pairs in relational database
Key/Value pairs in a database table
I am refactoring an old Oracle 10g schema to try to introduce some normalization. In one of the larger tables, there is a text field that has at most, 10-15 possible values. In my mind, it seems that this field is an example of unnecessary data duplication and should be extracted to a separate table.
After examining the data, I cannot find one relevant piece of information that could be associated with that text value. Basically, if I pulled that value out and put it into its own table, it would be the only field in that table. It exists today as more of a 'flag' field. Should I create a two-column table with a surrogate key, keep it as it is, or do something entirely different? Am I doing more harm than good by trying to minimize data duplication on this field?
You might save some space by extracting the column to a separate table. This is called a lookup table. It can give you a couple of other benefits:
You can declare a foreign key constraint to the lookup table, so you can rely on the column in the main table never having any value other than the 10-15 values you want.
It's easy to query for a concise list of all permitted values, by querying the lookup table. This can be faster than using SELECT DISTINCT on the main table's column. It also returns values that are permitted, but not currently used in the main table.
If you change a value in the lookup table, it automatically applies to all rows in the main table that reference it.
However, creating a lookup table with one column is not strictly normalization. You're just replacing one value with another. The attribute in the main table either already supports a normal form, or not.
Using surrogate keys (vs. natural keys) also has nothing to do with normalization. A lot of people make this mistake.
However, if you move other attributes into the lookup table, attributes that depend only on the lookup value and therefore would create repeating groups (violating 3NF) in the main table if you left them there, then that would be normalization.
If you want normalization break it out.
I think of these types of data in DBs as the equivalent of enums in C,C++,C#. Mostly you put them in the table as documentation.
I often have an ID, Name, Description, and auditing columns for them (eg modified by, modified date, create date, create by, active.) The description field is rarely used.
Example (some might say there are more than just 2)
Gender
ID Name Audit Columns...
1 Male
2 Female
Then in your contacts you would have a GenderID column which would link to this one.
Of course you don't "need" the table. You could have external documentation somewhere that says 1=Male, 2=Female -- but I think these tables serve to document a system.
If it's really a free-entry text field that's not re-used somewhere else in the database, and there's just a single field without repeated instances, I'd probably go ahead and leave it as it is. If you're determined to break it out I'd create a 'validation' table with a surrogate key and the text value, then put the surrogate key in the base table.
Share and enjoy.
Are these 10-15 values actually meaningful, or are they really just flags? If they're meaningful pieces of text and it seems wasteful to replicate them, then sure create a lookup table. But if they're just arbitrary flag values, then your new table will be nothing more than a mapping from one arbitrary value to another, and not terribly helpful.
A completely separate question is whether all or most of the rows in your big table even have a value for this column. If not, then indeed you have a good opportunity for normalization and can create a separate table linking the primary key from your base table with the flag value.
Edit: One thing. If there's some chance that one of these "flag" values is likely to be wholesale replaced with another value at some point in the future, that would be another good reason to create a table.