How to store array or multiple values in one column - sql

Running Postgres 7.4 (Yeah we are in the midst of upgrading)
I need to store from 1 to 100 selected items into one field in a database. 98% of the time it's just going to be 1 item entered, and 2% of the time (if that) there will be multiple items.
The items are nothing more than a text description, (as of now) nothing more than 30 characters long. They are static values the user selects.
Wanted to know the optimal column data type used to store the desired data. I was thinking BLOB but didn't know if this is a overkill. Maybe JSON?
Also I did think of ENUM but as of now I can't really do this since we are running Postgres 7.4
I also wanted to be able to easily identify the item(s) entered so no mappings or referencing tables.

You have a couple of questions here, so I'll address them separately:
I need to store a number of selected items in one field in a database
My general rule is: don't. This is something which all but requires a second table (or third) with a foreign key. Sure, it may seem easier now, but what if the use case comes along where you need to actually query for those items individually? It also means that you have more options for lazy instantiation and you have a more consistent experience across multiple frameworks/languages. Further, you are less likely to have connection timeout issues (30,000 characters is a lot).
You mentioned that you were thinking about using ENUM. Are these values fixed? Do you know them ahead of time? If so this would be my structure:
Base table (what you have now):
| id primary_key sequence
| -- other columns here.
Items table:
| id primary_key sequence
| descript VARCHAR(30) UNIQUE
Map table:
| base_id bigint
| items_id bigint
Map table would have foreign keys so base_id maps to Base table, and items_id would map to the items table.
And if you'd like an easy way to retrieve this from a DB, then create a view which does the joins. You can even create insert and update rules so that you're practically only dealing with one table.
What format should I use store the data?
If you have to do something like this, why not just use a character delineated string? It will take less processing power than a CSV, XML, or JSON, and it will be shorter.
What column type should I use store the data?
Personally, I would use TEXT. It does not sound like you'd gain much by making this a BLOB, and TEXT, in my experience, is easier to read if you're using some form of IDE.

Well, there is an array type in recent Postgres versions (not 100% about PG 7.4). You can even index them, using a GIN or GIST index. The syntaxes are:
create table foo (
bar int[] default '{}'
);
select * from foo where bar && array[1] -- equivalent to bar && '{1}'::int[]
create index on foo using gin (bar); -- allows to use an index in the above query
But as the prior answer suggests, it will be better to normalize properly.

Related

Index Integer Substring of Varchar ID PostgreSQL

I am going to be creating a very large table (320k+ rows) that I am going to be doing many complicated operations on so performance is very important. Each row will be a reference to a page / entity from an external site that already has unique IDs. In order to keep the data easy to read and for consistency reasons I would rather use those external IDs as my own row IDs, however the problem is that the IDs are in the format of XXX######## where the XXX part is always the same identical string prefix and the second ######## part is a completely unique number. From what I know, using varchar ids is measurably slower performance wise, and only looking at the numerical part will have the same results.
What is the best way to do this? I still want to be able to do queries like WHERE ID = 'XXX########' and have the actual correct ids displayed in result sets rather than trimmed ones. Is there a way to define getters and setters for a column? Or is there a way to create an index that is a function on just the numerical part of the id?
Since your ID column (with format XXX########) is a primary key, there will already be an index on that column. If you wish to create an index based on the "completely unique number" portion of the ID, it is possible to create an expression index in Postgres:
CREATE INDEX pk_substr_idx ON mytable (substring(id,4));
This will create an index on the ######## portion of your column. However, bear in mind that the values stored in the index will be text, not numbers. Therefore, you might not be able to see any real benefit to having this index around (i.e., you'll only be able to check for equality = and not comparison >/</>=/<=.
The other drawback of this approach is that for every row you insert, you'll be updating two indexes (the one for the PK, and the one for the substring).
Therefore, if at all possible, I would recommend splitting your ID into separate prefix (the XXX portion) and id_num (the ######## portion) columns. Since you stated that "the XXX part is always the same identical string prefix", you would stand to reap a performance benefit by either 1) splitting the string into two columns or 2) hard-code the XXX portion into your app (since it's "always the same identical string prefix") and only store the numeric portion in the database.
Another approach (if you are willing to split the string into separate prefix and id_num columns) is to create a composite index. The table definition would then look something like:
CREATE TABLE mytable (
prefix text,
id_num int,
<other columns>,
PRIMARY KEY (prefix, id_num)
);
This creates a primary key on the two columns, and you would be able to see your queries use the index if you write your application with two columns in mind. Again, you would need to split the ID up into text and number portions. I believe this is the only way to get the best performance out of your queries. Any value that mixes text and numbers will ultimately be stored and interpreted as text.
Disclosure: I work for EnterpriseDB (EDB)
Use an IDENTITY type column for the primary key and load the external IDs as a separate column

Table structure for data with many NULLs

I'm currently trying to model a dynamic data object that can have or miss some properties (the property names are known for the current requirement). It is not known if new properties will be added later on (but it is almost certain). The modeled object is something along the line of this:
int id PRIMARY KEY NOT NULL;
int owner FOREIGN KEY NOT NULL;
Date date NOT NULL;
Time time NOT NULL;
Map<String,String> properties;
A property can be of any type ( int, bool, string,... )
I'm not sure how i should model this object in an SQL database. There are 2 ways i can think of to do this and i would like to have some input which will be the better choice in terms of developer "work"(maintenance), memory consumption and performance. As a side info: properties are almost always NULL (not existant)
(1) I would have a big table that has id, owner, date, time and every property as a column whereas missing properties for a row are modeled as NULL. e.g.
TABLE_X
id|owner|date|time|prop_1|prop_2|prop_3|...
This table would have alot of NULL values.
If new properties should be added then i would do an ALTER TABLE and insert a new column for every new property
Here i would do a "usual"
SELECT * FROM TABLE_X ...
(2) I would have a main table with all NOT NULL data:
TABLE_X
id|owner|date|time
And then have a seperate table for every property, like this:
TABLE_X_PROP_N
foreign_key(TABLE_X(id))|value
Here would be no NULL values at all. A property either has a value and is in its corresponding table or it is NULL and then does not appear in its table.
To add new properties i would just add another table.
Here is would do a
SELECT * FROM TABLE_X LEFT JOIN TABLE_X_PROP_1 ON ... LEFT JOIN TABLE_X_PROP_2 ON ...
To repeat the question (so you don't have to scroll up):
Which of boths ways to deal with the problem is the better in terms of maintenance (work for developer), memory consumption (on disk) and performance (more queries per second)? Maybe you also have a better idea on how to deal with this. Thanks in advance
If you go with Option 2, I would think you need 3 tables:
TABLE_HEADER
id|owner|date|time
TABLE_PROPERTY
id|name
TABLE_PROPERTYVALUE
id|headerID(FK)|propertyID(FK)|value
Easy to add new properties allow you greater flexibility and to iterate much faster. The number of properties would also have an effect (for example if you have 500 properties you aren't going to want a table with 500 columns!). The main downside is it will become ugly if you need to attach complex business logic using the properties as its a more complex structure to navigate and you can't enforce data integrity like not null for particular fields. If you truly want a property bag like you have modeled in your object structure then this maps easily. Like everything it depends on your circumstances for what is most suitable.
Solution 2. but why without separate tables for every property. Just put everything in one table:
properties(
foreign_key(TABLE_X(id))
property_name,
value);
Sounds like you're trying to implement an Entity-Attribute-Value (often-viewed-as-an-anti-)pattern here. Are you familiar with them? Here's a few references:
https://softwareengineering.stackexchange.com/questions/93124/eav-is-it-really-bad-in-all-scenarios
http://www.dbforums.com/showthread.php?1619660-OTLT-EAV-design-why-do-people-hate-it
https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model
Personally I'm extremely wary of this type of setup in a RDBMS. I tend to think that NoSQL document style databases would be a better fit for these types of dynamic structures, though admittedly I have relatively little real-world experience with NoSQL myself.

What is the best way to store categorical references in SQL tables?

I'm wanting to store a wide array of categorical data in MySQL database tables. Let's say that for instance I want to to information on "widgets" and want to categorize attributes in certain ways, i.e. shape category.
For instance, the widgets could be classified as: round, square, triangular, spherical, etc.
Should these categories be stored within a table to reference them best from an application? Another possibility, I would imagine, would be to add a column to widgets that contained a shape column that contained a tiny int. That way my application could search shapes by that and then use a coordinating enum type that would map the shape int meanings.
Which would be best? Or is there another solution that I'm not thinking of yet?
Define a category table for each attribute grouping. IE:
WIDGET_SHAPE_TYPE_CODES
WIDGET_SHAPE_TYPE_CODE (primary key)
DESCRIPTION
Then use a foreign key reference in the WIDGETS table:
WIDGETS
WIDGET_ID (primary key)
...
WIDGET_SHAPE_TYPE_CODE (foreign key)
This has the benefit of being portable to other databases, and more obvious relationships which means simpler maintenance.
What I would do is start with a Widgets table that has a category field that is a numeric type. If you also use the category table the numeric category is a foreign key that relates to a row in the category table. A numeric type is nice and small for better performance.
Optionally you can add a category table containing a a primary key numeric value, and a text description. This matches up the numeric value to a human friendly text value. This table can be used to convert the numbers to text if you just want to run reports directly from the database. The nice thing about having this table is you don't need to update an executable if you add a new category. I would add such a table to my design.
MySQL's ENUM is handy but it stores int the table as a string so it uses up more space in the table than is really needed. However it does have the advantage of preventing values that are not recognized from being stored. Preventing the storage of invalid numeric values is possible, but not as elegantly as ENUM. The other problem with ENUM is because it is regarded as a string, the database must do more work if you are selecting by the value because instead of comparing a single number, multiple characters have to be compared.
If you really want to you can have an enumeration in your code that coverts the numeric category back into something more application code friendly, but you are making your code more difficult to maintain by doing this. However it can have a performance advantage because fewer bytes have to be returned when you run a query. I would try to avoid this because it requires updating the application code every time a category is added to the database. If you really need to squeeze performance out of the database you could select the whole category table, and select the widgets table and merge them in application code, but that is a rare circumstance since the DB client almost always has a fast connection to the DB server and a few more bytes over the network are insignificant.
I think the best way is use ENUM, for example thereare pre defined enum type in mysql - http://dev.mysql.com/doc/refman/5.0/en/enum.html

How to add user customized data to database?

I am trying to design a sqlite database that will store notes. Each of these notes will have common fields like title, due date, details, priority, and completed.
In addition though, I would like to add data for more specialized notes like price for shopping list items and author/publisher data for books.
I also want to have a few general purpose fields that users can fill with whatever text data they want.
How can I design my database table in this case?
I could just have a field for each piece of data for every note, but that would waste a lot of fields and I'd like to have other options and suggestions.
There are several standard approaches you could use for solving this situation.
You could create separate tables for each kind of note, copying over the common columns in each case. this would be easy but it would make it difficult to query over all notes.
You could create one large table with many columns and some kind of type field which would let you know which type of note it is (and therefore which subset of columns to use)
CREATE TABLE NOTE ( ID int PRIMARY KEY, NOTE_TYPE int, DUEDATE datetime, ...more common fields, price NUMBER NULL, author VARCHAR(100) NULL,.. more specific fields)
you could break your tables up into a inheritance relationship something like this:
CREATE TABLE NOTE ( ID int PRIMARY KEY, NOTE_TYPE int, DUEDATE datetime, ...more common fields);
CREATE TABLE SHOPPINGLITITEM (ID int PRIMARY KEY, NOTE_ID int FORIENKEY NOTE.ID, price number ... more shopping list item fields)
Option 1 would be easy to implement but would involve lots of mostly redundant table definitions.
Option 2 would be easy to create and easy to write queries on but would be space inefficient
And option 3 would be more space efficient and less redundant but would possibly have slower queries because of all the foreign keys.
This is the typical set of trade-offs for modeling these kinds of relationships in SQL, any of these solutions could be appropriate for use case depending non your performance requirements.
You could create something like a custom_field table. It gets pretty messy once you start to normalize.
So you have your note table with it's common fields.
Now add:
dynamic_note_field
id label
1 publisher
2 color
3 size
dynamic_note_field_data
id dynamic_note_field_id value
1 1 Penguin
2 1 Marvel
3 2 Red
Finally, you can relate instances of your data with the fields they use through
note_dynamic_note_field_data
note_id dynamic_note_field_data_id
1 1
1 3
2 2
So now we've said: note_id 1 has two additional fields. The first one has a value "Penguin" and represents a publisher. The second one has a value of "Red" and represents a color.
So what's the point of normalizing it this far?
You're not wasting space adding fields to every item (you relate a note with it's additional dynamic field via the m2m table).
You're not storing redundant labels (you may continue to store redundant data however as the same publisher is likely to appear many times... this aspect is extremely subjective. If you want rich data about your publishers you typically want to take the step of turning them into their own entity rather than an ad-hoc string. Be careful when making this leap because it adds an extra level of hairiness to the db. Evaluate the use case accordingly.
The dynamic_note_field acts as your data definition. If you're interested in answering a question such as "what are the additional fields I've created" this lets you do it easily without searching all of your dynamic_note_field_data. Eventually, you might add extra info to this table such as a type field. I like to create this separation off the bat, but that might be a violation of the YAGNI principle in your case.
Disadvantages:
It's not too bad to search for all notes that have a publisher, where that publisher is "Penguin".
What's tricky is something like "Find any note with a value of 'Penguin' in any field". You don't know up front which field's your searching. At this point you're better off with a separate index that's generated alongside your normalized db data which acts as the point of truth. Again, the nice thing about normalization is that you maintain the data in a very lossless, non-destructive state.
For data you want to store but does not have to be searchable, another option is to serialize it to/from JSON and store it in a TEXT column. This gives you arbitrary structure, but you cannot readily query against those values.
Yet another option is to dump SQLite and go with an object database. I seem to recall there are one or two working for Android. I have not tried any of these, however.
Just create a small table which contains the common fields of all your notes.
Then a table for each class of special notes you have, that that contains all the extra fiels plus a reference on your first table.
For each note you will enter, you create a row in your main table (that contains the common fields) and a row in your extra table that contains the extra fields, and a reference to the row in your main table.
Then you will just have to make a join in you request.
With this solution :
1)you have a safe design (can't access fields that are not part of your note)
2)your db will be optimized

Too many columns design question

I have a design question.
I have to store approx 100 different attributes in a table which should be searchable also. So each attribute will be stored in its own column. The value of each attribute will always be less than 200, so I decided to use TINYINT as data type for each attribute.
Is it a good idea to create a table which will have approx 100 columns (Each of TINYINT)? What could be wrong in this design?
Or should I classify the attributes into some groups (Say 4 groups) and store them in 4 different tables (Each approx have 25 columns)
Or any other data storage technique I have to follow.
Just for example the table is Table1 and it has columns Column1,Column2 ... Column100 of each TINYINT data type.
Since size of each row is going to be very small, Is it OK to do what I explained above?
I just want to know the advantages/disadvantages of it.
If u think that it is not a good idea to have a table with 100 columns, then please suggest other alternatives.
Please note that I don;t want to store the information in composite form (e.g. few xml columns)
Thanks in advance
Wouldn't a many-to-many setup work here?
Say Table A would have a list of widget, which your attributes would apply to
Table B has your types of attributes (color, size, weight, etc), each as a different row (not column)
Table C has foreign keys to the widget id (Table A) and the attribute type (Table B) and then it actually has the attribute value
That way you don't have to change your table structure when you've got a new attribute to add, you simply add a new attribute type row to Table C
Its ok to have 100 columns. Why not? Just employ code generation to reduce handwriting of this columns.
I wouldn't worry much about the number of columns per se (unless you're stuck using some really terrible relational engine, in which case upgrading to a decent one would be my most hearty recommendation -- what engine[s] do you plan/need to support, btw?) but about the searchability thereby.
Does the table need to be efficiently searchable by the value of an attribute? If you need 100 indexes on that table, THAT might make insert and update operations slow -- how frequent are such modifications (vs reads to the table and especially searches on attribute values) and how important is their speed to you?
If you do "need it all" there just possibly may be no silver bullet of a "perfect" solution, just compromises among unpleasant alternatives -- more info is needed to weigh them. Are typical rows "sparse", i.e. mostly NULL with just a few of the 100 attributes "active" for any given row (just different subsets for each)? Is there (at least statistically) some correlation among groups of attributes (e.g. most of the time when attribute 12 is worth 93, attribute 41 will be worth 27 or 28 -- that sort of thing)?
BAsed on your last, it seems to me that you may have a bad design. WHat is the nature of these columns? Are you storing information together that shouldn't be together, are you storing information that shoul be in related tables?
So really what we need to best help you is to see what the nature of the data you have is.
what would be in
column1,column3,column10 vice column4,column15,column20,column25
I had a table with 250 columns. There's nothing wrong. For some cases, it's how it works.
unless some of the columns you are defining have a meaning "per se" as independent entities and they can be shared by multiple rows. In that case, it makes sense to normalize out the set of columns in a different table, and put a column in the original table (possibly with a foreign key constraint)
I think the correct way is to have a table that looks more like:
CREATE TABLE [dbo].[Settings](
[key] [varchar](250) NOT NULL,
[value] tinyint NOT NULL
) ON [PRIMARY]
Put an index on the key column. You can eventually make a page where the user can update the values.
Having done a lot of these in the real world, I don't understand why anyone would advocate having each variable be its own column. You have "approx 100 different attributes" so far, you don't think you are going to want to add and delete to this list? Every time you do it is a table change and a production release? You will not be able to build something to hand the maintenance off to a power user. Your reports are going to be hard-coded too? Things take off and you reach the max number of columns of 1,024 are you going to rework the whole thing?
It is nothing to expand the table above - add Category, LastEditDate, LastEditBy, IsActive, etc. or to create archiving functionality. Much more awkward to do this with the column based solution.
Performance is not going to be any different with this small amount of data, but to rely on the programmer to make and release a change every time the list changes is unworkable.