Too many columns design question

Too many columns design question - sql

I have a design question.
I have to store approx 100 different attributes in a table which should be searchable also. So each attribute will be stored in its own column. The value of each attribute will always be less than 200, so I decided to use TINYINT as data type for each attribute.
Is it a good idea to create a table which will have approx 100 columns (Each of TINYINT)? What could be wrong in this design?
Or should I classify the attributes into some groups (Say 4 groups) and store them in 4 different tables (Each approx have 25 columns)
Or any other data storage technique I have to follow.
Just for example the table is Table1 and it has columns Column1,Column2 ... Column100 of each TINYINT data type.
Since size of each row is going to be very small, Is it OK to do what I explained above?
I just want to know the advantages/disadvantages of it.
If u think that it is not a good idea to have a table with 100 columns, then please suggest other alternatives.
Please note that I don;t want to store the information in composite form (e.g. few xml columns)
Thanks in advance

Wouldn't a many-to-many setup work here?
Say Table A would have a list of widget, which your attributes would apply to
Table B has your types of attributes (color, size, weight, etc), each as a different row (not column)
Table C has foreign keys to the widget id (Table A) and the attribute type (Table B) and then it actually has the attribute value
That way you don't have to change your table structure when you've got a new attribute to add, you simply add a new attribute type row to Table C

Its ok to have 100 columns. Why not? Just employ code generation to reduce handwriting of this columns.

I wouldn't worry much about the number of columns per se (unless you're stuck using some really terrible relational engine, in which case upgrading to a decent one would be my most hearty recommendation -- what engine[s] do you plan/need to support, btw?) but about the searchability thereby.
Does the table need to be efficiently searchable by the value of an attribute? If you need 100 indexes on that table, THAT might make insert and update operations slow -- how frequent are such modifications (vs reads to the table and especially searches on attribute values) and how important is their speed to you?
If you do "need it all" there just possibly may be no silver bullet of a "perfect" solution, just compromises among unpleasant alternatives -- more info is needed to weigh them. Are typical rows "sparse", i.e. mostly NULL with just a few of the 100 attributes "active" for any given row (just different subsets for each)? Is there (at least statistically) some correlation among groups of attributes (e.g. most of the time when attribute 12 is worth 93, attribute 41 will be worth 27 or 28 -- that sort of thing)?

BAsed on your last, it seems to me that you may have a bad design. WHat is the nature of these columns? Are you storing information together that shouldn't be together, are you storing information that shoul be in related tables?
So really what we need to best help you is to see what the nature of the data you have is.
what would be in
column1,column3,column10 vice column4,column15,column20,column25

I had a table with 250 columns. There's nothing wrong. For some cases, it's how it works.
unless some of the columns you are defining have a meaning "per se" as independent entities and they can be shared by multiple rows. In that case, it makes sense to normalize out the set of columns in a different table, and put a column in the original table (possibly with a foreign key constraint)

I think the correct way is to have a table that looks more like:
CREATE TABLE [dbo].[Settings](
[key] [varchar](250) NOT NULL,
[value] tinyint NOT NULL
) ON [PRIMARY]
Put an index on the key column. You can eventually make a page where the user can update the values.
Having done a lot of these in the real world, I don't understand why anyone would advocate having each variable be its own column. You have "approx 100 different attributes" so far, you don't think you are going to want to add and delete to this list? Every time you do it is a table change and a production release? You will not be able to build something to hand the maintenance off to a power user. Your reports are going to be hard-coded too? Things take off and you reach the max number of columns of 1,024 are you going to rework the whole thing?
It is nothing to expand the table above - add Category, LastEditDate, LastEditBy, IsActive, etc. or to create archiving functionality. Much more awkward to do this with the column based solution.
Performance is not going to be any different with this small amount of data, but to rely on the programmer to make and release a change every time the list changes is unworkable.

Related

SQl reference table - dynamic columns

I want to create a reference table. The number of rows and columns can grow (and become redundant) over time. The column names may change over time as well. Lets asume the value in Column 'Position' is 'IT Engineer', in row(n). For that specific row(n) further down in the column structure, lets say in column 'Behind Sheds', is a value I need to retrieve. If the number of columns was fixed, it was no issue, but now the columns are added dynamically with T-SQL, which is fairly easy to do, even renaming these columns. My question is, is it good practice to add columns dynamicaly for the example above, or is there an better alternative, and how?
Thanks.

may be better use something like addition table with fields: person_id (forign key to main table), person_param (as example 'Position'), person_value (as example 'IT Engineer')?

I believe your issue is the right for implementing the EAV model.
Basically instead of having 20 columns, you have 3:
Entity
Attribute (otherwise known as column name)
Value (might actually be more values, of different types, and you just populate the right one)
That way, for 10 entities in the database, having 20 columns each, you populate 200 rows of data in an EAV table. It has less performance, it's harder to query, but it does allow flexibility, so you can make each entity have different sets of attributes even.

How would you implement a very wide "table"?

Let's say you're modeling an entity that has many attributes (2400+), far greater than the physical limit on a given database engine (e.g. ~1000 SQL Server). Knowing nothing about the relative importance of these data points (which ones are hot/used most often) besides the domain/candidate keys, how would you implement it?
A) EAV. (boo... Native relational tools thrown out the window.)
B) Go straight across. The first table has a primary key and 1000 columns, right up to the limit. The next table is 1000, foreign keyed to the first. The last table is the remaining 400, also foreign keyed.
C) Stripe evenly across ceil( n / limit ) tables. Each table has an even number of columns, foreign keying to the first table. 800, 800, 800.
D) Something else...
And why?
Edit: This is more of a philosophical/generic question, not tied to any specific limits or engines.
Edit^2: As many have pointed out, the data was probably not normalized. Per usual, business constraints at the time made deep research an impossibility.

My solution: investigate further. Specifically, establish whether the table is truly normalised (at 2400 columns this seems highly unlikely).
If not, restructure until it is fully normalised (at which point there are likely to be fewer than 1000 columns per table).
If it is already fully normalised, establish (as far as possible) approximate frequencies of population for each attribute. Place the most commonly occurring attributes on the "home" table for the entity, use 2 or 3 additional tables for the less frequently populated attributes. (Try to make frequency of occurrence the criteria for determining which fields should go on which tables.)
Only consider EAV for extremely sparsely populated attributes (preferably, not at all).

Use Sparse Columns for up to 30000 columns. The great advantage over EAV or XML is that you can use Filtered Indexes in conjunction with sparse columns, for very efficient searches over common attributes.

Without having much knowlegde in this area, i think an entity with so many attributes really really needs a re-design. With that I mean splitting the big thing into smaller parts that are logically connected.

The key item to me is this piece:
Knowing nothing about the relative importance of these data points (which ones are hot/used most often)
If you have an idea of which fields are more important, I would put those more important fields in the "native" table and let an EAV structure handle the rest.
The thing is, without this information you're really shooting blind anyway. Whether you have 2400 fields or just 24, you ought to have some kind of idea about the meaning (and therefore relative importance, or at least logical groupings) your data points.

I'd use a one to many attribute table with a foreign key to the entity.
Eg
entities: id,
attrs: id, entity_id, attr_name, value
ADDED
Or as Butler Lampson would say, "all problems in Computer Science can be solved by another level of indirection"

I would rotate the columns and make them rows. Rather than having a column containing the name of the attribute as a string (nvarchar) you could have it as a fkey back to a lookup table which contains a list of all the possible attributes.
Rotating it in this way means you:
don't have masses of tables to record the details of just one item
don't have massively wide tables
you can store only the info you need due to the rotation (if you don't want to store a particular attribute, then just don't have that row)

I'd look at the data model a lot
more carefully. Is it 3rd normal
form? Are there groups of attributes
that should be logically grouped
together into their own tables?
Assuming it is normalized and the
entity truly has 2400+ attributes, I
wouldn't be so quick to boo an
EAV model. IMHO, it's the best,
most flexible solution for the
situation you've described. It gives you built in support for sparse data and gives you good searching speed as the values for any given attribute can be found in a single index.

I would like to use vertical ( increase number of rows ) approach instead of horizontal ( increase number of columns).
You can try this approach like
Table -- id , property_name -- property_value.
The advantage with approach is, no need to alter / create a table when you introduce the new property / column.

How to add user customized data to database?

I am trying to design a sqlite database that will store notes. Each of these notes will have common fields like title, due date, details, priority, and completed.
In addition though, I would like to add data for more specialized notes like price for shopping list items and author/publisher data for books.
I also want to have a few general purpose fields that users can fill with whatever text data they want.
How can I design my database table in this case?
I could just have a field for each piece of data for every note, but that would waste a lot of fields and I'd like to have other options and suggestions.

There are several standard approaches you could use for solving this situation.
You could create separate tables for each kind of note, copying over the common columns in each case. this would be easy but it would make it difficult to query over all notes.
You could create one large table with many columns and some kind of type field which would let you know which type of note it is (and therefore which subset of columns to use)
CREATE TABLE NOTE ( ID int PRIMARY KEY, NOTE_TYPE int, DUEDATE datetime, ...more common fields, price NUMBER NULL, author VARCHAR(100) NULL,.. more specific fields)
you could break your tables up into a inheritance relationship something like this:
CREATE TABLE NOTE ( ID int PRIMARY KEY, NOTE_TYPE int, DUEDATE datetime, ...more common fields);
CREATE TABLE SHOPPINGLITITEM (ID int PRIMARY KEY, NOTE_ID int FORIENKEY NOTE.ID, price number ... more shopping list item fields)
Option 1 would be easy to implement but would involve lots of mostly redundant table definitions.
Option 2 would be easy to create and easy to write queries on but would be space inefficient
And option 3 would be more space efficient and less redundant but would possibly have slower queries because of all the foreign keys.
This is the typical set of trade-offs for modeling these kinds of relationships in SQL, any of these solutions could be appropriate for use case depending non your performance requirements.

You could create something like a custom_field table. It gets pretty messy once you start to normalize.
So you have your note table with it's common fields.
Now add:
dynamic_note_field
id label
1 publisher
2 color
3 size
dynamic_note_field_data
id dynamic_note_field_id value
1 1 Penguin
2 1 Marvel
3 2 Red
Finally, you can relate instances of your data with the fields they use through
note_dynamic_note_field_data
note_id dynamic_note_field_data_id
1 1
1 3
2 2
So now we've said: note_id 1 has two additional fields. The first one has a value "Penguin" and represents a publisher. The second one has a value of "Red" and represents a color.
So what's the point of normalizing it this far?
You're not wasting space adding fields to every item (you relate a note with it's additional dynamic field via the m2m table).
You're not storing redundant labels (you may continue to store redundant data however as the same publisher is likely to appear many times... this aspect is extremely subjective. If you want rich data about your publishers you typically want to take the step of turning them into their own entity rather than an ad-hoc string. Be careful when making this leap because it adds an extra level of hairiness to the db. Evaluate the use case accordingly.
The dynamic_note_field acts as your data definition. If you're interested in answering a question such as "what are the additional fields I've created" this lets you do it easily without searching all of your dynamic_note_field_data. Eventually, you might add extra info to this table such as a type field. I like to create this separation off the bat, but that might be a violation of the YAGNI principle in your case.
Disadvantages:
It's not too bad to search for all notes that have a publisher, where that publisher is "Penguin".
What's tricky is something like "Find any note with a value of 'Penguin' in any field". You don't know up front which field's your searching. At this point you're better off with a separate index that's generated alongside your normalized db data which acts as the point of truth. Again, the nice thing about normalization is that you maintain the data in a very lossless, non-destructive state.

For data you want to store but does not have to be searchable, another option is to serialize it to/from JSON and store it in a TEXT column. This gives you arbitrary structure, but you cannot readily query against those values.
Yet another option is to dump SQLite and go with an object database. I seem to recall there are one or two working for Android. I have not tried any of these, however.

Just create a small table which contains the common fields of all your notes.
Then a table for each class of special notes you have, that that contains all the extra fiels plus a reference on your first table.
For each note you will enter, you create a row in your main table (that contains the common fields) and a row in your extra table that contains the extra fields, and a reference to the row in your main table.
Then you will just have to make a join in you request.
With this solution :
1)you have a safe design (can't access fields that are not part of your note)
2)your db will be optimized

One mysql table with many fields or many (hundreds of) tables with fewer fields?

I am designing a system for a client, where he is able to create data forms for various products he sales him self.
The number of fields he will be using will not be more than 600-700 (worst case scenario). As it looks like he will probably be in the range of 400 - 500 (max).
I had 2 methods in mind for creating the database (using meta data):
a) Create a table for each product, which will hold only fields necessary for this product, which will result to hundreds of tables but with only the neccessary fields for each product
or
b) use one single table with all availabe form fields (any range from current 300 to max 700), resulting in one table that will have MANY fields, of which only about 10% will be used for each product entry (a product should usualy not use more than 50-80 fields)
Which solution is best? keeping in mind that table maintenance (creation, updates and changes) to the table(s) will be done using meta data, so I will not need to do changes to the table(s) manually.
Thank you!
/**** UPDATE *****/
Just an update, even after this long time (and allot of additional experience gathered) I needed to mention that not normalizing your database is a terrible idea. What is more, a not normalized database almost always (just always from my experience) indicates a flawed application design as well.

i would have 3 tables:
product
id
name
whatever else you need
field
id
field name
anything else you might need
product_field
id
product_id
field_id
field value

Your key deciding factor is whether normalization is required. Even though you are only adding data using an application, you'll still need to cater for anomalies, e.g. what happens if someone's phone number changes, and they insert multiple rows over the lifetime of the application? Which row contains the correct phone number?
As an example, you may find that you'll have repeating groups in your data, like one person with several phone numbers; rather than have three columns called "Phone1", "Phone2", "Phone3", you'd break that data into its own table.
There are other issues in normalisation, such as transitive or non-key dependencies. These concepts will hopefully lead you to a database table design without modification anomalies, as you should hope for!

Pulegiums solution is a good way to go.
You do not want to go with the one-table-for-each-product solution, because the structure of your database should not have to change when you insert or delete a product. Only the rows of one or many tables should be inserted or deleted, not the tables themselves.

While it's possible that it may be necessary, having that many fields for something as simple as a product list sounds to me like you probably have a flawed design.
You need to analyze your potential table structures to ensure that each field contains no more than one piece of information (e.g., "2 hammers, 500 nails" in a single field is bad) and that each piece of information has no more than one field where it belongs (e.g., having phone1, phone2, phone3 fields is bad). Either of these situations indicates that you should move that information out into a separate, related table with a foreign key connecting it back to the original table. As pulegium has demonstrated, this technique can quickly break things down to three tables with only about a dozen fields total.

What is the preferred way to store custom fields in a SQL database?

My friend is building a product to be used by different independent medical units.
The database stores a vast collection of measurements taken at different times, like the temperature, blood pressure, etc...
Let us assume these are held in a table called exams with columns temperature, pressure, etc... (as well as id, patient_id and timestamp). Most of the measurements are stored as floats, but some are of other types (strings, integers...)
While many of these measurements are handled by their product, it needs to allow the different medical units to record and process other custom measurements. A very nifty UI allows the administrator to edit these customs fields, specify their name, type, possible range of values, etc...
He is unsure as to how to store these custom fields.
He is leaning towards a separate table (say a table custom_exam_data with fields like exam_id, custom_field_id, float_value, string_value, ...)
I worry that this will make searching both more difficult to achieve and less efficient.
I am leaning towards modifying the exam table directly (while avoiding conflicts on column names with some scheme like prefixing all custom fields with an underscore or naming them custom_1, ...)
He worries about modifying the database dynamically and having different schemas for each medical unit.
Hopefully some people which more experience can weigh in on this issue.
Notes:
he is using Ruby on Rails but I think this question is pretty much framework agnostic, except from the fact that he is only looking for solutions in SQL databases only.
I simplified the problem a bit since the custom fields need to be available for more than one table, but I believe this doesn`t really impact the direction to take.
(added) A very generic reporting module will need to search, sort, generate stats, etc.. of this data, so it is required that this data be stored in the columns of the appropriate type
(added) User inputs will be filtered, for the standard fields as well as for the custom fields. For example, numbers will be checked within a given range (can't have a temperature of -12 or +444), etc... Thus, conversion to the appropriate SQL type is not a problem.

I've had to deal with this situation many times over the years, and I agree with your initial idea of modifying the DB tables directly, and using dynamic SQL to generate statements.
Creating string UserAttribute or Key/Value columns sounds appealing at first, but it leads to the inner-platform effect where you end up having to re-implement foreign keys, data types, constraints, transactions, validation, sorting, grouping, calculations, et al. inside your RDBMS. You may as well just use flat files and not SQL at all.
SQL Server provides INFORMATION_SCHEMA tables that let you create, query, and modify table schemas at runtime. This has full type checking, constraints, transactions, calculations, and everything you need already built-in, don't reinvent it.

It's strange that so many people come up with ad-hoc solutions for this when there's a well-documented pattern for it:
Entity-Attribute-Value (EAV) Model
Two alternatives are XML and Nested Sets. XML is easier to manage but generally slow. Nested Sets usually require some type of proprietary database extension to do without making a mess, like CLR types in SQL Server 2005+. They violate first-normal form, but are nevertheless the fastest-performing solution.

Microsoft Dynamics CRM achieves this by altering the database design each time a change is made. Nasty, I think.
I would say a better option would be to consider an attribute table. Even though these are often frowned upon, it gives you the flexibility you need, and you can always create views using dynamic SQL to pivot the data out again. Just make sure you always use LEFT JOINs and FKs when creating these views, so that the Query Optimizer can do its job better.

I have seen a use of your friend's idea in a commercial accounting package. The table was split into two, first contained fields solely defined by the system, second contained fields like USER_STRING1, USER_STRING2, USER_FLOAT1 etc. The tables were linked by identity value (when a record is inserted into the main table, a record with same identity is inserted into the second one). Each table that needed user fields was split like that.

Well, whenever I need to store some unknown type in a database field, I usually store it as String, serializing it as needed, and also store the type of the data.
This way, you can have any kind of data, working with any type of database.

I would be inclined to store the measurement in the database as a string (varchar) with another column identifying the measurement type. My reasoning is that it will presumably, come from the UI as a string and casting to any other datatype may introduce a corruption before the user input get's stored.
The downside is that when you go to filter result-sets by some measurement metric you will still have to perform a casting but at least the storage and persistence mechanism is not introducing corruption.

I can't tell you the best way but I can tell you how Drupal achieves a sort of schemaless structure while still using the standard RDBMSs available today.
The general idea is that there's a schema table with a list of fields. Each row really only has two columns, the 'table':String column and the 'column':String column. For each of these columns it actually defines a whole table with just an id and the actual data for that column.
The trick really is that when you are working with the data it's never more than one join away from the bundle table that lists all the possible columns so you end up not losing as much speed as you might otherwise think. This will also allow you to expand much farther than just a few medical companies unlike the custom_ prefix you were proposing.
MySQL is very fast at returning row data for short rows with few columns. In this way this scheme ends up fairly quick while allowing you lots of flexibility.
As to search, my suggestion would be to index the page content instead of the database content. Use Solr to parse through rendered pages and hold links to the actual page instead of trying to search through the database using clever SQL.

Define two new tables: custom_exam_schema and custom_exam_data.
custom_exam_data has an exam_id column, plus an additional column for every custom attribute.
custom_exam_schema would have a row to describe how to interpret each of the columns of the custom_exam_data table. It would have columns like name, type, minValue, maxValue, etc.
So, for example, to create a custom field to track the number of fingers a person has, you would add ('fingerCount', 'number', 0, 10) to custom_exam_schema and then add a column named fingerCount to the exam table.
Someone might say it's bad to change the database schema at run time, but I'd argue that configuring these custom fields is part of set up and won't happen too often. Still, this method lets you handle changes at any time and doesn't risk messing around with your core table schemas.

lets say that your friend's database has to store data values from multiple sources such as demogrphic values, diagnosis, interventions, physionomic values, physiologic exam values, hospitalisation values etc.
He might have as well to define choices, lets say his database is missing the race and the unit staff need the race of the patient (different races are more unlikely to get some diseases), they might want to use a drop down with several choices.
I would propose to use an other table that would have these choices or would you just use a "Custom_field_choices" table, which at some point is exactly the same but with a different name.
Considering that the database :
- needs to be flexible
- that data from multiple tables can be added and be customized
- that you might want to keep the integrity of the main structure of your database for distribution and uniformity purpose
- that data MUST have a limit and alarms and warnings
- that data must have units ( 10 kg or 10 pounds) ?
- that data can have a selection of choices
- that data can be with different rights (from simple user to admin)
- that these data might be needed to generate reports without modifying the code (automation)
- that these data might be needed to make cross reference analysis within the system without modifying the code
the custom table would be my solution, modifying each table would end up being too risky.

I would store those custom fields in a table where each record ( dataType, dataValue, dataUnit ) would use in one row. So there would be a relation oneToMany from one sample to the data. You can also create a table to record all the kind of cutsom types you would use. For example:
create table DataType
(
id int primary key,
name varchar(100) not null unique
description text,
uri varchar(255) //<-- can be used for an ONTOLOGY
)
create table DataRecord
(
id int primary key,
sample_id int not null,//<-- reference to the sample
dataType_id int not null, //<-- references DataType
value varchar(100),//<-- the value as string
unit varchar(50)//<-- g, mg/ml, etc... but it could also be a link to a table describing the units just like DataType
)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas