Best Practice - Should I make one table or two for two similar sets of data? - sql

I need a table to store types of tests. I've been provided with two excel spreadsheets, one for microbial tests, one for pathogens. Microbial has 5 columns and Pathogens has 10. The 5 columns are in both tables. So one has 5 extra columns.
Just to give you an idea, the table columns would be something like this:
**Microbial**
Test Method IncubationStage1
**Pathogens**
Test Method IncubationStage1 IncubationStage2 Enrichment
So Is it better to have one table for Microbial and one for Pathogens, or better to have one table for Tests and have both within it? Is it bad to have a Microbial in a table where I know for certain only half the columns will be utilized? Or is it better to keep related items in the same table, and separate them by a column "Type"?
Obviously both will work fine but I'm wondering which is better.

The answer to these sorts of questions is always "it depends."
For my opinion, if you think you'll ever want to aggregate the data by test or by method across pathogenic or microbial types, then certainly you should put the data in the same table with an additional column that differentiates them.
You also could potentially better "normalize" your tables like this:
Table1: ExperimentID_PK ExperimentTypeID_FK Test Method
Table2: MeasurementRecordID_PK ExperimentID_FK Timestamp Other metadata about the record
Table3 MeasurementID_PK MeasurementTypeID_FK MeasurementValue MeasurementRecordID_FK
Table4: MeasurmentTypeId_PK Metadata About Measurement Types
Table5: ExperimentTypeId_PK Metadata About Experiment Types
... where all the leaf data elements point back to their parent data elements through foreign keys, and then you'd join data together in SQL statements, with indexes applied for optimal performance based on the types of queries you wanted to make. Obviously one of your rows in the question would end up appearing as multiple rows across multiple tables in this schema, and only at query time could they conceivably be reunited into individual rows (e.g. bound by MeasurementRecordID).
But there are other patterns too, in No-SQL land normalization can be the enemy. Slicing and dicing data sets turns out to be easier in some domains if it is stored in a more bloated format to make query structures more obvious. So it kind of comes down to thinking through your use cases.

Related

PostgreSQL - What should I put inside a JSON column

The data I want to store data that has this characteristics:
There are a finite number of fields (I don't expect to add new fields);
There are some columns that are common to all sets of data (a category field, for instance);
There are some columns that are specific to individual sets of data (each category needs it's own fields);
Here's how it would look like in a regular table:
I'm having trouble figuring out which would be the better way to store this data in a database for this situation.
Bellow are the ideas I already had:
Do exactly as the tabular table (I would have many NULL values);
Divide the categories into tables (I would use joins when needed);
Use JSON type for storing the values (no NULL values and having it all in same table).
So my questions are:
Is one of these solutions (or one that I have not thought about it) that is better for this case?
Are there other factors, other than the ones presented here, that I should consider to make this decision?
Unless you have very many columns (~ 100), it is usually better to use normal columns. NULL values don't take any storage space in PostgreSQL.
On the other hand, if you have queries that can use any of these columns in the WHERE condition, and you compare with =, a single GIN index on a jsonb might be better than having many B-tree indexes, because the index maintenance costs would be higher.
The definitive answer depends on the SQL statements that you plan to run on that table.
You have laid out the three options pretty well. Things to consider are:
Performance
Data size
Each of maintenance
Flexibility
Security
Note that you don't even allude to security considerations. But security at the table level is usually a tad simpler than at the column level and might be important for regulated data such as PII (personally identifiable information).
The primary strength of the JSON solution is flexibility. It is easy to add new columns. But you don't need that. JSON has a cost in data size and data type flexibility (notably JSON doesn't support date/times explicitly).
A multiple table solution requires duplicating the primary key but may result in much less storage overall if the columns really are sparse. The "may" may also depend on the data type. A NULL string for instance occupies less space than a NULL float in a table record.
The joins on multiple tables will be 1-1 on primary keys. These should be pretty fast.
What would I do? Unless the answer is obvious, I would dump the data into a single table with a bunch of columns. If that table starts to get unwieldy, then I would think about splitting it into separate tables -- but still have one table for the common columns. The details of one or multiple tables can be hidden behind a view.
Depends on how much data you want to store, but as long as it is finite it shouldn't make a big difference if it contains a lot of null's or not

Merging multiple tables from multiple databases with all rows and columns

I have 30 databases from a survey application that all have a table of results with approximately 100 columns in each. Most of the columns are identical but each survey seems to have a unique column or two added in with no real pattern (these are the added questions and results of the survey). As I am working on the statement to join all of the tables into one large master table the code is getting quite complex. Is there a more efficient way to merge these tables from multiple databases and just select all rows and columns so it will merge if the column exists and create if it encounters a new column?
No, there isn't an automatic way to merge a bunch of similar, but not quite the same, tables into one. At least, not in any database system that I know of.
You could possibly automate something like that with a fairly simple script that relies on your database's information schema (or equivalent).
However, with only 30 tables and only a column or two different in each, I'm not sure it's worth it. A manual approach, with copying and pasting and making minor changes, would probably be faster.
Also, consider whether the "extra" columns that are unique to individual tables need to go into the combined table. The point of making a big single table is to process/analyze all the data together. If something only applies to a single source, this isn't possible.

Building a MySQL database that can take an infinite number of fields

I am building a MySQL-driven website that will analyze customer surveys distributed by a variety of clients. Generally, these surveys are structured fairly consistently, and most of our clients' data can be reduced to the same normalized database structure.
However, every client inevitably ends up including highly specific demographic questions for their customers that are irrelevant to every other one of our clients. For instance, although all of our clients will ask about customer satisfaction, only our auto clients will ask whether the customers know how to drive manual transmissions.
Up to now, I have been adding columns to a respondents table for all general demographic information, with a lot of default null's mixed in. However, as we add more clients, it's clear that this will end up with a massive number of columns which are almost always null.
Is there a way to do this consistently? I would rather keep as much of the standardized data as possible in the respondents table since our import script is already written for that table. One thought of mine is to build a respondent_supplemental_demographic_info table that has the columns response_id, demographic_field, demographic_value (so the manual transmissions example might become: 'ID999', 'can_drive_manual_indicator', true). This could hold an infinite number of demographic_fields, but would be incredible painful to work with from both a processing and programming perspective. Any ideas?
Your solution to this problem is called entity-attribute-value (EAV). This "unpivots" columns so they are rows in a table and then you tie them together into a single view.
EAV structures are a bit tricky to learn how to deal with. They require many more joins or aggregations to get a single view out. Also, the types of the values becomes challenging. Generally there is one value column, so everything is stored as a string. You can, of course, have a type column with different types.
They also take up more space, because the entity id is repeated on each row (I think that is the response_id in your case).
Although not idea in all situations, they are appropriate in a situation such as you describe. You are adding attributes indefinitely. You will quickly run over the maximum number of columns allowed in a single table (typically between 1,000 and 4,000 depending on the database). You can also keep track of each value in each column separately -- if they are added at different times, for instance, you can keep a time stamp on when they go in.
Another alternative is to maintain a separate table for each client, and then use some other process to combine the data into a common data structure.
Do not fall for a table with key-value pairs (field id, field value) as that is inefficient.
In your case I would create a table per customer. And metadata tables (in a separate DB) describing these tables. With these metadata you can generate SQL etcetera. That is definitely superior too having many null columns. Or copied, adapted scripts. It requires a bit of programming, where an application uses the metadata to generate SQL, collect the data (without customer specific semantic knowledge) and generate reports.

mysql - how many columns is too many?

I'm setting up a table that might have upwards of 70 columns. I'm now thinking about splitting it up as some of the data in the columns won't be needed every time the table is accessed. Then again, if I do this I'm left with having to use joins.
At what point, if any, is it considered too many columns?
It's considered too many once it's above the maximum limit supported by the database.
The fact that you don't need every column to be returned by every query is perfectly normal; that's why SELECT statement lets you explicitly name the columns you need.
As a general rule, your table structure should reflect your domain model; if you really do have 70 (100, what have you) attributes that belong to the same entity there's no reason to separate them into multiple tables.
There are some benefits to splitting up the table into several with fewer columns, which is also called Vertical Partitioning. Here are a few:
If you have tables with many rows, modifying the indexes can take a very long time, as MySQL needs to rebuild all of the indexes in the table. Having the indexes split over several table could make that faster.
Depending on your queries and column types, MySQL could be writing temporary tables (used in more complex select queries) to disk. This is bad, as disk i/o can be a big bottle-neck. This occurs if you have binary data (text or blob) in the query.
Wider table can lead to slower query performance.
Don't prematurely optimize, but in some cases, you can get improvements from narrower tables.
It is too many when it violates the rules of normalization. It is pretty hard to get that many columns if you are normalizing your database. Design your database to model the problem, not around any artificial rules or ideas about optimizing for a specific db platform.
Apply the following rules to the wide table and you will likely have far fewer columns in a single table.
No repeating elements or groups of elements
No partial dependencies on a concatenated key
No dependencies on non-key attributes
Here is a link to help you along.
That's not a problem unless all attributes belong to the same entity and do not depend on each other.
To make life easier you can have one text column with JSON array stored in it. Obviously, if you don't have a problem with getting all the attributes every time. Although this would entirely defeat the purpose of storing it in an RDBMS and would greatly complicate every database transaction. So its not recommended approach to be followed throughout the database.
Having too many columns in the same table can cause huge problems in the replication as well. You should know that the changes that happen in the master will replicate to the slave.. for example, if you update one field in the table, the whole row will be w

How (and where) should I combine one-to-many relationships?

I have a user table, and then a number of dependent tables with a one to many relationship
e.g. an email table, an address table and a groups table. (i.e. one user can have multiple email addresses, physical addresses and can be a member of many groups)
Is it better to:
Join all these tables, and process the heap of data in code,
Use something like GROUP_CONCAT and return one row, and split apart the fields in code,
Or query each table independently?
Thanks.
It really depends on how much data you have in the related tables and on how many users you're querying at a time.
Option 1 tends to be messy to deal with in code.
Option 2 tends to be messy to deal with as well in addition to the fact that grouping tends to be slow especially on large datasets.
Option 3 is easiest to deal with but generates more queries overall. If your data-set is small and you're not planning to scale much beyond your current needs its probably the best option. It's definitely the best option if you're only trying to display one record.
There is a fourth option however that is a middle of the road approach which I use in my job in which we deal with a very similar situation. Instead of getting the related records for each row 1 at a time, use IN() to get all of the related records for your results set. Then loop in your code to match them to the appropriate record for display. If you cache search queries you can cache that second query as well. Its only two queries and only one loop in the code (no parsing, use hashes to relate things by their key)
Personally, assuming my table indexes where up to scratch I'd going with a table join and get all the data out in one go and then process that to end up with a nested data structure. This way you're playing to each systems strengths.
Generally speaking, do the most efficient query for the situation you're in. So don't create a mega query that you use in all cases. Create case specific queries that return just the information you need.
In terms of processing the results, if you use GROUP_CONCAT you have to split all the resulting values during processing. If there are extra delimiter characters in your GROUP_CONCAT'd values, this can be problematic. My preferred method is to put the GROUPed BY field into a $holder during the output loop. Compare that field to the $holder each time through and change your output accordingly.