Suppose you have the following tables: Orders, Customers, Events, Lines, and LineAssignments. The only table that I can modify is the LineAssignments table.
Event 1 <---> * Orders
Customer 1 <---> * Orders
Order 1 <---> * LineAssignments
Line 1 <---> * LineAssignments
Different pages display different combinations of info with the line assignments. For example, on some pages I only display the event info with the line assignments, while on other pages I display the order info with them, etc.
Basically, whenever I add a new line assignment, should I also store the EventID, CustomerID, and OrderID too, or should I only store the OrderID, then do multiple joins to get the other data. Would it be better to create a view that joins these tables?

I tend to follow the school of thought that data should only be represented once in a database. This means, in your place, I would attempt to get what I need from multiple joins and only store OrderID.
The reason why I would do this is if there's any chance that the data stored in the other tables (the data you copied over to the LineAssignments table) is updated, the copied data would be wrong. I don't see it being super likely that the data in the other tables would change, but in the off-chance that it does... You'd be better off with the joins than potentially incorrect data.

It is simply a question of performance. Generally, you should stick to the 3NF, i.e. no redundancy. Whereas this gives very tight and elegant data structures, it might also lead to heavy performance issues.
This is usually the case if your database is both for productive and historical data, i.e. grows over time.
When issuing the joined queries, your RDBMS will load as much information as possible into memory, usually index information to speed up your query. Now, if your indexes are so big that they don't fit into memory, your RDBMS (no, the OS in fact) will have to swap, which is a performance killer.
The real deal (in my eyes) is to completely separate productive data (open / unpaid orders for example) from historic data. The historic data can and should be optimized for fast retrieval as nothing changes anymore and hard discs are cheap.
Productive data should be nice and tight (3.NF). Whenever a piece of information is not productive anymore (order is paid, parts are delivered etc.) it will be removed from the productive database and transferred to the historical data.
Get information on the topic 'data warehouse' in case you're not yet familiar with it and read about the concepts. It's quite easy to understand.


How to decide on a data warehouse schema for clinical dataset?

I am trying to design a schema for the a clinical dataset, MIMIC-III. I am trying to figure out a way to store the events in a way such that users can query the dataset with ease using possibly a star schema. Almost all of the entries such as diagnoses, procedures, notes, charts etc. are related to a single admission. I had a few things in my mind but I have no experience so am finding it difficult to figure out what the best way to do this is.
Create multiple fact tables, for example I would have one fact table for diagnoses, one for procedures, one for lab notes but this just seems like too many fact tables with little to gain. Like I could have a fact table with an entry for each diagnosis per user per admission but would that give me more benefit than the OLTP schema already implemented?
Create one fact table with a row per admission with multiple columns/dimensions like diagnoses, procedures etc. But the issue with is that for most there are multiple diagnoses per admission so I will have to link to a bridge table in most fact tables and then it would look like the image below. The issue with this is the required joins for the queries.
The third way that I read about is using something like an entity-attribute-value fact table where I have a fact table with each row being one fact. For example, one row could be something like (patientid - addmissionid - (Attribute) Heart Rate Reading - (VALUE) 120bpm) this would create a single fact table with almost everything inside and fewer joins required but it would require the end user to pivot the table after querying due to the nature of EAV's thus making it more complicated for the end user. It would look like the i2b2 star schema.
The last way I thought of was doing an entry per event into the fact table but having many columns in the fact table to store dimensions, like (patientid, admissionid, icustay_id, diagnosis, procedure, labnote, labevent, microbiologyevent, etc.) in which patientid, and admissionid will be in all rows but the rest will depend on the entry so one entry could have just patientid, admissionid and a single procedure. I don't know how the end result of this will be like in terms of querying due to my lack of experience. I also don't know whether or not all these entries with almost every column being irrelevant is the way to go.
Any help would be greatly appreciated, I'm trying to have this implemented into BigQuery.
Some tests have concluded that reducing the number of joins can enhance the BigQuery's performance. In other words, if you denormalize most of your data in a single table you will probably get better results since you'll not need to perform many joins.
There are some points that you should think about when deciding the data modeling:
Is it easy to write queries in your model?
If you need many joins, your model will be probably difficult to build queries. If you use nested fields to reduce the number of joins you can simplify the queries you will write. However, if you create very complex nested structures, you will start having problems to query the model again.
Is it easy to join data in your model?
If you have many tables to be joined, your data will be difficult to join. The more denormalized the data is, the easier it is to join.
Easy to update data
If you need to update your data, you should consider that denormalization can help you. If you reduce the number of tables, you will need to update fewer tables. Its important to say that if you create very complex nested fields, it will also be difficult to be updated.
Make data understandable
This is the most important point. Will your data be understandable in a given model? All the previous points are technical and not exactly related to your data. You should design your model considering these points, but your data must make sense to you.
Finally, I would like to summarize some advice:
You can get a better performance denormalizing your data as long as the data keeps understandable.
Use nested fields to denormalize the model but try not creating very complex structures (more than 2 level of nesting).
Keep in mind that your model will need more disk space when you denormalize it so your billing for storage in BigQuery will probably be higher.
I hope it helps
Initial Reading
Trying reading The Data Warehouse Toolkit, chapter 14 has a section on healthcare data modelling.
Modelling vs Storage
But what you should try and figure out is what is the important and high value data, and what is the less important and low value data. Only model and load into the database the high value data. If you try and build the perfect data model, you are never going to deliver any useful reporting to your customers/clients.
What will be used on a daily or hourly basis? This information needs to be in your data model and loaded into the database for aggregation and slicing.
What will be used only occasionally? Once the slicing and dicing is complete, there will be further questions about the minutiae of a small cohort. This is when you go to the bulk storage and retrieve this data from your Data Lake.
Data Lake
Instrument measurements are great examples of low value information. Most individual measurements are not useful, but your ETL could review them and make an overall determination of some kind. E.G. Blood Pressure Normal, Heart Rate High, etc.
Prescription drugs are another example of low value information. In the database you may set a flag if multiple prescriptions are present or something similar. Freeform notes are another. Once some cohort has been decided on based on many other factors, you enable a data scientist to process the note with some machine learning but doing this for all notes is not very useful.
Group/Junk Dimensions
A number of these measurement determinations could be lumped together inside of a group/junk dimension as a way of resolving maintaining the grain of the fact and keeping the fact from exploding with low value rows. You could even hold off on modelling a junk dimension until your customers/clients start telling you about long executions for specific types of data extracts. You can then design a junk dimension to serve these higher value measurement determinations to enable additional aggregations or slicing prior to extracting the insightful data from the data lake.
File Structure
In your data lake I would have many file schemas for the low value data. These could be JSON, parquet, csv, or whatever you prefer. You would include the data needed to connect it back to the fact along with the data specific to the file type.
The point is most of this data will never be looked at, but occasionally a few records are high value. You just don't know which they will be, so you store them as cheaply as possible until they are needed.
The data lake also allows you to change your file schema as new information becomes available and executing this change is trivial compared with changing a database star schema.
Use your favourite scripting language to create these files. Python, C#, Azure Function App, AWS Lamda, whatever. This will depend on your own skills and resources available.

Normalization of SQL Database with similar data managed by different tools

I'm designing a database for storing a bunch of product data that is both pulled via an API and scraped off the web. This scraper will pull some data that is static and some data that varies with time. Therefore there will one table for each type of data (static/variable). I'm trying to decide if there should be a separate table for variable data that is scraped compared to variable data that is pulled through an API.
At first, I thought they should be stored in separate tables because they are managed by separate tools. However, data will be pulled through the API and scraped on the same schedule (daily), and so they will both be mapped with the same ProductID and date. So, it seems like I could just combine the schema of both tables to save on the join time during queries for processing the data later. The obvious downside to this is managing whether rows need to be created or updated whenever one of the processes runs (which of the scraper vs API tools create or update rows).
For what it's worth, these scripts will be pulling data for millions (maybe tens of millions) of rows per day, and storing it for quite a while. So, the tables are going to get quite huge, and that's why I'm concerned with join times later on.
Here's an example in case this is all a little cloudy as an example. There are multiple industries for this, but I'll just use real estate:
Scraped Static Data: ProductID, Address, City, State, Zip, SquareFeet, etc.
Scraped Variable Data: ProductID, Price, PricePerSqFt, etc.
API Variable Data: ProductID, PageHits, UniqueVisitors, etc.
Mainly just the variable data is the concern here. So, just summarize, separate tables for the sake of general design principles, or one table for the sake of speed on joins?
Thanks in advance for the input
The example you give indicates that, apart from having 2 or 3 tables, you should also consider having just one table for both static and variable data. As long as the key of everything is just the product id, you can keep all information describing a particular id value in one record. Or do you intend to have a time stamp as part of the key of your variable data?
Once this has been decided, I can't see any advantage in having more tables than necessary.
The joins you mention won't be particularly complicated, as they basically mean to read a single record from each of your tables, each time using a primary key, which is fast. But still reading 3 records means more effort than reading 2, or only one.
There is no general design principle saying you should have a separate table for each way to collect data. On the contrary, it's the purpose of a database to contain data according to their logical structure without (too much) regard of the technical means of collecting or accessing them.
The logic to decide whether to insert or update a row isn't complicated. Also, if you want to verify your data, you might need some logic anyway, e.g. making sure that variable data only get inserted for an object that already has static data.

Table design and Querying

I have a table design that is represented by this awesome hand drawn image.
Basically, I have an account event, which can be either a Transaction (Payment to or from a third party) or a Transfer (transfer between accounts held by the user).
All common data is held in the event table (Date, CreatedBy, Source Account Id...) and then if it's a transaction, then transaction specific data is held in the Account Transaction table (Third Party, transaction type (Debit, Credit)...). If the event is a transfer, then transfer specific data is in the account_transfer table (Amount, destination account id...).
Note, something I forgot to draw, is that the Event table has an event_type_id. If event_type_id = 1, then it's a transaction. If it's a 2, then it's a Transfer.
Both the transfer and transaction tables are linked to the event table via an event id foreign key.
Note though that a transaction doesn't have an amount, as the transaction can be split into multiple payment lines, so it has a child account_transaction_line. To get the amount of the transaction, you sum it's child lines.
Foreign keys are all setup, with an index on primary keys...
My question is about design and querying. If I want to list all events for a specific account, I can either:
from Event,
where event_type = 1 (transaction),
then INNER join to the Transaction table,
and INNER join to the transaction line (to sum the total)...
and then UNION to another selection,
from Event,
where event_type = 2 (transfer),
INNER join to transfer table...
and producing a list of all events.
from Event,
then LEFT join to transaction,
then LEFT join to transaction line,
then LEFT join to transfer ...
and sum up totals (because of the transaction lines).
Which is more efficient? I think option 1 is best, as it avoids the LEFT joins (Scans?)
An Indexed View of option 1?
On performance
For performance analysis in SQL server, there are quite a few factors at play, e.g.
What is the number of queries you are going to run, esp. on the same data? For example, if 80% of your queries are around 20% of your data, then caching may help significantly. (See below the design section on how this can matter)
Are your databases distributed or collocated on the same server? I assume it's a single server system, but if they were distributed, the design and optimization might vary.
Are these queries executed in a background process or on-demand and a user is expecting to get the results quicker?
Without these (and perhaps some other follow up questions once answers to these are provided), it would be unwise to give an answer stating one being preferable over the other.
Having said that, based on my personal experience, your best bet specifically for SQL server is to use query analyzer, which is actually pretty reasonable, as your first stop. After that, you can do some performance analyses to find the optimal solution. Typically, these are done by modeling the query traffic as it would be when the system is under regular load. (FYI: The modeling link is to ASP.NET performance modeling, but various core concepts apply to SQL as well.) You typically put the system under load and then:
Look at how many connections are lost -- this can increase if the queries are expensive.
Performance counters on the server(s) to see how the system is dealing with the load.
Responses from the queries to see if some start failing to provide a valid response, although this is unlikely to happen
FYI: This is based on my personal experience, after having done various types of performance analyses for multiple projects. We expect to do it again for our current project, although this time around we're using AD and Azure tables instead of SQL, and hence the methodology is not specific to SQL server, although the tools, traffic profiles, and what to measure varies.
On design
Introducing event id in the account transaction line:
Although you do not explicitly state so, but it seems that the event ID and transaction ID is not going to change after the first entry has been made. If that's the case and you are only interested in getting the totals for a transaction in this query, then another option (which will optimize your queries) would be to add a foreign key to AccountEvent's primary key (which I think is the event id). In strictest DB sense, you are de-normalizing the table a bit, but in practice, it often helps with performance.
Computing totals on inserts:
The other approach that I have taken in a past project (just because I was using FoxPro in the previous century and FoxPro tended to be extremely slow at joins) was to keep total amounts in the primary table, equivalent of your transactions table. This would be quite useful if your reads heavily outweighed your writes, and in the case of SQL, you can issue a transaction to make entries in other tables and update totals simultaneously (hence my question about on your query profiles).
Join transaction & transfers tables:
Keep a value to indicate which is which, and keep the totals there -- similar to previous one but at a different level. This will decrease the joins on query, but still have sum of totals on inserts -- I would prefer the previous over this one.
De-normalize completely:
This is yet another approach that folks have used (esp. in NOSQL space), but it gives me shivers when applying in SQL Server, so I have a personal bias against it but you could very well search it and find about it.

Join or storing directly

I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.
This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.
We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.
If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

How to handle an extremely large database table size?

In this scenario, every sales order is going to have atleast 400-500 products associated with it. Now everytime a sales order is generated, the cost and price of those products will be saved in the SalesOrderProduct table. This will cause the SalesOrderProduct table to become extremely large in a short period of time. Whats the best way to handle the size of this table?
Are you sure there is a problem?
If you have millions of rows, no sweat. A SQL database will chew that stuff up.
If you have billions of rows, you might want a key-value store instead of a SQL database. Especially for archival information like past orders which is write-once read-never (and analyze-rarely). If you can't switch from SQL, you can use a clustered database.
But before you do anything, be sure there's an issue - test the performance with a good, realistic workload. See if it'll handle your needs for the near future. Don't solve problems which aren't there.
Final note: for this particular database schema, you can eliminate the SalesOrderProduct table by keeping track of historical costs/prices for products. Then you can use the order date to backfigure the costs/prices of all ordered products, eliminating the need for that join table.