Extracting subset of DataSet by pulling rows, following relationships - serialization

Is there a relatively easy way to extract a relationship-consistent subset of a DataSet? The subset I want is: the rows I'm interested in, all the child and parent rows required by those rows, and nothing else.
I have around a dozen tables, each with two to four relationships to other tables.
I figure I could write code to traverse the data tables and relationships given a day or two, but I'd prefer to re-use existing code and spend that day or two on my product.
Background:
I have a strongly typed DataSet pulled from a database of the components my company sells.
I'm considering another strongly-typed DataSet to store proposed solutions. There'll be one row per item on the Bill of Materials (BOM). Each row will describe the component's configuration.
I don't want to put solution tables in the component DataSet.
When I serialize the solution DataSet via WriteXml, I'd like to persist just enough information about the components. If I'm storing primary keys from the component tables, that shouldn't be too hard.
It occurs to me that persisted solutions could survive expiry of data from the main component DataSet if I also persisted the appropriate rows from that DataSet. I don't want to persist the whole component DataSet, though, hence my question about extracting a useful subset.

The easiest way I can think of is to call DataRow.SetModified() on each row and traverse through any child rows you need to get, then call the DataSet.GetChanges() method on the DataSet, which will return you only the rows you've flagged.
You would also need to call RejectChanges() in the original DataSet after calling GetChanges to use it again.

Related

How to decide on a data warehouse schema for clinical dataset?

I am trying to design a schema for the a clinical dataset, MIMIC-III. I am trying to figure out a way to store the events in a way such that users can query the dataset with ease using possibly a star schema. Almost all of the entries such as diagnoses, procedures, notes, charts etc. are related to a single admission. I had a few things in my mind but I have no experience so am finding it difficult to figure out what the best way to do this is.
Create multiple fact tables, for example I would have one fact table for diagnoses, one for procedures, one for lab notes but this just seems like too many fact tables with little to gain. Like I could have a fact table with an entry for each diagnosis per user per admission but would that give me more benefit than the OLTP schema already implemented?
Create one fact table with a row per admission with multiple columns/dimensions like diagnoses, procedures etc. But the issue with is that for most there are multiple diagnoses per admission so I will have to link to a bridge table in most fact tables and then it would look like the image below. The issue with this is the required joins for the queries.
The third way that I read about is using something like an entity-attribute-value fact table where I have a fact table with each row being one fact. For example, one row could be something like (patientid - addmissionid - (Attribute) Heart Rate Reading - (VALUE) 120bpm) this would create a single fact table with almost everything inside and fewer joins required but it would require the end user to pivot the table after querying due to the nature of EAV's thus making it more complicated for the end user. It would look like the i2b2 star schema.
The last way I thought of was doing an entry per event into the fact table but having many columns in the fact table to store dimensions, like (patientid, admissionid, icustay_id, diagnosis, procedure, labnote, labevent, microbiologyevent, etc.) in which patientid, and admissionid will be in all rows but the rest will depend on the entry so one entry could have just patientid, admissionid and a single procedure. I don't know how the end result of this will be like in terms of querying due to my lack of experience. I also don't know whether or not all these entries with almost every column being irrelevant is the way to go.
Any help would be greatly appreciated, I'm trying to have this implemented into BigQuery.
Some tests have concluded that reducing the number of joins can enhance the BigQuery's performance. In other words, if you denormalize most of your data in a single table you will probably get better results since you'll not need to perform many joins.
There are some points that you should think about when deciding the data modeling:
Is it easy to write queries in your model?
If you need many joins, your model will be probably difficult to build queries. If you use nested fields to reduce the number of joins you can simplify the queries you will write. However, if you create very complex nested structures, you will start having problems to query the model again.
Is it easy to join data in your model?
If you have many tables to be joined, your data will be difficult to join. The more denormalized the data is, the easier it is to join.
Easy to update data
If you need to update your data, you should consider that denormalization can help you. If you reduce the number of tables, you will need to update fewer tables. Its important to say that if you create very complex nested fields, it will also be difficult to be updated.
Make data understandable
This is the most important point. Will your data be understandable in a given model? All the previous points are technical and not exactly related to your data. You should design your model considering these points, but your data must make sense to you.
Finally, I would like to summarize some advice:
You can get a better performance denormalizing your data as long as the data keeps understandable.
Use nested fields to denormalize the model but try not creating very complex structures (more than 2 level of nesting).
Keep in mind that your model will need more disk space when you denormalize it so your billing for storage in BigQuery will probably be higher.
I hope it helps
Initial Reading
Trying reading The Data Warehouse Toolkit, chapter 14 has a section on healthcare data modelling.
Modelling vs Storage
But what you should try and figure out is what is the important and high value data, and what is the less important and low value data. Only model and load into the database the high value data. If you try and build the perfect data model, you are never going to deliver any useful reporting to your customers/clients.
What will be used on a daily or hourly basis? This information needs to be in your data model and loaded into the database for aggregation and slicing.
What will be used only occasionally? Once the slicing and dicing is complete, there will be further questions about the minutiae of a small cohort. This is when you go to the bulk storage and retrieve this data from your Data Lake.
Data Lake
Instrument measurements are great examples of low value information. Most individual measurements are not useful, but your ETL could review them and make an overall determination of some kind. E.G. Blood Pressure Normal, Heart Rate High, etc.
Prescription drugs are another example of low value information. In the database you may set a flag if multiple prescriptions are present or something similar. Freeform notes are another. Once some cohort has been decided on based on many other factors, you enable a data scientist to process the note with some machine learning but doing this for all notes is not very useful.
Group/Junk Dimensions
A number of these measurement determinations could be lumped together inside of a group/junk dimension as a way of resolving maintaining the grain of the fact and keeping the fact from exploding with low value rows. You could even hold off on modelling a junk dimension until your customers/clients start telling you about long executions for specific types of data extracts. You can then design a junk dimension to serve these higher value measurement determinations to enable additional aggregations or slicing prior to extracting the insightful data from the data lake.
File Structure
In your data lake I would have many file schemas for the low value data. These could be JSON, parquet, csv, or whatever you prefer. You would include the data needed to connect it back to the fact along with the data specific to the file type.
PatientId
AdmissionId
MeasureType
MeasureValue
Date
Time
The point is most of this data will never be looked at, but occasionally a few records are high value. You just don't know which they will be, so you store them as cheaply as possible until they are needed.
The data lake also allows you to change your file schema as new information becomes available and executing this change is trivial compared with changing a database star schema.
Use your favourite scripting language to create these files. Python, C#, Azure Function App, AWS Lamda, whatever. This will depend on your own skills and resources available.

Is there a term to describe tables where only the last value should be used?

In my database there are tables where each row represents an entity, and there are other tables where the same entity can appear multiple times, but only the latest entry is the valid one.
For example, I have a table customer where each row represents a customer and another table customer_membership_status where the same customer can be referenced multiple times, but only the last record for each customer is suppost to be used. Data is never updated in customer_membership_status, only inserted.
Is there a term to describe this pattern? I'm asking because I would like to quickly and easily explain the intended use of the table to others.
Probably the best term would be CQRS and event sourcing
Using the stream of events as the write store, rather than the actual data at a point in time, avoids update conflicts on a single aggregate and maximizes performance and scalability
Event Sourcing pattern
Instead of storing just the current state of the data in a domain, use an append-only store to record the full series of actions taken on that data. The store acts as the system of record and can be used to materialize the domain objects. This can simplify tasks in complex domains, by avoiding the need to synchronize the data model and the business domain, while improving performance, scalability, and responsiveness. It can also provide consistency for transactional data, and maintain full audit trails and history that can enable compensating actions.
EDIT:
After closer look you may also read about SCD(Slowly Changing Dimension) Type 2.
This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert.
Temporal table. It's a table where a timestamp / version attribute is part of a key. The temporal / version attribute allows you to identify which is the latest row for each customer.

Normalization of SQL Database with similar data managed by different tools

I'm designing a database for storing a bunch of product data that is both pulled via an API and scraped off the web. This scraper will pull some data that is static and some data that varies with time. Therefore there will one table for each type of data (static/variable). I'm trying to decide if there should be a separate table for variable data that is scraped compared to variable data that is pulled through an API.
At first, I thought they should be stored in separate tables because they are managed by separate tools. However, data will be pulled through the API and scraped on the same schedule (daily), and so they will both be mapped with the same ProductID and date. So, it seems like I could just combine the schema of both tables to save on the join time during queries for processing the data later. The obvious downside to this is managing whether rows need to be created or updated whenever one of the processes runs (which of the scraper vs API tools create or update rows).
For what it's worth, these scripts will be pulling data for millions (maybe tens of millions) of rows per day, and storing it for quite a while. So, the tables are going to get quite huge, and that's why I'm concerned with join times later on.
Here's an example in case this is all a little cloudy as an example. There are multiple industries for this, but I'll just use real estate:
Scraped Static Data: ProductID, Address, City, State, Zip, SquareFeet, etc.
Scraped Variable Data: ProductID, Price, PricePerSqFt, etc.
API Variable Data: ProductID, PageHits, UniqueVisitors, etc.
Mainly just the variable data is the concern here. So, just summarize, separate tables for the sake of general design principles, or one table for the sake of speed on joins?
Thanks in advance for the input
The example you give indicates that, apart from having 2 or 3 tables, you should also consider having just one table for both static and variable data. As long as the key of everything is just the product id, you can keep all information describing a particular id value in one record. Or do you intend to have a time stamp as part of the key of your variable data?
Once this has been decided, I can't see any advantage in having more tables than necessary.
The joins you mention won't be particularly complicated, as they basically mean to read a single record from each of your tables, each time using a primary key, which is fast. But still reading 3 records means more effort than reading 2, or only one.
There is no general design principle saying you should have a separate table for each way to collect data. On the contrary, it's the purpose of a database to contain data according to their logical structure without (too much) regard of the technical means of collecting or accessing them.
The logic to decide whether to insert or update a row isn't complicated. Also, if you want to verify your data, you might need some logic anyway, e.g. making sure that variable data only get inserted for an object that already has static data.

SQL Structure, Dynamic Two Columns or Unique Colmuns

I'm not sure which is faster. I have the need to store lists of possible data.
Currently I have an SQL table with the following structure being accessed with php.
boxID
place
name -- (serialNum, itemNum, idlock, etc, etc)
data
--(Note: The Primary Key here would be boxId, place, name, and data, to prevent duplicate data.)
The reason i set it up like this was to prevent creating columns per named data. Its a possibility in the future to have 5-10 different named data or more. Also possible to store 1,000 - 10,000 entries of data in one week for just one named data. It will be searched as well, like when i get place from a specific serialNum, then getting all data related to that place. (A specific serialNum, itemNum, idLock, etc, etc,)
But my concern is that my structure could be slower than just creating a named column for each named data. For example:
boxID
place
serialNum
itemNum
idLock
etc
etc
--(Note: Not even sure how to add keys to this if i would do it this way)
To sum it up: Which is faster and better practice? (keep in mind im still a novice with SQL)
The best practice is to model your data as entities with specific attributes. Typically an entity has at most a few dozen attributes. The entities typically turn into tables, and the attributes typically which turn into columns. That is, the physical model and the logic model are often very similar.
There may be other considerations. For instance, there is a limit on the number of columns a row can have -- and if you have more columns, you need another solution. Similarly, if the data is sparse (that is, most values are NULL), then having lots of unused columns may be a waste of space. That is, it is more efficient to store it in another format. SQL Server offers sparse columns for this reason.
My suggestion is that you design your table in an intuitive way with named columns. A volume of data of 1,000 - 10,000 rows per week is not that much data. That turns into 50,000 - 500,000 rows per year, which SQL Server should be easily able to handle the volume. You don't say how many named entities you have, but table with millions or tens of millions of rows are quite reasonable for modern databases.

Too many MasterDataSets?

I'm writing a program in Visual Studio 2010 which is using an Access Database. Right now it has 6 Master Data Sets.
Each Dataset has a single tabular connection. Would it be better, if I instead used ONE MasterDataSet instead of the five or should I continue to use each of the Master Data Sets.
Below is a copy of my Solution Explorer to indicate what I mean:
EDIT: Even better: If it should be better that I merge down into one, how would I go about starting this?
This depends
if each dataset used in deferent form its better to keep them on the same way you did.
That if you put them on single dataset and you initialize this dataset on one form that use only one table from the six table existing on your dataset this will consume CPU to load the unwanted tables and memory for the unwanted table....
And if you are using two table as example on one screen its better to combine both in single, even the memory consumption will not differ if you distribute them to two dataset with single table on each one..
And also if you have some relation between some table like employee and Department and you want this data on single form its better to bring the two table on one dataset for view issue..to have the relation ready and don't build it on your code.....