Named Entity Recognition: Splitting data into test and train sets - training-data

When fitting a named entity recognition model, is it important to make sure that the entities that are in you training data do not repeat in your testing data? For example, if we have a relatively small data set and the goal is to identify person names. Now let us say we have 300 unique person names but would like to generalize our extraction to future data that may contain person names not in the 300 unique names we have in our data. Is it important to make sure that when we split the data into training and testing sets, that any of the 300 unique names not be found both in the training set as well as the testing set?

It is important that you have entities not in the training set to check that your model is generalizing, but usually you should have enough data and different values that with a random split you get a decent split even without checking to make sure it happens.

Related

How to decide on a data warehouse schema for clinical dataset?

I am trying to design a schema for the a clinical dataset, MIMIC-III. I am trying to figure out a way to store the events in a way such that users can query the dataset with ease using possibly a star schema. Almost all of the entries such as diagnoses, procedures, notes, charts etc. are related to a single admission. I had a few things in my mind but I have no experience so am finding it difficult to figure out what the best way to do this is.
Create multiple fact tables, for example I would have one fact table for diagnoses, one for procedures, one for lab notes but this just seems like too many fact tables with little to gain. Like I could have a fact table with an entry for each diagnosis per user per admission but would that give me more benefit than the OLTP schema already implemented?
Create one fact table with a row per admission with multiple columns/dimensions like diagnoses, procedures etc. But the issue with is that for most there are multiple diagnoses per admission so I will have to link to a bridge table in most fact tables and then it would look like the image below. The issue with this is the required joins for the queries.
The third way that I read about is using something like an entity-attribute-value fact table where I have a fact table with each row being one fact. For example, one row could be something like (patientid - addmissionid - (Attribute) Heart Rate Reading - (VALUE) 120bpm) this would create a single fact table with almost everything inside and fewer joins required but it would require the end user to pivot the table after querying due to the nature of EAV's thus making it more complicated for the end user. It would look like the i2b2 star schema.
The last way I thought of was doing an entry per event into the fact table but having many columns in the fact table to store dimensions, like (patientid, admissionid, icustay_id, diagnosis, procedure, labnote, labevent, microbiologyevent, etc.) in which patientid, and admissionid will be in all rows but the rest will depend on the entry so one entry could have just patientid, admissionid and a single procedure. I don't know how the end result of this will be like in terms of querying due to my lack of experience. I also don't know whether or not all these entries with almost every column being irrelevant is the way to go.
Any help would be greatly appreciated, I'm trying to have this implemented into BigQuery.
Some tests have concluded that reducing the number of joins can enhance the BigQuery's performance. In other words, if you denormalize most of your data in a single table you will probably get better results since you'll not need to perform many joins.
There are some points that you should think about when deciding the data modeling:
Is it easy to write queries in your model?
If you need many joins, your model will be probably difficult to build queries. If you use nested fields to reduce the number of joins you can simplify the queries you will write. However, if you create very complex nested structures, you will start having problems to query the model again.
Is it easy to join data in your model?
If you have many tables to be joined, your data will be difficult to join. The more denormalized the data is, the easier it is to join.
Easy to update data
If you need to update your data, you should consider that denormalization can help you. If you reduce the number of tables, you will need to update fewer tables. Its important to say that if you create very complex nested fields, it will also be difficult to be updated.
Make data understandable
This is the most important point. Will your data be understandable in a given model? All the previous points are technical and not exactly related to your data. You should design your model considering these points, but your data must make sense to you.
Finally, I would like to summarize some advice:
You can get a better performance denormalizing your data as long as the data keeps understandable.
Use nested fields to denormalize the model but try not creating very complex structures (more than 2 level of nesting).
Keep in mind that your model will need more disk space when you denormalize it so your billing for storage in BigQuery will probably be higher.
I hope it helps
Initial Reading
Trying reading The Data Warehouse Toolkit, chapter 14 has a section on healthcare data modelling.
Modelling vs Storage
But what you should try and figure out is what is the important and high value data, and what is the less important and low value data. Only model and load into the database the high value data. If you try and build the perfect data model, you are never going to deliver any useful reporting to your customers/clients.
What will be used on a daily or hourly basis? This information needs to be in your data model and loaded into the database for aggregation and slicing.
What will be used only occasionally? Once the slicing and dicing is complete, there will be further questions about the minutiae of a small cohort. This is when you go to the bulk storage and retrieve this data from your Data Lake.
Data Lake
Instrument measurements are great examples of low value information. Most individual measurements are not useful, but your ETL could review them and make an overall determination of some kind. E.G. Blood Pressure Normal, Heart Rate High, etc.
Prescription drugs are another example of low value information. In the database you may set a flag if multiple prescriptions are present or something similar. Freeform notes are another. Once some cohort has been decided on based on many other factors, you enable a data scientist to process the note with some machine learning but doing this for all notes is not very useful.
Group/Junk Dimensions
A number of these measurement determinations could be lumped together inside of a group/junk dimension as a way of resolving maintaining the grain of the fact and keeping the fact from exploding with low value rows. You could even hold off on modelling a junk dimension until your customers/clients start telling you about long executions for specific types of data extracts. You can then design a junk dimension to serve these higher value measurement determinations to enable additional aggregations or slicing prior to extracting the insightful data from the data lake.
File Structure
In your data lake I would have many file schemas for the low value data. These could be JSON, parquet, csv, or whatever you prefer. You would include the data needed to connect it back to the fact along with the data specific to the file type.
PatientId
AdmissionId
MeasureType
MeasureValue
Date
Time
The point is most of this data will never be looked at, but occasionally a few records are high value. You just don't know which they will be, so you store them as cheaply as possible until they are needed.
The data lake also allows you to change your file schema as new information becomes available and executing this change is trivial compared with changing a database star schema.
Use your favourite scripting language to create these files. Python, C#, Azure Function App, AWS Lamda, whatever. This will depend on your own skills and resources available.

ReducyByKeyAndWindowByCount in Spark stateful streaming aggegations

I've to inner join two relational tables extracted from Oracle.
Actually i want to perform 1-to-1 join to get one row per primary key with aggegated in list values from the second table. So before joining 1-to-1 two tables i have to reduce all my rows by key to a 1 with values kept in the list.
Here is the illustration of what i need:
[![tables aggregation][1]][1]
And here i've met a problem which is when to stop aggegation for my key and pass aggegated entity to the next step. Spark offers solutions for that by providing window intervals and watermaking for late data. And so assumption for keeping data consistency is the time it receives the data. It is feasible and applicable for infinite datasets but in my case i exactly know the count of aggegations for each key. For exampe for customer_id 1000 i know exactly that there are only 3 products and after i've aggegated 3 products i know that i can stop aggegation now and go to the next streaming step in my pipeline. How can this solution be implemented using Spark and streaming? I know there is reduceByKeyAndWindow operation but in my case i need something like reduceByKeyAndWindowByCount.
Count will be stored in a static dataset or simply store it in a row as an additional data.
Finally we've decided to switch from streaming to core spark with batch processing because we have finite dataset and that thing works well for our use case. We've came to a conclusion that spark streaming was designed for processing continuous (which was actually obvious only from it's naming) datasets. And thats why we have only window intervals by time and watermarks to correct network or other delays during transportation. We've also found our design with counters ugly, complex and in the other words bad. It is a live example of a bad design and such growing complexity was a marker that we were moving in the wrong direction and were trying to use a tool for a purpose it was not designed for.

Normalization of SQL Database with similar data managed by different tools

I'm designing a database for storing a bunch of product data that is both pulled via an API and scraped off the web. This scraper will pull some data that is static and some data that varies with time. Therefore there will one table for each type of data (static/variable). I'm trying to decide if there should be a separate table for variable data that is scraped compared to variable data that is pulled through an API.
At first, I thought they should be stored in separate tables because they are managed by separate tools. However, data will be pulled through the API and scraped on the same schedule (daily), and so they will both be mapped with the same ProductID and date. So, it seems like I could just combine the schema of both tables to save on the join time during queries for processing the data later. The obvious downside to this is managing whether rows need to be created or updated whenever one of the processes runs (which of the scraper vs API tools create or update rows).
For what it's worth, these scripts will be pulling data for millions (maybe tens of millions) of rows per day, and storing it for quite a while. So, the tables are going to get quite huge, and that's why I'm concerned with join times later on.
Here's an example in case this is all a little cloudy as an example. There are multiple industries for this, but I'll just use real estate:
Scraped Static Data: ProductID, Address, City, State, Zip, SquareFeet, etc.
Scraped Variable Data: ProductID, Price, PricePerSqFt, etc.
API Variable Data: ProductID, PageHits, UniqueVisitors, etc.
Mainly just the variable data is the concern here. So, just summarize, separate tables for the sake of general design principles, or one table for the sake of speed on joins?
Thanks in advance for the input
The example you give indicates that, apart from having 2 or 3 tables, you should also consider having just one table for both static and variable data. As long as the key of everything is just the product id, you can keep all information describing a particular id value in one record. Or do you intend to have a time stamp as part of the key of your variable data?
Once this has been decided, I can't see any advantage in having more tables than necessary.
The joins you mention won't be particularly complicated, as they basically mean to read a single record from each of your tables, each time using a primary key, which is fast. But still reading 3 records means more effort than reading 2, or only one.
There is no general design principle saying you should have a separate table for each way to collect data. On the contrary, it's the purpose of a database to contain data according to their logical structure without (too much) regard of the technical means of collecting or accessing them.
The logic to decide whether to insert or update a row isn't complicated. Also, if you want to verify your data, you might need some logic anyway, e.g. making sure that variable data only get inserted for an object that already has static data.

Match rows between datasets with noisy unstructured data and no unique identifier

Given 2 datasets, one of which is extremely unstructured and noisy, and another which is well structured and complete - does there exist a software tool, API or framework that I can use to match rows in the unstructured dataset to those in the structured dataset given that neither set contains any unique identifier? Possibly with the ability to produce some sort of confidence score that the rows are a match.
That is basically the question, but there's more info below if interested:
I have 2 datasets. Both consist of user information, containing names, addresses, contact details and other miscellaneous information about a user.
Dataset #1 is unformatted, unordered, undelimited, noisy (missing values, unconventional data, i.e. a phone number could be +XX (XX) XXX XXX, the next row's phone number could be XXXXXXXXXX). With the only convenience afforded being that each user record is at least on its own row.
Dataset #2 is augmented across roughly 80 separate datasets, however each dataset although not uniform to each other, is well formatted and data is well delineated meaning I can likely normalise these into 1 logical dataset - for the purpose of this question, consider this one set.
Every record in Dataset #1 is contained in Dataset #2 - the problem is there is no unique identifiers and Dataset #1 is noisy and unstructured.
No answers so far so I'll contribute the results of research I've done since asking this. The field this kind of problem belongs to is Record Linkage http://en.wikipedia.org/wiki/Record_linkage, that is matching records between 2 data sets where this is no shared unique id between the records.
I've found this software http://fril.sourceforge.net/ for Record Linkage, that incorporates a range of Distance Metrics into its linking algorithms so that confidence and acceptance levels can be set when matching between the left and right sides, e.g., q-gram distance can be used for matching different formats of storing names as the order of the "q-grams" (subsets of the given strings) does not matter, therefore "John Doe" and "Doe, John" will give a high score. Soundex algorithms can be used for overcoming common misspellings in the name fields as well, i.e. "John Doe" and "Jon Doe". For other types of data an edit distance or other distance metric can be used.
Based on the results of each distance metric, a confidence level is calculated, and a match is made providing the confidence level is equal or higher than the acceptance level set.
This field and software will likely overcome the problem of the differences in the format of the stored data, I'll likely have to do additional work structuring Dataset #1 mentioned above before I can feed it into this kind of system.

Extracting subset of DataSet by pulling rows, following relationships

Is there a relatively easy way to extract a relationship-consistent subset of a DataSet? The subset I want is: the rows I'm interested in, all the child and parent rows required by those rows, and nothing else.
I have around a dozen tables, each with two to four relationships to other tables.
I figure I could write code to traverse the data tables and relationships given a day or two, but I'd prefer to re-use existing code and spend that day or two on my product.
Background:
I have a strongly typed DataSet pulled from a database of the components my company sells.
I'm considering another strongly-typed DataSet to store proposed solutions. There'll be one row per item on the Bill of Materials (BOM). Each row will describe the component's configuration.
I don't want to put solution tables in the component DataSet.
When I serialize the solution DataSet via WriteXml, I'd like to persist just enough information about the components. If I'm storing primary keys from the component tables, that shouldn't be too hard.
It occurs to me that persisted solutions could survive expiry of data from the main component DataSet if I also persisted the appropriate rows from that DataSet. I don't want to persist the whole component DataSet, though, hence my question about extracting a useful subset.
The easiest way I can think of is to call DataRow.SetModified() on each row and traverse through any child rows you need to get, then call the DataSet.GetChanges() method on the DataSet, which will return you only the rows you've flagged.
You would also need to call RejectChanges() in the original DataSet after calling GetChanges to use it again.