Match rows between datasets with noisy unstructured data and no unique identifier - sql

Given 2 datasets, one of which is extremely unstructured and noisy, and another which is well structured and complete - does there exist a software tool, API or framework that I can use to match rows in the unstructured dataset to those in the structured dataset given that neither set contains any unique identifier? Possibly with the ability to produce some sort of confidence score that the rows are a match.
That is basically the question, but there's more info below if interested:
I have 2 datasets. Both consist of user information, containing names, addresses, contact details and other miscellaneous information about a user.
Dataset #1 is unformatted, unordered, undelimited, noisy (missing values, unconventional data, i.e. a phone number could be +XX (XX) XXX XXX, the next row's phone number could be XXXXXXXXXX). With the only convenience afforded being that each user record is at least on its own row.
Dataset #2 is augmented across roughly 80 separate datasets, however each dataset although not uniform to each other, is well formatted and data is well delineated meaning I can likely normalise these into 1 logical dataset - for the purpose of this question, consider this one set.
Every record in Dataset #1 is contained in Dataset #2 - the problem is there is no unique identifiers and Dataset #1 is noisy and unstructured.

No answers so far so I'll contribute the results of research I've done since asking this. The field this kind of problem belongs to is Record Linkage http://en.wikipedia.org/wiki/Record_linkage, that is matching records between 2 data sets where this is no shared unique id between the records.
I've found this software http://fril.sourceforge.net/ for Record Linkage, that incorporates a range of Distance Metrics into its linking algorithms so that confidence and acceptance levels can be set when matching between the left and right sides, e.g., q-gram distance can be used for matching different formats of storing names as the order of the "q-grams" (subsets of the given strings) does not matter, therefore "John Doe" and "Doe, John" will give a high score. Soundex algorithms can be used for overcoming common misspellings in the name fields as well, i.e. "John Doe" and "Jon Doe". For other types of data an edit distance or other distance metric can be used.
Based on the results of each distance metric, a confidence level is calculated, and a match is made providing the confidence level is equal or higher than the acceptance level set.
This field and software will likely overcome the problem of the differences in the format of the stored data, I'll likely have to do additional work structuring Dataset #1 mentioned above before I can feed it into this kind of system.

Related

Named Entity Recognition: Splitting data into test and train sets

When fitting a named entity recognition model, is it important to make sure that the entities that are in you training data do not repeat in your testing data? For example, if we have a relatively small data set and the goal is to identify person names. Now let us say we have 300 unique person names but would like to generalize our extraction to future data that may contain person names not in the 300 unique names we have in our data. Is it important to make sure that when we split the data into training and testing sets, that any of the 300 unique names not be found both in the training set as well as the testing set?
It is important that you have entities not in the training set to check that your model is generalizing, but usually you should have enough data and different values that with a random split you get a decent split even without checking to make sure it happens.

How to decide on a data warehouse schema for clinical dataset?

I am trying to design a schema for the a clinical dataset, MIMIC-III. I am trying to figure out a way to store the events in a way such that users can query the dataset with ease using possibly a star schema. Almost all of the entries such as diagnoses, procedures, notes, charts etc. are related to a single admission. I had a few things in my mind but I have no experience so am finding it difficult to figure out what the best way to do this is.
Create multiple fact tables, for example I would have one fact table for diagnoses, one for procedures, one for lab notes but this just seems like too many fact tables with little to gain. Like I could have a fact table with an entry for each diagnosis per user per admission but would that give me more benefit than the OLTP schema already implemented?
Create one fact table with a row per admission with multiple columns/dimensions like diagnoses, procedures etc. But the issue with is that for most there are multiple diagnoses per admission so I will have to link to a bridge table in most fact tables and then it would look like the image below. The issue with this is the required joins for the queries.
The third way that I read about is using something like an entity-attribute-value fact table where I have a fact table with each row being one fact. For example, one row could be something like (patientid - addmissionid - (Attribute) Heart Rate Reading - (VALUE) 120bpm) this would create a single fact table with almost everything inside and fewer joins required but it would require the end user to pivot the table after querying due to the nature of EAV's thus making it more complicated for the end user. It would look like the i2b2 star schema.
The last way I thought of was doing an entry per event into the fact table but having many columns in the fact table to store dimensions, like (patientid, admissionid, icustay_id, diagnosis, procedure, labnote, labevent, microbiologyevent, etc.) in which patientid, and admissionid will be in all rows but the rest will depend on the entry so one entry could have just patientid, admissionid and a single procedure. I don't know how the end result of this will be like in terms of querying due to my lack of experience. I also don't know whether or not all these entries with almost every column being irrelevant is the way to go.
Any help would be greatly appreciated, I'm trying to have this implemented into BigQuery.
Some tests have concluded that reducing the number of joins can enhance the BigQuery's performance. In other words, if you denormalize most of your data in a single table you will probably get better results since you'll not need to perform many joins.
There are some points that you should think about when deciding the data modeling:
Is it easy to write queries in your model?
If you need many joins, your model will be probably difficult to build queries. If you use nested fields to reduce the number of joins you can simplify the queries you will write. However, if you create very complex nested structures, you will start having problems to query the model again.
Is it easy to join data in your model?
If you have many tables to be joined, your data will be difficult to join. The more denormalized the data is, the easier it is to join.
Easy to update data
If you need to update your data, you should consider that denormalization can help you. If you reduce the number of tables, you will need to update fewer tables. Its important to say that if you create very complex nested fields, it will also be difficult to be updated.
Make data understandable
This is the most important point. Will your data be understandable in a given model? All the previous points are technical and not exactly related to your data. You should design your model considering these points, but your data must make sense to you.
Finally, I would like to summarize some advice:
You can get a better performance denormalizing your data as long as the data keeps understandable.
Use nested fields to denormalize the model but try not creating very complex structures (more than 2 level of nesting).
Keep in mind that your model will need more disk space when you denormalize it so your billing for storage in BigQuery will probably be higher.
I hope it helps
Initial Reading
Trying reading The Data Warehouse Toolkit, chapter 14 has a section on healthcare data modelling.
Modelling vs Storage
But what you should try and figure out is what is the important and high value data, and what is the less important and low value data. Only model and load into the database the high value data. If you try and build the perfect data model, you are never going to deliver any useful reporting to your customers/clients.
What will be used on a daily or hourly basis? This information needs to be in your data model and loaded into the database for aggregation and slicing.
What will be used only occasionally? Once the slicing and dicing is complete, there will be further questions about the minutiae of a small cohort. This is when you go to the bulk storage and retrieve this data from your Data Lake.
Data Lake
Instrument measurements are great examples of low value information. Most individual measurements are not useful, but your ETL could review them and make an overall determination of some kind. E.G. Blood Pressure Normal, Heart Rate High, etc.
Prescription drugs are another example of low value information. In the database you may set a flag if multiple prescriptions are present or something similar. Freeform notes are another. Once some cohort has been decided on based on many other factors, you enable a data scientist to process the note with some machine learning but doing this for all notes is not very useful.
Group/Junk Dimensions
A number of these measurement determinations could be lumped together inside of a group/junk dimension as a way of resolving maintaining the grain of the fact and keeping the fact from exploding with low value rows. You could even hold off on modelling a junk dimension until your customers/clients start telling you about long executions for specific types of data extracts. You can then design a junk dimension to serve these higher value measurement determinations to enable additional aggregations or slicing prior to extracting the insightful data from the data lake.
File Structure
In your data lake I would have many file schemas for the low value data. These could be JSON, parquet, csv, or whatever you prefer. You would include the data needed to connect it back to the fact along with the data specific to the file type.
PatientId
AdmissionId
MeasureType
MeasureValue
Date
Time
The point is most of this data will never be looked at, but occasionally a few records are high value. You just don't know which they will be, so you store them as cheaply as possible until they are needed.
The data lake also allows you to change your file schema as new information becomes available and executing this change is trivial compared with changing a database star schema.
Use your favourite scripting language to create these files. Python, C#, Azure Function App, AWS Lamda, whatever. This will depend on your own skills and resources available.

Database structure for large word-word co-occurrence frequencies with metadata

I would like to store the frequencies with which words co-occur with each other over a variety of contexts in a large (> 1 Billion tokens) text corpus. I need to store the word pair, the type of co-occurrence (e.g. word1 in the same sentence as word2, word1 in the same paragraph as word2), and some metadata about the text in which the co-occurrence was found, (e.g. year, author, publisher). So a single row might look like:
word1 word2 count decade publisher author context_type
-------------------------------------------------------------
nuclear danger 22 1980s NYT n/a paragraph
The frequencies will be sparse and Poisson distributed. I would then like to be able to query the data by date ranges, or groups of authors (for example), aggregating the counts of the results.
I have little experience with databases and am not sure what to use. Do I need related SQL tables (e.g. with the book metadata in one table and word data in another), or a simple flat NoSql solution? The vocabulary is about 50,000 words, so the if every word-word co-occurrence was observed there would be 2.5 billion rows even without the metadata (I think). But probably most of them won't be observed. This makes me think maybe a graph database is a possible solution. Is there a good cloud solution on AWS or google maybe?
Google Cloud Bigtable or another NoSQL system could be a reasonable choice for this type of query since it sounds like a sparse dataset option, but as the previous posted noted, this question is more suited for https://softwareengineering.stackexchange.com/.

What Algorithm to Use for a List Swap with No Duplication

I have a scenario where a number of organizations have joined together to swap names to increase the size of their respective mailing lists. I have a pool of names, each tagged to their source organization. Each org has provided a subset of their full list, and is entitled to an output of up to the number of names they put into the pool, but with these restrictions:
an org should not receive a name already on their full list
each name should only be swapped once (should be allocated to only a single output grouping)
Separate from this pool I have the full lists of each organization (hashed to obfuscate the details, but not particularly relevant to the question), so I have the data necessary to determine which records are available to swap to each list.
My question: is there a grouping/clustering algorithm that would apply in this case, where we want to cluster based the organization(s) that are entitled to them, with the aforementioned requirement of each record being distributed to a single group and each group's size not exceeding the number of records originally sourced to that organization.
The data is currently in MySQL tables and I will likely use Node/JS for implementation, but I'm mainly seeking general advice on the algorithm.
Thanks in advance for any advice you might have!

Correlation between amount of rows and amount columns in database performance

Is there a correlation between the amount of rows/number of columns used and it's impact within the (MS)SQL database?
A little more background:
We have to store lots of data from measurement devices. These devices ping a string with data back to us around 100 times a day. These strings contains +- 300 fields. Assume we have 100 devices in operation that means we get 10000 records back every day. At our back-end we split these data strings and have to put these into the database. When these data strings are fixed that means we add each days around 10000 new rows into the database. No big deal.
Whatsoever, the contents of these data strings may change during time. There are two options we are considering:
Using vertical tables to store the data dynamically
Using horizontal tables and add a new column now and then when it's needed.
From the perspective of ease we'd like to choose for the first approach. Whatsoever, that means we're adding 100*100*300=3000000 rows each day. Data has to be stored 1 year and a month (395 days) so then we're around 1.2 billion rows. Not calculated the expected growth.
Is it from a performance perspective smarter to use a 'vertical' or a 'horizontal' approach?
When choosing for the 'vertical' solution, how can we actual optimize performance by using PK's/FK's wisely?
When choosing for the 'horizontal' solution, are there recommendations for adding columns to the table?
I have a vertical DB with 275 million rows in the "values" table. We took this approach because we couldn't accurately define the schema at the outset either. Inserts are fantastic. Selects suck. Too be fair we throw in a couple of extra doohickies the typical vertical schema doesn't have to deal with.
Have a search for EAV aka Entity Attribute Value models. You'll find a lot of heat on both sides of the debate. Too good articles on making it work are
What is so bad about EAV, anyway?
dave’s guide to the eav
My guess is these sensors don't just start sending you extra fields. You have to release new sensors or sensor code for this to happen. That's your chance to do change control on your schema and add the extra columns. If external parties can connect sensors without notifying you this argument is null and void and you may be stuck with an EAV.
For the horizontal option you can split tables putting the frequently-used columns in one table and the less-used in a second; both tables have the same primary key values so you can link less-used to more-used columns. Also you can use RDBMS's built-in partitioning functionality to split each day's (or week's or month's) data for the others'.
Generally, you can tune a table more for inserts (or any DML) or for queries. Improving one side comes at the expense of the other. Usually, it's a balancing act.
First of all, 10K inserts a day is not really a large number. Sure, it's not insignificant, but it doesn't even come close to what would be considered "large" nowadays So, while we don't want to make inserts downright sluggish, this gives you some wiggle room.
Creating an index on the device id and/or entry timestamp will do some logical partitioning of the data for you. The exact makeup of your index(es) will depend on your queries. Are you looking for all entries for a given date or date range? Then index the timestamp column. Are you looking for all entries received from a particular device? Then index the device id column. Are you looking for entries from a particular device on a particular date or date range or sorted by the date? Then create an index on both columns.
So if you ask for the entries for device x on date y, then you are going out to the table and looking only at the rows you need. The fact that the table is much larger than the small subset you query is incidental. It's as if the rest of the table doesn't even exist. The total size of the table need not be intimidating.
Another option: As it looks like the data is written to the table and never altered after that, then you may want to create a data warehouse schema for the data. New entries can be moved to the warehouse every day or several times a day. The point is, the warehouse schema can have the data sliced, diced, quartered and cubed to make queries much more efficient. So you can have the existing table tuned for more efficient inserts and the warehouse tuned for more efficient queries. That is, after all, what data warehouses are for.
You also imply that some of each entry is (or can be) duplicated from one entry to the next. See if you can segment the data into three types:
Type 1: Data that never changes (the device id, for example)
Type 2: Data that rarely changes
Type 3: Data that changes often
Now all you have is a normalization problem, something a lot easier to solve. Let's say the row is equally split between the types. So you have one table with 100 rows of 33 columns. That's it. It never changes. Linked to that is a table with at least 100 rows of 33 columns but maybe several new rows are added each day. Finally, linked to the second table a table with rows of 33 columns that possibly grows by the full 10K every day.
This minimizes the grow-space required by the online database. The warehouse could then denormalize back to one huge table for ease of querying.