Does anyone know of a prediction engine that produces ICD-10 diagnosis codes from unstructured clinical text? Preferably an API that I can work with.
The use case I have in mind is to pull doctors' notes from electronic health records systems (EHR) as inputs and produce ICD-10 diagnosis code options for users. Thanks
Diagnoss just put out an API that "tags" text with ICD-10 codes. It is not predicting diagnoses, and you may need to filter the output depending on what you're looking for (specialty, accuracy, etc.). They've implemented it in a demo. I work for them -- happy to answer specific questions about it separately.
Related
I have my dataset with job metrics, and one of my features is industry. It is a categorical feature and has 1200 unique values. Before I go on and work on building a model, I need to figure out how to best encode it esp because it has 1200 unique values. Does anyone have any tips or guidance as to where I should start?
The picture below shows the top 9 industries. I am thinking of selective encoding - maybe only using one-hot encoding for these 15-20 most frequent values, but I will be thankful for any suggestions. Thanks
Tried to look for several resources, but couldn't find anything promising so far
[A picture of the 9 most occurring industries]
https://i.stack.imgur.com/tDAEk.jpg
You could one hot encode everything, and maybe check correlations against target to see which job categories may be informative features.
if the data is too large to do this, then yes perhaps selective encoding as you said -- just conditionally fill everything else as "other" and then proceed with one hot encoding.
I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.
I'm working with a lot of name data where the following events are happening:
In one stream the data is submitted as "Sung" and in the other stream "Snug" my initial thought to this was to convert Sung and Snug to where each character equals a number then the sums would be the same, so even if they transverse a character, I'd be able to bucket these appropriately.
The other is where in one stream it comes in as "Lillly" as opposed to "Lilly" in the other stream. I'd like to figure out how to fuzzy match these such that I can identify them. I'm not sure if this is possible in Oracle.
I'm working with many millions of data points and trying to figure out how to write these classification buckets such that I can stop having so much noise in my primary task of finding where people are truly different people as opposed to a clerical error.
Any thoughts would be very appreciated.
A common measure for such distance is called Levenshtein distance (Wikipedia here). This measures the "edit" distance between two strings -- number of edit operations needed to convert one into the other.
That's the good news. More good news is that Oracle even has an implementation in the UTL_MATCH library.
The bad news is that it is really, really expensive on millions of data points. Unfortunately, I cannot help you there so much. One idea is to determine which names are "close enough" because they already share a certain minimum number of characters.
Another method is to convert the strings to what they sound like. That is called soundex. You may be able to use the two together -- assuming your names are predominantly English (the soundex algorithm was invented by the US Census Bureau, so it would work best on names in America).
I am working with a pipeline where speech is translated into text. The speech would be a sort of technical professional jargon, with occurences of acronyms as well as phrases which describe amounts and values.
There is a high chance that the stock speech-to-text translator is gonna misrepresent terms and phrases I would be interested in. I have a corpus where I can tag those words and phrases.
Is there a way to use spacy (or maybe some other NLP tool) to apply spell-checking, so for instance 'pee-age' or 'pee-oh-to' would be corrected to 'pH' and 'pO2', respectively?
Also, for phrases where numbers and amounts are specified, like for example 'one twenty cee-cee' would be corrected to '120cc'?
Thanks in advance
Now that SEPA requirements are getting people used to BIC & IBAN, there are legacy system that cannot cope with this new data. Is there an algorithm or tool available for converting BIC & IBAN back to sort code and account?
Here is an example:
from this website.
Wikipedia has a list of IBAN formats by country, so it seems at least possible.
However, there is no complete algorithm for it - being a software developer, you can derive an algorithm from that input. Note that other countries might follow in the future, so you can expect more work (and hopefully not more exceptional cases of sort codes and accounts).
Regarding the tool or library, that's off-topic here on StackOverflow, but you might want to ask on Software Recommendations, though. Note that they have different requirements on how to ask questions, so you might want to read the tour first. Don't forget to mention the programming language.
Well, a quick search pointed me at this page: http://www.business.hsbc.co.uk/1/2/international-business/iban-bic.
Looks to me like you can just extract appropriate substrings. Although, a bit more searching seems to indicate that the format may vary a bit depending on the country.
Both sort code and account number are present inside a United Kingdom or Ireland IBAN.
You can simply substring like, PHP Examples:
$iban = "GB04BARC20474473160944";
$sort = substr($iban,8,6);
$account = substr($iban,14,8);
print "SortCode:" . $sort;
print "AccountNumber:" . $account;
The IBAN Calculator webservice has an API which digs up bank and branch information and so on. Also does check digit validation on the iban and sort/account.
But for simple extracting of the sort/account the substring is sufficient.