What exactly are the UMLS and SNOMED-CT vocabularies used by cTAKES? - ctakes

Very new to cTAKES and looking through the docs, curious about what exactly the UMLS and SNOMEDCT "vocabularies" are. The user installation docs don't really seem to tell and simply applying for the UMLS license and the language around the UMLS Metathesaurus does not really divulge much more about the structure of the data being accessed. Eg. is it some online API service? Is it some files that come with the cTAKES download that can only be unlocked with a valid UMLS password that is checked against an online DB?

Info on what the UMLS Metathesaurus and SNOMEDCT are can be found here (https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html) and here (https://www.ncbi.nlm.nih.gov/books/NBK9676/, specifically https://www.ncbi.nlm.nih.gov/books/NBK9684/):
The Metathesaurus is a very large, multi-purpose, and multi-lingual [relational?] vocabulary database that contains information about biomedical and health related concepts, their various names, and the relationships among them. Designed for use by system developers...
...The Metathesaurus contains concepts, concept names, and other attributes from more than 100 terminologies, classifications, and thesauri, some in multiple editions.
While I'm not sure how exactly cTAKES implements its use of the UMLS Metathesaurus (anyone who knows could please enlighten), I assume that it is accessing some API for a relational database based on the UMLS credentials you need to add to the example scripts that come with the cTAKES download (see https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+User+Install+Guide#cTAKES4.0UserInstallGuide-(Recommended)AddUMLSaccessrights).
...You may select from two relational formats: the Rich Release Format (RRF), introduced in 2004, and the Original Release Format (ORF).
(I think) this is what is used to power the UIMA analysis engines used to process text in cTAKES
UIMA is an architecture in which basic building blocks called Analysis Engines (AEs) are composed in order to analyze a document [...] How Annotators represent and share their results is an important part of the UIMA architecture. To enable composition and reuse, UIMA defines a Common Analysis Structure (CAS) precisely for these purposes. The CAS is an object-based container that manages and stores typed objects having properties and values, https://www.ibm.com/developerworks/data/downloads/uima/#How-does-it-work

Related

Modular MediaWiki

I wonder if it is possible to configure MediaWiki (or other wiki tools) as a modular predefined wiki. For instance, on a regular wiki page one can freely edit sections, text, everything.
I am looking for a solution that predefines a number of sections (or modules) that can be added to each wiki page. Then users are free to edit inside those sections within their predefined formats.
Hope someone can help, thanks.
As for MediaWiki, there is at least one extension that can work that way: Semantic Forms, usually used together with Semantic MediaWiki (though that is not necessary). With SF, you will define one or more templates that receives the data entered in the form, and the form can be divided into sections.
A more lighweight solution might be using one of the many boiler plate extensions available.
Either way, with a wiki you can never force your users to follow a certain scheme. The whole philosophy, making wiki's unique among collaborative tools, is that the users, not you, create not only the content but also the structure for the content!
The former Semantic Forms is now called Page_Forms and it is not dependent on SMW https://www.mediawiki.org/wiki/Extension:Page_Forms and can also make use of the Cargo extension https://www.mediawiki.org/wiki/Extension:Cargo
I would disagree that wiki users cannot or should not be forced to follow a scheme for some types of information, though the default is that they do control the categories and namespaces and can create those at will as the data evolves. All this means though is that you manage such issues socially rather than with complex permissions structures, i.e. someone undoes your change and says "do it this way instead". So it's a different kind of forcing, but, still, someone has to make sure categories don't proliferate with bad names, capitalization, etc.
The typical use of forms data is when it must be used to satisfy some legal or professional requirement (say logging for what reason a change was made for Sarbanes-Oxley, or logging what precedents were consulted for logging legal time), or will be providing input strictly to some application (like maps). It would not be a good idea to rigorize literally every page of a wiki this way.

Enterprise search platform vs General purpose search

I have a question about Solr. It is described as an enterprise search platform. Are there Enterprise oriented search platforms and general purpose search platforms? Can't you just use Solr for example to build a general purpose search engine? If there is such a distinction what are the major differences between them?
Enterprise is a vague term tacked on to things to say "Yes, you can totally use this in professional projects, it's super good". It's baloney, in short. When reading the front page of a software product (or any product really), I find it useful to ignore all adjectives and adverbs, which makes that first sentence on the Solr page read: "Solr is the search platform from the Apache Lucene project."
Don't know why I don't get hired to write ad copy.
I think it would be fair to say that Solr is a general purpose search server, sure (depending on what general purpose entails to you, of course). It indexes data, allows you to search it, and provides a lot of tools to do that in the way the best suits your data and users.
The term Search is overloaded with lots of semantics. It is often used to denote/describe either an action, a function or a technology. But more important wit respect to the question is the fact that there are two common kind of "search projects" which are Web Search and Enterprise Search projects.
Web Search is typically about indexing content from one kind of content source (Web Servers) serving content in html format. Most often it's only about public content and document level security is not an issue. A typical example for this kind of solution is Google's Web Search, but most full-text Site Search solutions can also be seen as good examples of this category. For a basic solution a crawler , an html markup removal tool and an indexing library and some "glue" is sufficient. Apache Nutch or Apache Solr and ElasticSearch in combination with a web crawler are good candidates to be used for implementing these kind of solutions.
Enterprise Search is typically about integrating content in various formats from multiple content sources. A typical example for this kind of solution are corporate intranets, but Search Based Applications often also fall into this category. Those solutions typically come with additional requirements such as support for document level security, advanced linguistics, metadata extraction, data mappings and enrichments, synonyms etc. The projects are more complex and a more complex technology stack is needed. While Apache Solr or ElasticSearch can both be used, a lot of the required functionality is not part of the standard download and needs to be developed or integrated as part of the project. But for both - Apache Solr and ElasticSearch - there are also commercial distributions available that already expand the functionality of the standard download into the direction of Enterprise Search. Other good alternatives are commercial search engines.
I agree with #femtoRgon that Solr:
is a good General Purpose Search Platform
and not an Enterprise Search Platform
but an Enterprise Search Platform can be built with Solr
Solr is a search platform that can be customized for either general purpose search or for Enterprise Search solutions. As suggested by Daniel in the previous comments, ESearch application is used specifically for an enterprise/organization to search for the organizations internal data and also in some cases can search external content as well but only related to the organization. Enterprises generally use various systems which are either internally developed or by a vendor and the ESearch application should be able to connect to the internal systems and index the content including the different file types, metadata and importantly security that is associated with each and every document from those systems.
To conclude, Solr is a Search system which can be used to index and search content as a general or as a ESearch application for a organization.

Microdata - itemid / global identifier conventions for organizations, business or brands markup with schema.org

My question is the following: when marking up an organization, business or brand with microdata and schema.org, should I use as a global identifier it's official webpage URL? Is there any kind of better reference that I could use (like IMDB for movies or actors)?
I'd like to know if there's any standard, convention or common practice recommended.
It would be better to use some kind of controlled vocabulary (e.g. VIAF) that uniqely identifies the organization in question.
The choice of identifiers is part of the explanation of REST. http://www.infoq.com/articles/rest-introduction
Inspect closely the first principle (for convention), though it is in broader terms of resources rather than specific to org/biz/brand. REST is the thesis that started this trend. Microformats accordingly makes use of rel="profile" link tags. The concept is further expanded at http://purl.org/ so, if IMDB, for example, switches to W3 like W3C did, then in the future this will minimize the impact on the application you're making right now. RDFa Dublin Core vocab's use of this is seen in the profile at http://www.w3.org/2011/rdfa-context/rdfa-1.1.html.
(For references) Applications serving general public or open initiatives such as academic support might be better served by these profiles, however when operating a site for commercial purposes, building application-specific "custom" profiles considering various legal matters identified, that should perform reliably with PURLs, might be advantageous to build credible reputation.
Finally, WHATWG considers prefixes too advanced and HTML5 for newbies only, so the support for W3's XHTML xmlns/RDFa prefix is dropped in microdata. This compels us to reuse long-form URLs for schema.org business/org/brand resources with microdata syntax. The "custom" profile then serves as mere good-will when picking up from where tasks are wrapped up, otherwise a more variety of items might appear in the content than actually intended, owing to mix-ups.
The good news is, Google supports schema.org usage as a vocab in RDFa syntax. So considering RDFa as an already "living" standard that originated in W3 spec, as per the (non-)commercial nature of application, defining PURL for scope namespaces, profiles exhibiting prefixes, and syntax (of official web-page or substitute IRIs) as per target processors is the way to go. Currently no vocab besides schema is processed as microdata, and schema in RDFa isn't supported by anybody but Google!

Entity Extraction/Recognition with free tools while feeding Lucene Index

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
Dbpedia Spotlight, the demo looks very promising
OpenNLP requires training. Which training data to use?
OpenNLP tools
Stanbol
NLTK
balie
UIMA
GATE -> example code
Apache Mahout
Stanford CRF-NER
maui-indexer
Mallet
Illinois Named Entity Tagger Not open source but free
wikipedianer data
My questions:
Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
How can they be integrated with Lucene?
Here are some questions related to that subject:
Does an algorithm exist to help detect the "primary topic" of an English sentence?
Named Entity Recognition Libraries for Java
Named entity recognition with Java
The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
Zemanta
Maui-indexer
Dbpedia Spotlight
Extractiv (my company)
These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.
You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/
For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind
Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole.
The output is in your choice of XML or JSON which makes it very easy to use with Lucene.
It is written in java.
There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does.
Yes both versions perform entity and term disambiguation based on the linguistic usage.
The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.
Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml
The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a
Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.
Just in case you find it useful.

folder structure for project documentation

I saw some questions raised about the folder structure of source codes, but I never see the question about folder structure of project documentation. I googled it and still do not see many articles talk about.
Here is one http://www.projectperfect.com.au/downloads/Info/info_project_folder_structure.pdf
To quote some of its words:
"There are two broad approaches:
Organize by phase so that each top
directory is a phase. For example,
you might have directories for
Feasibility, Business Analysis,
Design etc. or whatever your phases
are called.
Organize by function so that the top
directory level are functions. For
example, Risks, Requirements, Scope,
Change Control, Development.
Most times a mix of both are used..."
So any thought about it? I believe this is also an important issue!
IMHO depending on your document management system the choice of structure for your documents may not be an issue. When looking at the problems project related documents are trying to solve you typically come to the conclusion that documents are about communication.
Different documents attempt to communicate different things (or contexts); test plans discuss how testing should/has been executed, requirements specifications discuss how the business rules should be applied, architecture documents discuss the technical components and so forth. Each of these documents might have the need for its own unique structure. For example the structure you choose for your test plans may be vastly different from the structure you need for your architecture documents.
When keeping the communication issue and the document context in mind I generally come back to these 2 key aspects.
Searchability – What is the easiest way to find the document I am looking for?
Versioning – How do I know that the document I am looking for is the most recent one?
I feel searchability is the most important thing to remember because different people call the same document by different names. For example some people call Business Requirements documents Functional Specifications. Some people call Functional Specifications use case documents. As you cannot always govern the naming convention of documents I feel finding the right document to be far more important than the folder or place in which it is stored.
So to answer your question I would simply answer by saying it doesn’t really matter which structure you use, just that you should use some form of document management system (SharePoint, Documentum, Trim, etc). The benefits are simply too great to work without one :)