cTAKES indication that category > 0 sources are about to be used? - ctakes

In appendix 1 of the UMLS license agreement, there is a listing of all sources within the current version of the UMLS Metathesaurus with an indication of any additional restrictions and notices that apply. Loosely speaking, it seems like you can generally have your way with the Metathesaurus data sources that fall within category-0 of the license, but things get more restrictive at categories above that.
For example (likely a bad example as I am not a lawyer), looking at section 12.2 of the main license section:
LICENSEE is prohibited from using the vocabulary source in operational
applications that create records or information containing data from
the vocabulary source. Use for data creation research or product
development is allowed.
My question then is: (since cTAKES already has my UMLS credentials) is there any way to tell when doing a certain action with cTAKES is going to instruct it use/access data from the Metathesaurus that have a category > 0 (eg. some popup warning or header comment in the binary files)? Thanks
** The reason I'm interested is because: suppose that a certain ctakes process uses a category-2 data source to do something on some input that populates data into some XMI output (I don't know much about ctakes' full implementation, but for sake of arg. lets assume this is true) that gets post-processed and stored as some report for an organization. It would seem that the organization has violated the category-2 restriction inadvertently (since they were never warned about the underlying data being used to generate the outputs). I may be grossly misunderstanding something here, so please let me know if this is the case.

Related

How to quickly analyse the impact of a program change?

Lately I need to do an impact analysis on changing a DB column definition of a widely used table (like PRODUCT, USER, etc). I find it is a very time consuming, boring and difficult task. I would like to ask if there is any known methodology to do so?
The question also apply to changes on application, file system, search engine, etc. At first, I thought this kind of functional relationship should be pre-documented or some how keep tracked, but then I realize that everything can have changes, it would be impossible to do so.
I don't even know what should be tagged to this question, please help.
Sorry for my poor English.
Sure. One can technically at least know what code touches the DB column (reads or writes it), by determining program slices.
Methodology: Find all SQL code elements in your sources. Determine which ones touch the column in question. (Careful: SELECT ALL may touch your column, so you need to know the schema). Determine which variables read or write that column. Follow those variables wherever they go, and determine the code and variables they affect; follow all those variables too. (This amounts to computing a forward slice). Likewise, find the sources of the variables used to fill the column; follow them back to their code and sources, and follow those variables too. (This amounts to computing a backward slice).
All the elements of the slice are potentially affecting/affected by a change. There may be conditions in the slice-selected code that are clearly outside the conditions expected by your new use case, and you can eliminate that code from consideration. Everything else in the slices you may have inspect/modify to make your change.
Now, your change may affect some other code (e.g., a new place to use the DB column, or combine the value from the DB column with some other value). You'll want to inspect up and downstream slices on the code you change too.
You can apply this process for any change you might make to the code base, not just DB columns.
Manually this is not easy to do in a big code base, and it certainly isn't quick. There is some automation to do for C and C++ code, but not much for other languages.
You can get a bad approximation by running test cases that involve you desired variable or action, and inspecting the test coverage. (Your approximation gets better if you run test cases you are sure does NOT cover your desired variable or action, and eliminating all the code it covers).
Eventually this task cannot be automated or reduced to an algorithm, otherwise there would be a tool to preview refactored changes. The better you wrote code in the beginning, the easier the task.
Let me explain how to reach the answer: isolation is the key. Mapping everything to object properties can help you automate your review.
I can give you an example. If you can manage to map your specific case to the below, it will save your life.
The OR/M change pattern
Like Hibernate or Entity Framework...
A change to a database column may be simply previewed by analysing what code uses a certain object's property. Since all DB columns are mapped to object properties, and assuming no code uses pure SQL, you are good to go for your estimations
This is a very simple pattern for change management.
In order to reduce a file system/network or data file issue to the above pattern you need other software patterns implemented. I mean, if you can reduce a complex scenario to a change in your objects' properties, you can leverage your IDE to detect the changes for you, including code that needs a slight modification to compile or needs to be rewritten at all.
If you want to manage a change in a remote service when you initially write your software, wrap that service in an interface. So you will only have to modify its implementation
If you want to manage a possible change in a data file format (e.g. length of field change in positional format, column reordering), write a service that maps that file to object (like using BeanIO parser)
If you want to manage a possible change in file system paths, design your application to use more runtime variables
If you want to manage a possible change in cryptography algorithms, wrap them in services (e.g. HashService, CryptoService, SignService)
If you do the above, your manual requirements review will be easier. Because the overall task is manual, but can be aided with automated tools. You can try to change the name of a class's property and see its side effects in the compiler
Worst case
Obviously if you need to change the name, type and length of a specific column in a database in a software with plain SQL hardcoded and shattered in multiple places around the code, and worse many tables present similar column namings, plus without project documentation (did I write worst case, right?) of a total of 10000+ classes, you have no other way than manually exploring your project, using find tools but not relying on them.
And if you don't have a test plan, which is the document from which you can hope to originate a software test suite, it will be time to make one.
Just adding my 2 cents. I'm assuming you're working in a production environment so there's got to be some form of unit tests, integration tests and system tests already written.
If yes, then a good way to validate your changes is to run all these tests again and create any new tests which might be necessary.
And to state the obvious, do not integrate your code changes into the main production code base without running these tests.
Yet again changes which worked fine in a test environment may not work in a production environment.
Have some form of source code configuration management system like Subversion, GitHub, CVS etc.
This enables you to roll back your changes

HL7 v2X and v3 data modeling

The company I work for has started a new initiative in HL7 where we are trading both v2X and v3 (CDA specifically) messages. I am at the point where I am able to accept, validate and acknowledge the messages we are receiving from our trading partners and have started to create a data model for the backend storage of said messages. After a lot of consideration and research I am at a loss for the best way to approach this in MS SQL Server 2008 R2.
Currently my idea is to essentially load the data into a data warehouse directly from my integration engine (BizTalk) and foregoing a backing, normalized operational database. I have set up the database for v2X messages according to the v2.7 specs as all versions of HL7 v2 are backward compatible (I can store any previous versions in the same database). My initial design has a table for each segment which will tie back to a header table with a guid I am generating and storing at run time. The biggest issue with this approach is the amount of columns in each table and it's something I have no experience with. For instance the PV1 segment has 569 columns in order to accommodate all possible data. In addition to this I need to make all columns varchar and make them big enough to house any possible customization scenario from our vendors. I am planning on using varchar(1024) to achieve this. A lot of these columns (the majority probably) would be NULL so I would use SPARSE columns. This screams bad design to me but fully normalizing these tables would require a ton of work in both BizTalk and SQL server and I'm not sure what I would gain from doing so. I'm trying to be pragmatic since I have a deadline.
If fully normalized, I would essentially have to create stored procs that would have a ton of parameters OR split these messages to the nth degree to do individual loads into the smaller subtables and make sure they all correlate back to the original guid. I would also want to maintain ACID processing which could get tricky and cause a lot of overhead in BizTalk. I suppose a 3rd option would be to use nHapi to create objects out of the messages I could tie into with Entity Framework but nHapi seems like a dead project and I have no experience with Entity Framework as of right now.
I'm basically at a loss and need help from some industry professionals who have experience with HL7 data modeling. Is it worth the extra effort to fully normalize the tables? Will performance on the SQL side be abysmal if I use these denormalized segment tables with hundreds of columns (most of which will be NULL for each row)? I'm not a DBA so I'm trying to understand the pitfalls of each approach. I've also looked at RIMBAA but the HL7 RIM seems like a foreign language to me as an HL7 newbie and translating v2 messages to the RIM would probably take far longer than I have to complete this project. I'm hoping I'm overthinking this and there is a simpler solution staring me in the face. Hopefully this question isn't too open ended.
HL7 is not a "tight" standard inputs and expected outputs vary depending on the system you are talking to. In this case the adding in a broker such as Mirth, Rhaposdy or BizTalk is a very good idea.
What ever solution you employ make sure you can cope with "non standard" input and output as you will soon find things vary. On the HL7 versions 2X and 3 be aware that very few hospitals have the version 3 most still run 2X.
I have been down the road of working with a database that tried to follow the HL7 structure, it can work however it will take time and effort. Given that you have a tight dead line maybe break out the bits of the data you will need to search on and have fields (e.g. PID segment 3 is the patient id would be useful to have) the rest can go in your varchar. Also if you are not indexing on the column you could use varchar(max).
As for your Guids in the database, this can work fine, but be careful not to cluster any indexes using the Guid as this will fragment your data. Do your research here and if in doubt go for identity columns instead.
I'll recommend the entity framework too, excellent ORM, well worth learning.
So my overall advice. Go for a hybrid for now, breaking out what you need. Expect it to evolve over time breaking out the pieces of HL7 into their own areas as needed. Do write a generic HL7 parser (not too difficult I've done it a couple of times) and keep it flexible. But most of all expect the HL7 to vary in structure don't treat the specification as 100% truth you will get variations.
I can only comment on the CDA (and some very limited HL7v2) side of things based on personal experience.
We receive and send CDA documents wrapped in HL7v3 wrappers from external vendors (as well as internal systems -- see below). The wrappers contain the metadata for things like sending/receiving systems/dates and other high-level data. The very limited message metadata is stripped and stored in the message data repository. Inside the wrapper, is the actual CDA, which is then taken and stored as XML datatype in the SQL database.
Using this model we can then search at the metadata level, but also narrow it down based on the CDA using Xpath queries. It makes the database much simpler...I can't even imagine creating columns based on the CDA schema.
As for making clients follow the CDA schema, as a part of the project we've created an implementation guide which clients must follow if they want to have their messages accepted.
Using the implementation guide + schematron + BizTalk and XSD validation, we only accept messages which follow the CDA schema. We then check some data fields using schematron validation and reject if any of those fail. This is relayed to the sender using an HL7v3 message back to them with the specific error message and/or fields that are invalid. This is a point at which a message will be stored in the database.
This is all done in BizTalk/SQL Server. And since the CDA schema is very much pre-defined by the HL7 group, you can make the consumers of this system follow the schema. This is unlike what I've seen with HL7v2 where it seems people just bend the schema as needed.
For the HL7v2 side of things, I'm 99% certain that "we" (as in, "my company") are storing the messages much in the same way. Except since since the HL7v2 schema is so open, we're not validating and just accepting/storing all messages. An HL7v2 parser has been written to parse the HL7v2 using the variations of schemas we know about.
In my project's case, we are sending HL7v2 from our HCIS --> Mirth --> BizTalk which then follows the Implementation guide + CDA Schema along with an XSLT transform to map the HL7v2 to CDA THEN submits it to the OTHER BizTalk CDA Submission service as though it was an external vendor.
That's a ton of reading right now, so please ask questions, as I'd like to talk about it.
In most cases it's a waste of time to try to create a normalized relational data model to persist HL7 V2 or V3 data. I would recommend just storing entire messages or documents as single XML column values. Then query using SQLXML and/or XQuery. All modern relational databases support this now.
Modeling on HL7 can be a pain.
I would do the following;
use the standards described in HL7 for staging tables, that way even if you have varchar(1024) and they are null it does not hurt you
create your actual table to be populated from the staging table as per the standards that you have enforced or will enforce.
This means that you have 500+ columns from the message but only 10 or 50 make sense, you will need to model only your 50. Yes, this has a lopside, tomorrow you want to make more meaning then it will increase from 50 to 75, the historical messages will not have information; which is fine but you will need to factor into the design.
I would under no circumstances attempt to model anything using the HL7 v3 RIM. The reason is that this schema is very generic, deferring much of the metadata to the message itself. Are you familiar with an EAV table? The RIM is like that.
On the other hand, HL7 v2 should be a fairly simple basis for a DB schema. You can create tables around segment types, and columns around field names.
I think the problem of pulling in everything kills the project and you should not do it. Typically, HL7 v2 messages carry a small subset of the whole, so it would be an utter waste to build out the whole thing, and it would be very confusing.
Further, the version of v2 you model would impact your schemas dramatically, with later versions, more and more fields become repeating fields, and your join relationships would change.
I recommend that you put a stake in the sand and start with v2.4 which is pretty easy yet still more complicated than most interfaces actually in use. Focus on a few segments and a few fields. MSH and PID first.
Add an EAV table to capture what may come in that you don't yet have in your tables. You can then look at what comes into this table over time and use it to decide what to build next. Your EAV could look like this MSG_ID, SEGMENT, SET_ID, FIELD_NAME, FIELD VALUE. Just store the unparsed HL7 contents of the field value.

Migrating RMS to RDB

We're approaching the migration of legacy OpenVMS RMS files into relational database (both MS SQL 2012 and Oracle 10g are available).
I wonder if there are:
Tools to retrieve schema of indexed files
Tools to parse indexed files
Tools to deal with custom RMS data formats (zoned decimals etc)
as a bundle/API/Library
Perhaps I should change the approach?
There are several tools available, notably through ODBC vendors (I work for one: Attunity).
1 >> Tools to retrieve schema of indexed files
Please clarify. Looking for just record/column layout and indexes within the files or also relationships between files.
1a) How are the files currently being used? Cobol, Basic, Fortran programs? Datatrieve?
They will be using some data definition method, so you want a tool which can exploit that.
Connx, and Attunity Connect can 'import' CDD definitions, BASIC - MAP files, Cobol Copybooks. Variants are typically covered as well. I have written many a (perl/awk) script to convert special definition to XML.
1b ) Analyze/RMS, or a program with calling RMS XAB's can get available index information. Atunity connect will know how to map those onto the fields from 1a)
1c ) There is no formal, stored, relationship between (indexed) files on OpenVMS. That's all in the program logic. However, some modestly smart Perl/Awk/DCL script can often generate a tablem of likely foreign/primary keys by looking at filed names and datatypes matches.
How many files / layouts / gigabytes are we talking about?
2 >> Tools to parse indexed files
Please clarify? Once the structure is known (question 1), the parsing is done by reading using that structure right? You never ever want to understand the indexed file internals. Just tell RMS to fetch records.
3 >> Tools to deal with custom RMS data formats (zoned decimals etc) as a bundle/API/Library
Again, please clarify. Once the structure is known just use the 'right' tool to read using that structure and surely it will honor the detailed data definitions.
(I know it is quite simple to write one yourself, just thought there would be something in the industry)
Famous last words... 'quite simple'. Entire companies have been build and thrive doing just that for general cases. I admit that for specific cases it can be relatively straightforward, but 'the devil is in the details'.
In the Attunity Connect case we have a UDT (User Defined data Type) to handle the 'odd' cases, often involving DATES. Dates in integers, in strings, as units since xxx are all available out of the box, but for example some have -1 meaning 'some high date' which needs some help to be stored in a DB.
All the databases have some bulk load tool (BCP, SQL$LOADER).
As long as you can deliver data conforming to what those expect (tabular, comma-seperated, quoted-or-not, escapes-or-not) you should be in good shape.
The EGH tool Vselect may be a handy, and high performance, way to bulk read indexed files, filter and format some and spit out sequential files for the DB loaders. It can read RMS indexed file faster than RMS can! (It has its own metadata language though!)
Attunity offers full access and replication services.
They include a CDC (change data capture) to not a only load the data, but to also keep it up to date in near-real-time. That's useful for 'evolution' versus 'revolution'.
Check out Attunity 'Replicate'. Once you have a data dictionary, just point to the tables desired (include, exlude filters), point to a target DB and click to replicate. Of course there are options for (global or per-table) transformations (like an AREA-CODE+EXHANGE+NUMBER to single phone number, or adding a modified date columns ).
Will this be a single big switch conversion, or is there desire to migrate the data and keep the old systems alive for days, months, years perhaps, all along keeping the data in close sync?
Hope this helps some,
Hein van den Heuvel.
OP: Perhaps I should change the approach? Probably.
You might consider finding data migration vendors, some which likely have off-the-shelf solutions, if not as a COTS tool, more likely packaged as a service (I don't think this is a big market).
What this won't help you with is what I think of as much bigger problem with the application code: who is going to change all the code that is making RMS calls, in the corresponding code that makes relational DB calls? How will the entity ("Joe Programmer", or some tool), know where the data migrated to, so that he can write the correct call? What are you doing to do about the fact that the data representation is like to change?
Ideally you'd like an automated migration tool, that will move the data itself (therefore knows that datalayouts and representation changes), and will make the code changes that correspond. You can look for these kind of vendors, too.

What data generators?

I'm about to release a FOSS data generator that can generate random yet meaningful data in CSV format. Rather belatedly, I guess, I need to poll the state of the art for such products - because if there is a well known and useful existing tool, I can write my work off to experience. I am aware of of a couple of SQL Server specific tools, but mine is not database specific.
So, links? And if you have used such a product,
what features did you find it was missing?
Edit: To add a bit more info on my tool (Ooh, Matron!) it is intended to allow generation of any kind of random data from existing data files, and
supports weighting. It is XML based (sorry, folks) and lets you say things like:
<pick distribute="20,80" >
<datafile file="femalenames.dat"/>
<datafile file="malenames.dat"/>
<pick/>
to select female names about 20% of the time and male names 80% of the time.
But the purpose of this question is not to describe my product but to get info on other tools.
Latest: If anyone is interested, they can get the alpha of my data generator at http://code.google.com/p/csvtest
That can be a one-liner in R where I use the littler scripting front-end:
# generate the data as a one-liner from the command-line
# we set the RNG seed, and draw from a bunch of distributions
# indented just to fit the box here
edd#ron:~$ r -e'set.seed(42); write.csv(data.frame(y=runif(10), x1=rnorm(10),
x2=rt(10,4), x3=rpois(10, 0.4)), file="/tmp/neil.csv",
quote=FALSE, row.names=FALSE)'
edd#ron:~$ cat /tmp/neil.csv
y,x1,x2,x3
0.914806043496355,-0.106124516091484,0.830735621223563,0
0.937075413297862,1.51152199743894,1.6707628713402,0
0.286139534786344,-0.0946590384130976,-0.282485683052060,0
0.830447626067325,2.01842371387704,0.714442314565005,0
0.641745518893003,-0.062714099052421,-1.08008578470128,0
0.519095949130133,1.30486965422349,2.28674786332467,0
0.736588314641267,2.28664539270111,-0.73270267483628,1
0.134666597237810,-1.38886070111234,-1.45317770550920,1
0.656992290401831,-0.278788766817371,-1.01676025893376,1
0.70506478403695,-0.133321336393658,0.404860813371462,0
edd#ron:~$
You have not said anything about your data-generating process, but rest assured that R can probably cope with just about any requirement, including multivariate normal, t, skew-t, and more. The (six different) random-number generators in R are also of very high quality.
R can also write to DBs, or read parameters from it, and if it needs to be on Windoze then the Rscript front-end could be used instead of littler.
I asked a similar question some months ago:
Tools for Generating Mock Data?
I got some sincere suggestions, but most were not suitable for my needs. Either expensive (non-free) software, or else not flexible enough w.r.t. data types and database structure, or range of mock data, or way too slow (e.g. the Rails ActiveRecord solution).
Features I was looking for were:
Generate mock data to fill existing database tables
Quick to generate > 1 million rows
Produce either SQL script format or flat file suitable for importing
Scriptable command-line interface, not a GUI
Not dependent on Microsoft Windows environment
Nice-to-have features:
Extensible/configurable
Open-source, free license
Written in a dynamic language like Perl/PHP/Python
Point it at a database and let it "discover" the metadata
Integrated with testing tools (e.g. DbUnit)
Option to fill directly into the database as it generates data
The answer I accepted as Databene Benerator. Though since asking the question, I admit I haven't used it very much.
I was surprised that even when asking the community, the range of tools for generating mock data was so thin. This seems like a niche waiting to be filled! I'll be interested to see what you release.

enCapsa -what is it and what is used for?

It may not be a pure programming question but I'm looking for information about enCapsa. Do you know what it is, have you ever used it? I'm reading some papers about it but I can't really see how it works and what it can be used for in an IT company (and this is what i am supposed to find out).
Basically enCapsa is a shared data storage system, focused on providing a way for storing any kind of data (even from etherogeneous data sources such as different designed db tables) and consequently obtain it through a sort of human friendly queries, just like on a search engine. They offer the possibility to upload data from everywhere (it's CSV based) and later download to use wherever you need it.
Usages are many, consider it's a centralized DB accessible through web and they say it meets high security standard.
A usefull way to employ this service is to have data stored in there without the need to keep them synchronized across company computers.