Solutions to Import an IMS Hierarchial ASCII file into SQL/ACCESS - sql

I have a large ASCII dataset (2.7gb) which I believe is in an IMS hierarchial format. I'm unsure how to access the data to get it into a usable database, I would guess SQL but am open to other solutions. This is the "Layout" that came with the database if its at all helpful...

If you do not have a programming background You are in big trouble !!!. Excel MsAccess will not help you much.
So the answer is:
Hire Some programmers with Cobol / Cobol Conversion experience !!!
UIC-MN-H10-SEGMENT
The Cobol Copybook tells you the format of the file. The format of UIC-MN-H10-SEGMENT is
2 byte segment id (10 ???)
4 byte year
2 Byte Month
4 byte average injection pressure etc
This is a multi-record file.
Tools that you might be able to use
RecordEditor might be able to display the File (Size might be a problem). Also the RecordEditor will take a bit of getting used to
Cobol e.g. GNU Cobol will need Cobol programmers
Java / JRecord -needs java programmers
If it only a single record (unlikely), Cobol2Csv
To give a more meaning full answer, please supply the Cobol copybook in text format and some sample data

So you are missing some key information here. You would actually want the IMS Database Descriptor (DBD) file in addition to the layout you pasted. The IMS DBD file will describe the structure of the database. An IMS database can have many segments (aka tables) in it which the DBD will describe in addition to other information such as the size of those tables.
That actual records will be stored in a flat file (probably the 2.7gb ASCII file you mentioned) in a depth first format. So let's say you had two segments A and B where B is a child of A. Your flat file might look like this A1,B1,B2,B3,A2,B4,B5 where B1, B2, and B3 are children of A1 and B4 and B5 are children of A2. The reason this matter is because your layout information only provides an overlay for a specific segment structure.
So if your database had more than the one segment UIMNH10, you won't know where in the ASCII file to apply your starting point for the layout.
Now let's make a HUGE assumption here that your database only has one segment UIMNH10. In that case your ASCII file would look like: A1, A2, A3, A4. That's pretty straight forward as you would apply your layout over the data repeatedly.
Luckily your data structures are pretty straight forward as it's all character data. You would interpret PIC X(n) as a character string of length n. Similarly, for PIC 9(n) which would be a numeric character string of length n.
Assuming your sample data starts with: AA201805...
RRC-H10-SEGMENT-ID is 'AA' because it's PIC X(2)
MN-H10-CENTURY is '20' because it's PIC 9(2)
MN-H10-YEAR is '18' because it's PIC 9(2)
MN-H10-MONTH is '05' because it's PIC 9(2)
You would do this until you reach the end of your layout and then start again at the beginning for your next record. This is also making an ASSUMPTION that the layout definition MATCHES the length of your record.
Your best bet is to work with your IMS database administrator to confirm these assumptions but once you get an idea of your starting points you should be able to map the data yourself or write a quick program to do it for you. There are some other alternatives as well but that would assume some back end setup for things like SQL support to read and dump the data into a csv file format for Excel.

Related

UniData - record count of all files / tables

Looking for a shortcut here. I am pretty adept with SQL database engines and ERPs. I should clarify... I mean databases like MS SQL, MySQL, postresql, etc.
One of the things that I like to do when I am working on a new project is to get a feel for what is being utilized and what isn't. In T-SQL this is pretty easy. I just query the information schema and get a row count of all the tables and filter out the ones having rowcount = 0. I know this isn't truly a precise row count, but it does give me an idea of what is in use.
So I recently started at a new company and one of their systems is running on UniData. This is a pretty radical shift from mainstream databases and there isn't a lot of help out there. I was wondering if anybody knew of a command to do the same thing listed above in UniBasic/UniQuery/whatever else.
Which tables, files, are heavily populated and which ones are not?
You can start with a special "table" (or file in Unidata terminology) named VOC - it will have a list of all the other files that are in your current "database" (aka account), as well as a bunch of other things.
To get a list of files in (or pointed to) the current account:
:SORT VOC WITH F1 = "F]" "L]" "DIR" F1 F2
Try HELP CREATE.FILE if you're curious about the difference between F and LF and DIR.
Once you have a list of files, weed out the ones named *TEMP* or *WORK* and start digging into the ones that seem important. There are other ways to get at what's important (e.g using triggers or timestamps), but browsing isn't a bad idea to see what conventions are used.
Once you have a file that looks interesting (let's say CUSTOMERS), you can look at the dictionary of that file to see
:SORT DICT CUSTOMERS F1 F2 BY F1 BY F2 USING DICT VOC
It can help to create something like F2.LONG in DICT VOC to increase the display size up from 15 characters.
Now you have a list of "columns" (aka fields or attributes), you're looking for D-type attributes that will tell you what columns are in the file. V or I-type's are calculations
https://github.com/ianmcgowan/SCI.BP/blob/master/PIVOT is helpful with profiling when you see an attribute that looks interesting and you want to see what the data looks like.
http://docs.rocketsoftware.com/nxt/gateway.dll/RKBnew20/unidata/previous%20versions/v8.1.0/unidata_userguide_v810.pdf has some generally good information on the concepts and there are many other online manuals available there. It can take a lot of reading to get to the right thing if you don't know the terminology.

Storing trillions of document similarities

I wrote a program to compute similarities among a set of 2 million documents. The program works, but I'm having trouble storing the results. I won't need to access the results often, but will occasionally need to query them and pull out subsets for analysis. The output basically looks like this:
1,2,0.35
1,3,0.42
1,4,0.99
1,5,0.04
1,6,0.45
1,7,0.38
1,8,0.22
1,9,0.76
.
.
.
Columns 1 and 2 are document ids, and column 3 is the similarity score. Since the similarity scores are symmetric I don't need to compute them all, but that still leaves me with 2000000*(2000000-1)/2 ≈ 2,000,000,000,000 lines of records.
A text file with 1 million lines of records is already 9MB. Extrapolating, that means I'd need 17 TB to store the results like this (in flat text files).
Are there more efficient ways to store these sorts of data? I could have one row for each document and get rid of the repeated document ids in the first column. But that'd only go so far. What about file formats, or special database systems? This must be a common problem in "big data"; I've seen papers/blogs reporting similar analyses, but none discuss practical dimensions like storage.
DISCLAIMER: I don't have any practical experience with this, but it's a fun exercise and after some thinking this is what I came up with:
Since you have 2.000.000 documents you're kind of stuck with an integer for the document id's; that makes 4 bytes + 4 bytes; the comparison seems to be between 0.00 and 1.00, I guess a byte would do by encoding the 0.00-1.00 as 0..100.
So your table would be : id1, id2, relationship_value
That brings it to exactly 9 bytes per record. Thus (without any overhead) ((2 * 10^6)^2)*9/2bytes are needed, that's about 17Tb.
Off course that's if you have just a basic table. Since you don't plan on querying it very often I guess performance isn't that much of an issue. So you could go 'creative' by storing the values 'horizontally'.
Simplifying things, you would store the values in a 2 million by 2 million square and each 'intersection' would be a byte representing the relationship between their coordinates. This would "only" require about 3.6Tb, but it would be a pain to maintain, and it also doesn't make use of the fact that the relations are symmetrical.
So I'd suggest to use a hybrid approach, a table with 2 columns. First column would hold the 'left' document-id (4 bytes), 2nd column would hold a string of all values of documents starting with an id above the id in the first column using a varbinary. Since a varbinary only takes the space that it needs, this helps us win back some space offered by the symmetry of the relationship.
In other words,
record 1 would have a string of (2.000.000-1) bytes as value for the 2nd column
record 2 would have a string of (2.000.000-2) bytes as value for the 2nd column
record 3 would have a string of (2.000.000-3) bytes as value for the 2nd column
etc
That way you should be able to get away with something like 2Tb (inc overhead) to store the information. Add compression to it and I'm pretty sure you can store it on a modern disk.
Off course the system is far from optimal. In fact, querying the information will require some patience as you can't approach things set-based and you'll pretty much have to scan things byte by byte. A nice 'benefit' of this approach would be that you can easily add new documents by adding a new byte to the string of EACH record + 1 extra record in the end. Operations like that will be costly though as it will result in page-splits; but at least it will be possible without having to completely rewrite the table. But it will cause quite bit of fragmentation over time and you might want to rebuild the table once in a while to make it more 'aligned' again. Ah.. technicalities.
Selecting and Updating will require some creative use of SubString() operations, but nothing too complex..
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it you could actually store 8 values in 7 bytes and save another ca 300Mb, but it would make things quite a bit more complex... then again, it's not like the data is going to be human-readable anyway =)
PS: this line of thinking is completely geared towards reducing the amount of space needed while remaining practical in terms of updating the data. I'm not saying it's going to be fast; in fact, if you'd go searching for all documents that have a relation-value of 0.89 or above the system will have to scan the entire table and even with modern disks that IS going to take a while.
Mind you that all of this is the result of half an hour brainstorming; I'm actually hoping that someone might chime in with a neater approach =)

Questions in regards to full-text search with PDF SQL Server 2008 restricted to embedded text only?

What i'm curious about is lets say I have 100 pdfs. And all of them have the words "happy apple". Lets say that only 20 of these have embedded text that has "happy apple".
When i do a search for "happy apple" will i receive all 100 docs or only 20? I'm unable to find a clear answer on this question.
Flat out impossible to answer without any further information on your search tool and the actual PDFs.
"Happy apple" will be found if the text is 1. not compressed, 2. not encrypted, 3. not weirdly constructed, 4. not re-encoded, or 5. re-encoded but the translation table to Unicode is present and correct.
ad 1: Usually data streams in a PDF are compressed, using one or more algorithms from the standard set (usually LZW or Flate).
ad 2: PDFs may be encrypted with a password, preventing casual inspection. Levels of security range from mid-difficult to theoretically uncrackable with current technology.
ad 3: Single characters may appear on your page in any order. The software used to create it may, at its whim, split up text string in separate parts or even draw each individual character at any position, and omit all spaces. Only strict sorting on absolute x and y coordinates of each text fragment may reveal the original text.
ad 4: If a font gets subsetted, a PDF composer may decide to store 'h' as 0, 'a' as 1 and 'p' as 2 (and so on). The correct glyphs are still associated with these codes, but "the" text now may appear as "0 1 2 2 3 4 1 2 2 5 6" in the text stream. Also, even if it does not subset the font, a PDF composer is free to move characters around anyway.
ad 5: To revert this re-encoding, software may include a ToUnicode table. This is to associate character codes back to the original Unicode values again; one table per re-encoded font. If the table is missing, there usually is no straightforward way to create it.
There is even an ad 6 I did not think of: text may be outlined or appear in bitmaps only.
Only the very simplest PDFs can be searched with a general tool such as command-line grep. For anything else, you need a good PDF decoding tool -- and the better it is, the more points of this list you can tick off. Except, then, #5 and #6.
(Later edit) Oh wait. You obfuscated your actual question enough to entirely throw me off the target, which (I think!) was "does sql-server-2008 search for entire phrases or for individual words?"
Good thing, then, the above still holds. If you cannot search inside your PDFs anyway, the actual question is moot.

customizing output from database and formatting it

Say you have an average looking database. And you want to generate a variety of text files (each with their own specific formatting - so the files may have rudimentary tables and spacing). So you'd be taking the data from the Database, transforming it in a specified format (while doing some basic logic) and saving it as a text file (you can store it in XML as an intermediary step).
So if you had to create 10 of these unique files what would be the ideal approach to creating these files? I suppose you can create classes for each type of transformation but then you'd need to create quite a few classes, and what if you needed to create another 10 more of these files (a year down the road)?
What do you think is a good approach to this problem? being able to maintain the customizability of the output file, yet not creating a mess of a code and maintenance effort?
Here is what I would do if I were to come up with a general approach to this vague question. I would write three pieces of code, independent of each other:
a) A query processor which can run a query on a given database and output results in a well-known xml format.
b) An XSL stylesheet which can interpret the well-known xml format in (a) and transform it to the desired format.
c) An XML-to-Text transformer which can read the files in (a) and (b) and put out the result.

Adding custom data to GapMinder

Does anyone have any experience adding their own data to GapMinder, the really cool software that Hans Rosling uses in his TED talks? I have an array od objects in JSON that would be easy to show in moving bubbles. This would be really cool.
I can see that my Ubuntu box has what looks like data in /opt/Gapminder Desktop/share/assets/graphs/world, but I would need to figure out:
How to add a measure to a graph
How to add a data series
How to set the time range of the data
Identify the measures to follow at each time step
and so on.
Just for the record: if you want to use Gapminder with your own dataset, you have to convert your data in a format suitable to Gapminder. More specifically, looking in the assets/graphs/world, you will have to:
Edit the file overview.xml, which contains the tree structure of all the indicators (just copy/paste an entry and specify your own data);
Convert your data copying the structure of the xml files in that directory (this is the tricky part): you can specify some metadata in the preamble, and then specify your own data series, with something like:
<t1 m="i20,50.0,99.0,1992" d="90.0, ... ,50.0, ..."/> where i20 is the country id, which is followed by the minima and maxima of the series, and the year it refers to.
In my humble opinion, Gapminder is a great app but it definitely needs more work on integration with other datasets. Way better to use Google Motion Chart as you did, or MooGraph (site and doc), which is unfortunately not as great as Gapminder.
#Stefano
the information you provided is very valuable. Is somewhere available a detailed specification of the XML files containing the data?
Anyway, just to enrich your response, I also found that:
overview.xml file
The link between Nations and their IDs is in this file
The structure of the menus for the selection of the indicators is also in the same file (at the bottom) under the section <indicatorCategorization>
The structure of the datafile XML
For each line the year represents the first year of the serie, and then the values follow one per year, comma separated.
Grazie,
Max
I ended up using the google motion chart API. I ended up with this.