Generating Record Layouts for EBCDIC Data Files. - migration

We are attempting to write a tool in Perl which is expected to parse a fixed length EBCDIC data file and generate the record layout by looking at the hex value of each byte in the record.
It is assumed that each data file, which is written by a Cobol program whose source code we do not have, can have multiple record layouts. The aim of this tool is to perform data migration (EBCDIC to ASCII) by generating layout which would then be fed to a converter.
The problem is that there are hundreds of permutations and combinations that may arise with each byte. I thought that comparing the hex value of the corresponding byte in the record below the current one might give us some clue as to what this might be. But even in this case there is no concrete solution that one might arrive at. Decisions need to be taken at every juncture which might affect the end result.
Could someone please let me know for any said patterns that I can look for? For example, for all COMP-3s each nibble can possibly represent a value from 0-9 and hence the hex value of the byte might be something like, [0-9][0-9]. Essentially for data migration one need not bother about COMPs and COMP-3s as their value would not be affected in the migration. But identifying what is the DISPLAY fields are is also turning out to be a huge task. Can someone throw some ideas or point me in some direction that I can further explore?
Any help would be highly appreciated. I am really stuck in a mire here.
Thanks,
Aditya.

There are many enterprise transformation tools that will do exactly what you need. Alternatively, it is easy to parse the ADATA records from the compiled copybooks to get the exact byte positions and representations of every field.
Can I hazard a guess? Do you have nobody skilled in Cobol? It isn't that hard to process Cobol copybooks, certainly not as hard as it is to use a write only language like Perl.
Do you have syncsort or DFsort available? It will do what you ask with a simple config file...

I guess you have to go with probabilities, and hope the data is varied enough to get a lot out of that.
Any field that only contains EBCDIC values of alpha-numeric plus punctuation
Numeric DISPLAY fields will be the easiest, containing just EBCDIC 0-9. Note that if signed then the first number will be changed to a letter, like A is -1 I think.
Pretty random distribution of values, leading with hex 0's, will likely be binary numeric "COMP" fields.
COMP-3 fields are one decimal digit in each hex digit of data. So if all the hex digits happen to be 0-9, that's a strong sign of a comp-3 field. Except the last hex digit of the field, which will contain a C for positive, D for negative, and F for unsigned.
Some programs use spaces on numeric fields, so if a field contains all sorts of binary, and also hex 40 (spaces), it's probably best to toss the hex 40 out of the mix. It might tell you a group of bytes is one field if they are all spaces together, or all data together.
As for multiple layouts, that's tough. A common convention for records that can have multiple layouts is to have a limited set of values for "what type of data is this" near the front of the record. Like significantID, recordType, data. So the significantID should increase steadily, while the recordType fields will vary between just a few values and re-cycle.

The FileWizard in RecordEditor / JRecord can search for Mainframe Cobol fields in a Files. The FileWizard results can be stored in a Xml file for use in other languages or you can use the copy Function to copy from Ebcdic to either Ascii fixed or CSV formats.
There is some out of date documentation on the File Wizard

Related

HANA: Unknown Characters in Database column of datatype BLOB

I need help on how to resolve characters of unknown type from a database field into a readable format, because I need to overwrite this value on database level with another valid value (in the exact format the application stores it in) to automate system copy acitvities.
I have a proprietary application that also allows users to configure it in via the frontend. This configuration data gets stored in a table and the values of a configuration property are stored in a column of type "BLOB". For the here desired value, I provide a valid URL in the application frontend (like http://myserver:8080). However, what gets stored in the database is not readable (some square characters). I tried all sorts of conversion functions of HANA (HEX, binary), simple, and in a cascaded way (e.g. first to binary, then to varchar) to make it readable. Also, I tried it the other way around and make the value that I want to insert appear in the correct format (conversion to BLOL over hex or binary) but this does not work either. I copied the value to clipboard and compared it to all sorts of character set tables (although I am not sure if this can work at all).
My conversion tries look somewhat like this:
SELECT TO_ALPHANUM('') FROM DUMMY;
while the brackets would contain the characters in question. I cant even print them here.
How can one approach this and maybe find out the character set that is used by this application? I would be grateful for some more ideas.
What you have in your BLOB column is a series of bytes. As you mentioned, these bytes have been written by an application that uses an unknown character set.
In order to interpret those bytes correctly, you need to know the character set as this is literally the mapping of bytes to characters or character identifiers (e.g. code points in UTF).
Now, HANA doesn't come with a whole lot of options to work on LOB data in the first place and for C(haracter)LOB data most manipulations implicitly perform a conversion to a string data type.
So, what I would recommend is to write a custom application that is able to read out the BLOB bytes and perform the conversion in that custom app. Once successfully converted into a string you can store the data in a new NVCLOB field that keeps it in UTF-8 encoding.
You will have to know the character set in the first place, though. No way around that.
I assume you are on Oracle. You can convert BLOB to CLOB as described here.
http://www.dba-oracle.com/t_convert_blob_to_clob_script.htm
In case of your example try this query:
select UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.SUBSTR(<your_blob_value)) from dual;
Obviously this only works for values below 32767 characters.

Any bad affect if I use TEXT data-type to store a number

Is there any bad affect if I use TEXT data-type to store an ID number in a database?
I do something like:
CREATE TABLE GenData ( EmpName TEXT NOT NULL, ID TEXT PRIMARY KEY);
And actually, if I want to store a date value I usually use TEXT data-type. If this is a wrong way, what is its disadvantage?
I am using PostgreSQL.
Storing numbers in a text column is a very bad idea. You lose a lot of advantages when you do that:
you can't prevent storing invalid numbers (e.g. 'foo')
Sorting will not work the way you want to ('10' is "smaller" than '2')
it confuses everybody looking at your data model.
I want to store a date value I usually use TEXT
That is another very bad idea. Mainly because of the same reasons you shouldn't be storing a number in a text column. In addition to completely wrong dates ('foo') you can't prevent "invalid" dates either (e.g. February, 31st). And then there is the sorting thing, and the comparison with > and <, and the date arithmetic....
I really don't recommend using text for dates.
Look at all the functions you are missing with text
If you want to use them, you have to cast and it's only problems if by accident the dates stored are not valid cause with text there's no validation.
In addition to what the other answers already provided:
text is also subject to COLLATION and encoding, which may complicate portability and data interchange between platforms. It also slows down sorting operations.
Concerning storage size of text: an integer occupies 4 byte (and is subject to padding for data alignment). text or varchar occupy 1 byte plus the actual string, which is 1 byte for ASCII character in UTF-8 or more for special characters. Most likely, text will be much bigger.
It depends on what operations you are going to do on the data.
If you are going to be doing a lot of arithmetic on numeric data, it makes more sense to store it as some kind of numeric data type. Also, if you plan on sorting the data in numerical order, it really helps if the data is stored as a number.
When stored as text, "11" comes ahead of "9" because "1" comes ahead of "9". If this isn't what you want, don't use text.
On the other hand, it often makes sense to store strings of digits, such as zipcodes or social security number or phone numbers as text.

Can Long Integer store letters?

I ran 'Analyze Performance' feature in Access and it had an "idea" to improve performance; Access said I should convert items that are alphanumeric mixes that look like this 12BB1-DF740§ from text data type into long integer (the specific name from the idea). Whether Access is right that this would improve performance is secondary to whether long integer can store letters at all.
[§ About the data - the hyphen in the data provided to me is always present at that location; the letters are always A-F]
From what I can tell, w3schools is indicating that Long will only store numbers
Long - Allows whole numbers between -2,147,483,648 and 2,147,483,647
Am I conflating data types? (Further, when I pull up the design view, it only offers number as a data type; there is no long or long integer)
Can Long Integer store letters?
If my column is already populated, and I convert the data type, will I lose data?
You could store those values by splitting them into 2 Long Integer columns. Then when you need the original text form, concatenate their Hex() values with a dash between.
? Hex(76721) & "-" & Hex(915264)
12BB1-DF740
However I don't see why that would be worth doing. Occasionally a performance analyzer suggestion just doesn't make sense to me; this is such a case.
I've never run into this but it looks like it thinks your strings are hexadecimal numbers.
If you never have letters other than A-F then you could store them as longs and then convert back using the Hex() function but that seems mighty kludgy and something I'd avoid unless you're really desperate to eek out some performance.
If it is in fact hexadecimal data, and it always has the same format so that the dash could just be added at the same place, then it would be possible to store the data numeric, and convert it into the hexadecimal notation when needed.
Ten hexadecimal digits represent 40 bits of data, so the Long type described at the w3schools page wouldn't do, as it's only 32 bits. You would need a data type that is a 64 bits, like a double or bigint. (The latter one might not be available in Access.)
However, that would only be any real gain if you actually do any processing of the data in the numeric form. Otherwise you would only save a few bytes per record, and you would need extra processing to convert to and from the numeric format.
If your table is already populated, you would have to read out the values, convert them, and store them back in the numeric form.

How do you know when to use varchar and when to use text in sql?

It seems like a very arbitrary decision.
Both can accomplish the same thing in most cases.
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
Is there any specific guideline for choosing VARCHAR or TEXT for your string fields?
I will be using postgresql with the sqlalchemy orm framework for python.
In PostgreSQL there is no technical difference between varchar and text
You can see a varchar(nnn) as a text column with a check constraint that prohibits storing larger values.
So each time you want to have a length constraint, use varchar(nnn).
If you don't want to restrict the length of the data use text
This sentence is wrong:
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
If you are saving, for example, MD5 hashes you do know how large the field is your storing and your storage becomes more efficient. Other examples are:
Usernames (64 max)
Passwords (128 max)
Zip codes
Addresses
Tags
Many more!
In brief:
Variable length fields save space, but because each field can have different length, it makes table operations slower
Fixed length fields make table operations fast, although must be large enough for the maximum expected input, so can use more space
Think of an analogy to arrays and linked lists, where arrays are fixed length fields, and linked lists are like varchars. Which is better, arrays or linked lists? Lucky we have both, because they are both useful in different situations, so too here.
In the most cases you do know what the max length of a string in a field is. In case of a first of lastname you don't need more then 255 characters for example. So by design you choose wich type to use, if you always use text you're wasting resources
Check this article on PostgresOnline, it also links to two other usefull articles.
Most problems with TEXT in PostgreSQL occur when you're using tools, applications and drivers that treat TEXT very different from VARCHAR because other databases behave very different with these two datatypes.
Database designers almost always know how many characters a column needs to hold. US delivery addresses need to hold up to 64 characters. (The US Postal Service publishes addressing guidelines that say so.) US ZIP codes are 5 characters long.
A database designer will look at representative sample data from her clients when she's specifying columns. She'll ask herself, questions like "What's the longest product name?" And when the answer is "70 characters", she won't make the column 3000 characters wide.
VARCHAR has a limit of 8k in SQL Server (I think). Most applications don't require nearly that much storage for a single column.

Is the CHAR datatype in SQL obsolete? When do you use it?

The title pretty much frames the question. I have not used CHAR in years. Right now, I am reverse-engineering a database that has CHAR all over it, for primary keys, codes, etc.
How about a CHAR(30) column?
Edit:
So the general opinion seems to be that CHAR if perfectly fine for certain things. I, however, think that you can design a database schema that does not have a need for "these certain things", thus not requiring fixed-length strings. With the bit, uniqueidentifier, varchar, and text types, it seems that in a well-normalized schema you get a certain elegance that you don't get when you use encoded string values. Thinking in fixed lenghts, no offense meant, seems to be a relic of the mainframe days (I learned RPG II once myself). I believe it is obsolete, and I did not hear a convincing argument from you claiming otherwise.
I use char(n) for codes, varchar(m) for descriptions. Char(n) seems to result in better performance because data doesn't need to move around when the size of contents change.
Where the nature of the data dictates the length of the field, I use CHAR. Otherwise VARCHAR.
CHARs are still faster for processing than VARCHARs in the DBMS I know well. Their fixed size allow for optimizations that aren't possible with VARCHARs. In addition, the storage requirements are slightly less for CHARS since no length has to be stored, assuming most of the rows need to fully, or near-fully, populate the CHAR column.
This is less of an impact (in terms of percentage) with a CHAR(30) than a CHAR(4).
As to usage, I tend to use CHARs when either:
the fields will generally always be close to or at their maximum length (stock codes, employee IDs, etc); or
the lengths are short (less than 10).
Anywhere else, I use VARCHARs.
I use CHAR when length of the value is fixed. For example we are generating a code or something based on some algorithm which returns the code with the specific fixed lenght lets say 13.
Otherwise, I found VARCHAR better. One more reason to use VARCHAR is that when you get the value back in your application you don't need to trim that value. In the case of CHAR you will get the full length of the column whether the value is filling it fully or not. It would get filled by spaces and you end up trimming every value, and forgetting that would lead to errors.
For PostgreSQL, the documentation states that char() has no advantage in storage space over varchar(); the only difference is that it's blank-padded to the specified length.
Having said that, I still use char(1) or char(3) for one-character or three-character codes. I think that the clarity due to the type specifying what the column should contain provides value, even if there are no storage or performance advantages. And yes, I typically use check constraints or foreign key constraints as well. Apart from those cases, I generally just stick with text rather than using varchar(). Again, this is informed by the database implementation, which automatically switches from inline to out-of-line storage if the value is large enough, which some other database implementations don't do.
Char isn't obsolete, it just should only be used if the length of the field should never vary. In the average database, this would be very few fields mostly some kind of code field like State Abbreviations which are a standard 2 character filed if you use the postal codes. Using Char where the filed length is varaible means that there will be a lot of trimming going on and that is extra, unnecessary work and the database should be refactored.