How do I define the record structure of ebcdic file? - apache-spark-sql

I have ebcdic file in hdfs I want to load data to spark dataframe, process it and load results as orc files, I found that there is a open source solution which is cobrix cobrix, that allow to get data from ebcdic files, but developer must provide a copybook file which is a schema definition.
A few line of my ebcedic file are presented in the attached image.
I want to get the format of copybook of the ebcdic file, essentially I want to read the vin his length is 17, vin_data the length is 3 and finally vin_val the length is 100.

how to define a copybook file of ebcdic data?
You don't.
A copybook may be used as a record definition (=how the data is stored), it has nothing to do with the encoding of data that may be stored in that.
This leaves the question "How do I define the record structure?"
You'd need the amount of fields, their length and type (it likely is not only USAGE DISPLAY) and then just define it with some fancy names. Ideally you just get the original record definition from the COBOL program writing the file, put that into a copybook if it isn't in one yet, and use that.
Your link has samples that show actually how a copybook looks like, if you struggle on the definition then please edit your question with the copybook you've defined and we may be able to help.

Based on your comment in the question, and looking at the input file, you could start with this.
01 VIN-RECORD.
05 VIN PIC X(17).
05 VIN-COUNT PIC S9(5) COMP-3.
05 VIN-VALUE PIC X(100).
I'm guessing that the second field is COMP-3 based on the six examples all ending with a C byte. This indicates a positive COMP-3 value. A D byte would be a negative COMP-3 value. An F byte would indicate an unsigned COMP-3 value.
The third field is variable length and right padded with spaces.

Related

VB.net Read Cobol File Fields (Pure Binary, EBCDIC, Packed)

I need to read a Cobol file into VB.net. Here is the description of the data types from the documentation:
All Magnetic tape files are recorded in 9-track, 8OOBPI mode with odd parity. They are created IBM equipment disk operating system. IBM System 360 Standard.
Binary - Data is coded in pure binary code.
BCD - Data is coded in binary coded decimal format. (Primarily
for files created by the IBM 1401 System).
EBCDIC - Data is coded in extended binary coded decimal interchange code. :(An IBM developed code.)
Packed - Data is coded in packed decimal format.
File Format:
1-2 Record Count [Numeric] (Binary)
3-4 Filler (Binary)
5-5 Record Type [B or R] (EBCDIC)
6-10 Sales Location Numeric [9 digit number] (Packed)
11-13 Sales Identifier (3 character Alpha) (EBCDIC]
etc
So, I know I should read the entire file into a byte array and that's about the limit of what I know to do...
A) I saw another post on EBCDIC conversation using
System.Text.Encoding.GetEncoding(37)
but it is for an entire file. If I run the whole file through it I see intelligible text, but of course the other fields are junk. I don't know the language to decode a single field properly.
B) I have no idea what to do with PURE Binary format.
C) I don't know how to read Packed, particularly as a single field
I've tried a variety of decoding options for PURE BINARY, but the number I get for the first field is not consistent with the stated length of the rows in the docs.
Packed decimal format:
For s9(5)V9(4) comp-3, 123.45 is represented in byte format as
00 12 34 50 0c
Each digit is represented by 4 bits, there is a 4 bit sign (c) at the end and an assumed decimal after the 3.
Most languages provide a routine for converting byte/bytes into a string i.e. byte x'34' -->> String '34'. So you can:
Convert the bytes to a String representation
Add the decimal point in
Strip off the sign character from the end and add the appropriate sign to the front
There are other ways:
Create an translation array and do an array lookup. (See https://github.com/bmTas/JRecord/blob/master/Source/JRecord_Project/JRecord_Common/src/main/java/net/sf/JRecord/Types/smallBin/TypePackedDecimal9.java for an example)
Process it 4 bits at a time
Other fields
The first field (binary) might be a big endian binary integer or another packed-decimal. There is probably a utility built in the .net to do this.
Convert the character fields from ebcdic to ascii one field at a time
In VBA you did not need to read the whole file in, you could read it record by record. I would presume you can do the same in vb.net
Useful Utilities
These tools might be useful for testing.
The RecordEditor should be able to display the file. The Layout Wizard should be able determine the format of the file. Alternatively use the Cobol copybook below
The Java program CobolToCsv should be able to convert the file to Csv
01 tape-record.
05 record-count pic s9(3) comp.
05 filler pic x(2).
05 record-type pic x.
05 Sales-Location pic s9(9) comp-3.
05 Sales-Identifier pic x(3).

GNU Radio text file sink

I'm trying to teach myself basics of GNU Radio and DSP. I created a flowchart in GNU Radio Companion that takes a vector that is the binary representation of a single character (the character "1" as "00110001"), modulates, demodulates, and writes to a file sink.
The scope sink after demodulation looks like the values are returned (see below; appears to be correct pattern of 0s and 1s), but the file sink, although its size is 19 bytes, appears empty, or at least is not returning the correct values (I've looked at it in ASCII and Hex text editors). I assumed the single character transferred would result in 1 byte (or 8 bits) -- not 19 bytes. Changing some of the settings in the Polyphase Sync and adding a Repack Bits block after the binary slicer results in some characters in the output file, but never the right character.
My questions are:
Can GNU Radio take a single character, modulate/demodulate it, and return the same character?
Are there errors in my flowchart?
I'd appreciate any insights or suggestions, thank you.

Convert a file to Binary or Hexadecimal

So I have a file that I need to have in either binary or hex format. Everything that I've been able to find basically says to store the text in a string and convert it to binary or hex from there, but I cant do it this way. The file was written using its own private character set that uses null and system hex codes, so notepad doesn't know what to do with these characters and replaces it with wrong characters and spaces. This distorts the information so it wont be correct if I try to convert it to binary/hex.
I really just need to have the binary/hex information stored in a string or text box so I can work with it. I don't really need it to be saved as a file.
Never mind, I finally figured it out. I used a file stream to read the data byte by byte. I didn't understand how to convert this as the first byte data in the array was showing as 80 when i knew the binary data should've been "1010000" (i didn't realize at that time that 80 was the decimal format).
Anyways I used the bitconverter.tostring and it put everything together and converted it to hexadecimal format. So i'm all good now.

Fortran 90: How to correctly read an integer among other real

I have created a Fortran 90 code to filter and convert the text output of another program in a csv form. The file contains a table with columns of various types (character, real, integer). There is a column that generally contains decimal values (probability values). BUΤ, in some rows, where the value should be decimal "1.000", the value is actually integer "1".
I use "F5.3" specifier to read this column and I have the same format statement for every row of the table. So, when the code finds "1", it reads ".001", because it does not find a decimal point.
What ways could I use to correctly (and generally) read integers among other decimals?
Could I specify "unformatted" input only for a number of "spaces"?
The data edit descriptor fw.d for floating point format specification is for input normally used with zero d (it cannot be ommited). Nonzero d is used in the rare case when the floating point data is stored as scaled integers, or you do some unit conversion from the integer values.
You could try using list-directed input: use a * instead of a format specifier. This would be for the entire read, not selected items. Or you could read the lines into a string test their contents to decide how to read them. If the sub-string has a decimal point: read (string(M:N), '(F5.3)') value. If it doesn't, use a different format, e.g., perhaps read as as F5.0.
P.S. "unformatted" is reading binary data without conversion ... it is a direct copy of the data from the file to the data item. "listed-directed" is the Fortran term for reading & converting data without using a format specification.
well here's someting new to me: f90 allows a mix of comma and space delimiters for a simple list directed read:
read(unit,*)v1,v2,v3,v4
with input
1.222 2 , 3.14 , 4
yields
1.222000 2.000000 3.140000 4.000000

SAS : read in PDF file

I am looking for ways to read in a PDF file with SAS. Apparently this is not basic functionality and there is very little to be found on the internet. (Let alone that google is not easy with PDF in you search giving you also links to PDF documents that go about other things.)
The only things that can be found, are people looking for ways to import data into datasets from a PDF. For me, that is not even necesarry. I would like to be able to read the contents of the PDF file in one big character variable. If possible, it would even be better to be able to read in the file's binary data.
Is this possible with SAS and how? (I got it to work in Access VBA, but can't find any similar ways in SAS.)
(In the end, the purpose is to convert this to base64 and put that base64-string into an XML document.)
You probably will not be able to read the entire file into one character variable since the maximum size of a character variable is around 33 KB. A simple way to read in one line at a time, though, is something like the following:
%let pdfFileName = Test.pdf;
%let lineSize = 2000;
data base;
format text_line $&lineSize..;
infile "&pdfFileName" lrecl=&lineSize;
input text_line $;
run;
This requires that you have a general idea of the maximum record length ahead of time, but you could write additional code to determine the maximum record size prior to reading in the file. In this example each line of text is read into one character variable named "text_line." From there, you could use a RETAIN statement or double trailers (##) in the INPUT line to process multiple lines at a time. The SAS web-site has plenty of documentation on how to read and process text from various types of input files.