Converting binary data/numbers into a memorable format - formatting

What are my options for converting arbitrary binary data into a memorable format?
Are there any widely used standards? What is the most efficient way to do this? I.e. to encode the most data in the easiest-to-remember way?

You can use Bitcoin's BIP-39 standard, whereby numbers are split into 11 bits and each is mapped to a word from a list of 2048 words. These words are chosen whereby similar words are avoided and a word can be unambiguously determined based on the first 4 letters.

Related

Distinct & compact word lists to represent e.g. fingerprints or other binary data for humans?

What word lists / "dictionaries" can you suggest for conveying binary data such as cryptographic fingerprints, hashes etc. ?
Criteria for such a word list are e.g.
Compact, i.e. a long list of rather short words so you need less of those words to transmit data
Distinctive words / simple to distinguish (no homonyms, not even accidentally caused by slight mispronunciation)
Simple to spell
One of the most familiar ones is the PGP word list.

Is there a database that accepts special characters by default (without converting them)?

I am currently starting from scratch choosing a database to store data collected from a suite of web forms. Humans will be filling out these forms, and as they're susceptible to using international characters, especially those humans named José and François and أسامة and 布鲁斯, I wanted to start with a modern database platform that accepts all types (so to speak), without conversion.
Q: Does a databases exist, from the start, that accepts a wide diversity of the characters found in modern typefaces? If so, what are the drawbacks to a database that doesn't need to convert as much data in order to store that data?
// Anticipating two answers that I'm not looking for:
I found many answers to how someone could CONVERT (or encode) a special character, like é or a copyright symbol © into database-legal character set like © (for ©) so that a database can then accept it. This requires a conversion/translation layer to shuttle data into and out of the database. I know that has to happen on a level like the letter z is reducible to 1's and 0's, but I'm really talking about finding a human-readable database, one that doesn't need to translate.
I also see suggestions that people change the character encoding of their current database to one that accepts a wider range of characters. This is a good solution for someone who is carrying over a legacy system and wants to make it relevant to the wider range of characters that early computers, and the early web, didn't anticipate. I'm not starting with a legacy system. I'm looking for some modern database options.
Yes, there are databases that support large character sets. How to accomplish this is different from one database to another. For example:
In MS SQL Server you can use the nchar, nvarchar and ntext data types to store Unicode (UCS-2) text.
In MySQL you can choose UTF-8 as encoding for a table, so that it will be able to store Unicode text.
For any database that you consider using, you should look for Unicode support to see if can handle large character sets.

what data-type to use to store MAC addresses in an SQL Server database table?

I want to store MAC addresses in one of my database tables, what data-type should I use? Reading articles on google, I have seen Binary(8) mentioned a few times. Is this the correct way?
Also, this does not make sense to me, as MAC addresses are six groups of two hexadecimal digits, wouldn't you use Binary(6)?
I wouldn't use Binary at all.
I would use CHAR(12).
Though this really depends on what you use the data for - if this is for display only, you can simply use the textual representation.
For easier performaing binary operations you can store them into Binary(6)
You can use the following built in function to view the Hex readable value of the binary data:
select top 10 master.sys.fn_varbintohexstr(mac) from macaddresses
and to convert the hexadecimal text into binary:
select CONVERT(binary(6), 'AABBCCDDEEFF', 2);
MAC address is a sequence of 6 hexadecimal numbers. That's why it would be efficient to store it as a number. Use BIGINT.

Selecting rows where a column contains at least one character not in a whitelist

I'm converting some sensitive data from a low-security encryption to a higher security encryption (specifically, from CFMX_COMPAT to AES with a 256-bit key). I intend to encode my AES-encrypted strings using Hex, and CFMX_COMPAT is extremely likely to use special characters, so finding records that aren't yet converted should be as simple as (pseudocode):
select from table where column has at least one character not in [A-Z0-9]
Is this possible in SQL? If so, how?
I found this documentation, but I had no idea it was possible in a simple LIKE clause. Awesome!
select top 10 foo
from bar
where foo like '%[^A-Z0-9]%'

Generating Record Layouts for EBCDIC Data Files.

We are attempting to write a tool in Perl which is expected to parse a fixed length EBCDIC data file and generate the record layout by looking at the hex value of each byte in the record.
It is assumed that each data file, which is written by a Cobol program whose source code we do not have, can have multiple record layouts. The aim of this tool is to perform data migration (EBCDIC to ASCII) by generating layout which would then be fed to a converter.
The problem is that there are hundreds of permutations and combinations that may arise with each byte. I thought that comparing the hex value of the corresponding byte in the record below the current one might give us some clue as to what this might be. But even in this case there is no concrete solution that one might arrive at. Decisions need to be taken at every juncture which might affect the end result.
Could someone please let me know for any said patterns that I can look for? For example, for all COMP-3s each nibble can possibly represent a value from 0-9 and hence the hex value of the byte might be something like, [0-9][0-9]. Essentially for data migration one need not bother about COMPs and COMP-3s as their value would not be affected in the migration. But identifying what is the DISPLAY fields are is also turning out to be a huge task. Can someone throw some ideas or point me in some direction that I can further explore?
Any help would be highly appreciated. I am really stuck in a mire here.
Thanks,
Aditya.
There are many enterprise transformation tools that will do exactly what you need. Alternatively, it is easy to parse the ADATA records from the compiled copybooks to get the exact byte positions and representations of every field.
Can I hazard a guess? Do you have nobody skilled in Cobol? It isn't that hard to process Cobol copybooks, certainly not as hard as it is to use a write only language like Perl.
Do you have syncsort or DFsort available? It will do what you ask with a simple config file...
I guess you have to go with probabilities, and hope the data is varied enough to get a lot out of that.
Any field that only contains EBCDIC values of alpha-numeric plus punctuation
Numeric DISPLAY fields will be the easiest, containing just EBCDIC 0-9. Note that if signed then the first number will be changed to a letter, like A is -1 I think.
Pretty random distribution of values, leading with hex 0's, will likely be binary numeric "COMP" fields.
COMP-3 fields are one decimal digit in each hex digit of data. So if all the hex digits happen to be 0-9, that's a strong sign of a comp-3 field. Except the last hex digit of the field, which will contain a C for positive, D for negative, and F for unsigned.
Some programs use spaces on numeric fields, so if a field contains all sorts of binary, and also hex 40 (spaces), it's probably best to toss the hex 40 out of the mix. It might tell you a group of bytes is one field if they are all spaces together, or all data together.
As for multiple layouts, that's tough. A common convention for records that can have multiple layouts is to have a limited set of values for "what type of data is this" near the front of the record. Like significantID, recordType, data. So the significantID should increase steadily, while the recordType fields will vary between just a few values and re-cycle.
The FileWizard in RecordEditor / JRecord can search for Mainframe Cobol fields in a Files. The FileWizard results can be stored in a Xml file for use in other languages or you can use the copy Function to copy from Ebcdic to either Ascii fixed or CSV formats.
There is some out of date documentation on the File Wizard