Detect if Base 64 string is image or text - objective-c

Is there a way to detect if the Base 64 string contained in an NSData instance is an image or a text or any other object?

You can't generally just look at the base 64 string and decide, but you can decode the first few bytes of data, look at the hex codes (you can do this by decoding your base-64 string into a NSData and just NSLog it or examining it in the debugger), and draw some conclusions. For example:
Image files generally start with special byte sequences (e.g. JPEG start with the hex bytes FF D8; PNG generally start with hex bytes 89 50 4E 47 0D 0A 1A 0A (e.g. 89 "PNG" CR LF EOF LF, etc.). Note, there are a dizzying number of different image formats, so this is a non-trivial exercise, but sometimes you can get lucky and it will be self-evident that it's one of these common format when you glance at the first few bytes.
NSKeyedArchiver archives generally start with the string "bplist".
ASCII text consists of codes between 20 and 7F (with linefeeds represented by 0A; carriage return and linefeeds represented by OD 0A; tab characters as 09; etc.). Then, again, if it was a text, it's unlikely they'd be base-64 encoding it.
If it was UTF-8 it would conform to the coding pattern outlined here. For example, you can look at the first few high bits of the first byte that might conceivably represent a UTF-8 character, and conclude (a) how many bytes the character is represented by and (b) what high bits will be turned on those subsequent bytes. You can often quickly look at it and confirm whether the data conforms to this UTF-8 pattern or not (especially easy to do for most western languages)
If the first three characters were EF BB BF, that often indicates a UTF-8 byte order mark.
This is, by no means, an exhaustive list of codes, but just a few that leapt out at me.
To do this programmatically and do so exhaustively would be a non-trivial exercise. But if you're just "eye-balling" a base-64 string and trying to draw some logical inferences, decode it and look at the hex bytes and you can quickly narrow down the possibilities, at the very least. If you're unsure about how to interpret it, update your question with the hex representation of the decoded base-64 string (just the first 16-32 bytes, please), and we might be able to point you in the right direction.

It is impossible to clearly distinguish text string and Base64 image encoding string. The only way - check if your string is valid Base 64 encoding string. If it is - probably it is an image. If not - you can be sure it is a text.
How to check if string is valid Base 64 you can ere How to check whether the string is base64 encoded or not.

Related

VB.net Read Cobol File Fields (Pure Binary, EBCDIC, Packed)

I need to read a Cobol file into VB.net. Here is the description of the data types from the documentation:
All Magnetic tape files are recorded in 9-track, 8OOBPI mode with odd parity. They are created IBM equipment disk operating system. IBM System 360 Standard.
Binary - Data is coded in pure binary code.
BCD - Data is coded in binary coded decimal format. (Primarily
for files created by the IBM 1401 System).
EBCDIC - Data is coded in extended binary coded decimal interchange code. :(An IBM developed code.)
Packed - Data is coded in packed decimal format.
File Format:
1-2 Record Count [Numeric] (Binary)
3-4 Filler (Binary)
5-5 Record Type [B or R] (EBCDIC)
6-10 Sales Location Numeric [9 digit number] (Packed)
11-13 Sales Identifier (3 character Alpha) (EBCDIC]
etc
So, I know I should read the entire file into a byte array and that's about the limit of what I know to do...
A) I saw another post on EBCDIC conversation using
System.Text.Encoding.GetEncoding(37)
but it is for an entire file. If I run the whole file through it I see intelligible text, but of course the other fields are junk. I don't know the language to decode a single field properly.
B) I have no idea what to do with PURE Binary format.
C) I don't know how to read Packed, particularly as a single field
I've tried a variety of decoding options for PURE BINARY, but the number I get for the first field is not consistent with the stated length of the rows in the docs.
Packed decimal format:
For s9(5)V9(4) comp-3, 123.45 is represented in byte format as
00 12 34 50 0c
Each digit is represented by 4 bits, there is a 4 bit sign (c) at the end and an assumed decimal after the 3.
Most languages provide a routine for converting byte/bytes into a string i.e. byte x'34' -->> String '34'. So you can:
Convert the bytes to a String representation
Add the decimal point in
Strip off the sign character from the end and add the appropriate sign to the front
There are other ways:
Create an translation array and do an array lookup. (See https://github.com/bmTas/JRecord/blob/master/Source/JRecord_Project/JRecord_Common/src/main/java/net/sf/JRecord/Types/smallBin/TypePackedDecimal9.java for an example)
Process it 4 bits at a time
Other fields
The first field (binary) might be a big endian binary integer or another packed-decimal. There is probably a utility built in the .net to do this.
Convert the character fields from ebcdic to ascii one field at a time
In VBA you did not need to read the whole file in, you could read it record by record. I would presume you can do the same in vb.net
Useful Utilities
These tools might be useful for testing.
The RecordEditor should be able to display the file. The Layout Wizard should be able determine the format of the file. Alternatively use the Cobol copybook below
The Java program CobolToCsv should be able to convert the file to Csv
01 tape-record.
05 record-count pic s9(3) comp.
05 filler pic x(2).
05 record-type pic x.
05 Sales-Location pic s9(9) comp-3.
05 Sales-Identifier pic x(3).

Can ASCII arrays be manipulated as arrays without converting to String form?

This is a basic question, but I can't find anything on it, since I don't know what to searchย โ€” each of my tries have come up with unrelated results.
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character? Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
Dim t1() As Byte = Text.Encoding.ASCII.GetBytes("Hello ")
Dim t2() As Byte = Text.Encoding.ASCII.GetBytes("World")
Dim msg As String = Text.Encoding.ASCII.GetString(t1.Concat(t2).ToArray)
Now msg should be "Hello World".
I would like this to work as I don't want to have to convert data I receive back to Strings in order to manipulate it before it is sent again.
What if I used something other than ASCII (like UTF-8, for example)?
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character?
Yes. ASCII is a 7bit encoding, it does not support multi-byte characters. Any Unicode codepoint above U-007F will get converted to a ? character in ASCII.
If you were to use UTF-7 instead, for instance, it can encode individual Unicode codepoints into a sequence of multiple ASCII characters.
Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
In your particular example, yes (provided you are using LINQ's Concat() method - there are other ways to concat arrays together). There is no data loss.
But for other examples, just know that you will have data loss if you convert non-ASCII characters to ASCII, or otherwise mismatch encodings between GetBytes() and GetString().
You can certainly manipulate byte arrays. Just make sure the arrays are in the same encoding if you merge them together.
.NET strings are counted sequences of UTF-16 code units (char), one or two of which encode a Unicode codepoint (int Char.ConvertToUtf32 ). Some codepoints are "combining characters", which when applied to a preceding "base character" form a grapheme (which is then rendered by a font into a glyph).
An encoder from Unicode to an encoding of another character set should attempt to preserve graphemes. In .NET, a grapheme is called a "text element."
So, yes, you can combine encoded byte sequences as long as you haven't defeated the encoder by converting parts of a grapheme into different byte sequences. If you are breaking a string into two before encoding, see TextElementEnumerator and StringInfo class.

Hexadecimal numbers vs. hexadecimal enocding (with base64 as well)

Encoding with hexadecimal numbers seems to be different from using hexadecimals to represent numbers. For example, then hex number 0x40 to me should be equal to 64, or BA_{64}, but when I put it through this hex to base64 converter, I get the output: QA== which to me is equal to some number times 64. Why is this?
Also when I check the integer value of the hex string deadbeef I get 3735928559, but when I check it other places I get: 222 173 190 239. Why is this?
Addendum: So I guess it is because it is easier to break the number into bit chunks than treat it as a whole number when encoding? That is pretty confusing to me but I guess I get it.
You may wish to read this:
http://en.wikipedia.org/wiki/Base64
In summary, base64 specifies a specific encoding, which involves using different values for letters than their ASCII encoding.
For the second part, one source is treating the entire string as a 32 bit integer, and the other is dividing it into bytes and giving the value of each byte.

Xcode UTF-8 literals

Suppose I have the MUSICAL SYMBOL G CLEF symbol: ** ๐„ž ** that I wish to have in a string literal in my Objective-C source file.
The OS X Character Viewer says that the CLEF is UTF8 F0 9D 84 9E and Unicode 1D11E(D834+DD1E) in their terms.
After some futzing around, and using the ICU UNICODE Demonstration Page, I did get the following code to work:
NSString *uni=#"\U0001d11e";
NSString *uni2=[[NSString alloc] initWithUTF8String:"\xF0\x9D\x84\x9E"];
NSString *uni3=#"๐„ž";
NSLog(#"unicode: %# and %# and %#",uni, uni2, uni3);
My questions:
Is it possible to streamline the way I am doing UTF-8 literals? That seems kludgy to me.
Is the #"\U0001d11e part UTF-32?
Why does cutting and pasting the CLEF from Character Viewer actually work? I thought Objective-C files had to be UTF-8?
I would prefer the way you did it in uni3, but sadly that is not recommended. Failing that, I would prefer the method in uni to that in uni2. Another option would be [NSString stringWithFormat:#"%C", 0x1d11e].
It is a "universal character name", introduced in C99 (section 6.4.3) and imported into Objective-C as of OS X 10.5. Technically this doesn't have to give you UTF-8 (it's up to the compiler), but in practice UTF-8 is probably what you'll get.
The encoding of the source code file is probably UTF-8, matching what the runtime expects, so everything happens to work. It's also possible the source file is UTF-16 or UTF-32 and the compiler is doing the Right Thing when compiling it. None the less, Apple does not recommend this.
Answers to your questions (same order):
Why choose? Xcode uses C99 in default setup. Refer to the C0X draft specification 6.4.3 on Universal Character Names. See below.
More technically, the #"\U0001d11e is the 32 bit Unicode code point for that character in the ISO 10646 character set.
I would not count on this behavior working. You should absolutely, positively, without question have all the characters in your source file be 7 bit ASCII. For string literals, use an encoding or, preferably, a suitable external resource able to handle binary data.
Universal Character Names (from the WG14/N1256 C0X Draft which CLANG follows fairly well):
Universal Character Names may be used
in identifiers, character constants,
and string literals to designate
characters that are not in the basic
character set.
The universal
character name \Unnnnnnnn designates
the character whose eight-digit short
identifier (as specified by ISO/IEC
10646) is nnnnnnnn) Similarly, the
universal character name \unnnn
designates the character whose
four-digit short identifier is nnnn
(and whose eight-digit short
identifier is 0000nnnn).
Therefor, you can produce your character or string in a natural, mixed way:
char *utf8CStr =
"May all your CLEF's \xF0\x9D\x84\x9E be left like this: \U0001d11e";
NSString *uni4=[[NSString alloc] initWithUTF8String:utf8CStr];
The \Unnnnnnnn form allows you to select any Unicode code point, and this is the same value as "Unicode" field at the bottom left of the Character Viewer. The direct entry of \Unnnnnnnn in the C99 source file is handled appropriately by the compiler. Note that there are only two options: \unnnn which is a 256 character offset to the default code page or \Unnnnnnnn which is the full 32 bit character of any Unicode code point. You need to pad the left with 0's if you are not using all 4 or all 8 digits or \u or \U.
The form of \xF0\x9D\x84\x9E in the same string literal is more interesting. This is inserting the raw UTF-8 encoding of the same character. Once passed to the initWithUTF8String method, but the literal and the encoded literal end up as encoded UTF-8.
It may, arguably, be a violation of 130 of section 5.1.1.2 to use raw bytes in this way. Given that a raw UTF-8 string would be encoded similarly, I think you are OK.
You can write the clef character in your string literal, too:
NSString *uni2=[[NSString alloc] initWithUTF8String:"๐„ž"];
The \U0001d11e matches the unicode code point for the G clef character. The UTF-32 form of a character is the same as its codepoint, so you can think of it as UTF-32 if you want to. Here's a link to the unicode tables for musical symbols.
Your file probably is UTF-8. The G clef is a valid UTF8 character - check out the output from hexdump for your file:
00 4e 53 53 74 72 69 6e 67 20 2a 75 6e 69 33 3d 40 |NSString *uni3=#|
10 22 f0 9d 84 9e 22 3b 0a 20 20 4e 53 4c 6f 67 28 |"....";. NSLog(|
As you can see, the correct UTF-8 representation of that character is in the file right where you'd expect it. It's probably safer to use one of your other methods and try to keep the source file in the ASCII range.
I created some utility classes to convert easily between unicode code points, UTF-8 byte sequences and NSString. You can find the code on Github, maybe it is of some use to someone.

Do certain characters take more bytes than others?

I'm not very experienced with lower level things such as howmany bytes a character is. I tried finding out if one character equals one byte, but without success.
I need to set a delimiter used for socket connections between a server and clients. This delimiter has to be as small (in bytes) as possible, to minimize bandwidth.
The current delimiter is "#". Would getting an other delimiter decrease my bandwidth?
It depends on what character encoding you use to translate between characters and bytes (which are not at all the same thing):
In ASCII or ISO 8859, each character is represented by one byte
In UTF-32, each character is represented by 4 bytes
In UTF-8, each character uses between 1 and 4 bytes
In ISO 2022, it's much more complicated
US-ASCII characters (of whcich # is one) will take only 1 byte in UTF-8, which is the most popular encoding that allows multibyte characters.
It depends on the encoding. In Single-byte character sets such as ANSI and the various ISO8859 character sets it is one byte per character. Some encodings such as UTF8 are variable width where the number of bytes to encode a character depends on the glyph being encoded.
The answer of course is that it depends. If you are in a pure ASCII env, then yes, every char takes 1 byte, but if you are in a Unicode env (all of Windows for example), then chars can range from 1 to 4 bytes in size.
If you choose a char from the ASCII set, then yes your delimter is a small as possible.
No, all characters are 1 byte, unless you're using Unicode or wide characters (for accents and other symbols for example).
A character is 1 byte, or 8 bits, long which gives 256 possible combination to form characters with. 1 byte characters are called ASCII characters. They only use 7 bits (even though 8 are available, but you can't use this 8th bit) to form the standard alphabet and various symbols used when teletypes and typewriters were still common.
You can find an ASCII chart and what numbers correspond to what characters here.