How to identify binary and text files using Smalltalk - smalltalk

I want to verify that a given file in a path is of type text file, i.e. not binary, i.e. readable by a human. I guess reading first characters and check each character with :
isAlphaNumeric
isSpecial
isSeparator
isOctetCharacter ???
but joining all those testing methods with and: [ ... and: [ ... and: [ ] ] ] seems not to be very smalltalkish. Any suggestion for a more elegant way?
(There is a Python version here How to identify binary and text files using Python? which could be useful but syntax and implementation looks like C.)

only heuristics; you can never be really certain...
For ascii, the following may do:
|isPlausibleAscii numChecked|
isPlausibleAscii :=
[:char |
((char codePoint between:32 and:127)
or:[ char isSeparator ])
].
numChecked := text size min: 1024.
isPossiblyText := text from:1 to:numChecked conform: isPlausibleAscii.
For unicode (UTF8 ?) things become more difficult; you could then try to convert. If there is a conversion error, assume binary.
PS: if you don't have from:to:conform:, replace by (copyFrom:to:) conform:
PPS: if you don't have conform: , try allSatisfy:

All text contains more space than you'd expect to see in a binary file, and some encodings (UTF16/32) will contain lots of 0's for common languages.
A smalltalky solution would be to hide the gory details in method on Standard/MultiByte-FileStream, #isProbablyText would probably be a good choice.
It would essentially do the following:
- store current state if you intend to use it later, reset to start (Set Latin1 converter if you use a MultiByteStream)
Iterate over N next characters (where N is an appropriate number)
Encounter a non-printable ascii char? It's probably binary, so return false. (not a special selector, use a map, implement a new method on Character or something)
Increase 2 counters if appropriate, one for space characters, and another for zero characters.
If loop finishes, return whether either of the counters have been read a statistically significant amount
TLDR; Use a method to hide the gory details, otherwise it's pretty much the same.

Related

how to render a dicom file's header unreadable

Kind of a strange question, but I'm doing some testing to handle errors when a dicom file's tags can't be read.
Unfortunately I don't have a damaged dicom available.
Specifically, can anyone advise how to apply some sort of incorrectly encoded text tag or some invalid numeric data tag onto the file, such that it can't be read by python's pydicom package?
you could have a look at the dcmodify tool from the DCMTK. It can be used to insert, modify and delete attributes. I doubt that it is possible to specify invalid attribute values through the command line, but you could surely modify the source code to accomplish that (except you can definitely write attribute values that exceed the maximum length according to the Value Representation).
My approach would be to create a buffer of characters and write binary data to it. Then pass it to the method that writes the value to the attribute.
Examples:
write unicode (UTF-8) sequences which are not a valid unicode character
write ascii characters which are not covered by the characterset specified by (0008,0005) - not sure whether pydicom would run into problems but it would be wrong from the DICOM perspective
write non-numeric characters to attributes with Value Representation "Decimal String" or "Integer String".
formats other than YYYYMMDD for VR "Date"
formats other than HHMMSS.FFFFFF for VR "Time"
other characters than ['0'-'9'], '.' for VR "Unique Identifier"
[edit]: DCMTK, dcmodify: http://dicom.offis.de/dcmtk.php.en

Saving CSV file with degree symbol and ASCII encoded

I have string variable txt. It contains "°" degree symbol. I would like to save string into CSV file ASCII encoded. I use the procedure below But the "°" symbol is converted to "?". Do you have any idea how to save properly degree symbol?
Public Sub Write_File(ByVal txt As String, ByVal fName As String)
Try
Using OutFile As New StreamWriter(fName, False, Text.Encoding.ASCII)
OutFile.Write(txt)
End Using
Me.Write_Log("Succesfully Exported")
Catch ex As Exception
Me.Write_Log("Write Error during export")
End Try
End Sub
Encoding.ASCII is for the standard 7-bit ASCII encoding, which does not contain a degree symbol at all. In order to get a degree symbol in ASCII, you would have to use one of the many 8-bit ASCII encodings. For English, you'd probably be most interested in using the ISO 8859-1 code page, since that's the most standard-ish one there is of the bunch. For instance, instead of using Encoding.ASCII, you could do something like this:
Using OutFile As New StreamWriter(fName, False, Text.Encoding.GetEncoding("iso-8859-1"))
OutFile.Write(txt)
End Using
For a complete list of available encodings, use the Encoding.GetEncodings method, or look at the list of supported ones in the MSDN documentation.
Of course, none of the various 8-bit ASCII encodings are compatible with each other, so, if you do use that, the degree symbol will be a completely different symbol when viewed on a system that uses a different code page by default. That is precisely why UTF-8 has become the new standard. Usage of 8-bit ASCII is widely discouraged since it is practically unworkable in multi-cultural scenarios. If you can use UTF-8 instead, I would. If you must use ASCII, it's best to stick to the standard 7-bit encoding. If you must use an 8-bit ASCII encoding, please do so sparingly and with full awareness of its drawbacks.
One more thing. You mention the degree symbol as being character 167 (0xA7) in your desired target encoding. If that is the case, you may actually be wanting IBM437 encoding rather than ISO 8859-1. IBM437 is the old code page that was used by default in MS-DOS. If you really need to use that code page, you may have additional trouble for two reasons. As you'll see in the MSDN article, that code page is not well supported in the .NET framework. In my testing, outputting the Unicode string containing the degree symbol using that encoding did not work properly. Therefore, you may find yourself needing to use a byte array to represent the data rather than a String variable (which is Unicode). For instance:
File.WriteAllBytes("Test.txt", {167})
The second problem is that IBM437 is likely not the default code page for your windows OS, so even when it is written to the file as byte value 167, it won't actually look like a degree symbol when you view it in a windows application such as notepad.

Returning Unicode Name With Code Point

I know how to return a Unicode character from a code point. That's not what I'm after. What I want to know is how to return the name associated with a particular code point. For example, The code point for 🍀 is 1F340. And its name is FOUR LEAF CLOVER. Is it possible for us to return this name with its code point? I've read about 100 topics involving Unicode. But I haven't see one discussing my question. I hope that's possible.
Thank you for your help.
Have you considered the ICU library? It offers the following C API: http://icu-project.org/apiref/icu4c/uchar_8h.html#aa488f2a373998c7decb0ecd3e3552079
int32_t u_charName(
UChar32 code,
UCharNameChoice nameChoice,
char* buffer,
int32_t bufferLength,
UErrorCode* pErrorCode)
Retrieve the name of a Unicode character.
Depending on nameChoice, the character name written into the buffer is the "modern" name or the name that was defined in Unicode version 1.0. The name contains only "invariant" characters like A-Z, 0-9, space, and '-'. Unicode 1.0 names are only retrieved if they are different from the modern names and if the data file contains the data for them. gennames may or may not be called with a command line option to include 1.0 names in unames.dat.
Parameters
code The character (code point) for which to get the name. It must be 0<=code<=0x10ffff.
nameChoice Selector for which name to get.
buffer Destination address for copying the name. The name will always be zero-terminated. If there is no name, then the buffer will be set to the empty string.
bufferLength ==sizeof(buffer)
pErrorCode Pointer to a UErrorCode variable; check for U_SUCCESS() after u_charName() returns.
Returns
The length of the name, or 0 if there is no name for this character. If the bufferLength is less than or equal to the length, then the buffer contains the truncated name and the returned length indicates the full length of the name. The length does not include the zero-termination.
ICU is the right approach, but it's even simpler than Chris said. Foundation includes ICU already, for various text processing functions, including CFStringTransform(). Its transform parameter accepts "any valid ICU transform ID defined in the ICU User Guide for Transforms".
One of ICU's transforms is Any-Name:
Converts between characters and their Unicode names in curly braces. For example:
., ⇆ {FULL STOP}{COMMA}
(The syntax isn't exactly as documented, but it's close enough you can figure it out.)
There's also an Any-Hex transform which can be used for translating to/from the codepoint hex value.

How to split lines in Haskell?

I have made a program which takes a 1000 digit number as input.
It is fixed, so I put this input into the code file itself.
I would obviously be storing it as Integer type, but how do I do it?
I have tried the program by having 1000 digits in the same line. I know this is the worst possible code format! But it works.
How can assign the variable this number, and split its lines. I read somewhere something about eos? Ruby, end of what?
I was thinking that something similar to comments could be used here.
Help will be appreciated.
the basic idea is to make this work:
a=3847981438917489137897491412341234
983745893289572395725258923745897232
instead of something like this:
a=3847981438917489137897491412341234983745893289572395725258923745897232
Haskell doesn't have a way to split (non-String) literals across multiple lines. Since Strings are an exception, we can shoehorn in other literals by parsing a multiline String:
v = read
"32456\
\23857\
\23545" :: Integer
Alternately, you can use list syntax if you think it's prettier:
v = read . concat $
["32456"
,"24357"
,"23476"
] :: Integer
The price you pay for this is that some work will be done (once) at runtime, namely, the parsing (e.g. read).

Why is there a convention of 1-based line numbers but 0-based char numbers?

According to TkDocs:
The "1.0" here represents where to insert the text, and can be read as "line 1, character 0". This refers to the first character of the first line; for historical conventions related to how programmers normally refer to lines and characters, line numbers are 1-based, and character numbers are 0-based.
I hadn't heard of this convention before, and I can't find anything relevant on Google. Can anyone explain this to me please?
I think you're referring to Tk's text widget. The man page says:
Lines are numbered from 1 for consistency with other UNIX programs that use this numbering scheme.
Although, I'm not sure which Unix tools it's talking about.
Update:
As mentioned in the comments, it looks like a lot of unix text manipulation tool starts line numbering at 1. And tcl/tk having a unix origin, it makes sense to be as compatible as possible with the underlying OS environment.
It really is nothing more than convention, but here is a suggestion.
Character positions are generally thought of in the same way as a Java iterator, which is a "pointer" to a position between two characters. Thus the first character is the one after index position 0. Substrings are taken between two inter-character positions, for instance.
Line positions on the other hand are generally thought of more in the way of a .NET enumerator, which is a "pointer" to the item itself, not to a position in between. Thus the first line is the line at position 1.