In Pharo/Smalltalk: How to read a file with a specific encoding? - smalltalk

I am currently reading a file like this:
dir := FileSystem disk workingDirectory.
stream := (dir / 'test.txt' ) readStream.
line := stream nextLine.
This works when the file is utf-8 encoded but I could not find out what to do when the file has another encoding.

For Pharo 7 there's this guide for file streams, which proposes:
('test.txt' asFileReference)
readStreamEncoded: 'cp-1250' do: [ :stream |
stream upToEnd ].

The classes ZnCharacterReadStream and ZnCharacterWriteStream provide
functionality to work with encoded character streams other then UTF-8 (which is the default). First, the file stream needs to be converted into a binary stream. After this, it can be wrapped by a ZnCharacter*Stream. Here is a full example for writing and reading a file:
dir := FileSystem disk workingDirectory.
(dir / 'test.txt') writeStreamDo: [ :out |
encoded := ZnCharacterWriteStream on: (out binary) encoding: 'cp1252'.
encoded nextPutAll: 'Über?'.
].
content := '?'.
(dir / 'test.txt') readStreamDo: [ :in |
decoded := ZnCharacterReadStream on: (in binary) encoding: 'cp1252'.
content := decoded nextLine.
].
content. " -> should evaluate to 'Über?'"
For more details, the book Enterprise Pharo a Web Perspective has a chapter about character encoding.

Related

How to write txt file in smalltalk

I try with this code:
f := 'testfile.txt' asFileReference.
f2 := f writeStream.
f2 nextPutAll: 'hello world'.
f2 close.
f content.
But I get this exception:
**FileDoesNotExistException**
'testfile.txt' asFileReference
writeStreamDo: [ :stream | stream << 'Hello, World!' ].
This should work. But this is another way to express what you did before so I suspect some writing permission is wrong or something around that.
Just to add to Estaban's response, one surprising behaviour of Pharo is that writeStreamDo overwrites the existing file, so if the existing file is longer than the new data, you end up with the new data and the tail end of the old data. Fortunately, there is a simple solution: you can simply include truncate. So a slightly "safer" version is:
'testfile.txt' asFileReference
writeStreamDo: [ :stream | stream truncate. stream << 'Hello, World!' ].

perl gunzip to buffer and gunzip to file have different byte orders

I'm using Perl v5.22.1, Storable 2.53_01, and IO::Uncompress::Gunzip 2.068.
I want to use Perl to gunzip a Storable file in memory, without using an intermediate file.
I have a variable $zip_file = '/some/storable.gz' that points to this zipped file.
If I gunzip directly to a file, this works fine, and %root is correctly set to the Storable hash.
gunzip($zip_file, '/home/myusername/Programming/unzipped');
my %root = %{retrieve('/home/myusername/Programming/unzipped')};
However if I gunzip into memory like this:
my $file;
gunzip($zip_file, \$file);
my %root = %{thaw($file)};
I get the error
Storable binary image v56.115 more recent than I am (v2.10)`
so the Storable's magic number has been butchered: it should never be that high.
However, the strings in the unzipped buffer are still correct; the buffer starts with pst which is the correct Storable header. It only seems to be multi-byte variables like integers which are being broken.
Does this have something to do with byte ordering, such that writing to a file works one way while writing to a file buffer works in another? How can I gunzip to a buffer without it ruining my integers?
That's not related to unzip but to using retrieve vs. thaw. They both expect different input, i.e. thaw expect the output from freeze while retrieve expects the output from store.
This can be verified with a simple test:
$ perl -MStorable -e 'my $x = {}; store($x,q[file.store])'
$ perl -MStorable=freeze -e 'my $x = {}; print freeze($x)' > file.freeze
On my machine this gives 24 bytes for the file created by store and 20 bytes for freeze. If I remove the leading 4 bytes from file.store the file is equivalent to file.freeze, i.e. store just added a 4 byte header. Thus you might try to uncompress the file in memory, remove the leading 4 bytes and run thaw on the rest.

Reading double byte files

I was wondering if there was a simple way in Tcl to read a double byte file (or so I think it is called). My problem is that I get files that look fine when opened in notepad (I'm on Win7) but when I read them in Tcl, there are spaces (or rather, null characters) between each and every character.
My current workaround has been to first run a string map to remove all the null
string map {\0 {}} $file
and then process the information normally, but is there a simpler way to do this, through fconfigure, encoding or another way?
I'm not familiar with encodings so I'm not sure what arguments I should use.
fconfigure $input -encoding double
of course fails because double is not a valid encoding. Same with 'doublebyte'.
I'm actually working on big text files (above 2 GB) and doing my 'workaround' on a line by line basis, so I believe that this slows the process down.
EDIT: As pointed out by #mhawke, the file is UTF-16-LE encoded and this apparently is not a supported encoding. Is there an elegant way to circumvent this shortcoming, maybe through a proc? Or would this make things more complex than using string map?
The input files are probably UTF-16 encoded as is common in Windows.
Try:
% fconfigure $input -encoding unicode
You can get a list of encodings using:
% encoding names
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine gb2312 jis0201 euc-cn euc-jp iso8859-10 macThai iso2022-jp jis0208 macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania gb1988 iso2022-kr macTurkish macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 koi8-r iso8859-4 macCroatian ebcdic cp1250 iso8859-5 iso8859-6 macCyrillic cp1251 iso8859-7 cp1252 koi8-u macDingbats iso8859-8 cp1253 cp1254 iso8859-9 cp1255 cp850 cp932 cp1256 cp852 cp1257 identity cp1258 macJapan utf-8 shiftjis cp936 cp855 symbol cp775 unicode cp857
I decided to write a little proc to convert the file. I am using a while loop since reading a 3 GB file into a single variable locked the process completely... The comments make it seem pretty long, but it's not that long.
proc itrans {infile outfile} {
set f [open $infile r]
# Note: files I have been getting have CRLF, so I split on CR to keep the LF and
# used -nonewline in puts
fconfigure $f -translation cr -eof ""
# Simple switch just to remove the BOM, since the result will be UTF-8
set bom 0
set o [open $outfile w]
while {[gets $f l] != -1} {
# Convert to binary where the specific characters can be easily identified
binary scan $l H* l
# Ignore empty lines
if {$l == "" || $l == "00"} {continue}
# If it is the first line, there's the BOM
if {!$bom} {
set bom 1
# Identify and remove the BOM and set what byte should be removed and kept
if {[regexp -nocase -- {^(?:FFFE|FEFF)} $l m]} {
regsub -- "^$m" $l "" l
if {[string toupper $m] eq "FFFE"} {
set re "(..).."
} elseif {[string toupper $m] eq "FEFF"} {
set re "..(..)"
}
}
regsub -all -- $re $l {\1} new
} else {
# Regardless of utf-16-le or utf-16-be, that should work since we split on CR
regsub -all -- {..(..)|00$} $l {\1} new
}
puts -nonewline $o [binary format H* $new]
}
close $o
close $f
}
itrans infile.txt outfile.txt
Final warning, this will mess up characters actually using all 16 bits (e.g. code unit sequence 04 30 will lose the 04 and become 30 instead of becoming D0 B0 as it should be in Table 3-4, but 00 4D will correctly be mapped to 4D) in a character silently, so be sure you don't mind that or your file doesn't contain such characters before trying out the above.

How to read different file format data and use it for compression

fob = open('this.txt','rb')
fob1 = open('that.txt','wb')
content = ''
for i in fob:
content += i
fob1.write(content)
fob.close()
fob1.close()
This is a code that is used to read a txt file and store it in a txt file.. How do I read any kind of file??? tht might evn be a jpeg file,pdf file or someother file.. Pls do help me..
Thanks in advance..
Your code reads a *.txt file line by line (and copies it).
If you want to read a different type of file byte by byte, and print its bits you can do this:
f = open('test.gnu','rb')
flag=1;
while flag:
byte = f.read(1)
flag = (byte != "")
if flag:
# do something with the byte, eg:
# print its bits:
print '{0:08b}'.format(ord(byte))
f.close()
Or if you want to zip and unzip files, you can use the package "zipfile"
http://docs.python.org/2/library/zipfile; for code with examples with various compression formats see:
http://pymotw.com/2/compression.html

Windows scripting to parse a HL7 file

I have a HUGE file with a lot of HL7 segments. It must be split into 1000 (or so ) smaller files.
Since it has HL7 data, there is a pattern (logic) to go by. Each data chunk starts with "MSH|" and ends when next segment starts with "MSH|".
The script must be windows (cmd) based or VBS as I cannot install any software on that machine.
File structure:
MSH|abc|123|....
s2|sdsd|2323|
...
..
MSH|ns|43|...
...
..
..
MSH|sdfns|4343|...
...
..
asds|sds
MSH|sfns|3|...
...
..
as|ss
File in above example, must be split into 2 or 3 files. Also, the files comes from UNIX, so newlines must remain as they are in the source file.
Any help?
This is a sample script that I used to parse large hl7 files into separate files with the new file names based on the data file. Uses REBOL which does not require installation ie. the core version does not make any registry entries.
I have a more generalised version that scans an incoming directory and splits them into single files and then waits for the next file to arrive.
Rebol [
file: %split-hl7.r
author: "Graham Chiu"
date: 17-Feb-2010
purpose: {split HL7 messages into single messages}
]
fn: %05112010_0730.dat
outdir: %05112010_0730/
if not exists? outdir [
make-dir outdir
]
data: read fn
cnt: 0
filename: join copy/part form fn -4 + length? form fn "-"
separator: rejoin [ newline "MSH"]
parse/all data [
some [
[ copy result to separator | copy result to end ]
(
write to-file rejoin [ outdir filename cnt ".txt" ] result
print "Got result"
?? result
cnt: cnt + 1
)
1 skip
]
]
HL7 has a lot of segments - I assume that you know that your file has only MSH segments. So, have you tried parsing the file for the string "(newline)MSH|"? Just keep a running buffer and dump that into an output file when it gets too big.