Reading double byte files - file-io

I was wondering if there was a simple way in Tcl to read a double byte file (or so I think it is called). My problem is that I get files that look fine when opened in notepad (I'm on Win7) but when I read them in Tcl, there are spaces (or rather, null characters) between each and every character.
My current workaround has been to first run a string map to remove all the null
string map {\0 {}} $file
and then process the information normally, but is there a simpler way to do this, through fconfigure, encoding or another way?
I'm not familiar with encodings so I'm not sure what arguments I should use.
fconfigure $input -encoding double
of course fails because double is not a valid encoding. Same with 'doublebyte'.
I'm actually working on big text files (above 2 GB) and doing my 'workaround' on a line by line basis, so I believe that this slows the process down.
EDIT: As pointed out by #mhawke, the file is UTF-16-LE encoded and this apparently is not a supported encoding. Is there an elegant way to circumvent this shortcoming, maybe through a proc? Or would this make things more complex than using string map?

The input files are probably UTF-16 encoded as is common in Windows.
Try:
% fconfigure $input -encoding unicode
You can get a list of encodings using:
% encoding names
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine gb2312 jis0201 euc-cn euc-jp iso8859-10 macThai iso2022-jp jis0208 macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania gb1988 iso2022-kr macTurkish macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 koi8-r iso8859-4 macCroatian ebcdic cp1250 iso8859-5 iso8859-6 macCyrillic cp1251 iso8859-7 cp1252 koi8-u macDingbats iso8859-8 cp1253 cp1254 iso8859-9 cp1255 cp850 cp932 cp1256 cp852 cp1257 identity cp1258 macJapan utf-8 shiftjis cp936 cp855 symbol cp775 unicode cp857

I decided to write a little proc to convert the file. I am using a while loop since reading a 3 GB file into a single variable locked the process completely... The comments make it seem pretty long, but it's not that long.
proc itrans {infile outfile} {
set f [open $infile r]
# Note: files I have been getting have CRLF, so I split on CR to keep the LF and
# used -nonewline in puts
fconfigure $f -translation cr -eof ""
# Simple switch just to remove the BOM, since the result will be UTF-8
set bom 0
set o [open $outfile w]
while {[gets $f l] != -1} {
# Convert to binary where the specific characters can be easily identified
binary scan $l H* l
# Ignore empty lines
if {$l == "" || $l == "00"} {continue}
# If it is the first line, there's the BOM
if {!$bom} {
set bom 1
# Identify and remove the BOM and set what byte should be removed and kept
if {[regexp -nocase -- {^(?:FFFE|FEFF)} $l m]} {
regsub -- "^$m" $l "" l
if {[string toupper $m] eq "FFFE"} {
set re "(..).."
} elseif {[string toupper $m] eq "FEFF"} {
set re "..(..)"
}
}
regsub -all -- $re $l {\1} new
} else {
# Regardless of utf-16-le or utf-16-be, that should work since we split on CR
regsub -all -- {..(..)|00$} $l {\1} new
}
puts -nonewline $o [binary format H* $new]
}
close $o
close $f
}
itrans infile.txt outfile.txt
Final warning, this will mess up characters actually using all 16 bits (e.g. code unit sequence 04 30 will lose the 04 and become 30 instead of becoming D0 B0 as it should be in Table 3-4, but 00 4D will correctly be mapped to 4D) in a character silently, so be sure you don't mind that or your file doesn't contain such characters before trying out the above.

Related

Writing lines to a binary file

I'm further playing with Raku's CommaIDE and I wanna print a binary file line by line.
I've tried this, but it doesn't work:
for "G.txt".IO.lines -> $line {
say $_;
}
How shall I fix it ? It's obviously incorrect.
EDIT
this doesn't work either, see the snippet bellow
for "G.txt".IO.lines -> $line {
say $line;
}
You're showing us h.raku but Comma is giving you an error regarding c.raku, which is some other file in your Comma project.
It looks like you're working with a text file, not binary. Raku makes a clear distinction here: a text file is treated as text, regardless of encoding. If it's UTF-8, using .lines as you are now should work just fine because that's the default. If it's some other encoding, you can call .lines(:enc<some-other-encoding>). If it's truly binary, then the concept of "lines" really has no meaning, and you want something more like .slurp(:bin), which will give you a Buf[uint8] for working on the byte level.
The question specifically refers to reading a binary file, for which reading line-wise may (or may not) make sense--depending on the file.
Here's code to read a binary file straight from the docs (using class IO::CatHandle):
~$ raku -e '(my $f1 = "foo".IO).spurt: "A\nB\nC\n"; (my $f2 = "foo"); with IO::CatHandle.new: $f2 {.encoding: Nil; .slurp.say;};'
Buf[uint8]:0x<41 0A 42 0A 43 0A>
Compare to reading the file with default encoding (utf8):
~$ raku -e '(my $f1 = "foo".IO).spurt: "A\nB\nC\n"; (my $f2 = "foo"); with IO::CatHandle.new: $f2 {.slurp.say;};'
A
B
C
See:
https://docs.raku.org/routine/encoding
Note: the read method uses class IO::Handle which reads binary by default. So the code is simply:
~$ raku -e '(my $file1 = "foo".IO).spurt: "A\nB\nC\n"; my $file2 = "foo".IO; given $file2.open { .read.say; .close;};'
Buf[uint8]:0x<41 0A 42 0A 43 0A>
See:
https://docs.raku.org/type/IO::Handle#method_read
For further reading, see discussion of Perl5's <> diamond-operator-equivalent in Raku:
https://docs.raku.org/language/5to6-nutshell#while_until
...and some (older) mailing-list discussion of the same:
https://www.nntp.perl.org/group/perl.perl6.users/2018/11/msg6295.html
Finally, the docs refer to writing a mixed utf8/binary file here (useful for further testing):
https://docs.raku.org/routine/encoding#Examples

how to guess file encoding

I have a file (an author list from the Library of Congress) with lines like:
Arteaga, Ana Mar�ia
Corval�an-V�asquez, Oscar E.
(when printed to linux console)
I'd like to read those (either into a pandas dataframe or a set of lines)
df = pd.read_csv(fname, sep='\t', header='infer', lineterminator=None,encoding='latin1') #lineterminator \r\n hits error
or
with open(fname,'r',encoding='ISO-8859-1') as fp:
lines=fp.readlines()
but both are not quite right , giving me output like
Arteaga, Ana Marâia
(again when printed to console)
when I am pretty sure the actual name here should be María.
Does someone recognize this format?
Ok this seems to be the 'marc-8' format .
yaz-iconv -f marc8 -t utf8 infile.txt > outfile.txt
took care of the conversion to utf8 , with the sole hiccup being that yaz killed all the line terminators (both for \r\n and \n versions of the file).
Those can be returned with something along the lines of
sed 's/\[/\n\[/g' outfile.txt > outfile_utf.txt
(for example in my case where each line starts with a '[' character)

Copy filename to file with same number

So this is a bit vague to describe so I'll use a picture:
I have around 150 DWG files that have the same content as the SVG's (they're both vector drawing formats converted 1 to 1). I'd like to apply the same filename from the DWG's to the SVG's that start with the same number.
So I end up with:
001_TERMINAL.dwg
001_TERMINAL.svg
002_DIFFUSER.dwg
002_DIFFUSER.svg
etcetera...
I'm using a PC with Windows 10.
How can I implement a solution to my problem?
Thanks!
Assuming it's always 3 digits in the *.svg file names:
set DIR=C:\mydir
#rem Allow repeated setting of !variables! in the FOR loop below
setlocal enabledelayedexpansion
for %%I in (%DIR%\*.dwg) do (
#rem "~n" to pick out just the filename part of the %%I variable
set BASENAME=%%~nI
#rem Substring - batch file style
set PREFIX=!BASENAME:~0,3!
echo !PREFIX! ... !BASENAME!
rename !PREFIX!.svg !BASENAME!.svg
)
Note this will need to be in a batch file for the %%I to work.
The main complication there is using variables in a multi-line FOR loop.
For these you have to use the delayed expansion option, to enable the variable to be expanded each time round, rather than when the line is parsed. This means you have to use !variable! instead of the more normal %variable% in a batch file.
Because you are on Windows, PowerShell is a great candidate to solve this.
For the script below, the length of the numeric part in front of the underscore character doesn't matter, as long as there is an underscore in the .dwg filename, as visible in your question.
Just replace 'c:\folder' here below with the path your files are stored in.
$folderPath = "c:\folder"
$files = Get-ChildItem ([System.IO.Path]::Combine($folderPath, "?*_*.dwg"))
for ($i=0; $i -lt $files.Count; $i++)
{
$file = $files[$i]
$dwgFileName = $file.BaseName
$index = $dwgFileName.IndexOf("_")
$numberPart = $dwgFileName.Substring(0, $index)
$svgFilePath = [System.IO.Path]::Combine($folderPath, "$numberPart.svg")
if ([System.IO.File]::Exists($svgFilePath))
{
Rename-Item -Path $svgFilePath -NewName "$dwgFileName.svg"
}
}
Using bash:
#!/bin/bash
for f in *.dwg; do
IFS='_' read -r -a arr <<< "$f"
mv ${arr[0]}.svg ${f%.*}.svg
done

perl gunzip to buffer and gunzip to file have different byte orders

I'm using Perl v5.22.1, Storable 2.53_01, and IO::Uncompress::Gunzip 2.068.
I want to use Perl to gunzip a Storable file in memory, without using an intermediate file.
I have a variable $zip_file = '/some/storable.gz' that points to this zipped file.
If I gunzip directly to a file, this works fine, and %root is correctly set to the Storable hash.
gunzip($zip_file, '/home/myusername/Programming/unzipped');
my %root = %{retrieve('/home/myusername/Programming/unzipped')};
However if I gunzip into memory like this:
my $file;
gunzip($zip_file, \$file);
my %root = %{thaw($file)};
I get the error
Storable binary image v56.115 more recent than I am (v2.10)`
so the Storable's magic number has been butchered: it should never be that high.
However, the strings in the unzipped buffer are still correct; the buffer starts with pst which is the correct Storable header. It only seems to be multi-byte variables like integers which are being broken.
Does this have something to do with byte ordering, such that writing to a file works one way while writing to a file buffer works in another? How can I gunzip to a buffer without it ruining my integers?
That's not related to unzip but to using retrieve vs. thaw. They both expect different input, i.e. thaw expect the output from freeze while retrieve expects the output from store.
This can be verified with a simple test:
$ perl -MStorable -e 'my $x = {}; store($x,q[file.store])'
$ perl -MStorable=freeze -e 'my $x = {}; print freeze($x)' > file.freeze
On my machine this gives 24 bytes for the file created by store and 20 bytes for freeze. If I remove the leading 4 bytes from file.store the file is equivalent to file.freeze, i.e. store just added a 4 byte header. Thus you might try to uncompress the file in memory, remove the leading 4 bytes and run thaw on the rest.

How to read text files transfered as binary

My code copies files from ftp (using text transfer mode) to local disk and then trys to process them.
All files contain only text and values are seperated using new line. Sometimes files are moved to this ftp using binary transfer mode and looks like this will mess up line-ends.
Using hex editor, I compared line ends depending the transfer mode used to send files to ftp:
using text mode: file endings are 0D 0A
using binary mode: file endings are 0D 0D 0A
Is it possible to modify my code so it could read files in both cases?
Code from job that illustrates my problem and shows how i'm reading file:
(here i use same file, that contains 14 rows of data)
int i;
container con;
container files = ["c:\\temp\\axa_keio\\ascii.txt", "c:\\temp\\axa_keio\\binary.txt"];
boolean purchLineFirstRow;
IO inFile;
;
for(i=1; i<=conlen(files); i++)
{
inFile = new AsciiIO(conpeek(files,i), "R");
inFile.inFieldDelimiter('\n');
con = inFile.read();
info(int2str(conlen(con)));
}
Files come from Unix system to Windows sytem.
Not sure but maybe the question could be: "Which inFieldDelimiter values should i use to read both Unix and Windows line ends?"
Use inRecordDelimiter:
inFile.inRecordDelimiter('\n');
instead of:
inFile.inFieldDelimiter('\n');
There may still be a dangling CR on the last field, you may wish remove this:
strRem(conpeek(con, conlen(con)), '\r')
See also: http://en.wikipedia.org/wiki/Line_endings