Mecab outputs strange characters when automatically splitting large files - mecab

How to prevent Mecab from inputting strange characters before and after EOS when input buffer file size exceeded without increasing input buffer size?
When running mecab with files that exceed the input buffer size, it'd automatically split the output. This is usually fine, except before and after EOS, there's the below unrecognizable characters.
� � � � 補助記号-一般
��\uFFFD (character code)
Is there any settings that prevents mecab from outputting these strange characters? I need the file splitting to ensure morphemes are grouped properly. Going through the entire file and manually getting rid of them isn't the best option, especially when I have 10's of thousands of lines in the mecab output (due to lots of files).
Installed mecab via Homebrew with Unidict

Related

How to read a binary file with TCL

So I have a function I'm using to read data from a file. It works fine if the file is plain text, but when I try to read a binary file, like a png, it returns a different text (diff confirms that). I opened a hex editor to see what was wrong and found out it is putting some c2 bytes along with the file (I don't know if the position is random or if there are other bytes except this c2 one).
This is my function. I just want it to read and save to a variable.
proc read_file {path} {
set channel [open $path r]
fconfigure $channel -translation binary
set return_string "[read $channel]"
close $channel
return "$return_string"
}
To actually print, I'm doing this:
puts -nonewline [read_file file.png]
When you open a file, it defaults to being in text mode . In text mode (which is really a combination of options) the IO layer translates characters from whatever encoding they are in into Tcl's internal encoding, and does the reverse operation on output. The default encoding scheme is platform specific, but in your case it sounds like it is UTF-8. (Tcl uses a complex internal system of encodings; it doesn't expose those to the outside world.)
By contrast, when you put the channel into binary mode, the bytes on the outside are directly mapped to characters in the range 0-255 (and vice versa on output). You get a perfect copy, provided you put both input and output channels in binary mode. (There are other optimisations for binary mode, but they don't matter here.)
When you only put one of the channels in binary mode, you get what looks like corruption. It isn't random though. In particular, when the input is binary but the output is UTF-8, input bytes in the range 128-255 get converted into multiple output bytes, where the first of those bytes is in the sort of range you observed. There are other combinations that mess things up; the whole range of problems is collectively known as mojibake.
tl;dr Don't mix up binary and text data unless you're very careful. The results of getting it wrong are "surprising".

Custom Speech: "normalized text is empty"

I uploaded a wav and text file to the custom speech portal. I got the following error: “Error: normalized text is empty.”
The text file is UTF-8 BOM, and is similar in format to a file that did work.
How I can trouble-shoot this?
There can be several reasons for a normalized text to be empty, e.g. if there are words of Latin and non-Latin characters in a sentence (depending on the locale). Also, words that are repeated multiple times in a row may cause this. Can you share which locale you're using to import the data? If you could share the text we can find the reason. Otherwise you could try to reduce the input text (no need to cut the audio for this) to find out what causes the normalization to discard the sentence.

Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF`

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?
It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

Losslessly Compress PDF Generated from PostScript

I am generating multiple EPS files, which contain several PostScript drawing commands that are not necessarily encoded efficiently. The first update in the answer to this question describes similar inefficiencies.
Each of my EPS files are around 18 MB, and the resulting PDF files are around 3 MB. I am generating the PDF files using epstopdf, which enables some sort of compression by default.
Are there any suggestions for how to further reduce the resulting PDF file sizes without changing the quality (e.g. rasterizing the vector graphics)?
I tried reducing the precision of the coordinates from 8 digits past the decimal to 3. This reduced the EPS file sizes to about 14 MB, but, counter-intuitively, the PDF file sizes slightly increased.
Update 1: The EPS files contain several occurrences of the sample code below for different coordinates and colors.
newpath
1 setlinejoin
1 setlinecap
<<
/BBox [322 384.0417 615.0087 651.9958]
/Domain [322 384.0417 615.0087 651.9958]
/ShadingType 6
/ColorSpace [/DeviceRGB]
/DataSource
[
0
350.00000000 651.99583594
336.00000000 645.75890880
336.00000000 645.75890880
322.00000000 639.52198166
339.17140372 627.26533984
339.17140372 627.26533984
356.34280743 615.00869803
370.19224806 621.16169097
370.19224806 621.16169097
384.04168868 627.31468392
367.02084434 639.65525993
367.02084434 639.65525993
0.23047 0.29688 0.75
0.23047 0.29688 0.75
0.41081 0.54141 0.93366
0.41112 0.54178 0.93388
]
>>
gsave
322 615.0087 62.04169 36.98714 rectclip
shfill
grestore
Update 2: I have been able to reduce the PDF file sizes by about 15% by using pdftocairo, followed by gs -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDetectDuplicateImages=true -sOutputFile=out.pdf in_.pdf.
PostScript is a programming language and PDF is not, so often you can actually create a smaller PostScript program than the resulting PDF file.
The 'inefficiencies' you mention in your EPS program, and the precision of the input numbers, are completely irrelevant to the size of the PDF file. The operators in PDF are not the same names as the operators in PostScript, so a 'moveto' in PostScript does not simply get translitereated into a 'moveto' in the resulting PDF file. The precision of numbers in the output PDF file is not tied to the precision of the numbers in the input.
In addition, PostScript interpreters often use a fixed precision arithmetic (Ghostscript for example uses 24:8), so (eg) 1.5 on the input may not be produced as 1.5 on the output, it may instead become 1.49999999.
So the result of this, basically, is that nobody can tell why your PDF files are as large as they are without seeing them. I would suggest that a 6:1 reduction in size is pretty reasonable personally. If you post a representative example somewhere its possible someone could look at it and might be able to offer some suggestions, but without seeing the content its not really possible to tell.
NB rendering the content would most likely increase the size of the PDF file, unless you render at a really low resolution.
EDIT
The supplied example is simply a shading dictionary, the PDF file will contain almost exactly the same data for that particular construct. Its already about as compact as you could expect, I very much doubt it this is the sort of thing occupying 18MB of source, that would be an enormous amount of shadings. There is no realistic way to make that smaller, and rendering it to a bitmap (even at very low resolution) would actually make it larger.
Its entirely possible the EPS contains things like a bitmap preview, which will, of course, be removed when creating a PDF. It may also (depending on the creating application) contain the original document, stored as comments, which will also be removed when creating a PDF file. Without seeing the original EPS its not really possible to suggest much.
I'm afraid posting little bits of the file isn't going to help really.

Encoding issue in I/O with Jena

I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.
When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....
Then, I use the RDF writer to output the file:
Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text
The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.
By default the text should be encoded to utf-8.
The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu généralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis....
(Either way, this is obviously not acceptable from the user's viewpoint).
The same issue happens with any output format supported by Jena (RDF, NT, etc.).
I can't really find a logical explanation to this.
The official documentation doesn't seem to address this issue.
Any hint or tests I can run to figure it out?
My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.
You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.
You're also not showing how you're printing the strings in the printStringFromModel() method.
Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?
Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.
My hint/answer would be to inspect the byte sequence in 3 places:
The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.
This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.
The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.