Need to prevent entry of "U+3000" full-width Japanese spaces in Word doc

Need to prevent entry of "U+3000" full-width Japanese spaces in Word doc - vba

Short version:
So I have a Word doc with very specific formatting guidelines, into which Japanese users have to input text. Despite being specifically told not to, instead of using "tab" to create indents, they will very often use full-width Japanese spaces (U+3000). I want to somehow prevent the entry of this character to avoid having to reformat.
Long version:
We send out Japanese/English-language script templates for Japanese users to input their own dialogue into. They will often ignore the formatting of the script, using hard returns and full-width spaces to hack-format the doc (very common practice among Japanese Word users). This leads to unnecessary time spent on re-formatting. As I see it I have three options:
Prevent the entry of undesired characters, by blocking the character
or setting up a dialogue box every time it is used.
Automate a dialogue box to pop up when the document is saved, displaying a message to users to make sure no undesired characters were used.
Create a macro to auto-replace undesired characters on my end.
Any suggestions? Help is very much appreciated.

Related

Missing presentation forms (glyphs) of some arabic characters in Unicode

I am working on a code that generates PDF containing arabic texts. For each character, I am choosing the correct glyph in the presentation forms to display the text correctly. This works fine but Unicode doesn't contain presentation form of all arabic characters.
For example \u067D ARABIC LETTER TEH WITH THREE DOTS ABOVE DOWNWARDS ٽ. There is no presentation form of this character even though the character has medial form, as can be seen in this string: لٽط
What is the reason that presentation forms of this and other characters are missing?
Is the character not used in practice?
Can the simple ARABIC LETTER TEH, which contains only one dot above and has presentation forms, be used instead?
Or is it necessary to somehow build this character (e.g. by using \uFBB6 THREE DOTS ABOVE character)?

The Arabic presentation forms should never be used for writing text. They exist only because they were needed for compatibility with older standards long ago. As such, there aren’t presentation forms for all Arabic letters in Unicode, only those necessary for this specific purpose. Many letters were also added long after the presentation forms ceased being relevant altogether. See the FAQ on Arabic for more information.
Arabic text should always be entered and stored using the regular letters (from the blocks Arabic, Arabic Supplement, and Arabic Extended-A). These letters will then automatically assume the correct shape depending on where they are situated in the word (initial, medial, or final) as can be seen in the example string you provided.
Using the character U+FBB6 ﮶ ARABIC SYMBOL THREE DOTS ABOVE would not be appropriate in this context because it is not a combining mark. It isn’t used to build new characters, but to talk about the symbol itself in isolation. From the code chart for Arabic Presentation Forms-A:
These are spacing symbols representing Arabic letter diacritics
considered in isolation, as for example as in discussions about the
Arabic script.
If the software you are using does not handle Arabic letter joining correctly, then there simply is no Unicode-defined way to enter the medial form of ٽ in your document. You will either have to switch to another framework entirely, or (as a last resort) encode the contextual forms you need as private-use characters in a new font, but I strongly recommend against that solution.

How to substitude all "\t" (tab characters) with white space in a PDF

Hello i am trying to convert a pdf book about programming to mobi format with Calibre.
The problem I am facing is that the code blocks inside the converted version completely lose indentation.
I managed with a regular expression to correctly indent the lines that where indented using white spaces. I did so transforming every two white spaces to two non-breaking-spaces.
Some of the code blocks unfortunately are indented using the tab character, so the regular expression is not working in these cases.
I came to realize that during the conversion from pdf to mobi there is an intermediate step in which the pdf is converted to hmtl and there is when the tab information is lost because no special tag is being generated to carry this information.
So i think the best solution is to edit the very pdf itself and replace all the tab characters(\t) to two white spaces (\s\s). This way the regular expression i mentioned before will work for all the code block references and the code will be indented properly.
but i have no idea which software to use that has this functionality of substituting pdf elements.

I doubt that the 'tabs' are contained in the PDF as tabs. The 'tab' character (0x04 in ASCII) has no special significance in PDF, and in particular it does not move the current point, it simply draws a glyph. As a result, if you do (A\tB) what you will see when the PDF is rendered is 'AB'. Or 'A*B' where the * is some other character you didn't expect (often a square)
So you would probably actually have to convert current point movement operators into white space drawing There's no realistic way that can be automated, since no tool can tell where a movement was a 'tab' and where it was a reposition.
So you will need to do it manually.
The challenge here is that the page content stream is likely to be compressed, so the first thing you will have to do is decompress the PDF. There are a number of tools which will do this for you, MuPDF is one, I think pdftk is another.
Then you will need to locate the position where you want to insert space, this could be challenging, as the font may be re-encoded to something other than ASCII so it may be hard to identify the correct position. Once you've done that, you can insert the space(s) you want into the text strings, again bearing in mind that the font in use may be re-encoded, and subset. This means that a space may not be 0x20 and indeed the font may not even contain a space glyph. And of course you need to remove the operations to reposition the current point.
Finally, after you've modified the contents, you need to remember that PDF is a binary format, and the xref table contains the position of every element in the file. If you've edited the file its likely that you will have altered the length of one or more elements, which will change the offset of any following elements, so you will need to recalculate those and update the xref table.
I suspect you are going to find it easier to modify the conversion from PDF to HTML, or modify the HTML, than to try and alter the PDF file.

MS Access VBA code editor character encoding and copy/paste

What is the actual encoding used in Access' VBA editor? I have been searching for a concrete answer for quite a while but with no luck.
I thought it was UTF-8 but I'm not very certain.
My main issue is that when writing a query in VBA I sometimes need to test it in Access' query editor. When copy-pasting however, I lose my native characters (greek in my case) as they turn to gibberish.
I have tried pasting in a text editor and saving it as different encodings but I can never recover the original characters.
Thanks in advance.
Edit
Let me explain this a bit further:
As you can see I can write my greek characters in the VBA editor normally:
However, copying the first line in Access' query editor, I get the following:
Same goes for a simple text editor:
So I am inclined to think that the problem lies inside the clipboard, due to the encoding used for the greek characters. I guess they are not Unicode, as I indeed have to make the change in the System Locale for non-unicode characters. So how are these characters saved/copied? In what encoding?
Answer
Actually this problem was solved by switching the keyboard input language to greek (EL), when copying the actual test string.
I am still not sure however, as to why that happens. If anyone can provide some insight into this, I would love to hear it.
Thanks again

The VBA editor does not support Unicode characters, either for input or display. Instead, it uses the older Windows technology called "code pages" to provide support for non-ASCII characters.
So, the character encoding in the VBA editor corresponds to the code page that is used by the Windows system locale as specified in the "Regional and Language Options" control panel. For example, with my system locale set to "Greek (Greece)"
I can enter Greek characters into my VBA code
However, if I switch my Windows system locale back to "English (United States)"
and re-open my VBA project, the Greek characters have changed to the corresponding characters in the new code page

If "Control Panel" -> "Regional and Language Options" -> "System Locale" is set correctly but you still suffer from this problem some times then note that while you're copying your keyboard layout must be switched to the non-English language.
This is applicable to all non-unicode-aware applications not only VBA.
Credit goes to #parakmiakos

details in this: http://www.pcreview.co.uk/forums/use-greek-characters-visual-basic-editor-t2097705.html
Looks like making sure your OS is set properly, and font choice inside the VBA editor.

I had a similar problem with Cyrillic characters. Part of the problem is solved when set the System locale correctly.
However, The VBA editor still does not recognize cyrillic characters when it has to interpret them from inside itself.
For example it can not display characters from the command:
Msgbox "Здравей"
but if the sheet name is in cyrillic characters it does it well:
Msgbox Activesheet.Name
Finally, it turned out that these kind of problems were solved when I changed to 32 bits version of MS Office.

Can't type spaces in right aligned text field in pdf form

I am creating fillable PDF form using adobe acrobat.
When I choose right alignment for a text field; I can't type spaces between the words when I fill the form. (so the spaces don't appears)
However when I enable the "Multi- line" option or the "Scroll long text" option, I am able to type spaces ! In addition after I finish typing my words and then go back to insert the space, the space appear !
I am using adobe acrobat 9 and my form is generated based on word document.

I had the same problem--it's inconsistent and a bit buggy. Until they fix it, I recommend just enabling the "Scroll long text" option.

With WxWidgets, is the wxTextCtrl one-size-fits-all?

When working with guis of different kinds, I am used to the distinction of text field or text entry box versus text box. That is there is one type of object for the multi-line word processor style text box and another type of object for a single line, quite often non-scrollable text field / text entry box. Does wxTextCtrl serve both purposes? I know it does the text box but is it also the correct choice for the text field/text entry box?
EDIT
There are actually 2 types of text boxes for multi-line entry as pointed out in the answers. What really interests me are widgets specific for single line entry versus widgets specific for multi-line entry.

wxTextCtrl serves for both single and multiline entry. It is quite powerful but not exactly 'word processor style'. Closer to that would be wxRichTextCtrl.
wxComboBox uses wxTextEntry ( as does wxTextCtrl in single-line mode ). Although wxTextEntry is not offered as a control itself - it does not inherit from wxControl - if you like it so much you might be able to build something using it. But it seems like a lot of trouble for benefits that I do not see.

wxTextCtrl is a single line text control (what is called "entry" in other frameworks) by default. If you specify wxTE_MULTILINE style when creating it (this style can't be changed later), it becomes -- surprise -- a multi-line control, i.e. what is called "area" in other places.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas