Special unicode question mark characters in database table - sql

Firstly anyone who reads this and response, thanks for your assistance.
I'm having a problem where I have a site (primarily in English), with many translations for different language. I have a database which stores these translations. Unfortunately one of the language seems to be populated with question mark characters between each general character. Because of this, any text which contains these characters wont show up in IE.
Is there any SQL statements that will seek these characters out and remove them? There's a find/replace option, but I can't seem to find a rule that applies.
Thanks for any help you can give.
As an example, this is how text shows in a table:
�i�O�N� �k�i�t� �d�e� �s�u�p�p�o�r�t� �V�é�l�o� - which stops it showing IE.
Removing these as below will show it in IE:
iON kit de support Vélo
Any idea how I go about this?
Thanks :)

Your translation database contains mangled data that has come from misinterpreting UTF-16-encoded input as ISO-8859-1 (or the closely related Windows code page 1252; you can't tell the difference from the example data).
You could attempt to undo the damage by extracting the data, encoding it back to what is hopefully the original set of bytes, and re-decoding it, then inserting it back into the database. For example in PHP:
$mangled = "i\0O\0N\0 \0k\0i\0t\0 \0d\0e\0 \0s\0u\0p\0p\0o\0r\0t\0 \0V\0\xE9\0l\0o\0"
$fixed = iconv('utf-16le', 'utf-8', $mangled)
# "iON kit de support V\xC3\xA9lo"
but it would be best to go back to the original input data and re-import it properly really.
Just removing zero bytes from a UTF-16-encoded bytes string (str_replace("\0", '', $mangled)) isn't really fixing it, it would work for the ASCII characters (U+0000–U+007F) but you would end up with ISO-8859-1 bytes for characters U+0080–U+00FF (more usually you would want UTF-8) and any other characters outside that range would remain unreadable nonsense.

Related

How to determine Thousands Separator using Format in VBA

I would like to determine the Thousand Separator used while running a VBA Code on a target machine without resolving to calling system built-in functions such as (Separator = Application.ThousandsSeparator).
I am using the following simple code using 'Format':
ThousandSeparator = Mid(Format(1000, "#,#"), 2, 1)
The above seems to work fine, and would like to confirm if this is a safe method of doing it without resorting to system calls.
I would expect the result to be a single char string in the form of , or . or ' or a Space as applicable to the locale on the machine.
Please note that I want to only use a language statement such as Format or similar (no sys calls). Also this relates to Thousands Separator not Decimal Separator. This article Using VBA to detect which decimal sign the computer is using does not help or answer my question. Thanks
Thanks in advance.
The strict answer to whether it is safe to use Format to get the thousands separator is No.
E.g. on Windows, it is possible to enter up to three characters into the Thousands Separator field in the regional settings in the control panel.
Suppose you enter asd and click OK.
If you now call Format(1000, "#,#") it will give you 1a000. That is only the first letter of your thousands separator. You have failed to retrieve it correctly.
Reading the registry:
? CreateObject("WScript.Shell").RegRead("HKCU\Control Panel\International\sThousand")
you get back asd in full.
To be fair, the Excel international properties do not seem to be of much help either. Application.International(xlThousandsSeparator) in this situation will return the separator originally defined in your computer's locale, not the value you've overridden it to.
Having that said, the practical answer is Yes, because it would appear (and if you happen to know for sure, please post an answer here) that there is no culture with multi-char thousand separator (even in China where scary things like 1億2345万6789 or 1億2345萬6789 exist, they happen to be represented with just one UTF-16 character), and you probably are happy to ignore the people who decided to play with their locale settings in that fashion.

Scrapy: how to solve the "empty" item in html due to a foreign language symbol?

One of the scrapy-ed items seems contain no content in HTML. In MySQL database, it does have content including a non-regular - (dash) that is slightly longer. It could be a dash symbol from Chinese input, or something similar. I am copy it below, not sure whether it will keep the original form. The web link is here and this non-regular dash is in the title and the beginning of the description.
**Hospitalist – Chattanooga**
To further prove it, the export CVS file from MySQL convert this weird dash to ?€?. Most likely this weird symbol causes the non-display problem.
I want to either delete this weird symbol or replace it with a , or a regular dash. Where can it be done? During Scrapy? Or in MySQL? Sorry this is not a specific coding question. I need some guidance before figuring out any codes for this problem.
The long dash is called an EM dash fileformat - EM dash
The reason you are seeing it is likely due to the chosen encoding.
Try setting a different encoding or replacing the EM dash with the , sign as you mentioned in your question.
In php you can do so with the following code:
str_replace(chr(151), ',' $input);

SQL Strip the Font Format(Colour or other)

I have a problem to strip out the format in a note table
Here is an example:
";\red31\green73\blue125;
\viewkind4\uc1\ltrpar\f0\fs20 USEFUL TEXT BODY \cf1\f3
\ltrpar\f0\fs17
"
How to get rid of those stuff? I want to play safe not to replace anything after'\'
Many thanks,
Rick
Your making it quite difficult for yourself by not replace '\' .
If you look at http://other9.tripod.com/Refs/easy-rtf.html you will see that there are different RTF codes and there is no default size for the codes.
Additionally, it is not like HTML where there must be a necessary "closing" tag which makes it additionally difficult.
The only thing I can think of is to record all possible RTF codes (or use an RTF parser library) and hence be able to recognize if a \ is or is not RTF code.

Asc(Chr(254)) returns 116 in .Net 1.1 when language is Hungarian

I set the culture to Hungarian language, and Chr() seems to be broken.
System.Threading.Thread.CurrentThread.CurrentCulture = "hu-US"
System.Threading.Thread.CurrentThread.CurrentUICulture = "hu-US"
Chr(254)
This returns "ţ" when it should be "þ"
However, Asc("ţ") returns 116.
This: Asc(Chr(254)) returns 116.
Why would Asc() and Chr() be different?
I checked and the 'wide' functions do work correctly: ascw(chrw(254)) = 254
Chr(254) interprets the argument in a system dependent way, by looking at the System.Globalization.CultureInfo.CurrentCulture.TextInfo.ANSICodePage property. See the MSDN article about Chr. You can check whether that value is what you expect. "hu-US" (the hungarian locale as used in the US) might do something strange there.
As a side-note, Asc() has no promise about the used codepage in its current documentation (it was there until 3.0).
Generally I would stick to the unicode variants (ending on -W) if at all possible or use the Encoding class to explicitly specify the conversions.
My best guess is that your Windows tries to represent Chr(254)="ţ" as a combined letter, where the first letter is Chr(116)="t" and the second ("¸" or something like that) cannot be returned because Chr() only returns one letter.
Unicode text should not be handled character-by-character.
It sounds like you need to set the code page for the current thread -- the current culture shouldn't have any effect on Asc and Chr.
Both the Chr docs and the Asc docs have this line:
The returned character depends on the code page for the current thread, which is contained in the ANSICodePage property of the TextInfo class. TextInfo.ANSICodePage can be obtained by specifying System.Globalization.CultureInfo.CurrentCulture.TextInfo.ANSICodePage.
I have seen several problems in VBA on the Mac where characters over 127 and some control characters are not treated properly.
This includes paragraph marks (especially in text copied from the internet or scanned), "¥", and "Ω".
They cannot always be searched for, cannot be used in file names - though they could in the past, and when tested, come up as another ascii number. I have had to write algorithms to change these when files open, as they often look like they are the right character, but then crash some of my macros when they act strangely. The character will look and act right when I save the file, but may be changed when it is reopened.
I will eventually try to switch to unicode, but I am not sure if that will help this issue.
This may not be the issue that you are observing, but I would not rule out isolated problems with certain characters like this. I have sent notes to MS about this in the past but have received no joy.
If you cannot find another solution and the character looks correct when you type it in, then I recommend using a macro snippet like the one below, which I run when updating tables. You of course have to setup theRange as the area you are looking at. A whole file can take a while.
For aChar = 1 To theRange.Characters.count
theRange.Characters(aChar).Select
If Asc(Selection.Text) = 95 And Selection.Text <> "_" Then Selection.TypeText "Ω"
Next aChar

Rails ActiveRecord: Inserting text containing unprintable/weird characters

I am inserting some text from scraped web into my database. some of the fields in the string have unprintable/weird characters. For example,
if text is "C__O__?__P__L__E__T__E",
then the text in the database is stored only as "C__O__"
I know about h(), strip_tags()... sanitize, ... etc etc. But I do not want to sanitize this SQL. The activerecord logs the SQL correctly, and when run in phpMySQL, the query is executed correctly. something happens between the SQL query generation and it being executed.
Help is much appreciated.
Just replace the question mark in the string with a string containing a question mark, I haven't found any other way either:
["C__O__?__P__L__E__T__E", '?']
works perfectly.
Can you escape the question mark using "\?"?
Hmmmm.. using CGI escape, I found out that the character coming in the system is not what I expected it to be. It is not a question mark (%3F) but a question mark (%D5).
C__%D5__M__P__L__%80___T__%80__
C__%3F__M__P__L__%3F___T__%3F__
Eventually I gsubbed out the non-printable characters before saving.
gsub(/[^[:print:]]/, '')
Only after removing the invalid characters in my string, was I able to save the item properly.
None of the other solutions worked, partially because the problem was not understood clearly upfront.
I know this is way late, but I ran into the same problem when we were trying to process a file as UTF-8 that actually used the ISO-8859-1 character encoding. I suspect you had a similar issue in your scraping where you assumed the wrong encoding and it ended up causing things to fail.