Replacing wrong codificaton letters with SQL - sql

I have a database with data from internet, but some pages have wrong codification and letters like ã becomes ã and çbecomes ç.
What are the possibilities to fix this? I'm using PostgreSQL.
I can use replace, but I need to do a replace for each case? I was thinking about translate, but I see that it transforms only one char into other. Is possible translate two chars into one? Something like: TRANSLATE(text,'ã|ç','ã|ç').

This particular problem looks like you have UTF-8 encoding being interpreted as a single-byte character set ("ç" becoming "ç" suggests iso-8859-1).
You can fix these up individually with a long chain of replace(...) calls. Or you can use postgresql's own character-conversion facilities:
select convert_from(convert_to('£20 - garçon', 'iso-8859-1'), 'utf-8')
In order, this:
Converts the string back to binary using the iso-8859-1 codec (which will just change unicode codepoints back to bytes, assuming all the codepoints are under 256)
Reinterprets that binary output as UTF-8, so sequences such as {0xc2, 0xa3} are translated to '£'

You can fix some of the characters by replacing them, but not all. By decoding the data using the wrong encoding you have already removed some information, and that is impossible to get back.
You should find out what the correct encoding is for those pages, and use that when decoding the data.
Some pages have the encoding in the response header, e.g.
Content-Type: text/html; charset=utf8
Some pages have the encoding in the HTML head, e.g.
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
If the information is not in the header you would first have to decode the page (or at least a part of it) using the ASCII encoding (which is not a problem as the meta tag contains no special characters), find out the encoding, then decode the page using the correct encoding.

PostgreSQL has a string replacement function:
replace(string text, from text, to text): Replace all occurrences in string of substring from with substring to
Example:
replace ('abcdefabcdef', 'cd', 'XX') ==> abXXefabXXef

Related

Defining the contents of a URL

I have a web site which contains forms for my customers to download. They are constantly telling me that the %20 listed in the url when they are looking at a form means there is a 20% discount on the items listed on the form. The following url is what is displayed in one example. Can you explain to me what the %20 means in this url? http://www.schumachersuniforms.com/form/Atonement%20PreK.pdf
Percent Encoding
A URL cannot contain certain characters. The SPACE character is one of the those forbidden chapters.
Your PDF document is apparently named with a SPACE in the middle, Atonement PreK.pdf.
Percent Encoding, also known as URL Encoding, is a way to replace the offending characters with a sequence of other characters. That sequence begins with a PERCENT SIGN character. A hexadecimal number of the character’s code point follows.
The decimal code point for SPACE is 32, the hex is 20. So the string %20 substitutes for the SPACE.
No way around this:
If you really don't want the %20, then avoid naming your PDF document with space characters. Example: AtonementPreK.pdf.
Or use a more sophisticated web scheme for handling the URL triggering a download other than directly referencing the file name.
Do not confuse URL encoding with HTML (and XML) character entity references.

NSJSONSerialization parsng special characters

I am parsing some data using NSJSONSerialization. After parsing, I get strings like &auml ; and %#339; which i think has something to do with encoding. But NSJSONSerialzation doesn't ask for what encoding it requires, it i guess detects it by itself. So my question is, how can I get proper strings instead of these weird &auml ; and %#339;.
NSJSONSerialization assumes the encoding is one of the Unicode encodings. Make sure the data you pass to it is in UTF-8 (or UTF-16). ä is C3 A4 in UTF-8 or E4 in UTF-16.
Note that the default encoding for HTTP if none is specified is ISO-8859-1, so it may be that you are passing ISO-8859-1 data instead of UTF-8.
In options try NSJSONReadingMutableLeaves, it must return NSMutableString.. For more take a look at the docs.

Is there a field in which PDF files specify their encoding?

I understand that it is impossible to determine the character encoding of any stringform data just by looking at the data. This is not my question.
My question is: Is there a field in a PDF file where, by convention, the encoding scheme is specified (e.g.: UTF-8)? This would be something roughly analogous to <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in HTML.
Thank you very much in advance,
Blz
A quick look at the PDF specification seems to suggest that you can have different encoding inside a PDF-file. Have a look at page 86. So a PDF library with some kind of low level access should be able to provide you with encoding used for a string. But if you just want the text and don't care about the internal encodings used I would suggest to let the library take care of conversions for you.
PDF uses "named" characters, in the sense that a character is a name and not a numeric code. Character "a" has name "a", character "2" has name "two" and the euro sign has name "euro", to give a few examples. PDF defines a few "standard" "base" encodings (named "WinAnsiEncoding", "MacRomanEncoding" and a few more, can't remember exactly), an encoding being a one-to-one correspondence between character names and byte values (yes, only 0 to 255). The exact, normative values for these predefined encodings are in the PDF specification. All these encodings use the ASCII values for the US-ASCII characters, but they differ in higher byte values.
A PDF file may define new encodings by taking a "base" encoding (say, WinAnsiEncoding) and redefining a few bytes, so a PDF author may, for example, define a new encoding named "MySuperbEncoding" as WinAnsiEncoding but with byte value 65 changed to mean character "ntilde" (this definition goes inside the PDF file), and then specifying that some strings in the file use encoding "MySuperbEncoding". In this case, a string containing byte values 65-66-67 would mean characters "ñBC" and not "ABC". And note that I mean characters, nothing to do with glyphs or fonts. Different strings withing the PDF file may use different encodings (this provides a way for using more tan 256 characters in the PDF file, even though every string is defined as a byte sequence, and one byte always corresponds to one character).
So, the answer to your question is: characters within a PDF file can well be encoded internally in an ad-hoc encoding made on the spot for that specific PDF file. PDF parsers should make the appropriate substitutions when necessary. I do not know PDFMiner but I'm surprised that it (being a PDF parser) gives incorrect values, as the specification is very clear on how this must be interpreted. It IS possible to get all the necessary information from the PDF file, but, as Mattias said, it might be a large project and I think a program named PDFMiner should do exactly this kind of job.

how to check the string is UNICODE vb.net

Is there any way to check if the string is UNICODE using VB.net.
Best Regards
inchikka
You need to read the file using the Encoding that the file is written in.
It appears to be a non Unicode file that you are trying to read as Unicode, or possibly a different Unicode encoding than the default UTF-8 (could be UTF-16 for example).
StreamWriter has several constructors that the an Encoding as parameter.
You can do it by validating each character in the string against the 128 characters in the ASCII table. If the character is not found there then it might be a unicode character.
Is that what you mean?

When should space be encoded to plus (+) or %20? [duplicate]

This question already has answers here:
URL encoding the space character: + or %20?
(5 answers)
Closed 1 year ago.
Sometimes the spaces get URL encoded to the + sign, and some other times to %20. What is the difference and why should this happen?
+ means a space only in application/x-www-form-urlencoded content, such as the query part of a URL:
http://www.example.com/path/foo+bar/path?query+name=query+value
In this URL, the parameter name is query name with a space and the value is query value with a space, but the folder name in the path is literally foo+bar, not foo bar.
%20 is a valid way to encode a space in either of these contexts. So if you need to URL-encode a string for inclusion in part of a URL, it is always safe to replace spaces with %20 and pluses with %2B. This is what, e.g., encodeURIComponent() does in JavaScript. Unfortunately it's not what urlencode does in PHP (rawurlencode is safer).
See Also
HTML 4.01 Specification application/x-www-form-urlencoded
So, the answers here are all a bit incomplete. The use of a '%20' to encode a space in URLs is explicitly defined in RFC 3986, which defines how a URI is built. There is no mention in this specification of using a '+' for encoding spaces - if you go solely by this specification, a space must be encoded as '%20'.
The mention of using '+' for encoding spaces comes from the various incarnations of the HTML specification - specifically in the section describing content type 'application/x-www-form-urlencoded'. This is used for posting form data.
Now, the HTML 2.0 specification (RFC 1866) explicitly said, in section 8.2.2, that the query part of a GET request's URL string should be encoded as 'application/x-www-form-urlencoded'. This, in theory, suggests that it's legal to use a '+' in the URL in the query string (after the '?').
But... does it really? Remember, HTML is itself a content specification, and URLs with query strings can be used with content other than HTML. Further, while the later versions of the HTML spec continue to define '+' as legal in 'application/x-www-form-urlencoded' content, they completely omit the part saying that GET request query strings are defined as that type. There is, in fact, no mention whatsoever about the query string encoding in anything after the HTML 2.0 specification.
Which leaves us with the question - is it valid? Certainly there's a lot of legacy code which supports '+' in query strings, and a lot of code which generates it as well. So odds are good you won't break if you use '+'. (And, in fact, I did all the research on this recently because I discovered a major site which failed to accept '%20' in a GET query as a space. They actually failed to decode any percent encoded character. So the service you're using may be relevant as well.)
But from a pure reading of the specifications, without the language from the HTML 2.0 specification carried over into later versions, URLs are covered entirely by RFC 3986, which means spaces ought to be converted to '%20'. And definitely that should be the case if you are requesting anything other than an HTML document.
http://www.example.com/some/path/to/resource?param1=value1
The part before the question mark must use % encoding (so %20 for space), after the question mark you can use either %20 or + for a space. If you need an actual + after the question mark use %2B.
For compatibility reasons, it's better to always encode spaces as "%20", not as "+".
It was RFC 1866 (HTML 2.0 specification), which specified that space characters should be encoded as "+" in "application/x-www-form-urlencoded" content-type key-value pairs. (see paragraph 8.2.1. subparagraph 1.). This way of encoding form data is also given in later HTML specifications, look for relevant paragraphs about application/x-www-form-urlencoded.
Here is an example of a URL string where RFC 1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses, according to RFC 1866. In other cases, spaces should be encoded to %20. But since it's hard to determine the context, it's the best practice to never encode spaces as "+".
I would recommend to percent-encode all characters except "unreserved" defined in RFC 3986, p.2.3.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The only situation when you may want to encode spaces as "+" (one byte) rather than "%20" (three bytes) is when you know for sure how to interpret the context, and when the size of the query string is of the essence.
What's the difference? See the other answers.
When should we use + instead of %20? Use + if, for some reason, you want to make the URL query string (?.....) or hash fragment (#....) more readable. Example: You can actually read this:
https://www.google.se/#q=google+doesn%27t+encode+:+and+uses+%2B+instead+of+spaces
(%2B = +)
But the following is a lot harder to read (at least to me):
https://www.google.se/#q=google%20doesn%27t%20oops%20:%20%20this%20text%20%2B%20is%20different%20spaces
I would think + is unlikely to break anything, since Google uses + (see the 1st link above) and they've probably thought about this. I'm going to use + myself just because readable + Google thinks it's OK.