Convert text value in SQL from ISO-8859-1 - sql

I have some data in SQL that is falsely formatted:
"Daniel Bødtker" is displayed in ISO 8859-1 format: "=?iso-8859-1?Q?Daniel_B=F8dtker?="
Does anyone have a fix function ready to share?
Thanks!
Daniel

That's not ISO8859-1 format. That's RFC 2047 format, used for transmitting international characters in email headers, which technically only support 7-bit ASCII.
These values have almost certainly been extracted from email headers by a process which does not understand RFC2047.
Format is:
"=?" + character set name + "?" + encoding identifier + "?" + encoded data + "?="
The encoding identifier is either Q or B. Q means "Quoted-Printable" and B means "Base64".
Note that an email header can have multiple such sequences.
Therefore your solution needs to search for these sequences, and handle them on a case-by-case basis.

Related

How to search in SQL Server for text that has special characters?

I have a SQL Server table with a column of type TEXT that would store candidate resumes in different format. RTF is the most common one but often we get resume data from a 3rd party converter which stores the resume as special characters (maybe Unicode or I don't know what they are).
How do I search my table to find all the rows that have these special characters? For example the rows with id = 4,6,7, 9 etc. all are the records with special characters.
What format are these special characters called? Unicode??
Assuming that by "special" characters you mean anything outside the set of printable ASCII and certain common whitespace characters , you can try the following:
DECLARE #SpecialPattern VARCHAR(100) =
'%[^'
+ CHAR(9) + CHAR(10) + CHAR(13) -- tab, CR, LF
+ CHAR(32) + '-' + CHAR(126) -- Range from space to last printable ASCII
+ ']%'
SELECT
RESUME_TEXT,
cast(left(cast(resume_text as varchar(max)),20) as varbinary(max))` -- Borrowed from userMT's comment
FROM RESUME
WHERE RESUME_TEXT LIKE #SpecialPattern COLLATE Latin1_General_Bin -- Use exact compare
You may get some false hits against some perfectly valid extended characters such as accented vowels, curly quotes, or m- and n- dashes that may exist in the text.
My first though is that the weird characters might be a UTF-8 BOM (hex EF, BB, BF), but the display didn't seem to match the how I would expect SQL Server to render them. The inverse dot isn't present at all in the default windows code page (1252).
We need at least some hex data (at least the first few bytes) to help further. Often, common binary file types have a recognizable signature in the first 3-5 bytes.

How to avoid plus sign to create a line feed in a rdlc textbox

I need to print an encrypted string as is in a rdlc report. My problem is if the string contain a plus sign it creates a new line in the Textbox. How to avoid this?
Encryption produces output that is binary and contains many bytes that have no displayable representation.
Because of this if encrypted data needs to be displayed it is generally either Base64 (best for computers) or hexadecimal (best for people) encoded.
It seems that you may have base64 encoded encrypted data and that is generally composed of the upper and lowercase characters, the 10 digits, "+", "/" and "=". You can not delete these and expect to recover the encrypted data.
If these characters present a problem they can be many times be escaped in some manor or another encoding can be chosen such as hexadecimal or an alternate Base64 character set, see Base64. If you choose an alternate Base64 character set interoperability will most likely be impaired.
Note: More information would produce a better answer.
I had to replace the "+" with "÷".
Users don't notice is it since the PDF is just a visual representation of the CFDI, I haven't had any issues with it.

Inserting UTF-32 characters

I'm testing UTF-32 characters (specifically emojis) with SQL Server (2008 R2, 10.5) and at this stage I'm checking if the server supports the given code
For this case I'm using the :rose with the following query
SELECT '' + nchar(0x1F339) + 'test'
which returns back in Management Studio with (NULL).
What format do I need to encode the character to have it not return null in SQL Server
SQL Server only supports UCS-2, which is currently (almost) the same as UTF-16. So exactly 2 bytes per character and all that.
An idea, if I may. You can store the data in a BINARY or VARBINARY data field which doesn't care about encoding. You can then use a mapping table or external script to parse the binary into a text field replacing 0x1F339 with :rose: or your own custom forma for example.
Since it's UTF-32, it has two be written as two UTF-16 characters:
-- Returns: 🌹test
SELECT '' + nchar(0xD83C) + nchar(0xDF39) + 'test'
You can find this code under "UTF-16 Hex (C Syntax)" title, following your link.
Also I have to recommend this article, because it was very helpful during investigation: Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)
Couple of options for those who are looking for answers:
SQL Server technically does not have character escape sequences, but
you can still create characters using either byte sequences or Code
Points using the CHAR() and NCHAR() functions. We are only concerned
with Unicode here, so we will only be using NCHAR().
All versions:
NCHAR(0 - 65535) for BMP Code Points (using an int/decimal value)
NCHAR(0x0 - 0xFFFF) for BMP Code Points (using a binary/hex value)
NCHAR(0 - 65535) + NCHAR(0 - 65535) for a Surrogate Pair / Two UTF-16
Code Units
NCHAR(0x0 - 0xFFFF) + NCHAR(0x0 - 0xFFFF) for a Surrogate Pair / Two
UTF-16 Code Units
CONVERT(NVARCHAR(size), 0xHHHH) for one or more characters in UTF-16
Little Endian (“HHHH” is 1 or more sets of 4 hex digits)
Starting in SQL Server 2012:
If database’s default collation supports Supplementary Characters
(collation name ends in _SC, or starting in SQL Server 2017 name
contains 140 but does not end in _BIN*, or starting in SQL Server
2019 name ends in _UTF8 but does not contain _BIN2), then NCHAR() can
be given Supplementary Character Code Points:
decimal value can go up to 1114111
hex value can go up to 0x10FFFF
Starting in SQL Server 2019:
“_UTF8” collations enable CHAR and VARCHAR data to use the UTF-8
encoding:
CONVERT(VARCHAR(size), 0xHH) for one or more characters in UTF-8 (“HH”
is 1 or more sets of 2 hex digits)
NOTE: The CHAR() function does not work for this purpose. It can only
produce a single byte, and UTF-8 is only a single byte for values 0 –
127 / 0x00 – 0x7F.

What character encoding should I use for a HTTP header?

I'm using a "fun" HTML special-character (✰)(see http://html5boilerplate.com/ for more info) for a Server HTTP-header and am wondering if it is "allowed" per spec.
Using the Network Tab in the dev tools in Chrome on Windows Xp Pro SP 3 I see the ✰ just fine.
In IE8 the ✰ is not rendered correctly.
The w3.org HTML validator does not render it correctly (displays "â°" instead).
Now, I'm not too keen on character encodings ... and frankly I don't really care too much about them; I just blindly use UTF-8 cus I'm told to. :-)
Is the disparity caused by bugs in the different parsers/browses/engines/(whatever-they-are-called)?
Is there a spec for this or maybe a list of allowed characters for an HTTP-header "value"?
In short: Only ASCII is guaranteed to work. Some non-ASCII bytes are allowed for backwards compatibility, but are not supposed to be displayable.
HTTPbis gave up and specified that in the headers there is no useful encoding besides ASCII:
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
Previously, RFC 2616 from 1999 defined this:
Words of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047 [14].
and RFC 2047 is the MIME encoding, so it'd be:
=?UTF-8?Q?=E2=9C=B0?=
but I don't think that many (if any) clients support it.
Please read comments first, this answer likely draws wrong conclusions from the right sources, needs edit.
You can use any printable ASCII chars, and no special chars like ✰ (Which is not ASCII)
Tip: you can encode anything in JSON.
Edit: may not be obvious at first, the character encoding defined in the header only applies for the response body, not for the header itself. (As it would cause a chicken-&-egg problem.)
I'd like to sum up all the relevant definitions as per the spec linked by Penchant.
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
So, we are after field-value.
LWS = [CRLF] 1*( SP | HT )
CRLF = CR LF
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
LWS stands for Linear White Space. Essentially, LWS is Space or Tab, but you can break your field-value into multiple lines by starting a new line before a Space or Tab.
Let's simplify it to this:
field-value = <any field-content or Space or Tab>
Now we are after field-content.
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
OCTET = <any 8-bit sequence of data>
TEXT = <any OCTET except CTLs,
but including LWS>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
TEXT is the most general and includes all the rest -so forget about the rest-.
Here is the US-ASCII charset (= ASCII)
As you can see, all printable ASCII chars are allowed.

When should space be encoded to plus (+) or %20? [duplicate]

This question already has answers here:
URL encoding the space character: + or %20?
(5 answers)
Closed 1 year ago.
Sometimes the spaces get URL encoded to the + sign, and some other times to %20. What is the difference and why should this happen?
+ means a space only in application/x-www-form-urlencoded content, such as the query part of a URL:
http://www.example.com/path/foo+bar/path?query+name=query+value
In this URL, the parameter name is query name with a space and the value is query value with a space, but the folder name in the path is literally foo+bar, not foo bar.
%20 is a valid way to encode a space in either of these contexts. So if you need to URL-encode a string for inclusion in part of a URL, it is always safe to replace spaces with %20 and pluses with %2B. This is what, e.g., encodeURIComponent() does in JavaScript. Unfortunately it's not what urlencode does in PHP (rawurlencode is safer).
See Also
HTML 4.01 Specification application/x-www-form-urlencoded
So, the answers here are all a bit incomplete. The use of a '%20' to encode a space in URLs is explicitly defined in RFC 3986, which defines how a URI is built. There is no mention in this specification of using a '+' for encoding spaces - if you go solely by this specification, a space must be encoded as '%20'.
The mention of using '+' for encoding spaces comes from the various incarnations of the HTML specification - specifically in the section describing content type 'application/x-www-form-urlencoded'. This is used for posting form data.
Now, the HTML 2.0 specification (RFC 1866) explicitly said, in section 8.2.2, that the query part of a GET request's URL string should be encoded as 'application/x-www-form-urlencoded'. This, in theory, suggests that it's legal to use a '+' in the URL in the query string (after the '?').
But... does it really? Remember, HTML is itself a content specification, and URLs with query strings can be used with content other than HTML. Further, while the later versions of the HTML spec continue to define '+' as legal in 'application/x-www-form-urlencoded' content, they completely omit the part saying that GET request query strings are defined as that type. There is, in fact, no mention whatsoever about the query string encoding in anything after the HTML 2.0 specification.
Which leaves us with the question - is it valid? Certainly there's a lot of legacy code which supports '+' in query strings, and a lot of code which generates it as well. So odds are good you won't break if you use '+'. (And, in fact, I did all the research on this recently because I discovered a major site which failed to accept '%20' in a GET query as a space. They actually failed to decode any percent encoded character. So the service you're using may be relevant as well.)
But from a pure reading of the specifications, without the language from the HTML 2.0 specification carried over into later versions, URLs are covered entirely by RFC 3986, which means spaces ought to be converted to '%20'. And definitely that should be the case if you are requesting anything other than an HTML document.
http://www.example.com/some/path/to/resource?param1=value1
The part before the question mark must use % encoding (so %20 for space), after the question mark you can use either %20 or + for a space. If you need an actual + after the question mark use %2B.
For compatibility reasons, it's better to always encode spaces as "%20", not as "+".
It was RFC 1866 (HTML 2.0 specification), which specified that space characters should be encoded as "+" in "application/x-www-form-urlencoded" content-type key-value pairs. (see paragraph 8.2.1. subparagraph 1.). This way of encoding form data is also given in later HTML specifications, look for relevant paragraphs about application/x-www-form-urlencoded.
Here is an example of a URL string where RFC 1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses, according to RFC 1866. In other cases, spaces should be encoded to %20. But since it's hard to determine the context, it's the best practice to never encode spaces as "+".
I would recommend to percent-encode all characters except "unreserved" defined in RFC 3986, p.2.3.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The only situation when you may want to encode spaces as "+" (one byte) rather than "%20" (three bytes) is when you know for sure how to interpret the context, and when the size of the query string is of the essence.
What's the difference? See the other answers.
When should we use + instead of %20? Use + if, for some reason, you want to make the URL query string (?.....) or hash fragment (#....) more readable. Example: You can actually read this:
https://www.google.se/#q=google+doesn%27t+encode+:+and+uses+%2B+instead+of+spaces
(%2B = +)
But the following is a lot harder to read (at least to me):
https://www.google.se/#q=google%20doesn%27t%20oops%20:%20%20this%20text%20%2B%20is%20different%20spaces
I would think + is unlikely to break anything, since Google uses + (see the 1st link above) and they've probably thought about this. I'm going to use + myself just because readable + Google thinks it's OK.