Hsqldb - how to remove the padding on char fields - hsqldb

I'm finding that Char fields are being padded.
Is there any way to stop this happening.
I've tried using the property
SET PROPERTY "sql.enforce_strict_size" FALSE
but doesn't seem to help.

Indeed, the MySQL docs specify that "When CHAR values are retrieved, trailing spaces are removed." This is odd, as other databases seem to always keep the padding (i can confirm that for Oracle). The SQL-92 standard indicates that right-padded spaces are part of the char, for example in the definition of the CAST function on p. 148. When source (SV=source value) and target (TV=target value, LTD=length of target datatype), then:
ii) If the length in characters of SV is larger than LTD, then
TV is the first LTD characters of SV. If any of the re-
maining characters of SV are non-<space> characters, then a
completion condition is raised: warning-string data, right
truncation.
iii) If the length in characters M of SV is smaller than LTD,
then TV is SV extended on the right by LTD-M <space>s.
Maybe that's just another one of MySQL's many oddities and gotchas.
And to answer your question: if you don't want the trailing spaces, you should use VARCHAR instead.

I thought 'char' by definition are space padded to fill the field. They are considered fixed lenght and will be space padded to be fixed length.
The data type 'varchar' is defined as variable char where they are not space padded to fill the field.
I could be wrong though since I normally work on SQL Server.

Related

REGEXP_REPLACE explanation

Hi may i know what does the below query means?
REGEXP_REPLACE(number,'[^'' ''-/0-9:-#A-Z''[''-`a-z{-~]', 'xy') ext_number
part 1
In terms of explaining what the function function call is doing:
It is a function call to analyse an input string 'number' with a regex (2nd argument) and replace any parts of the string which match a specific string. As for the name after the parenthesis I am not sure, but the documentation for the function is here
part 2
Sorry to be writing a question within an answer here but I cannot respond in comments yet (not enough rep)
Does this regex work? Unless sql uses different syntax this would appear to be a non-functional regex. There are some red flags, e.g:
The entire regex is wrapped in square parenthesis, indicating a set of characters but seems to predominantly hold an expression
There is a range indicator between a single quote and a character (invalid range: if a dash was required in the match it should be escaped with a '\' (backslash))
One set of square brackets is never closed
After some minor tweaks this regex is valid syntax:
^'' ''\-\/0-9:-#A-Z''[''-a-z{-~]`, but does not match anything I can think of, it is important to know what string is being examined/what the context is for the program in order to identify what the regex might be attempting to do
It seems like it is meant to replaces all ASCII control characters in the column or variable number with xy.
[] encloses a class of characters. Any character in that class matches. [^] negates that, hence all characters match, that are not in the class.
- is a range operator, e.g. a-z means all characters from a to z, like abc...xyz.
It seams like characters enclosed in ' should be escaped (The second ' is to escape the ' in the string itself.) At least this would make some sense. (But for none of the DBMS I found having a regexp_replace() function (Postgres, Oracle, DB2, MariaDB, MySQL), I found something in the docs, that would indicate this escape mechanism. They all use \, but maybe I missed something? Unfortunately you didn't tag which DBMS you're actually using!)
Now if you take an ASCII table you'll see, that the ranges in the expression make up all printable characters (counting space as printable) in groups from space to /, 0 to 9, : to #, etc.. Actually it might have been shorter to express it as '' ''-~, space to ~.
Given the negation, all these don't match. The ones left are from NUL to US and DEL. These match and get replaced by xy one by one.

Returning Unicode Name With Code Point

I know how to return a Unicode character from a code point. That's not what I'm after. What I want to know is how to return the name associated with a particular code point. For example, The code point for 🍀 is 1F340. And its name is FOUR LEAF CLOVER. Is it possible for us to return this name with its code point? I've read about 100 topics involving Unicode. But I haven't see one discussing my question. I hope that's possible.
Thank you for your help.
Have you considered the ICU library? It offers the following C API: http://icu-project.org/apiref/icu4c/uchar_8h.html#aa488f2a373998c7decb0ecd3e3552079
int32_t u_charName(
UChar32 code,
UCharNameChoice nameChoice,
char* buffer,
int32_t bufferLength,
UErrorCode* pErrorCode)
Retrieve the name of a Unicode character.
Depending on nameChoice, the character name written into the buffer is the "modern" name or the name that was defined in Unicode version 1.0. The name contains only "invariant" characters like A-Z, 0-9, space, and '-'. Unicode 1.0 names are only retrieved if they are different from the modern names and if the data file contains the data for them. gennames may or may not be called with a command line option to include 1.0 names in unames.dat.
Parameters
code The character (code point) for which to get the name. It must be 0<=code<=0x10ffff.
nameChoice Selector for which name to get.
buffer Destination address for copying the name. The name will always be zero-terminated. If there is no name, then the buffer will be set to the empty string.
bufferLength ==sizeof(buffer)
pErrorCode Pointer to a UErrorCode variable; check for U_SUCCESS() after u_charName() returns.
Returns
The length of the name, or 0 if there is no name for this character. If the bufferLength is less than or equal to the length, then the buffer contains the truncated name and the returned length indicates the full length of the name. The length does not include the zero-termination.
ICU is the right approach, but it's even simpler than Chris said. Foundation includes ICU already, for various text processing functions, including CFStringTransform(). Its transform parameter accepts "any valid ICU transform ID defined in the ICU User Guide for Transforms".
One of ICU's transforms is Any-Name:
Converts between characters and their Unicode names in curly braces. For example:
., ⇆ {FULL STOP}{COMMA}
(The syntax isn't exactly as documented, but it's close enough you can figure it out.)
There's also an Any-Hex transform which can be used for translating to/from the codepoint hex value.

Which Unicode characters are "composing" characters (whose sole purpose is to add accent, tilda)?

This is related to
What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?
This is how I plan to do this:
Use http://msdn.microsoft.com/en-us/library/dd374126%28v=vs.85%29.aspx to turn the string into
KD form.
Basically it'll turn most variation such as superscript into the normal number. Also it decompose tilda and accent into 2 characters.
Next step would be to remove all characters whose sole purpose is tildaing or accenting character.
How do I know which characters are like that? Which characters are just "composing characters"
How do I find such characters? After I find those, how do I get rid of it? Should I scan character by character and remove all such "combining characters?"
For example:
Character from 300 to 362 can be gotten rid off.
Then what?
Combining characters are listed in UnicodeData.txt as having a nonzero Canonical_Combining_Class, and a General_Category of Mn (Mark, nonspacing).
For each character in the string, call GetUnicodeCategory and check the UnicodeCategory for NonSpacingMark, SpacingCombiningMark or EnclosingMark.
You may be able to do it more efficiently using regex, eg Regex.Replace(str, "\p{M}", "").

RegEx to find % symbols in a string that don't form the start of a legal two-digit escape sequence?

I would like a regular expression to find the %s in the source string that don't form the start of a valid two-hex-digit escaped character (defined as a % followed by exactly two hexadecimal digits, upper or lower case) that can be used to replace only these % symbols with %25.
(The motivation is to make the best guess attempt to create legally escaped strings from strings of various origins that may be legally percent escaped and may not, and may even be a mixture of the two, without damaging the data intent if the original string was already correctly encoded, e.g. by blanket re-encoding).
Here's an example input string.
He%20has%20a%2050%%20chance%20of%20living%2C%20but%20there%27s%20only%20a%2025%%20chance%20of%20that.
This doesn't conform to any encoding standard because it is a mix of valid escaped characters eg. %20 and two loose percentage symbols. I'd like to convert those %s to %25s.
My progress so far is to identify a regex %[0-9a-z]{2} that finds the % symbols that are legal but I can't work out how to modify it to find the ones that aren't legal.
%(?![0-9a-fA-F]{2})
Should do the trick. Use a look-ahead to find a % NOT followed by a valid two-digit hexadecimal value then replace the found % symbol with your %25 replacement.
(Hopefully this works with (presumably) NSRegularExpression, or whatever you're using)
%(?![a-fA-F0-9]{2})
That's a percent followed by a negative lookahead for two hex digits.

unwanted leading blank space on oracle number format

I need to pad numbers with leading zeros (total 8 digits) for display. I'm using oracle.
select to_char(1011,'00000000') OPE_NO from dual;
select length(to_char(1011,'00000000')) OPE_NO from dual;
Instead of '00001011' I get ' 00001011'.
Why do I get an extra leading blank space? What is the correct number formatting string to accomplish this?
P.S. I realise I can just use trim(), but I want to understand number formatting better.
#Eddie: I already read the documentation. And yet I still don't understand how to get rid of the leading whitespace.
#David: So does that mean there's no way but to use trim()?
Use FM (Fill Mode), e.g.
select to_char(1011,'FM00000000') OPE_NO from dual;
From that same documentation mentioned by EddieAwad:
Negative return values automatically
contain a leading negative sign and
positive values automatically contain
a leading space unless the format
model contains the MI, S, or PR format
element.
EDIT: The right way is to use the FM modifier, as answered by Steve Bosman. Read the section about Format Model Modifiers for more info.