In PostgreSQL manual it says that citext is simply a module that implements TEXT data type with a call to LOWER():
The citext module provides a case-insensitive character string type,
citext. Essentially, it internally calls lower when comparing values.
Otherwise, it behaves almost exactly like text.
On the other hand at the end of the documntation it says:
citext is not as efficient as text because the operator functions and
the B-tree comparison functions must make copies of the data and
convert it to lower case for comparisons. It is, however, slightly
more efficient than using lower to get case-insensitive matching.
So I'm confused if it uses LOWER() how can it be "slightly more efficient than using lower"?
It doesn't call the SQL function lower. As the documentation says, it essentially internally calls lower.
The calls happen within the C functions which implement the citext comparison operations. And rather than actually calling lower, they go directly to the underlying str_tolower() routine. You can see this for yourself in the source code, most of which is relatively easy to follow in this case.
So what you're saving, more or less, is the overhead of two SQL function calls per comparison. Which is not insignificant, compared with the cost of the comparison itself, but you'd probably never notice either of them next to the other costs in a typical query.
Related
Folks
I am in the process of moving a decade old back-end from DB2 9.5 to Oracle 19c.
I frequently see in SQL queries and veiw definitions bizarre timestamp(nullif('','')) constructs used instead of a plain null.
What is the point of doing so? Why would anyone in their same mind would want to do so?
Disclaimer: my SQL skills are fairly mediocre. I might well miss something obvious.
It appears to create a NULL value with a TIMESTAMP data type.
The TIMESTAMP DB2 documentation states:
TIMESTAMP scalar function
The TIMESTAMP function returns a timestamp from a value or a pair of values.
TIMESTAMP(expression1, [expression2])
expression1 and expression2
The rules for the arguments depend on whether expression2 is specified and the data type of expression2.
If only one argument is specified it must be an expression that returns a value of one of the following built-in data types: a DATE, a TIMESTAMP, or a character string that is not a CLOB.
If you try to pass an untyped NULL to the TIMESTAMP function:
TIMESTAMP(NULL)
Then you get the error:
The invocation of routine "TIMESTAMP" is ambiguous. The argument in position "1" does not have a best fit.
To invoke the function, you need to pass one of the required DATE, TIMESTAMP or a non-CLOB string to the function which means that you need to coerce the NULL to have one of those types.
This could be:
TIMESTAMP(CAST(NULL AS VARCHAR(14)))
TIMESTAMP(NULLIF('',''))
Using NULLIF is more confusing but, if I have to try to make an excuse for using it, is slightly less to type than casting a NULL to a string.
The equivalent in Oracle would be:
CAST(NULL AS TIMESTAMP)
This also works in DB2 (and is even less to type).
It is not clear why - in any SQL dialect, no matter how old - one would use an argument like nullif('',''). Regardless of the result, that is a constant that can be calculated once and for all, and given as argument to timestamp(). Very likely, it should be null in any dialect and any version. So that should be the same as timestamp(null). The code you found suggests that whoever wrote it didn't know what they were doing.
One might need to write something like that - rather than a plain null - to get null of a specific data type. Even though "theoretical" SQL says null does not have a data type, you may need something like that, for example in a view, to define the data type of the column defined by an expression like that.
In Oracle you can use the cast() function, as MT0 demonstrated already - that is by far the most common and most elegant equivalent.
If you want something much closer in spirit to what you saw in that old code, to_timestamp(null) will have the same effect. No reason to write something more complicated for null given as argument, though - along the lines of that nullif() call.
Also what's the vb.net function that will map all those different characters into their most standard form.
For example, tolower would map A and a to the same character right?
I need the same function for these characters
german
ß === s
Ü === u
Χιοσ == Χίος
Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.
So I want to create a unique ID that maps all those strange characters into a more stable one.
For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".
However, things get more complicated once you move into the database and deal with collations.
Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same1, mean the same thing. For example,
Χιοσ != Χίος,
The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).
Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.
It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.
1Almost the same in case of compatibility normalization as opposed to canonical normalization.
I use this function to strip all non-numeric from a field before writing to a MYSQL dB:
function remove_non_numeric($inputtext) {return preg_replace("/[^0-9]/","",$inputtext);
Does this effectively escape the input data to prevent SQL Injection? I could wrap this function in mysql_real_escape_string, but thought that might be redundant.
Assumption is the mother of all bleep when it comes to sql injection. Wrap it in mysql_real_escape_string anyway.
It does not escape the data, but it is indeed an example of an OWASP recommended approach.
By removing all but numerics from the input you are effectively protecting against SQL-Injection by implementing a White list. There is no amount of paranoia that can make the resulting string (in this specific case) into an effective SQL Injection payload.
However, code ages, and changes and is misunderstood as it's inherited by new developers. So the bottom line, the correct advice, the end all and be all - is to actively prevent against SQL Injection with one or more of the following 3 steps. In this order. Ever. Single. Time.
Use a safe database API. (prepared statements or parametrized queries for example)
Use db specific escaping or escaping routines (mysql_real_escape_string falls into this category).
White list the domain of acceptable input values. (Your proposed numeric solution falls into this category)
mysql_real_escape_string is not the answer for all anti-sql-injection. It's not even the most robust method, but it works. Stripping all but numerics is white listing the safe values and is also a sound idea - however, neither are as good as using a safe API.
I am trying to get the index of the Nth character inside a varchar(MAX). Right now I am cutting off the front end of the varchar so I only have to check for the first but With extremely large varchars the process is getting way too slow. I would like to parse through the varchar(max) and just keep track of what index I am at and get the next character I want AFTER that index. If I could do this without having to constantly cut off the front of a large varchar constantly I think it would increase performance a good amount. Thanks.
EDIT:
Right now to parse a large amount of text I use CHARINDEX to get the index of the char, THEN I SUBSTRING the text to remove up to that first index. Now I call CHARINDEX again (which effectively retrieves the 2nd index of that character inside the text). However, this SUBSTRINGING is very taxing on the system and I want to avoid it.
EDIT: Ah, sorry my title was very misleading, it should be more straight forward now.
The built-in string functions are rather limited in T-SQL. You need to combine SUBSTRING(), CHARINDEX(), STRPOS() to do this, but it not fast operations.
Also you can use a tally table.
But i think the best way is to use CLR procedure for task like your. It will be more efficient and fast.
One way to increase performance when doing procedural work (as opposed to set-based work) is to use CLR Integration with SQL Server. I got this tip from Chris Lively in this answer when asking a question about performance. Although I haven't tested it myself, the documentation around CLR integration with SQL server suggests that the CLR is much more performant than T-SQL for procedural operations. For set-based operations, though, use T-SQL.
I think the CHARINDEX or SUBSTRING functions will solve your problem.
The title pretty much frames the question. I have not used CHAR in years. Right now, I am reverse-engineering a database that has CHAR all over it, for primary keys, codes, etc.
How about a CHAR(30) column?
Edit:
So the general opinion seems to be that CHAR if perfectly fine for certain things. I, however, think that you can design a database schema that does not have a need for "these certain things", thus not requiring fixed-length strings. With the bit, uniqueidentifier, varchar, and text types, it seems that in a well-normalized schema you get a certain elegance that you don't get when you use encoded string values. Thinking in fixed lenghts, no offense meant, seems to be a relic of the mainframe days (I learned RPG II once myself). I believe it is obsolete, and I did not hear a convincing argument from you claiming otherwise.
I use char(n) for codes, varchar(m) for descriptions. Char(n) seems to result in better performance because data doesn't need to move around when the size of contents change.
Where the nature of the data dictates the length of the field, I use CHAR. Otherwise VARCHAR.
CHARs are still faster for processing than VARCHARs in the DBMS I know well. Their fixed size allow for optimizations that aren't possible with VARCHARs. In addition, the storage requirements are slightly less for CHARS since no length has to be stored, assuming most of the rows need to fully, or near-fully, populate the CHAR column.
This is less of an impact (in terms of percentage) with a CHAR(30) than a CHAR(4).
As to usage, I tend to use CHARs when either:
the fields will generally always be close to or at their maximum length (stock codes, employee IDs, etc); or
the lengths are short (less than 10).
Anywhere else, I use VARCHARs.
I use CHAR when length of the value is fixed. For example we are generating a code or something based on some algorithm which returns the code with the specific fixed lenght lets say 13.
Otherwise, I found VARCHAR better. One more reason to use VARCHAR is that when you get the value back in your application you don't need to trim that value. In the case of CHAR you will get the full length of the column whether the value is filling it fully or not. It would get filled by spaces and you end up trimming every value, and forgetting that would lead to errors.
For PostgreSQL, the documentation states that char() has no advantage in storage space over varchar(); the only difference is that it's blank-padded to the specified length.
Having said that, I still use char(1) or char(3) for one-character or three-character codes. I think that the clarity due to the type specifying what the column should contain provides value, even if there are no storage or performance advantages. And yes, I typically use check constraints or foreign key constraints as well. Apart from those cases, I generally just stick with text rather than using varchar(). Again, this is informed by the database implementation, which automatically switches from inline to out-of-line storage if the value is large enough, which some other database implementations don't do.
Char isn't obsolete, it just should only be used if the length of the field should never vary. In the average database, this would be very few fields mostly some kind of code field like State Abbreviations which are a standard 2 character filed if you use the postal codes. Using Char where the filed length is varaible means that there will be a lot of trimming going on and that is extra, unnecessary work and the database should be refactored.