Handling dynamic (user supplied) column names - sql

When writing applications that manage data, it is often useful to allow the end user to create or remove classes of data that are best represented as columns. For example, I'm working on a dictionary building application; a user might decide they want to add, say, an "alternate spelling" field or something to data, which could be very easily represented as another column.
Usually, I just name the column based on whatever the user called it ("alternate_spelling" in this case); however, a user-defined string that isn't explicitly sanitized as a database identifier bothers me. Since column names can't be bound like values, I'm trying to figure out how to sanitize the column names.
So my question is: what should I be doing? Can I get away with just quoting things? There's lots of questions asking how to bind column names in SQL, and many responses saying one should never need to, but never explaining the correct approach to handling variable columns. I'm working in Python specifically, but I think this question is more general.

It depands on which database you are using...
According to PostgreSQL:
"SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($). Note that dollar signs are not allowed in identifiers according to the letter of the SQL standard, so their use might render applications less portable"
(Keep also in mind: maximum length allowed for the name)

I was looking for something like this. I still wouldn't trust it with user-supplied names - I'd look those up from the database catalog instead, but I think it is robust enough to check data that is provided from your backend.
i.e. Just because something comes from your internal data tables or yaml config files doesn't 100% mean that an attacker couldn't have hacked into those sources, so why not add another layer right before composing sql queries?
This is for postgresql but mostly should work on something else. No, it doesn't cover ALL possible characters for naming columns and tables, only those used in my databases.
class SecurityException(Exception):
"""concerns security"""
class UnsafeSqlException(SecurityException):
""" sql fragments looks unsafe """
def is_safe_sql_name(sql : str, error_on_empty : bool = False, raise_on_false : bool = True) -> bool :
"""check that something looks like an object name"""
patre = re.compile("^[a-z][a-z0-9_]{0,254}$",re.IGNORECASE)
if not isinstance(sql, str):
raise TypeError(f"sql should be a string {sql=}")
if not sql:
if error_on_empty:
raise ValueError(f"empty sql {sql=}")
return False
res = bool(patre.match(sql))
if not res and raise_on_false:
raise UnsafeSqlException(f"{sql=}")
return res

Related

Creating a table in NexusDB with german umlauts?

I'm trying to import a CREATE TABLE statement in NexusDB.
The table name contains some german umlauts and so do some field names but I receive an error that there were some invalid characters in my statement (obviously the umlauts...).
My question is now: can somebody give a solution or any ideas to solve my problem?
It's not so easy to just change the umlauts into equivalent terms like ä -> ae or ö -> oe since our application has fixed table names every customer uses currently.
It is not a good idea to use characters outside what is normally permitted in the SQL standard. This will bite you not only in NexusDB, but in many other databases as well. Take special note that there is a good chance you will also run into problems when you want to access data via ODBC etc, as other environments may also have similar standard restrictions. My strong recommendation would be to avoid use of characters outside the SQL naming standard for tables, no matter which database is used.
However... having said all that, given that NexusDB is one of the most flexible database systems for the programmer (it comes with full source), there is already a solution. If you add an "extendedliterals" define to your database server project, then a larger array of characters are considered valid. For the exact change this enables, see the nxcValidIdentChars constant in the nxllConst.pas unit. The constant may also be changed if required.

Table or column name cannot start with numeric?

I tried to create table named 15909434_user with syntax like below:
CREATE TABLE 15909434_user ( ... )
It would produced error of course. Then, after I tried to have a bit research with google, I found a good article here that describe:
When you create an object in PostgreSQL, you give that object a name. Every table has a name, every column has a name, and so on. PostgreSQL uses a single data type to define all object names: the name type.
A value of type name is a string of 63 or fewer characters. A name must start with a letter or an underscore; the rest of the string can contain letters, digits, and underscores.
...
If you find that you need to create an object that does not meet these rules, you can enclose the name in double quotes. Wrapping a name in quotes creates a quoted identifier. For example, you could create a table whose name is "3.14159"—the double quotes are required, but are not actually a part of the name (that is, they are not stored and do not count against the 63-character limit). ...
Okay, now I know how to solve this by use this syntax (putting double quote on table name):
CREATE TABLE "15909434_user" ( ... )
You can create table or column name such as "15909434_user" and also user_15909434, but cannot create table or column name begin with numeric without use of double quotes.
So then, I am curious about the reason behind that (except it is a convention). Why this convention applied? Is it to avoid something like syntax limitation or other reason?
Thanks in advance for your attention!
It comes from the original sql standards, which through several layers of indirection eventually get to an identifier start block, which is one of several things, but primarily it is "a simple latin letter". There are other things too that can be used, but if you want to see all the details, go to http://en.wikipedia.org/wiki/SQL-92 and follow the links to the actual standard ( page 85 )
Having non numeric identifier introducers makes writing a parser to decode sql for execution easier and quicker, but a quoted form is fine too.
Edit: Why is it easier for the parser?
The problem for a parser is more in the SELECT-list clause than the FROM clause. The select-list is the list of expressions that are selected from the tables, and this is very flexible, allowing simple column names and numeric expressions. Consider the following:
SELECT 2e2 + 3.4 FROM ...
If table names, and column names could start with numerics, is 2e2 a column name or a valid number (e format is typically permitted in numeric literals) and is 3.4 the table "3" and column "4" or is it the numeric value 3.4 ?
Having the rule that identifiers start with simple latin letters (and some other specific things) means that a parser that sees 2e2 can quickly discern this will be a numeric expression, same deal with 3.4
While it would be possible to devise a scheme to allow numeric leading characters, this might lead to even more obscure rules (opinion), so this rule is a nice solution. If you allowed digits first, then it would always need quoting, which is arguably not as 'clean'.
Disclaimer, I've simplified the above slightly, ignoring corelation names to keep it short. I'm not totally familiar with postgres, but have double checked the above answer against Oracle RDB documentation and sql spec
I'd imagine it's to do with the grammar.
SELECT 24*DAY_NUMBER as X from MY_TABLE
is fine, but ambiguous if 24 was allowed as a column name.
Adding quotes means you're explicitly referring to an identifier not a constant. So in order to use it, you'd always have to escape it anyway.

What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

Also what's the vb.net function that will map all those different characters into their most standard form.
For example, tolower would map A and a to the same character right?
I need the same function for these characters
german
ß === s
Ü === u
Χιοσ == Χίος
Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.
So I want to create a unique ID that maps all those strange characters into a more stable one.
For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".
However, things get more complicated once you move into the database and deal with collations.
Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same1, mean the same thing. For example,
Χιοσ != Χίος,
The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).
Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.
It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.
1Almost the same in case of compatibility normalization as opposed to canonical normalization.

How can you query a SQL database for malicious or suspicious data?

Lately I have been doing a security pass on a PHP application and I've already found and fixed one XSS vulnerability (both in validating input and encoding the output).
How can I query the database to make sure there isn't any malicious data still residing in it? The fields in question should be text with allowable symbols (-, #, spaces) but shouldn't have any special html characters (<, ", ', >, etc).
I assume I should use regular expressions in the query; does anyone have prebuilt regexes especially for this purpose?
If you only care about non-alphanumerics and it's SQL Server you can use:
SELECT *
FROM MyTable
WHERE MyField LIKE '%[^a-z0-9]%'
This will show you any row where MyField has anything except a-z and 0-9.
EDIT:
Updated pattern would be: LIKE '%[^a-z0-9!-# ]%' ESCAPE '!'
I had to add the ESCAPE char since you want to allow dashes -.
For the same reason that you shouldn't be validating input against a black-list (i.e. list of illegal characters), I'd try to avoid doing the same in your search. I'm commenting without knowing the intent of the fields holding the data (i.e. name, address, "about me", etc.), but my suggestion would be to construct your query to identify what you do want in your database then identify the exceptions.
Reason being there are just simply so many different character patterns used in XSS. Take a look at the XSS Cheat Sheet and you'll start to get an idea. Particularly when you get into character encoding, just looking for things like angle brackets and quotes is not going to get you too far.

Preg_replace solution for prepared statements

I have a command class that abstracts almost all specific database functions (We have the exactly same application running on Mssql 2005 (using ODBC and the native mssql library), MySQL and Oracle. But sometimes we had some problems with our prepare method that, when executed, replaces all placeholders with their respective values. But the problem is that I am using the following:
if(is_array($Parameter['Value']))
{
$Statement = str_ireplace(':'.$Name, implode(', ', $this->Adapter->QuoteValue($Parameter['Value'])), $Statement);
}
else
{
$Statement = str_ireplace(':'.$Name, $this->Adapter->QuoteValue($Parameter['Value']), $Statement);
}
The problem arises when we have two or mer similar parameters names, for example, session_browser and session_browse_version... The first one will partially replace the last one.
Course we learned to go around specifying the parameters within a specific order, but now that I have some "free" time I want to make it better, so I am thinking on switching to preg_replace... and I am not good in regular expression, can anyone give any help with a regex to replace a string like ':parameter_name`?
Best Regards,
Bruno B B Magalhaes
You should use the \b metacharacter to match the word boundary, so you don't accidentally match a short parameter name within a longer parameter name.
Also, you don't need to special-case arrays if you coerce a scalar Value to an array of one entry:
preg_replace("/:$Name\b/",
implode(",", $this->Adapter->QuoteValue( (array) $Parameter['Value'] )),
$Statement);
Note, however, that this can make false positive matches when an identifier or a string literal contains a pattern that looks like a parameter placeholder:
SELECT * FROM ":Name";
SELECT * FROM Table WHERE column = ':Name';
This gets even more complicated when quoted identifiers and string literals can contain escaped quotes.
SELECT * FROM Table WHERE column = 'word\':Name';
You might want to reconsider interpolating variables into SQL strings during prepare, because you're defeating any benefits of prepared statements with respect to security or performance.
I understand why you're doing what you're doing, because not all RDBMS back-end supports named parameters, and also SQL parameters can't be used for lists of values in an IN() predicate. But you're creating an awfully leaky abstraction.