How to store sensitive information in SQL Server 2008? - sql

I need to store some sensitive information in a table in SQL Server 2008.
The data is a string and I do not want it to be in human readable format to anyone accessing the database.
What I mean by sensitive information is, a database of dirty/foul words. I need to make sure that they are not floating around in tables and SQL files.
At the same time, I should be able to perform operations like "=" and "like" on the strings.
So far I can think of two options; will these work or what is a better option?
Store string (varchar) as binary data (BLOB)
Store in some encrypted format, like we usually do with passwords.

A third option, which may be most appropriate, is to simply not store these values in the particular database at all. I would argue that it is probably more appropriate to store them elsewhere, since you're probably not going to JOIN against the table of sensitive words.
Otherwise, you probably want to use Conrad Frix's suggestion of SQL Server's built-in encryption support.
The reason I say this is because you say both = and LIKE must work across your data. When you hash a string using a hash algo such as SHA/MD5/etc., the results won't obey human language LIKE semantics.
If exact equality (=) is sufficient (i.e. you don't really need to be able to do LIKE queries), you can use a cryptographic function to secure the text. But keep in mind that a one-way hash function would prohibit you from getting a list of strings "un-hashed" - if you need to do that, you need to use an encryption algo where decryption is possible, such as AES.

If you use rot13, then you can still use = and LIKE. This also applies to any storage method other than an SQL database, if preventing casual/accidental views (including search engine indexing, if the list is public) is that important.

Related

Masking/Hashing data

As SQL dba, I need to export data which have some personal/sensitive information such as the national identification number (NiN). This field is a 10-digits unique number and it's not allowed as per our company's policy to export such data. Is there anyway I can generate a new field out of NiN but with different value and same length. I need this value to be consistent across all tables so that we can use this new field to JOIN data instead of using NiN.
I am thinking of HashBytes function but it generates an output with different length (10 digits).
Data is huge, so it's important to avoid collision. What's the best way to do this?
Thanks
First, I would change the format of the produced value to be different from the internal version. That will make it much simpler to see right away if there is an issue.
Second, you can use a hashing algorithm such as sha256 which is quite unlikely to have conflicts. That might be good enough.
Third, you need to think through the security requirements better. My preferred solution is to have a look-up table that matches internal numbers to external values. Then, this table is used for all exports and imports to translate between the two. A suggestion here would be to use newid() to generate the value and to use GUIDs for external data.
However, this may not be sufficient for your requirements. Why? The same number has the same value over time. So, although you might be able to hide the internal value and even forget it, a given external value still still matches to a single number -- tying external records together.
The solution to this is something called "salt" in the hashing function. This allows the external value to change over time, while still mapping to the same internal number.

Why can't database engines use an auto column datatype?

When defining table in database, we set column types as int / varchar etc etc.
Why can't it set to auto?
The database would recognize from the input and set the type itself, much like php handles variables. Also, youtube wouldn't have to crash their stats counter.
In some instances, you can (e.g. SQL Server's sql_variant or Oracle's ANYDATA).
Whether that's a good idea is another matter...
First of all, not having explicit schema (in the database), doesn't mean you can avoid implicit schema (implied by what your client application expects). For example, just because you have a string in the database, doesn't mean your application will be able to work with it, if it was originally implemented to expect a number. This is really an argument of dynamic vs. static typing, and static typing is generally acknowledged as superior for non-trivial programs.
Furthermore, these "variant" types often have significant overhead and are problematic from the indexing and sargability perspective. For example, is 2 smaller than 10? Yes if these are numbers, no if they are strings!
So if you know what you expect from the database (which you do in 99% cases), tell that to the database, so it can type-check and optimize it for you!

Get original sql query in postgres extension in C

I am creating extension to postgres in C (c++). It is new data type that behave like text but it is encrypted by HSM device. But I have problem to use more then one key to protect data. My idea is to get original sql query and process it to choose what key I should use. But I don't know how to do that or if it is even possible?
My goal is to change some existing text fields in database to encrypted ones. And that's why I can't provide key number to my type in direct way. Type must be seen by external app as text.
Normally there is userID field and single query always use that id to get or set encrypted data. Base on that field I want to chose key. HSM can have billions of keys in itself and that's mean every user can have it's own key. It's not a problem if I need to parse string by myself, I am more then capable of doing that. Performance is not issue too, HSM is so slow that I can encode or decode only couple fields in one second.
In most parts of the planner and executor the current (sub)query is available in a passed PlannerInfo struct, usually:
PlannerInfo *root
This has a parse member containing the Query object.
Earlier in the system, in the rewriter, it's passed as Query *root directly.
In both cases, if there's evaluation of a nested subquery going on, you get the subquery. There's no easy way to access the parent Query node.
The query tree isn't always available deeper in execution paths, such as in expression evaluation. You're not supposed to be referring to it there; expressions are self contained, and don't need to refer to the rest of the query.
So you're going to have a problem doing what you want. Frankly, that's because it's a pretty bad design from the sounds. What you should consider instead is:
Using a function to encode/decode the type to/from cleartext, allowing you to pass parameters; or possibly
Using the typmod of the type to store the desired information (but be aware that the typmod is not preserved across casts, subqueries, etc).
There's also the debug_query_string global, but really don't use that. It's unparsed query text so it won't help you anyway. If you (ab)use this in your code, I will cry. I'm only telling you it exists so I can tell you not to use it.
By far and away your best option is going to be to use a function-based interface for this.

Database many columns vs one string column processed at runtime

The obvious answer is that having one column as a string is pointless because you cannot perform any decent database queries on it. But say that you wanted to store user information, for example, such as their privacy preferences for their username, name and email (etc....).
Would having this as a single column named, for example, "settings" be better performance wise since these variables are not used with any other models?
This variable would be something like "{username => true, name => true, email => false}" and can be processed at runtime.
The string that you're creating would consume more space than storing the separate columns, so the table rows would be larger, and caching on the database server would be less efficient. I don't see a potential performance benefit at all.
I think that the only situations in which this might be acceptable database practice are:
Where you only ever treat the string that represents many data elements as a single itme of information: for example if you wanted to store a copy of a JSON or XML message that you have received or transmitted.
Where the data item represents a set of data elements that you could not know in advance, and you do not have the resources to use something like a key-value store for some reason.
There are a lot of factors to consider in performance. If the connection, or your database server is slow then you might be correct in doing what you suggest (to improve performance).
However, keep in mind that you still have to parse the string.
In most professional projects this would be considered horrible practice. Doing it like you suggests makes the code very un-maintainable. It opens the door to using the same field for a variety of thing, like favorite color, and such.
If you are tinkering in your basement, do this, but know it is horrible if you want to make a living writing code.

Are input sanitization and parameterized queries mutually exclusive?

I'm working updating some legacy code that does not properly handle user input. The code does do a minimal amount of sanitization, but does not cover all known threats.
Our newer code uses parameterized queries. As I understand it, the queries are precompiled, and the input is treated simply as data which cannot be executed. In that case, sanitization is not necessary. Is that right?
To put it another way, if I parameterize the queries in this legacy code, is it OK to eliminate the sanitization that it currently does? Or am I missing some additional benefit of sanitization on top of parameterization?
It's true that SQL query parameters are a good defense against SQL injection. Embedded quotes or other special characters can't make mischief.
But some components of SQL queries can't be parameterized. E.g. table names, column names, SQL keywords.
$sql = "SELECT * FROM MyTable ORDER BY {$columnname} {$ASC_or_DESC}";
So there are some examples of dynamic content you may need to validate before interpolating into an SQL query. Whitelisting values is also a good technique.
Also you could have values that are permitted by the data type of a column but would be nonsensical. For these cases, it's often easier to use application code to validate than to try to validate in SQL constraints.
Suppose you store a credit card number. There are valid patterns for credit card numbers, and libraries to recognize a valid one from an invalid one.
Or how about when a user defines her password? You may want to ensure sufficient password strength, or validate that the user entered the same string in two password-entry fields.
Or if they order a quantity of merchandise, you may need to store the quantity as an integer but you'd want to make sure it's greater than zero and perhaps if it's greater than 1000 you'd want to double-check with the user that they entered it correctly.
Parameterized queries will help prevent SQL injection, but they won't do diddly against cross-site scripting. You need other measures, like HTML encoding or HTML detection/validation, to prevent that. If all you care about is SQL injection, parameterized queries is probably sufficient.
There are many different reasons to sanitize and validate, including preventing cross-site scripting, and simply wanting the correct content for a field (no names in phone numbers). Parameterized queries eliminate the need to manually sanitize or escape against SQL injection.
See one of my previous answers on this.
You are correct, SQL parameters are not executable code so you don't need to worry about that.
However, you should still do a bit of validation. For example, if you expect a varchar(10) and the user inputs something longer than that, you will end up with an exception.
In short no. Input sanitization and the use of parameterized queries are not mutually exclusive, they are independent: you can use neither, either one alone, or both. They prevent different types of attacks. Using both is the best course.
It is important to note, as a minor point, that sometimes it is useful to write stored procedures which contain dynamic SQL. In this case, the fact that the inputs are parameterized is no automatic defense against SQL injection. This may seem a fairly obvious point, but I often run into people who think that because their inputs are parameterized they can just stop worrying about SQL Injection.