Encrypted database query - sql

I've just found out about Stack Overflow and I'm just checking if there are ideas for a constraint I'm having with some friends in a project, though this is more of a theoretical question to which I've been trying to find an answer for some time.
I'm not much given into cryptography but if I'm not clear enough I'll try to edit/comment to clarify any questions.
Trying to be brief, the environment is something like this:
An application where the front-end as access to encrypt/decrypt keys and the back-end is just used for storage and queries.
Having a database to which you can't have access for a couple of fields for example let's say "address" which is text/varchar as usual.
You don't have access to the key for decrypting the information, and all information arrives to the database already encrypted.
The main problem is something like this, how to consistently make queries on the database, it's impossible to do stuff like "where address like '%F§YU/´~#JKSks23%'". (IF there is anyone feeling with an answer for this feel free to shoot it).
But is it ok to do where address='±!NNsj3~^º-:'? Or would it also completely eat up the database?
Another restrain that might apply is that the front end doesn't have much processing power available, so already encrypting/decrypting information starts to push it to its limits. (Saying this just to avoid replies like "Exporting a join of tables to the front end and query it there".)
Could someone point me in a direction to keep thinking about it?
Well thanks for so fast replies at 4 AM, for a first time usage I'm really feeling impressed with this community. (Or maybe I'm it's just for the different time zone)
Just feeding some information:
The main problem is all around partial matching. As a mandatory requirement in most databases is to allow partial matches. The main constraint is actually the database owner would not be allowed to look inside the database for information. During the last 10 minutes I've come up with a possible solution which extends again to possible database problems, to which I'll add here:
Possible solution to allow semi partial matching:
The password + a couple of public fields of the user are actually the key for encrypting. For authentication the idea is to encrypt a static value and compare it within the database.
Creating a new set of tables where information is stored in a parsed way, meaning something like: "4th Street" would become 2 encrypted rows (one for '4th' another for 'Street'). This would already allow semi-partial matching as a search could already be performed on the separate tables.
New question:
Would this probably eat up the database server again, or does anyone think it is a viable solution for the partial matching problem?
Post Scriptum: I've unaccepted the answer from Cade Roux just to allow for further discussion and specially a possible answer to the new question.

You can do it the way you describe - effectively querying the hash, say, but there's not many systems with that requirement, because at that point the security requirements are interfering with other requirements for the system to be usable - i.e. no partial matches, since the encryption rules that out. It's the same problem with compression. Years ago, in a very small environment, I had to compress the data before putting it in the data format. Of course, those fields could not easily be searched.
In a more typical application, ultimately, the keys are going to be available to someone in the chain - probably the web server.
For end user traffic SSL protects that pipe. Some network switches can protect it between web server and database, and storing encrypted data in the database is fine, but you're not going to query on encrypted data like that.
And once the data is displayed, it's out there on the machine, so any general purpose computing device can be circumvented at that point, and you have perimeter defenses outside of your application which really come into play.

why not encrypt the disk holding the database tables, encrypt the database connections, and let the database operate normally?
[i don't really understand the context/contraints that require this level of paranoia]
EDIT: "law constraints" eh? I hope you're not involved in anything illegal, I'd hate to be an inadvertent accessory... ;-)
if the - ahem - legal constraints - force this solution, then that's all there is to be done - no LIKE matches, and slow response if the client machines can't handle it.

A few months ago I came across the same problem: the whole database (except for indexes) is encrypted and the problem on partial matches raised up.
I searched the Internet looking for a solution, but it seems that there's not much to do about this but a "workaround".
The solution I've finally adopted is:
Create a temporary table with the data of the field against which the query is being performed, decrypted and another field that is the primary key of the table (obviously, this field doesn't have to be decrypted as is plain-text).
Perform the partial match agains that temporary table and retrieve the identifiers.
Query the real table for those identifiers and return the result.
Drop the temporary table.
I am aware that this supposes a non-trivial overhead, but I haven't found another way to perform this task when it is mandatory that the database is fully encrypted.
Depending on each particular case, you may be able to filter the number of lines that are inserted into the temporary table without losing data for the result (only consider those rows which belong to the user that is performing the query, etc...).

You want do use md5 hashing. Basically, it takes your string and turns it into a hash that cannot be reproduced. You can then use it to validate against things later. For example:
$salt = "123-=asd";
$address = "3412 g ave";
$sql = "INSERT INTO addresses (address) VALUES ('" . md5($salt . $address) . "')";
mysql_query($sql);
Then, to validate an address in the future:
$salt = "123-=asd";
$address = "3412 g ave";
$sql = "SELECT address FROM addresses WHERE address = '" . md5($salt . $address) . "'";
$res = mysql_query($sql);
if (mysql_fetch_row($res))
// exists
else
// does not
Now it is encrypted on the database side so nobody can find it out - even if they looked in your source code. However, finding the salt will help them decrypt it though.
http://en.wikipedia.org/wiki/MD5

If you need to store sensitive data that you want to query later I'd recommend to store it in plain text, restricting access to that tables as much as you can.
If you can't do that, and you don't want overhead in the front end you can make a component in the back end, running in a server, that processes the encrypted data.
Making querys to encrypted data? If you're using a good encryption algorithm I can't imagine how to do that.

Related

How can data be synchronized between processes in SQL?

I'm wondering something perhaps extremely stupid, but I can't seem to find an answer (which is not a good sign, usually).
Assuming we have a SQL server (MySQL, PostgreSQL, this question even applies to Sqlite3 though there's no server) and several clients connected to it. I've seen countless times queries that might be hard to sync in my opinion.
So let's assume we have a table (usage statistics, say) with a row per day.
statistic (
day,
num_requests
)
(I avoid mentioning data types, since it's not the point, but the number of requests should be a number of some sort.)
So when a new web request is sent, the web server will ask this table's current statistic and increase the number of requests. No biggie right?
number = cursor.execute("""
SELECT num_requests FROM statistic
WHERE ...
""")
number += 1
cursor.execute("""
UPDATE statistic SET num_requests=?
WHERE ...
""", (number, ))
But what does happen if two requests are handled somewhat simultaneously, perhaps on several clients? Different processes? They each ask for today's current statistic (just a read operation, non-blocking), they get the number of requests from this row (this step doesn't involve the server) and then they increment it by 1. At this point, if both requests are running somewhat simultaneously, they have both incremented the same number once and they send an UPDATE requests with their number.
In the end, the number of requests for today's statistic has increased by one, although they were two requests. I know there are mechanisms to ensure proper data synchronization, but I fail to see how it could address the situation in this case. Read usually is non-blocking as far as I know. Write can be blocking, but since read for the other process has happened before, the second write operation will not be acceptable. And I don't see any way to express that logically.
In other words, this seems like the point where we would lock the row in most programming languages, and say "from that point onward, you can neither read it or write it, I'm working on it". The first request will execute its read (lock), increment and write, and then will unlock. The second request will have to wait patiently for the lock to be released. I don't see that mechanism in SQL. Is that transparent and not even necessary? And if so, how does it work? Or have we lived our entire life with problems like it?
Thanks!
cursor.execute("""
UPDATE statistic SET num_requests=num_requests+1
WHERE ...
""", (number, ))

Preventing Data Fraud and Deletion in SQL Database

We have an application that is installed on premises for many clients. We are trying to collect information that will be sent to us at a point in the future. We want to ensure that we can detect if any of our data is modified and if any data was deleted.
To prevent data from being modified we currently hash table rows and send the hashes with the data. However, we are struggling to detect if data has been deleted. For example if we insert 10 records in a table and hash each row the user wont be able to modify the record without us detecting it but if they drop all the records then we can't distinguish this from the initial installation.
Constraints:
Clients will have admin roles to DB
Application and DB will be behind a DMZ and won't be able to connect external services
Clients will be able to profile any sql commands and be able to replicate any initial setup we do. (to clearify they clients can also drop/recreate tables)
although clients can drop data and tables, there are some sets of data and tables that if dropped or deleted would be obvious to us during audits beacuse they should always be accumulating data and missing data or truncated data would stand out. We want to be able to detect deletion and fraud in the remaining tables.
We're working under the assumption that clients will not be able to reverse our code base or hash/encrypt data themselves
Clients will send us all data collected every month and the system will be audited by us once a year.
Also consider they client can take backups of the DB or snapshots of a VM in a 'good' state and then do a roll back to that 'good' state if they want to destroy data. we don't want do do any detection of vm snapshot or db backup roll backs directly.
So far the only solution we have is encrypting the install date (which could be modified) and the instance name. Then every minute 'increment' the encrypted data. When we add data to the system, we hash the data row and stick the hash in the encrypted data. Then continue to 'increment' the data. Then when the monthly data is sent we'd be able to see if they are deleting data and rolling the DB back to just after installation because the encrypted value wouldn't have any increments or would be have extra hashes that don't belong to any data.
Thanks
Have you looked into Event Sourcing? This could be used possibly with write-once media as secondary storage if performance is good enough that way. That would then guarantee transaction integrity even against DB or OS admins. I'm not sure whether it's feasible to do Event Sourcing with real write-once media and still keep a reasonable performance.
Let's say we have a md5() or similar function in your code, and you want to keep control of the modification of the "id" fields of the table "table1". You can do something like:
accumulatedIds = "secretkey-only-in-your-program";
for every record "record" in the table "table1"
accumulatedIds = accumulatedIds + "." + record.id;
update hash_control set hash = md5(accumulatedIds) where table = "table1";
After every authorized change of the information of the table "table1". Nobody could make changes out of this system without being noticed.
If somebody changes some of the id's, you will notice that because the hash wouldn't be the same.
If somebody wants to recreate your table, unless he recreates exactly the same information, he woulnd't be capable of making the hash again, because he don't know the "secretkey-only-in-your-program".
If somebody deletes a record, it can be discovered also, because that "accumulatedIds" wouldn't match. The same will apply if somebody adds a record.
The user can delete the record under the table hash_control, but he can't reconstruct the hash information properly without the "secretkey...", so you will notice that also.
What am I missing??

Super column vs serialization vs 2 lookups in Cassandra

We have:
users, each of which has events, each of which has several properties (time, type etc.). Our basic use case is to fetch all events of a given user in a given time-span.
We've been considering the following alternatives in Cassandra for the Events column-family. All alternatives share: key=user_id (UUID), column_name = event_time
column_value = serialized object of event properties. Will need to read/write all the properties every time (not a problem), but might also be difficult to debug (can't use Cassandra command-line client easily)
column is actually a super column, sub-columns are separate properties. Means reading all events(?) every time (possible, though sub-optimal). Any other cons?
column_value is a row-key to another CF, where the event properties are stored. Means maintaining two tables -> complicates calls + reads/writes are slower(?).
Anything we're missing? Any standard best-practice here?
Alternative 1 : Why go to Cassandra if you are to store serialized object ? MongoDB or a similar product would perform better on this task if I get it wright (never actually tried a document base NoSQL, so correct me if I'm wrong on this one). Anyway, I tried this alternative once in MySQL 6 years ago and it is still painful to maintain today.
Alternative 2 : Sorry, I didn't had to play with super colunm yet. Would use this only if I had to show frequently many information on many users (i.e. much more than just their username and a few qualifiers) and their respective events in one query. Also could make query based on a given time-span a bit tricky if there are conditions on the user itself too, since a user's row is likely to have event's columns that fit in the span an other columns that doesn't.
Alternative 3 : Would defenitly be my choice in most cases. You are not likely to write events and create a user in the same transaction, so no worry for consistency. Use the username itself as a standard event column (don't forget to index it) so your calls will be pretty fast. More on this type of data model at http://www.datastax.com/docs/0.8/ddl/index.
Yes it's a two call read, but it do is two different families of data anyway.
As for a best-practices, the field is kinda new, not sure there are any widely approved yet.

MySQL select from db efficiency and time-consumption

Which one of those will take the least amount of time?
1.
q = 'SELECT * FROM table;'
result = db->query(q)
while(row = mysql_fetch_assoc(result))
{
userName = row['userName'];
userPassword = row['userPassword'];
if (enteredN==userName && enteredP==userPassword)
return true;
return false;
}
2.
q = 'select * from table where userName='.enteredN.' and userPassword='.enteredP.';'
...
The second one is guaranteed to be significantly faster on just about any database management system. You may not notice a difference if you have just a few rows in the table, but when you get to have thousands of rows, it will become quite obvious.
As a general rule, you should let the database management system handle filtering, grouping, and sorting that you need; database management systems are designed to do those things and are generally highly optimized.
If this is a frequently used query, make sure you use an index on the unique username field.
As soulmerge brings up, you do need to be careful with SQL injection; see the PHP Manual for information how to protect against it. While you're at it, you should seriously consider storing password hashes, not the passwords themselves; the MySQL Manual has information on various functions that can help you do this.
Number 2 is bound to be much, much!, faster. (unless the table only contains a few rows)
It is not only be SQL servers are very efficient at filtering (and they can use indexes etc. which the loop in #1 doesn't have access to), but also because #1 may cause a very significant amount of data be transferred from the server to the php application.
Furthermore solution #1 would cause all the login credentials of the users to transit on the wire, exposing them to possible snooping, somewhere along the line... Do note that solution #2 also incur a potential security risk if the channel between SQL server and application is not secure; this risk is somewhat lessened because 100% of the accounts details are not transmitted with every single login attempt.
Beyond this risk (which speaks to internal security), there is also the very real risk of SQL injection from fully outside attackers (as pointed out by others in this post). This however can be fully addressed by escaping the single quote characters contained within end-users supplied strings.

Design Question - Put hundreds of Yes/No switches in columns, rows, or other?

We are porting an old application that used a hierarchical database to a relational web app, and are trying to figure out the best way to port configuration switches (Y/N values).
Our old system had 256 distinct switches (per client) that were each stored as a bit in one of 8 32-bit data fields. Each client would typically have ~100 switches set. To read or set a switch, we'd use bitwise arithmetic using a #define value. For example:
if (a_switchbank4 & E_SHOW_SALARY_ON_CHECKS) //If true, print salary on check
We were debating what approach to store switches in our new relational (MS-SQL) database:
Put each switch in its own field
Pros: fast and easy read/write/access - 1 row per client
Cons: seems kludgey, need to change schema every time we add a switch
Create a row per switch per client
Pros: unlimited switches, no schema changes necessary w/ new switches
Cons: slightly more arduous to pull data, lose intellisense w/o extra work
Maintain bit fields
Pros: same code can be leveraged, smaller XML data transmissions between machines
Cons: doesn't make any sense to our developers, hard to debug, too easy to use wrong 'switch bank' field for comparison
I'm leaning towards #1 ... any thoughts?
It depends on a few factors such as:
How many switches are set for each client
How many switches are actually used
How often switches are added
If I had to guess (and I would be guessing) I'd say what you really want are tags. One table has clients, with a unique ID for each, another has tags (the tag name and a unique ID) and a third has client ID / tag ID pairs, to indicate which clients have which tags.
This differs from your solution #2 in that tags are only present for the clients where that switch is true. In other words, rather than storing a client ID, a switch ID, and a boolean you store just a client ID and a switch ID, but only for the clients with that switch set.
This takes up about one third the space over solution number two, but the real advantage is over solutions one and three: indexing. If you want to find out things like which clients have switches 7, 45, and 130 set but not 86 or 14, you can do them efficiently with a single index on a tag table, but there's no practical way to do them with the other solutions.
You could think about using database views to give you the best of each solution.
For example store the data as one row per switch, but use a view that pivots the switches (rows) into columns where this is more convenient.
I would go with option #2, one row per flag.
However, I'd also consider a mix of #1 and #2. I don't know your app, but if some switches are related, you could group those into tables where you have multiple columns of switches. You could group them based on use or type. You could, and would probably still have a generic table with one switch per row, for ones that don't fit into the groups.
Remember too if you change the method, you may have a lot of application code to change that relys on the existing method of storing the data. Whether you should change the method may depend on exactly how hard it will be and how many hours it will take to change everything associated. I agree with Markus' solution, but you do need to consider exactly how hard refactoring is going to be and whether your project can afford the time. The refactoring book I've been reading would suggest that you maintain both for a set time period with triggers to keep them in synch while you then start fixing all the references. Then on a set date you drop the original (and the triggers) from the database. This allows you to usue the new method going forth, but gives the flexibility that nothing will break before you get it fixed, so you can roll out the change before all references are fixed. It requires discipline however as it is easy to not get rid of the legacy code and columns because everything is working and you are afraid not to. If you are in the midst of total redesign where everything will be tested thougroughly and you have the time built into the project, then go ahead and change everything all at once.
I'd also lean toward option 1, but would also consider an option 4 in some scenarios.
4- Store in dictionary of name value pairs. Serialize to database.
I would recommend option 2. It's relatively straightforward to turn a list of tags/rows into a hash in the code, which makes it fairly easy to check variables. Having a table with 256+ columns seems like a nightmare.
One problem with option #2 is that having a crosstab query is a pain:
Client S1 S2 S3 S4 S5 ...
A X X
B X X X
But there are usually methods for doing that in a database-specific way.