Get original sql query in postgres extension in C - sql

I am creating extension to postgres in C (c++). It is new data type that behave like text but it is encrypted by HSM device. But I have problem to use more then one key to protect data. My idea is to get original sql query and process it to choose what key I should use. But I don't know how to do that or if it is even possible?
My goal is to change some existing text fields in database to encrypted ones. And that's why I can't provide key number to my type in direct way. Type must be seen by external app as text.
Normally there is userID field and single query always use that id to get or set encrypted data. Base on that field I want to chose key. HSM can have billions of keys in itself and that's mean every user can have it's own key. It's not a problem if I need to parse string by myself, I am more then capable of doing that. Performance is not issue too, HSM is so slow that I can encode or decode only couple fields in one second.

In most parts of the planner and executor the current (sub)query is available in a passed PlannerInfo struct, usually:
PlannerInfo *root
This has a parse member containing the Query object.
Earlier in the system, in the rewriter, it's passed as Query *root directly.
In both cases, if there's evaluation of a nested subquery going on, you get the subquery. There's no easy way to access the parent Query node.
The query tree isn't always available deeper in execution paths, such as in expression evaluation. You're not supposed to be referring to it there; expressions are self contained, and don't need to refer to the rest of the query.
So you're going to have a problem doing what you want. Frankly, that's because it's a pretty bad design from the sounds. What you should consider instead is:
Using a function to encode/decode the type to/from cleartext, allowing you to pass parameters; or possibly
Using the typmod of the type to store the desired information (but be aware that the typmod is not preserved across casts, subqueries, etc).
There's also the debug_query_string global, but really don't use that. It's unparsed query text so it won't help you anyway. If you (ab)use this in your code, I will cry. I'm only telling you it exists so I can tell you not to use it.
By far and away your best option is going to be to use a function-based interface for this.

Related

Masking/Hashing data

As SQL dba, I need to export data which have some personal/sensitive information such as the national identification number (NiN). This field is a 10-digits unique number and it's not allowed as per our company's policy to export such data. Is there anyway I can generate a new field out of NiN but with different value and same length. I need this value to be consistent across all tables so that we can use this new field to JOIN data instead of using NiN.
I am thinking of HashBytes function but it generates an output with different length (10 digits).
Data is huge, so it's important to avoid collision. What's the best way to do this?
Thanks
First, I would change the format of the produced value to be different from the internal version. That will make it much simpler to see right away if there is an issue.
Second, you can use a hashing algorithm such as sha256 which is quite unlikely to have conflicts. That might be good enough.
Third, you need to think through the security requirements better. My preferred solution is to have a look-up table that matches internal numbers to external values. Then, this table is used for all exports and imports to translate between the two. A suggestion here would be to use newid() to generate the value and to use GUIDs for external data.
However, this may not be sufficient for your requirements. Why? The same number has the same value over time. So, although you might be able to hide the internal value and even forget it, a given external value still still matches to a single number -- tying external records together.
The solution to this is something called "salt" in the hashing function. This allows the external value to change over time, while still mapping to the same internal number.

EAV vs Serialized Object vs SQL with Xpath?

I'm trying to implement a badge system, the badge are based on user's metadata which are subject to change.
Those metadata are variable, and are set on the fly.
Example of metadata :
commentCount
hasCompletedProfile
isActiveMember
etc. Later, I would want to add hasGravatar metadata, for this reason, I can't easily design and normalize a table.
Those data, while they are an important part of the application are not 'sensible', almost all those metadata could be re-computed that means that the integrity of the data is not part of constraints.
Currently, I know three options, even if I didn't know any of them.
EAV
Serialized Objects
XML Field (I read somewhere that it is possible to store XML in a column, and use XPATH or something to query data)
All of these options look to have pro & cons but since I've never experimented with them, I don't really know which.
Do you have any feedbacks or advices?
I'm currently working with Zend Framework & Doctrine 2 with a MySQL server
XML and Serialized Objects are both very similar as you would likely be using 1 column to store this arbitrary data. This quickly becomes very messy and difficult to easily distinguish in SQL WHERE clauses (though some DBMS have XPath support)
EAV on the otherhand will provide a separate row for every Key => Value pair you have, which you can easily extract out with a JOIN or subquery. The major downfall is that it can be a performance hit if you have a lot of data in here. Another drawback is that to keep things simple you would store all keys/values as text in the db. You could create an EAV table for every type, but it's not practically needed in most languages as what you fetch comes out as a string or can be converted there anyway. Simply storing user configuration/properties should be perfectly fine for EAV.
So you might have a table user_metadata with 3 fields:
metadata_id INTEGER
user_id INTEGER
key CHAR
value CHAR
You could then fetch this data all at once for a user:
SELECT * FROM user_metadata WHERE metadata_user_id = $user_id
Or you could fetch individual metadata along with your user data
SELECT user.*, meta_gravatar.value AS hasGravatar
FROM user
LEFT JOIN user_metadata AS meta_gravatar
ON meta_gravatar.user_id = user.user_id AND meta_gravatar.key = 'hasGravatar'
WHERE user.user_id = $user_id
EAV: It is complicated and slow. It is an example how not to use an SQL database. You cannot have an index on properties in EAV and you need some nontrivial logic to get the data from database into business logic objects. Also your SQL queries become difficult to optimize.
Serialized objects: Serialization often depends on language or platform. There is no way of having an index on some property or search anything, but it is a simple way to store data of undefined structure.
XML field: Use of a standardized representation is better than serialization. Also, there may be support for such data structures in you SQL server.
JSON field: The same as XML field, however, JSON supports primitive data types (int, bool, null) and it is faster and easier to parse and serialize than XML. Some SQL servers provide some support for it as well.
All the three ways of serialization share the same disadvantage: No indices on the properties. In most applications, it is acceptable because the data are not processed by the database anyway, so they are simply a blob for the application. The good thing is, that this blob does not complicate the database schema and operations.
There is one more way to implement such an EAV alternative: plain old SQL table. If a new property requires some change in the application code, then you can add the SQL column as well. If you have user interface and application logic to define properties at run-time, you can teach your application to use ALTER TABLE queries. Then you simply add or remove columns as you need. In the end, it will be much easier and more effective than implementing EAV, as long as you have a good query builder.

Most efficient method for persisting complex types with variable schemas in SQL

What I'm doing
I am creating an SQL table that will provide the back-end storage mechanism for complex-typed objects. I am trying to determine how to accomplish this with the best performance. I need to be able to query on each individual simple type value of the complex type (e.g. the String value of a City in an Address complex type).
I was originally thinking that I could store the complex type values in one record as an XML, but now I am concerned about the search performance of this design. I need to be able to create variable schemas on the fly without changing anything about the database access layer.
Where I'm at now
Right now I am thinking to create the following tables.
TABLE: Schemas
COLUMN NAME DATA TYPE
SchemaId uniqueidentifier
Xsd xml //contains the schema for the document of the given complex type
DeserializeType varchar(200) //The Full Type name of the C# class to which the document deserializes.
TABLE: Documents
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
SchemaId uniqueidentifier
TABLE: Values //The DocumentId+ValueXPath function as a PK
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
ValueXPath varchar(250)
Value text
from these tables, when performing queries I would do a series of self-joins on the value table. When I want to get the entire object by the DocumentId, I would have a generic script for creating a view mimics a denormalized datatable of the complex-type.
What I want to know
I believe there are better ways to accomplish what I am trying to, but I am a little too ignorant about the relative performance benefits of different SQL techniques. Specifically I don't know the performance cost of:
1 - comparing the value of a text field versus of a varchar field.
2 - different kind of joins versus nested queries
3 - getting a view versus an xml document from the sql db
4 - doing some other things that I don't even know I don't know would be affecting my query but, I am experienced enough to know exist
I would appreciate any information or resources about these performance issues in sql as well as a recommendation for how to approach this general issue in a more efficient way.
For Example,
Here's an example of what I am currently planning on doing.
I have a C# class Address which looks like
public class Address{
string Line1 {get;set;}
string Line2 {get;set;}
string City {get;set;}
string State {get;set;}
string Zip {get;set;
}
An instance is constructed from new Address{Line1="17 Mulberry Street", Line2="Apt C", City="New York", State="NY", Zip="10001"}
its XML value would be look like.
<Address>
<Line1>17 Mulberry Street</Line1>
<Line2>Apt C</Line2>
<City>New York</City>
<State>NY</State>
<Zip>10001</Zip>
</Address>
Using the db-schema from above I would have a single record in the Schemas table with an XSD definition of the address xml schema. This instance would have a uniqueidentifier (PK of the Documents table) which is assigned to the SchemaId of the Address record in the Schemas table. There would then be five records in the Values table to represent this Address.
They would look like:
DocumentId ValueXPath Value
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line1 17 Mulberry Street
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line2 Apt C
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/City New York
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/State NY
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Zip 10001
Just Added a Bounty...
My objective is to obtain the resources I need in order to give my application a data access layer that is fully searchable and has a data-schema generated from the application layer that does not require direct database configuration (i.e. creating a new SQL table) in order to add a new aggregate root to the domain model.
I am open to the possibility of using .NET compatible technologies other than SQL, but I will require that any such suggestions be adequately substantiated in order to be considered.
How about looking for a solution at the architectural level? I was also breaking my head on complex graphs and performance until I discovered CQRS.
[start evangelist mode]
You can go document-based or relational as storage. Even both! (Event Sourcing)
Nice separation of concerns: Read Model vs Write Model
Have your cake and eat it too!
Ok, there is an initial learning / technical curve to get over ;)
[end evangelist mode]
As you stated: "I need to be able to create variable schemas on the fly without changing anything about the database access layer." The key benefit is that your read model can be very fast since it's made for reading. If you add Event Sourcing to the mix, you can drop and rebuild your Read Model to whatever schema you want... even "online".
There are some nice opensource frameworks out there like nServiceBus which saves lots of time and technical challenges. All depends on how far you want to take these concepts what you're willing/can spend time on. You can even start with just basics if you follow Greg Young's approach. See the info in the links below.
See
CQRS Examples and Screencasts
CQRS Questions
Intro (Also see the video)
Somehow what you want sounds like a painful thing to do in SQL. Basically, you should treat the inside of a text field as opaque as when querying an SQL database. Text fields were not made for efficient queries.
If you just want to store serialized objects in a text field, that is fine. But do not try to build queries that look inside the text field to find objects.
Your idea sounds like you want to perform some joins, XML parsing, and XPath application to get to a value. This doesn't strike me as the most efficient thing to do.
So, my advise:
Either just store serialized objects in the db, and do nothing more than load them and perform all other operations in memory
Or, if you need to query complex data structures, you may really want to look into document stores/databases like CouchDB or MongoDB; you can also check Wikipedia on the subject. There are even databases specifically designed for storing XML, even though I personally don't like them very much.
Addendum, per your explanations above
Simply put, don't go over the top with this thing:
If you just want to persist C#/.NET objects, just use the XML Serialization already built into the framework, a single table and be done with it.
If you, for some reason, need to store complex XML, use a dedicated XML store
If you have a fixed database schema, but it is too complex for efficient queries, use a Document Store in memory where you keep a denormalized version of your data for faster queries (or just simplify your database schema)
If you don't really need a fixed schema, use just a Document Store, and forget about having any "schema definition" at all
As for your solution, yes, it could work somehow. As could a plain SQL schema if you set it up right. But for applying an XPath, you'll probably parse the whole XML document each time you access a record, which wouldn't be very efficient to begin with.
If you want to check out Document databases, there are .NET drivers for CouchDB and MongoDB. The eXist XML database offers a number of Web protocols, and you can probably create a client class easily with VisualStudio's point-and-shoot interface. Or just google for someone who already did.
I need to be able to create variable
schemas on the fly without changing
anything about the database access
layer.
You are re-implementing the RDBMS within an RDBMS. The DB can do this already - that is what the DDL statements like create table and create schema are for....
I suggest you look into "schemas" and SQL security. There is no reason with the correct security setup you cannot allow your users to create their own tables to store document attributes in, or even generate them automatically.
Edit:
Slightly longer answer, if you don't have full requirements immediately, I would store the data as XML data type, and query them using XPath queries. This will be OK for occasional queries over smallish numbers of rows (fewer than a few thousand, certainly).
Also, your RDBMS may support indexes over XML, which may be another way of solving your problem. CREATE XML INDEX in SqlServer 2008 for example.
However for frequent queries, you can use triggers or materialized views to create copies of relevant data in table format, so more intensive reports can be speeded up by querying the breakout tables.
I don't know your requirements, but if you are responsible for creating the reports/queries yourself, this may be an approach to use. If you need to enable users to create their own reports that's a bigger mountain to climb.
I guess what i am saying is "are you sure you need to do this and XML can't just do the job".
In part, it will depend of your DB Engine. You're using SQL Server, don't you?
Answering your topics:
1 - Comparing the value of a text field versus of a varchar field: if you're comparing two db fields, varchar fields are smarter. Nvarchar(max) stores data in unicode with 2*l+2 bytes, where "l" is the lengh. For performance issues, you will need consider how much larger tables will be, for selecting the best way to index (or not) your table fields. See the topic.
2 - Sometimes nested queries are easily created and executed, also serving as a way to reduce query time. But, depending of the complexity, would be better to use different kind of joins. The best way is try to do in both ways. Execute two or more times each query, for the DB engine "compiles" a query on first executing, then the subsequent are quite faster. Measure the times for different parameters and choose the best option.
"Sometimes you can rewrite a subquery to use JOIN and achieve better performance. The advantage of creating a JOIN is that you can evaluate tables in a different order from that defined by the query. The advantage of using a subquery is that it is frequently not necessary to scan all rows from the subquery to evaluate the subquery expression. For example, an EXISTS subquery can return TRUE upon seeing the first qualifying row." - link
3- There's no much information in this question, but if you will get the xml document directly from the table, would be a good idea insted a view. Again, it will depends of the view and the document.
4- Other issues is about the total records expected for your table; the indexing of the columns, in wich you need to consider sorting, joining, filtering, PK's and FK's. Each situation could demmand different aproaches. My sugestion is to invest some time reading about your database engine and queries functioning and relating to your system.
I hope I've helped.
Interesting question.
I think you may be asking the wrong question here. Broadly speaking, as long as you have a FULLTEXT index on your text field, queries will be fast. Much faster than varchar if you have to use wild cards, for instance.
However, if I were you, I'd concentrate on the actual queries you're going to be running. Do you need boolean operators? Wildcards? Numerical comparisons? That's where I think you will encounter the real performance worries.
I would imagine you would need queries like:
"find all addresses in the states of New York, New Jersey and Pennsylvania"
"find all addresses between house numbers 1 and 100 on Mulberry Street"
"find all addresses where the zipcode is missing, and the city is New York"
At a high level, the solution you propose is to store your XML somewhere, and then de-normalize that XML into name/value pairs for querying.
Name/value pairs have a long and proud history, but become unwieldy in complex query situations, because you're not using the built-in optimizations and concepts of the relational database model.
Some refinements I'd recommend is to look at the domain model, and at least see if you can factor out separate data types into the "value" column; you might end up with "textValue", "moneyValue", "integerValue" and "dateValue". In the example you give, you might factor "address 1" into "housenumber" (as an integer) and "streetname".
Having said all this - I don't think there's a better solution other than completely changing tack to a document-focused database.

How to store sensitive information in SQL Server 2008?

I need to store some sensitive information in a table in SQL Server 2008.
The data is a string and I do not want it to be in human readable format to anyone accessing the database.
What I mean by sensitive information is, a database of dirty/foul words. I need to make sure that they are not floating around in tables and SQL files.
At the same time, I should be able to perform operations like "=" and "like" on the strings.
So far I can think of two options; will these work or what is a better option?
Store string (varchar) as binary data (BLOB)
Store in some encrypted format, like we usually do with passwords.
A third option, which may be most appropriate, is to simply not store these values in the particular database at all. I would argue that it is probably more appropriate to store them elsewhere, since you're probably not going to JOIN against the table of sensitive words.
Otherwise, you probably want to use Conrad Frix's suggestion of SQL Server's built-in encryption support.
The reason I say this is because you say both = and LIKE must work across your data. When you hash a string using a hash algo such as SHA/MD5/etc., the results won't obey human language LIKE semantics.
If exact equality (=) is sufficient (i.e. you don't really need to be able to do LIKE queries), you can use a cryptographic function to secure the text. But keep in mind that a one-way hash function would prohibit you from getting a list of strings "un-hashed" - if you need to do that, you need to use an encryption algo where decryption is possible, such as AES.
If you use rot13, then you can still use = and LIKE. This also applies to any storage method other than an SQL database, if preventing casual/accidental views (including search engine indexing, if the list is public) is that important.

Why are SQL-Server UDFs so limited?

From the MSDN docs for create function:
User-defined functions cannot be used to perform actions that modify the database state.
My question is simply - why?
Yes, a UDF that modifies data may have potentially unwanted side-effects.
Yes, there is overhead involved if a UDF is called thousands of times.
But that is the whole point of design and testing - to ensure that such issues are ironed out before deployment. So why do DB vendors insist on imposing these artificial limitations on developers? What is the point of a language construct that can essentially only be used as a wrapper for select statements?
The reason for this question is as follows: I am writing a function to return a GUID for a certain unique integer ID. If a GUID is already allocated for that ID I simply return it; otherwise I want to generate a new GUID, store that into a table, and return the newly-generated GUID. (Yes, this sounds long-winded and possibly crazy, but when you're sending data to another dev company who believes their design was handed down by God and cannot be improved upon, it's easier just to smile and nod and do what they ask).
I know that I can use a stored procedure with an output parameter to achieve the same result, but then I have to declare a new variable just to hold the result of the sproc. Not only that, I then have to convert my simple select into a while loop that inserts into a temporary table, and call the sproc for every iteration of that loop.
It's usually best to think of the available tools as a spectrum, from Views, through UDFs, out to Stored Procedures. At the one end (Views) you have a lot of restrictions, but this means the optimizer can actually "see through" the code and make intelligent choices. At the other end (Stored Procedures), you've got lots of flexibility, but because you have such freedom, you lose some abilities (e.g. because you can return multiple result sets from a stored proc, you lose the ability to "compose" it as part of a larger query).
UDFs sit in a middle ground - you can do more than you can do in a view (multiple statements, for example), but you don't have as much flexibility as a stored proc. By giving up this freedom, it allows the outputs to be composed as part of a larger query. By not having side effects, you guarantee that, for example, it doesn't matter in which row order the UDF is applied in. If you could have side effects, the optimizer might have to give an ordering guarantee.
I understand your issue, I think, but taking this from your comment:
I want to do something like select my_udf(my_variable) from my_table, where my_udf either selects or creates the value it returns
So you want a select that (potentially) modifies data. Can you look at that sentence on its own and tell me that that reads perfectly OK? - I certainly can't.
Reading your description of what you actually need to do:
I am writing a function to return a
GUID for a certain unique integer ID.
If a GUID is already allocated for
that ID I simply return it; otherwise
I want to generate a new GUID, store
that into a table, and return the
newly-generated GUID.
I know that I can use a stored
procedure with an output parameter to
achieve the same result, but then I
have to declare a new variable just to
hold the result of the sproc. Not only
that, I then have to convert my simple
select into a while loop that inserts
into a temporary table, and call the
sproc for every iteration of that
loop.
from that last sentence it sounds like you have to process many rows at once, so how about a single INSERT that inserts the GUIDs for those IDs that don't already have them, followed by a single SELECT that returns all the GUIDs that (now) exist?
Sometimes if you cannot implement the solution you came up with, it may be an indication that your solution is not optimal.
Using a statement like this
INSERT INTO IntGuids(IntValue, GuidValue)
SELECT MyIntValues.IntValue, NEWID()
FROM MyIntValues
LEFT OUTER JOIN IntGuids ON MyIntValues.IntValue = IntGuids.IntValue
WHERE IntGuids.IntValue IS NULL
creates all the GUIDs you need to have in 1 statement. No need to SELECT+INSERT for every single value.