Related
I am creating extension to postgres in C (c++). It is new data type that behave like text but it is encrypted by HSM device. But I have problem to use more then one key to protect data. My idea is to get original sql query and process it to choose what key I should use. But I don't know how to do that or if it is even possible?
My goal is to change some existing text fields in database to encrypted ones. And that's why I can't provide key number to my type in direct way. Type must be seen by external app as text.
Normally there is userID field and single query always use that id to get or set encrypted data. Base on that field I want to chose key. HSM can have billions of keys in itself and that's mean every user can have it's own key. It's not a problem if I need to parse string by myself, I am more then capable of doing that. Performance is not issue too, HSM is so slow that I can encode or decode only couple fields in one second.
In most parts of the planner and executor the current (sub)query is available in a passed PlannerInfo struct, usually:
PlannerInfo *root
This has a parse member containing the Query object.
Earlier in the system, in the rewriter, it's passed as Query *root directly.
In both cases, if there's evaluation of a nested subquery going on, you get the subquery. There's no easy way to access the parent Query node.
The query tree isn't always available deeper in execution paths, such as in expression evaluation. You're not supposed to be referring to it there; expressions are self contained, and don't need to refer to the rest of the query.
So you're going to have a problem doing what you want. Frankly, that's because it's a pretty bad design from the sounds. What you should consider instead is:
Using a function to encode/decode the type to/from cleartext, allowing you to pass parameters; or possibly
Using the typmod of the type to store the desired information (but be aware that the typmod is not preserved across casts, subqueries, etc).
There's also the debug_query_string global, but really don't use that. It's unparsed query text so it won't help you anyway. If you (ab)use this in your code, I will cry. I'm only telling you it exists so I can tell you not to use it.
By far and away your best option is going to be to use a function-based interface for this.
I was arguing with my friend against his suggestion to store price, value and other similar informations in varchar.
My point of view are on the basis of
Calculations will become difficult as we need to cast back and forth.
Integrity of the data will be lost.
Poor performance of Indexes
Sorting and aggregate functions will also need casting
etc. etc.
But he was saying that in his previous employement everybody used to store such values in varchar, because the communication between DB and the APP will be very effective in this approach. (I still cant accept this)
Are there really some advantages in storing such values in varchar ?
Note : I'm not talking about columns like PhoneNo, IDs, ZIP Code, SSN etc. I know varchar is best suited for those. The columns are value based, and will for sure be involved in calculations some way or other.
None at all.
Try casting a values back and too and see how much data you lose.
DECLARE #foo TABLE (bar varchar(30))
INSERT #foo VALUES (11.2222222222)
INSERT #foo VALUES (22.3333333333)
INSERT #foo VALUES (33.1111111111)
SELECT CAST(CAST(bar AS float) AS varchar(30)) FROM #foo
I would also mention that his current employment does things differently... he isn't at his previous employment any more....
I think a big part of the reason to use the APPROPRIATE (in this case decimal) data type is to prevent invalid data. There's nothing to stop someone entering "The King" as a price in a varchar field.
I can see no advantages, and a whole heap of very severe disadvantages - the most pressing of which is performance (particularly when sorting).
Consider if you want to get a list of the N most expensive products, and you are storing your price as a VARCHAR. Here are some sample values (sorted in descending order)
SELECT Price FROM Table ORDER BY Price DESC
Price
-----
90
600
50
1000
Whoops! The sort order is, well, wrong! (Alphanumerical sorting, rather than value sorting).
If we want to do the sort properly then this means we either need to pad values with zeroes at the start, or convert each value to a double before we sort - but if we have to do a convert on every row this means that SQL server has no way of using statistics to predict what the results will be! This in turn means extremely poor performance, probably a table scan.
As Kragen notes, sorts will not necessarily come out in the right order.
Compares won't necessarily work either. If a field is defined as, say, decimal(8,2) and I give it the value "37.20", and later I write "select ... where price=37.2", the result will be true. But if I store a varchar 37.20 and compare it to 37.2, it will not be equal. Similarly if one or the other has leading zeros.
You could solve these problems by having the application insure that you always store the numbers with a fixed number of decimal places and padded with leading zeros. Oh, and make sure you have a consistent convention about storing minus signs. But then every place in the app that writes to this field must be sure that it follows exactly the same rules. We could do this of course, but why? The database engine will do it for us if we just declare the field numeric. Like, yes, I COULD mow my lawn with a pair of scissors, but why would I want to do this?
I don't understand what your friend is saying the advantage is supposed to be. Easier communication between app and database? How? Maybe he was using some unconventional language or database interface that couldn't read numeric values from the DB. I've never had an issue with this. Actually just saying that gets me to wondering if that isn't what happenned: That at his previous company they were using some language or tool that couldn't read decimals from the database because of an implementation problem, the only way they could get it to work was to declare all the numbers as varchar, and now he walks away thinking that's a generally good idea.
Ok . One word answer . Dont
You are right about correct data types having impact on performance (SQL Optimizer works differently for INT VS VARCHAR) , data consistency and integrity etc
if all we needed was VARCHAR I dont think we ever invented other types.
SQL is not dynamically typed. Static typing makes optimization better , index pages smaller and query operators efficient.
It is not the problem of source that consumer needs all strings as input. it is upto consumer to do type checking and consuming data. A DB should always have correct types .
(Forget about choosing between INT and VARCHAR i would say you should also think whether you should have INT or TINYINT ) these consideration makes a lot of difference
Data Types are best stored in fields that match the type between two different systems. In this case you are referring from your .Net objects to MS SQL server. You are correct with data integrity loss and with the need to cast/convert data types into useable forms. As for other types such as Phone Number, ZIP Code, SSN and so on; they too would benefit from dedicated data types. The main reason these are stored in VARCHAR/NVARCHAR is due to the number of different possibilities that are not needed in every system. But if you have a type that is commonly used and you want to constrain it you can build custom data types called User-defined types to store that data in SQL server. (Even more fun is CLR defined types see example on code project.)
The only advantage I can see with using any sort of variable-sized string-ish format would be if the field would have to accommodate an unknown amount of additional information. For example, "49.95#1/39.95#5/29.95#20/14.95#100,match=true/24.95#100" to indicate that this particular product has price points at 1, 5, 20, and 100 units, and the best 100-unit price is only available when all items are identical. Using strings to store such things is icky, but if the number of price-points is open-ended, using a variable-sized field might be better than having to create another table with one row per product/price-point combination. If you do go that route, it may be good to use XML serialization for the data, rather than an ad-hoc thing as shown above. An ad-hoc approach might allow faster parsing in some cases, but if things really are open-ended it could become a real pain to maintain.
Addendum: If you want to be able to do any type of sorting or searching based on price, you'll need to have separate columns for that. If you want to allow users to e.g. find the ten cheapest items at 100-piece mix/match quantity, and the database holds 10,000 possible items, the only way to satisfy the query with varchar-stored data would be to read all l0,000 items and evaluate what the best price would be given the restrictions. If users can only query based upon a small number of price/restriction combinations, it may be helpful to have a column for each one to allow direct queries.
I've been working on a database and I have to deal with a TEXT field.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
Some research revealed this, suggesting that
Separate text/blobs from metadata, don't put text/blobs in results if you don't need them.
However, I am not familiar with the definition of "metadata" being used here.
So I wonder if there are any relevant advantages in putting a TEXT column in a table of its own. What are the potential problems of having it with the rest of the fields? And potential problems of keeping it in a separated table?
This table(without the TEXT field) is supposed to be searched(SELECTed) rather frequently. Is "premature optimization considered evil" important here? (If there really is a penalty in TEXT columns, how relevant is it, considering it is fairly easy to change this later if needed).
Besides, are there any good links on this topic? (Perhaps stackoverflow questions&answers? I've tried to search this topic but I only found TEXT vs VARCHAR discussions)
Yep, it seems you've misinterpreted the meaning of the sentence. What it says is that you should only do a SELECT including a TEXT field if you really need the contents of that field. This is because TEXT/BLOB columns can contain huge amounts of data which would need to be delivered to your application - this takes time and of course resources.
Best wishes,
Fabian
This is probably premature optimisation. Performance tuning MySQL is really tricky and can only be done with real performance data for your application. I've seen plenty of attempts to second guess what makes MySQL slow without real data and the result each time has been a messy schema and complex code which will actually make performance tuning harder later on.
Start with a normalised simple schema, then when something proves too slow add a complexity only where/if needed.
As others have pointed out the quote you mentioned is more applicable to query results than the schema definition, in any case your choice of storage engine would affect the validity of the advice anyway.
If you do find yourself needing to add the complexity of moving TEXT/BLOB columns to a separate table, then it's probably worth considering the option of moving them out of the database altogether. Often file storage has advantages over database storage especially if you don't do any relational queries on the contents of the TEXT/BLOB column.
Basically, get some data before taking any MySQL tuning advice you get on the Internet, including this!
The data for a TEXT column is already stored separately. Whenever you SELECT * from a table with text column(s), each row in the result-set requires a lookup into the text storage area. This coupled with the very real possibility of huge amounts of data would be a big overhead to your system.
Moving the column to another table simply requires an additional lookup, one into the secondary table, and the normal one into the text storage area.
The only time that moving TEXT columns into another table will offer any benefit is if there it a tendency to usually select all columns from tables. This is merely introducing a second bad practice to compensate for the first. It should go without saying the two wrongs is not the same as three lefts.
The concern is that a large text field—like way over 8,192 bytes—will cause excessive paging and/or file i/o during complex queries on unindexed fields. In such cases, it's better to migrate the large field to another table and replace it with the new table's row id or index (which would then be metadata since it doesn't actually contain data).
The disadvantages are:
a) More complicated schema
b) If the large field is using inspected or retrieved, there is no advantage
c) Ensuring data consistency is more complicated and a potential source of database malaise.
There might be some good reasons to separate a text field out of your table definition. For instance, if you are using an ORM that loads the complete record no matter what, you might want to create a properties table to hold the text field so it doesn't load all the time. However if you are controlling the code 100%, for simplicity, leave the field on the table, then only select it when you need it to cut down on data trasfer and reading time.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
You probably saw this, from the MySQL manual
http://dev.mysql.com/doc/refman/5.5/en/optimize-character.html
If a table contains string columns such as name and address, but many queries do not retrieve those columns, consider splitting the string columns into a separate table and using join queries with a foreign key when necessary. When MySQL retrieves any value from a row, it reads a data block containing all the columns of that row (and possibly other adjacent rows). Keeping each row small, with only the most frequently used columns, allows more rows to fit in each data block. Such compact tables reduce disk I/O and memory usage for common queries.
Which indeed is telling you that in MySQL you are discouraged from keeping TEXT data (and BLOB, as written elsewhere) in tables frequently searched
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
What is the point (if any) in having a table in a database with only one row?
Note: I'm not talking about the possibility of having only one row in a table, but when a developer deliberately makes a table that is intended to always have exactly one row.
Edit:
The sales tax example is a good one.
I've just observed in some code I'm reviewing three different tables that contain three different kinds of certificates (a la SSL), each having exactly one row. I don't understand why this isn't made into one large table; I assume I'm missing something.
I've seen something like this when a developer was asked to create a configuration table to store name-value pairs of data that needs to persist without being changed often. He ended up creating a one-row table with a column for each configuration variable. I wouldn't say it's a good idea, but I can certainly see why the developer did it given his instructions. Needless to say it didn't pass review.
I've just observed in some code I'm reviewing three different tables that contain three different kinds of certificates (a la SSL), each having exactly one row. I don't understand why this isn't made into one row; I assume I'm missing something.
This doesn't sound like good design, unless there are some important details you don't know about. If there are three pieces of information that have the same constraints, the same use and the same structure, they should be stored in the same table, 99% of the time. That's a big part of what tables are for fundamentally.
For some things you only need one row - typically system configuration data. For example, "current sales tax rate". This might change in the future and so shouldn't be hardcoded, but you'll typically only ever need one at any given time. This kind of data needs to be in the database so that queries can use it in computations.
It's not necessarily a bad idea.
What if you had some global state (say, a boolean) that you wanted to store somewhere? And you wanted your stored procedures to easily access this state?
You could create a table with a primary key whose value range was limited to exactly one value.
Single row is like a singleton class. purpose: to control or manage some other process.
Single row table could act as a critical section or as deterministic automaton (kind of dispatcher based on row values)
Single row is use full in a table COMPANY_DESCRIPTION, to obtain consistent data about that company. Use full on company letters and addressing.
Single row is use full to contain an actual value like VAT or Date or Time, and so on.
It can be useful sometime to emulate some features the Database system doesn't provide. I'm thinking of sequences in MySQL for instance.
If your database is your application, then it probably makes sense for storing configuration data that might be required by stored procedures implementing business logic.
If you have an application that could use the file system to store information, then I don't think there is an advantage to using the database over an XML or flat file, except maybe that most developers are now far more well versed in using SQL to store and retrieve data than accessing the file system.
What is the point (if any) in having a table in a database with only one row?
A relational database stores things as relations: a tuples of data satisfying some relation.
Like, this one: "a VAT of this many percent is in effect in my country now".
If only one tuple satisifies this relation, then yes, it will be the only one in the table.
SQL cannot store variables: it can store a set consisting of 1 element, this is a one-row table.
Also, SQL is a set based language, and for some operations you need a fake set of only one row, like, to select a constant expression.
You cannot just SELECT out of nothing in Oracle, you need a FROM clause.
Oracle has a pseudotable, dual, which contains only one row and only one column.
Once, long time ago, it used to have two rows (hence the name dual), but lost its second row somewhere on its way to version 7.
MySQL has this pseudotable too, but MySQL is able to do selects without FROM clause. Still, it's useful when you need an empty rowset: SELECT 1 FROM dual WHERE NULL
I've just observed in some code I'm reviewing three different tables that contain three different kinds of certificates (a la SSL), each having exactly one row. I don't understand why this isn't made into one large table; I assume I'm missing something.
It may be a kind of "have it all or lose" scenario, when all three certificates are needed at once:
SELECT *
FROM ssl1
CROSS JOIN
ssl2
CROSS JOIN
ssl3
If any if the certificates is missing, the whole query returns nothing.
A table with a single row can be used to store application level settings that are shared across all database users. 'Maximum Allowed Users' for example.
Funny... I asked myself the same question. If you just want to store some simple value and your ONLY method of storage is an SQL server, that's pretty much what you have to do. If I have to do this, I usually end up creating a table with several columns and one row. I've seen a couple commercial products do this as well.
We have used a single-row table in the past (not often). In our case, this table was used to store system-wide configuration values that were updatable via a web interface. We could have gone the route of a simple name/value table, but the end client preferred a single row. I personally would have preferred the latter, but it really is up to preference, especially if this table will never have any sort of relationship with another table.
I really cannot figure out why this would be the best solution. It seams more efficient to just have some kind of config file that will contain the data that would be in the tables one row. The cost of connecting to the database and querying the one row would be more costly. However if this is going to be some kind of config for the database logic. Then this would make a little bit more sense depending on the type of database you are using.
I use the totally awesome rails-settings plugin for this http://github.com/Squeegy/rails-settings/tree/master
It's really easy to set up and provides a nice syntax:
Settings.admin_password = 'supersecret'
Settings.date_format = '%m %d, %Y'
Settings.cocktails = ['Martini', 'Screwdriver', 'White Russian']
Settings.foo = 123
Want a list of all the settings?
Settings.all # returns {'admin_password' => 'super_secret', 'date_format' => '%m %d, %Y'}
Set defaults for certain settings of your app. This will cause the defined settings to return with the Specified value even if they are not in the database. Make a new file in config/initializers/settings.rb with the following:
Settings.defaults[:some_setting] = 'footastic'
A use for this might be to store the current version of the database.
If one were storing database versions for schema changes it would need to reside within the database itself.
I currently analyse the schema and update accordingly but am thinking of moving to versioning. Unless someone has a better idea.
I use vb.net and sql express
Unless there are insert constraints on the table a timestamp for versioning then this sounds like a bad idea.
There was a table set up like this in a project I inherited. It was for configuration data, and the reason that was given was that it made for very simple queries:
SELECT WidgetSize FROM ConfigTable
SELECT FooLength FROM ConfigTable
Okay fine. We converted to a generalized configuration table:
ID Name IntValue StringValue TextValue
This has served our purposes well.
CREATE TABLE VERSION (VERSION_STRING VARCHAR2(20 BYTE))
?
I used a single datum in a SQLite database as a counter in a dynamic web page. That's the simplest way I can think of to make it thread-safe (or process-safe to be precise). But I am not sure whether it's a good idea.
I think the best way to deal with these scenarios is to, rather than using a database at all, use the configuration file (which is usually XML) or make your own configuration file that is read during start up of the application. It only takes a few minutes to write the code to read the file in.
The advantage here is that the there is no chance accidentally adding additional values for the same XML variable, and its great for testing because you don't need to write a lot of code to test the different inputs, just a simple change to the text value and re-run the application.
Specifically, in relational database management systems, why do we need to know the data type of a column (more likely, the attribute of an object) at creation time?
To me, data types feel like an optimization, because one data point can be implemented in any number of ways. Wouldn't it be better to assign semantic roles and constraints to a data point and then have the engine internally examine and optimize which data type best serves the user?
I suspect this is where the heavy lifting is and why it's easier to just ask the user rather than to do the work.
What do you think? Where are we headed? Is this a realistic expectation? Or do I have a misguided assumption?
The type expresses a desired constraint on the values of the column.
The answer is storage space and fixed size rows.
Fixed-size rows are much, MUCH faster to search than variable length rows, because you can seek directly to the correct byte if you know which record number and field you want.
Edit: Having said that, if you use proper indexing in your database tables, the fixed-size rows thing isn't as important as it used to be.
SQLite does not care.
Other RDBMS's use principles that were designed in early 80's, when it was vital for performance.
Oracle, for instance, does not distinguish between a NULL and an empty string, and keeps its NUMBER's as sets of centesimal digits.
That hardly makes sense today, but these were very clever solutions when Oracle was being developed.
In one of the databases I developed, though, non-indexed values were used that were stored as VARCHAR2's, casted dynamically into appropriate datatypes depending on several conditions.
That was quite a special thing, though: it was used for bulk loading key-value pairs in one call to the database using collections.
Dynamic SQL statements were used for parsing data and putting them into appropriate tables based on key name.
All values were loaded to the temporary VARCHAR2 column as is and then converted into NUMBER's and DATETIME's to be put into their columns.
Explicit data types are huge for efficiency, and storage. If they are implicit they have to be 'figured' out and therefore incur speed costs. Indexes would be hard to implement as well.
I would suspect, although not positive, that having explicit types also on average incur less storage space. For numbers especially, there is no comparison between a binary int and a string of digit characters.
Hm... Your question is sort of confusing.
If I understand it correctly, you're asking why it is that we specify data types for table columns, and why it is that the "engine" automatically determines what is needed for the user.
Data types act as a constraint - they secure the data's integrity. An int column will never have letters in it, which is a good thing. The data type isn't automatically decided for you, you specify it when you create the database - almost always using SQL.
You're right: assigning a data type to a column is an implementation detail and has nothing to do with the set theory or calculus behind a database engine. As a theoretical model, a database ought to be "typeless" and able to store whatever we throw at it.
But we have to implement the database on a real computer with real constraints. It's not practical, from a performance standpoint, to have the computer dynamically try to figure out how to best store the data.
For example, let's say you have a table in which you store a few million integers. The computer could -- correctly -- figure out that it should store each datum as an integral value. But if you were to one day suddenly try to store a string in that table, should the database engine stop everything until it converts all the data to a more general string format?
Unfortunately, specifying a data type is a necessary evil.
If you know that some data item is supposed to be numeric integer, and you deliberately choose NOT to let the DBMS take care of enforcing this, then it becomes YOUR responsibility to ensure all sorts of things such as data integrity (ensuring that no value 'A' can be entered in the column, ensuring that no value 1.5 can be entered in the column), such as consistency of system behaviour (ensuring that the value '01' is considered equal to the value '1', which is not the behaviour you get from type String), ...
Types take care of all those sorts of things for you.
I'm not sure of the history of datatypes in databases, but to me it makes sense to know the datatype of a field.
When would you want to do a sum of some fields which are entirely varchar?
If I know that a field is an integer, it makes perfect sense to do a sum, avg, max, etc.
Not all databases work this way. SQLite was mentioned earlier, but a much older set of databases also does this, multivalued databases.
Consider UniVerse (now an IBM property). It does not do any data validation, nor does it require that you specify what type it is. Searches are still (relatively) fast, it takes up less space (due to the way it stores data dynamically).
You can describe what the data may look like using meta-data (dictionary items), but that is the limit of how you restrict the data.
See the wikipedia article on UniVerse
When you're pushing half a billion rows in 5 months after go live, every byte counts (in our system)
There is no such anti-pattern as "premature optimisation" in database design.
Disk space is cheap, of course, but you use the data in memory.
You should care about datatypes when it comes to filtering (WHERE clause) or sorting (ORDER BY). For example "200" is LOWER than "3" if those values are strings, and the opposite when they are integers.
I believe sooner or later you wil have to sort or filter your data ("200" > "3" ?) or use some aggregate functions in reports (like sum() or (avg()). Until then you are good with text datatype :)
A book I've been reading on database theory tells me that the SQL standard defines a concept of a domain. For instance, height and width could be two different domains. Although both might be stored as numeric(10,2), a height and a width column could not be compared without casting. This allows for a "type" constraint that is not related to implementation.
I like this idea in general, though, since I've never seen it implemented, I don't know what it would be like to use it. I can see that it would reduce the chance of errors in using values whose implementation happen to be the same, when their conceptual domain is quite different. It might also help keep people from comparing cm and inches, for instance.
Constraint is perhaps the most important thing mentioned here. Data types exist for ensuring the correctness of your data so you are sure you can manipulate it correctly. There are 2 ways we can store a date. In a type of date or as a string "4th of January 1893". But the string could also have been "4/1 1893", "1/4 1893" or similar. Datatypes constrain that and defines a canonical form for a date.
Furthermore, a datatype has the advantage that it can undergo checks. The string "0th of February 1975" is accepted as a string, but should not be as a date. How about "30th of February 1983"? Poor databases, like MySQL, does not make these checks by default (although you can configure MySQL to do it -- and you should!).
data types will ensure the consistency of your data. This is one of the most important concepts as keeping your data sane will spare your head from insanity.
RDBMs generally require definition of column types so it can perform lookups fast. If you want to get the 5th column of every row in a huge dataset, having the columns defined is a huge optimisation.
Instead of scanning each row for some form of delimiter to retrieve the 5th column (if column widths were not fixed width), the RDBMs can just take the item at sizeOf(column1 - 4(bytes)) + sizeOf(column5(bytes)). Imagine how much quicker this would be on a table of say 10,000,000 rows.
Alternatively, if you don't want to specify the types of each column, you have two options that I'm aware of. Specify each column as a varchar(255) and decide what you want to do with it within the calling program. Or you can use a different database system that uses key-value pairs such as Redis.
database is all about physical storage, data type define this!!!