Storing flags in SQL column, and indexing them - sql

I need to store a set of flags that are related to an entity into database. Flags might not be the best word because these are not binary information (on/off), but rather a to-be-defined set of codes.
Normally, you would store each information (say each flag value) in a distinct column, but I'm exploring opportunities for storing such information in data structures different than one-column-for-each-attribute to prevent a dramatic increase in column mappings. Since each flag is valid for each attribute of an entity, you understand that for large entities that intrinsically require a large number of columns the total number of columns may grow as 2n.
Eventually, these codes can be mapped to a positional string.
I'm thinking about something like: 02A not being interpreted as dec 42 but rather as:
Flag 0 in position 1 (or zero if you prefer...)
Flag 2 in position 2
Flag A in position 3
Data formatted in such a way can be easily processed by high-level programming languages, because PL/SQL is out of the scope of the question and all these values are supposed to be processed by Java.
Now the real problem
One of my specs is to optimize searching. I have been required to find a way (say, an efficient way) to seek for entities that show a certain flag (or a special 0 flag) in a given position.
Normally, in SQL, given the RDBMS-specific substring function, you would
SELECT * FROM ENTITIES WHERE SUBSTRING(FLAGS,{POSITION},1) = {VALUE};
This works, but I'm afraid it may be a little slow on all platforms but Oracle, which, AFAIK, supports creating secondary indexes mapped to a substring.
However, my solution must work in MySQL, Oracle, SQL Server and DB2 thanks to Hibernate.
Given such a design, is there some, possibly cross-platform, indexing strategy that I'm missing?

If performance is an issue I would go for a some different model here.
Say a table that store entities and a relation 1->N to another table (say: flags table: entId(fk), flag, position) and this table would have an index on flag and position.
The issue here would be to get this flags in a simple column wich can be done in java or even on the database (but it would be difficult to have a cross plataform query to this)

If you want a database-independent, reasonable method for storing such flags, then use typical SQL data types. For a binary flag, you can use bit or boolean (this differs among databases). For other flags, you can use tinyint or smallint.
Doing bit-fiddling is not going to be portable. If nothing else, the functions used to extract particular bits from data differ among databases.
Second, if performance is an issue, then you may need to create indexes to avoid full table scans. You can create indexes on normal SQL data types (although some databases may not allow indexes on bits).
It sounds like you are trying to be overly clever. You should first get the application to work using reasonable data structures. Then you will understand where the performance issues are and can work on fixing them.

I have improved my design and performed a benchmark and found an interesting result.
I created a dummy demographic entity with first/last name columns, birthdate, birthplace, email, SSN...
Then in version 1
I added a column VALIDATION VARCAHR(40) NULL DEFAULT NULL with an index on it.
Instead of positional flags, the new column contains an unordered set of codes each representing a specific format error (e.g. A01 means "last name not specified", etc.). Each code is terminated by a colon : symbol.
Example columns look like
NULL
'A01:A03:A10:'
'A05:'
Typical queries are:
SELECT * FROM ENTITIES WHERE VALIDATION IS {NOT} NULL
Search for entities that are valid/invalid (NULL = no problem)
SELECT * FROM ENTITIES WHERE VALIDATION LIKE '%AXX:';
Selects entities with a specific problem
Then in version 1
I added a column VALID TINYINT NOT NULL with an index which is 0=invalid, 1=valid (Hibernate maps a Boolean to a TINYINT in MySQL).
I added a lookup table
CREATE TABLE ENTITY_VALIDATION (
ID BIGINT NOT NULL PRIMARY KEY,
PERSON_ID LONG NOT NULL, --REFERENCES PERSONS(ID) --Omitted for performance
ERROR CHAR(3) NOT NULL
)
With index on both PERSON_ID and ERROR. This represents the 1:N relationship
Queries:
SELECT * FROM ENTITIES WHERE VALIDATION = {0|1}
Select invalid/valid entities
SELECT * FROM ENTITIES JOIN ENTITY_VALIDATION ON ENTITIES.ID = ENTITY_VALIDATION.PERSON_ID WHERE ERROR = 'Axx';
Selects entities with a given problem
Then I benchmarked
the count(*) function via JUnit+JDBC. So the same queries you see above replace * with COUNT(*).
I did several benchmarks, with entity table containing 100k, 250k, 500k, 750k, 1M entities with a mean ratio entity:flag of 1:3 (there are meanly 3 errors for each entity).
The result
is displayed below. While correct/incorrect entities lookup is equally performing, it looks like MySQL is faster in the LIKE operator rather than in a JOIN, even though there are indexes
Of course,
This was only a benchmark on MySQL. While the approach is cross-platform, the benchmark does not (yet) compare performance in different DBMSes

Related

"Canonical" approach for mapping custom queries to hierarchical entities with user-defined key/value pairs

In about every SQL-based database application I have worked on so far, sooner or later the following three-faceted requirement has popped up:
There is some entity, linked in a hierarchical fashion (i.e. the tuples form a tree structure).
Users must be able to define any number of custom attributes with values for the tuples, and these values are inherited/overridden towards the leaves of the tree structure. ("Dumb" attributes usually suffice. That is, no uniqueness constraints, no foreign keys, only one value per attribute, ...)
Users must be able to run arbitrary queries on this data (i.e. custom boolean expressions, based upon filters for the values of the user-defined attributes that are linked with AND/OR).
Storing the data, roughly matching the first two bullets above, is quite straightforward:
The hierarchy is built up by giving the respective table a parent column. This column will be null for root nodes, and a pointer to the ID of the parent node for all other nodes.
The user-defined attributes are stored according to the entity-attribute-value pattern.
While there are numerous resources that suggest to use a different approach especially in the latter point (e.g. answers here, here, or here), I have not usually been in a position to move away from a traditional static relational database schema. Hence, let's simply assume the above as a given. Also, hardly ever could I rely on the specifics of a particular DBMS; the more usual case was systems that were supposed to work with MS SQL Server, Oracle, and possibly others as backends without requiring two significantly different product versions.
Solving the third item, however, is always problematic (even without considering the hierarchical inheritance of attribute values). The number of joins depends on the different number of attributes considered in the boolean expression. Alternatively, the number of joins can somewhat be reduced by determining the maximum number of distinct attributes considered in any case of the custom boolean expression, which may save joins, but makes the resulting queries and the code used to generate them even less intelligible and maintainable. For instance,
a = 5 or (b = 8 and c = 9)
could do with 2 joins to the attribute-value table.
I have always been able to do this "somehow", but as this appears to be a fairly ubiquitous situation, I am looking for the "canonical" way to generate SQL queries in this situation. Is there a "standard pattern" to follow here?
Careful not to fall prey to the inner platform effect. It is a complicated problem, and SQL itself is designed to handle the complexities. Generate DDL to add and remove columns as needed, and generate simple select statements for queries. Store each Tuple Type (distinct set of attributes) as a table.
With regards to inheritance, I recommend handling it in the application or DAL, and only storing the non-inherited values. On retrieval, read all parent rows to calculate the functional values. If you do need to access "functional" values from SQL, use an indexed view or triggers to maintain them separate from storage.
Hierarchies can be represented as you describe, but a simple "Parent" column can make it difficult to query beyond a single level. Look at hierarchyid on SQL Server or CONNECT BY on oracle.
Avoiding EAV stores allows you to:
Use indexes and statistics where needed
Keep efficient storage (ints stored as ints, money stored as money)
Keep understandable queries (SELECT * FROM vwProducts WHERE Color = 'RED' ORDER BY Price ASC)
If you want an EAV system because you have too many attributes (>1024 per type) or they are not somewhat statically defined (many changes per hour), I would avoid using a relational database in the first place. Use an EAV (NoSQL) database server instead.
tl;dr: If you have a schema, use DDL to tell the server about it. If you don't, use a more appropriate server.

Is using JOINs to avoid numerical IDs a bad thing? [duplicate]

This question already has answers here:
Performance of string comparison vs int join in SQL
(5 answers)
Closed 9 years ago.
Yesterday I was looking at queries like this:
SELECT <some fields>
FROM Thing
WHERE thing_type_id = 4
... and couldn't but think this was very "readable". What's '4'? What does it mean? I did the same thing in coding languages before but now I would use constants for this, turning the 4 in a THING_TYPE_AVAILABLE or some such name. No arcane number with no meaning anymore!
I asked about this on here and got answers as to how to achieve this in SQL.
I'm mostly partial to using JOINS with existing type tables where you have an ID and a Code, with other solutions possibly of use when there are no such tables (not every database is perfect...)
SELECT thing_id
FROM Thing
JOIN ThingType USING (thing_type_id)
WHERE thing_type_code IN ('OPENED', 'ONHOLD')
So I started using this on a query or two and my colleagues were soon upon me: "hey, you have literal codes in the query!" "Um, you know, we usually go with pks for that".
While I can understand that this method is not the usual method (hey, it wasn't for me either until now), is it really so bad?
What are the pros and cons of doing things this way? My main goal was readability, but I'm worried about performance and would like to confirm whether the idea is sound or not.
EDIT: Note that I'm not talking about PL/SQL but straight-up queries, the kind that usually starts with a SELECT.
EDIT 2:
To further clarify my situation with fake (but structurally similar) examples, here are the tables I have:
Thing
------------------------------------------
thing_id | <attributes...> | thing_type_id
1 3
4 7
5 3
ThingType
--------------------------------------------------
thing_type_id | thing_type_code | <attributes...>
3 'TYPE_C'
5 'TYPE_E'
7 'TYPE_G'
thing_type_code is just as unique as thing_type_id. It is currently also used as a display string, which is a mistake in my opinion, but would be easily fixable by adding a thing_type_label field duplicating thing_type_code for now, and changeable at any time later on if needed.
Supposedly, filtering with thing_type_code = 'TYPE_C', I'm sure to get that one line which happens to be thing_type_id = 3. Joins can (and quite probably should) still be done with the numerical IDs.
Primary key values should not be coded as literals in queries.
The reasons are:
Relational theory says that PKs should not convey any meaning. Not even a specific identity. They should be strictly row identifiers and not relied upon to be a specific value
Due to operational reasons, PKs are often different in different environments (like dev, qa and prod), even for "lookup" tables
For these reasons, coding literal IDs in queries is brittle.
Coding data literals like 'OPENED' and 'ONHOLD' is GOOD practice, because these values are going to be consistent across all servers and environments. If they do change, changing queries to be in sync will be part of the change script.
I assume that the question is about the two versions of the query -- one with the numeric comparison and the other with the join and string comparison.
Your colleagues are correct that the form with where thing_id in (list of ids) will perform better than the join. The difference in performance, however, might be quite minor if thing_id is not indexed. The query will already require a full table scan on the original table.
In most other respects, your version with the join is better. In particular, it makes the intent of the query cleaner and overall make the query more maintainable. For a small reference table, the performance hit may not be noticeable. In fact, in some databases, this form could be faster. This would occur when the in is evaluated as a series of or expressions. If the list is long, it might be faster to do an index lookup.
There is one downside to the join approach. If the values in the columns change, then the code also needs to be changed. I wouldn't be surprised if your colleague who suggests using primary keys has had this experience. S/he is working on an application and builds it using joins. Great. Lots of code. All clear. All maintainable. Then every week, the users decide to change the definitions of the codes. That can make almost any sane person prefer primary keys over using the reference table.
See Mark comment. I assume you are ok but can give my 2 cents on matter.
If that value is in the scope of one query I like to write that this, readable, way:
declare HOLD int = 4
SELECT <some fields>
FROM Thing
WHERE thing_type_id = HOLD
If that values are used many times in many points (queries, SP, views, etc)
I create a domain table.
create table ThingType (id int not null primary key, varchar(50) description)
GO
insert into ThingType values (4,'HOLD'),(5, 'ONHOLD')
GO
that way i can reuse that types on my selects as an enumerator
declare TYPE int
set TYPE = (select id from ThingType where description = 'HOLD')
SELECT <some fields>
FROM Thing
WHERE thing_type_id = TYPE
that way I keep meaning and performance (and also can enforce relational integrity over domain values)
Also I can just use enumerator at app level and just pass numeric values to the queries. A quick glimpse in that enumerator ill give me that number meaning.
In SQL queries you will definitely introduce a performance hit for JOINs (effectively multiple queries are taking place inside the SQL server). The question is whether the performance hit is significant enough to offset the benefits.
If it's just a readability thing then you may prefer to go for better performance and avoid the JOINs, but I would suggest you take into account potential integrity problems (e.g. what happens if the typed value of 4 in your example is changed by another process further down the line - the entire application may fail).
If the values will NEVER change then use PKs - this is a decision for you as the developer - there is no rule. One options may be best for one query and not for another.
In case of PL/SQL it makes sense to define constants in your package, e.g.
DECLARE
C_OPENED CONSTANT NUMBER := 3;
C_ONHOLD CONSTANT NUMBER := 4;
BEGIN
SELECT <some fields>
INTO ...
FROM Thing
WHERE thing_type_id in (C_OPENED, C_ONHOLD);
END;
Sometime it is usefull to create global package (without a body) where all commonly used constants are defined. In case the literal changes, you only have to modify the constant definition at a single place.

When would combining columns into a single, delimited column be better in a RDB schema?

Consider for example the case where you have two peaces of data, where one value is rarely used without the other. As one example, here is a table holding user authentication data :
CREATE TABLE users
(
id INT PRIMARY KEY,
auth_name STRING,
auth_password STRING,
auth_password_salt STRING
)
I think that password is meaningless without salt, and the other way around. I also have the option on representing the data this way:
CREATE TABLE users
(
id INT PRIMARY KEY,
auth_name STRING,
auth_secret STRING,
)
And in auth_secret, store strings such as D5SDfsuuAedW:unguessable42
In general, are there any situations where combining columns into one, delimited column would be a better choice?
Even if it is never a "better choice" overall, are there any costs (performance, space, anything) to having more columns vs fewer columns (for the same data)? My motivation is better understanding and to be able to more competently argue against it when someone suggests this sort of thing.
--edited I changed the example... original example as follows:
CREATE TABLE points
(
id INT PRIMARY KEY,
x_coordinate INT,
y_coordinate INT,
z_coordinate INT
)
vs
CREATE TABLE points
(
id INT PRIMARY KEY,
position STRING
)
In position, storing strings such as 7:3:15
You do that when there is no chance of needing to join, query, report or aggregate the data.
In other words - never. It is bad database design.
First Normal form (NF1) states that attributes should be distinct - it is the basic requirement.
The only possible answer to this question is never. Never, ever, store delimited data in a column. It defeats the entire point of columns, which are there to delimit your data, and makes it inordinately difficult to do anything that a database has been designed to do. It's a violation of normalisation so huge that you'll spend hours on Stack Overflow trying to correct it in a months time.
Never do this.
However, "never say never".
In certain, extremely limited, circumstances it's okay. Never assume it's okay but it can be.
A good example is Stack Overflow's own Posts table, which stores the tags in a delimited format for quick reading. The tags a question has are read from the database far more often than they are edited. The tags are stored in a separate table, PostTags, and then denormalised to Posts when they are updated.
In short, even though you can denormalise your data in this way, don't. Try everything possible to avoid it. If you come across a situation where you've been optimizing for days and the only way to get something quicker is to denormalize, then it's okay. Just ensure that you are only ever going to read data from that column and you have a secondary process in place to ensure that it is kept up-to-date. If the update of the denormalised data fails, roll everything back to ensure that your data is consistent.
You left out a significant option: create an appropriate user-defined data type. (PostgreSQL has long had an intrinsic data type for 2-space.)
PostgreSQL
Oracle
SQL Server
DB2
These implementations differ quite a lot.
But you might not have the luxury of using one of those platforms. You might have to use MySQL, for example, which doesn't support user-defined data types.
Relational theory says that data types can be arbitrarily complex; they can have internal structure. The most common data type that has internal structure is the type "date". Relational theory specifies what the dbms is supposed to do with data types like that. The dbms must either
ignore the internal structure entirely, or
provide functions to manipulate the parts.
In the case of dates, every SQL dbms provides functions to manipulate the parts.
You can make a good argument for a single column that stores 3-space coordinates like "7:3:15" in MySQL. To keep in line with relational theory, you'd want the dbms to ignore the structure, and return only the single value "7:3:15"; manipulation of parts is left to application code.
One problem with implementing something like that in MySQL is that MySQL doesn't enforce CHECK constraints. So it's a lot harder to prevent values like "wibble:frog:foo" from finding their way into the database.

Strategy for Storing Multiple Nullable Booleans in SQL

I have an object (happens to be C#) with about 20 properties that are nullable booleans. There will be perhaps a few million such objects persisted to a SQL database (currently SQL Server 2008 R2, but MySQL may need to be supported in the future). The instances themselves are relatively large because they contain about a paragraph of text as well as some other unrelated properties.
For a given object instance, most of the properties will be null most of the time.
When users search for instances of such objects, they will select perhaps 1-3 of the nullable boolean properties and search for instances where at least one of those 1-3 properties is non-null (OR search).
My first thought is to persist the object to a single table with nullable BIT columns representing the nullable boolean properties. However, this strategy will require one index per BIT column to avoid performing a table scan when searching. Further, each index would not be particularly selective since there are only three possible values per index.
Is there a better way to approach this problem?
For performance reasons, I would suggest that you split the table into two tables.
Create a primary key with the bit fields used for indexes. Have another table that has the additional data (such as the paragraph). Use the first for WHERE conditions, joining in the second to get the data you want. Something like:
select p.*
from BitFields bf join
Paragraph p
on bf.bfid = p.bfid
where <conditions on the bit fields>
With a bunch of binary/ternary fields, I wouldn't think that indexes would help much, so the query engine will be resorting to a full table scan. If you put in the bit fields in one table, you can store the table in memory, and achieve good performance.
The alternative is to store the fields as name value pairs. If you really have lots of such fields (say many hundreds or thousands) and only a few are used in a given row (say a dozen or so), then an entity-attribute-value (EAV) structure might work better. This is a table with three important columns:
Entity id (what I call bfid above).
Attribute id (the particular attribute)
Value (true or false)

select * vs select column

If I just need 2/3 columns and I query SELECT * instead of providing those columns in select query, is there any performance degradation regarding more/less I/O or memory?
The network overhead might be present if I do select * without a need.
But in a select operation, does the database engine always pull atomic tuple from the disk, or does it pull only those columns requested in the select operation?
If it always pulls a tuple then I/O overhead is the same.
At the same time, there might be a memory consumption for stripping out the requested columns from the tuple, if it pulls a tuple.
So if that's the case, select someColumn will have more memory overhead than that of select *
There are several reasons you should never (never ever) use SELECT * in production code:
since you're not giving your database any hints as to what you want, it will first need to check the table's definition in order to determine the columns on that table. That lookup will cost some time - not much in a single query - but it adds up over time
if you need only 2/3 of the columns, you're selecting 1/3 too much data which needs to be retrieving from disk and sent across the network
if you start to rely on certain aspects of the data, e.g. the order of the columns returned, you could get a nasty surprise once the table is reorganized and new columns are added (or existing ones removed)
in SQL Server (not sure about other databases), if you need a subset of columns, there's always a chance a non-clustered index might be covering that request (contain all columns needed). With a SELECT *, you're giving up on that possibility right from the get-go. In this particular case, the data would be retrieved from the index pages (if those contain all the necessary columns) and thus disk I/O and memory overhead would be much less compared to doing a SELECT *.... query.
Yes, it takes a bit more typing initially (tools like SQL Prompt for SQL Server will even help you there) - but this is really one case where there's a rule without any exception: do not ever use SELECT * in your production code. EVER.
It always pulls a tuple (except in cases where the table has been vertically segmented - broken up into columns pieces), so, to answer the question you asked, it doesn't matter from a performance perspective. However, for many other reasons, (below) you should always select specifically those columns you want, by name.
It always pulls a tuple, because (in every vendors RDBMS I am familiar with), the underlying on-disk storage structure for everything (including table data) is based on defined I/O Pages (in SQL Server for e.g., each Page is 8 kilobytes). And every I/O read or write is by Page.. I.e., every write or read is a complete Page of data.
Because of this underlying structural constraint, a consequence is that Each row of data in a database must always be on one and only one page. It cannot span multiple Pages of data (except for special things like blobs, where the actual blob data is stored in separate Page-chunks, and the actual table row column then only gets a pointer...). But these exceptions are just that, exceptions, and generally do not apply except in special cases ( for special types of data, or certain optimizations for special circumstances)
Even in these special cases, generally, the actual table row of data itself (which contains the pointer to the actual data for the Blob, or whatever), it must be stored on a single IO Page...
EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:
Select colA, colB
From table1 t1
Where Exists (Select * From Table2
Where column = t1.colA)
EDIT: To address #Mike Sherer comment, Yes it is true, both technically, with a bit of definition for your special case, and aesthetically. First, even when the set of columns requested are a subset of those stored in some index, the query processor must fetch every column stored in that index, not just the ones requested, for the same reasons - ALL I/O must be done in pages, and index data is stored in IO Pages just like table data. So if you define "tuple" for an index page as the set of columns stored in the index, the statement is still true.
and the statement is true aesthetically because the point is that it fetches data based on what is stored in the I/O page, not on what you ask for, and this true whether you are accessing the base table I/O Page or an index I/O Page.
For other reasons not to use Select *, see Why is SELECT * considered harmful? :
You should always only select the columns that you actually need. It is never less efficient to select less instead of more, and you also run into fewer unexpected side effects - like accessing your result columns on client side by index, then having those indexes become incorrect by adding a new column to the table.
[edit]: Meant accessing. Stupid brain still waking up.
Unless you're storing large blobs, performance isn't a concern. The big reason not to use SELECT * is that if you're using returned rows as tuples, the columns come back in whatever order the schema happens to specify, and if that changes you will have to fix all your code.
On the other hand, if you use dictionary-style access then it doesn't matter what order the columns come back in because you are always accessing them by name.
This immediately makes me think of a table I was using which contained a column of type blob; it usually contained a JPEG image, a few Mbs in size.
Needless to say I didn't SELECT that column unless I really needed it. Having that data floating around - especially when I selected mulitple rows - was just a hassle.
However, I will admit that I otherwise usually query for all the columns in a table.
During a SQL select, the DB is always going to refer to the metadata for the table, regardless of whether it's SELECT * for SELECT a, b, c... Why? Becuase that's where the information on the structure and layout of the table on the system is.
It has to read this information for two reasons. One, to simply compile the statement. It needs to make sure you specify an existing table at the very least. Also, the database structure may have changed since the last time a statement was executed.
Now, obviously, DB metadata is cached in the system, but it's still processing that needs to be done.
Next, the metadata is used to generate the query plan. This happens each time a statement is compiled as well. Again, this runs against cached metadata, but it's always done.
The only time this processing is not done is when the DB is using a pre-compiled query, or has cached a previous query. This is the argument for using binding parameters rather than literal SQL. "SELECT * FROM TABLE WHERE key = 1" is a different query than "SELECT * FROM TABLE WHERE key = ?" and the "1" is bound on the call.
DBs rely heavily on page caching for there work. Many modern DBs are small enough to fit completely in memory (or, perhaps I should say, modern memory is large enough to fit many DBs). Then your primary I/O cost on the back end is logging and page flushes.
However, if you're still hitting the disk for your DB, a primary optimization done by many systems is to rely on the data in indexes, rather than the tables themselves.
If you have:
CREATE TABLE customer (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(150) NOT NULL,
city VARCHAR(30),
state VARCHAR(30),
zip VARCHAR(10));
CREATE INDEX k1_customer ON customer(id, name);
Then if you do "SELECT id, name FROM customer WHERE id = 1", it is very likely that you DB will pull this data from the index, rather than from the tables.
Why? It will likely use the index anyway to satisfy the query (vs a table scan), and even though 'name' isn't used in the where clause, that index will still be the best option for the query.
Now the database has all of the data it needs to satisfy the query, so there's no reason to hit the table pages themselves. Using the index results in less disk traffic since you have a higher density of rows in the index vs the table in general.
This is a hand wavy explanation of a specific optimization technique used by some databases. Many have several optimization and tuning techniques.
In the end, SELECT * is useful for dynamic queries you have to type by hand, I'd never use it for "real code". Identification of individual columns gives the DB more information that it can use to optimize the query, and gives you better control in your code against schema changes, etc.
I think there is no exact answer for your question, because you have pondering performance and facility of maintain your apps. Select column is more performatic of select *, but if you is developing an oriented object system, then you will like use object.properties and you can need a properties in any part of apps, then you will need write more methods to get properties in special situations if you don't use select * and populate all properties. Your apps need have a good performance using select * and in some case you will need use select column to improve performance. Then you will have the better of two worlds, facility to write and maintain apps and performance when you need performance.
The accepted answer here is wrong. I came across this when another question was closed as a duplicate of this (while I was still writing my answer - grr - hence the SQL below references the other question).
You should always use SELECT attribute, attribute.... NOT SELECT *
It's primarily for performance issues.
SELECT name FROM users WHERE name='John';
Is not a very useful example. Consider instead:
SELECT telephone FROM users WHERE name='John';
If there's an index on (name, telephone) then the query can be resolved without having to look up the relevant values from the table - there is a covering index.
Further, suppose the table has a BLOB containing a picture of the user, and an uploaded CV, and a spreadsheet...
using SELECT * will willpull all this information back into the DBMS buffers (forcing out other useful information from the cache). Then it will all be sent to client using up time on the network and memory on the client for data which is redundant.
It can also cause functional issues if the client retrieves the data as an enumerated array (such as PHP's mysql_fetch_array($x, MYSQL_NUM)). Maybe when the code was written 'telephone' was the third column to be returned by SELECT *, but then someone comes along and decides to add an email address to the table, positioned before 'telephone'. The desired field is now shifted to the 4th column.
There are reasons for doing things either way. I use SELECT * a lot on PostgreSQL because there are a lot of things you can do with SELECT * in PostgreSQL that you can't do with an explicit column list, particularly when in stored procedures. Similarly in Informix, SELECT * over an inherited table tree can give you jagged rows while an explicit column list cannot because additional columns in child tables are returned as well.
The main reason why I do this in PostgreSQL is that it ensures that I get a well-formed type specific to a table. This allows me to take the results and use them as the table type in PostgreSQL. This also allows for many more options in the query than a rigid column list would.
On the other hand, a rigid column list gives you an application-level check that db schemas haven't changed in certain ways and this can be helpful. (I do such checks on another level.)
As for performance, I tend to use VIEWs and stored procedures returning types (and then a column list inside the stored procedure). This gives me control over what types are returned.
But keep in mind I am using SELECT * usually against an abstraction layer rather than base tables.
Reference taken from this article:
Without SELECT *:
When you are using ” SELECT * ” at that time you are selecting more columns from the database and some of this column might not be used by your application.
This will create extra cost and load on database system and more data travel across the network.
With SELECT *:
If you have special requirements and created dynamic environment when add or delete column automatically handle by application code. In this special case you don’t require to change application and database code and this will automatically affect on production environment. In this case you can use “SELECT *”.
Just to add a nuance to the discussion which I don't see here: In terms of I/O, if you're using a database with column-oriented storage you can do A LOT less I/O if you only query for certain columns. As we move to SSDs the benefits may be a bit smaller vs. row-oriented storage but there's a) only reading the blocks that contain columns you care about b) compression, which generally greatly reduces the size of the data on disk and therefore the volume of data read from disk.
If you're not familiar with column-oriented storage, one implementation for Postgres comes from Citus Data, another is Greenplum, another Paraccel, another (loosely speaking) is Amazon Redshift. For MySQL there's Infobright, the now-nigh-defunct InfiniDB. Other commercial offerings include Vertica from HP, Sybase IQ, Teradata...
select * from table1 INTERSECT select * from table2
equal
select distinct t1 from table1 where Exists (select t2 from table2 where table1.t1 = t2 )