I have a table of 755 columns and around holding 2 million records as of now and it will grow.There are many procedures accessing it with other tables join, are running slow. Now it's hard to split/normalize them as everything is already built and customer is not ready to spend much on it. Is there any way to make the query access to that table faster? Please advise.
Will column store index help?
How little are they prepared to spend?
It may be possible to split this table into multiple 1 to 1 joined tables (vertical partitioning), then use a view to present it as one single blob to existing code.
With some luck you may get join elimination happening frequently enough to make it worthwhile.
View will probably require INSTEAD OF triggers to fully replicate existing logic. INSTEAD OF triggers have a number of restrictions e.g. no support for OUTPUT clause, which can prove to be to hard to overcome depending on your specific setup.
You can name your view the same as existing table, which will eliminate the need of fixing code everywhere.
IMO this is the simplest you can do short of a full DB re-factoring exercise.
See: http://aboutsqlserver.com/2010/09/15/vertical-partitioning-as-the-way-to-reduce-io/ and https://logicalread.com/sql-server-optimizer-may-eliminate-foreign-key-joins-mc11/#.WXgEzlERW6I
755 Columns thats a lot. You should try to index the columns that are mostly used in where clause. this might speed up the process
It is fine, dont worry about it, actually how many columns you have it is not important in sql server (But be careful I said 'have'). The main problem is data count and how many column you select in queries. There is a few point firstly you can check.
Do not use * selector and change it if used in everywhere
In the joins, do not use it directly, you can firstly filter it as inner select. (Just try it, I have no idea about your table so I m telling the general rules.)
Try the diminish data count for ex: use history table for old records. This technicque depends on needs of your organization.
Try to use column index and something like that features.
And of course remove dynamic selects in your queries.
I wish one of them will work.
I encounter this a lot when writing SQL. I have two tables that are meant to be in a one-to-one relationship with each other, and I wish I could easily assert that fact in my query. For example, the simplified query:
SELECT Person.ID, Person.Name, Location.Address1
FROM Person
LEFT JOIN Location ON Person.LocationID = Location.ID
When I read this query I think to myself, well what if the Location table fails to enforce uniqueness on its ID column? Suddenly you could have the same Person multiple times in your resultset. Sure, I can go look at the schema to assure myself it's unique so everything will be okay, but why shouldn't I simply be able to put it right here in my query, a la:
SELECT Person.ID, Person.Name, Location.Address1
FROM Person
LEFT JOINONE Location ON Person.LocationID = Location.ID
Not only would a keyword like this (made up "JOINONE") make it 100% clear to a human reading this query that we are guaranteed to get exactly one row for each Person record, but it lets the db engine optimize its execution plan because it knows there won't be more than one match each, even if the foreign key relationship isn't defined in the schema.
Another advantage of this would be that the db engine could enforce it, so if the data actually did have more than one match, an error could be thrown. This happens for subqueries already, e.g.:
SELECT Person.ID, Person.Name
, (
SELECT Location.Address1
FROM Location
WHERE Location.ID = Person.Location
) AS Address1
FROM Person
This is nice and spiffy, 100% clear to the human reader, neatly optimizable, and enforced by the db engine. In fact I often end up doing things this way for all those reasons. The problem is, besides the distracting syntax, you can only select one field this way. (What if I want City, State, and Zip too?) How nice it would be if you could flow this table right along with the rest of your JOINs and select any fields from it you wish in your SELECT clause just like all the rest of your tables.
I couldn't find any other question like this around StackOverflow, though I did find lots of repeats of a close question: people wanting to choose a single record. Close but really quite a different kind of goal, and less meaningful in my opinion.
I'm posting this question to see if there's some mechanism already in the SQL language that I'm missing, or an efficient workaround anyone has come up with. The concept of a one-to-one vs. one-to-many relationship is so fundamental to relational database design, I'm just so surprised at the absence of this language element.
SQL is two languages in one. Constraints, including uniqueness constraints, are set using the data definition language (DDL) in SQL. This is a layer above the data manipulation language (DML), where SELECT statements live, and it's understood that statements issued in the DDL might invalidate statements in the DML.
There's no way for a query to prevent someone from executing an ALTER TABLE command and changing the name of a field that the query refers to between query runs.
And there isn't much more of a way for a query to be written defensively against uncertain constraints; it's OK if you need to ask someone for information outside of the database environment to address this. The information may also be available within the environment; in most engines, you get to it by querying the data dictionary. This is the INFORMATION_SCHEMA in MySQL, for instance.
I have a need to build a schema structure to support table of contents (so the level of sections / sub-sections could change for each book or document I add)...one of my first thoughts was that I could use a recursive table to handle that. I want to make sure that my structure is normalized, so I was trying to stay away from deonormalising the table of contents data into a single table (then have to add columns when there are more sub-sections).
It doesn't seem right to build a recursive table and could be kind of ugly to populate.
Just wanted to get some thoughts on some alternate solutions or if a recursive table is ok.
Thanks,
S
It helps that SQL Server 2008 has both the recursive WITH clause and hierarchyid to make working with hierarchical data easier - I was pointing out to someone yesterday that MySQL doesn't have either, making things difficult...
The most important thing is to review your data - if you can normalize it to be within a single table, great. But don't shoehorn it in to fit a single table setup - if it needs more tables, then design it that way. The data & usage will show you the correct way to model things.
When in doubt, keep it simple. Where you've a collection of similar items, e.g. employees then a table that references itself makes sense. Whilst here you can argue (quite rightly) that each item within the table is a 'section' of some form or another, unless you're comfortable with modelling the data as sections and handling the different types of sections through relationships to these entities, I would avoid the complexity of a self-referencing table and stick with a normalized approach.
I've always preached to my developers that SELECT * is evil and should be avoided like the plague.
Are there any cases where it can be justified?
I'm not talking about COUNT(*) - which most optimizers can figure out.
Edit
I'm talking about production code.
And one great example I saw of this bad practice was a legacy asp application that used select * in a stored procedure, and used ADO to loop through the returned records, but got the columns by index. You can imagine what happened when a new field was added somewhere other than the end of the field list.
I'm quite happy using * in audit triggers.
In that case it can actually prove a benefit because it will ensure that if additional columns are added to the base table it will raise an error so it cannot be forgotten to deal with this in the audit trigger and/or audit table structure.
(Like dotjoe) I am also happy using it in derived tables and column table expressions. Though I habitually do it the other way round.
WITH t
AS (SELECT *,
ROW_NUMBER() OVER (ORDER BY a) AS RN
FROM foo)
SELECT a,
b,
c,
RN
FROM t;
I'm mostly familiar with SQL Server and there at least the optimiser has no problem recognising that only columns a,b,c will be required and the use of * in the inner table expression does not cause any unnecessary overhead retrieving and discarding unneeded columns.
In principle SELECT * ought to be fine in a view as well as it is the final SELECT from the view where it ought to be avoided however in SQL Server this can cause problems as it stores column metadata for views which is not automatically updated when the underlying tables change and the use of * can lead to confusing and incorrect results unless sp_refreshview is run to update this metadata.
There are many scenarios where SELECT * is the optimal solution. Running ad-hoc queries in Management Studio just to get a sense of the data you're working with. Querying tables where you don't know the column names yet because it's the first time you've worked with a new schema. Building disposable quick'n'dirty tools to do a one-time migration or data export.
I'd agree that in "proper" development, you should avoid it - but there's lots of scenarios where "proper" development isn't necessarily the optimum solution to a business problem. Rules and best practices are great, as long as you know when to break them. :)
I'll use it in production when working with CTEs. But, in this case it's not really select *, because I already specified the columns in the CTE. I just don't want to respecify in the final select.
with t as (
select a, b, c from foo
)
select t.* from t;
None that I can think of, if you are talking about live code.
People saying that it makes adding columns easier to develop (so they automatically get returned and can be used without changing the Stored procedure) have no idea about writing optimal code/sql.
I only ever use it when writing ad-hoc queries that will not get reused (finding out the structure of a table, getting some data when I am not sure what the column names are).
I think using select * in an exists clause is appropriate:
select some_field from some_table
where exists
(select * from related_table [join condition...])
Some people like to use select 1 in this case, but it's not elegant, and it doesn't buy any performance improvements (early optimization strikes again).
In production code, I'd tend to agree 100% with you.
However, I think that the * more than justifies its existence when performing ad-hoc queries.
You've gotten a number of answers to your question, but you seem to be dismissing everything that isn't parroting back what you want to hear. Still, here it is for the third (so far) time: sometimes there is no bottleneck. Sometimes performance is way better than fine. Sometimes the tables are in flux, and amending every SELECT query is just one more bit of possible inconsistency to manage. Sometimes you've got to deliver on an impossible schedule and this is the last thing you need to think about.
If you live in bullet time, sure, type in all the column names. But why stop there? Re-write your app in a schema-less dbms. Hell, write your own dbms in assembly. That'd really show 'em.
And remember if you use select * and you have a join at least one field will be sent twice (the join field). This wastes database resources and network resources for no reason.
As a tool I use it to quickly refresh my memory as to what I can possibly get back from a query. As a production level query itself .. no way.
When creating an application that deals with the database, like phpmyadmin, and you are in a page where to display a full table, in that case using SELECT * can be justified, I guess.
About the only thing that I can think of would be when developing a utility or SQL tool application that is being written to run against any database. Even here though, I would tend to query the system tables to get the table structure and then build any necessary query from that.
There was one recent place where my team used SELECT * and I think that it was ok... we have a database that exists as a facade against another database (call it DB_Data), so it is primarily made up of views against the tables in the other database. When we generate the views we actually generate the column lists, but there is one set of views in the DB_Data database that are automatically generated as rows are added to a generic look-up table (this design was in place before I got here). We wrote a DDL trigger so that when a view is created in DB_Data by this process then another view is automatically created in the facade. Since the view is always generated to exactly match the view in DB_Data and is always refreshed and kept in sync, we just used SELECT * for simplicity.
I wouldn't be surprised if most developers went their entire career without having a legitimate use for SELECT * in production code though.
I've used select * to query tables optimized for reading (denormalized, flat data). Very advantageous since the purpose of the tables were simply to support various views in the application.
How else do the developers of phpmyadmin ensure they are displaying all the fields of your DB tables?
It is conceivable you'd want to design your DB and application so that you can add a column to a table without needing to rewrite your application. If your application at least checks column names it can safely use SELECT * and treat additional columns with some appropriate default action. Sure the app could consult system catalogs (or app-specific catalogs) for column information, but in some circumstances SELECT * is syntactic sugar for doing that.
There are obvious risks to this, however, and adding the required logic to the app to make it reliable could well simply mean replicating the DB's query checks in a less suitable medium. I am not going to speculate on how the costs and benefits trade off in real life.
In practice, I stick to SELECT * for 3 cases (some mentioned in other answers:
As an ad-hoc query, entered in a SQL GUI or command line.
As the contents of an EXISTS predicate.
In an application that dealt with generic tables without needing to know what they mean (e.g. a dumper, or differ).
Yes, but only in situations where the intention is to actually get all the columns from a table not because you want all the columns that a table currently has.
For example, in one system that I worked on we had UDFs (User Defined Fields) where the user could pick the fields they wanted on the report, the order as well as filtering. When building a result set it made more sense to simply "select *" from the temporary tables that I was building instead of having to keep track of which columns were active.
I have several times needed to display data from a table whose column names were unknown. So I did SELECT * and got the column names at run time.
I was handed a legacy app where a table had 200 columns and a view had 300. The risk exposure from SELECT * would have been no worse than from listing all 300 columns explicitly.
Depends on the context of the production software.
If you are writing a simple data access layer for a table management tool where the user will be selecting tables and viewing results in a grid, then it would seem *SELECT ** is fine.
In other words, if you choose to handle "selection of fields" through some other means (as in automatic or user-specified filters after retrieving the resultset) then it seems just fine.
If on the other hand we are talking about some sort of enterprise software with business rules, a defined schema, etc. ... then I agree that *SELECT ** is a bad idea.
EDIT: Oh and when the source table is a stored procedure for a trigger or view, "*SELECT **" should be fine because you're managing the resultset through other means (the view's definition or the stored proc's resultset).
Select * in production code is justifiable any time that:
it isn't a performance bottleneck
development time is critical
Why would I want the overhead of going back and having to worry about changing the relevant stored procedures, every time I add a field to the table?
Why would I even want to have to think about whether or not I've selected the right fields, when the vast majority of the time I want most of them anyway, and the vast majority of the few times I don't, something else is the bottleneck?
If I have a specific performance issue then I'll go back and fix that. Otherwise in my environment, it's just premature (and expensive) optimisation that I can do without.
Edit.. following the discussion, I guess I'd add to this:
... and where people haven't done other undesirable things like tried to access columns(i), which could break in other situations anyway :)
I know I'm very late to the party but I'll chip in that I use select * whenever I know that I'll always want all columns regardless of the column names. This may be a rather fringe case but in data warehousing, I might want to stage an entire table from a 3rd party app. My standard process for this is to drop the staging table and run
select *
into staging.aTable
from remotedb.dbo.aTable
Yes, if the schema on the remote table changes, downstream dependencies may throw errors but that's going to happen regardless.
If you want to find all the columns and want order, you can do the following (at least if you use MySQL):
SHOW COLUMNS FROM mytable FROM mydb; (1)
You can see every relevant information about all your fields. You can prevent problems with types and you can know for sure all the column names. This command is very quick, because you just ask for the structure of the table. From the results you will select all the name and will build a string like this:
"select " + fieldNames[0] + ", fieldNames[1]" + ", fieldNames[2] from mytable". (2)
If you don't want to run two separate MySQL commands because a MySQL command is expensive, you can include (1) and (2) into a stored procedure which will have the results as an OUT parameter, that way you will just call a stored procedure and every command and data generation will happen at the database server.
How do you select all fields of two joined tables, without having conflicts with the common field?
Suppose I have two tables, Products and Services. I would like to make a query like this:
SELECT Products.*, Services.*
FROM Products
INNER JOIN Services ON Products.IdService = Services.IdService
The problem with this query is that IdService will appear twice and lead to a bunch of problems.
The alternative I found so far is to discriminate every field from Products except the IdService one. But this way I'll have to update the query every time I add a new field to Products.
Is there a better way to do this?
What are the most common SQL anti-patterns?
You've hit anti-pattern #1.
The better way is to provide a fieldlist. One way to get a quick field list is to
sp_help tablename
And if you want to create a view from this query - using select * gets you in more trouble. SQL Server captures the column list at the time the view is created. If you edit the underlying tables and don't recreate the view - you're signing up for trouble (I had a production fire of this nature - view was against tables in a different database though).
You should NEVER have SELECT * in production code (well, almost never, but the times where it is justified can be easily counted).
As far as I am aware you'll have to avoid SELECT * but this't really a problem.
SELECT * is usually regarded as a problem waiting to happen for the reason you quote as an advantage! Usually extra results columns appearing for queries when the database has been modified will cause problems.
Does your dialect of SQL support COMPOSE? COMPOSE gets rid of the extra copy of the column that's used on an equijoin, like the one in your example.
As others have said the Select * is bad news especially if other fields are added to the tables in which you are querying. You should select out the exact fields you want from the tables and can use an alias for fields with the same names or just use table.columnName.
Do not use *. Use somthing like this:
SELECT P.field1 AS 'Field from P'
, P.field2
, S.field1 AS 'Field from S'
, S.field4
FROM Products P
INNER JOIN
Services S
ON P.IdService = S.IdService
That would be correct, list the fields you want (in SQL Server you can drag them over from the object browser, so you don't have to type them all). Incidentally, if there are fields your specific query doe not need, do not list them. This creates extra work for the server and uses up extra network resources and can be one of the causes of poor performance when it is done thoughout your system and such wasteful queries are run thousands of times a day.
As to it being a maintenance problem, you only need to add the fields if the part of the application that uses your query would be affected by them. If you don't know what affect the new field would have or where you need to add it, you shouldn't be adding the field. Also adding new fileds unexopectedly through the use of select * can cause maintenance problems as well. Creating performance problems to avoid doing maintenance (maintenance you may never even need to do as column changes should be rare (if they aren't you need to look at your design)) is pretty short-sighted.
The best way is to specify the exact fields that you want from the query. You shouldn't use * anyway.
It is convenient to use * to get all fields, but it doesn't produce robust code. Any change in the table will change the result that is returned from the query, and that is not always desirable.
You should return only the data that you really want from the query, specified in the exact order you want it. That way the result looks exactly the same even if you add fields to the table or change the order of the fields in the table.
It's a litte more work to specify the exact output, but in the long run it usually pays off. When you make a change, only what you actually change is affected, you don't get cascading effects that breaks code that you didn't even know was affected.