What happens if all data are added in same table? - sql

I was wondering what happens if all data is added in a same table without using any other table or foreign key. What effects does it create on the performance if the data is very large.
This is sheer out of curiosity. Please do help on finding an answer on this.

So, you could enter large data in single table as well but if you missed proper indexing and partitioning (if required) then the fetching of data will be very slow.
If you are using Binary column in the table then it would be worst.
As a best practice you should follow normalization and if that is not possible then you can use single table as well the only thing which you can prepare yourself for performance.
Use proper Index & partitioning will help in such cases.

This question strongly depends on amount of data (what is 'huge'? 1 GB? 1 TB? 1 PB? ...) and the kind of data you have got (unstructured? images? movies? ...)
Obviously, your data would not be efficient except having only one table if this is something that one would want to call efficiency.
To see what would happen, you basically could test it yourself. You might use AdventureWorks or WorldwideImporters Database and then do something like that with a Select Into:
SELECT * INTO TABLETARGET FROM TABLESOURCE

Related

SQL Server Performance Tuning of large table

I have a table of 755 columns and around holding 2 million records as of now and it will grow.There are many procedures accessing it with other tables join, are running slow. Now it's hard to split/normalize them as everything is already built and customer is not ready to spend much on it. Is there any way to make the query access to that table faster? Please advise.
Will column store index help?
How little are they prepared to spend?
It may be possible to split this table into multiple 1 to 1 joined tables (vertical partitioning), then use a view to present it as one single blob to existing code.
With some luck you may get join elimination happening frequently enough to make it worthwhile.
View will probably require INSTEAD OF triggers to fully replicate existing logic. INSTEAD OF triggers have a number of restrictions e.g. no support for OUTPUT clause, which can prove to be to hard to overcome depending on your specific setup.
You can name your view the same as existing table, which will eliminate the need of fixing code everywhere.
IMO this is the simplest you can do short of a full DB re-factoring exercise.
See: http://aboutsqlserver.com/2010/09/15/vertical-partitioning-as-the-way-to-reduce-io/ and https://logicalread.com/sql-server-optimizer-may-eliminate-foreign-key-joins-mc11/#.WXgEzlERW6I
755 Columns thats a lot. You should try to index the columns that are mostly used in where clause. this might speed up the process
It is fine, dont worry about it, actually how many columns you have it is not important in sql server (But be careful I said 'have'). The main problem is data count and how many column you select in queries. There is a few point firstly you can check.
Do not use * selector and change it if used in everywhere
In the joins, do not use it directly, you can firstly filter it as inner select. (Just try it, I have no idea about your table so I m telling the general rules.)
Try the diminish data count for ex: use history table for old records. This technicque depends on needs of your organization.
Try to use column index and something like that features.
And of course remove dynamic selects in your queries.
I wish one of them will work.

SQL select * vs. selecting specific columns [duplicate]

This question already has answers here:
Why is SELECT * considered harmful?
(16 answers)
Closed 9 years ago.
I was wondering which is best practice. Lest say I have a table with 10+ columns and I want to select data from it.
I've heard that 'select *' is better since selecting specific columns makes the database search for these columns before selecting while selecting all just grabs everything. On the other hand, what if the table has a lot of columns in it?
Is that true?
Thanks
It is best practice to explicitly name the columns you want to select.
As Mitch just said the performance isn't different. I even heard that looking up the actual columns names when using * is slower.
But the advantage is that when your table changes then your select does not change when you name your columns.
I think these two questions here and here have satisfactory answers.
* is not better, actually it is slower is one reason that select * is not good. In addition to this, according to OMG Ponies, select * is anti-pattern. See the questions in the links for detail.
selecting specific columns is better as it is raises the probability that SQL Server can access the data from indexes rather than querying the table data.
It's also require less changes, since any code that consumes the data will be getting the same data structure regardless of changes you make to the table schema in the future.
Definetly not. Try making a SELECT * from a table which has millions of rows and tens of columns.
The performance with SELECT * will be worse.
It depends on what you're about to do with the result. Selecting unnecessary data is not a good practice either. You wouldn't create a bunch of variables with values you would never use. So selecting many columns you don't need is not a good idea either.
It depends.
Selecting all columns can make query slower because of need of reading all columns from disk -- if there are a lot of string columns (which are not in index) then it can have huge impact on query (IO) performance. And from my practise -- you rely need all columns.
From the other hand -- for small database with a few user and good enough hardware it's much easier to select just all columns -- especially if schema changes often.
However -- I would always recommended to explicitly select columns to make sure it doesn't hurt performance.

Using temp table for sorting data in SQL Server

Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.

What is wrong with using SELECT * FROM sometable [duplicate]

I've heard that SELECT * is generally bad practice to use when writing SQL commands because it is more efficient to SELECT columns you specifically need.
If I need to SELECT every column in a table, should I use
SELECT * FROM TABLE
or
SELECT column1, colum2, column3, etc. FROM TABLE
Does the efficiency really matter in this case? I'd think SELECT * would be more optimal internally if you really need all of the data, but I'm saying this with no real understanding of database.
I'm curious to know what the best practice is in this case.
UPDATE: I probably should specify that the only situation where I would really want to do a SELECT * is when I'm selecting data from one table where I know all columns will always need to be retrieved, even when new columns are added.
Given the responses I've seen however, this still seems like a bad idea and SELECT * should never be used for a lot more technical reasons that I ever though about.
One reason that selecting specific columns is better is that it raises the probability that SQL Server can access the data from indexes rather than querying the table data.
Here's a post I wrote about it: The real reason select queries are bad index coverage
It's also less fragile to change, since any code that consumes the data will be getting the same data structure regardless of changes you make to the table schema in the future.
Given your specification that you are selecting all columns, there is little difference at this time. Realize, however, that database schemas do change. If you use SELECT * you are going to get any new columns added to the table, even though in all likelihood, your code is not prepared to use or present that new data. This means that you are exposing your system to unexpected performance and functionality changes.
You may be willing to dismiss this as a minor cost, but realize that columns that you don't need still must be:
Read from database
Sent across the network
Marshalled into your process
(for ADO-type technologies) Saved in a data-table in-memory
Ignored and discarded / garbage-collected
Item #1 has many hidden costs including eliminating some potential covering index, causing data-page loads (and server cache thrashing), incurring row / page / table locks that might be otherwise avoided.
Balance this against the potential savings of specifying the columns versus an * and the only potential savings are:
Programmer doesn't need to revisit the SQL to add columns
The network-transport of the SQL is smaller / faster
SQL Server query parse / validation time
SQL Server query plan cache
For item 1, the reality is that you're going to add / change code to use any new column you might add anyway, so it is a wash.
For item 2, the difference is rarely enough to push you into a different packet-size or number of network packets. If you get to the point where SQL statement transmission time is the predominant issue, you probably need to reduce the rate of statements first.
For item 3, there is NO savings as the expansion of the * has to happen anyway, which means consulting the table(s) schema anyway. Realistically, listing the columns will incur the same cost because they have to be validated against the schema. In other words this is a complete wash.
For item 4, when you specify specific columns, your query plan cache could get larger but only if you are dealing with different sets of columns (which is not what you've specified). In this case, you do want different cache entries because you want different plans as needed.
So, this all comes down, because of the way you specified the question, to the issue resiliency in the face of eventual schema modifications. If you're burning this schema into ROM (it happens), then an * is perfectly acceptable.
However, my general guideline is that you should only select the columns you need, which means that sometimes it will look like you are asking for all of them, but DBAs and schema evolution mean that some new columns might appear that could greatly affect the query.
My advice is that you should ALWAYS SELECT specific columns. Remember that you get good at what you do over and over, so just get in the habit of doing it right.
If you are wondering why a schema might change without code changing, think in terms of audit logging, effective/expiration dates and other similar things that get added by DBAs for systemically for compliance issues. Another source of underhanded changes is denormalizations for performance elsewhere in the system or user-defined fields.
You should only select the columns that you need. Even if you need all columns it's still better to list column names so that the sql server does not have to query system table for columns.
Also, your application might break if someone adds columns to the table. Your program will get columns it didn't expect too and it might not know how to process them.
Apart from this if the table has a binary column then the query will be much more slower and use more network resources.
There are four big reasons that select * is a bad thing:
The most significant practical reason is that it forces the user to magically know the order in which columns will be returned. It's better to be explicit, which also protects you against the table changing, which segues nicely into...
If a column name you're using changes, it's better to catch it early (at the point of the SQL call) rather than when you're trying to use the column that no longer exists (or has had its name changed, etc.)
Listing the column names makes your code far more self-documented, and so probably more readable.
If you're transferring over a network (or even if you aren't), columns you don't need are just waste.
Specifying the column list is usually the best option because your application won't be affected if someone adds/inserts a column to the table.
Specifying column names is definitely faster - for the server. But if
performance is not a big issue (for example, this is a website content database with hundreds, maybe thousands - but not millions - of rows in each table); AND
your job is to create many small, similar applications (e.g. public-facing content-managed websites) using a common framework, rather than creating a complex one-off application; AND
flexibility is important (lots of customization of the db schema for each site);
then you're better off sticking with SELECT *. In our framework, heavy use of SELECT * allows us to introduce a new website managed content field to a table, giving it all of the benefits of the CMS (versioning, workflow/approvals, etc.), while only touching the code at a couple of points, instead of a couple dozen points.
I know the DB gurus are going to hate me for this - go ahead, vote me down - but in my world, developer time is scarce and CPU cycles are abundant, so I adjust accordingly what I conserve and what I waste.
SELECT * is a bad practice even if the query is not sent over a network.
Selecting more data than you need makes the query less efficient - the server has to read and transfer extra data, so it takes time and creates unnecessary load on the system (not only the network, as others mentioned, but also disk, CPU etc.). Additionally, the server is unable to optimize the query as well as it might (for example, use covering index for the query).
After some time your table structure might change, so SELECT * will return a different set of columns. So, your application might get a dataset of unexpected structure and break somewhere downstream. Explicitly stating the columns guarantees that you either get a dataset of known structure, or get a clear error on the database level (like 'column not found').
Of course, all this doesn't matter much for a small and simple system.
Lots of good reasons answered here so far, here's another one that hasn't been mentioned.
Explicitly naming the columns will help you with maintenance down the road. At some point you're going to be making changes or troubleshooting, and find yourself asking "where the heck is that column used".
If you've got the names listed explicitly, then finding every reference to that column -- through all your stored procedures, views, etc -- is simple. Just dump a CREATE script for your DB schema, and text search through it.
Performance wise, SELECT with specific columns can be faster (no need to read in all the data). If your query really does use ALL the columns, SELECT with explicit parameters is still preferred. Any speed difference will be basically unnoticeable and near constant-time. One day your schema will change, and this is good insurance to prevent problems due to this.
definitely defining the columns, because SQL Server will not have to do a lookup on the columns to pull them. If you define the columns, then SQL can skip that step.
It's always better to specify the columns you need, if you think about it one time, SQL doesn't have to think "wtf is *" every time you query. On top of that, someone later may add columns to the table that you actually do not need in your query and you'll be better off in that case by specifying all of your columns.
The problem with "select *" is the possibility of bringing data you don't really need. During the actual database query, the selected columns don't really add to the computation. What's really "heavy" is the data transport back to your client, and any column that you don't really need is just wasting network bandwidth and adding to the time you're waiting for you query to return.
Even if you do use all the columns brought from a "select *...", that's just for now. If in the future you change the table/view layout and add more columns, you'll start bring those in your selects even if you don't need them.
Another point in which a "select *" statement is bad is on view creation. If you create a view using "select *" and later add columns to your table, the view definition and the data returned won't match, and you'll need to recompile your views in order for them to work again.
I know that writing a "select *" is tempting, 'cause I really don't like to manually specify all the fields on my queries, but when your system start to evolve, you'll see that it's worth to spend this extra time/effort in specifying the fields rather than spending much more time and effort removing bugs on your views or optimizing your app.
While explicitly listing columns is good for performance, don't get crazy.
So if you use all the data, try SELECT * for simplicity (imagine having many columns and doing a JOIN... query may get awful). Then - measure. Compare with query with column names listed explicitly.
Don't speculate about performance, measure it!
Explicit listing helps most when you have some column containing big data (like body of a post or article), and don't need it in given query. Then by not returning it in your answer DB server can save time, bandwidth, and disk throughput. Your query result will also be smaller, which is good for any query cache.
You should really be selecting only the fields you need, and only the required number, i.e.
SELECT Field1, Field2 FROM SomeTable WHERE --(constraints)
Outside of the database, dynamic queries run the risk of injection attacks and malformed data. Typically you get round this using stored procedures or parameterised queries. Also (although not really that much of a problem) the server has to generate an execution plan each time a dynamic query is executed.
It is NOT faster to use explicit field names versus *, if and only if, you need to get the data for all fields.
Your client software shouldn't depend on the order of the fields returned, so that's a nonsense too.
And it's possible (though unlikely) that you need to get all fields using * because you don't yet know what fields exist (think very dynamic database structure).
Another disadvantage of using explicit field names is that if there are many of them and they're long then it makes reading the code and/or the query log more difficult.
So the rule should be: if you need all the fields, use *, if you need only a subset, name them explicitly.
The result is too huge. It is slow to generate and send the result from the SQL engine to the client.
The client side, being a generic programming environment, is not and should not be designed to filter and process the results (e.g. the WHERE clause, ORDER clause), as the number of rows can be huge (e.g. tens of millions of rows).
Naming each column you expect to get in your application also ensures your application won't break if someone alters the table, as long as your columns are still present (in any order).
Performance wise I have seen comments that both are equal. but usability aspect there are some +'s and -'s
When you use a (select *) in a query and if some one alter the table and add new fields which do not need for the previous query it is an unnecessary overhead. And what if the newly added field is a blob or an image field??? your query response time is going to be really slow then.
In other hand if you use a (select col1,col2,..) and if the table get altered and added new fields and if those fields are needed in the result set, you always need to edit your select query after table alteration.
But I suggest always to use select col1,col2,... in your queries and alter the query if the table get altered later...
This is an old post, but still valid. For reference, I have a very complicated query consisting of:
12 tables
6 Left joins
9 inner joins
108 total columns on all 12 tables
I only need 54 columns
A 4 column Order By clause
When I execute the query using Select *, it takes an average of 2869ms.
When I execute the query using Select , it takes an average of 1513ms.
Total rows returned is 13,949.
There is no doubt selecting column names means faster performance over Select *
Select is equally efficient (in terms of velocity) if you use * or columns.
The difference is about memory, not velocity. When you select several columns SQL Server must allocate memory space to serve you the query, including all data for all the columns that you've requested, even if you're only using one of them.
What does matter in terms of performance is the excecution plan which in turn depends heavily on your WHERE clause and the number of JOIN, OUTER JOIN, etc ...
For your question just use SELECT *. If you need all the columns there's no performance difference.
It depends on the version of your DB server, but modern versions of SQL can cache the plan either way. I'd say go with whatever is most maintainable with your data access code.
One reason it's better practice to spell out exactly which columns you want is because of possible future changes in the table structure.
If you are reading in data manually using an index based approach to populate a data structure with the results of your query, then in the future when you add/remove a column you will have headaches trying to figure out what went wrong.
As to what is faster, I'll defer to others for their expertise.
As with most problems, it depends on what you want to achieve. If you want to create a db grid that will allow all columns in any table, then "Select *" is the answer. However, if you will only need certain columns and adding or deleting columns from the query is done infrequently, then specify them individually.
It also depends on the amount of data you want to transfer from the server. If one of the columns is a defined as memo, graphic, blob, etc. and you don't need that column, you'd better not use "Select *" or you'll get a whole bunch of data you don't want and your performance could suffer.
To add on to what everyone else has said, if all of your columns that you are selecting are included in an index, your result set will be pulled from the index instead of looking up additional data from SQL.
SELECT * is necessary if one wants to obtain metadata such as the number of columns.
Gonna get slammed for this, but I do a select * because almost all my data is retrived from SQL Server Views that precombine needed values from multiple tables into a single easy to access View.
I do then want all the columns from the view which won't change when new fields are added to underlying tables. This has the added benefit of allowing me to change where data comes from. FieldA in the View may at one time be calculated and then I may change it to be static. Either way the View supplies FieldA to me.
The beauty of this is that it allows my data layer to get datasets. It then passes them to my BL which can then create objects from them. My main app only knows and interacts with the objects. I even allow my objects to self-create when passed a datarow.
Of course, I'm the only developer, so that helps too :)
What everyone above said, plus:
If you're striving for readable maintainable code, doing something like:
SELECT foo, bar FROM widgets;
is instantly readable and shows intent. If you make that call you know what you're getting back. If widgets only has foo and bar columns, then selecting * means you still have to think about what you're getting back, confirm the order is mapped correctly, etc. However, if widgets has more columns but you're only interested in foo and bar, then your code gets messy when you query for a wildcard and then only use some of what's returned.
And remember if you have an inner join by definition you do not need all the columns as the data in the join columns is repeated.
It's not like listing columns in SQl server is hard or even time-consuming. You just drag them over from the object browser (you can get all in one go by dragging from the word columns). To put a permanent performance hit on your system (becasue this can reduce the use of indexes and becasue sending unneeded data over the network is costly) and make it more likely that you will have unexpected problems as the database changes (sometimes columns get added that you do not want the user to see for instance) just to save less than a minute of development time is short-sighted and unprofessional.
Absolutely define the columns you want to SELECT every time. There is no reason not to and the performance improvement is well worth it.
They should never have given the option to "SELECT *"
If you need every column then just use SELECT * but remember that the order could potentially change so when you are consuming the results access them by name and not by index.
I would ignore comments about how * needs to go get the list - chances are parsing and validating named columns is equal to the processing time if not more. Don't prematurely optimize ;-)

perfomance of single table vs. joined dual structure

This is not a question about using another tool. It is not a question about using different data structure. It is issue about WHY I see what I see -- please read to the end before answering. Thank you.
THE STORY
I have one single table which has one condition, the records are not deleted. Instead the record is marked as not active (there is field for that) and in such case all fields (except for identifiers and this isActive field) are considered irrelevant.
More on identifiers -- there are two fields:
id -- int, primary key, clustered
name -- unique, varchar, external index
How the update is done for example (I use C#/Linq/MSSQL2005): I fetch the record basing on name, then change required fields and commit the changes, so the update is executed (UPDATE uses id, not name).
However there is a problem with storage. So why not not break this table into dual structure -- "header" table (id, name, isActive) and data table (id, rest of the fields). In case of a problem with storage we can delete all records from data table for real (for isActive=false).
edit (by Shimmy): header+data are not retrieved by LINQ with join. Data records are loaded on demand (and this always occurs because of the code).
comment (by poster): AFAIR there is no join, so this is irrelevant. Data for headers were loaded manually. See below.
Performance -- (my) Theory
Now, what about performance? Which one will be faster? Let's say you have 10000 records in both tables (single, header, data) and you update them one by one (all 3 tables) -- fields isActive and some field from the "data" fields.
My calculation was/is:
mono table -- search using external index, then jumping into the structure, fetching all the data, update using primary key.
dual tables -- search using external index, jumping into the header table, fetching all the data, search using primary key on data table (no jumping here, it is clustered index), fetching all the data, update both tables using primary keys.
So, for me mono structure should be faster, because in dual case I have the same operations plus some extras.
The results
No matter what I do, update, select, insert, dual structure is either slightly better (speed) or up to 30% faster. And now I am all puzzled -- I would understand that I if were insert/update/select only header records, but in every case data records are used as well.
The question -- why/how dual structure can be faster?
I think this all boils down to how much data is being fetched, inserted, and updated.
SELECT case - in the dual-table configuration you're fetching less data. Database runtime is heavily dominated by I/O time, so having the "header" fields replicated on each row in the single-table configuration means you have to read that same data over and over again. In the dual-table configuration you only read the header data once.
INSERT case - similar to the above, but related to writing the data instead of reading it.
UPDATE case - your code updates the "isActive" field, which if I've read it correctly is part of the "header" fields. In the single-table configuration you're forcing many rows to be updated for each "isActive" change. In the dual-table configuration you're only updating a single header row for each "isActive" change.
I think this is a case of premature optimization. I get the feeling that you understood that according to data normalization rules the dual-table configuration was "better" - but because the single-table case seemed like it would provide better performance you wanted to go with that design. Thankfully you took the time to test what would happen and found that observed performance did not match your expectations. GOOD JOB! I wish more people would take the time to test things out like this. I think the lesson to learn here is that data normalization is a Good Thing.
Remember, the best time to optimize something is NEVER! The second-best time to optimize things is when you have an observed performance problem. The worst time to optimize is during analysis.
I hope this helps.
Assumption: Sql Server for database.
Sql Server tends to be higher in performance on narrow tables rather than wide. While this might not be true for something such as a mainframe.
This really points to normalization until you decide NOT to for performance reasons, and in this case the assumption that de-normalized tables would be more efficient is incorrect. Normalized structures can be better managed in the resources than de-normalized in this environment. I suspect (no citable basis of this) that the resource (hardware, multiprocessors, threading etc) makes the normalized structure faster because more stuff gets done at the same time.
Have you looked at the two query plans? That often gives it away.
As for speculation, the size of a row in a table affects how fast you can scan it. smaller rows means more rows fit into a data page. the brunt of a query is usually in the I/O time, so using two smaller tables greatly reduces the amount of data you have to sift through in the indexes.
Also, the locks are more granular -- the first update could write to table1, and then the second update could write to table1 while you're finishing up table2.