Good resources for learning database optimization part [closed] - sql

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am good at database(sql) programming part but I want to move ahead into database optimization part like: where and when to indexes, how to decide which query is better than other, how to optimize database. Can you guide me some good resources or books which can lead me to this?

Inside Microsoft SQL Server 2005: Query Tuning and Optimization,
Inside Microsoft SQL Server 2005: T-SQL Querying
Inside Microsoft SQL Server 2005: The Storage Engine
have very deep and thorough explanation of optimizing sql server querying.

SQL Server Query Performance Tuning Distilled, Second Edition

I've recently been focusing on this for my company, and I've learned some interesting things about specifically query optimization.
I've run SQL Profiler for a half hour at a time and logged queries that required 1000 reads or more (then later ones that required 50 CPU or more).
I originally focused on individual queries with the highest reads and CPU. However, having written the logs to a database, I was able to query aggregate results to see which queries required the most aggregate reads and CPU. Targeting these actually helped a lot more than only targeting the most expensive queries.
The most expensive query might be run once a day, so it's good to optimize that. However, if the 10th most expensive query is run 100 times an hour, it's much more helpful to optimize that first.
Here's a summary of what I've learned so far, which can help you get started in identifying queries for optimization:
A Beginner's Guide to Database Query Optimization
Highly Inefficient Linq Queries that Break Database Indexing
An Obscure Performance Pitfall for Test Accounts and Improperly Indexed Database Tables

Please find some tips for database/query optimization.
Apply functions to parameters, not columns
One of the most common mistakes seen when looking at database queries, is the improper use of functions against database tables. Whenever we need to apply a function to a column and validate the result against a value, it's worth checking if we have the reverse function that we can apply against the given column. In this way, the database engine can use an index against that column, and there isn't the need to define a functional based index.
against a 60 rows table with no indexes whatsoever, the following query
SELECT ticker.SYMBOL,
ticker.TSTAMP,
ticker.PRICE
FROM ticker
WHERE TO_CHAR(ticker.TSTAMP, 'YYYY-MM-DD') = '2011-04-01'
executes in 0.006s, whereas, the "reverse" query
SELECT ticker.SYMBOL,
ticker.TSTAMP,
ticker.PRICE
FROM ticker
WHERE
ticker.TSTAMP = TO_DATE('2011-04-01', 'YYYY-MM-DD')
-- executes in 0.004s
Exists clause instead of IN (subquery)
Another observed pattern in database development is that people choose the easy and the most convenient solution and for this tip, we will take a look at finding an element in a list. The easiest and most convenient solution is using the IN operator.
SELECT symbol, tstamp, price
FROM ticker
WHERE price IN (3,4,5);
--or
SELECT symbol, tstamp, price
FROM ticker
WHERE price IN (SELECT price FROM threshold WHERE action = 'Buy');
This approach is ok when we have a small manageable list. When the list becomes extensively large and when the list is dynamic(it will be generated based on parameters that we'll have only at runtime) this approach tends to becomes quite costly for the database. The alternative solution is the use of the EXISTS operator as shown in the below code snippet:
SELECT symbol, tstamp, price
FROM ticker t
WHERE EXISTS (SELECT 1 FROM threshold m WHERE t.price = m.price AND m.action = 'Buy');
This approach will be faster because once the engine has found a hit, it will quit looking as the condition has proved true. With IN it will collect all the results from the subquery before further processing.

Related

Should I name tables based on date & time of creation, and use EXEC() and a variable to dynamically refer to these tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
TL;DR: Current company creates new table for every time period, such as sales_yyyymmdd, and use EXEC() to dynamically refer to table names, making the entire query red and hard to read. What kind of changes can I suggest to them to both improve readability and performance?
Some background: I'm a Data analyst (and not a DBA), so my SQL knowledge can be limited. I recently moved to a new company which use MS SQL Server as their database management system.
The issues: The DAs here share a similar style of writing SQL scripts, which includes:
Naming tables based on their time of creation, e.g. data for sales record everyday will be saved into a new table of that day, such as sales_yyyymmdd. This means there are a huge amount of tables like this. Note that the DAs has their own database to tinker with, so they are allowed to created any amount of tables there.
Writing queries enclosed in EXEC() and dynamically refer to table names based on some variable #date. As such, their entire scripts become a color red which is difficult for me to read.
They also claim that enclosing queries in EXEC(), per their own words, makes the scripts running entirely when stored as scheduled jobs, because when they write them the "normal way", sometimes these jobs stop mid-way.
My questions:
Regarding naming and creating new tables for every new time period: I suppose this is obviously a bad practice, at least in terms of management due to the sheer amount of tables. I suggested merging them and add a created_date column, but the DAs here argued that both ways take up the same amount of disk space, so why bother with such radical change. How do I explain this to them?
Regarding the EXEC() command: My issue with this way of writing queries is that it's hard to maintain and share to other people. My quick fix for now (if issue 1 remains), is to use one single EXEC() command to copy the tables needed to temp tables, then select these temp tables instead. If new data need to be merged, I first insert them into temp tables, manipulate them here, and finally merge into the final, official table. Would this method affect performance at all (as there is an extra step involving temp tables)? And is there any better way that both helps with readability and performance?
I don't have experience scheduling jobs myself on my own computer, as my previous company has a dedicated data engineering team that take my SQL scripts and automate the job on a server. My googling also has not yielded any result yet. Is it true that using EXEC() keeps jobs from being interrupted? If not, what is the actual issue here?
I know that the post is long, and I'm also not a native speaker. I hope that I explain my questions clearly enough, and I appreciate any helps/answers.
Thanks everyone, and stay safe!
While I understand the reasons for creating a table for each day, I do not think this is the correct solution.
Modern databases do very good job partitioning data, SQL Server also has this feature. In fact, such use-cases are exactly the rason why partitioning was created in the first place. For me that would be the way to go, as:
it's not a WTF solution (your description easily understandable, but it's a WTF still)
partitioning allows for optimizing partition-restricted queries, particularly time-restricted queries
it is still possible to execute a non-partition based query, while for the solution you showed it would require an union, or multiple unions
As everybody mentioned in the comments, You can have single table Sales and have extra column in the table to hold the date, the data got inserted.
Create table Sales to hold all sales data
CREATE TABLE Sales
(
col1 datatype
col2 datatype
.
.
InsertedDate date --This contains the date for which sales data correspond to
)
Insert all the existing tables data into the above table
INSERT INTO sales
SELECT *,'20200301' AS InsertedDate FROM Sales_20200301
UNION ALL
SELECT *,'20200302' AS InsertedDate FROM Sales_20200302
.
.
UNION ALL
SELECT *,'20200331' AS InsertedDate FROM Sales_20200331
Now, you can modify EXEC query with variable #date to direct query. You can easily read the script without them being in the red color.
DECLARE #date DATE = '20200301'
SELECT col1,col2...
FROM Sales
WHERE InsertedDate = #date
Note:
If data is huge, you can think of partitioning the data based on the Inserteddate.
The purpose of database is not to create tables. It is to use tables. To be honest, this is a nuance that is sometimes hard to explain to DBAs.
First, understand where they are coming from. They want to protect data integrity. They want to be sure that the database is available and that people can use the data they need. They may have been around when the database was designed, and the only envisioned usage was per day. This also makes the data safe when the schema changes (i.e. new columns are added).
Obviously, things have changed. If you were to design the database from scratch, you would probably have a single partitioned table; the partitioning would be by day.
What can you do? There are several options.
You do have some options, depending on what you are able to do and what the DBAs need. The most important thing is to communicate the importance of this issue. You are trying to do analysis. You know SQL. Before you can get started on a problem, you have to deal with the data model, thinking about execs, date ranges, and a whole host of issues that have nothing to do with the problems you need to solve.
This affects your productivity. And affects the utility of the database. Both of these are issues that someone should care about.
There are some potential solutions:
You can copy all the data into a single table each day, perhaps as a separate job. This is reasonable if the tables are small.
You can copy the latest data into a single table.
You can create a view that combines the data into a single view.
The DBAs could do any of the above.
I obviously don't know the structure of the existing code or how busy the DBAs are. However, (4) does not seem particularly cumbersome, regardless of which solution is chosen.
If you have no available space for a view or copy of the data, I would write SQL generation code that would construct a query like this:
select * from sales_20200101 union all
select * from sales_20200102 union all
. . .
This will be a long string. I would then just start my queries with:
with sales as (
<long string here>
)
<whatever code here>;
Of course, it would be better to have a view (at least) that has all the sales you want.

Alternatives to CTEs in Microsoft Access

I work with MS Access 2010 on a daily basis and I want to know if there are alternatives to a Common Table Expression as it is used in SQL Server, and how it affects performance?
For example, is it better to create a subquery or is it better to call a query from another query, which essentially is very similar..
Example:
SELECT A.field1,A.Date
FROM (SELECT * FROM TABLE B WHERE B.Date = Restriction )
or
SELECT A.field1,A.Date
FROM SavedQueryB
SavedQueryB:
SELECT * FROM TABLE B WHERE B.Date = Restriction
I feel having multiple queries makes it easier to debug and manage, but does it affect performance when the data-set is very large?
Also, I've seen some videos about implementing the queries thru VBA, however I'm not very comfortable doing it that way yet.
Essentially. What is more efficient or a better practice? Any suggestion or recommendations on better practices?
I am mostly self taught, through videos, and books, and some programming background (VB.NET)
For the simplest queries, such as those in your question, I doubt you would see a significant performance difference between a subquery and a "stacked query" (one which uses another saved query as its data source) approach. And perhaps the db engine would even use the same query plan for both. (If you're interested, you can use SHOWPLAN to examine the query plans.)
The primary performance driver for those 2 examples will be whether the db engine can use indexed retrieval to fetch the rows which satisfy the WHERE restriction. If TABLE.Date is not indexed, the query will require a full table scan. That would suck badly with a very large dataset, and the performance impact from a full scan should far overshadow any difference between a subquery and stacked query.
The situation could be different with a complex subquery as Allen Browne explains:
Complex subqueries on tables with many records can be slow to run.
Among the potential fixes for subquery performance problems, he suggests ...
Use stacked queries instead of subqueries. Create a separate saved
query for JET to execute first, and use it as an input "table" for
your main query. This pre-processing is usually (but not always)
faster than a subquery. Likewise, try performing aggregation in one
query, and then create another query that operates on the aggregated
results. This post-processing can be orders of magnitude faster than a
query that tries to do everything in a single query with subqueries.
I think your best answer will come from testing more complex real world examples. The queries in your question are so simple that conclusions drawn from them will likely not apply to those real world queries.
This is really context-dependent as a host of scenarios will decide the most efficient outcome including data types, join tables, indexes, and more. In essence and for simple queries like posted SELECT statmeents, the two queries are equivalent but the Jet/ACE's (underlying engine of MS Access) query optimizer may decide a different plan again according to structural needs of the query. Possibly, calling an external query adds a step in execution plan but then subqueries can be executed as self-contained tables then linked to main tables.
Recall SQL's general Order of Operations which differs from typed order as each step involves a virtual table (see SQL Server):
FROM clause --VT1
ON clause --VT2
WHERE clause --VT3
GROUP BY clause --VT4
HAVING clause --VT5
SELECT clause --VT6
ORDER BY clause --VT7
What can be said is for stored query objects, MS Access analyzes and caches the optimized "best plan" version. This is often the argument to use stored queries over VBA string queries which the latter was not optimized before execution. Even further, Access' query object is similar to other RDMS' view object (though Jet/ACE does have the VIEW and PROCEDURE objects). A regular discussion in the SQL world involves your very question of efficiency and best practices: views vs subqueries and usually the answer returns "it depends". So, experiment on a needs basis.
And here CTEs are considered "inline-views" denoted by WITH clause (not yet supported in JET/ACE). SQL programmers may use CTEs for readibility and maintainability as you avoid referencing same statement multiple times in body of statement. All in all, use what fits your coding rituals and project requirements, then adjust as needed.
Resources
MS Access 2007 Query Performance
Tips - with note on subquery performance
Intermediate Microsoft Jet
SQL
- with note on subqueries (just learned about the ANY/SOME/ALL)
Microsoft Jet 3.5 Performance White
Paper
- with note on query plans
I cannot recall where but there are discussion out there about this topic of nested or sub-queries. Basically they all suggest saved queries and then referenced the saved query.
From a personal experience, nested queries are extremely difficult to troubleshoot or modify later. Also, if they get too deep I have experienced a performance hit.
Allen Browne has several tips and tricks listed out here
The one place I use nest queries a lot are in the criteria of action queries. This way I do not have any joins and can limit some of the "cannot perform this operation" issue.
Finally, using query strings in VBA. I have found it be much easier to build parameter queries and then in VBA set a variable to the QueryDef and add in the parameters rather than build up a query string in VBA. So much easier to troubleshoot and modify later.
Just my two cents.

Why might a join on a subquery be slow? What could be done to make it faster? (SQL) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
This is a data science interview question.
My understanding of subqueries is that especially with correlated subqueries where the is dependent on the outer query, the correlated subquery requires more or one values to be passed to it by the outer query before the subquery can be resolved. This means that you need to process the subquery more times, one for each row in the outer query.
In particular, in this case, if the inner and outer query returns M and N rows, respectively, the total run time could be O(M*N)
So in general, that would be my answer for why running a subquery could be slow, but am I missing anything else that pertains to joining on a subquery? Also I'm not really sure what could be done to make it faster.
I would of course appreciate any tips or help.
Thanks!
I think that your answer should be correct: subqueries are slow, if they are correlated. Uncorrelated subquery are only evaluated a single time.
What can be done to speedup: correlated subqueries can be rewritten as joins! And join queries can be executed must faster!
If you use a good RDBMS, the optimizer is often able to rewrite a correlated subquery into a join query (however, not for all cases). However, if you use simple RDBMS, there is either no optimizer at all or the optimizer is not very advance (ie, cannot unnest subqueries into join queries). For those cases, you need to rewrite you query by hand.
Wow - what an open ended question! I'm not sure how far outside the box they want you to think, but some possible reasons:
Criteria too broad
The criteria for your query may be too broad, there may be extra clauses you could add that would reduce the sheer amount of data the RDBMS has to process.
Lack of indexes
If there aren't any indexes on the pertinent columns, the RDBMS may have to resort to full table scans which could be slow.
Stale stats
If statistics haven't been updated for a while, the RDBMS may not have the full picture of the skew of the data which can affect the execution time massively.
Physical arrangement of database
If the indexes and tables are on the same physical drive(s), this can create IO contention.
Parallelism
The RDBMS may not be set up correctly for parallelism meaning that the RDBMS may not be making the best use of the available hardware.
Scheduling
The time when the query is run can affect the execution time. Would the query be better run out of hours?
Data changes
Data changes can affect the skew of the data and in rare cases create cartesians. On large databases there should be full traceability of data at row level at least to track down data issues.
Locking
Related to high levels of use is the issue of locking. If you require clean reads, there may be contention on the required data which could slow down the query.
Misleading execution plans
You may have pulled the execution plans but these don't always tell the full story. Cost is a function of CPU and IO but your system may be more bound on one than the other. Some RDBMSs have setting that can force the optimiser to skew the cost towards one side or the other to produce better plans.
Static data not being cached
If you have some static data you're recalculating each time, this will affect the cost. Such data should be stored in an indexed or temporary table to reduce the amount of processing that the RDBMs needs to process.
Query simply too complex
Whilst the query may scan perfectly well to you, if you can break it up into chunks using temporary tables or the like, this could perform significantly better.
I'm going to stop there as I could easily spend the rest of the day adding to this, but hopefully this gives you a flavour.

SQL cost compare INNER JOIN vs SELECT twice [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
There are two tables with a common field id. What I want to do is select all attributes for a specific id, and I'm wondering which way is more efficient.
Using INNER JOIN, and then a single SELECT * operation is done.
Select from the smaller table first, if the id exist, then select from the larger table.
In most databases, you want to do the join:
select *
from bigtable b join
smalltable s
on b.id = s.id
where b.id = #id;
SQL engines have an optimizer to determine the best execution plan for a query. As mentioned in the comment, having an index woiuld often speed this up.
By selecting from one table and then the other, you are forcing a particular execution plan.
In general, you should trust the SQL engine to produce the best execution plan. In some cases, it may be better to do one and then the other, but generally that is not true.
This will vary based on each circumstance. You can't make a generic statement saying one will always be better .
To compare you can look at execution plans, or simply run both and compare based on execution time.
Ex: if it's rare to find data in the second table, then over time it might be better to run the single query etc
I suggest you take the 2nd way.
It is a good practice to keep some main/primary info in a index table, then put extra / detail info on another big table.
To divide info into two part (main/primary | extra / detail), because most of the time, we only the the first part info, it can save the cost of large query, large data transfer, the net bandwidth.

Sql server: internal workings [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Some of these might make little sense, but:
Sql code interpreted or compiled (and to what)?
What are joins translated into - I mean into some loops or what?
Is algorithm complexity analysis applicable to a query, for example is it possible to write really bad select - exponential in time by number of rows selected? And if so how to analyze queries?
Well ... quite general questions, so some very general answers
1) Sql code interpreted or compiled (and to what)?
SQL code is compiled in to execution plans.
2) What are joins translated into - I mean into some loops or what?
Depends on the join and the tables you're joining (as far as i know). SQL Server has some join primitives (hash join, nested loop join), depending on the objects involved in your sql code the query optimizer tries to choose the best option.
3) Not reallyIs algorithm complexity analysis applicable to a query, for example is it possible to write really bad select - exponential in time by number of rows selected? And if so how to analyze queries?
Not really sure, what you mean by that. But there are cases where you can do really bad things, for example using
SELECT TOP 1 col FROM Table ORDER BY col DESC
on a table without an index on col to find the lagest value for col instead of
SELECT MAX(col) FROM Table
You should get your hands on some/all of the books from the SQL Server internals series. They are really excellent an cover many things in great detail.
You'd get a lot of these answers by reading one of Itzik Ben-Gan's books. He covers these topics you mention in some detail.
http://tsql.solidq.com/books/index.htm