After writing SQL statements in MySQL, how to measure the speed / performance of them? - sql

I saw something from an "execution plan" article:
10 rows fetched in 0.0003s (0.7344s)
(the link: http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/ )
How come there are 2 durations shown? What if I don't have large data set yet. For example, if I have only 20, 50, or even just 100 records, I can't really measure how faster 2 different SQL statements compare in term of speed in real life situation? In other words, there needs to be at least hundreds of thousands of records, or even a million records to accurately compares the performance of those 2 different SQL statements?

For your first question:
X row(s) fetched in Y s (Z s)
X = number of rows (of course);
Y = time it took the MySQL server to execute the query (parse, retrieve, send);
Z = time the resultset spent in transit from the server to the client;
(Source: http://forums.mysql.com/read.php?108,51989,210628#msg-210628)
For the second question, you will never ever know how the query performs unless you test with a realistic number of records. Here is a good example of how to benchmark correctly: http://www.mysqlperformanceblog.com/2010/04/21/mysql-5-5-4-in-tpcc-like-workload/
That blog in general as well as the book "High Performance MySQL" is a goldmine.

The best way to test and compare performance of operations is often (if not always !) to work with a realistic set of data.
If you plan on having millions of rows when your application is in production, then, you should test with millions of rows right now, and not only a dozen !
A couple of tips :
While benchmarking, use select SQL_NO_CACHE ..., instead of select ...
This will prevent MySQL from using its query cache (which would make the first query take a normal amount of time, and re-executing it several times a lot faster)
Learn how to use EXPLAIN, and understand its output
Read the Chapter 7. Optimization section of the manual ;-)

Generally when there are 2 times shown, one is CPU time and one is wall-clock time. I cannot recall which is which, but it appears that the first is the CPU time and the second is elapsed time.

Related

What's the curve for a simple select query?

This is a conceptual question.
Hypothetically, when do select * from table_name where the table has 1 million records it takes about 3 secs.
Similarly, when I select 10 million records the time taken is about 30 secs. But I am told the selection of records is not linearly proportional to time. After a certain number, the time required to select records increases exponentially?
Please help me understand how this works?
THere are things that can make one query take longer than the other even simple selects with no where clauses or joins.
First, the time to return the query depends on how busy the network is at the time the query is run. It could also depend on whether there are any locks on the data or how much memory is available.
It also depends on how wide the tables are and in general how many bytes an individual record would have. For instance I would expect that a 10 million record table that only has two columns both ints would return much faster than a million record table that has 50 columns including some large columns epecially if they are things like documents stored as database objects or large fields that have too much text to fit into an ordinary varchar or nvarchar field (in sql server these would be nvarchar(max) or text for instance). I would expect this becasue there is simply less total data to return even though more records.
As you start adding where clauses and joins of course there are many more things that affect performance of an indivuidual query. If you query datbases, you should read a good book on performance tuning for your particular database. There are many things you can do without realizing it that can cause queries to run more slowly than need be. You should learn the techniques that create the queries most likely to be performant.
I think this is different for each database-server. Try to monitor the performance while you fire your queries (what happens to the memory, and CPU?)
Eventually all hardware components have a bottleneck. If you come close to that point the server might 'suffocate'.

View Clustered Index Seek over 0.5 million rows takes 7 minutes

Take a look at this execution plan: http://sdrv.ms/1agLg7K
It’s not estimated, it’s actual. From an actual execution that took roughly 30 minutes.
Select the second statement (takes 47.8% of the total execution time – roughly 15 minutes).
Look at the top operation in that statement – View Clustered Index Seek over _Security_Tuple4.
The operation costs 51.2% of the statement – roughly 7 minutes.
The view contains about 0.5M rows (for reference, log2(0.5M) ~= 19 – a mere 19 steps given the index tree node size is two, which in reality is probably higher).
The result of that operator is zero rows (doesn’t match the estimate, but never mind that for now).
Actual executions – zero.
So the question is: how the bleep could that take seven minutes?! (and of course, how do I fix it?)
EDIT: Some clarification on what I'm asking here.
I am not interested in general performance-related advice, such as "look at indexes", "look at sizes", "parameter sniffing", "different execution plans for different data", etc.
I know all that already, I can do all that kind of analysis myself.
What I really need is to know what could cause that one particular clustered index seek to be so slow, and then what could I do to speed it up.
Not the whole query.
Not any part of the query.
Just that one particular index seek.
END EDIT
Also note how the second and third most expensive operations are seeks over _Security_Tuple3 and _Security_Tuple2 respectively, and they only take 7.5% and 3.7% of time. Meanwhile, _Security_Tuple3 contains roughly 2.8M rows, which is six times that of _Security_Tuple4.
Also, some background:
This is the only database from this project that misbehaves.
There are a couple dozen other databases of the same schema, none of them exhibit this problem.
The first time this problem was discovered, it turned out that the indexes were 99% fragmented.
Rebuilding the indexes did speed it up, but not significantly: the whole query took 45 minutes before rebuild and 30 minutes after.
While playing with the database, I have noticed that simple queries like “select count(*) from _Security_Tuple4” take several minutes. WTF?!
However, they only took several minutes on the first run, and after that they were instant.
The problem is not connected to the particular server, neither to the particular SQL Server instance: if I back up the database and then restore it on another computer, the behavior remains the same.
First I'd like to point out a little misconception here: although the delete statement is said to take nearly 48% of the entire execution, this does not have to mean it takes 48% of the time needed; in fact, the 51% assigned inside that part of the query plan most definitely should NOT be interpreted as taking 'half of the time' of the entire operation!
Anyway, going by your remark that it takes a couple of minutes to do a COUNT(*) of the table 'the first time' I'm inclined to say that you have an IO issue related to said table/view. Personally I don't like materialized views very much so I have no real experience with them and how they behave internally but normally I would suggest that fragmentation is causing its toll on the underlying storage system. The reason it works fast the second time is because it's much faster to access the pages from the cache than it was when fetching them from disk, especially when they are all over the place. (Are there any (max) fields in the view ?)
Anyway, to find out what is taking so long I'd suggest you rather take this code out of the trigger it's currently in, 'fake' an inserted and deleted table and then try running the queries again adding times-stamps and/or using some program like SQL Sentry Plan Explorer to see how long each part REALLY takes (it has a duration column when you run a script from within the program).
It might well be that you're looking at the wrong part; experience shows that cost and actual execution times are not always as related as we'd like to think.
Observations include:
Is this the biggest of these databases that you are working with? If so, size matters to the optimizer. It will make quite a different plan for large datasets versus smaller data sets.
The estimated rows and the actual rows are quite divergent. This is most apparent on the fourth query. "delete c from #alternativeRoutes...." where the _Security_Tuple5 estimates returning 16 rows, but actually used 235,904 rows. For that many rows an Index Scan could be more performant than Index Seeks. Are the statistics on the table up to date or do they need to be updated?
The "select count(*) from _Security_Tuple4" takes several minutes, the first time. The second time is instant. This is because the data is all now cached in memory (until it ages out) and the second query is fast.
Because the problem moves with the database then the statistics, any missing indexes, et cetera are in the database. I would also suggest checking that the indexes match with other databases using the same schema.
This is not a full analysis, but it gives you some things to look at.
Fyodor,
First:
The problem is not connected to the particular server, neither to the particular SQL Server instance: if I back up the database and then restore it on another computer, the behavior remains the same.
I presume that you: a) run this query in isolated environment, b) the data is not under mutation.
Is this correct?
Second: post here your CREATE INDEX script. Do you have a funny FILLFACTOR? SORT_IN_TEMPDB?
Third: which type is your ParentId, ObjectId? int, smallint, uniqueidentifier, varchar?

How long should a query that returns 5 million records take?

I realise the answer should probably be 'as little time as possible' but I'm trying to learn how to optimise databases and I have no idea what an acceptable time is for my hardware.
For a start I'm using my local machine with a copy of sql server 2008 express. I have a dual-core processor, 2GB ram and a 64bit OS (if that makes a difference). I'm only using a simple table with about 6 varchar fields.
At first I queried the data without any indexing. This took a ridiculously long amount of time so I cancelled and added a clustered index (using the PK) to the table. This cut the time down to 1 minute 14 sec. I have no idea if this is the best I can get or whether I'm still able to cut this down even further?
Am I limited by my hardware or is there anything else I can do to my table/database/queries to get results faster?
FYI I'm only using a standard SELECT * FROM <Table> to retrieve my results.
EDIT: Just to clarify, I'm only doing this for testing purposes. I don't NEED to pull out all the data, I'm just using that as a consistent test to see if I can cut down the query times.
I suppose what I'm asking is: Is there anything I can do to speed up the performance of my queries other than a) upgrading hardware and b) adding indexes (assuming the schema is already good)?
I think you are asking the wrong question.
First of all - why do you need so many articles at one time on the local machine? What do you want to do with them? I'm asking because I think you want to transfer this of data to somewhere, so you should be measuring how long it takes to transfer the data.
Some advice:
Your applications should not select 5 million records at the time. Try to split your query and get the data in smaller sets.
UPDATE:
Because you are doing this for testing, I suggest that you
Remove * from your query - it takes SQL server some time to resolve this.
Put your data in temporary storage, try using VIEW or a temporary table for this.
Use plan caching on your server
to improve performance. But even if you're just testing, I still don't understand why you would need such tests if your application would never use such a query. Testing just for the sake of testing is a bad use of time
Look at the query execution plan. If your query is doing a table scan, it will obviously take a long time. The query execution plan can help you decide what kind of indexing you would need on the table. Also, creating table partitions can help sometimes in cases where the data is partitioned by a condition (usually date and time).
I did 5.5 million in 20 seconds. That's taking over 100k schedules with different frequencies and forecasting them for the next 25 years. Just max scenario testing, but proves the speed you can achieve in a scheduling system as an example.
The best optimized way depends on the indexing strategy you choose. As many of the above answers, i too would say partitioning the table would help sometimes. And its not the best practice to query all the billion record in a single time frame. Will give you much better results if you could try to query partially with the iterations. you may check this link to clear the doubts on the minimum requirements for the Sql server 2008 Minimum H/W and S/W Requirements for Sql server 2008
When fecthing 5 million rows you are almost 100% going spool to tempdb. you should try to optimize your temp Db by adding additional files. if you have multiple drives on seperate disks you should split the table data into different ndf files located on seperate disks. parititioning wont help when querying all the data on the disk
U can also use a query hint to force parrallelism MAXDOP this will increase the CPU utilization. Ensure that the columns contain few nulls as possible and rebuild ur indexes and stats

How much is performance improved when using LIMIT in a SQL sentence?

Let's suppose I have a table in my database with 1.000.000 records.
If I execute:
SELECT * FROM [Table] LIMIT 1000
Will this query take the same time as if I have that table with 1000 records and just do:
SELECT * FROM [Table]
?
I'm not looking for if it will take exactly the same time. I just want to know if the first one will take much more time to execute than the second one.
I said 1.000.000 records, but it could be 20.000.000. That was just an example.
Edit:
Of course that when using LIMIT and without using it in the same table, the query built using LIMIT should be executed faster, but I'm not asking that...
To make it generic:
Table1: X records
Table2: Y records
(X << Y)
What I want to compare is:
SELECT * FROM Table1
and
SELECT * FROM Table2 LIMIT X
Edit 2:
Here is why I'm asking this:
I have a database, with 5 tables and relationships between some of them. One of those tables will (I'm 100% sure) contain about 5.000.000 records. I'm using SQL Server CE 3.5, Entity Framework as the ORM and LINQ to SQL to make the queries.
I need to perform basically three kind of non-simple queries, and I was thinking about showing to the user a limit of records (just like lot of websites do). If the user wants to see more records, the option he/she has is to restrict more the search.
So, the question came up because I was thinking about doing this (limiting to X records per query) or if storing in the database only X results (the recent ones), which will require to do some deletions in the database, but I was just thinking...
So, that table could contain 5.000.000 records or more, and what I don't want is to show the user 1000 or so, and even like this, the query still be as slow as if it would be returning the 5.000.000 rows.
TAKE 1000 from a table of 1000000 records - will be 1000000/1000 (= 1000) times faster because it only needs to look at (and return) 1000/1000000 records. Since it does less, it is naturally faster.
The result will be pretty (pseudo-)random, since you haven't specified any order in which to TAKE. However, if you do introduce an order, then one of two below becomes true:
The ORDER BY clause follows an index - the above statement is still true.
The ORDER BY clause cannot use any index - it will be only marginally faster than without the TAKE, because
it has to inspect ALL records, and sort by ORDER BY
deliver only a subset (TAKE count)
so it is not faster in the first step, but the 2nd step involves less IO/network than ALL records
If you TAKE 1000 records from a table of 1000 records, it will be equivalent (with little significant differences) to TAKE 1000 records from 1 billion, as long as you are following the case of (1) no order by, or (2) order by against an index
Assuming both tables are equivalent in terms of index, row-sizing and other structures. Also assuming that you are running that simple SELECT statement. If you have an ORDER BY clause in your SQL statements, then obviously the larger table will be slower. I suppose you're not asking that.
If X = Y, then obviously they should run in similar speed, since the query engine will be going through the records in exactly the same order -- basically a table scan -- for this simple SELECT statement. There will be no difference in query plan.
If Y > X only by a little bit, then also similar speed.
However, if Y >> X (meaning Y has many many more rows than X), then the LIMIT version MAY be slower. Not because of query plan -- again should be the same -- but simply because that the internal structure of data layout may have several more levels. For example, if data is stored as leafs on a tree, there may be more tree levels, so it may take slightly more time to access the same number of pages.
In other words, 1000 rows may be stored in 1 tree level in 10 pages, say. 1000000 rows may be stored in 3-4 tree levels in 10000 pages. Even when taking only 10 pages from those 10000 pages, the storage engine still has to go through 3-4 tree levels, which may take slightly longer.
Now, if the storage engine stores data pages sequentially or as a linked list, say, then there will be no difference in execution speed.
It would be approximately linear, as long as you specify no fields, no ordering, and all the records. But that doesn't buy you much. It falls apart as soon as your query wants to do something useful.
This would be quite a bit more interesting if you intended to draw some useful conclusion and tell us about the way it would be used to make a design choice in some context.
Thanks for the clarification.
In my experience, real applications with real users seldom have interesting or useful queries that return entire million-row tables. Users want to know about their own activity, or a specific forum thread, etc. So unless yours is an unusual case, by the time you've really got their selection criteria in hand, you'll be talking about reasonable result sizes.
In any case, users wouldn't be able to do anything useful with many rows over several hundred, transporting them would take a long time, and they couldn't scroll through it in any reasonable way.
MySQL has the LIMIT and OFFSET (starting record #) modifiers primarlly for the exact purpose of creating chunks of a list for paging as you describe.
It's way counterproductive to start thinking about schema design and record purging until you've used up this and a bunch of other strategies. In this case don't solve problems you don't have yet. Several-million-row tables are not big, practically speaking, as long as they are correctly indexed.

Does the speed of the query depend on the number of rows in the table?

Let's say I have this query:
select * from table1 r where r.x = 5
Does the speed of this query depend on the number of rows that are present in table1?
The are many factors on the speed of a query, one of which can be the number of rows.
Others include:
index strategy (if you index column "x", you will see better performance than if it's not indexed)
server load
data caching - once you've executed a query, the data will be added to the data cache. So subsequent reruns will be much quicker as the data is coming from memory, not disk. Until such point where the data is removed from the cache
execution plan caching - to a lesser extent. Once a query is executed for the first time, the execution plan SQL Server comes up with will be cached for a period of time, for future executions to reuse.
server hardware
the way you've written the query (often one of the biggest contibutors to poor performance!). e.g. writing something using a cursor instead of a set-based operation
For databases with a large number of rows in tables, partitioning is usually something to consider (with SQL Server 2005 onwards, Enterprise Edition there is built-in support). This is to split the data down into smaller units. Generally, smaller units = smaller tables = smaller indexes = better performance.
Yes, and it can be very significant.
If there's 100 million rows, SQL server has to go through each of them and see if it matches.
That takes a lot more time compared to there being 10 rows.
You probably want an index on the 'x' column, in which case the sql server might check the index rather than going through all the rows - which can be significantly faster as the sql server might not even need to check all the values in the index.
On the other hand, if there's 100 million rows matching x = 5, it's slower than 10 rows.
Almost always yes. The real question is: what is the rate at which the query slows down as the table size increases? And the answer is: by not much if r.x is indexed, and by a large amount if not.
Not the rows (to a certain degree of course) per se, but the amount of data (columns) is what can make a query slow. The data also needs to be transfered from the backend to the frontend.
The Answer is Yes. But not the only factor.
if you did appropriate optimizations and tuning the performance drop will be negligible
Main Performance factors
Indexing Clustered or None clustered
Data Caching
Table Partitioning
Execution Plan caching
Data Distribution
Hardware specs
There are some other factors but these are mainly considered.
Even how you designed your Schema makes effect on the performance.
You should assume that your query always depends on the number of rows. In fact, you should assume the worst case (linear or O(N) for the example you provided) and exponential for more complex queries. There are database specific manuals filled with tricks to help you avoid the worst case but SQL itself is a language and doesn't specify how to execute your query. Instead, the database implementation decides how to execute any given query: if you have indexed a column or set of columns in your database then you will get O(log(N)) performance for a simple lookup; if the system has effective query caching you might get O(1) response. Here is a good introductory article: High scalability: SQL and computational complexity