What does the TotalStorageWaitTime parameter mean in profile of impala? - impala

The same query statement, performance is unstable in impala. Sometimes it takes about 2 seconds to query, and sometimes it takes about 10 seconds. When the execution is slow, the TotalStorageWaitTime and scannerThreadsTotalWallClockTime parameters are relatively large. What is the relationship between these two parameters, and where is the bottleneck of slow query?
part of profile

Related

Can I speed up a large JDBC query with Futures?

I have a Postgres table with millions of rows. One of the columns is a timestamp and I need to query rows with time greater than x and less than y. Since I am getting hundreds of thousands of rows back, the query is taking a long time even though I indexed it.
My plan is to use a list of Futures to each make a query over a small time interval concurrently and then I will aggregate the results after.
Should I expect the large speedup I'm hoping for? Is there a better approach?
Depends on where you're bottlenecked. If it's on network I/O it won't help significantly.
If it's disk I/O, it might help a bit, since it'll effectively parallelize the query. But only if your DB server benefits significantly from I/O parallelization. It's possible that you won't see much benefit, so test it by manually dispatching a bunch of queries simultaneously and timing it to see before you bother reworking your code.
Make sure you're doing forward index scans though - see EXPLAIN ANALYZE; you'll get less benefit if you're doing backward index scans.
TL;DR: Benchmark and see.

BigQuery execution time and scaling

I created a test dataset of roughly 450GB in BigQuery and I am getting execution speed of ~9 seconds to query the largest table (10bn rows) when running from WebUI. I just wanted to check if this is a 'normal' expected result and whether it would get worse with larger size (i.e. 100bn rows+) and if the queries become more complex. I am aware of table partitioning/etc. but I just want to get a sense of what is 'normal' expected speed without first getting into optimization, since the above seems like 'smallish' size for what BQ is meant for.
The above result is achieved on a simple query like this:
select ColumnA from DataSet.Table order by ColumnB desc limit 100
So the result returned to the client is very small. ColumnA is structured as UUIDs represented in String format and ColumnB is integer.
It's almost impossible to say if this is "normal" or not. BigQuery is a multitenancy architecture/infrastructure. That means we all share the same resources (i.e. compute power) in the cluster when running queries. Therefore, query times are never deterministic in BigQuery i.e. they can vary depending on the number of concurrent queries executing from users at any given time. That said however, you can get reserved slots for a flat rate price. Although, you'd need to be spending quite a lot of money to justify that.
You can improve execution times by removing compute/shuffle/memory intensive steps like order by etc. Obviously, the complexity of the query will also have and impact on the query times.
On some of our projects we can smash through 3TB-5TB with a relatively complex query in about 15s-20s. Sometimes it quicker, sometimes is slower. We also run queries over much smaller datasets that can take the same amount of time. This is because what I wrote at the beginning - BigQuery query times are not deterministic.
Finally, BigQuery will cache results, so if you issue the same query multiple times over the same dataset it will be returned from the cache i.e. much quicker!

query slow once an hour or so, takes less than 100 ms 99% of the time

I am troubleshooting a slow query, it runs in less than 100 ms 99% of the time, but once in an hr (or two no pattern, i guess), goes bad and does 6 million reads and takes 11 seconds! I saw the query plan, it does do a clustered index scan, I noticed the cached_plans dynamic management view use counts column keeps increasing every time the query executes, so i am thinking its the same plan, just wondering why at one point it goes out-of-whack! any pointers will be helpful. I haven't tried anything as it runs pretty fast most of the time.
First something could easily be blocking the query to make it run slow. Otr there could be other things happening onthe server at the same time that are consuming most of its resources.
Next, the parameters of the query might be bad for the saved execution plan.
Or the statistics might be out of date
Or if the query is an action query as opposed to a select, the particular parameters may be causing a problem in a trigger that makes it take longer.
Or teh query might be returning significanlty more results at times. If you run it at 10 and return 10 results and an import puts more records inteh table that meet the query conditions, at 10:30 you might return a million results which would clearly be slower.
One thing I like to do in such circumstances is set up logging so that the exact query is logged with the time at the time of execution. Then you can see what the query that ran sloweractaully was if you have varaible , than might be differnt from run to run.

After writing SQL statements in MySQL, how to measure the speed / performance of them?

I saw something from an "execution plan" article:
10 rows fetched in 0.0003s (0.7344s)
(the link: http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/ )
How come there are 2 durations shown? What if I don't have large data set yet. For example, if I have only 20, 50, or even just 100 records, I can't really measure how faster 2 different SQL statements compare in term of speed in real life situation? In other words, there needs to be at least hundreds of thousands of records, or even a million records to accurately compares the performance of those 2 different SQL statements?
For your first question:
X row(s) fetched in Y s (Z s)
X = number of rows (of course);
Y = time it took the MySQL server to execute the query (parse, retrieve, send);
Z = time the resultset spent in transit from the server to the client;
(Source: http://forums.mysql.com/read.php?108,51989,210628#msg-210628)
For the second question, you will never ever know how the query performs unless you test with a realistic number of records. Here is a good example of how to benchmark correctly: http://www.mysqlperformanceblog.com/2010/04/21/mysql-5-5-4-in-tpcc-like-workload/
That blog in general as well as the book "High Performance MySQL" is a goldmine.
The best way to test and compare performance of operations is often (if not always !) to work with a realistic set of data.
If you plan on having millions of rows when your application is in production, then, you should test with millions of rows right now, and not only a dozen !
A couple of tips :
While benchmarking, use select SQL_NO_CACHE ..., instead of select ...
This will prevent MySQL from using its query cache (which would make the first query take a normal amount of time, and re-executing it several times a lot faster)
Learn how to use EXPLAIN, and understand its output
Read the Chapter 7. Optimization section of the manual ;-)
Generally when there are 2 times shown, one is CPU time and one is wall-clock time. I cannot recall which is which, but it appears that the first is the CPU time and the second is elapsed time.

Long running Jobs Performance Tips

I have been working with SQL server for a while and have used lot of performance techniques to fine tune many queries. Most of these queries were to be executed within few seconds or may be minutes.
I am working with a job which loads around 100K of data and runs for around 10 hrs.
What are the things I need to consider while writing or tuning such query? (e.g. memory, log size, other things)
Make sure you have good indexes defined on the columns you are querying on.
Ultimately, the best thing to do is to actually measure and find the source of your bottlenecks. Figure out which queries in a stored procedure or what operations in your code take the longest, and focus on slimming those down, first.
I am actually working on a similar problem right now, on a job that performs complex business logic in Java for a large number of database records. I've found that the key is to process records in batches, and make as much of the logic as possible operate on a batch instead of operating on a single record. This minimizes roundtrips to the database, and causes certain queries to be much more efficient than when I run them for one record at a time. Limiting the batch size prevents the server from running out of memory when working on the Java side. Since I am using Hibernate, I also call session.clear() after every batch, to prevent the session from keeping copies of objects I no longer need from previous batches.
Also, an RDBMS is optimized for working with large sets of data; use normal SQL operations whenever possible. Avoid things like cursors, and a lot procedural programming; as other people have said, make sure you have your indexes set up correctly.
It's impossible to say without looking at the query. Just because you have indexes doesn't mean they are being used. You'll have to look at the execution plan and see if they are being used. They might show that they aren't useful to the execution plan.
You can start with looking at the estimated execution plan. If the job actually completes, you can wait for the actual execution plan. Look at parameter sniffing. Also, I had an extremely odd case on SQL Server 2005 where
SELECT * FROM l LEFT JOIN r ON r.ID = l.ID WHERE r.ID IS NULL
would not complete, yet
SELECT * FROM l WHERE l.ID NOT IN (SELECT r.ID FROM r)
worked fine - but only for particular tables. Problem was never resolved.
Make sure your statistics are up to date.
If possible post your query here so there is something to look at. I recall a query someone built with joins to 12 different tables dealing with around 4 or so million records that took around a day to run. I was able to tune that to run within 30 mins by eliminating the unnecessary joins. Where possible try to reduce the datasets you are joining before returning your results. Use plenty of temp tables, views etc if you need.
In cases of large datasets with conditions try to preapply your conditions through a view before your joins to reduce the number of records.
100k joining 100k is a lot bigger than 2k joining 3k