How to time and benchmark Bigquery Standard SQL scripts?

How to time and benchmark Bigquery Standard SQL scripts? - sql

I have different versions of the Bigquery Standard scripts i.e. I follow the Standard API here. I am trying to find out Matlab-style timeit or warmed-up timing measures to benchmark different scripts:
Version A: very readabled code with modular code
WITH a AS
(
SELECT * FROM SOURCE
),
a_ AS
(
SELECT ... FROM a
)
SELECT * FROM a_
Version B: very unreadable code with subqueries but claimed to be efficient
SELECT * FROM (SELECT * FROM (SELECT * FROM SOURCE))
How can I time and benchmark different Standard SQL queries in Google Bigquery?
Possible perspectives to address
Do I need to warm up BQ Standard SQL queries like in Matlab?
What are the common performance differences between the version A and B? Any Pros and Cons? How can you demonstrate that in Bigquery?
Any documentation or recommendations available for the two different approaches (overusing subqueries vs modular coding)?

I summarires the comments and I recite on the speedup opportunities from the perspective of costs.
A demonstrates CTE (Common Table Expressions) while B demonstrates construction with subqueries like sets. The user Used_By_Already commentted that the difference should not matter in the performance so not convey speedup opportunities.
"No. That is the inverse of what I tried to convey. You can take almost any subquery and make it a cte, or vice versa take the type of ctes in your question and make it a subquery. The net effect it most likely to be zero or unmeasurable."
What matters is the prefiltering, more here as forwarded by Mikhail Berlyant in a comment. So you can make your queries more efficient by prefiltering data early.
As a side note, more early filtering and hence less computation may not affect on charging, more here, because charging is based on the input data, not on the amount of computation. So query optimisation for speedup opportunity makes little sense in Bigquery in terms of monetary costs, even though you may save some time.

Related

Alternatives to CTEs in Microsoft Access

I work with MS Access 2010 on a daily basis and I want to know if there are alternatives to a Common Table Expression as it is used in SQL Server, and how it affects performance?
For example, is it better to create a subquery or is it better to call a query from another query, which essentially is very similar..
Example:
SELECT A.field1,A.Date
FROM (SELECT * FROM TABLE B WHERE B.Date = Restriction )
or
SELECT A.field1,A.Date
FROM SavedQueryB
SavedQueryB:
SELECT * FROM TABLE B WHERE B.Date = Restriction
I feel having multiple queries makes it easier to debug and manage, but does it affect performance when the data-set is very large?
Also, I've seen some videos about implementing the queries thru VBA, however I'm not very comfortable doing it that way yet.
Essentially. What is more efficient or a better practice? Any suggestion or recommendations on better practices?
I am mostly self taught, through videos, and books, and some programming background (VB.NET)

For the simplest queries, such as those in your question, I doubt you would see a significant performance difference between a subquery and a "stacked query" (one which uses another saved query as its data source) approach. And perhaps the db engine would even use the same query plan for both. (If you're interested, you can use SHOWPLAN to examine the query plans.)
The primary performance driver for those 2 examples will be whether the db engine can use indexed retrieval to fetch the rows which satisfy the WHERE restriction. If TABLE.Date is not indexed, the query will require a full table scan. That would suck badly with a very large dataset, and the performance impact from a full scan should far overshadow any difference between a subquery and stacked query.
The situation could be different with a complex subquery as Allen Browne explains:
Complex subqueries on tables with many records can be slow to run.
Among the potential fixes for subquery performance problems, he suggests ...
Use stacked queries instead of subqueries. Create a separate saved
query for JET to execute first, and use it as an input "table" for
your main query. This pre-processing is usually (but not always)
faster than a subquery. Likewise, try performing aggregation in one
query, and then create another query that operates on the aggregated
results. This post-processing can be orders of magnitude faster than a
query that tries to do everything in a single query with subqueries.
I think your best answer will come from testing more complex real world examples. The queries in your question are so simple that conclusions drawn from them will likely not apply to those real world queries.

This is really context-dependent as a host of scenarios will decide the most efficient outcome including data types, join tables, indexes, and more. In essence and for simple queries like posted SELECT statmeents, the two queries are equivalent but the Jet/ACE's (underlying engine of MS Access) query optimizer may decide a different plan again according to structural needs of the query. Possibly, calling an external query adds a step in execution plan but then subqueries can be executed as self-contained tables then linked to main tables.
Recall SQL's general Order of Operations which differs from typed order as each step involves a virtual table (see SQL Server):
FROM clause --VT1
ON clause --VT2
WHERE clause --VT3
GROUP BY clause --VT4
HAVING clause --VT5
SELECT clause --VT6
ORDER BY clause --VT7
What can be said is for stored query objects, MS Access analyzes and caches the optimized "best plan" version. This is often the argument to use stored queries over VBA string queries which the latter was not optimized before execution. Even further, Access' query object is similar to other RDMS' view object (though Jet/ACE does have the VIEW and PROCEDURE objects). A regular discussion in the SQL world involves your very question of efficiency and best practices: views vs subqueries and usually the answer returns "it depends". So, experiment on a needs basis.
And here CTEs are considered "inline-views" denoted by WITH clause (not yet supported in JET/ACE). SQL programmers may use CTEs for readibility and maintainability as you avoid referencing same statement multiple times in body of statement. All in all, use what fits your coding rituals and project requirements, then adjust as needed.
Resources
MS Access 2007 Query Performance
Tips - with note on subquery performance
Intermediate Microsoft Jet
SQL
- with note on subqueries (just learned about the ANY/SOME/ALL)
Microsoft Jet 3.5 Performance White
Paper
- with note on query plans

I cannot recall where but there are discussion out there about this topic of nested or sub-queries. Basically they all suggest saved queries and then referenced the saved query.
From a personal experience, nested queries are extremely difficult to troubleshoot or modify later. Also, if they get too deep I have experienced a performance hit.
Allen Browne has several tips and tricks listed out here
The one place I use nest queries a lot are in the criteria of action queries. This way I do not have any joins and can limit some of the "cannot perform this operation" issue.
Finally, using query strings in VBA. I have found it be much easier to build parameter queries and then in VBA set a variable to the QueryDef and add in the parameters rather than build up a query string in VBA. So much easier to troubleshoot and modify later.
Just my two cents.

Are there any specialized databases for aggregate queries?

Are there any specialized databases - rdbms, nosql, key-value, or anything else - that are optimised for running fast aggregate queries or map-reduces like this over very large data sets:
select date, count(*)
from Sales
where [various combinations of filters]
group by date
So far I've run benchmarks on MongoDB and SQL Server, but I'm wondering if there's a more specialized solution, preferably one that can scale data horizontally.

In my experience, the real issue has less to do with aggregate query performance, which I find good in all major databases I've tried, than it has to do with the way queries are written.
I've lost count of the number of times I've seen enormous report queries with huge amounts of joins and inline subquery aggregates all over the place.
Off the top of my head, the typical steps to make these things faster are:
Use window functions where available and applicable (i.e. the over () operator). There's absolutely no point in refetching data multiple times.
Use common table expressions (with queries) where available and applicable (i.e. sets that you know will be reasonably small).
Use temporary tables for large intermediary results, and create indexes on them (and analyze them) before using them.
Work on small result sets by filtering rows earlier when possible: select id, aggregate from (aggregate on id) where id in (?) group by id can made much faster by rewriting it as select id, aggregate from (aggregate on id where id in (?)) group by id.
Use union/except/intersect all rather than union/except/intersect where applicable. This removes pointless sorting of result sets.
As a bonus the first three steps all tend to make the report queries more readable and thus more maintainable.

Oracle, DB2 Warehouse edition, and to a lesser degree SQLServer enterprise are all very good on these aggregate queries -- of course these are expensive solutions and it depends very much on your budget and business case whether its worth it.

Pretty much any OLAP database, this is exactly the type of thing they're designed for.

OLAP data cubes are designed for this. You denormalize data into forms that they can compute on quickly. The denormalization and pre computation steps can take time, so these databases are typically built only for reporting and are separate from the real time transactional data.

For certain kinds of data (large volumes, time series) kx.com provides probably the best solution: kdb+. If it looks like your kind of data, give it a try. Note: they don't use SQL, but rather a more general, more powerful, and more crazy set-theoretical language.

Measuring the complexity of SQL statements

The complexity of methods in most programming languages can be measured in cyclomatic complexity with static source code analyzers. Is there a similar metric for measuring the complexity of a SQL query?
It is simple enough to measure the time it takes a query to return, but what if I just want to be able to quantify how complicated a query is?
[Edit/Note]
While getting the execution plan is useful, that is not necessarily what I am trying to identify in this case. I am not looking for how difficult it is for the server to execute the query, I am looking for a metric that identifies how difficult it was for the developer to write the query, and how likely it is to contain a defect.
[Edit/Note 2]
Admittedly, there are times when measuring complexity is not useful, but there are also times when it is. For a further discussion on that topic, see this question.

Common measures of software complexity include Cyclomatic Complexity (a measure of how complicated the control flow is) and Halstead complexity (a measure of complex the arithmetic is).
The "control flow" in a SQL query is best related to "and" and "or" operators in query.
The "computational complexity" is best related to operators such as SUM or implicit JOINS.
Once you've decided how to categorize each unit of syntax of a SQL query as to whether it is "control flow" or "computation", you can straightforwardly compute Cyclomatic or Halstead measures.
What the SQL optimizer does to queries I think is absolutely irrelevant. The purpose of complexity measures is to characterize how hard is to for a person to understand the query, not how how efficiently it can be evaluated.
Similarly, what the DDL says or whether views are involved or not shouldn't be included in such complexity measures. The assumption behind these metrics is that the complexity of machinery inside a used-abstraction isn't interesting when you simply invoke it, because presumably that abstraction does something well understood by the coder. This is why Halstead and Cyclomatic measures don't include called subroutines in their counting, and I think you can make a good case that views and DDL information are those "invoked" abstractractions.
Finally, how perfectly right or how perfectly wrong these complexity numbers are doesn't matter much, as long they reflect some truth about complexity and you can compare them relative to one another. That way you can choose which SQL fragments are the most complex, thus sort them all, and focus your testing attention on the most complicated ones.

I'm not sure the retrieval of the query plans will answer the question: the query plans hide a part of the complexity about the computation performed on the data before it is returned (or used in a filter); the query plans require a significative database to be relevant. In fact, complexity, and length of execution are somewhat oppposite; something like "Good, Fast, Cheap - Pick any two".
Ultimately it's about the chances of making a mistake, or not understanding the code I've written?
Something like:
number of tables times (1
+1 per join expression (+1 per outer join?)
+1 per predicate after WHERE or HAVING
+1 per GROUP BY expression
+1 per UNION or INTERSECT
+1 per function call
+1 per CASE expression
)

Please feel free to try my script that gives an overview of the stored procedure size, the number of object dependencies and the number of parameters -
Calculate TSQL Stored Procedure Complexity

SQL queries are declarative rather than procedural: they don't specify how to accomplish their goal. The SQL engine will create a procedural plan of attack, and that might be a good place to look for complexity. Try examining the output of the EXPLAIN (or EXPLAIN PLAN) statement, it will be a crude description of the steps the engine will use to execute your query.

Well I don't know of any tool that did such a thing, but it seems to me that what would make a query more complicated would be measured by:
the number of joins
the number of where conditions
the number of functions
the number of subqueries
the number of casts to differnt datatypes
the number of case statements
the number of loops or cursors
the number of steps in a transaction
However, while it is true that the more comlex queries might appear to be the ones with the most possible defects, I find that the simple ones are very likely to contain defects as they are more likely to be written by someone who doesn't understand the data model and thus they may appear to work correctly, but in fact return the wrong data. So I'm not sure such a metric wouild tell you much.

In the absence of any tools that will do this, a pragmatic approach would be to ensure that the queries being analysed are consistently formatted and to then count the lines of code.
Alternatively use the size of the queries in bytes when saved to file (being careful that all queries are saved using the same character encoding).
Not brilliant but a reasonable proxy for complexity in the absence of anything else I think.

In programming languages we have several methods to compute the time complexity or space complexity.
Similarly we could compare with sql as well like in a procedure the no of lines you have with loops similar to a programming language but unlike only input usually in programming language in sql it would along with input will totally depend on the data in the table/view etc to operate plus the overhead complexity of the query itself.
Like a simple row by row query
Select * from table ;
// This will totally depend on no of
records say n hence O(n)
Select max(input) from table;
// here max would be an extra
overhead added to each
Therefore t*O(n) where t is max
Evaluation time

Here is an idea for a simple algorithm to compute a complexity score related to readability of the query:
Apply a simple lexer on the query (like ones used for syntax coloring in text editors or here on SO) to split the query in tokens and give each token a class:
SQL keywords
SQL function names
string literals with character escapes
string literals without character escape
string literals which are dates or date+time
numeric literals
comma
parenthesis
SQL comments (--, /* ... */)
quoted user words
non quoted user words: everything else
Give a score to each token, using different weights for each class (and differents weights for SQL keywords).
Add the scores of each token.
Done.
This should work quite well as for example counting sub queries is like counting the number of SELECT and FROM keywords.
By using this algorithm with different weight tables you can even measure the complexity in different dimensions. For example to have nuanced comparison between queries. Or to score higher the queries which use keywords or functions specific to an SQL engine (ex: GROUP_CONCAT on MySQL).
The algorithm can also be tweaked to take in account the case of SQL keywords: increase complexity if they are not consistently upper case. Or to account for indent (carriage return, position of keywords on a line)
Note: I have been inspired by #redcalx answer that suggested applying a standard formatter and counting lines of code. My solution is simpler however as it doesn't to build a full AST (abstract syntax tree).

Toad has a built-in feature for measuring McCabe cyclomatic complexity on SQL:
https://blog.toadworld.com/what-is-mccabe-cyclomatic-complexity

Well if you're are using SQL Server I would say that you should look at the cost of the query in the execution plan (specifically the subtree cost).
Here is a link that goes over some of the things you should look at in the execution plan.

Depending on your RDBMS, there might be query plan tools that can help you analyze the steps the RDBMS will take in fetching your query.
SQL Server Management Studio Express has a built-in query execution plan. Pervasive PSQL has its Query Plan Finder. DB2 has similar tools (forgot what they're called).

A good question. The problem is that for a SQL query like:
SELECT * FROM foo;
the complexity may depend on what "foo" is and on the database implementation. For a function like:
int f( int n ) {
if ( n == 42 ) {
return 0;
}
else {
return n;
}
}
there is no such dependency.
However, I think it should be possible to come up with some useful metrics for a SELECT, even if they are not very exact, and I'll be interested to see what answers this gets.

It's reasonably enough to consider complexity as what it would be if you coded the query yourself.
If the table has N rows then,
A simple SELECT would be O(N)
A ORDER BY is O(NlogN)
A JOIN is O(N*M)
A DROP TABLE is O(1)
A SELECT DISTINCT is O(N^2)
A Query1 NOT IN/IN Query2 would be O( O1(N) * O2(N) )

Can I trust Execution plans?

I have these Queries:
With CTE(comno) as
(select distinct comno=ErpEnterpriseId from company)
select id=Row_number() over(order by comno),comno from cte
select comno=ErpEnterpriseId,RowNo=Row_number() over (order by erpEnterpriseId) from company group by ErpEnterpriseId
SELECT erpEnterpriseId, ROW_NUMBER() OVER(ORDER BY erpEnterpriseId) AS RowNo
FROM
(
SELECT DISTINCT erpEnterpriseId
FROM Company
) x
All three of them returns identical cost and actual execution plans..why and how so ?

It's all down to the query optimizer - that will by trying to optimize the query you enter into the most efficient execution plan (i.e several different queries could be optimized down to the SAME statement that is estimated to be most efficient).
The main thing you should do when trying to optimise a query and find which one performs the best, is to just try them and compare performance. Run an SQL profiler trace to see what the duration/reads is for each version. I usually run each version of a query 3 times to get an average to compare. Each time, clearing the execution plan and data cache down to prevent skewed results.
It's worth having a read of this MSDN article on the optimizer.

Simple, the optimizer is probably turning all your statements into the same statement.

Just like in English, in which there are many ways to say the same thing, all three of those queries are asking for the same data. The SQL Engine (the query optimizer) knows that and is smart enough to know what you are asking.
Even more appropriately, the engine has information that you don't (or likely don't know) - how the data is organized and indexed. It uses this information to make it's own decision about what the BEST way to get the data is, and that's what it is doing.
Although there are ways to override the optimizer, unless you really know what you are doing, you will probably only hurt performance. So your best option is to write the queries in whatever way make most sense to you (or other humans) for readability and maintainability.

Working around UDF Performance Issues - Manual caching

My system does some pretty heavy processing, and I've been attacking the performance in order to give me the ability to run more test runs in shorter times.
I have quite a few cases where a UDF has to get called on say, 5 million rows (and I pretty much thought there was no way around it).
Well, it turns out, there is a way to work around it and it gives huge performance improvements when UDFs are called over a set of distinct parameters somewhat smaller than the total set of rows.
Consider a UDF that takes a set of inputs and returns a result based on complex logic, but for the set of inputs over 5m rows, there are only 100,000 distinct inputs, say, and so it will only produce 100,000 distinct result tuples (my particular cases vary from interest rates to complex code assignments, but they are all discrete - the fundamental point with this technique is that you can simply determine if the trick will work by running the SELECT DISTINCT).
I found that by doing something like this:
INSERT INTO PreCalcs
SELECT param1
,param2
,dbo.udf_result(param1, param2) AS result
FROM (
SELECT DISTINCT param1, param2 FROM big_table
)
When PreCalcs is suitably indexed, the combination of that with:
SELECT big_table.param1
,big_table.param2
,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
ON PreCalcs.param1 = big_table.param1
AND PreCalcs.param2 = big_table.param2
You get a HUGE boost in performance. Apparently, just because something is deterministic, it doesn't mean SQL Server is caching the past calls and re-using them, as one might think.
The only thing you have to watch out for is where NULL are allowed, then you need to fix up your joins carefully:
SELECT big_table.param1
,big_table.param2
,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
ON (
PreCalcs.param1 = big_table.param1
OR COALESCE(PreCalcs.param1, big_table.param1) IS NULL
)
AND (
PreCalcs.param2 = big_table.param2
OR COALESCE(PreCalcs.param2, big_table.param2) IS NULL
)
Hope this helps and any similar tricks with UDFs, or refactoring queries for performance are welcome.
I guess the question is, why is manual caching like this necessary - isn't that the point of the server knowing that the function is deterministic? And if it makes such a big difference, and if UDFs are so expensive, why doesn't the optimizer just do it in the execution plan?

Yes, the optimizer will not manually memoize UDFs for you. Your trick is very nice in the cases where you can collapse the output set down in this way.
Another technique that can improve performance if your UDF's parameters are indices into other tables, and the UDF selects values from those tables to calculate the scalar result, is to rewrite your scalar UDF as a table-valued UDF that selects the result value over all your potential parameters.
I've used this approach when the tables we based the UDF query on were subject to a lot of inserts and updates, the involved query was relatively complex, and the number of rows the original UDF had to be applied to were large. You can achieve some great improvement in performance in this case, as the table-values UDF only needs to be run once and can run as an optimized set-oriented query.

How would SQL Server know that you have 100,000 discrete combinations within 5 million rows?
By using the PreCalcs table, you are simply running the udf over 100k rows rather that 5 million rows, before expanding back out again.
No optimiser in existence would be able to divine this useful information.
The scalar udf is a black box.
For a more practical solution, I'd use a computed, persisted columns that does the udf call.
So it's available in all queries can be indexed/included.
This suits OLTP more, maybe... I query a table to get trading cash and positions in real time in many different ways so this approach suits me to avoid the udf math overhead every time.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas