BigQuery - how to think about query optimisation when coming from a SQL Server background

BigQuery - how to think about query optimisation when coming from a SQL Server background - sql

I have a background that includes SQL Server and Informix database query optimisation (non big-data). I'm confident in how to maximise database performance on those systems. I've recently been working with BigQuery and big data (about 9+ months), and optimisation doesn't seem to work the same way. I've done some research and read some articles on optimisation, but I still need to better understand the basics of how to optimise on BigQuery.
In SQL Server/Informix, a lot of the time I would introduce a column index to speed up reads. BigQuery doesn't have indexes, so I've mainly been using clustering. When I've done benchmarking after introducing a cluster for a column that I thought should make a difference, I didn't see any significant change. I'm also not seeing a difference when I switch on query cacheing. This could be an unfortunate coincidence with the queries I've tried, or a mistaken perception, however with SQL Server/SQL Lite/Informix I'm used to seeing immediate significant improvement, consistently. Am I misunderstanding clustering (I know it's not exactly like an index, but I'm expecting it should work in a similar type of way), or could it just be that I've somehow been 'unlucky' with the optimisations.
And this is where the real point is. There's almost no such thing as being 'unlucky' with optimisation, but in a traditional RDBMS I would look at the execution plan and know exactly what I need to do to optimise, and find out exactly what's going on. With BigQuery, I can get the 'execution details', but it really isn't telling me much (at least that I can understand) about how to optimise, or how the query really breaks down.
Do I need a significantly different way of thinking about BigQuery? Or does it work in similar ways to an RDBMS, where I can consciously make the first JOINS eliminate as many records as possible, use 'where' clauses that focus on indexed columns, etc. etc.
I feel I haven't got the control to optimise like in a RDBMS, but I'm sure I'm missing a major point (or a few points!). What are the major strategies I should be looking at for BigQuery optimisation, and how can I understand exactly what's going on with queries? If anyone has any links to good documentation that would be fantastic - I'm yet to read something that makes me think "Aha, now I get it!".

It is absolutely a paradigm shift in how you think. You're right: you don't have hardly any control in execution. And you'll eventually come to appreciate that. You do have control over architecture, and that's where a lot of your wins will be. (As others mentioned in comments, the documentation is definitely helpful too.)
I've personally found that premature optimization is one of the biggest issues in BigQuery—often the things you do trying to make a query faster actually have a negative impact, because things like table scans are well optimized and there are internals that you can impact (like restructuring a query in a way that seems more optimal, but forces additional shuffles to disk for parallelization).
Some of the biggest areas our team HAS seem greatly improve performance are as follows:
Use semi-normalized (nested/repeated) schema when possible. By using nested STRUCT/ARRAY types in your schema, you ensure that the data is colocated with the parent record. You can basically think of these as tables within tables. The use of CROSS JOIN UNNEST() takes a little getting used to, but eliminating those joins makes a big difference (especially on large results).
Use partitioning/clustering on large datasets when possible. I know you mention this, just make sure that you're pruning what you can using _PARTITIONTIME when possible, and also using clutering keys that make sense for your data. Keep in mind that clustering basically sorts the storage order of the data, meaning that the optimizer knows it doesn't have to continue scanning if the criteria has been satisfied (so it doesn't help as much on low-cardinality values)
Use analytic window functions when possible. They're very well optimized, and you'll find that BigQuery's implementation is very mature. Often you can eliminate grouping this way, or filter our more of your data earlier in the process. Keep in mind that sometimes filtering data in derived tables or Common Table Expressions (CTEs/named WITH queries) earlier in the process can make a more deeply nested query perform better than trying to do everything in one flat layer.
Keep in mind that results for Views and Common Table Expressions (CTEs/named WITH queries) aren't materialized during execution. If you use the CTE multiple times, it will be executed multiple times. If you join the same View multiple times, it will be executed multiple times. This was hard for members of our team who came from the world of materialized views (although it looks like somethings in the works for that in BQ world since there's an unused materializedView property showing in the API).
Know how the query cache works. Unlike some platforms, the cache only stores the output of the outermost query, not its component parts. Because of this, only an identical query against unmodified tables/views will use the cache—and it will typically only persist for 24 hours. Note that if you use non-deterministic functions like NOW() and a host of other things, the results are non-cacheable. See details under the Limitations and Exceptions sections of the docs.
Materialize your own copies of expensive tables. We do this a lot, and use scheduled queries and scripts (API and CLI) to normalize and save a native table copy of our data. This allows very efficient processing and fast responses from our client dashboards as well as our own reporting queries. It's a pain, but it works well.
Hopefully that will give you some ideas, but also feel free to post queries on SO in the future that you're having a hard time optimizing. Folks around here are pretty helpful when you let them know what your data looks like and what you've already tried.
Good luck!

Related

Are sql tuning ways always same for different DB engine？

I used Oracle for the half past year and learned some tricks of sql tuning,but now our DB is moving to greenplum and the project manager suggest us to change some of the codes that writted in Oracle sql for their efficiency or grammar.
I am curious that Are sql tuning ways same for different DB engine,like oracle,postgresql,mysql and so on？if yes or not,why?Any suggestion are welcomed!
some like:
in or exists
count(*) or count(column)
use index or not
use exact column instead of select *

For the most part the syntax that is used will remain the same, there may be small differences from one engine to another and you may run into different terms to achieve some of the more specific output or do more complex tasks. In order to achieve parity you will need to learn those new terms.
As far as tuning, this will vary from system to system. Specifically going from Oracle to Greenplum you are looking at moving from a database where efficiency in a query if often driven by dropping an index on the data. Where Greenplum is a parallel execution system where efficiency is gained by effectively distributing the data across multiple systems and querying them in parallel. In Greenplum indexing is an additional layer that usually does not add benefit, just additional overhead.
Even within a single system using changing the storage engine type can result in different ways to optimize a query. In practice queries are often moved to a new platform and work, but are far from optimal as they don't take advantage of optimizations of that platform. I would strongly suggest getting an understanding of the new platform and you should not go in assuming a query that is optimized for one platform is the optimal way to run it in another.

Getting specifics in why they differ requires someone to be an expert in bother to be able to compare both. I don't claim to know much of greenplum.
The basic principles which I would expect all developers to learn over time dont really change. But there are "quirks" of individual engines which make specific differences. From your question I would personally anticipate 1 and 4 to remain the same.
Indexing is something which does vary. For example the ability to use two indexes was not (is not?) Ubiquitous. I wouldn't like to guess which DBMS can / can't count columns from the second field in a composite index. And the way indexes are maintained is very different from one DBMS to the next.
From my own experience I've also seen differences caused by:
Different capabilities in the data access path. As an example, one optimisation is for a DBMS to create a bit map of rows (matching and not matching) the combine multiple bitmaps to select rows. A DBMS with this feature can use multiple indexes in a single query. One without it can't.
Availability of hints / lack of hints. Not all DBMS support them. I know they are very common in Oracle.
Different locking strategies. This is a big one and can really affect update and insert queries.
In some cases DBMS have very specific capabilities for certain types of data such as geographic data or searchable free text (natural language). In these cases the way of working with the data is entirely different from one DBMS to the next.

are performance/code-maintainability concerns surrounding SELECT * on MS SQL still relevant today, with modern ORMs?

summary: I've seen a lot of advice against using SELECT * in MS SQL, due to both performance and maintainability concerns. however, many of these posts are very old - 5 to 10 years! it seems, from many of these posts, that the performance concerns may have actually been quite small, even in their time, and as to the maintainability concerns ("oh no, what if someone changes the columns, and you were getting data by indexing an array! your SELECT * would get you in trouble!"), modern coding practices and ORMs (such as Dapper) seem - at least in my experience - to eliminate such concerns.
and so: are there concerns with SELECT * that are still relevant today?
greater context: I've started working at a place with a lot of old MS code (ASP scripts, and the like), and I've been helping to modernize a lot of it, however: most of my SQL experience is actually from MySQL and PHP frameworks and ORMs - this is my first time working with MS SQL - and I know there are subtle differences between the two. ALSO: my co-workers are a little older than I am, and have some concerns that - to me - seem "older". ("nullable fields are slow! avoid them!") but again: in this particular field, they definitely have more experience than I do.
for this reason, I'd also like to ask: whether SELECT * with modern ORMs is or isn't safe and sane to do today, are there recent online resources which indicate such?
thanks! :)

I will not touch maintainability in this answer, only performance part.
Performance in this context has little to do with ORMs.
It doesn't matter to the server how the query that it is running was generated, whether it was written by hand or generated by the ORM.
It is still a bad idea to select columns that you don't need.
It doesn't really matter from the performance point of view whether the query looks like:
SELECT * FROM Table
or all columns are listed there explicitly, like:
SELECT Col1, Col2, Col3 FROM Table
If you need just Col1, then make sure that you select only Col1. Whether it is achieved by writing the query by hand or by fine-tuning your ORM, it doesn't matter.
Why selecting unnecessary columns is a bad idea:
extra bytes to read from disk
extra bytes to transfer over the network
extra bytes to parse on the client
But, the most important reason is that optimiser may not be able to generate a good plan. For example, if there is a covering index that includes all requested columns, the server will usually read just this index, but if you request more columns, it would do extra lookups or use some other index, or just scan the whole table. The final impact can vary from negligible to seconds vs hours of run time. The larger and more complicated the database, the more likely you see the noticeable difference.
There is a detailed article on this topic Myth: Select * is bad on the Use the index, Luke web-site.
Now that we have established a common understanding of why selecting
everything is bad for performance, you may ask why it is listed as a
myth? It's because many people think the star is the bad thing.
Further they believe they are not committing this crime because their
ORM lists all columns by name anyway. In fact, the crime is to select
all columns without thinking about it—and most ORMs readily commit
this crime on behalf of their users.
I'll add answers to your comments here.
I have no idea how to approach an ORM that doesn't give me an option which fields to select. I personally would try not to use it. In general, ORM adds a layer of abstraction that leaks badly. https://en.wikipedia.org/wiki/Leaky_abstraction
It means that you still need to know how to write SQL code and how DBMS runs this code, but also need to know how ORM works and generates this code. If you choose not to know what's going on behind ORM you'll have unexplainable performance problems when your system grows beyond trivial.
You said that at your previous job you used ORM for a large system without problems. It worked for you. Good. I have a feeling, though, that your database was not really large (did you have billions of rows?) and the nature of the system allowed to hide performance questions behind the cache (it is not always possible). The system may never grow beyond the hardware capacity. If your data fits in cache, usually it will be reasonably fast in any case. It begins to matter only when you cross the certain threshold. After which suddenly everything becomes slow and it is hard to fix it.
It is common for a business/project manager to ignore the possible future problems which may never happen. Business always has more pressing urgent issues to deal with. If business/system grows enough when performance becomes a problem, it will either have accumulated enough resources to refactor the whole system, or it will continue working with increasing inefficiency, or if the system happens to be really critical to the business, just fail and give a chance to another company to overtake it.
Answering your question "whether to use ORMs in applications where performance is a large concern". Of course you can use ORM. But, you may find it more difficult than not using it. With ORM and performance in mind you have to inspect manually the SQL code that ORM generates and make sure that it is a good code from performance point of view. So, you still need to know SQL and specific DBMS that you use very well and you need to know your ORM very well to make sure it generates the code that you want. Why not just write the code that you want directly?
You may think that this situation with ORM vs raw SQL somewhat resembles a highly optimising C++ compiler vs writing your code in assembler manually. Well, it is not. Modern C++ compiler will indeed in most cases generate code that is better than what you can write manually in assembler. But, compiler knows processor very well and the nature of the optimisation task is much simpler than what you have in the database. ORM has no idea about the volume of your data, it knows nothing about your data distribution.
The simple classic example of top-n-per-group can be done in two ways and the best method depends on the data distribution that only the developer knows. If performance is important, even when you write SQL code by hand you have to know how DBMS works and interprets this SQL code and lay out your code in such a way that DBMS accesses the data in an optimal way. SQL itself is a high-level abstraction that may require fine-tuning to get the best performance (for example, there are dozens of query hints in SQL Server). DBMS has some statistics and its optimiser tries to use it, but it is often not enough.
And now on top of this you add another layer of ORM abstraction.
Having said all this, "performance" is a vague term. All these concerns become important after a certain threshold. Since modern hardware is pretty good, this threshold had been pushed rather far to allow a lot of projects to ignore all these concerns.
Example. An optimal query over a table with million rows returns in 10 milliseconds. A non-optimal query returns in 1 second. 100 times slower. Would the end-user notice? Maybe, but likely not critical. Grow the table to billion rows or instead of one user have 1000 concurrent users. 1 second vs 100 seconds. The end-user would definitely notice, even though the ratio (100 times slower) is the same. In practice the ratio would increase as data grows, because various caches would become less and less useful.

From a SQL-Server-Performance-Point-of-view, you should NEVER EVER use select *, because this means to sqlserver to read the complete row from disk or ram. Even if you need all fields, i would suggest to not do select *, because you do not know, who is appending any data to the table that your application does NOT need. For Details see answer of #sandip-patel
From a DBA-perspective: If you give exactly those columnnames you need the dbadmin can better analyse and optimize his databases.
From a ORM-Point-Of-View with changing column-names i would suggest to NOT use select *. You WANT to know, if the table changes. How do you want to give a guarantee for your application to run and give correct results if you do not get errors if the underlying tables change??
Personal Opinion: I really do not work with ORM in Applications needing to perform well...

This question is out some time now, and noone seems to be able to find, what Ben is looking for...
I think this is, because the answer is "it depends".
There just NOT IS THE ONE answer to this.
Examples
As i pointed out before, if a database is not yours, and it may be altered often, you cannot guarantee performance, because with select * the amount of data per row may explode
If you write an application using ITS OWN database, noone alters your DB (hopefully) and you need your columns, so whats wrong with select *
If you build some kind of lazy loading with "main properties" beeing loaded instantly and others beeing loaded later (of same entity), you cannot go with select * because you get all
If you use select * other developers will every time think about "did he think about select *" as they will try to optimize. So you should add enough comments...
If you build 3-Tier-Application building large caches in the middle-Tier and performance is a theme beeing done by cache, you may use select *
Expanding 3Tier: If you have many many concurrent users and/or really big data, you should consider every single byte, because you have to scale up your middle-Tier with every byte beeing wasted (as someone pointed out in the comments before)
If you build a small app for 3 users and some thousands of records, the budget may not give time to optimize speed/db-layout/something
Speak to your dba... HE will advice you WHICH statement has to be changed/optimized/stripped down/...
I could go on. There just is not ONE answer. It just depends on to many factors.

It is generally a better idea to select the column names explicitly. Should a table receive an extra column it would be loaded with a select * call, where the extra column is not needed.
This can have several implications:
More network traffic
More I/O (got to read more data from disk)
Possibly even more I/O (a covering index cannot be used - a table scan is performed to get the data)
Possibly even more CPU (a covering index cannot be used so data needs sorting)
EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:
Select colA, colB
From table1 t1
Where Exists (Select * From Table2 Where column = t1.colA)
More Details -1
More Details -2
More Details -3

Maintainability point.
If you do a "Select * from Table"
Then I alter the Table and add a column.
Your old code will likely crash as it now has an additional column in it.
This creates a night mare for future revisions because you have to identify all the locations for the select *.
The speed differences is so minimal I would not be concerned about it. There is a speed difference in using Varchar vs Char, Char is faster. But the speed difference is so minimal it is just about not worth talking about.
Select *'s biggest issue is with changes (additions) to the table structure.
Maintainability nightmare. Sign of a Junior programmer, and poor project code. That being said I still use select * but intend to remove it before I go to production with my code.

SQL versus noSQL (speed)

When people are comparing SQL and noSQL, and concluding the upsides and downsides of each one, what I never hear anyone talking about is the speed.
Isn't performing SQL queries generally faster than performing noSQL queries?
I mean, for me this would be a really obvious conclusion, because you should always be able to find something faster if you know the structure of your database than if you don't.
But people never seem to mention this, so I want to know if my conclusion is right or wrong.

People who tend to use noSQL use it specifically because it fits their use cases. Being divorced from normal RDBMS table relationships and constraints, as well as ACID-ity of data, it's very easy to make it run a lot faster.
Consider Twitter, which uses NoSQL because a user only does very limited things on site, or one exactly - tweet. And concurrency can be considered non-existent since (1) nobody else can modify your tweet and (2) you won't normally be simultaneously tweeting from multiple devices.

The definition of noSQL systems is a very broad one -- a database that doesn't use SQL / is not a RDBMS.
Therefore, the answer to your question is, in short: "it depends".
Some noSQL systems are basically just persistent key/value storages (like Project Voldemort). If your queries are of the type "look up the value for a given key", such a system will (or at least should be) faster that an RDBMS, because it only needs to have a much smaller feature set.
Another popular type of noSQL system is the document database (like CouchDB).
These databases have no predefined data structure.
Their speed advantage relies heavily on denormalization and creating a data layout that is tailored to the queries that you will run on it. For example, for a blog, you could save a blog post in a document together with its comments. This reduces the need for joins and lookups, making your queries faster, but it also could reduce your flexibility regarding queries.

As Einstein would say, speed is relative.
If you need to store a master/detail simple application (like a shopping cart), you would need to do several Insert statements in your SQL application, also you will get a Data set of information when you do a query to get the purchase, if you're using NoSQL, and you're using it well, then you would have all the data for a single order in one simple "record" (document if you use the terms of NoSQL databases like djondb).
So, I really think that the performance of an application can be measured by the number of things it need to do to achieve a single requirement, if you need to do several Inserts to store an order and you only need one simple Insert in a database like djondb then the performance will be 10x faster in the NoSQL world, just because you're using 10 times less calls to the database layer, that's it.
To illustrate my point let me link an example I wrote sometime ago about the differences between NoSQL and SQL data models approach: https://web.archive.org/web/20160510045647/http://djondb.com/blog/nosql-masterdetail-sample/, I know it's a self reference, but basically I wrote it to address this question which I found it's the most challenging question a RDBMS guy could have and it's always a good way to explain why NoSQL is so different from SQL world, and why it will achieve better performance anytime, not because we use "nasa" technology, it's because NoSQL will let the developer do less... and get more, and less code = greater performance.

The answer is: it depends. Generally speaking, the objective of NoSQL DATABASES (no "queries") is scalability. RDBMS usually have some hard limits at some point (I'm talking about millons and millons of rows) where you could not scale any more by traditional means (Replication, clustering, partitioning), and you need something more because your needs keep growing. Or even if you manage to scale, the overall setup is quite complicated. Or you can scale reads, but not writes.
And the queries depends on the particular implementation of your server, the type of query you are doing, the columns in the table, etc... remember that queries are just one part of the RDBMS.

query time of relational database like SQL for 1000 person data is 2000 ms and graph database like neo4j
is 2ms .if you crate more node 1000000 speed stable 2 ms

Over use of Oracle With clause?

I'm writing many reporting queries for my current employer utilizing Oracle's WITH clause to allow myself to create simple steps, each of which is a data-oriented transformation, that build upon each other to perform a complex task.
It was brought to my attention today that overuse of the WITH clause could have negative side effects on the Oracle server's resources.
Can anyone explain why over use of the Oracle WITH clause may cause a server to crash? Or point me to some articles where I can research appropriate use cases? I started using the WITH clause heavily to add structure to my code and make it easier to understand. I hope with some informative responses here I can continue to use it efficiently.
If an example query would be helpful I'll try to post one later today.
Thanks!

Based on this: http://www.dba-oracle.com/t_with_clause.htm it looks like this is a way to avoid using temporary tables. However, as others will note, this may actually mean heavier, more expensive queries that will put an additional drain on the database server.
It may not 'crash'. That's a bit dramatic. More likely it will just be slower, use more memory, etc. How that affects your company will depend on the amount of data, amount of processors, amount of processing (either using with or not)

is there such thing as a query being too big?

I don't have too much experience with SQL. Most of the queries I have written have been very small. Whenever I see a very large query, I always kinda assume it needs to be optimized. But is this true? or is there situations where really large queries are just whats needed?
BTW when I say large queries I mean queries that exceed 1000+ chars

Yes, any statement, method, or even query can be "too big".
The problem, is actually defining what too big really is.
If you can't sit down and figure out what the query does in a relatively short amount of time, it's probably best to break it up into smaller chunks.
I always like to look at things from a maintenance standpoint. If the query is hard to understand now, what if you have to debug something in it?
Just because you see a large query, doesn't mean it needs to be changed or optimized, but if it's too complicated for its own good, then you might want to consider refactoring.

Just as in other languages, you can't determine the efficiency of a query based on a character count. Also, 1000 characters isn't what I could call "large", especially when you use good table/column names, aliases that make sense, etc.
If you're not comfortable enough with SQL to be able to "eye ball" the design merits of particular query, run it through a profiler and examine the execution plan. That'll give you a good idea of problems, if any, the code in question will suffer from.
My rule of thumb is this: write the best, tightest, simplest code you can, and optimize where needed - ie, where you see a performance bottleneck or where (as frequently happens) you slap yourself in the head and say "D'OH!" at about three in the morning on vacation.
Summary:Code well, and optimize where needed.
As Robert said, if you can't easily tell what the query is doing, it probably needs to be simplified.

If you are used to writing simple stuff, you may not realize how complex getting information for a complex report might be. Yes, queries can get long and complicated and still perform well for what they are being asked to do. Often the techniques that are used to performance tune something may make the code look more complicated to those less familar with advanced querying techniques. What counts is how long it takes to execute and whether it returns the correct data, not how many characters it has.
When I see a complex query, my first thought is does it return what the developer really intended to return (you'd be surprised at how often the answer to that is no) and then I look to see if it could be performance tuned. Yes there are many badly written long queries out there, but there are also as many or more that do what they are intended to do about as fast as it can be done without a major database redesign or faster hardware.

I'd suggest that it's not the characters that should measure the size/complexity of the query.
I'd boil it down to:
what's the goal of the query?
does it used set-based logic?
does it re-use any components?
does it JOIN improperly/poorly?
what are the performance implications?
maintainability concerns - is it written so that another developer can grok its intentions?

Where I work we've created stored procedures that exceed 1000 characters. I can't really say it was NECESSARY but sometimes haste wins out over efficiency (most notably when a quick fix is necessary for a client).
Having said that ... if given the time I would attempt to optimize a query as small/efficient as it can get without it being overly confusing. I've used nested stored procedures to make things a little more clear and/or functions as well.

The number of characters does not mean that a query needs to be optimized - it is what you see within those characters that does.
Things like subqueries on top of subqueries is something I would review. I'd review JOINs as well, but it shouldn't take long comparing to the ERD to know if there's an unnecessary JOIN - the first thing I'd look at would be what tables are joined but not used in the output, which is fine if the tables are link/corrollary/etc tables.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas