Querying tables with repeated fields: Performance considerations - google-bigquery

Assume we have a table with two dimensions (DimA & DimB) and one repeated field (Facts).
Now also consider the "flattened" version of it:
We should expect queries directed toward the first table to be generally cheaper since we are scanning fewer rows. However, from my experience, it looks like it is cheaper to direct queries on the flattened version of the table.
For example, it turns out that
select sum(A) from tableA, unnest(facts) where dimA = 1001
is more expensive than
select sum(A) from tableB where dimA = 1001
This is (at least to me) counterintuitive and was wondering why this is happening.

The second query is cheaper than the first is because the second query select only one column where as the first query needs to unnest three column. (related but not quite the same question: https://stackoverflow.com/a/53266945/3054766)
Because BigQuery store as columnar (column based), selecting less columns are sure to be cheaper. In your case, it's the trade-off between saving the storage cost (the first option), or the query cost/performance (the second option). Considering the ease of querying, I think the second option will be much more cleaner.

Related

Computational Efficiency - I/O

If I have two queries, why does it seem the second query is more computationally efficient (just in terms of I/O) as the first:
The first query only returns eight fields, runs in 1.1sec and processes 115.6mb. The second, however, returns over a million records, but runs in just 3.4sec and only accesses 8.2mb.
I am really trying to understand writing queries more efficiently as I am beginning to use substantially larger pools of data. Thanks!
SELECT
*
FROM
`table1`
LIMIT
10;
SELECT
id
FROM
`table1`
BigQuery is basically a columnar database (this is not exactly true, but it is a useful approximation). That is, it stores each column separately. So accessing one column only requires finding and reading that one column. Accessing multiple columns requires finding all those columns and reading them -- even if you only want one value.
This is not only a function of performance. The number of columns also determines billing. For users of other databases, it can be really surprising when:
select t.*
from t
limit 10;
ends up costing $10 or $100 because t is really big and wide. But:
select count(id)
from t;
costs almost nothing at all.
As another note: when you refer to a table multiple times in a query, you only pay for access once. So self-joins are not more expensive than selecting directly from the table.

Optimising a sequence of SQL calculations - Nested or Seperate Queries?

Short Intro:
When it is required to have a dozen nested calculating queries, is it more optimal to
A) Perform each operation separately (saving into a table for each result and then reading that table for the next query)
B) Have a large set of nested selects
Full Description:
I am trying to calculate some advanced forecasts from a series of input tables in SQL.
I am building around a dozen 'modules' that are separated into their own schema and each module typically includes 4-10 input tables and 6-10 calculation steps. All outputs from each module is dumped into the same output table once completed.
Queries range from 7k-200k rows.
A single schema's/module's tables might look like this:
Input Table 1
Input Table 2
Input Table 3
Input Table 4
Calculation Query 1 Result Table
Calculation Query 2 Result Table
Calculation Query 3 Result Table
Calculation Query 4 Result Table
Calculation Query 5 Result Table
Calculation Query 6 Result Table
Final Output
Each calculation query uses the results of the previous (for the most part). The final output is the result of the final calculation query. Calculations are not very complex: partitioned max, basic formula (+,-,*,/) or SUM etcetera. Normally only 1-3 of these per calculation step and always on the same column.
The main reason this is split into multiple calculation queries (instead of one super-formula) is because each calculation joins the outputs in a different way and uses different input tables; also because some are based on previous row results. (Such as max partitions or Lag)
My requirements are as follows:
A procedure that calculates final output from step 1 and merges into Final Output.
A procedure that calculates up to the selected calculation query and merges into its respective results table (and stop). Consider this the 'overriding final'
I DONT need to store the calculation results of intermediate queries - only the final output or the 'overriding final' if selected.
My Problem:
I am trying to optimise the entire process - at this point it looks like it will take around 10-15 seconds. I want it to be 1 second - however I appreciate this is probably not possible.
What I have tried:
Firstly, I created a single procedure for each calculation query that Merged the results into its respective output table. Using this method, each calculation query must read from the database and then merge into its output.
I tried temp tables however I don't see why this would be optimal because I have existing tables for the calculation steps already - which are indexed with the next step in mind.
I then made an assumption that it would be faster to simply nest all the queries into one super-procedure or maybe even have a sequence of Table-Functions.
My Question:
However I ran into a thought that I could not find an answer for - which is the following:
Inserting results into a table on every calculation step might slow the process (especially as they are indexed with 2-4 columns); but at least the data will be indexed for the next step.
Nesting selects would save the effort of inserting data but these results wouldn't be indexed? Right? Or Wrong?
Are select results intelligently indexed? And given my scenario what advise would you give on how I approach this. Maybe I am missing something really simple.
Additional Info:
Most of my larger query results (150-200K) have 4 columns that need to be indexed.
All of my tables only have one column that needs calculating - the rest are indexed.
For Example:
ForecastID, Group, Year, Type, Sub-Type, Value
So I have to index Group, Year, Type and Sub-Type to Join multiple input tables and then calculate on the Value column.
I am telling you this in case having index-heavy tables influences your advice - I wont ask for help on optimizing indexes here due to the overwhelming quantity of advice already available and because it's a different question!
Query optimization is often more art than science, there are few hard and fast rules because there are so many possible influences on the outcome. With that big caveat out of the way, Time to hit the high points.
Indexes effects on loading tables - Indexes have a similar performance impact on inserts as triggers. Unless you have a filtered index each insert will have to update every index on the table, so at three indexes you are looking at quadrupling the number of updates per insert. At one read per insert and a small table size of 200k (very doable for a table scan), for three indexes you are probably outside the butter zone for cost vs. benefit of having those indexes on your work tables.
Nesting results - Like CTEs, nested results work best when the entire result set can fit in memory. When part is in memory and part is on disk it will generally perform worse than a similarly sized temp table without an index. At 5 or so columns for 200k rows with smallish datatypes and a modern server you should be ok performance wise with nesting queries, so long as your only doing one result set at a time. Once again this varies based on your setup, if you are strapped for ram drop them into a temp table.
Joins - Another possible good reason to use temp tables/nested queries is to avoid excessively large joins. The first step in a join process is a full Cartesian join between the tables, which is then filtered based on the on and where clauses. The Join process is heavily optimized in all RDMS, so most of the time you are not aware of how much heavy lifting is occurring behind the scenes, however when tables reach large sizes this can be a major performance pain point. So instead you select the subset of data you require from both tables, and join the two much smaller sets. Once again the butter zone between subsets and full table joins depends on a number of factors, so you'll have to play around with your queries to find where it is for your situation.
Unfortunately I can't really give specific advice without some sample inputs and outputs and/or an execution plan, but I hope this is some food for thought. Good luck.
It sounds like your datasets from the subqueries are more than a few thousand rows, so I would start off with approach A, persist some of these intermediate result sets to #temptables, check the execution plan for scans on these tables, and index the #temptables if needed.
If you want to use approach B, or mix A and B, I suggest CTEs instead of nested queries where possible. They are more readable, and it is easier to switch to #temptables when you are testing/designing the query.

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.
If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.
So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

Is using COUNT(*) or SELECT * a good idea?

I've heard several times that you shouldn't perform COUNT(*) or SELECT * for performance reasons, but wasn't able to dig up some further information about it.
I can imagine that the database is then using all columns for the action, which can be an impressive performance loss, but I'm not sure about that. Does somebody have further information about the topic?
1. On count(*) vs. count(something else)
SQL is declarative in that you specify what you want. This is different from specifying how to get what you want. That means the database engine is free to realize your query in whatever way it thinks is the most efficient. Many database optimizers rewrites your query to a less costly alternative (if such a plan is available).
Given the following table:
table(
pk not null
,color not null
,nullable null
,unique(pk)
,index(color)
);
...all of the following are functionally equivalent (due to the mechanics of count and nulls):
1) select count(*) from table;
2) select count(1) from table;
3) select count(pk) from table;
4) select count(color) from table;
Regardless of which form you use, the optimizer is free to rewrite the query to another form if it is more efficient. (Again, not all optimizers are sophisticated enough to do this). The unique index(pk) would be smaller (bytes occupied) than the entire table. Therefore it would be more efficient to count the number of index entries rather than scanning through the entire table. In Oracle we have bitmap indexes, which also compress repeating strings. If we had used such an index on the color column, it would probably have been the smallest index to scan. Oracle also supports table compression which in some cases makes the physical table smaller than a composite index.
1. TL;DR;
Your specific dbms will have its own set of tools that enables different rewriting rules and in turn execution plans. That renders the question somewhat useless (unless we talk about a specific release of a specific dbms). I recommend COUNT(*) in all cases because it requires the least cognitive effort to grasp.
2. On select a,b,c vs. select *
There are very few valid uses of SELECT * in code you write and put into production. Imagine a table which contains Bluray movies (yes, the movies is stored as a blob in this table). So you slapped together your awesomesauce abstraction layer and put SELECT * FROM movies where id = ? in the getMovies(movie_id) method. I will refrain myself from explaining why SELECT name FROM movies will be transported across the network just a tad faster. Of course, in most realistic cases it won't have a noticable impact.
One last point on performance is that when all the referenced columns (selected, filtered) in your query exists as an index (called a covering index), the database need not touch the table at all. It can be fully resolved from scanning the index only. By selecting all columns you remove this option from the optimizer.
Another thing about SELECT * which is far more serious than anything, is that it creates an implicit dependency on a specific physical layout of the table. Let me explain. Consider the following tables:
table T1(name, id)
table T2(name, id)
The following statement...
insert into t1 select * from t2;
... will break or produce a different result if any of the following happens:
Any of the tables columns are rearranged for example T1(id, name)
T1 gets an additional not-null column
T2 gets another column
2. TL;DR; When possible, explicitly specify the columns you want (eventually, you'll have to do that anyway). Also, selecting fewer columns are faster than selecting more columns. A possitive side-effect on explicit selects is that it gives greater freedom to the optimizer.
COUNT(*) is different from COUNT(column1) !
COUNT(*) returns the number of records, and does NOT use more resources, while COUNT(column1) counts the number of records where column1 is non null.
For SELECT, it is different. SELECT * will of course request more data.
When using count(*) the * doesn't mean "all fields". Using count(field) will count all non-null values in the field, but count(*) will always count all records even if all fields in all records are null, so it doesn't need to check the data in the fields at all.
Using select * means that you almost always return more data than you are going to use, which of course is a waste. However, perhaps more serious is the maintainence problem; if you add fields to a table your query will return these too. That might mean that the record becomes too large to fit in the buffer, resulting in an error message.
Don't confuse the * in "COUNT(*)" with the * in "SELECT * ". They are completely unrelated but sometimes confused because it's such an odd syntax. There is nothing wrong with using COUNT(*), which just means "count rows".
SELECT * on the other hand means "select all columns". That's generally poor practice because it tightly couples your code to the database schema. That means when you change the table you probably have to change the code even if it should have been unaffected. It increases the impact of any schema change.
SELECT * may also cause a sub-optimal query plan. Either because you didn't really need all columns or because it forces the DBMS to do an extra lookup at runtime to get the list of columns.
It's absolutely true that "*" is "all columns". And you're right in the point of if you've a table with an incredible number of columns (say 100+), these kind of queries can be bad in terms of efficiency.
I believe that the best solution is creating database views previously filtering the amount of records evolved in the count operation, so, the performance impact isn't a big problem, because views can be cached.
In the other hand, it seems that "*" operator should be avoided when returning records, and it's brutally better to select the fields you really need to use in some business.
When using SELECT * it can have a performance hit. Applications which use the SELECT * syntax when they actually only need a handful of columns are transferring more data across the network than they need to consume, which is wasteful.
Also, in Microsoft SQL Server at least, there's a strange problem when you use SELECT * in a view and then add a column to the underlying table. The column headings and data returned by the view don't match each other following certain changes! See my blog post for further details of this particular problem.
Depending on the size of the database depends on how inefficient it becomes, the simnplest way to describe would be like so:
when you specifically do:
SELECT column1,column2,column3 FROM table1
Mysql knows exactly exactly what columns it looking for, but when you do
SELECT * FROM table1
Mysql does not know the columns you want, it knows you want all of them but not the names, so it has to perform extra tasks that analyse the table to discover the columns, thus resulting in using resources.
In case of COUNT(*) it depends on database and its version. For example in modern versions of MS SQL it doesn't matter [source needed].
So the best approach in case of COUNT(*) is to measure it.
Using SELECT * is really bad idea. * means read all columns which can be heavy IO and network operation (especially for various type of CHAR columns). Moreover -- rather rarely you need all columns.

100+ tables JOIN

I was wondering if anyone ever had a change to measure how a would 100 joined tables perform?
Each table would have an ID column with primary index and all table are 1:1 related.
It is a common problem within many data entry applications where we need to collect 1000+ data points. One solution would be to have one big table with 1000+ columns and the alternative would be to split them into multiple tables and join them when it is necessary.
So perhaps more real question would be how 30 tables (30 columns each) would behave with multitable join.
500K-1M rows should be the expected size of the tables.
Cheers
As a rule of thumb, anymore than 25 joins might be a performance problem. I try to keep joins below 10-15. It depends on the database activity and number of concurrent users, and the ratio of reads/writes.
Suggest you look at indexed views.
With any well tuned database, 'good' indexes for the query workload are the key .
They'd most likely perform terribly, unless you had a very small number of rows per table.
Go for a wider table, but normalize it properly. My guess is that if you normalize your data properly, you will have a slightly more sane design.
What you describe is similar to the implementation of column-oriented database (wikipedia). The data is stored in "column major" format which slows down adding each row, but is much faster for querying in the case of a where clause which restricts the returned rowset.
Why is it that you would rather split the rows? Is it that you measure the data elements for each row at different times? Or is it that the query result of a row would be very large?
Since first posting this, you answered me below that your reason for desiring a split of the table is that you usually only work with a subset of the data.
In that case, splitting the table can help your performance (amount of runtime consumed by the query) some amount. This may be an important factor in your wanting to work with less data -- in the case where your database engine runs slowly with large rows.
If performance is not a concern, rather than using SQL JOINs, it might serve you to explicitly list the columns you wish to retrieve in each query. For example, if you only wish to retrieve width, height, and length for a row, you could use:
SELECT width, height, length FROM datatable; rather than SELECT * FROM datatable; and accomplish the same improvement of getting less data returned. The SQL statements used would probably be shorter than the alternative join statements we were considering.
There's no way to better organise the tables? For example a "DataPointTypes" and "DataPointValues" table?
For example (and I don't know your particular circumstances) if all of your tables are like "WebsiteDataPoints (WebsitePage, Day, Visits)" "StoreDataPoints (Branch, Week, Sales)" etc. you could instead have
DataPointSources(Name)
(with data: Website,Store)
DataPointTypes(SourceId, ColumnName)
(with data: (Website, WebsitePage), (Website, Day), (Store, Branch), (Store, Sales) etc.)
DataPointEntry(Id, Timestamp)
DataPointValues (EntryId, Value(as varchar probably))
(with data: (1, Website-WebsitePage, 'pages.php'), (2, Store-Branch, 'MainStore'), (1, Website-Day, '12/03/1980'), (2, Store-Sales '35') etc.)
In this way each table becomes a source, each column becomes a type, each row becomes an entry, and each cell becomes a value.