BigQuery table properties affecting query performance - google-bigquery

This question relates to the performance of querying data in BigQuery.
Any particular table or column settings that would affect query performance, or are all columns in a table in effect treated equally by BigQuery, such that the order of columns or any definitions applying to a column would not impact data fetching in any distinguishable way?
Thanks!

All columns are treated equally by BigQuery. The order doesn't matter at all. There aren't any settings that affect performance, but if you import your table in a very large number of small pieces (e.g. more than a few thousand) it can cause some slowdown.

Related

Star Schema Query - do you have to include all dimensions(joins) in the query?

Hi I have a question regarding star schema query in MS SQL datawarehouse.
I have a fact table and 8 dimensions. And I am confused, to get the metrics from Fact, do we have to join all dimensions with Fact, even though I am not getting data from them? Is this required for the right metrics?
My fact table is huge, so that's why I am wondering for performance purposes and the right way to query.
Thanks!
No you do not have to join all 8 dimensions. You only need to join the dimensions that contain data you need for analyzing the metrics in the fact table. Also to increase performance make sure to only include columns from the dimension table that are needed for the analysis. Including all columns from the dimensions you join will decrease performance.
It is not necessary to include all the dimensions. Indeed, while exploring fact tables, It is very important to have the possibility to select only some dimensions to join and drop the others. The performance issues must not be an excuse to give up on this capability.
You have a bunch of different techniques to solve performance issues depending on the database you are using. Some common ways :
aggregate tables : it is one of the best way to solve performance issues. If you have a huge fact table, you can create an aggregate version of it, using only the most frequently queried columns. This way, it should be much smaller. Then, users (or the adhoc query application) has to use the aggregrate table instead of the original fact table when this is possible. The good news is that most databases know how to automatically manage aggregate tables (materialized views for example). Queries that initially target the original fact table, are transparently redirected to the aggregate table whenever possible.
indexing : bitmap indexing for example can be an efficient way to increase performance in a star schema fact table.

Query Performance over a BigQuery table with nullable field

We need to upload data from our logs to Google BigQuery and we have two subsets of the log data that will not overlap when queried.
Subset number one has a field "vendor_id" which will be used a lot in WHERE clauses.
Subset number two are the log entries that do not have "vendor_id"
We could make only one table with a nullable "vendor_id" field or make two different tables one for each subset. Is there any difference in the performance of these aproaches?
Regards
Leo
There will be little (if any) difference in query performance between the two options you mention. That said, the cost of queries is proportional to the amount of data read, so if you have two separate tables it will likely be less expensive, since each query will read a smaller amount of data.

Normalizing an extremely big table

I face the following issue. I have an extremely big table. This table is a heritage from the people who previously worked on the project. The table is in MS SQL Server.
The table has the following properties:
it has about 300 columns. All of them have "text" type but some of them eventually should represent other types (for example, integer or datetime). So one has to convert this text values in appropriate types before using them
the table has more than 100 milliom rows. The space for the table would soon reach 1 terabyte
the table does not have any indices
the table does not have any implemented mechanisms of partitioning.
As you may guess, it is impossible to run any reasonable query to this table. Now people only insert new records into the table but nobody uses it. So I need to restructure it. I plan to create a new structure and refill the new structure with the data from the old table. Obviously, I will implement partioning, but it is not the only thing to be done.
One of the most important features of the table is that those fields that are purely textual (i.e. they don't have to be converted into another type) usually have frequently repeated values. So the actual variety of values in a given column is in the range of 5-30 different values. This induces the idea to make normalization: for every such a textual column I will create an additional table with the list of all the different values that may appear in this column, then I will create a (tinyint) primary key in this additional table and then will use an appropriate foreign key in the original table instead of keeping those text values in the original table. Then I will put an index on this foreign key column. The number of the columns to be processed this way is about 100.
It raises the following questions:
would this normalization really increase the speed of the queires imposing conditions on some of those 100 fields? If we forget about the size needed to keep those columns, whether would there be any increase in the performance due to the substition of the initial text-columns with tinyint-columns? If I do not do any normalization and simply put an index on those initial text columns, whether the performace will be the same as for the index on the planned tinyint-column?
If I do the described normalization, then building a view showing the text values will require joining my main table with some 100 additional tables. A positive moment is that I'll do those joins for pairs "primary key"="foreign key". But still quite a big amount of tables should be joined. Here is the question: whether the performance of the queryes made to this view compare to the performance of the queries to the initial non-normalized table will be not worse? Whether the SQL Server Optimizer will really be able to optimize the query the way that allows taking the benefits of the normalization?
Sorry for such a long text.
Thanks for every comment!
PS
I created a related question regarding joining 100 tables;
Joining 100 tables
You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...
However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA or LOB_DATA Allocation Units.
By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.
As far as the overhead added by performing JOIN to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.
Whether this is worth the effort depends on how long the values are. If the values are, say, state abbreviations (2 characters) or country codes (3 characters), the resulting table would be even larger than the existing one. Remember, you need to include the primary key of the reference table. That would typically be an integer and occupy four bytes.
There are other good reasons to do this. Having reference tables with valid lists of values maintains database consistency. The reference tables can be used both to validate inputs and for reporting purposes. Additional information can be included, such as a "long name" or something like that.
Also, SQL Server will spill varchar columns over onto additional pages. It does not spill other types. You only have 300 columns but eventually your record data might get close to the 8k limit for data on a single page.
And, if you decide to go ahead, I would suggest that you look for "themes" in the columns. There may be groups of columns that can be grouped together . . . detailed stop code and stop category, short business name and full business name. You are going down the path of modelling the data (a good thing). But be cautious about doing things at a very low level (managing 100 reference tables) versus identifying a reasonable set of entities and relationships.
1) The system is currently having to do a full table scan on very significant amounts of data, leading to the performance issues. There are many aspects of optimisation which could improve this performance. The conversion of columns to the correct data types would not only significantly improve performance by reducing the size of each record, but would allow data to be made correct. If querying on a column, you're currently looking at the text being compared to the text in the field. With just indexing, this could be improved, but changing to a lookup would allow the ID value to be looked up from a table small enough to keep in memory and then use this to scan just integer values, which is a much quicker process.
2) If data is normalised to 3rd normal form or the like, then you can see instances where performance suffers in the name of data integrity. This is most a problem if the engine cannot work out how to restrict the rows without projecting the data first. If this does occur, though, the execution plan can identify this and the query can be amended to reduce the likelihood of this.
Another point to note is that it sounds like if the database was properly structured it may be able to be cached in memory because the amount of data would be greatly reduced. If this is the case, then the performance would be greatly improved.
The quick way to improve performance would probably be to add indexes. However, this would further increase the overall database size, and doesn't address the issue of storing duplicate data and possible data integrity issues.
There are some other changes which can be made - if a lot of the data is not always needed, then this can be separated off into a related table and only looked up as needed. Fields that are not used for lookups to other tables are particular candidates for this, as the joins can then be on a much smaller table, while preserving a fairly simple structure that just looks up the additional data when you've identified the data you actually need. This is obviously not a properly normalised structure, but may be a quick and dirty way to improve performance (after adding indexing).
Construct in your head and onto paper a normalized database structure
Construct the database (with indexes)
De-construct that monolith. Things will not look so bad. I would guess that A LOT (I MEAN A LOT) of data is repeated
Create SQL insert statements to insert the data into the database
Go to the persons that constructed that nightmare in the first place with a shotgun. Have fun.

Query performance on record type vs flatten table in BigQuery

I have a table with "orders", and "order lines" that come as JSON, and it is simple to store it as JSON in BigQuery. I can run a process to flatten the file to rows, but it is a burden, and makes the BigQUery table bigger.
What would be a best performance structure for BigQuery? Assuming I have queries on sum or products, and sales in order lines.
And what is the best practice to number of "records" (or "order lines") in a record column? Can it contain thousands or is it aimed for a few? Assuming I would query it like in a MongoDB document based database.
This will help me plan the right architecture.
BigQuery's columnar architecture is designed to handle nested and repeated fields in a highly performant manner, and in general can return query results as fast as it would if those records were flattened. In fact, in some cases, (depending on your data and the types of queries you are running) using already nested records can actually allow you to avoid subqueries that tack on an extra step.
Short answer: Don't worry about flattening, keep your data in the nested structure, the query performance will generally be the same either way.
However, as to your second question: Your record limit will be determined by how much data you can store in a single record. Currently BigQuery's per row maximum is 100MB. You can have many, many repeated fields in a single record, but they need to fit into this limit.

What is table partitioning?

In which case we should use table partitioning?
An example may help.
We collected data on a daily basis from a set of 124 grocery stores. Each days data was completely distinct from every other days. We partitioned the data on the date. This allowed us to have faster
searches because oracle can use partitioned indexes and quickly eliminate all of the non-relevant days.
This also allows for much easier backup operations because you can work in just the new partitions.
Also after 5 years of data we needed to get rid of an entire days data. You can "drop" or eliminate an entire partition at a time instead of deleting rows. So getting rid of old data was a snap.
So... They are good for large sets of data and very good for improving performance in some cases.
Partitioning enables tables and indexes or index-organized tables to be subdivided into smaller manageable pieces and these each small piece is called a "partition".
For more info: Partitioning in Oracle. What? Why? When? Who? Where? How?
When you want to break a table down into smaller tables (based on some logical breakdown) to improve performance. So now the user can refer to tables as one table name or to the individual partitions within.
Table partitioning consist in a technique adopted by some database management systems to deal with large databases. Instead of a single table storage location, they split your table in several files for quicker queries.
If you have a table which will store large ammounts of data (I mean REALLY large ammounts, like millions of records) table partitioning will be a good option.
i) It is the subdivision of a single table into multiple segments, called partitions, each of which holds a subset of values.
ii) You should use it when you have read the documentation, run some test cases, fully understood the advantages and disadvantages of it, and found that it will be of benefit.
You are not going to get a complete answer on a forum. Go and read the documentation and come back when you have a manageable problem.