I know that the pricing for scanning data in the bigquery queries is determined by which columns are accessed.
How does this play out with nested and repeated data?
Is all the data for each record scanned? Or, does it depend on the nodes which are scanned? Are the columns in the nested data considered as each a separate column?
Thanks for any clarifications on this.
This is a great question - short answer is yes, BigQuery only processes the nested nodes you specify. A useful characteristic of BigQuery's columnar data storage is that nested columns can be represented as a separate column. If you don't explicitly SELECT them, the additional children field in your nested record are not processed.
Related
From Cloud Bigtable schema design documentation:
Grouping data into column families allows you to retrieve data from a single family, or multiple families, rather than retrieving all of the data in each row. Group data as closely as you can to get just the information that you need, but no more, in your most frequent API calls.
In my use case, I can group all the data into one single column family (currently the access pattern is retrieve all fields), or group them to say 3 logical column families and specify these column families all the time while querying. Is there any performance difference between these two designs? Which design is recommended?
In your case, there isn't a performance advantage either way. I would use the 3 logical column families so that you have cleaner code.
We need to upload data from our logs to Google BigQuery and we have two subsets of the log data that will not overlap when queried.
Subset number one has a field "vendor_id" which will be used a lot in WHERE clauses.
Subset number two are the log entries that do not have "vendor_id"
We could make only one table with a nullable "vendor_id" field or make two different tables one for each subset. Is there any difference in the performance of these aproaches?
Regards
Leo
There will be little (if any) difference in query performance between the two options you mention. That said, the cost of queries is proportional to the amount of data read, so if you have two separate tables it will likely be less expensive, since each query will read a smaller amount of data.
I have a table with "orders", and "order lines" that come as JSON, and it is simple to store it as JSON in BigQuery. I can run a process to flatten the file to rows, but it is a burden, and makes the BigQUery table bigger.
What would be a best performance structure for BigQuery? Assuming I have queries on sum or products, and sales in order lines.
And what is the best practice to number of "records" (or "order lines") in a record column? Can it contain thousands or is it aimed for a few? Assuming I would query it like in a MongoDB document based database.
This will help me plan the right architecture.
BigQuery's columnar architecture is designed to handle nested and repeated fields in a highly performant manner, and in general can return query results as fast as it would if those records were flattened. In fact, in some cases, (depending on your data and the types of queries you are running) using already nested records can actually allow you to avoid subqueries that tack on an extra step.
Short answer: Don't worry about flattening, keep your data in the nested structure, the query performance will generally be the same either way.
However, as to your second question: Your record limit will be determined by how much data you can store in a single record. Currently BigQuery's per row maximum is 100MB. You can have many, many repeated fields in a single record, but they need to fit into this limit.
This question relates to the performance of querying data in BigQuery.
Any particular table or column settings that would affect query performance, or are all columns in a table in effect treated equally by BigQuery, such that the order of columns or any definitions applying to a column would not impact data fetching in any distinguishable way?
Thanks!
All columns are treated equally by BigQuery. The order doesn't matter at all. There aren't any settings that affect performance, but if you import your table in a very large number of small pieces (e.g. more than a few thousand) it can cause some slowdown.
In which case we should use table partitioning?
An example may help.
We collected data on a daily basis from a set of 124 grocery stores. Each days data was completely distinct from every other days. We partitioned the data on the date. This allowed us to have faster
searches because oracle can use partitioned indexes and quickly eliminate all of the non-relevant days.
This also allows for much easier backup operations because you can work in just the new partitions.
Also after 5 years of data we needed to get rid of an entire days data. You can "drop" or eliminate an entire partition at a time instead of deleting rows. So getting rid of old data was a snap.
So... They are good for large sets of data and very good for improving performance in some cases.
Partitioning enables tables and indexes or index-organized tables to be subdivided into smaller manageable pieces and these each small piece is called a "partition".
For more info: Partitioning in Oracle. What? Why? When? Who? Where? How?
When you want to break a table down into smaller tables (based on some logical breakdown) to improve performance. So now the user can refer to tables as one table name or to the individual partitions within.
Table partitioning consist in a technique adopted by some database management systems to deal with large databases. Instead of a single table storage location, they split your table in several files for quicker queries.
If you have a table which will store large ammounts of data (I mean REALLY large ammounts, like millions of records) table partitioning will be a good option.
i) It is the subdivision of a single table into multiple segments, called partitions, each of which holds a subset of values.
ii) You should use it when you have read the documentation, run some test cases, fully understood the advantages and disadvantages of it, and found that it will be of benefit.
You are not going to get a complete answer on a forum. Go and read the documentation and come back when you have a manageable problem.