BigQuery Clustered Tables: How to Create Multiple Clusters - google-bigquery

My BigQuery table is commonly queried with different combinations of "where" conditions across 1 or more common columns, say across columns A, B, C (not in order). Hence, I would like to add individual clusters for columns A, B, and C respectively.
How can I create multiple clusters for BigQuery tables? (Similar to how multiple indexes can be created on a traditional rdbms table)

Multiple clustering is allowed (but it is hierarchical,you cluster by specific field and then it is subclustered on the following, etc).
At the same time, clustering is only allowed for partitioned tables.
You can find the corresponding documentation here

Upon viewing some comments and pages it appears that there are no ways to have multiple independent clusters (vs how multiple indexes can be created on a traditional rdbms) on a single bigquery table.
This is because clusters pretty much just sort the data blocks of that table as per docs:
When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query that contains a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Hence, it appears that there is no way of applying multiple sorting logic for each independent cluster on the same set of data, and so what I require appears to be impossible as of now.

Related

What are the intended uses of a columnar database?

I have a question about columnar databases such as Sybase. I understand columnar databases are very fast when your operation is restricted to single coulmns and doesn’t go across columns, i.e no row based filtering?
But most queries are a combination of both, filter some rows, then aggregate some columns.
So really, where do columnar databases shine?
Columnar databases can definitely access data across different columns. By storing columns separately, they offer a few advantages not available in row-based storage:
They only need to read the columns accessed in a particular query.
They make it easy to add new columns to a table.
They allow an individual column to be compressed, using a compression algorithm optimal for that column.
They sometimes provide built-in indexing for each column.
All of these can be used to speed SELECT queries.
One big issue with columnar databases is inserting a new row (or deleting a row), because this requires touching all the columns. That makes ensuring ACID properties . . . trickier.
There are definitely some SELECT queries that may not perform as well in a columnar database as in other databases. But they do surprisingly well at increasing the performance of many queries.

How to use time partitioned tables with template tables and beyond 4000 limit for BigQuery?

For streaming inserts, I want to use a template table (with user id suffix) which is itself a Partitioned table. This way I can make my tables smaller than just using Partitioned Tables and hence make my queries more cost-effective. Also my query cost per user stays constant irrespective of the number of users in my system. As per the documentation at https://cloud.google.com/bigquery/streaming-data-into-bigquery:-
To create smaller sets of data by date, use time-partitioned tables. To create smaller tables that are not date-based, use template tables and BigQuery creates the tables for you.
It sounds as if it can either be a time-partitioned table OR a template table. Can it not be both? If not, is there another architecture that I should look into?
One more concern regarding my above proposed architecture is the 4000 limit that I saw on https://cloud.google.com/bigquery/docs/partitioned-tables . Does it mean that my partitioned table can't cover more than 4000 days? Will I have to delete old partitions in this case or will the last partition keep storing any subsequent streamed data?
You should look into Clustered Tables on partitioned tables.
With that you can have ONE table with all users in it, partitioned by time, and clustered by user_id as you would use in a template table.
Introduction to Clustered Tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Clustered table pricing
When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries you run against the data. Clustered tables help you to reduce query costs by pruning data so it is not processed by the query.

Bigtable design and querying with respect to number of column families

From Cloud Bigtable schema design documentation:
Grouping data into column families allows you to retrieve data from a single family, or multiple families, rather than retrieving all of the data in each row. Group data as closely as you can to get just the information that you need, but no more, in your most frequent API calls.
In my use case, I can group all the data into one single column family (currently the access pattern is retrieve all fields), or group them to say 3 logical column families and specify these column families all the time while querying. Is there any performance difference between these two designs? Which design is recommended?
In your case, there isn't a performance advantage either way. I would use the 3 logical column families so that you have cleaner code.

Sort field in hive

I have table about 20-25 million records, I have to put in another table based on some condition and also sorted. Example
Create table X AS
select * from Y
where item <> 'ABC'
Order By id;
I know that Order by Uses single reducer to guarantee total order in output.
I need optimize way to do sorting for above query.
SQL tables represent unordered sets. This is especially true in parallel databases where the data is spread among multiple processors.
That said, Hive does support clustered indexes (which essentially define partitions) and sorting within the partitions. The documentation is quite specific, though, that this is not supported with CREATE TABLE AS:
CTAS has these restrictions:
The target table cannot be a partitioned table.
You could do what you want by exporting the data and re-importing it.
However, I would suggest that you figure out what you really need without requiring the data to be ordered within the database.

Why does Redshift not need materialized views or indexes?

In the Redshift FAQ under
Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?
It says the following:
Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
Why is this the case?
It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.
Materialised Views
I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.
I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?
Indexes
RedShift does not have indexes.
It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).
It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.
That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics
As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns
as of dec 2019, Redshift has a preview of materialized views: Announcement
from the documentation: A materialized view contains a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Amazon Redshift returns the precomputed results from the materialized view, without having to access the base tables at all. From the user standpoint, the query results are returned much faster compared to when retrieving the same data from the base tables.
Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.
The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.
Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.
for eg.
Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys.
If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate).
If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently.
If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently.
If you have an interleaved soft key on (orderid, shipdate) and if your query
Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.
https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html
Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks
The simple answer is: because it can read the needed data really, really fast and in parallel.
One of the primary uses of indexes are "needle-in-the-haystack" queries. These are queries where only a relatively small number of rows are needed and these match a WHERE clause. Columnar datastores handle these differently. The entire column is read into memory -- but only the column, not the rest of the row's data. This is sort of similar to having an index on each column, except the values need to be scanned for the match (that is where the parallelism comes in handy).
Other uses of indexes are for matching key pairs for joining or for aggregations. These can be handled by alternative hash-based algorithms.
As for materialized views, RedShift's strength is not updating data. Many such queries are quite fast enough without materialization. And, materialization incurs a lot of overhead for maintaining the data in a high transaction environment. If you don't have a high transaction environment, then you can increment temporary tables after batch loads.
They recently added support for Materialized Views in Redshift: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-introduces-support-for-materialized-views-preview/
Syntax for materialized view creation:
CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ]
[ table_attributes ]
AS query
Syntax for refreshing a materialized view:
REFRESH MATERIALIZED VIEW mv_name