As the title states, is there any sort order for the data read using the read streams constructed with the Storage Read API? Is there any ordering with respect to partitions and clustering keys, as I understand partitions are colocated and if clustering is used, the data in a partition is stored in clustered blocks?
For the 1st Question
Storage API operates on storage directly.Thus you really can’t make assumptions regarding in which order you will receive the data by using Storage Read API.
For the 2nd Question
In a clustered table the data gets automatically organized whenever new data is added to a table or specific partition.From the partitioned table doc and clustered table doc
Partition table: A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data.
Cluster table: When you create a clustered table in BigQuery the table data is automatically organized based on the contents of one or more columns in the table's schema. The columns you specify are used to collocate related data. When data is written to a clustered table, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. The order of clustered columns determines the sort order of the data. When new data is added to a table or a specific partition, BigQuery performs automatic re-clustering in the background to restore the sort property of the table or partition.
When you are using cluster by with some columns , it gets applied to the whole dataset. If the table is a partitioned table then it will be applied to each partition.
You can follow this code lab for a better understanding. From the lab:-
Consider this stackoverflow.question_2018 table as an example. Let's assume it has 3 columns
Creation_date 2.Title 3.Tags
If we create a new partitioned table from the main table having creation_date as date partition , then as per partitioning logic it will have a partition for every creation date.
Now if we create a table creation_date as a partition and apply cluster by on column tags then clustering will be applied to each of the partitions. Even if we add new data in this table , bigquery will take care of reorganizing the data.
Hope this helps you to understand.
Related
I'm moving to tables partitioned by a timestamp-column with its value in milliseconds. Now I want to generate clusters by hour, which will depend on the same timestamp-column I used for partitioning.
I want to use the same column for partitioning and clustering but I'm not sure if that works to generate hourly clusters.
I was planning on adding a new column which contains only the hour related with every record, and then using this column to create my clustered table, but I want to better understand what will happen if I use the same timestamp-column that I used for partitioning.
When specifying TIMESTAMP column as partition - The data is saved on the disk by the partition allows each access.
Now, BigQuery allows to also define up to 4 columns which will used as cluster field.
If I get it correctly the partition is like PK and the cluster fields are like indexes.
So this means that the cluster fields has nothing to do with how records are saved on the disk?
If I get it correctly the partition is like PK
This is not correct, Partition is not used to identify a row in the table rather enable BigQuery to Store each partitioned data in a different segment so when you scan a table by Partition you ONLY scan the specified partitions and thus reduce your scanning cost
cluster fields are like indexes
This is correct cluster fields are used as pointers to records in the table and enable quick/minimal cost access to data regardless to the partition. This means Using cluster fields you can query a table cross partition with minimal cost
I like #Felipe image from his medium post which gives nice visualization on how data is stored.
Note: Partitioning happens on the time of the insert while clustering happens as a background job performed by BigQuery
I have a quite huge existing partitioned table in bigquery. I want to make the table clustered, at least for the new partition.
From the documentation: https://cloud.google.com/bigquery/docs/creating-clustered-tables, it is said that we are able to Creating a clustered table when you load data and I have tried to load a new partition using clustering fields: job_config.clustering_fields = ["event_type"].
The load finished successfully, however it seems that the new partition is not clustered (I am not really sure how to check whether it is clustered or not, but when I query to that particular partition it would always scan all rows).
Is there a good way to make clustering field for an existing partitioned table?
Any comment, suggestion, or answer is well appreciated.
Thanks a lot,
Yosua
BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table.
You can change the clustering specification in the following ways:
Call the tables.update or tables.patch API method.
Call the bq command-line tool's bq update command with the --clustering_fields flag.
Reference
https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec
This answer is no longer valid / correct
https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec
You can only specify clustering columns when a table is created
So, obviously you cannot expect existing non-clustered table and especially just new partitions to become clustered
The "workaround" is to create new table to be properly partitioned / clustered and load data into it from Google Cloud Storage (GCS). You can export data from original table into GCS first for this so whole process will be free of charge
What I missed from the above answers was a real example, so here it goes:
bq update --clustering_fields=tool,qualifier,user_id my_dataset.my_table
Where tool, qualifier and user_id are the three columns I want the table to be clustered by (in that order) and the table is my_dataset.my_table.
When creating a partitioned table using bq mk --time_partitioning_type=DAY are the partitions created based on the load time of the data, not a date key within the table data itself?
To create partitions based on dates within the date, is the current approach to manually create sharded tables, and load them based on date, as in this post from 2012?
Yes, partitions created based on data load time not based on data itself
You can use partition decorator (mydataset.mytable1$20160810) if you want to load data into specific partition
Per my understanding, partition by column is something that we should expect to be supported at some point - but not now
Good news, BigQuery currently supports 2 type data partition, included partition by column. Please check here.
I like the feature: An individual operation can commit data into up to 2,000 distinct partitions.
I'm learning table partitioning.
When I read this page, it said that
The TransactionHistoryArchive table must have the same design schema as the TransactionHistory table. There must also be an empty partition to receive the new data. In this case, TransactionHistoryArchive is a partitioned table that consists of just two partitions.
And with the following picture, we can see that TransactionHistory has 12 partitions, but TransactionHistoryArchive just has 2 partitions.
Illustration http://i.msdn.microsoft.com/dynimg/IC38652.gif
How could it possible? Please help me to understand it.
As long as two individual partitions have identical schema and the same boundary values you can switch them. They don't need to have the same partition scheme or function.
This is because SQL Server ensures that the binary data of those partitions on disk is compatible. That's the magic of partitioning and why you can move arbitrary amounts of data as a quick metadata-only operation.