Make existing bigquery table clustered - google-bigquery

I have a quite huge existing partitioned table in bigquery. I want to make the table clustered, at least for the new partition.
From the documentation: https://cloud.google.com/bigquery/docs/creating-clustered-tables, it is said that we are able to Creating a clustered table when you load data and I have tried to load a new partition using clustering fields: job_config.clustering_fields = ["event_type"].
The load finished successfully, however it seems that the new partition is not clustered (I am not really sure how to check whether it is clustered or not, but when I query to that particular partition it would always scan all rows).
Is there a good way to make clustering field for an existing partitioned table?
Any comment, suggestion, or answer is well appreciated.
Thanks a lot,
Yosua

BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table.
You can change the clustering specification in the following ways:
Call the tables.update or tables.patch API method.
Call the bq command-line tool's bq update command with the --clustering_fields flag.
Reference
https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

This answer is no longer valid / correct
https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec
You can only specify clustering columns when a table is created
So, obviously you cannot expect existing non-clustered table and especially just new partitions to become clustered
The "workaround" is to create new table to be properly partitioned / clustered and load data into it from Google Cloud Storage (GCS). You can export data from original table into GCS first for this so whole process will be free of charge

What I missed from the above answers was a real example, so here it goes:
bq update --clustering_fields=tool,qualifier,user_id my_dataset.my_table
Where tool, qualifier and user_id are the three columns I want the table to be clustered by (in that order) and the table is my_dataset.my_table.

Related

Create new index on specific partitions postgresql 14

I am using postgresql 14.
I have a table which is ranged partitioned by days, the table's retention is rather small - i.e have 14 days worth of data (and dropping partitions older than 14 days).
I would like to introduce a new index, and was thinking if it is possible to create the index only for new partitions and not for old partitions,so I can avoid reindexing existing data currently on the "older partitions" table as these will anyways be deleted.
My question - is this worth doing? if so, do I have to create the index on table level after all partitions available in the table have the new index?
If not, would the best way to go is to create the index concurrently?
This is currently a thought, I do not have much experience with such operations on partitioned tables
Yes, you can just create the index for new partitions. When you create the new partition, just create an index for it, either manually or scriptedly. Once all extant partitions have the index, then you can create the index on the parent table very quickly, as it just sets some metadata and doesn't build any indexes. At that point, new partitions would automatically have the index created on them.
Whether it is worth doing, I don't think anyone can answer that for you. If each partition would take most of a day to index, maybe that would be worth avoiding. But in that case, can you even tolerate having the index in the first place?

Sort order of BigQueryStorage Read API

As the title states, is there any sort order for the data read using the read streams constructed with the Storage Read API? Is there any ordering with respect to partitions and clustering keys, as I understand partitions are colocated and if clustering is used, the data in a partition is stored in clustered blocks?
For the 1st Question
Storage API operates on storage directly.Thus you really can’t make assumptions regarding in which order you will receive the data by using Storage Read API.
For the 2nd Question
In a clustered table the data gets automatically organized whenever new data is added to a table or specific partition.From the partitioned table doc and clustered table doc
Partition table: A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data.
Cluster table: When you create a clustered table in BigQuery the table data is automatically organized based on the contents of one or more columns in the table's schema. The columns you specify are used to collocate related data. When data is written to a clustered table, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. The order of clustered columns determines the sort order of the data. When new data is added to a table or a specific partition, BigQuery performs automatic re-clustering in the background to restore the sort property of the table or partition.
When you are using cluster by with some columns , it gets applied to the whole dataset. If the table is a partitioned table then it will be applied to each partition.
You can follow this code lab for a better understanding. From the lab:-
Consider this stackoverflow.question_2018 table as an example. Let's assume it has 3 columns
Creation_date   2.Title   3.Tags
If we create a new partitioned table from the main table having creation_date as date partition , then as per partitioning logic it will have a partition for every creation date.
Now if we create a table creation_date as a partition and apply cluster by on column tags then clustering will be applied to each of the partitions. Even if we add new data in this table , bigquery will take care of reorganizing the data.
Hope this helps you to understand.

how to tell if a bigquery table is partially sorted

We are clustering some of our big query tables and would like to know when we need to do some table maintenance based on how sorted the clustered column is . is there any way of telling how degraded the clustered sort is ? As online documentation suggests that we may need to re cluster over time.
Are you deleting or updating data?
Are you streaming data in?
My rules of thumb:
Add data to a clustered table once a day, when the day is over.
For updates and deletes use MERGE, as docs say it will force a reclustering.
BigQuery clustered tables are in beta, and I would expect BigQuery making smarter decisions by itself as the feature grows into maturity.

SQL Server Table Partitioning, what is happening behind the scenes?

I'm working with table partitioning on extremely large fact table in a warehouse. I have executed the script a few different ways. With and without non clustered indexes. With indexes it appears to dramatically expand the log file while without the non clustered indexes it appears to not expand the log file as much but takes more time to run due to the rebuilding of the indexes.
What I am looking for is any links or information as to what is happening behind the scene specifically to the log file when you split a table partition.
I think it isn't to hard to theorize what is going on (to a certain extent). Behind the scenes each partition is given a different HoBT, which in normal language means each partition is in effect sitting on it's own hidden table.
So theorizing the splitting of a partition (assuming data is moving) would involve:
inserting the data into the new table
removing data from the old table
The NC index can be figured out, but depending on whether there is a clustered index or not, the theorizing will alter. It also matters whether the index is partition aligned or not.
Given a bit more information on the table (CL or Heap) we could theorize this further
If the partition function is used by a
partitioned table and SPLIT results in
partitions where both will contain
data, SQL Server will move the data to
the new partition. This data movement
will cause transaction log growth due
to inserts and deletes.
This is from an article by Microsoft on Partitioned Table and Index Strategies
So looks like its doing a delete from old partition and and insert into the new partition. This could explain the growth in t-log.

Is a CLUSTER INDEX desireable when loading a sorted loadfile into a new table?

INFORMIX-SE:
My users periodically run an SQL script [REORG.SQL] which unloads all rows from a table in sorted order to two separate files (actives and inactives), drops the table, re-creates the table, loads the sorted loadfiles back into it, creates a cluster index on the same column I sorted my unload files by, creates other supporting indexes and updates its statistics.
(See REORG.SQL script at: SE: 'bcheck -y' anomaly)
(Also see: customer.pk_name joining transactions.fk_name vs. customer.pk_id [serial] joining transactions.fk_id [integer] for reason why cluster index is by name and not pk_id[serial]=fk_id[int])
With my REORG.SQL script, I've been having index file consistency problems so I suspected the CLUSTER INDEX had something to do with it and created the index with no clustering and the problems went away!
Now my question is: If I manage to load all my transaction data, sorted by the customers full name into a newly created table, is it really necessary for me to create a CLUSTER INDEX when in fact the rows are already sorted in the same order that the clustering would accomplish?.. I know that a clustered index starts loosing its clustering as new rows are added, so what's the advantage of creating a cluster index?.. does the query optimizer take advantage of clustering vs. a non-clustered index when the rows are essentially in the same clustered order?.. Has anyone encountered IDX/DAT file problems when clustering a table?.. Perhaps my SQL script has something wrong with it? (PLEASE REVIEW MY SQL SCRIPT CODE TO SEE IF I'm DOING SOMETHING WRONG?)
The script unloads the active and inactive transactions to two different files, with each file sorted by customer name. It then loads them back into the table, active transactions first, followed by inactive transactions. A clustered index is then created on customer name. The problem is that the database now has to go back and re-order the physical rows by customer name when building the clustered index. Although each of the unload files are separately ordered by customer name, when the two are put together the result is not ordered by customer name, causing more work for the database. Unless the separate files for active and inactive transactions are needed elsewhere you might try just dumping all the transactions to a single file, ordered by customer name, and then re-load the table from that single file. At that point the data in the table would be ordered by customer name and the clustered index create wouldn't have to do the re-ordering of the data.
As to whether or not the clustered index is really needed - a clustered index can be of value if you use that column to query with as it should help to reduce the number of I/O's needed to fetch the data. Usually clustered indexes are created on columns which increase monotonically so perhaps TRX_NUM would serve well as the column to be named on the clustered index.
Share and enjoy.