My data can’t be date partitioned, how do I use clustering? - google-bigquery

Currently I using following query:
SELECT
ID,
Key
FROM
mydataset.mytable
where ID = 100077113 
and Key='06019'
My data has 100 million rows:
ID - unique
Key - can have ~10,000 keys
If I know the key looking for ID can be done on ~10,000 rows and work much faster and process much less data.
How can I use the new clustering capabilites in BigQuery to partition on the field Key?

(I'm going to summarize and expand on what Mikhail, Pentium10, and Pavan said)
I have a table with 12M rows and 76 GB of data. This table has no timestamp column.
This is how to cluster said table - while creating a fake date column for fake partitioning:
CREATE TABLE `fh-bigquery.public_dump.github_java_clustered`
(id STRING, size INT64, content STRING, binary BOOL
, copies INT64, sample_repo_name STRING, sample_path STRING
, fake_date DATE)
PARTITION BY fake_date
CLUSTER BY id AS (
SELECT *, DATE('1980-01-01') fake_date
FROM `fh-bigquery.github_extracts.contents_java`
)
Did it work?
# original table
SELECT *
FROM `fh-bigquery.github_extracts.contents_java`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(3.3s elapsed, 72.1 GB processed)
# clustered table
SELECT *
FROM `fh-bigquery.public_dump.github_java_clustered2`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(2.4s elapsed, 232 MB processed)
What I learned here:
Clustering can work with unique ids, even for tables without a date to partition by.
Prefer using a fake date instead of a null date (but only for now - this should be improved).
Clustering made my query 99.6% cheaper when looking for rows by id!
Read more: https://medium.com/#hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b

you can have one filed of type DATE with NULL value, so you will be able partition by that field and since the table partitioned you will be able to enjoy clustering

You need to recreate your table with an additional date column with all rows having NULL values. And then you set partition to the date column. This way your table is partitioned.
After you've done with this, you will add clustering, based on the columns you identified in your query. Clustering will improve processing time and query costs will be reduced.

Now you can partition table on an integer column so this might be a good solution, remember there is a limit of 4,000 partitions for each table. So because you have ~10,000 keys I will suggest to create a sort of group_key that bundles ids together or maybe you have another column that you can leverage as integer with a cardinality < 4,000.
Recently BigQuery introduced support for clustering table even if they are not partitioned. So you can simply cluster on your integer field and don't use partitioning all together. Although, this solution will not be most effective for data scan optimisation.

Related

BigQuery clustering : hit clustering with multiple keys

We have a table which is partitioned on time (512Mbyte / partition) and it has also a cluster key on customer_id and time.
Up to now we had these queries, which are working well:
SELECT column FROM TABLE WHERE customer_id = 'key' and time > '2021-11-10'
SELECT column FROM TABLE WHERE customer_id IN ('key1', 'key2') and time > '2021-11-10'
Today we are trying these queries:
SELECT column FROM TABLE WHERE customer_id IN (SELECT customer_id FROM customers) AND time > '2021-11-10'
We see that this query not uses the clustering, resulting in getting a lot more data out of BigQuery. Then i found this article, explaining that complex filtering does not work with clustering https://cloud.google.com/bigquery/docs/querying-clustered-tables#do_not_use_clustered_columns_in_complex_filter_expressions
Is there a solution to define a list of id's outside the query and inject it into the query? (because now we need to generate the list of id into code).
Thx in Advance,
Regards
Concluding the discussion from the comments:
Partitioning and Clustering are used to improve query performance and control costs of querying heavy data.
Partition helps segment the data into partitions and clustering organises the data based on specified columns ie. clustered columns.
As mentioned in this documentation, in order for clustering to work efficiently, your table/partition should be greater than or approximately 1GB.
In your case if you're trying clustering on 512MB of data, there won't be any significant difference in query performance. You should prefer clustering over partitioning if partitioning results in a small amount of data per partition(approximately less than 1GB).
Refer to this documentation for more information.

BigQuery partitioned table per integer of size n

I have a BigQuery table with userId column.
userId is external id from mysql (auto increment field).
The most common query will be to filter by this id:
SELECT * FROM tableName WHERE userId=123
I want to partition the table by userId for better performance and pricing, but google requires me to specify end number, but I don't have end, the end is n- Today it's 1000, tomorrow it can be 4000000.
What is the technique to achieve that?
BigQuery has a limit of 4000 partitions per table. Use clustering instead. Alternatively you can partition by MOD(userId, 4000).

Mariadb Scans all partitions in timestamp column

I have a table Partitioned by:
HASH(timestamp DIV 43200 )
When I perform this query
SELECT max(id)
FROM messages
WHERE timestamp BETWEEN 1581708508 AND 1581708807
it scans all partitions while both numbers 1581708508 & 1581708807& numbers between them are in the same partition, how can I make it to scan only that partition?
You have discovered one of the reasons why PARTITION BY HASH is useless.
In your situation, the Optimizer sees the range (BETWEEN) and says "punt, I'll just scan all the partitions".
That is, "partition pruning" does not work when the WHERE clause involves a range and you are using PARTITION BY HASH. PARTITION BY RANGE, on the other hand, may be able to prune. But... What's the advantage? It does not make the query any faster.
I have found only four uses for partitioning: http://mysql.rjweb.org/doc.php/partitionmaint . It sounds like your application does not fit any of those cases.
That particular query would best be done without partitioning. Instead have a non-partitioned table with this 'composite' index:
INDEX(timestamp, id)
It must scan all the rows to discover the MAX(id), but with this index, it is
Scanning only the 2-column index
Not touching any rows outside the timestamp range.
Hence it will be as fast as possible. Even if PARTITION BY HASH were smart enough to do the desired pruning, it would not run any faster.
In particular, when you ask for a range on the Partition key, such as with WHERE timestamp BETWEEN 1581708508 AND 1581708807, the execution looks in all partitions for the desired rows. This is one of the major failings of Hash. Even if it could realize that only Partition is needed, it would be no faster than simply using the index I suggest.
You can determine that individual partition by using modular arithmetic
MOD(<formula which's argument of hash function>,<number of partitions>)
assuming you have 2 partitions
CREATE TABLE messages(ID int, timestamp int)
PARTITION BY HASH( timestamp DIV 43200 )
PARTITIONS 2;
look up partition names by
SELECT CONCAT( 'p',MOD(timestamp DIV 43200,2)) AS partition_name, timestamp
FROM messages;
and determine the related partition name for the value 1581708508 of timestamp column (assume p1). Then Use
SELECT MAX(id)
FROM messages PARTITION(p1)
to get all the records only in the partition p1 without need of a WHERE condition such as
WHERE timestamp BETWEEN 1581708508 AND 1581708807
Btw, all partitions might be listed through
SELECT *
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE table_name='messages'
Demo

Google BigQuery clustered table not reducing query size when running query with WHERE clause on clustered field

I have a Google BigQuery table of 500,000 rows that I have setup to be partitioned by a TIMESTAMP field called Date and clustered by a STRING field called EventCategory (this is just a sample of a table that is over 500 million rows).
I have a duplicate of the table that is not partitioned and not clustered.
I run the following query on both tables:
SELECT
*
FROM
`table_name`
WHERE
EventCategory = "email"
There are only 2400 rows where EventCategory is "email". When I run the query on the non clustered table I get the following:
When I run the query on the clustered table I get the following:
Here is the schema of both the non clustered and the clustered table:
Date TIMESTAMP NULLABLE
UserId STRING NULLABLE
EventCategory STRING NULLABLE
EventAction STRING NULLABLE
EventLabel STRING NULLABLE
EventValue STRING NULLABLE
There is basically no difference between the two queries and how much data they look through and I can't seem to figure out why? I have confirmed that the clustered table is partitioned and clustered because in the BigQuery UI in table details it actually says so and running a query by filtering by Date greatly reduces the size of the data searched and shows the estimated query size to be much smaller.
Any help here would be greatly appreciated!
UPDATE:
If I change the query to:
SELECT
*
FROM
`table_name`
WHERE
EventCategory = "ad"
I get the following result:
There are 53640 rows with EventCategory is "ad" and it looks like clustering did result in less table data being scanned, albeit not much less (529.2MB compared to 586MB).
So it looks like clustering is working but the data is not clustered properly in the table? How would I fix that? I have tried re-creating the table multiple times using DDL and even saving the table data to a JSON in GCS and then importing it into a new partitioned and clustered table but it hasn't changed anything.
Does the date partitioning sit on top of the clustering? Meaning that BigQuery first groups by date and then groups by cluster within those date groups? If so, I think that would probably explain it but it would render clustering not very useful.
If you have less than 100MB of data per day, clustering won't do much for you - you'll probably get one <=100MB cluster of data for each day.
You haven't mentioned how many days of data you have (# of partitions, as Mikhail asked), but since the total data scanned is 500MB, I'll guess that you have at least 5 days of data, and less than 100MB per day.
Hence the results you are getting seem to be the expected results.
See an example of this at work here:
How can I improve the amount of data queried with a partitioned+clustered table?
The reason clustering wasn't helping very much was specific to the table data. The table was event based data that was partitioned by day and then clustered by EventCategory (data is clustered on each day's partition). Since every day would have a large amount of rows for each EventCategory type, querying the entire table for a specific EventCategory would still have to search every single partition, which would then almost definitely have some data with that EventCategory meaning almost every cluster would have to be searched too.
The data are partitioned by day and inside that they are clustered,
the clustering works best when you load whole partitions (days) at once or export the partition (day) to Google Storage (which should be free) and import it again to another table, when we tried loading something like 4GB JSONS the difference was something like 60/10.

Return First Row For Given Value in a Column - BigQuery

I have a very large table that has a column which holds custom ID of string type for each row. For each ID, there are 50 properties in that table. This is guaranteed to be unique in the table.
My main task is to get those 50 properties in the row for a given ID.
When I run a normal query like the one below, it takes 5 sec to scan only 1 million rows.
SELECT * FROM `mytable` WHERE id='123'
As per my understanding, BigQuery does a parallel search for a match after partitioning the rows into different clusters. And I believe for a given ID value it will check all the rows in all different clusters. So that even if a match is found in one partition, the others clusters will continue to get other matches.
But as the values in the ID column are unique here, can we somehow "break" the jobs running on other clusters as soon a match is found in a cluster and return the row.
I hope this will speed up the query run time.
Also, in the future, this table will grow to really large so if this can be done it will really be helpful for my purpose.
Any suggestions are welcome.
You can use recently introduced Clustered Tables
This will allow you to bring down cost and improve performance
Please note: currently clustering is supported for partitioned tables only - but support for clustering non-partitioned tables is under development
If you table is partitioned you can just cluster it by id - and you are done
If not - you can introduce 'fake' date field and partition by it so clustering will be available for that table
Meantime, if you just interested in one row for given id - try below
SELECT * FROM mytable WHERE id='123' LIMIT 1