Return First Row For Given Value in a Column - BigQuery - google-bigquery

I have a very large table that has a column which holds custom ID of string type for each row. For each ID, there are 50 properties in that table. This is guaranteed to be unique in the table.
My main task is to get those 50 properties in the row for a given ID.
When I run a normal query like the one below, it takes 5 sec to scan only 1 million rows.
SELECT * FROM `mytable` WHERE id='123'
As per my understanding, BigQuery does a parallel search for a match after partitioning the rows into different clusters. And I believe for a given ID value it will check all the rows in all different clusters. So that even if a match is found in one partition, the others clusters will continue to get other matches.
But as the values in the ID column are unique here, can we somehow "break" the jobs running on other clusters as soon a match is found in a cluster and return the row.
I hope this will speed up the query run time.
Also, in the future, this table will grow to really large so if this can be done it will really be helpful for my purpose.
Any suggestions are welcome.

You can use recently introduced Clustered Tables
This will allow you to bring down cost and improve performance
Please note: currently clustering is supported for partitioned tables only - but support for clustering non-partitioned tables is under development
If you table is partitioned you can just cluster it by id - and you are done
If not - you can introduce 'fake' date field and partition by it so clustering will be available for that table
Meantime, if you just interested in one row for given id - try below
SELECT * FROM mytable WHERE id='123' LIMIT 1

Related

BigQuery clustering : hit clustering with multiple keys

We have a table which is partitioned on time (512Mbyte / partition) and it has also a cluster key on customer_id and time.
Up to now we had these queries, which are working well:
SELECT column FROM TABLE WHERE customer_id = 'key' and time > '2021-11-10'
SELECT column FROM TABLE WHERE customer_id IN ('key1', 'key2') and time > '2021-11-10'
Today we are trying these queries:
SELECT column FROM TABLE WHERE customer_id IN (SELECT customer_id FROM customers) AND time > '2021-11-10'
We see that this query not uses the clustering, resulting in getting a lot more data out of BigQuery. Then i found this article, explaining that complex filtering does not work with clustering https://cloud.google.com/bigquery/docs/querying-clustered-tables#do_not_use_clustered_columns_in_complex_filter_expressions
Is there a solution to define a list of id's outside the query and inject it into the query? (because now we need to generate the list of id into code).
Thx in Advance,
Regards
Concluding the discussion from the comments:
Partitioning and Clustering are used to improve query performance and control costs of querying heavy data.
Partition helps segment the data into partitions and clustering organises the data based on specified columns ie. clustered columns.
As mentioned in this documentation, in order for clustering to work efficiently, your table/partition should be greater than or approximately 1GB.
In your case if you're trying clustering on 512MB of data, there won't be any significant difference in query performance. You should prefer clustering over partitioning if partitioning results in a small amount of data per partition(approximately less than 1GB).
Refer to this documentation for more information.

Need suggestion on splitting table in bigquery based on non-date column along with date partition

We are having a date partitioned table with 5 yrs(with daily incremental load) data running into millions & millions of records. To improve the performance, thinking of splitting the table based on a non-date field(id) as all the queries will include a where clause on that column(id). And also partition each of split tables with date partition so that we can query on a smaller dataset with a date range. we will not be using wildcarded table as we will know the id and planning to append that to the table and run a query against that specific table. Need to know whether that would be good option to pursue to improve performance and reduce the query cost.
[Update]: We went ahead and split the tables based on id column(tablename_id) and made the table date partitioned and clustered with 4 other columns(max supported) which are commonly used in queries. With that we were able to get a better performance and also reduced the data accessed for each query. Based on the testing looks like it is a good option to puruse as long as wildcarded querying of tables is avoided and till Bigquery supports partitioning based on non-date/non-datetime columns.
We split the tables based on the id columns creating multiple tables. Each of the split tables are partition on date column. Apart from that we had it as clustered table on 4 other columns as needed. Find below the performance on a sample dataset. Old Table(UserInfo) has more than 500,000 rows. The stats we captured is for a given date range and id, the performance of old table(non split/combined table) and split table(split based on the ID) in terms of amount of data processed and the time taken for the same query.
This is not possible. BigQuery doesn't support partitioned on non-date columns.
There's a feature request for it. I suggest subscribing to it to keep receiving information regarding its availability.

Google BigQuery clustered table not reducing query size when running query with WHERE clause on clustered field

I have a Google BigQuery table of 500,000 rows that I have setup to be partitioned by a TIMESTAMP field called Date and clustered by a STRING field called EventCategory (this is just a sample of a table that is over 500 million rows).
I have a duplicate of the table that is not partitioned and not clustered.
I run the following query on both tables:
SELECT
*
FROM
`table_name`
WHERE
EventCategory = "email"
There are only 2400 rows where EventCategory is "email". When I run the query on the non clustered table I get the following:
When I run the query on the clustered table I get the following:
Here is the schema of both the non clustered and the clustered table:
Date TIMESTAMP NULLABLE
UserId STRING NULLABLE
EventCategory STRING NULLABLE
EventAction STRING NULLABLE
EventLabel STRING NULLABLE
EventValue STRING NULLABLE
There is basically no difference between the two queries and how much data they look through and I can't seem to figure out why? I have confirmed that the clustered table is partitioned and clustered because in the BigQuery UI in table details it actually says so and running a query by filtering by Date greatly reduces the size of the data searched and shows the estimated query size to be much smaller.
Any help here would be greatly appreciated!
UPDATE:
If I change the query to:
SELECT
*
FROM
`table_name`
WHERE
EventCategory = "ad"
I get the following result:
There are 53640 rows with EventCategory is "ad" and it looks like clustering did result in less table data being scanned, albeit not much less (529.2MB compared to 586MB).
So it looks like clustering is working but the data is not clustered properly in the table? How would I fix that? I have tried re-creating the table multiple times using DDL and even saving the table data to a JSON in GCS and then importing it into a new partitioned and clustered table but it hasn't changed anything.
Does the date partitioning sit on top of the clustering? Meaning that BigQuery first groups by date and then groups by cluster within those date groups? If so, I think that would probably explain it but it would render clustering not very useful.
If you have less than 100MB of data per day, clustering won't do much for you - you'll probably get one <=100MB cluster of data for each day.
You haven't mentioned how many days of data you have (# of partitions, as Mikhail asked), but since the total data scanned is 500MB, I'll guess that you have at least 5 days of data, and less than 100MB per day.
Hence the results you are getting seem to be the expected results.
See an example of this at work here:
How can I improve the amount of data queried with a partitioned+clustered table?
The reason clustering wasn't helping very much was specific to the table data. The table was event based data that was partitioned by day and then clustered by EventCategory (data is clustered on each day's partition). Since every day would have a large amount of rows for each EventCategory type, querying the entire table for a specific EventCategory would still have to search every single partition, which would then almost definitely have some data with that EventCategory meaning almost every cluster would have to be searched too.
The data are partitioned by day and inside that they are clustered,
the clustering works best when you load whole partitions (days) at once or export the partition (day) to Google Storage (which should be free) and import it again to another table, when we tried loading something like 4GB JSONS the difference was something like 60/10.

how to speed up a clustered index scan while selecting all fields on range of rows or all the rows

I have a table
Books(BookId, Name, ...... , PublishedYear)
I do have about 30 fields in my Books table, where BookId is the primary key (Identity column). I have about 2 million records for this table.
I know select * is evil performance killer..
I have a situation to select range of rows or all the rows having all the columns in it.
Select * from Books;
this query takes more than 2 seconds to scan through the data page and get all the records. On checking the execution it still uses the Clustered index scan.
Obviously 2 seconds my not be that bad, however when this table has to be joined with other tables which is executed in batch is taking time over 15 minutes (There are no duplicate records though on the final result at completion as the count is matching). The join criteria is pretty simple and yields no duplication.
Excluding this table alone has the batch execution completed in sub seconds.
Is there a way to optimize this having said that I will have to select all the columns :(
Thanks in advance.
I've just run a batch against my developer instance, one SELECT specifying all Columns and one using *. There is no evidence (nor should there) that there is any difference aside from the raw parsing of my input. If I remember correctly, that old saying really means: Do not SELECT columns you are not using, they use up resources without benefit.
When you try to improve performance in your code, always check your assumptions, they might only apply to some older version (of sql server etc) or other method.

Checking a large set of columns for null-ness

On my current project (a redesign), I'm tasked with checking whether or not a series of soon-to-be-deleted columns have data, so we can decide if and how we should migrate them into new and improved tables / columns. This task is - per se - not the problem, merely the background.
The problem is, there are about 30 columns to check, out of a total of 150. The table is fairly large so I fear that a chained select * from table where x is not null or y is not null or... is a bit..slow.
Is there a better, or more elegant way to check multiple columns for null-ness?
Am I better adviced to just check the columns independetly, or in smaller groups and don't bother with an optimal solution?
It's just one table. It will get read record by record (full table scan) and the criteria checked. This is not slow. No sorting, no joining, no sub-selects or intermidiate results. This can't be slow. Don't worry.
BTW: shouldn't that be select * from table where x is not null OR y is not null ...?
You want to find all records that contain data in any of the columns, right?