How to query for data in streaming buffer ONLY in BigQuery? - google-bigquery

We have a table partitioned by day in BigQuery, which is updated by streaming inserts.
The doc says that: "when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column"
But if I query for select count(*) from table where _PARTITIONTIME is NULL it always returns 0, even though bq show tells me that there are a lot of rows in the streaming buffer.
Does this mean that the pseudo column is not present at all for rows in streaming buffer? In any case, how can I query for the data ONLY in the streaming buffer without it becoming a full table scan?
Thanks in advance

Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.
SELECT
fields
FROM
`dataset.partitioned_table_name`
WHERE
_PARTITIONTIME IS NULL
https://cloud.google.com/bigquery/docs/partitioned-tables#copying_to_partitioned_tables

When you stream data to BQ you usually have the "warming-up" period and that's the time it takes for the streamed data to become available for operations such as querying, copying and exporting.
The doc states in the end that after a period of up to 90 mins the pseudo-column _PARTITIONTIME receives a non-null value, which means your streamed data is fully ready for any type of operation you want to run on the data (being able to run queries usually takes a few seconds).
That means that you don't query partitioned tables looking for when this field is null but instead, you do like so:
SELECT
fields
FROM
`dataset.partitioned_table_name`
WHERE
_PARTITIONTIME = TIMESTAMP('2017-01-20')
In this example, you would be querying only data streamed in the dates partition Jan/20 (which avoids a full table scan).
You can also select for a range of dates, you would just have to change the WHERE clause to:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-20') AND TIMESTAMP('2017-01-22')
Which would query for 2 days in your table.

Related

BigQuery : Data processed when running Select on Custom Partitioning Field

I have a table that is partitioned by day using a Timestamp field my_partition_field from the schema (and not the ingestion time _PARTITIONTIME)
When I execute the following query :
SELECT my_partition_field FROM MY_TABLE;
BigQuery tells me that "This query will process XX MB when run". The amount of data processed is the same as if the field was not the partitioning field.
However, if I have the same table partitioned by ingestion time and I run the following query :
SELECT _PARTITIONTIME FROM MY_TABLE_2;
BigQuery tells me that "This query will process 0 B when run."
Why is there a difference in the data processed (and billed :) ) between these two cases ?
When you create a partitioned table in BigQuery, your charges are based on how much data is stored in the partitions and on the queries you run against the data[1]. Many partitioned table operations are free as is _PARTITIONTIME[2]. There is no difference between the processed data, just the data which is in both tables may be different because, in the time-unit partitioned table, the partition is based on a TIMESTAMP, DATE or DATETIME column in the table. On the other hand, the Ingestion time tables are partitioned based on the timestamp when BigQuery ingests the data.
[1]https://cloud.google.com/bigquery/docs/partitioned-tables#pricing
[2]https://cloud.google.com/bigquery/pricing#free

Get the most recent Timestamp value

I have a pipeline which reads from a BigQuery table, performs some processing to the data and saves it into a new BigQuery table. This is a batch process performed on a weekly basis through a cron. Entries keep being added on the source table, so I want that whenever I start the ETL process it only process the new rows which have been added since the last time the ETL job was launched.
In order to achieve this, I have thought about making a query to my sink table asking for the most recent timestamp it contains. Then, as a data source I will perform another query to the source table filtering and asking for the entries having a timestamp higher than the one I have just recovered. Both my source and sink table are time partitioned ones.
The query I am using for getting the latest entry on my sink table is the following one:
SELECT Timestamp
FROM `myproject.mydataset.mytable`
ORDER BY Timestamp DESC
LIMIT 1
It gives me the correct value, but I feel like if it is not the most efficient way of querying it. Does this query take advantage of the partitioned feature of my table? Is there any better way of retrieving the most recent timestamp from my table?
I'm going to refer to the timestamp field as ts_field for your example.
To get the latest timestamp, I would run the following query:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
If your table is also partitioned on the timestamp field, you can do something like this to scan even less bytes:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
WHERE date(ts_field) = current_date()

BigQuery: cost of querying tables partitioned by ingestion time vs date/timestamp partitioned

We are trying to build (or better say rebuild) our DWH in the cloud based on BigQuery. We decided to use 'partitioned by date field' tables (like a 'created_date' field) for our raw data instead of ingestion time partitions because with this feature we can load data easely and then query it with "group by" partition date column, build datamarts bla bla bla. We supposed that this partition method will increase queries speed and reduce it cost (versus non-partitioned tables - yes), BUT we've discovered than when you querying table with WHERE by partition field (like 'select count(*) from table where created_date=current_date'), it will cost money.
Our old-style ingestion time partitioned table queries with WHERE _PARTITIONTIME ='' were FREE! (like 'select count(*) from table where _PARTITIONTIME=current_date')
For example:
1) select value1 from table1 where _PARTITIONTIME = current_date
2) select value1 from table1 where created_date = current_date
3) select count(*) from table1 where _PARTITIONTIME = current_date
The second query costs more, because it will scan 2 columns. Its logical. But not fair((( The 3rd query is absolutely free btw!
This is very sad situation, because there is NO ANY WARNING about this 'side effect' in the documentation. This feature designed to make DB developers life easier (i guess), and it positioned as best practice feature and highly recommended by Google. But nobody said that it will cost you additional money also!
So the question is can we somehow query date-field partitioned tables using partition key for free? Is there any other pseudocolumn or method of filtering by partition key available if you use date/timestamp field based partitioning?
(ps: you guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist).
Thnx!
So the question is can we somehow query date-field partitioned tables
using partition key for free?
The answer is No, querying the partition will not be free.
Is there any other pseudocolumn or method of filtering by partition
key available if you use date/timestamp field based partitioning?
If you want partitioning by date, this can only be achieved using ingestion-time partitioning with the _PARTITIONTIME pseudocolumn or using dates value in a selected date/timestamp value columns. Currently there is no alternative option available. Keep in mind that one of the main goals of partitioning is reducing the amount of data being scanned mainly by reducing the number of rows that are scanned.
You guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist
I understand that you would like to have some pseudocolumn for the data column partitioned method, but could you please elaborate a bit more what values you would like to see in this partition in your original post?
Edit: A feature request has been opened on your behalf. You can follow it here

Need suggestion on splitting table in bigquery based on non-date column along with date partition

We are having a date partitioned table with 5 yrs(with daily incremental load) data running into millions & millions of records. To improve the performance, thinking of splitting the table based on a non-date field(id) as all the queries will include a where clause on that column(id). And also partition each of split tables with date partition so that we can query on a smaller dataset with a date range. we will not be using wildcarded table as we will know the id and planning to append that to the table and run a query against that specific table. Need to know whether that would be good option to pursue to improve performance and reduce the query cost.
[Update]: We went ahead and split the tables based on id column(tablename_id) and made the table date partitioned and clustered with 4 other columns(max supported) which are commonly used in queries. With that we were able to get a better performance and also reduced the data accessed for each query. Based on the testing looks like it is a good option to puruse as long as wildcarded querying of tables is avoided and till Bigquery supports partitioning based on non-date/non-datetime columns.
We split the tables based on the id columns creating multiple tables. Each of the split tables are partition on date column. Apart from that we had it as clustered table on 4 other columns as needed. Find below the performance on a sample dataset. Old Table(UserInfo) has more than 500,000 rows. The stats we captured is for a given date range and id, the performance of old table(non split/combined table) and split table(split based on the ID) in terms of amount of data processed and the time taken for the same query.
This is not possible. BigQuery doesn't support partitioned on non-date columns.
There's a feature request for it. I suggest subscribing to it to keep receiving information regarding its availability.

BigQuery Cluster field usage/value not clear

I created a table with a cluster filed but I don't see any saving or any performance improvement, this is what I have done:
I created a destination table with 3 columns: projectId, tableId and schema
using this SQL:
SELECT projectId, tableId, schema
FROM `project.dataset.tables`
WHERE _partitionTime >= '2018-12-27 00:00:00'
Partition Field: Default partitionTime
Cluster Field: projectId, tableId
The original cost of this sql is: $2.82
Now When selecting from the new table I expect
To get lower cost
To get better performance
I'm using this SQL
SELECT * FROM `project.table.testCluster`
WHERE projectId = 'xxx' and tableId = 'yyy'
AND _PARTITIONTIME >= TIMESTAMP("2018-12-30") LIMIT 1000
From my benchmark and from BigQuery console execution report I see neither
Any ideas why?
BigQuery sorts the data in a clustered table based on the values in the clustering columns and organizes them into blocks. When you submit a query that contains a filter on a clustered column, BigQuery uses the clustering information to efficiently determine whether a block contains any data relevant to the query.
This allows BigQuery to only scan the relevant blocks — a process referred to as block pruning.
One small catch here. BigQuery provides an estimate for how much data each query will query before running the query. Without clustering, said estimate is exact. With clustering the estimate is an upper bound, and the query might end up querying less or may remain the same. It depends on the structure of the clustered column. The higher the unique values in the clustered column, lower the optimization.