When I try to use dynamic table partitions in a query in the web UI in BigQuery (like documented e.g. here), i.e.
SELECT * FROM [dataset.table$0-of-3]
I get the following error:
Error: Cannot read partition information from a table that is not partitioned: project:dataset.table$0-of-3
When I try a table that was partitioned with the new date partitioning (bq mk --time_partitioning_type=DAY ...), I do not get an error but instead:
Query returned zero records.
Also, I can't find the documentation on this feature anymore. Has it been deprecated?
I don't have enough reputation to comment on Mikhail's answer -- so adding an answer here.
At least for now, the dynamic table partitions described in the book were deprecated in favor of table partitioning as described in the latest BigQuery documentation.
We hope to provide richer flavors of partitioning in the future, but they may not be necessarily be available as table decorators.
This ($0-of-3) feature was never implemented - hopefuly it will at some point.
The ONLY partitioning decorator that was recently implemented was for date partitioned tables. see more at Partitioned Tables and timePartitioning.type
Related
I have a table with around 800k rows (which I didn't think is a lot). It is created from a series of other tables. I am then joining this table with another table of about 5M rows (using the python client), but it appears to be taking forever. In the NoSQL and SQL world I would create an index. In BQ, I think this is a partition or can I create an Index.
I'm using python and the following to create a table
query = """
CREATE OR REPLACE TABLE `{table_name}` AS
WITH get_all_affiliate AS (
""".format(table_name=table_name)
and
query += """
) SELECT * from get_all_table
"""
and then
response = client.query(query).result()
How can I easily CAST and also perform some indexing/partition on one field that is a string, but can be recast as an Integer?
As #Samuel mentioned in comments, Partition can be used to optimize a query in BigQuery. However, if both tables need to be joined, it does not help since JOIN will combine all of both tables' elements which contradicts the purpose of Partition. For more information, you may refer to this documentation.
You can use below for casting a string and recast as integer.
Cast(string_column_A as int64) as tempory_column_A
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.
As far as I understand Hive keeps track of schema for all partitions, handling schema evolution.
Is there any way to get schema for particular partition? For example, if I want to compare schema for some old partition with the latest one.
Show extended command does give you a bunch of information around the partition columns and its types, probably you could use those.
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTable/PartitionExtended
(I am guessing based on How do I query the streaming buffer in BigQuery if the _PARTITIONTIME field isn't available with Standard SQL that my question has no simple solution, so I will "enhance" it)
I stream my data into Bigquery's partitioned and clustered table using a timestamp field (not an ingestion time partition).
I want to have a view that always look into the last hour data, what already in the table, plus what still in the buffer.
Since this table is not an ingestion time partitioned table, there is no pseudo column _PARTITIONTIME/DATE, so I can't use it in order to get the buffer data.
The only way I've found is by using legacy SQL: SELECT * FROM [dataset.streaming_data$__UNPARTITIONED__]
This is not good enough for me, since even if I save this as a view, I can't refer to a legacy SQL view from a standard SQL query.
Any idea how I can achieve this ?
Another idea I am thinking of - bigquery can have an external data source (using EXTERNAL_QUERY), which I can query using standard SQL.
A solution might be some "temporary" table on a separate database (such as PostgreSQL Cloud SQL) which will only have 1 hour of data, and won't have bigquery's buffer mechanism.
I think this is a bad solution, but I guess it might work...
What do you think ?
Thanks to #Felipe Hoffae I just found out I need to do nothing :-)
Buffered data is already available in any SQL query if the WHERE clause includes the data in it...
I've created a partitioned and clustered BigQuery table for the time period of the year 2019, up to today. I can't seem to find if it is possible to update such a table (since I would need to add data for each new day). Is it possible to do it and if so, then how?
I've tried searching stackoverflow and BigQuery documentation for the answer. No results there on my part.
You could use the UPDATE statement to update this data. Your partitioned table will maintain their properties across all operations that modify it, like the DML and DDL statements, load jobs and copy jobs as well. For more information, you could check this document.
Hope it helps.
I've looked at previous questions, but the links given to GCP were outdated so I would like to learn what is the best way to do the conversion while inserting the correct partition (meaning not the day i inserted the records, but according to the "date" column.
Could someone point me in the right direction, specifically for Legacy SQL.
From the docs: "Currently, legacy SQL is not supported for querying partitioned tables or for writing query results to partitioned tables".
So, in this case, because Legacy can't write to partitioned tables, which seems to be a major blocking with no workarounds, you would have to use Standard SQL or Dataflow, as detailed in the answers of the question provided by Graham.