As far as I understand Hive keeps track of schema for all partitions, handling schema evolution.
Is there any way to get schema for particular partition? For example, if I want to compare schema for some old partition with the latest one.
Show extended command does give you a bunch of information around the partition columns and its types, probably you could use those.
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTable/PartitionExtended
Related
I'm trying to understand the rationale behind the partition specific schema managed for Hive/Glue tables. Albeit, I couldn't find any documentation, specifically talking about this but during my search, I found a couple of Hive JIRAs (as attached in references) which hint at its purpose. From what I gathered, partition schema is a snapshot of table schema when it is registered, and it allows Hive to support schema evolution without invalidating existing table partitions and the underlying data. Also, it enables Hive to support different partitions and table level file formats, giving clients more flexibility.
The exact purpose is still not clear to me, so requesting the experts to comment on the following set of questions:
What is the rationale behind maintaining partition specific schema?
How does Hive/Glue behave in case there is a discrepancy in the partition and table schema? Does the resolution criteria consider or is dependent on the underlying data file format?
What are the repercussions of not maintaining partition specific schema in table metadata?
Experimentation and observations:
I ran an experiment on my end, in which I tested a few count, count with partition filters and schema description queries against Glue table without explicit schema definition in partition properties (underlying data files are written in parquet) using Spark-Shell, Hive CLI and Athena. The results retrieved were consistent with the ones computed from the original table.
References:
https://issues.apache.org/jira/browse/HIVE-6131
https://issues.apache.org/jira/browse/HIVE-6835
https://issues.apache.org/jira/browse/HIVE-8839
Thanks!
(I am guessing based on How do I query the streaming buffer in BigQuery if the _PARTITIONTIME field isn't available with Standard SQL that my question has no simple solution, so I will "enhance" it)
I stream my data into Bigquery's partitioned and clustered table using a timestamp field (not an ingestion time partition).
I want to have a view that always look into the last hour data, what already in the table, plus what still in the buffer.
Since this table is not an ingestion time partitioned table, there is no pseudo column _PARTITIONTIME/DATE, so I can't use it in order to get the buffer data.
The only way I've found is by using legacy SQL: SELECT * FROM [dataset.streaming_data$__UNPARTITIONED__]
This is not good enough for me, since even if I save this as a view, I can't refer to a legacy SQL view from a standard SQL query.
Any idea how I can achieve this ?
Another idea I am thinking of - bigquery can have an external data source (using EXTERNAL_QUERY), which I can query using standard SQL.
A solution might be some "temporary" table on a separate database (such as PostgreSQL Cloud SQL) which will only have 1 hour of data, and won't have bigquery's buffer mechanism.
I think this is a bad solution, but I guess it might work...
What do you think ?
Thanks to #Felipe Hoffae I just found out I need to do nothing :-)
Buffered data is already available in any SQL query if the WHERE clause includes the data in it...
I've looked at previous questions, but the links given to GCP were outdated so I would like to learn what is the best way to do the conversion while inserting the correct partition (meaning not the day i inserted the records, but according to the "date" column.
Could someone point me in the right direction, specifically for Legacy SQL.
From the docs: "Currently, legacy SQL is not supported for querying partitioned tables or for writing query results to partitioned tables".
So, in this case, because Legacy can't write to partitioned tables, which seems to be a major blocking with no workarounds, you would have to use Standard SQL or Dataflow, as detailed in the answers of the question provided by Graham.
I am attempting to fix the schema of a Bigquery table in which the type of a field is wrong (but contains no data). I would like to copy the data from the old schema to the new using the UI ( select * except(bad_column) from ... ).
The problem is that:
if I select into a table, then Bigquery is removing the required of the columns and therefore rejecting the insert.
Exporting via json loses information on dates.
Is there a better solution than creating a new table with all columns being nullable/repeated or manually transforming all of the data?
Update (2018-06-20): BigQuery now supports required fields on query output in standard SQL, and has done so since mid-2017.
Specifically, if you append your query results to a table with a schema that has required fields, that schema will be preserved, and BigQuery will check as results are written that it contains no null values. If you want to write your results to a brand-new table, you can create an empty table with the desired schema and append to that table.
Outdated:
You have several options:
Change your field types to nullable. Standard SQL returns only nullable fields, and this is intended behavior, so going forward it may be less useful to mark fields as required.
You can use legacy SQL, which will preserve required fields. You can't use except, but you can explicitly select all other fields.
You can export and re-import with the desired schema.
You mention that export via JSON loses date information. Can you clarify? If you're referring to the partition date, then unfortunately I think any of the above solutions will collapse all data into today's partition, unless you explicitly insert into a named partition using the table$yyyymmdd syntax. (Which will work, but may require lots of operations if you have data spread across many dates.)
BigQuery now supports table clone features. A table clone is a lightweight, writeable copy of another table
Copy tables from query in Bigquery
Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?
You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to