How to Add a Column to BigQuery External Table - google-bigquery

I have a bunch of data stored in partitioned ORC files in Google Cloud Storage. My bucket looks something like the following:
my_bucket
- folder_of_orc_files
- - partition1=abc
- - - file1.orc
- - - file2.orc
I have an external table defined in BigQuery that points to the data above that was created like this:
CREATE OR REPLACE EXTERNAL TABLE my_dataset.my_external_table
WITH PARTITION COLUMNS (
partition1 STRING,
)
OPTIONS (
uris=['gs://my_bucket/folder_of_orc_files/*'],
format=orc,
hive_partition_uri_prefix='gs://my_bucket/folder_of_orc_files'
);
Those files currently have columns "Column A", "Column B" and "Column C".
Now I need to add "Column D". So I add a file3.orc that contains "Column D". In reality, of course, I have a metric ton of files and would rather not have to recreate all of the old ones. I need a way to have the external table see "Column D" with NULLS for the old entries and with the proper values from the new files.
Out the door, the external table does not see "Column D". So I dropped the table and re-added it. It still only has columns A, B and C and completely ignores "Column D". With ORC files you cannot specify schema - it reads it automatically. The ALTER TABLE command does not work for external tables.
The only thing I have found (short of reloading all of the data from scratch from my Spark jobs) is that I could move all of my data in to temporary tables and then re-write them out - which is again quite a large task (and expensive) when you have a huge amount of data. Anyone know of any other way to achieve this? Thanks!

When you ask BigQuery to perform a schema autodetect, BigQuery simply get a sample of lines (for CSV, or JSON format). For binary files, such as ORC format, I guess it's a sample of files, or the first one get.
Anyway, after the schema definition, the schema is never updated automatically. If you r format change, you have to update it manually.
Indeed, you can't do it with an ALTER TABLE statement, but you can do it on the UI or with the bq CLI (or the API/client libraries if you prefer). You have the documentation here
So, not impossible, but not automatic!

Related

Add new partition-scheme to existing table in athena with SQL code

Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

How to rename AWS Athena columns with parquet file source?

I have data loaded in my S3 bucket folder as multiple parquet files.
After loading them into Athena I can query the data successfully.
What are the ways to rename the Athena table columns for parquet file source and still be able to see the data under renamed column after querying?
Note: checked with edit schema option, column is getting renamed but after querying you will not see data under that column.
There is as far as I know no way to create a table with different names for the columns than what they are called in the files. The table can have fewer or extra columns, but only the names that are the same as in the files will be queryable.
You can, however, create a view with other names, for example:
CREATE OR REPLACE VIEW a_view AS
SELECT
a AS b,
b AS c
FROM the_table

create BigQuery external tables partitioned by one/multiple columns

I am porting a java application from Hadoop/Hive to Google Cloud/BigQuery. The application writes avro files to hdfs and then creates Hive external tables with one/multiple partitions on top of the files.
I understand Big Query only supports date/timestamp partitions for now, and no nested partitions.
The way we now handle hive is that we generate the ddl and then execute it with a rest call.
I could not find support for CREATE EXTERNAL TABLE in the BigQuery DDL docs, so I've switched to using the java library.
I managed to create an external table, but I cannot find any reference to partitions in the parameters passed to the call.
Here's a snippet of the code I use:
....
ExternalTableDefinition extTableDef =
ExternalTableDefinition.newBuilder(schemaName, null, FormatOptions.avro()).build();
TableId tableID = TableId.of(dbName, tableName);
TableInfo tableInfo = TableInfo.newBuilder(tableID, extTableDef).build();
Table table = bigQuery.create(tableInfo);
....
There is however support for partitions for non external tables.
I have a few questions questions:
is there support for creating external tables with partition(s)? Can you please point me in the right direction
is loading the data into BigQuery preferred to having it stored in GS avro files?
if yes, how would we deal with schema evolution?
thank you very much in advance
You cannot create partitioned tables over files on GCS, although you can use the special _FILE_NAME pseudo-column to filter out the files that you don't want to read.
If you can, prefer just to load data into BigQuery rather than leaving it on GCS. Loading data is free, and queries will be way faster than if you run them over Avro files on GCS. BigQuery uses a columnar format called Capacitor internally, which is heavily optimized for BigQuery, whereas Avro is a row-based format and doesn't perform as well.
In terms of schema evolution, if you need to change a column type, drop a column, etc., you should recreate your table (CREATE OR REPLACE TABLE ...). If you are only ever adding columns, you can add the new columns using the API or UI.
See also a relevant blog post about lazy data loading.

Trying to copy data from Impala Parquet table to a non-parquet table

I am moving data around within Impala, not my design, and I have lost some data. I need to copy the data from the parquet tables back to their original non-parquet tables. Originally, the developers had done this with a simple one liner in a script. Since I don't know anything about databases and especially about Impala I was hoping you could help me out. This is the one line that is used to translate to a parquet table that I need to be reversed.
impalaShell -i <ipaddr> use db INVALIDATE METADATA <text_table>;
CREATE TABLE <parquet_table> LIKE <text_table> STORED AS PARQUET TABLE;
INSERT OVERWRITE <parquet_table> SELECT * FROM <text_table>;
Thanks.
Have you tried simply doing
CREATE TABLE <text_table>
AS
SELECT *
FROM <parquet_table>
Per the Cloudera documentation, this should be possible.
NOTE: Ensure that your does not exist or use a table name that does not already exist so that you do not accidentally overwrite other data.