Autodetect BigQuery schema within Dataflow? - google-bigquery

Is it possible to use the equivalent of --autodetect in DataFlow?
i.e. can we load data into a BQ table without specifying a schema, equivalent to how we can load data from a CSV with --autodetect?
(potentially related question)

If you are using protocol buffers as objects in your PCollections (which should be performing very well on the Dataflow back-end) you might be able to use a util I wrote in the past. It will parse the schema of the protobuffer into a BigQuery schema at runtime, based on inspection of the protobuffer descriptor.
I quickly uploaded it to GitHub, it's WIP, but you might be able to use it or be inspired to write something similar using Java Reflection (I might do it myself at some point).
You can use the util as follows:
TableSchema schema = ProtobufUtils.makeTableSchema(ProtobufClass.getDescriptor());
enhanced_events.apply(BigQueryIO.Write.to(tableToWrite).withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
where the create disposition will create the table with the schema specified and the ProtobufClass is the class generated using your Protobuf schema and the proto compiler.

I'm not sure about reading from BQ, but for writes I think that something like this will work on the latest java SDK.
.apply("WriteBigQuery", BigQueryIO.Write
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.to(outputTableName));
Note: BigQuery Table must be of the form: <project_name>:<dataset_name>.<table_name>.

Related

Airflow GCSToBigQueryOperator is reordering my columns

I have the following operators in my DAG. They are receiving data from my MySQL database, uploading it to GCS, and then importing it to BigQuery. It runs great! With one small issue...
I can see that, inbetween the create and import tasks, the target table is created in BigQuery with the schema specified in the schema argument, with the correct column ordering. But, as soon as the import task runs, the schema of the table changes, and the columns are reordered into a seemingly arbitrary ordering. Why does this happen and is there a way to get BigQuery to stop doing this? I see that there are schema_update_options available on the operator but the documentation is quite poor...
create=BigQueryCreateEmptyTableOperator(
task_id="create",
bigquery_conn_id='google_cloud',
project_id="<myproject>",
dataset_id=target_dataset,
table_id=table_name,
schema_fields=schema
)
upload=MySQLToGCSOperator(
task_id='mysql_to_gcs',
mysql_conn_id='bi_mysql',
sql=self.sql,
bucket=self.bucket,
filename=self.filename,
export_format='NEWLINE_DELIMITED_JSON',
google_cloud_storage_conn_id='google_cloud'
)
import=GCSToBigQueryOperator(
task_id='gcs_to_bigquery',
bucket=self.bucket,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[self.filename],
destination_project_dataset_table="<myproject>..target_dataset.{table_name}",
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id='google_cloud',
google_cloud_storage_conn_id='google_cloud',
)
create >> upload >> import
The re-ordering happens because you did not define the schema_fields inside your GCSToBigQueryOperator and it triggered BigQuery Schema Auto-detection wherein
BigQuery makes a best-effort attempt to automatically infer the schema from the source data.
In your case, to ensure the ordering of your columns is defined the way you wanted it to be, you must define schema_fields inside your GCSToBigQueryOperator.
You can already omit BigQueryCreateEmptyTableOperator since GCSToBigQueryOperator can already create BigQuery tables and define schemas.
Please see updated code based on your posted question:
upload=MySQLToGCSOperator(
task_id='mysql_to_gcs',
mysql_conn_id='bi_mysql',
sql=self.sql,
bucket=self.bucket,
filename=self.filename,
export_format='NEWLINE_DELIMITED_JSON',
google_cloud_storage_conn_id='google_cloud'
)
create_and_import=GCSToBigQueryOperator(
task_id='gcs_to_bigquery',
bucket=self.bucket,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[self.filename],
destination_project_dataset_table="<myproject>..target_dataset.{table_name}",
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id='google_cloud',
google_cloud_storage_conn_id='google_cloud',
schema_fields=schema
)
upload >> create_and_import
You may refer to this GCSToBigQueryOperator Documentation for more details.

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

create BigQuery external tables partitioned by one/multiple columns

I am porting a java application from Hadoop/Hive to Google Cloud/BigQuery. The application writes avro files to hdfs and then creates Hive external tables with one/multiple partitions on top of the files.
I understand Big Query only supports date/timestamp partitions for now, and no nested partitions.
The way we now handle hive is that we generate the ddl and then execute it with a rest call.
I could not find support for CREATE EXTERNAL TABLE in the BigQuery DDL docs, so I've switched to using the java library.
I managed to create an external table, but I cannot find any reference to partitions in the parameters passed to the call.
Here's a snippet of the code I use:
....
ExternalTableDefinition extTableDef =
ExternalTableDefinition.newBuilder(schemaName, null, FormatOptions.avro()).build();
TableId tableID = TableId.of(dbName, tableName);
TableInfo tableInfo = TableInfo.newBuilder(tableID, extTableDef).build();
Table table = bigQuery.create(tableInfo);
....
There is however support for partitions for non external tables.
I have a few questions questions:
is there support for creating external tables with partition(s)? Can you please point me in the right direction
is loading the data into BigQuery preferred to having it stored in GS avro files?
if yes, how would we deal with schema evolution?
thank you very much in advance
You cannot create partitioned tables over files on GCS, although you can use the special _FILE_NAME pseudo-column to filter out the files that you don't want to read.
If you can, prefer just to load data into BigQuery rather than leaving it on GCS. Loading data is free, and queries will be way faster than if you run them over Avro files on GCS. BigQuery uses a columnar format called Capacitor internally, which is heavily optimized for BigQuery, whereas Avro is a row-based format and doesn't perform as well.
In terms of schema evolution, if you need to change a column type, drop a column, etc., you should recreate your table (CREATE OR REPLACE TABLE ...). If you are only ever adding columns, you can add the new columns using the API or UI.
See also a relevant blog post about lazy data loading.

ExecuteSQL processor returns corrupted data

I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000.
What I did is select * from my_table where dt=1000.
Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.
The DBCPConnectionPool is configured to use HiveJDBC4 jar.
Later I tried using the compatible jar according to the CDH release, didn't fix it either.
The ExecuteSQL processor is configured as such:
Normalize Table/Column Names: true
Use Avro Logical Types: false
Hive version: 1.1.0
CDH: 5.7.1
Any ideas what's happening? Thanks!
EDIT:
Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.
Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).
If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.
Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.