How to create BigQuery table with required fields from DataFlow with string schema definition? - google-bigquery

I am using DataFlow's WriteToBigQuery with CREATE_IF_NEEDED, and thus have to specify the schema.
I define the schema in the beginning of my code (outside the actual pipeline), but since I need the flag --save_main_session, I get the same error as here, which explains that the schema cannot be passed along with the pipeline since a BigQuery schema definition is not pickleable.
The solution mentioned on the page is not an option for me (disable the --save_main_session flag), and thus the other option to specify the schema is through a string.
However, I need to set some fields to REQUIRED. Is there a way to do this with the string schema definition?

As you can see from bigquery.py the conversion from a string schema to a TableSchema is quite straightforward and does indeed set the mode to NULLABLE. Perhaps you can create the TableSchema with REQUIRED fields based on this code snippet.

Related

Is there a way to avoid the data type conversion from STRING to STRUCT<string STRING, text STRING, provided STRING> for Datastore imports to BigQuery?

We are automatically loading Datastore Backups to BigQuery for further analysis overwriting the table every day.
When a Datastore Kind with at least one Entity with long text is imported in BigQuery, that field is automatically converted to a STRUCT<string STRING, text STRING, provided STRING> instead of a STRING field like all the other text/string fields. This then changes the schema of the BigQuery table and makes any further processing or analysis really hard as queries need to be adapted to account for this. We cannot control the length of text on the Datastore side, so we need to find a way to at least stabilize the schema on the BigQuery side.
Any idea on how to deal with this elegantly?
Any way this conversion can be avoided so the schema of the BigQuery table does not change?
Setting a schema to a Load Job from a Datastore export is not possible in BigQuery. It means that the schema will always be inferred from the data. If you try to load it through the UI for example, you will see a message saying
Source file defines the schema
In this link you can find how the type conversion works between Datastore and BigQuery.
Try to use a View as the final table or create a scheduled query to read your table when its loaded and save the results in another table with the right schema.

How to pass dynamic table names for sink database in Azure Data Factory

I am trying to copy tables from one schema to another with the same Azure SQL db. So far, I have created a lookup pipeline and passed the parameters for the for each loop and copy activity. But my sink dataset is not taking the parameter value I have given under "table option" field rather it is taking the dummy table I chose when creating the sink dataset. Can someone tell how can I pass dynamic table name to a sink dataset?
I have given concat('dest_schema.STG_',#{item().table_name})} in the table option field.
To make the schema and table names dynamic, add Parameters to the Dataset:
Most important - do NOT import a schema. If you already have one defined in the Dataset, clear it. For this Dataset to be dynamic, you don't want improper schemas interfering with the process.
In the Copy activity, provide the values at runtime. These can be hardcoded, variables, parameters, or expressions, so very flexible.
If it's the same database, you can even use the same Dataset for both, just provide different values for the Source and Sink.
WARNING: If you use the "Auto-create table" option, the schema for the new table will define any character field as varchar(8000), which can cause serious performance problems.
MY OPINION:
While you can do this, one of my personal rules is to not cross the database boundary. If the Source and Sink are on the same SQL database, I would try to solve this problem with a Stored Procedure rather than a data factory.

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

change field datatype in Cosmos DB

I have a field in Cosmos DB which is mapped as an number, but it should be a string. I'd like to alter the schema in-place without reloading the data, is this possible with a query in the same way it can be achieved in SQL?
ALTER TABLE EVENTS
MODIFY COLUMN eventAmount varchar;
Have consulted the docs but they only reference simple SQL commands.
DocumentDB is schemaless. There is no structure defined outside documents themselves so each document has their own schema. If you want to enforce some documents follow a certain structure, then that must be enforced by yourself in your application logic.
So, this means you can not "alter schema" for collection to change data types.
What you can and should do, is to fix documents which you consider having wrong schema by updating them. Query docs where eventAmount is stored as JS number and save the document with the value stored as a corresponding javascript string instead.

Appending data to a table created from an Avro file in BigQuery

Every morning, an automatic job creates a new table from an Avro file. In the afternoon, I would need to append some data to this table from a Query.
When trying to do so, I get the following error:
Error: Invalid schema update. Field chn has changed mode from REQUIRED to NULLABLE
I noticed that I can change the property of the field chn from REQUIRED to NULLABLE in the BigQuery Web UI and then it works fine, but I would have to do it manually everyday which is not what I am looking for.
Is there a way to "cast" the field as REQUIRED during the append query ?
Or during the first import from the Avro file, force the field to be NULLABLE and not REQUIRED ?
Thanks !
The feature that allows relaxing a field as part of a query or a load job will be available in production shortly. I will update this answer when it goes live (likely within a week).
Update: 08/25/2016
You can supply schemaUpdateOptions in load or query job configuration.
Multiple options can be provided.
It allows the schema of the destination table to be updated as a side effect of the load or query job. Schema update options are supported in two cases:
When writeDisposition is WRITE_APPEND
When writeDisposition is WRITE_TRUNCATE and the destination table is a partition of a table, specified by partition decorators
For non-partitioned tables, WRITE_TRUNCATE will always overwrite the schema.
The following values are supported:
ALLOW_FIELD_ADDITION: allow adding a nullable field to the schema
ALLOW_FIELD_RELAXATION: allow relaxing a required field in the original schema to nullable
NOTE: This doesn't currently work with schema auto-detection. We plan to support that soon.