Is there a way to avoid the data type conversion from STRING to STRUCT<string STRING, text STRING, provided STRING> for Datastore imports to BigQuery? - google-bigquery

We are automatically loading Datastore Backups to BigQuery for further analysis overwriting the table every day.
When a Datastore Kind with at least one Entity with long text is imported in BigQuery, that field is automatically converted to a STRUCT<string STRING, text STRING, provided STRING> instead of a STRING field like all the other text/string fields. This then changes the schema of the BigQuery table and makes any further processing or analysis really hard as queries need to be adapted to account for this. We cannot control the length of text on the Datastore side, so we need to find a way to at least stabilize the schema on the BigQuery side.
Any idea on how to deal with this elegantly?
Any way this conversion can be avoided so the schema of the BigQuery table does not change?

Setting a schema to a Load Job from a Datastore export is not possible in BigQuery. It means that the schema will always be inferred from the data. If you try to load it through the UI for example, you will see a message saying
Source file defines the schema
In this link you can find how the type conversion works between Datastore and BigQuery.
Try to use a View as the final table or create a scheduled query to read your table when its loaded and save the results in another table with the right schema.

Related

How can I load data into snowflake from S3 whilst specifying data types

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)
As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.

Talend ETL tool

I am developing a migration tool and using Talend ETL tool (Free edition).
Challenges faced:-
is it possible to create a Talend job that uses dynamic schema every time it runs i.e. no hard-coded mappings in tMap component.
I want user to give a input CSV/Excel file and the job should create mappings on the basis of that input file. Is it possible in talend?
Any other free source ETL tool can also be helpful, or any sample job.
Yes, this can be done in Talend but if you do not wish to use a tMap then your table and file must match exactly. The way we have implemented it is for stage tables which are all datatype of varchar. This works when you are loading raw data into a stage table, and your validation is done after the load, prior to loading the stage data into a data warehouse.
Here is a summary of our method:
the filenames contain the table name so the process starts with a tFileList and parsing out the table name from the file name.
using tMSSQLColumnList obtain each column name, type, and length for the table (one way is to store it as an inline table in tFixedFlowInput)
run this thru a tSetDynamicSchema to produce your dynamic for that table
use a file input reference the dynamic schema.
load that into a MSSQLOutput again referencing the dynamic schema.
One more note on data types. It may work with data types than varchar, but our stage tables only have varchar and datetime. We had issues with datetime, so we filtered out those column types with a tMap.
Keep in mind, this is a summary to point you in the right direction, not a precise tutorial. But with this info in your hands, it can save you many hours of work while building your solution.

Keep Column types from Java ResultSet in CSV export

I'm currently building a tool that pulls data directly from a database because SPSS Modeler is too slow and store it in a Java ResultSet first of all.
But I try to export the data into a CSV (or similar) file and try to keep as much column types as possible.
Currently I'm using opencsv but it casts Decimals and many others to a String. When I load the file back into SPSS Modeler I get only Integers and Strings.
Are there any CSV libraries (maybe with a special encoding) or other file types I can use to export the data with its column types (like IBM InfoSphere Data Architect can do) so I can load it directly back into SPSS Modeler without changing it back manually there ?
Thank you!
Retrieving the Metadata from the DB Information Schema
If the data is currently stored in a database, you can retrieve the column type from the information schema. All you need to do is retrieving this information after your queried the table and store it so that you can reuse it later.
// connect to DB as usual
Statement stmt = conn.createStatement();
// create your query
// Note that you can use a dummy query here.
//You only need to access the metadata schema of the table, regardless of the actual query.
ResultSet rse = stmt.executeQuery("Select A,B FROM table WHERE ..");
// get the ResultSetMetadata
ResultSetMetaData rsmd = rse.getMetaData();
// Get database specific type
rsmd.getColumnTypeName(1); // database specific type name for column 1 (e.g. VARCHAR)
rsmd.getColumnTypeName(2); // database specific type name for column 2 (e.g. DateTime)
....
// Get generic JDBC type http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html
rsmd.getColumnType(1) // generic type for col 1 (e.g. 12)
rsmd.getColumnType(2) // generic type for col 2
Processing
You could store this information in a CSV schema and process this during the transformation process.
I recommend that you use SuperCSV, which is available here.
This library provides so called cell processors, which allow you to define the type of the columns.
Description:
Cell processors are an integral part of reading and writing with Super CSV - they automate the data type conversions, and enforce constraints. They implement the chain of responsibility design pattern - each processor has a single, well-defined purpose and can be chained together with other processors to fully automate all of the required conversions and constraint validation for a single CSV column.