infer avro schema for BigQuery table load - google-bigquery

I'm using the java api, trying to load data from avro files into BigQuery.
When creating external tables, BigQuery automatically detects the schema from the .avro files.
Is there a way to specify a schema/data file in GCS when creating a regular BigQuery table for data to be loaded into?
thank you in advance

You could create manually the schema definition with the configuration.load.schema, however, the documentation says that:
When you load Avro, Parquet, ORC, Cloud Firestore export data, or Cloud Datastore export data, BigQuery infers the schema from the source data.

Seems the problem was that the table existed already, and I did not specify the CreateDisposition.CREATE_IF_NEEDED.
You do not need to specify the schema at all, just like for the external tables

Related

Best approach loading a text file (.txt) to bigquery table

Anyone got any pratical idea with regard to what is the best possible approach to upload a text file to a bigquery table? I have a few zipped text files I need to download from a remote SFTP server and load it into a bigquery table. Should I download it to a google cloud storage and upload it from there to bigquery for faster speed? The text files are about 5GB each and will grow further.
Thank you.
First thing to consider if you are loading files from your local data source is that there are limitations for that, according to the documentation.
Loading data from a local data source is subject to the following limitations:
Wildcards and comma separated lists are not supported when you load
files from a local data source. Files must be loaded individually.
When using the classic BigQuery web UI, files loaded from a local data
source must be 10 MB or less and must contain fewer than 16,000 rows.
Besides that, with this provided above link , there are instructions how to upload your data with Console or CLI.
Nevertheless, using the cloud storage you can take advantage of long term storage, which means that you are not charged by loading data into bigquery instead for storing the data in Cloud Storage. You can read more about it here.
Finally, I would like you to consider two points External and Natives tables in bigquery.
Native tables: tables backed by native BigQuery storage.
External tables: tables backed by storage external to BigQuery. For more
information, see Querying External Data Sources.
In other words, using Native tables you import the full data inside BigQuery. Thus, it tends to me faster when executing data analysis. Meanwhile, external tables do not store data in BigQuery, instead references the data from an external source.
The cost of storing in BigQuery is higher than in Cloud storage. Although, querying external tables is slower than querying against native tables, mainly if the files are significantly large. Lastly, since external tables are pointers to files, you do not have to wait for the data to load.

What is the structure of an SQL dump file?

I am using Firestore as my main database, but I would like to export its data to SQL format. In order to do that, I know I'll need to create a script to create/format the dump file. What is the standard way to structure the file contents? Is it XML? What are the required fields? Unfortunately, I cannot find the answer to this.
Additional Info:
I will be exporting data from Firestore and importing it to Google Cloud SQL.
EDIT 1:
I'm using Postgres.
If you're looking for the easiest way to get your data from Cloud Firestore in a more query-friendly format, have a look at the new Firebase Extension that automatically exports specific collections from Firestore to BigQuery.
BigQuery is still a NoSQL database, but one that has built-in support for structured querying through a SQL dialect.

Do I need to specify entire schema to append data into an existing Google BigQuery table?

I have an existing Google BigQuery table with about 30 fields. I would like to start automating the addition of data to this table on a regular basis. I have installed the command line tools and they are working correctly.
I'm confused by the proper process to append data to a table. Do I need to specify the entire schema for the table every time I want to append data? It feels strange to be recreating the schema in an avro file. The schema already exists on the table.
Can someone please clarify how to do this?
Do I need to specify the entire schema for the table every time
No, you don't need to do this. as described in BigQuery official documentation
Schema auto-detection is not used with Avro files, Parquet files, ORC files, Cloud Firestore export files, or Cloud Datastore export files. When you load these files into BigQuery, the table schema is automatically retrieved from the self-describing source data.

BigQuery - load a datasource in Google big query

I have a MySQL DB in AWS and can I use the database as a data source in Big Query.
I m going with CSV upload to Google Cloud Storage bucket and loading into it.
I would like to keep it Synchronised by directly giving the data source itself than loading it every time.
You can create a permanent external table in BigQuery that is connected to Cloud Storage. Then BQ is just the interface while the data resides in GCS. It can be connected to a single CSV file and you are free to update/overwrite that file. But not sure if you can link BQ to a directory full of CSV files or even are tree of directories.
Anyway, have a look here: https://cloud.google.com/bigquery/external-data-cloud-storage

No schema auto-detect when querying External Tables in Bigquery and new data arrive

This is the currently situation:
I've created an external table in Bigquery against json in Cloud Storage.
I'm testing how it works regarding to the schema auto-detect.
When I create the table, there were 2 json files with different schemas, and Bigquery does it well.
When I load a new file with a new schema (adding a new attribute to a record field), Bigquery recognizes the new record, but this new field doesn't appear. So the schema auto-detect doesn't work as I expected.
How can I get schema auto-detect when new files arrives to my cloud storage folder?
Any help?
Culprit: AFAIK the auto-schema detection happens when you create a table, and not updated as you add new files.
Possible solution:
Re-create the tables when new files arrive.
Straightforward implementation:
Add a pub/sub notification on GCS for new arriving files, have a Google Cloud Function that re-creates the table trigger on this.