Add filename as column on import to BigQuery? - google-bigquery

This is a question about importing data files from Google Cloud Storage to BigQuery.
I have a number of JSON files that follow a strict naming convention to include some key data not included in the JSON data itself.
For example:
xxx_US_20170101.json.gz
xxx_GB_20170101.json.gz
xxx_DE_20170101.json.gz
Which is client_country_date.json.gz At the moment, I have some convoluted processes in a Ruby app that reads the files, appends the additional data and then writes it back to a file that is then imported into a single daily table for the client in BigQuery.
I am wondering if it is possible to grab and parse the filename as part of the import to BigQuery? I could then drop the convoluted Ruby processes which occasionally fail on larger files.

You could define an external table pointing to your files:
Note that the table type is "external table", and that it points to multiple files with the * glob.
Now you can query for all data in these files, and query for the meta-column _FILE_NAME:
#standardSQL
SELECT *, _FILE_NAME filename
FROM `project.dataset.table`
You can now save these results to a new native table.

Related

Remove root from avro upload to bigquery using nodejs

Just starting out with bigquery and trying to find the best way to upload db tables to bq. We've been converting the table content to avro using the avsc library because from all the docs it seems that avro is the fastest way to load it but it adds a root element to the schema so that all the columns are root.name, root.time etc. I saw there was another post about this from 2016 here and the solution is to use a temporary table and strip out the root like this,
bq query --external_table_definition=foo::AVRO=gs://your_bucket/path/file.avro* --destination_table your_dataset.your_table "SELECT root.* FROM foo"
but the nodejs library only has instructions to accomplish permanent tables, not temporary ones. Even if I wanted to create a permanent table, I can't because due to the "root" it places all columns in one row and the amount of data exceeds the amount allowed in a single row. How can I load the data to bigquery?
You can create an external table using the bq js library [1] you need to set the options object appropriately [2]
[1] https://github.com/googleapis/nodejs-bigquery/blob/master/samples/createTable.js
[2] https://stackoverflow.com/a/42916251/5873699

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

BigQuery fails on parsing dates in M/D/YYYY format from CSV file

Problem
I'm attempting to create a BigQuery table from a CSV file in Google Cloud Storage.
I'm explicitly defining the schema for the load job (below) and set header rows to skip = 1.
Data
$ cat date_formatting_test.csv
id,shipped,name
0,1/10/2019,ryan
1,2/1/2019,blah
2,10/1/2013,asdf
Schema
id:INTEGER,
shipped:DATE,
name:STRING
Error
BigQuery produces the following error:
Error while reading data, error message: Could not parse '1/10/2019' as date for field shipped (position 1) starting at location 17
Questions
I understand that this date isn't in ISO format (2019-01-10), which I'm assuming will work.
However, I'm trying to define a more flexible input configuration whereby BigQuery will correctly load any date that the average American would consider valid.
Is there a way to specify the expected date format(s)?
Is there a separate configuration / setting to allow me to successfully load the provided CSV in with the schema defined as-is?
According to the listed limitations:
When you load CSV or JSON data, values in DATE columns must use
the dash (-) separator and the date must be in the following
format: YYYY-MM-DD (year-month-day).
So this leaves us with 2 options:
Option 1: ETL
Place new CSV files in Google Cloud Storage
That in turn triggers a Google Cloud Function or Google Cloud Composer job to:
Edit the date column in all the CSV files
Save the edited files back to Google Cloud Storage
Load the modified CSV files into Google BigQuery
Option 2: ELT
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to shipped:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognised date format. Use SELECT id, PARSE_DATE('%m/%d/%Y', shipped) AS shipped, name
Use that view for your analysis
I'm not sure, from your description, if this is a once-off job or recurring. If it's once-off, I'd go with Option 2 as it requires the least effort. Option 1 requires a bit more effort, and would only be worth it for recurring jobs.

BigQuery - Transfers automation from Google Cloud Storage - Overwrite table

Here's the case:
Our client daily uploads CSVs (overwritten) to a bucket in Google Cloud Storage (each table in a different file).
We use BigQuery as DataSource in DataStudio
We want to automatically transfer the CSVs to BigQuery.
The thing is, even though we've:
Declared the tables in BigQuery with "Overwrite table" write preference option
Configured the daily Transfers vía UI (BigQuery > Transfers) to automatically upload the CSVs from Google Cloud one hour after the files are uploaded to Google Cloud, as stated by the limitations.
The automated transfer/load is by default in "WRITE_APPEND", thus the tables are appended instead of overwritten in BigQuery.
Hence the question: How/where can we change the
configuration.load.writeDisposition = WRITE_TRUNCATE
as stated here in order to overwrite the tables when the CSVs are automatically loaded?
I think that's what we're missing.
Cheers.
None of the above worked for us, so I'm posting this in case anyone has the same issue.
We scheduled a query to erase the table content just before the automatic importation process starts:
DELETE FROM project.tableName WHERE true
And then, new data will be imported to a void table, therefore default "WRITE_APPEND" doesn't affect us.
1) One way to do this is to use DDL to CREATE and REPLACE your table before running the query which imports the data.
This is an example of how to create a table
#standardSQL
CREATE TABLE mydataset.top_words
OPTIONS(
description="Top ten words per Shakespeare corpus"
) AS
SELECT
corpus,
ARRAY_AGG(STRUCT(word, word_count) ORDER BY word_count DESC LIMIT 10) AS top_words
FROM bigquery-public-data.samples.shakespeare
GROUP BY corpus;
Now that it's created you can import your data.
2) Another way is to use BigQuery schedule Queries
3) If you write Python you can find an even better solution here

how to load multiple CSV files into Multiple Tables

I have Multiple CSV files in Folder
Example :
Member.CSv
Leader.CSv
I need to load them in to Data base tables .
I have worked on it using ForEachLoop Container ,Data FlowTask, Excel Source and OLEDB Destination
we can do if by using Expressions and Precedence Constraints but how can I do using Script task if I have more than 10 files ..I got Stuck with this one
We have a similar issue, our solution is a mixture of the suggestions above.
We have a number of files types sent from our client on a daily basis.
These have a specific filename pattern (e.g. SalesTransaction20160218.csv, Product20160218.csv)
Each of these file types have a staging "landing" table of the structure you expect
We then have a .net script task that takes the filename pattern and loads that data into a landing table.
There are also various checks that are done within the csv parser - matching number of columns, some basic data validation, before loading into the landing table
We are not good enough .net programmers to be able to dynamically parse an unknown file structure, create SQL table and then load the data in. I expect it is feasible, after all, that is what the SSIS Import/Export Wizard does (with some manual intervention)
As an alternative to this (the process is quite delicate), we are experimenting with a HDFS data landing area, then it allows us to use analytic tools like R to parse the data within HDFS. After that utilising PIG to load the data into SQL.