Bigquery dbt_external_tables External Data Configuration - google-bigquery

I need some help when using the dbt_external_tables package.
I realized that in the csv I have in GCS some lines appear to have line breaks and this is causing some issues when trying to query the table created by the macro.
Sometimes when doing this configuration of the external table manually the BigQuery UI has two options:
Allow jagged rows (CSV)
Allow quoted newlines (CSV) true
I usually put those options in true and sometimes the issues are solved.
I don't know how to do this using the dbt_external_tables.
This is important as I am receiving this errors when trying to query the table created by dbt "Error while reading table: kpi-process.file_csv.History, error message: CSV table references column position 9, but line starting at position:10956 contains only 7 columns. "

The dbt-external-tables package supports passing a dictionary of options for BigQuery external tables, which maps to the options documented here. In your case, it sounds like you want to turn on allow_jagged_rows and allow_quoted_newlines, so you can specify them like so:
version: 2
sources:
- name: my_external_source
tables:
- name: my_external_table
location: 'gs://bucket/path/*'
options:
format: csv
allow_jagged_rows: true
allow_quoted_newlines: true
And dbt will template a DDL statement accordingly:
create or replace external statement my_external_source.my_external_table
options (
format = 'csv',
allow_jagged_rows = true,
allow_quoted_newlines = true,
uris = ['gs://bucket/path/*']
)

Related

How to persist column descriptions in BigQuery tables

I have created models in my dbt(data build tool) where I have specified column description. In my dbt_project.yml file as shown below
models:
sakila_dbt_project:
# Applies to all files under models/example/
+persist_docs:
relation: true
columns: true
events:
materialized: table
+schema: examples
I have added +persist_docs as described by dbt as the fix to make column description appear but still no description appears in bigquery table.
My models/events/events.yml looks like this
version: 2
models:
- name: events
description: This table contains clickstream events from the marketing website
columns:
- name: event_id
description: This is a unique identifier for the event
tests:
- unique
- not_null
- name: user-id
quote: true
description: The user who performed the event
tests:
- not_null
What I'm I missing?
p.s I'm using dbt version 0.21.0
Looks consistent with the required format as shown in the docs:
dbt_project.yml
models:
..[<resource-path>](resource-path):
....+persist_docs:
......relation: true
......columns: true
models/schema.yml
version: 2
models:
..- name: dim_customers
....description: One record per customer
....columns:
......- name: customer_id
........description: Primary key
Maybe spacing? I converted the spaces to periods in the examples above because the number of spaces is unforgivingly specific for yml files.
I've started using the vscode yml formatter because of how often I run into spacing issues on these keys in both the schema.yml and the dbt_project.yml
Otherwise, this isn't for a source or external-table right? Those are the only two artifacts that persist-docs is unsupported for.
Sources unsupported persist_docs -> sources tab
External Tables unsupported (Can't find in docs again but read today in docs or github issue)
Also Apache Spark unsupported (irrelevant here) Apache Spark Profile
Also, if you're going to be working with persist_docs a lot, check out this macro example persist_docs_op that Jeremy left for a run-operation to update your persisted docs in case that's all you changed!

Spark (Databricks) unmanaged table from SQL not processing headers

Trying to create an unmanaged table in Spark (Databricks) from a CSV file using the SQL API. But first row is not being used as headers.
Image 2, shows that the first row is correct when using the Dataframe API to create an unmanaged table. The Dataframe was loaded from the same csv file.
However, Image 1, shows that when creating an unmanaged table from a CSV file data source in SQL, does not process the first row as headers. Am I leaving out some "headers" option?
And if so, how would that be coded?
Dataframe API
You just need to provide OPTIONS as it's specified in the documentation.
In the that options block you can list key/value pairs that matches to the options specific to the Spark CSV reader, for example, options ('header' = 'true', 'sep' = ',') will force Spark to ignore header line, and set separator to comma. You can also add the 'inferSchema' = true into options, in this case you can just omit the columns declaration - Spark will infer it for you (it's ok for small datasets, but not for the big ones):
create table test.test using csv
options ('header' = 'true', 'sep' = ',', 'inferSchema' = true)
location '/databricks-datasets/Rdatasets/data-001/csv/COUNT/affairs.csv'

Alias not recognized in main script

In the following code I've loaded data from two excel documents that have the exact same column names, and have therefore given one of the tables aliases.
My problem occurs when I try to put in a not match() condition at the end of the script.
// New table
NewTable:
LOAD
[namn] as namnNy
FROM
[pglistaNy.xlsx]
(ooxml, embedded labels);
// Old table
OldTable:
LOAD
[namn]
FROM
[pglistaOld.xlsx]
(ooxml, embedded labels)
Where not match(namn, namnNy);
I get an error telling me that it does not recognize the namnNy alias, why is that and what's a better solution / method?
match function will not work in your case. You are trying to match values from field names from different tables. You should use the exists function (full documentation on Qlik's help page)
So your script will be:
// New table
NewTable:
LOAD
[namn] as namnNy
FROM
[pglistaNy.xlsx]
(ooxml, embedded labels);
// Old table
OldTable:
LOAD
[namn]
FROM
[pglistaOld.xlsx]
(ooxml, embedded labels)
Where
not Exists(namnNy, namn);
Example qvw file here

How to create a view against a table that has record fields?

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.
To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.
This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.
For example, I have this table:
For which I issue this API call:
{
'tableReference': {
'projectId': 'redacted',
'tableId': u'AccountDeletionRequest',
'datasetId': 'Latest_Production_Data'
}
'view': {
'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
},
}
This results in the following error:
HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">
When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.
Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?
Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.
Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:
def get_leaf_column_selectors(dataset, table):
schema = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()['schema']
return ",\n".join([
_get_leaf_selectors("", top_field)
for top_field in schema["fields"]
])
def _get_leaf_selectors(prefix, field):
if prefix:
format = prefix + ".%s"
else:
format = "%s"
if 'fields' not in field:
# Base case
actual_name = format % field["name"]
safe_name = actual_name.replace(".", "_")
return "%s as %s" % (actual_name, safe_name)
else:
# Recursive case
return ",\n".join([
_get_leaf_selectors(format % field["name"], sub_field)
for sub_field in field["fields"]
])
We had a bug where you needed to need to select out the individual fields in the view and use an 'as' to rename the fields to something legal (i.e they don't have '.' in the name).
The bug is now fixed, so you shouldn't see this issue any more. Please ping this thread or start a new question if you see it again.

Import xx.sql file to execute by using Ebean

Is there any way to execute SQLs directly read from SQL files(xx.sql) in Ebean?
For example, if I had a SQL file including several SQL statements (values already written in the file), is there any way to execute this SQL file by using Ebean?
You have at least two options out of the box:
Play evolutions are meant to updating DB schema, so you can use them also for inserting initial data (if they are flat, and will not contain relations to the objects not created yet), sample evolution for MySQL:
# --- !Ups
INSERT INTO your_table (some_field) VALUES ('New value');
# --- !Downs
DELETE FROM your_table WHERE some_field = 'New value`;
Use Global object and insert initial data using common Ebean's way:
public void onStart(Application app) {
if (YourModel.find.findRowCount() == 0) {
YourModel newItem = new YourModel();
newItem.someField = "New value";
newItem.save();
YourModel newItem2 = new YourModel();
// etc....
}
}
For the second approach you can check the way how to read YAML file holding initial data with Global object of Zentask sample (the file with sample is placed in conf directory)
Edit:
Take a closer look to the initial-data.yml, there are also relations between tasks and projects, so they have fixed id values. So you need do the same in your yaml:
projects:
- !!models.Project
id: 1
name: Play 2.0
folder: Play framework
tasks:
- !!models.Task
title: Fix the documentation
done: false
folder: Todo
project: !!models.Project
id: 1