Configuring raw and analytics databases with dbt

Configuring raw and analytics databases with dbt - dbt

I have been reading dbt's How we configure Snowflake guide which explains the rationale behind having a raw database and an analytics database. Raw data is loaded into your warehouse into raw (e.g. by using Fivetran) and analytics is used by dbt to save transformed data/views for data analysts/scientists.
However, I can't seem to find any guides on how to actually set this up. The profiles.yml file needs to point to where the raw data is, so that dbt can begin transforming. However, this file also seems to dictate the database and schema into which transformed data/views are saved.
Where in dbt's many .yml files do I specify globally where to save transformed data?

Set up your profiles.yml, which does NOT live in the actual project but rather in the ~/.dbt folder on your machine, such that it refers to your target database/schema. For development, this would look like what you see below. For production on dbt Cloud. Now, you just set up your sources like usual (see third block below). There is no universal sources option, just a target database/schema.
Profiles.yml Docs and Snowflake Profile Docs
-- profiles.yml
my_profile:
target: dev
outputs:
dev:
type: snowflake
account: <snowflake_server>
user: my_user
password: my_password
role: my_role
database: analytics
warehouse: dev_wh
schema: dbt_<myname>
threads: 1
client_session_keep_alive: False
-- dbt_project.yml
name: 'my_dbt_models'
version: '1.0.0'
config-version: 2
profile: 'my_profile'
...
...
...
Sources Docs
-- src.yml
version: 2
sources:
- name: jaffle_shop
database: raw
tables:
- name: orders
In the model:
raw.jaffle_shop.orders becomes {{ source( 'jaffle_shop' , 'orders' ) }}
Note, dbt processes this source such that it assumes the name is the schema by default, however, I've discovered that you can really name it whatever you want and add in a schema if you want to give it a special name.
For example…
sources:
- name: my_special_name
database: raw
schema: jaffle_shop
tables:
- name: orders
In the model:
raw.jaffle_shop.orders becomes {{ source( 'my_special_name' , 'orders' ) }}
I hope all that made sense.

Related

How to specify model schema when referencing another dbt project as a package? (dbt multi-repo setup)

We're using a dbt multi-repo setup with different projects for different business areas. We have several projects, something like this:
dbt_dwh
dbt_project1
dbt_project2
The dbt_dwh project contains models which we plan to reference in projects 1 and 2 (we have ~10 projects that would reference the dbt_dwh project) by way of installing git packages. Ideally, we'd like to be able to just reference the models in the dbt_dwh project (e.g.
SELECT * FROM {{ ref('dbt_dwh', 'model_1') }}). However, each of our projects sits in it's own database schema and this causes issue upon dbt run because dbt uses the target schema from dbt_project_x, where these objects don't exist. I've included example set-up info below, for clarity.
packages.yml file for dbt_project1:
packages:
- git: https://git/repo/url/here/dbt_dwh.git
revision: master
profiles.yml for dbt_dwh:
dbt_dwh:
target: dwh_dev
outputs:
dwh_dev:
<config rows here>
dwh_prod:
<config rows here>
profiles.yml for dbt_project1:
dbt_project1:
target: project1_dev
outputs:
project1_dev:
<config rows here>
project1_prod:
<config rows here>
sf_orders.sql in dbt_dwh:
{{
config(
materialized = 'table',
alias = 'sf_orders'
)
}}
SELECT * FROM {{ source('salesforce', 'orders') }} WHERE uid IS NOT NULL
revenue_model1.sql in dbt_project1:
{{
config(
materialized = 'table',
alias = 'revenue_model1'
)
}}
SELECT * FROM {{ ref('dbt_dwh', 'sf_orders') }}
My expectation here was that dbt would examine the sf_orders model and see that the default schema for the project it sits in (dbt_dwh) is dwh_dev, so it would construct the object reference as dwh_dev.sf_orders.
However, if you use command dbt run -m revenue_model1 then the default dbt behaviour is to assume all models are located in the default target for dbt_project1, so you get something like:
11:05:03 1 of 1 START sql table model project1_dev.revenue_model1 .................... [RUN]
11:05:04 1 of 1 ERROR creating sql table model project1_dev.revenue_model1 ........... [ERROR in 0.89s]
11:05:05
11:05:05 Completed with 1 error and 0 warnings:
11:05:05
11:05:05 Runtime Error in model revenue_model1 (folder\directory\revenue_model1.sql)
11:05:05 404 Not found: Table database_name.project1_dev.sf_orders was not found
I've got several questions here:
How do you force dbt to use a specific schema on runtime when using dbt ref function?
Is it possible to force dbt to use the default parameters/settings for models inside the dbt_dwh project when this Git repo is installed as a package in another project?
Some points to note:
All objects & schemas listed above sit in the same database
I know that many people recommend mono-repo set-up to avoid exactly this type of scenario, but switching to a mono-repo structure is not feasible right now, as we are already fully invested in multi-repo setup
Although it would be feasible to create source.yml files in each of the dbt projects to reference the output objects of the dbt_dwh project, this feels like duplication of effort and could result in different versions of the same sources.yml file across projects
I appreciate it is possible to hard-code the output schema in the dbt config block, but this removes our ability to test in dev environment/schema for dbt_dwh project

I managed to find a solution so I'll answer my own question in case anybody else runs up against the same issue. Unfortunately this is not documented anywhere that I can find, however, a throw-away comment in the dbt Slack workspace sparked an idea that allowed me to find the/a solution (I'll post the message if I manage to find it again, to give credit where it's due).
To fix this is actually very simple, you just need to add the project being imported to your profiles.yml file and specify the schema. For our use case this is fine as we only have 1 schema we use.
profiles.yml for dbt_project1:
models:
db_project_1:
outputs:
project1_dev:
<configs here>
project1_prod:
<configs here>
dbt_dwh:
+schema: [[schema you want these models to run into]]
<configs here>
The advantages with this approach are:
When you generate/serve dbt docs it allows you to see the upstream lineage from the upstream project
If there are any upstream dependencies in your upstream project you can run this using dbt run -m +model_name (this can be super handy)
If you don't want this behaviour then you can use dbt run -m +model_name --exclude dbt_dwh (for example) to prevent models in your upstream project from running.
I haven't yet figured out if it is possible to use the default parameters/settings for models inside the upstream project (in this case dbt_dwh) but I will edit this answer if I find a way.

DBT - [WARNING]: Did not find matching node for patch

I keep getting the error below when I use dbt run - I can't find anything on why this error occurs or how to fix it within the dbt documentation.
[WARNING]: Did not find matching node for patch with name 'vGenericView' in the 'models' section of file 'models\generic_schema\schema.sql'

did you by chance recently upgrade to dbt 1.0.0? If so, this means that you have a model, vGenericView defined in a schema.yml but you don't have a vGenericView.sql model file to which it corresponds.

If all views and tables defined in schema are 1 to 1 with model files then try to run dbt clean and test or run afterward.
Not sure what happened to my project, but ran into frustration looking for missing and/or misspelled files when it was just leftovers from different compiled files not cleaned out. Previously moved views around to different schemas and renamed others.

So the mistake is here in the naming:
The model name in the models.yml file should for example be: employees
And the sql file should be named: employees.sql
So your models.yml will look like:
version: 2
models:
- name: employees
description: "View of employees"
And there must be a model with file name: employees.sql

One case when this will happen is if you have the same data source defined in two different schema.yml file (or whatever you call it)

How do I use a yaml selector?

I am experimenting with yaml selectors, so far without success. My selectors.yml:
selectors:
- name: daily
description: selects models for daily run
definition:
exclude:
- union:
- "tag:pp_backfill"
- "tag:inc_transactions_raw_data"
- "tag:hourly"
And when I try to use it I get an error:
$ dbt ls --selector daily
* Deprecation Warning: dbt v0.17.0 introduces a new config format for the
dbt_project.yml file. Support for the existing version 1 format will be removed
in a future release of dbt. The following packages are currently configured with
config version 1:
- honey_dbt
- dbt_utils
- audit_helper
- codegen
For upgrading instructions, consult the documentation:
https://docs.getdbt.com/docs/guides/migration-guide/upgrading-to-0-17-0
* Deprecation Warning: The "adapter_macro" macro has been deprecated. Instead,
use the `adapter.dispatch` method to find a macro and call the result.
adapter_macro was called for: dbt_utils.intersect
Encountered an error:
Runtime Error
Could not find selector named daily, expected one of []
I tried this in both 0.18.1 and 0.19.0, with and without config-version: 2. Any thoughts?

I think the blocker here might be that you are not currently selecting anything initially to then exclude specific models from using the tag method. Here is solution in my project and then an adoption that might work in your case.
Context
I'm running dbt version 0.19.0 on dbt Cloud. Both of these compiled and successfully ran dbt run --selector daily.
Jaffle Shop Example
stg_customers is tagged with dont_run_me and stg_orders is tagged with also_dont_run_me
selector.yml is the following at the root of the dbt project
selectors:
- name: daily
description: selects models for daily run
definition:
union:
- method: path
value: models
- exclude:
- method: tag
value: dont_run_me
- method: tag
value: also_dont_run_me
The logic here is that I'm first select all the models and then exclude the union of the models that have tags dont_run_me and also_dont_run_me.
dbt run --selector daily ended up running everything in my project except stg_customers and stg_orders
Specific Case
If you are trying to select all models except for ones that are tagged as pp_backfill, inc_transactions_raw_data, and hourly, I think the following will do the trick:
selectors:
- name: daily
description: selects models for daily run
definition:
union:
- method: path
value: models
- exclude:
- union:
- method: tag
value: pp_backfill
- method: tag
value: inc_transactions_raw_data
- method: tag
value: hourly

what are drone.io 0.8.5 plugin/gcr secretes' acceptable values?

I'm having trouble pushing to gcr with the following
gcr:
image: plugins/gcr
registry: us.gcr.io
repo: dev-221608/api
tags:
- ${DRONE_BRANCH}
- ${DRONE_COMMIT_SHA}
- ${DRONE_BUILD_NUMBER}
dockerfile: src/main/docker/Dockerfile
secrets: [GOOGLE_CREDENTIALS]
when:
branch: [prod]
...Where GOOGLE_CREDENTIALS will work, but if named say GOOGLE_CREDENTIALS_DEV it will not be properly picked up. GCR_JSON_KEY works fine. I recall reading legacy documentation that spelled out the acceptable variable names, of which GOOGLE_CREDENTIALS and GCR_JSON_KEY were listed among other variants but as of version 1 they've done some updates omitting that info.
So, question is, is the plugin capable of accepting whatever variable name or is it expecting specific variable names and if so what are they?

The Drone GCR plugin accepts the credentials in a secret named PLUGIN_JSON_KEY, GCR_JSON_KEY, GOOGLE_CREDENTIALS, or TOKEN (see code here)
If you stored the credentials in drone as GOOGLE_CREDENTIALS_DEV then you can rename it in the .drone.yml file like this:
...
secrets:
- source: GOOGLE_CREDENTIALS_DEV
target: GOOGLE_CREDENTIALS
...

What is "hellostepfunc1" in the serverless documenation for setup AWS stepfunctions?

In these documentation from the serverless website - How to manage your AWS Step Functions with Serverless and GiTHUb - serverless-step-functions, we can find this word hellostepfunc1: in the serverless.yml file. I could not find reference to it. I dont understand what is it, and I can't find any reference to it, even after the State Machine was created into AWS.
If I delete it I get the follow error
Cannot use 'in' operator to search for 'role' in myStateMachine
But if I change its name for someName for example I have no error and the State Machine will works good.
I could assume it is only an identifier but I not sure.
Where can I find reference to it?

This is quite specific to the library you are using and how it names the statemachine which is getting created based upon whether the name: field is provided under the hellostepfunc1: or not.
Have a look at the testcases here and here to understand better.
In-short a .yaml like
stateMachines:
hellostepfunc1:
definition:
Comment: 'comment 1'
.....
has name of statemachine like hellostepfunc1StepFunctionsStateMachine as no name was specified.
Whereas for a .yaml like
stateMachines:
hellostepfunc1:
name: 'alpha'
definition:
Comment: 'comment 1'
.....
the name of statemachine is alpha as you had name was specified.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Configuring raw and analytics databases with dbt - dbt

Related

How to specify model schema when referencing another dbt project as a package? (dbt multi-repo setup)

DBT - [WARNING]: Did not find matching node for patch

How do I use a yaml selector?

what are drone.io 0.8.5 plugin/gcr secretes' acceptable values?

What is "hellostepfunc1" in the serverless documenation for setup AWS stepfunctions?

Categories

Resources