How to specify model schema when referencing another dbt project as a package? (dbt multi-repo setup)

How to specify model schema when referencing another dbt project as a package? (dbt multi-repo setup) - dbt

We're using a dbt multi-repo setup with different projects for different business areas. We have several projects, something like this:
dbt_dwh
dbt_project1
dbt_project2
The dbt_dwh project contains models which we plan to reference in projects 1 and 2 (we have ~10 projects that would reference the dbt_dwh project) by way of installing git packages. Ideally, we'd like to be able to just reference the models in the dbt_dwh project (e.g.
SELECT * FROM {{ ref('dbt_dwh', 'model_1') }}). However, each of our projects sits in it's own database schema and this causes issue upon dbt run because dbt uses the target schema from dbt_project_x, where these objects don't exist. I've included example set-up info below, for clarity.
packages.yml file for dbt_project1:
packages:
- git: https://git/repo/url/here/dbt_dwh.git
revision: master
profiles.yml for dbt_dwh:
dbt_dwh:
target: dwh_dev
outputs:
dwh_dev:
<config rows here>
dwh_prod:
<config rows here>
profiles.yml for dbt_project1:
dbt_project1:
target: project1_dev
outputs:
project1_dev:
<config rows here>
project1_prod:
<config rows here>
sf_orders.sql in dbt_dwh:
{{
config(
materialized = 'table',
alias = 'sf_orders'
)
}}
SELECT * FROM {{ source('salesforce', 'orders') }} WHERE uid IS NOT NULL
revenue_model1.sql in dbt_project1:
{{
config(
materialized = 'table',
alias = 'revenue_model1'
)
}}
SELECT * FROM {{ ref('dbt_dwh', 'sf_orders') }}
My expectation here was that dbt would examine the sf_orders model and see that the default schema for the project it sits in (dbt_dwh) is dwh_dev, so it would construct the object reference as dwh_dev.sf_orders.
However, if you use command dbt run -m revenue_model1 then the default dbt behaviour is to assume all models are located in the default target for dbt_project1, so you get something like:
11:05:03 1 of 1 START sql table model project1_dev.revenue_model1 .................... [RUN]
11:05:04 1 of 1 ERROR creating sql table model project1_dev.revenue_model1 ........... [ERROR in 0.89s]
11:05:05
11:05:05 Completed with 1 error and 0 warnings:
11:05:05
11:05:05 Runtime Error in model revenue_model1 (folder\directory\revenue_model1.sql)
11:05:05 404 Not found: Table database_name.project1_dev.sf_orders was not found
I've got several questions here:
How do you force dbt to use a specific schema on runtime when using dbt ref function?
Is it possible to force dbt to use the default parameters/settings for models inside the dbt_dwh project when this Git repo is installed as a package in another project?
Some points to note:
All objects & schemas listed above sit in the same database
I know that many people recommend mono-repo set-up to avoid exactly this type of scenario, but switching to a mono-repo structure is not feasible right now, as we are already fully invested in multi-repo setup
Although it would be feasible to create source.yml files in each of the dbt projects to reference the output objects of the dbt_dwh project, this feels like duplication of effort and could result in different versions of the same sources.yml file across projects
I appreciate it is possible to hard-code the output schema in the dbt config block, but this removes our ability to test in dev environment/schema for dbt_dwh project

I managed to find a solution so I'll answer my own question in case anybody else runs up against the same issue. Unfortunately this is not documented anywhere that I can find, however, a throw-away comment in the dbt Slack workspace sparked an idea that allowed me to find the/a solution (I'll post the message if I manage to find it again, to give credit where it's due).
To fix this is actually very simple, you just need to add the project being imported to your profiles.yml file and specify the schema. For our use case this is fine as we only have 1 schema we use.
profiles.yml for dbt_project1:
models:
db_project_1:
outputs:
project1_dev:
<configs here>
project1_prod:
<configs here>
dbt_dwh:
+schema: [[schema you want these models to run into]]
<configs here>
The advantages with this approach are:
When you generate/serve dbt docs it allows you to see the upstream lineage from the upstream project
If there are any upstream dependencies in your upstream project you can run this using dbt run -m +model_name (this can be super handy)
If you don't want this behaviour then you can use dbt run -m +model_name --exclude dbt_dwh (for example) to prevent models in your upstream project from running.
I haven't yet figured out if it is possible to use the default parameters/settings for models inside the upstream project (in this case dbt_dwh) but I will edit this answer if I find a way.

Related

How to pass environment variables in gitlab dynamically?

I am working on database deployment using gitlab CICD. Now there are two databases e.g. ABC and XYZ. One team is working on DB ABC and we are working on DB XYZ. Now the logic is same but if we need to pass DB name according to the team in gitlab pipeline, Whats the process fotr that ? for example if team 1 is working they will select DB ABC and all changes will be reflected on ABC and same for the other. I have already set up variables in gitlab-ci.yml but the task is manual as one team has to overwrite name of DB of other team and when it merges to master it chanhges the variable name everytime which is hard to manage .
variables:
DB_NAME_dev: DEMO_DB
DB_NAME_qa: DEMO_DB
DB_NAME_prod: DEMO_DB
Now if team 2 wants to work on their pipeline they have to change the value of DB_NAME_dev to their database which is a manual task. Is there a smart way to select DB name and the pipeline runs only for that database rather than manually editing the DB name ?

How do you pass variables in GitLab?
An alternative is to use Gitlab Variables. Go to your project page, Settings tab -> CI/CD, find Variables and click on the Expand button. Here you can define variable names and values, which will be automatically passed into the gitlab pipelines, and are available as environment variables there.

You can also use the git branch method. Let's say the 'ABC' and 'XYZ' team pushes their code to specific branches (eg. branch starting with 'abc' or 'xyz'). For those, you need to export variables in before_script with only parameter.
Create branch-specific jobs in your CI file:
abc-dev-job:
before_script:
- export DB_NAME_dev: $DEMO_DB_abc
- export DB_NAME_qa: $DEMO_DB_abc
- export DB_NAME_prod: $DEMO_DB_abc
only:
- /^abc/.*$/#gitlab-org/gitlab
xyz-dev-job:
before_script:
- export DB_NAME_dev: $DEMO_DB_xyz
- export DB_NAME_qa: $DEMO_DB_xyz
- export DB_NAME_prod: $DEMO_DB_xyz
only:
- /^xyz/.*$/#gitlab-org/gitlab
This pipeline will only run when Team 'XYZ' or 'ABC' pushes their code to their team-specific branches which might start with the prefix xyz or abc (eg. xyz-dev, xyz/dev, abc-dev, etc.)
And it will use variables accordingly.
Note: you need to define variables in CI/CD settings.
Thank you!

Spring boot flyway migrations

I am using flyway for migrations in my Spring boot application. I have around 5 migration scripts with names in the below fashion:
V1__initialmigrations.sql
V2__alter_message_table.sql
When the migrations run and I see the data in 'flyway_schema_history' table, the data looks good for all migration scripts except the very first one for which under the 'script' column, the value is '<< Flyway Baseline >>' rather than the name of the script unlike other rows. Also, the 'installed_by' column has the value 'null' for this very row while other have the user name that I have in my Spring boot yml file. Also, the 'checksum' is null as well.
The only flyway related properties in the spring env yml file are :
spring:
flyway:
baseline-on-migrate: true
enabled: true
I am not sure if this is the right behavior. Any inputs would be appreciated.

You only use the baseline-on-migrate property if you've created a new baseline after the initial application. Flyway ignores all script with a version below what is set in baseline-version (default 1).
For example in applications i've worked on we had years of migration script backlog. we replaced all of them with a single file with current db structure and enabled baseline-on-migrate with baseline-version: X.1 where our new baseline script was VX_0_0.
See also official documentation on baseline: https://flywaydb.org/documentation/concepts/baselinemigrations

DBT - [WARNING]: Did not find matching node for patch

I keep getting the error below when I use dbt run - I can't find anything on why this error occurs or how to fix it within the dbt documentation.
[WARNING]: Did not find matching node for patch with name 'vGenericView' in the 'models' section of file 'models\generic_schema\schema.sql'

did you by chance recently upgrade to dbt 1.0.0? If so, this means that you have a model, vGenericView defined in a schema.yml but you don't have a vGenericView.sql model file to which it corresponds.

If all views and tables defined in schema are 1 to 1 with model files then try to run dbt clean and test or run afterward.
Not sure what happened to my project, but ran into frustration looking for missing and/or misspelled files when it was just leftovers from different compiled files not cleaned out. Previously moved views around to different schemas and renamed others.

So the mistake is here in the naming:
The model name in the models.yml file should for example be: employees
And the sql file should be named: employees.sql
So your models.yml will look like:
version: 2
models:
- name: employees
description: "View of employees"
And there must be a model with file name: employees.sql

One case when this will happen is if you have the same data source defined in two different schema.yml file (or whatever you call it)

DBT filtering for (None) when running on incremental model

I'm trying to configure a DBT model as materialized='incremental', which is failing as DBT seems to be wrapping my model with a check on (None) or (None) is null which causes the model to throw a SQL exception against the target (Bigquery). The (None) checks don't seem to get added for non-incremental models, or when running with --full-refresh which just re-create the table.
According to the docs, incremental models are supposed to be wrapped as follows:
merge into {{ destination_table }} DEST
using ({{ model_sql }}) SRC
...
However what I'm seeing is:
merge into {{ destination_table }} DEST
using ( select * from( {{ model_sql }} ) where (None) or (None) is null) SRC
...
It's not clear to me where the (None) check are coming from, what it's actually trying to achieve by wrapping the query, and what (if any) model config would need to be set to correct this.
My model's config is set as {{ config(materialized='incremental', alias='some_name') }}, and I've tried also setting unique_key just in case with no luck.
I'm running the model with dbt run --profiles-dir dbt_profiles --models ${MODEL} --target development, and can confirm the compiled model is fine and the (None) checks get added for the model run.
I'm running dbt 0.11.1 (old repo version).
Any help would be most appreciated!

Managed to resolve this by looking into the DBT codebase on github for my target version - incremental macro 0.11
Seems like in 0.11 DBT expects a sql_where config flag to be set, which is used to select which records you want to use for the incremental load (pre-cursor to is_incremental() macro).
In my case, as I just want to load all rows in each incremental run and tag with the load timestamp, Setting sql_where='TRUE' generates valid sequel and doesn't filter my results (ie. WHERE TRUE OR TRUE IS NULL)

have you had an incremental model configured beforehand with 0.11.1? I'm pretty sure you need to use {{ this }} but maybe that didn't exist in version 0.11.1. docs on this

boost build - sources with the same name

src
|--Manager.cpp
|--Specializations
| |--Manager.cpp
Building this Boost.Build tries to create
/bin/...
|--Manager.o
|--Manager.o
but fails. How to resolve this automatically? I read FAQ item, but I don't like the solution, as I have to fix things manually when I have a same class name, but different namespace. Would it be possible to make Boost.Build automatically prefix object file names with directory?
/bin/...
|--Manager.o
|--Specializations.Manager.o
Or duplicate the source directory tree?
/bin/...
|--Manager.o
|--Specializations
| |--Manager.o

This behavior has been changed a long time ago and should just work. Boost.Build now mimics the source structure, i.e. you should get both bin/Manager.o and bin/Specializations/Manager.o.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to specify model schema when referencing another dbt project as a package? (dbt multi-repo setup) - dbt

Related

How to pass environment variables in gitlab dynamically?

Spring boot flyway migrations

DBT - [WARNING]: Did not find matching node for patch

DBT filtering for (None) when running on incremental model

boost build - sources with the same name

Categories

Resources