dbt: how can I run ad_reporting model (only with google_ads source) from fivetran transformation? - dbt

I have a dbt project in BitBucket repo, which I connected to fivetran transformation.
my deployment.yml file contains:
jobs:
- name: daily
targetName: dev
schedule: 0 12 * * * # Define when this job should run, using cron format. This example will run every day at 12:00pm (according to your warehouse timezone).
steps:
- name: run models # Give each step in your job a name. This will enable you to track the steps in the logs.
command: dbt run
my dbt_project.yml file is:
name: 'myproject'
version: '1.0.0'
config-version: 2
# This setting configures which "profile" dbt uses for this project.
profile: 'fivetran'
# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
seed-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_modules"
vars:
ad_reporting__pinterest_enabled: False
ad_reporting__microsoft_ads_enabled: False
ad_reporting__linkedin_ads_enabled: False
ad_reporting__google_ads_enabled: True
ad_reporting__twitter_ads_enabled: False
ad_reporting__facebook_ads_enabled: False
ad_reporting__snapchat_ads_enabled: False
ad_reporting__tiktok_ads_enabled: False
api_source: google_ads ## adwords by default and is case sensitive!
google_ads_schema: google_ads
google_ads_database: fivetran
models:
# disable all models except than google_ads
linkedin:
enabled: False
linkedin_source:
enabled: False
twitter_ads:
enabled: False
twitter_ads_source:
enabled: False
snapchat_ads:
enabled: False
snapchat_ads_source:
enabled: False
pinterest:
enabled: False
pinterest_source:
enabled: False
facebook_ads:
enabled: False
facebook_ads_source:
enabled: False
microsoft_ads:
enabled: False
microsoft_ads_source:
enabled: False
tiktok_ads:
enabled: False
tiktok_ads_source:
enabled: False
google_ads:
enabled: True
google_ads_source:
enabled: True
my packages.yml file is:
packages:
- package: fivetran/ad_reporting
version: 0.7.0
bottom line:
I have a dbt project that needs eventually run from fivetran transformation.
which means I cannot push the dbt_packages folder, instead I have the packages.yml file that "installing" the needed packages using the command dbt deps.
after installing the packages, dbt run command will be running and since packages.yml contains ad_reporting package, the run command will cause the ad_reporting model to run.
and since in dbt_project.yml we disabled all sources except than google_ads, only google_ads will triggered from ad_reporting.
now all I want is to run dbt ad_reporting model, that includes only the google_ads source.
this option is built in and should work.
however, when I run this command LOCALLY
dbt run --select ad_reporting
I get this error:
Compilation Error
dbt found two resources with the name "google_ads__url_ad_adapter". Since these resources have
the same name,
dbt will be unable to find the correct resource when ref("google_ads__url_ad_adapter") is
used. To fix this,
change the name of one of these resources:
- model.google_ads.google_ads__url_ad_adapter (**models\url_adwords\google_ads__url_ad_adapter.sql**)
- model.google_ads.google_ads__url_ad_adapter (models\url_google_ads\google_ads__url_ad_adapter.sql)
and when I changed manually this file name:
dbt_packages\google_ads\**models\url_google_ads\google_ads__url_ad_adapter.sql**
from google_ads__url_ad_adapter.sql to google_ads**1**__url_ad_adapter.sql
(just to avoid duplicate file names, as I read in dbt documentation that file names should be uniques even if they are in different folders,
everything worked just fine.
but, as I said before, I need this project to run from fivetran transformation, not locally.
and when I push this project to it's repo, I don't push the dbt_packages folder, since a dbt project should be up to 30 MB size.
and then, according to packages.yml file, dbt deps command executed, and then the project could run. BUT- as I showed, I needed to change file name MANUALLY, and now, when I cant push dbt_packages folder, dbt deps "downolading" the files, and as you saw, there is a bug: 2 files are coming from installation with same name.
that's why when the fivetran transformation is trying to run the command dbt run - I get this error again:
Compilation Error
dbt found two resources with the name "google_ads__url_ad_adapter". Since these resources have
the same name,
dbt will be unable to find the correct resource when ref("google_ads__url_ad_adapter") is
used. To fix this,
change the name of one of these resources:
- model.google_ads.google_ads__url_ad_adapter (models/url_google_ads/google_ads__url_ad_adapter.sql)
- model.google_ads.google_ads__url_ad_adapter (models/url_adwords/google_ads__url_ad_adapter.sql)
what can I do to enable ad_reporting run from fivetran transformation without this compilation error? and how is it possible that dbt produces this dupliacte file names, after writing in documentation that file names should be unique?

I found a solution.
as I said, the problem was the unnecessary file dbt_packages\google_ads\models\url_google_ads\google_ads__url_ad_adapter.sql.
so in deployment.yml file, I added step that delete the file,
now the deployment.yml file looks like: (I added the step called 'delete unnecessary file')
jobs:
- name: daily
targetName: dev
schedule: 0 12 * * * # Define when this job should run, using cron format. This example will run every day at 12:00pm (according to your warehouse timezone).
steps:
- name: delete unnecessary file
command: dbt clean
- name: run models
command: dbt run # Enter the dbt command that should run in this step. This example will run all your models.
also I had to add dbt clean command that looking for paths declared in dbt_project.yml
so I added the problematic folder path, like that: (I added the last line)
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_modules"
- "dbt_packages/google_ads/models/url_adwords"
and now after pushing the project, when fivetran transformation is running the project, using dbt_project.yml file, the first step is deleting the duplicate file and then the dbt run command could run just fine.
problem solved :)

For the ad_reporting package to run it needs atleast 2 data sources.
For more info setting up the ad_reporting look at this answer from the Fivetran team:
https://github.com/fivetran/dbt_ad_reporting/issues/48

Related

dbt: could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS

I'm trying to connect dbt to BigQuery in vscode. For that I extracted a bigquery keyfile json that I put into the root directory of my dbt project.
I then created a profiles.yml file that looks as follows:
my-bigquery-db:
target: dev
outputs:
dev:
type: bigquery
method: service-account
project: civil-parsec-350114
dataset: dbt_dataset
threads: 1
keyfile: bigquery.json
Database Error
Runtime Error
dbt encountered an error while trying to read your profiles.yml file.
Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
When I put an empty projects.yml file I get the same error, so I'm not even sure if that file is loaded at all. How can I best debug this? What could be the problem?

Gitlab CI sequence of instructions causing circular dependency

I have a CICD configuration that looks something like this:
.rule_template: &rule_configuration
rules:
- changes:
- file/dev/script1.txt
variables:
DESTINATION_HOST: somehost1
RUNNER_TAG: somerunner1
- changes:
- file/test/script1.txt
variables:
DESTINATION_HOST: somehost2
RUNNER_TAG: somerunner2
default:
tags:
- scripts
stages:
- lint
deploy scripts 1/6:
<<: *rule_configuration
tags:
- $RUNNER_TAG
stage: lint
script: |
echo "Add linting here!"
....
In short, which runner to choose depends on which file was changed, hence the runner tag has to be conditionally decided. However, these jobs never execute and the value of never gets assigned as I always get:
This job is stuck because you don't have any active runners online or available with any of these tags assigned to them: $RUNNER_TAG
I believe it is because the rules blocks isn't executed and hence the $RUNNER_TAG variable not resolved to its actual value at the point when job/workflow is being initialized and runner being searched.
If my doubt is correct, then probably it's a circular dependency that job initialization requires $RUNNER_TAG but the resolution of $RUNNER_TAG requires job initialization.
If the above is correct, what is the right way to handle it and what stage can I conditionally decide and assign $RUNNER_TAG its value so it doesn’t hinder job/workflow initialization?
gitlab-runner --version
Version: 14.7.0
Git revision: 98daeee0
Git branch: 14-7-stable
GO version: go1.17.5
Built: 2022-01-19T17:11:48+0000
OS/Arch: linux/amd64
I think what you are doing is over complicating what you need to do.
Instead of trying to abstract the tag and dynamically create some variable, simply make each job responsible for registering itself within a pipeline run based on if a particular file path changed.
It might feel like code duplication but it actually keeps your CI a lot simpler and easier to understand.
Job1:
Run when file changes
Tag : some tag
Job2:
Run when some other file changes
Tag: sometag2
Job 3:
Run when a third different file changes
Tag: sometag3

GitLab CI - forced start of job during manual start of another job

I have a dependency problem. My pipeline looks like it gets the dependencies required for jobs first, and finally runs a stage cleanup that cleans them all. The problem is that I have one stage with manual launch which also needs these dependencies but they are cleared.
Question can I somehow run a stage which has dependencies by running a manual stage? is there any other way i can solve this problem?
The normal behaviour of GitLab-CI is to clone the git repository at each job because the jobs can be run on different runners and thus need to be independent.
The automatic clone can be disabled by adding:
job-with-no-git-clone:
variables:
GIT_STRATEGY: none
If you need to use in a job some files/directories created in a previous stage, you must add them as GitLab artifacts
stages:
- one
- two
job-with-git-clone:
stage: one
script:
# this script creates something in the folder data
# (which means $CI_PROJECT_DIR/data)
do_something()
artifacts:
paths:
- data/
job2-with-git-clone:
stage: two
script:
# here you can use the files created in data
job2-with-no-git-clone:
stage: two
variables:
GIT_STRATEGY: none
script:
# here you can use the files created in data

In what order is the serverless file evaluated?

I have tried to find out in what order the statements of the serverless file are evaluated (maybe it is more common to say that 'variables are resolved').
I haven't been able to find any information about this and to some extent it makes working with serverless feel like a guessing game for me.
As an example, the latest surprise I got was when I tried to run:
$ sls deploy
serverless.yaml
useDotenv: true
provider:
stage: ${env:stage}
region: ${env:region}
.env
region=us-west-1
stage=dev
I got an error message stating that env is not available at the time when stage is resolved. This was surprising to me since I have been able to use env to resolve other variables in the provider section, and there is nothing in the syntax to indicate that stage is resolved earlier.
In what order is the serverless file evaluated?
In effect you've created a circular dependency. Stage is special because it is needed to identify which .env file to load. ${env:stage} is being resolved from ${stage}.env, but Serverless needs to know what ${stage} is in order to find ${stage}.env etc.
This is why it's evaluated first.
Stage (and region, actually) are both optional CLI parameters. In your serverless.yml file what you're setting is a default, with the CLI parameter overriding it where different.
Example:
provider:
stage: staging
region: ca-central-1
Running serverless deploy --stage prod --region us-west-2 will result in prod and us-west-2 being used for stage and region (respectively) for that deployment.
I'd suggest removing any variable interpolation for stage and instead setting a default, and overriding via CLI when needed.
Then dotenv will know which environment file to use, and complete the rest of the template.

Bamboo Spec YAML and location of shared artifacts

in the context of using Gradle to drive build, testing, and further jobs/stages on Bamboo server (version 7.2.1) I've configured env. variable GRADLE_USER_HOME to save downloaded Gradle binary to project-local path with the intent to share it with further downstream jobs/stages.
But unfortunately Bamboo ignores "source" or location folder of the artifact. Excerpt from our bamboo.yaml:
Build Java application artifact:
tasks:
- script:
scripts:
- "export GRADLE_USER_HOME=${bamboo.build.working.directory}/GradleUserHome"
- ./gradlew --no-daemon assemble
- "echo GRADLE USER HOME content; ls -al $GRADLE_USER_HOME/; echo '---'" # DEBUG
artifacts:
- name: "Gradle Wrapper installation"
location: GradleUserHome
pattern: '**/*.*'
required: true
shared: true
Debugging output of the echo command shows expected content.
But next downstream job shows that content of artifact "Gradle Wrapper installation" is installed relative to project's workspace, but not in sub-folder ./GradleUserHome as denoted by location key (just as if mentioned location config item is simply ignored with downstream jobs/stages).
Any ideas how to fix this?
Thanks
PS: Next downstream job exhibits in its log messages something like the following:
Preparing artifact 'Gradle Wrapper installation' for use at /var/atlassian/bamboo-agent02-home/xml-data/build-dir/[...] (location: )
Take notice of empty location!