Can I default every dbt command to cautious indirect selection? - dbt

I have a dbt project with several singular tests that ref() several models, many of them testing if a downstream model matches some expected data in the upstream. Whenever I build only another downstream model that uses the same upstream models, dbt will try to execute the tests with an out-of-date model. Sample visualization:
Model "a" is an upstream view that only makes simply transformations on a source table
Model "x" is a downstream reporting table that uses ref("a")
Model "y" is another downstream reporting table that also uses ref("a")
There is a test "t1" making sure every a.some_id exists in x.some_key
Now if I run dbt build -s +y, "t1" will be picked up an executed, however "x" is out-of-date when compared to "a" since new data has been pushed into the source table, so the test will fail
If I run dbt build -s +y --indirect-selection=cautious the problem will not happen, since "t1" will not be picked up in the graph.
I want every single dbt command in my project to use --indirect-selection=cautious by default. Looking at the documentation I've been unable to find any sort of environment variable or YML key in dbt_project that I could use to change this behavior. Setting a new default selector also doesn't help because it is overriden by the usage of the -s flag. Setting some form of alias does work but only affects me, and not other developers.
Can I make every dbt build in my project use cautious selection by default, unless the flag --indirect-selection=eager is given?

Related

ADF: Using ForEach and Execute Pipeline with Pipeline Folder

I have a folder of pipelines, and I want to execute the pipelines inside the folder using a single pipeline. There will be times when there will be another pipeline added to the folder, so creating a pipeline filled with Execute Pipelines is not an option (well, it is the current method, but it's not very "automate-y" and adding another Execute Pipeline whenever a new pipeline is added is, as you can imagine, a pain). I thought of the ForEach Activity, but I don't know what the approach is.
I have not tried this approach but I think we can use the
ADF RestAPI to get all the details of the pipelines which needs to be executed. Since the response is in JSON you can write it back to temp blob and add filter and focus on what you need .
https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/list-by-factory?tabs=HTTP
You can use the Create RUN API to trigger the pipeline .
https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/create-run?tabs=HTTP
As Joel called out , if different pipeline has different count of paramter , it will be little messy to maintain .
Folders are really just organizational structures for the code assets that describe pipelines (same for Datasets and Data Flows), they have no real substance or purpose inside the executing environment. This is why pipeline names have to be globally unique rather than unique to their containing folder.
Another problem you are going to face is that the "Execute Pipeline" activity is not very dynamic. The pipeline name has to be known as design time, and while parameter values are dynamic, the parameter names are not. For these reasons, you can't have a foreach loop that dynamically executes child pipelines.
If I were tackling this problem, it would be through an external pipeline management system that you would have to build yourself. This is not trivial, and in your case would have additional challenges because of the folder level focus.

Why is DBT running a model that is not being targeted explicitly in the DBT run statement?

I have a DBT project that is mostly comprised of models for views over snowflake external tables.Every model view is triggered with a seperate dbt run statement concurrently.
dbt run --models model_for_view_1
I have one other model in the dbt project which materializes to a table that uses these views. I trigger this model in a separate DAG in airflow using the same DBT run statement as above. It uses no ref or source statement that connects it to the views.
I noticed recently that this table model is getting built by DBT whenever I build the view models. I thought it was because DBT was making an inference that this was a referenced model but after some experimentation in which I even set the table model SQL as something like SELECT 1+1 as column1, it was still getting built. I have placed it in a different folder in the dbt project, renamed the file etc. No joy. have no idea why running the other models is causing this unrelated model to be built. The only connection to the view models is that they share the same schema in the database. What is triggering this model to be built?
Selection syntax can be finicky, because there are many ways to select the models. From the docs:
The --select flag accepts one or more arguments. Each argument can be one of:
a package name
a model name
a fully-qualified path to a directory of models
a selection method (path:, tag:, config:, test_type:, test_name:)
(note that --models was renamed --select in v0.21, but --models has the same behavior)
So my guess is that your model_for_view_1 name isn't unique, and is either shared with your project (acting as a package in this case) or the directory that it is in.
So if your project looks like:
models
|- some_name
|- some_name.sql # the view
|- another_name.sql # the table
dbt run --models some_name will run the code in both some_name.sql and another_name.sql, since it is selecting the directory called some_name.
Would you be able to share a bit more context here.
Which version of dbt is your project on?
Would it be possible to share how the models look (while removing any sensitive information).
It is rather difficult to tell what is triggering this unexpected behaviour without these info.

Setting "Flags" for a run in Atlassian Bamboo

When looking at the list of recent builds for a Plan in Bamboo, there is column in the summary called "Flags". It contains values like "Custom build" (if you started the build manually using "Run customized") and "Custom revision" (if you ran the build against something other than the head of master).
Is it possible to add a step to my Plan that potentially puts my own value in this field for the current build? In particular, I want to have logic that does this based on build variable, basically saying "if this variable equals this value, then add this custom tag to this build" and have the tag show up in that column of the summary, letting me see whether each build used that specific value for the build variable.
I believe what you may be looking for will require two plugins created by Atlassian:
Conditional Tasks for Bamboo. This will allow you to trigger tasks based on variables.
Inject Variables. This will allow you to set certain variables in your tasks which can even be picked up by the other plugin.

Is there a way to make Gitlab CI run only when I commit an actual file?

New to Gitlab CI/CD.
What is the proper construct to use in my .gitlab-ci.yml file to ensure that my validation job runs only when a "real" checkin happens?
What I mean is, I observe that the moment I create a merge request, say—which of course creates a new branch—the CI/CD process runs. That is, the branch creation itself, despite the fact that no files have changed, causes the .gitlab-ci.yml file to be processed and pipelines to be kicked off.
Ideally I'd only want this sort of thing to happen when there is actually a change to a file, or a file addition, etc.—in common-sense terms, I don't want CI/CD running on silly operations that don't actually really change the state of the software under development.
I'm passably familiar with except and only, but these don't seem to be able to limit things the way I want. Am I missing a fundamental category or recipe?
I'm afraid what you ask is not possible within Gitlab CI.
There could be a way to use the CI_COMMIT_SHA predefined variable since that will be the same in your new branch compared to your source branch.
Still, the pipeline will run before it can determine or compare SHA's in a custom script or condition.
Gitlab runs pipelines for branches or tags, not commits. Pushing to a repo triggers a pipeline, branching is in fact pushing a change to the repo.

TeamCity: Managing deployment dependencies for acceptance tests?

I'm trying to configure a set of build configurations in TeamCity 6 and am trying to model a specific requirement in the cleanest possible manner way enabled by TeamCity.
I have a set of acceptance tests (around 4-8 suites of tests grouped by the functional area of the system they pertain to) that I wish to run in parallel (I'll model them as build configurations so they can be distributed across a set of agents).
From my initial research, it seems that having a AcceptanceTests meta-build config that pulls in the set of individual Acceptance test configs via Snapshot dependencies should do the trick. Then all I have to do is say that my Commit build config should trigger AcceptanceTests and they'll all get pulled in. So, lets say I also have AcceptanceSuiteA, AcceptanceSuiteB and AcceptanceSuiteC
So far, so good (I know I could also turn it around the other way and cause the Commit config to trigger AcceptanceSuiteA, AcceptanceSuiteB and AcceptanceSuiteC - problem there is I need to manually aggregate the results to determine the overall success of the acceptance tests as a whole).
The complicating bit is that while AcceptanceSuiteC just needs some Commit artifacts and can then live on it's own, AcceptanceSuiteA and AcceptanceSuiteB need to:
DeploySite (lets say it takes 2 minutes and I cant afford to spin up a completely isolated one just for this run)
Run tests against the deployed site
The problem is that I need to be able to ensure that:
the website only gets configured once
The website does not get clobbered while the two suites are running
If I set up DeploySite as a build config and have AcceptanceSuiteA and AcceptanceSuiteB pull it in as a snapshot dependency, AFAICT:
a subsequent or parallel run of AcceptanceSuiteB could trigger another DeploySite which would clobber the deployment that AcceptanceSuiteA and/or AcceptanceSuiteB are in the middle of using.
While I can say Limit the number of simultaneously running builds to force only one to happen at a time, I need to have one at a time and not while the dependent pieces are still running.
Is there a way in TeamCity to model such a hierarchy?
EDIT: Ideas:-
A crap solution is that DeploySite could set a 'in use flag' marker and then have the AcceptanceTests config clear that flag [after AcceptanceSuiteA and AcceptanceSuiteB have completed]. The problem then becomes one of having the next DeploySite down the pipeline wait until said gate has been opened again (Doing a blocking wait within the build, doesnt feel right - I want it to be flagged as 'not yet started' rather than looking like it's taking a long time to do something). However this sort of stuff a flag over here and have this bit check it is the sort of mutable state / flakiness smell I'm trying to get away from.
EDIT 2: if I could programmatically alter the agent configuration, I could set Agent Requirements to require InUse=false and then set the flag when a deploy starts and clear it after the tests have run
Seems you go look on the Jetbrains Devnet and YouTrack tracker first and remember to use the magic word clobber in your search.
Then you install groovy-plug and use the StartBuildPrecondition facility
To use the feature, add system.locks.readLock. or system.locks.writeLock. property to the build configuration.
The build with writeLock will only start when there are no builds running with read or write locks of the same name.
The build with readLock will only start when there are no builds running with write lock of the same name.
therein to manage the fact that the dependent configs 'read' and the DeploySite config 'writes' the shared item.
(This is not a full productised solution hence the tracker item remains open)
EDIT: And I still dont know whether the lock should be under Build Parameters|System Properties and what the exact name format should be, is it locks.writeLock.MYLOCKNAME (i.e., show up in config with reference syntax %system.locks.writeLock.MYLOCKNAME%) ?
Other puzzlers are: how does one manage giving builds triggered by build completion of a writeLock task read access - does the lock get dropped until the next one picks up (which would allow another writer in) - or is it necessary to have something queue up the parent and child dependency at the same time ?