Why is DBT running a model that is not being targeted explicitly in the DBT run statement? - dbt

I have a DBT project that is mostly comprised of models for views over snowflake external tables.Every model view is triggered with a seperate dbt run statement concurrently.
dbt run --models model_for_view_1
I have one other model in the dbt project which materializes to a table that uses these views. I trigger this model in a separate DAG in airflow using the same DBT run statement as above. It uses no ref or source statement that connects it to the views.
I noticed recently that this table model is getting built by DBT whenever I build the view models. I thought it was because DBT was making an inference that this was a referenced model but after some experimentation in which I even set the table model SQL as something like SELECT 1+1 as column1, it was still getting built. I have placed it in a different folder in the dbt project, renamed the file etc. No joy. have no idea why running the other models is causing this unrelated model to be built. The only connection to the view models is that they share the same schema in the database. What is triggering this model to be built?

Selection syntax can be finicky, because there are many ways to select the models. From the docs:
The --select flag accepts one or more arguments. Each argument can be one of:
a package name
a model name
a fully-qualified path to a directory of models
a selection method (path:, tag:, config:, test_type:, test_name:)
(note that --models was renamed --select in v0.21, but --models has the same behavior)
So my guess is that your model_for_view_1 name isn't unique, and is either shared with your project (acting as a package in this case) or the directory that it is in.
So if your project looks like:
models
|- some_name
|- some_name.sql # the view
|- another_name.sql # the table
dbt run --models some_name will run the code in both some_name.sql and another_name.sql, since it is selecting the directory called some_name.

Would you be able to share a bit more context here.
Which version of dbt is your project on?
Would it be possible to share how the models look (while removing any sensitive information).
It is rather difficult to tell what is triggering this unexpected behaviour without these info.

Related

Can I default every dbt command to cautious indirect selection?

I have a dbt project with several singular tests that ref() several models, many of them testing if a downstream model matches some expected data in the upstream. Whenever I build only another downstream model that uses the same upstream models, dbt will try to execute the tests with an out-of-date model. Sample visualization:
Model "a" is an upstream view that only makes simply transformations on a source table
Model "x" is a downstream reporting table that uses ref("a")
Model "y" is another downstream reporting table that also uses ref("a")
There is a test "t1" making sure every a.some_id exists in x.some_key
Now if I run dbt build -s +y, "t1" will be picked up an executed, however "x" is out-of-date when compared to "a" since new data has been pushed into the source table, so the test will fail
If I run dbt build -s +y --indirect-selection=cautious the problem will not happen, since "t1" will not be picked up in the graph.
I want every single dbt command in my project to use --indirect-selection=cautious by default. Looking at the documentation I've been unable to find any sort of environment variable or YML key in dbt_project that I could use to change this behavior. Setting a new default selector also doesn't help because it is overriden by the usage of the -s flag. Setting some form of alias does work but only affects me, and not other developers.
Can I make every dbt build in my project use cautious selection by default, unless the flag --indirect-selection=eager is given?

ADF: Using ForEach and Execute Pipeline with Pipeline Folder

I have a folder of pipelines, and I want to execute the pipelines inside the folder using a single pipeline. There will be times when there will be another pipeline added to the folder, so creating a pipeline filled with Execute Pipelines is not an option (well, it is the current method, but it's not very "automate-y" and adding another Execute Pipeline whenever a new pipeline is added is, as you can imagine, a pain). I thought of the ForEach Activity, but I don't know what the approach is.
I have not tried this approach but I think we can use the
ADF RestAPI to get all the details of the pipelines which needs to be executed. Since the response is in JSON you can write it back to temp blob and add filter and focus on what you need .
https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/list-by-factory?tabs=HTTP
You can use the Create RUN API to trigger the pipeline .
https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/create-run?tabs=HTTP
As Joel called out , if different pipeline has different count of paramter , it will be little messy to maintain .
Folders are really just organizational structures for the code assets that describe pipelines (same for Datasets and Data Flows), they have no real substance or purpose inside the executing environment. This is why pipeline names have to be globally unique rather than unique to their containing folder.
Another problem you are going to face is that the "Execute Pipeline" activity is not very dynamic. The pipeline name has to be known as design time, and while parameter values are dynamic, the parameter names are not. For these reasons, you can't have a foreach loop that dynamically executes child pipelines.
If I were tackling this problem, it would be through an external pipeline management system that you would have to build yourself. This is not trivial, and in your case would have additional challenges because of the folder level focus.

U-SQL : How to merge two usql files with same import statement

I want to deploy multiple tables creation script as one adla job to save on cost. I am using packages to get set of defined partition keys for all tables. When i try to deploy as merged script it complains that import statement is declared multiple times and fails.
While i can still deploy script one by one but wanted to see if we can merge script for faster deployment.
Thanks
Amit
I am not sure I completely get your scenario. If you want to deploy a single object by itself, then that file needs to include all the dependencies (e.g., your package). If you want to deploy several objects, you should include the dependencies only once.
You probably should set up something that generates your script from the underlying "fragments". One fragment would be the reference to the package, the other fragments would be the creation of one object. And your deployment system would concatenate the files as needed.

Create sql schema from django models or migrations

I created different models with django 1.8.
Now, to enable other people to have a quickly comprehension, I would create a sql schema from the models or from the migration files (even only from the initial migration file).
Someone knows how does it?
You can squash all migrations for a moment, or delete them for a sec and generate new initial one and then run this command:
https://docs.djangoproject.com/en/1.11/ref/django-admin/#django-admin-sqlmigrate
If you just want to show the database structure to others, I would rather recommend using the graph_models command from django_extensions:
http://django-extensions.readthedocs.io/en/latest/graph_models.html
For example typing
python manage.py graph_models -a -g models.png
creates a graph with the individual models as nodes and their relations as arcs (assuming you have graphviz installed). You can also create a dot-file and render it however you like

Do I use Snapshot file, migration file or data annotations in my EF Core to update database?

I'm trying to understand the different types of migration paths we can choose when developing an ASP.NET Core 1.0 application with EF Core. When I created my first Core application I noticed it generated a ApplicationDbContextModelSnapshot class that uses a ModelBuilder to build the model.
Then I read that if I need to add a table to the database, I need to create the new model and run the command line to generate the migration file and update the database. Ok, I get it up to this point.
But when I do that, I notice that the ApplicationDbContextModelSnapshot class gets updated too.
1) Does that mean I cannot modify this ApplicationDbContextModelSnapshot class since it looks like it gets regenerated each time?
2) Should I use Data Annotations to build my model or should I use Fluent API which tells me to build my model in the ApplicationDbContext class? Huh? another file that builds the model?
I'm seeing three different ways of working with the database here, the snapshot class, data annotations, and fluent API. I'm confused because today, I made a mistake in my last migration file so I deleted the file, dropped the database and reran the database update.
But by doing that I got errors similar to:
The index 'IX_Transaction_GiftCardId' is dependent on column 'GiftCardId'.
ALTER TABLE ALTER COLUMN GiftCardId failed because one or more objects access this column.
So naturally I was wondering if I had to modify the ApplicationDbContextModelSnapshot class.
What is the path I should be taking when it comes to migrations or database updates because these three paths are confusing me.
I have run into this issue before when I create migrations, make model changes, create new migrations, and try to update the database. The root cause is when keys are being changed and relationships are not dropped and are not added back or do not exist.
You have two options
Easy Method
The easiest way is also the most destructive way and only possible in a dev environment.
Delete all migrations, drop the database, create new migrations and run 'update-database'.
Hard/Safest Method
This is the most time consuming method. I recommend do this in a local integration branch first, pushing it to a remote integration, and then production.
Open the migration file, ie 20160914173357_MyNewMigration.cs.
Drop all indexes in order
Drop/Add/Edit table schemas
Add all indexes back.
For either method, just be sure to test and test again.
Do not modify ApplicationDbContextModelSnapshot. It is a design-time artifact, and should only be modified in the case of a merge conflict.
To update the model, always use data annotations or the fluent API.
For more information on the EF Migrations workflow, see Code First Migrations. It's for EF6, but most of the information is still relevant.