Is there a way to configure the dbt lineage graph to start with a given set of settings? - dbt

I'm using DBT, and as you can imagine, my lineage DAG is fairly large by now.
I have it structured with tags - roughly aligned to bronze, silver, gold.
However, each time I open the DAG I typicaly have to go to tags and un-select a few to get to the "working" version of my dag.
Ideally I'd like the DBT lineage dag to default to:
And
But I don't see how to configure this.
(I'm using DBT cloud).

Related

Can I default every dbt command to cautious indirect selection?

I have a dbt project with several singular tests that ref() several models, many of them testing if a downstream model matches some expected data in the upstream. Whenever I build only another downstream model that uses the same upstream models, dbt will try to execute the tests with an out-of-date model. Sample visualization:
Model "a" is an upstream view that only makes simply transformations on a source table
Model "x" is a downstream reporting table that uses ref("a")
Model "y" is another downstream reporting table that also uses ref("a")
There is a test "t1" making sure every a.some_id exists in x.some_key
Now if I run dbt build -s +y, "t1" will be picked up an executed, however "x" is out-of-date when compared to "a" since new data has been pushed into the source table, so the test will fail
If I run dbt build -s +y --indirect-selection=cautious the problem will not happen, since "t1" will not be picked up in the graph.
I want every single dbt command in my project to use --indirect-selection=cautious by default. Looking at the documentation I've been unable to find any sort of environment variable or YML key in dbt_project that I could use to change this behavior. Setting a new default selector also doesn't help because it is overriden by the usage of the -s flag. Setting some form of alias does work but only affects me, and not other developers.
Can I make every dbt build in my project use cautious selection by default, unless the flag --indirect-selection=eager is given?

Log tables for scheduled jobs in dbt

I am super new to dbt and wanted to see if there is a way to log the success/failure of scheduled dbt jobs in a table. I am currently using bigquery as my data warehouse.
Any help would be greatly appreciated thanks.
Thanks!
According to dbt's documentation, events are logged to logs/dbt.log file at your project folder, along with stdout at your terminal. As you're dealing with scheduled jobs, the file-based option would be more appropriate.
You can pass the --debug and the --log-format json arguments to your dbt jobs for structured logging messages, which would look like:
dbt --debug --log-format json run
Along with that, you could parse the success or failure of your jobs easily by looking, when available, at the node_info complex field and its node_status subfield. This will give you the results of your jobs when they eventually finish. The node_name subfield will give you the correspondent dbt model.
For more information, look at the structured logging section of the documentation. For a very detailed view of the jobs' metadata, look at this dbt's code block that generates it.

ELK stack configuration

I dont have much experience with elk stack I basically only know the basics.
Something i.e. filebeat gets data and sends it to logstash
Logstash processes it and sends it Elastic search
Kibana uses elastic search to visualise data
(I hope that thats correct)
I need to create an elk system where data from three different projects is passed, stored and visualised.
Project no1. Uses MongoDB and I need to get all the information from 1 table into kibana
Project no2. Also uses MongoDB and I need to get all the information from 1 table into kibana
Project no3. Uses mysql and I need to get a few tables from that database into kibana
All three of these projects are on the same server
The thing is for Projects 1 and 2 I need the data flow to be constant (i.e. if a user registers I can see that in kabana)
But for Project no3. I only need the data when I need to generate a report (this project functions as a BI of sorts)
So my question is how does one go about creating an elk architecture that gets the inputs from these 3 sources and is able to combine into one elk project.
My best guess is :
Project No1 -> filebeat -> logstash
Project No2 -> filebeat -> logstash
Project No3 -> logstash
(logstash here being a single instance that then feeds into elastic)
Would this be a realistic approach?
I also stumbled upon redis, and from the looks of it it looks like it can combine all the data sources into one and then feed the output to logstash.
What would be the better approach?
Finally, I mentioned filebeat, but from what I understand it basically reads the data from a log file. Would that mean that I would have to re-write all my database entries into a log file in order to feed them into logstash or can logstash tap into the DB without an intermediary.
I tried looking for all of this online, but for some reason the internet is a bit scarce on ELK stack beginner questions.
Thanks
filebeat is used for shipping logs to logstash, you can't use it for reading items from DB. But you can read from DB using logstash's input plugins.
From what you're describing you'll need a logstash instance with 3 pipelines (one per project)
For project 3 you can use Logstash JDBC input plugin to connect to your mysql DB and read new/updated lines based on some "last_updated" column.
JDBC input plugin has a cron confguration value, that allows you to set it up to run periodically and read updated lines with an SQL query that you define in configuration.
For projects 1-2 you can also use the JDBC input plugin with mongoDB.
There is also an Open Source implementation for a mongoDB input plugin on git. You can check this post for how to use it here.
(see the full list of input plugins here)
If that works for you and you manage to set it up, then the rest will be about the same for all three configurations.
i.e. using filter plugins to modify data, and Elasticsearch output plugin to push data to an elastic index.

Create sql schema from django models or migrations

I created different models with django 1.8.
Now, to enable other people to have a quickly comprehension, I would create a sql schema from the models or from the migration files (even only from the initial migration file).
Someone knows how does it?
You can squash all migrations for a moment, or delete them for a sec and generate new initial one and then run this command:
https://docs.djangoproject.com/en/1.11/ref/django-admin/#django-admin-sqlmigrate
If you just want to show the database structure to others, I would rather recommend using the graph_models command from django_extensions:
http://django-extensions.readthedocs.io/en/latest/graph_models.html
For example typing
python manage.py graph_models -a -g models.png
creates a graph with the individual models as nodes and their relations as arcs (assuming you have graphviz installed). You can also create a dot-file and render it however you like

Call a pipeline from a pipeline in Amazon Data Pipeline

My team at work is currently looking for a replacement for a rather expensive ETL tool that, at this point, we are using as a glorified scheduler. Any of the integrations offered by the ETL tool we have improved using our own python code, so I really just need its scheduling ability. One option we are looking at is Data Pipeline, which I am currently piloting.
My problem is thus: imagine we have two datasets to load - products and sales. Each of these datasets requires a number of steps to load (get source data, call a python script to transform, load to Redshift). However, product needs to be loaded before sales runs, as we need product cost, etc to calculate margin. Is it possible to have a "master" pipeline in Data Pipeline that calls products first, waits for its successful completion, and then calls sales? If so, how? I'm open to other product suggestions as well if Data Pipeline is not well-suited to this type of workflow. Appreciate the help
I think I can relate to this use case. Any how, Data Pipeline does not do this kind of dependency management on its own. It however can be simulated using file preconditions.
In this example, your child pipelines may depend on a file being present (as a precondition) before starting. A Master pipeline would create trigger files based on some logic executed in its activities. A child pipeline may create other trigger files that will start a subsequent pipeline downstream.
Another solution is to use Simple Workflow product . That has the features you are looking for - but would need custom coding using the Flow SDK.
This is a basic use case of datapipeline and should definitely be possible. You can use their graphical pipeline editor for creating this pipeline. Breaking down the problem:
There are are two datasets:
Product
Sales
Steps to load these datasets:
Get source data: Say from S3. For this, use S3DataNode
Call a python script to transform: Use ShellCommandActivity with staging. Data Pipeline does data staging implicitly for S3DataNodes attached to ShellCommandActivity. You can use them using special env variables provided: Details
Load output to Redshift: Use RedshiftDatabase
You will need to do add above components for each of the dataset you need to work with (product and sales in this case). For easy management, you can run these on an EC2 Instance.
Condition: 'product' needs to be loaded before 'sales' runs
Add dependsOn relationship. Add this field on ShellCommandActivity of Sales that refers to ShellCommandActivity of Product. See dependsOn field in documentation. It says: 'One or more references to other Activities that must reach the FINISHED state before this activity will start'.
Tip: In most cases, you would not want your next day execution to start while previous day execution is still active aka RUNNING. To avoid such a scenario, use 'maxActiveInstances' field and set it to '1'.