I am super new to dbt and wanted to see if there is a way to log the success/failure of scheduled dbt jobs in a table. I am currently using bigquery as my data warehouse.
Any help would be greatly appreciated thanks.
Thanks!
According to dbt's documentation, events are logged to logs/dbt.log file at your project folder, along with stdout at your terminal. As you're dealing with scheduled jobs, the file-based option would be more appropriate.
You can pass the --debug and the --log-format json arguments to your dbt jobs for structured logging messages, which would look like:
dbt --debug --log-format json run
Along with that, you could parse the success or failure of your jobs easily by looking, when available, at the node_info complex field and its node_status subfield. This will give you the results of your jobs when they eventually finish. The node_name subfield will give you the correspondent dbt model.
For more information, look at the structured logging section of the documentation. For a very detailed view of the jobs' metadata, look at this dbt's code block that generates it.
Related
Hello everyone I would like to ask for your assistance if there's a way I could create an HTML/GUI Report upon running "dbt test" command. Your response is highly appreciated.
This is the result I got upon running dbt test
There are some 3rd party libraries that will generate reports like the one you're looking for. One you can try is called Elementary Data (https://www.elementary-data.com/). When you run DBT tasks, it creates a schema with all of the stats on model execution timings, test results, etc. It then has a command to generate an HTML page using those results to publish some simple graphs and aggregate test results.
Currently gitlab pipelines which I execute show me the last commit message as it's description. But when I need to search for some specific pipeline it becomes time consuming. Can we pass some title to pipeline before running it? If so can u please share some sample example.
So I have typical run of the mill logs from Nginx and tomcat servers which are just single line text files with typical log format. I have changed the tomcat access logs to output pipe delimited fields so I can easily process them using some unix scripts. I'd like to get rid of my unix scripts and move to using cloudwatch to process my logs in a similar manner, however I found out that cloudwatch really doesn't understand anything beyond timestamp, message, and logstream by default.
It will add fields using JSON, but JSON is verbose when it comes to log files. I'd like to just let it process a CSV file which seems like an obvious alternative to JSON. I'm willing to change my log format to meet a requirement like that, but I can't find any information about how I could do that.
Is my only option to translate my logs into JSON in order to add fields to cloudwatch? I am aware of the parse command, but I find that cumbersome to reconstitute my fields every time I want to build a query. Especially since these will mostly be access logs which will have numerous fields. I have aws cloudwatch log agent setup on my systems and I'm currently sending these logs to cloudwatch.
The closest thing there is to handling space delimited log files is to use Metric Filters. Or at least that's how the authors of CloudWatch designed it.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html
The best examples of this is here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CountOccurrencesExample.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/ExtractBytesExample.html
Not sure if this is going to work for what I'm trying to do with logs, but it's a start. And it's the closest thing to a proper answer. If you want it done right, you gotta do it yo'self.
I am currently evaluating Flyway software as a deployment option for our
company. We run our database deployments on an ORACLE database and
currently spool the output from a sqlplus session for logging purposes. We
use this to verify feedback information such as were objects created
successfully, were packages and functions, etc. compiled without errors,
verify amount of records entered and so forth.
Is there similar logging functionality in Flyway? Currently the only
logging we have found is in the server logs. We can tell from these logs
that a script has completed successfully or has triggered an ORA error but
we are curious as to whether this is the extent of the database logging
options or not.
Thank you,
We used the command line method for running flyway and turned on debug output (-X). Along with a lot of other output it also logs more information about the SQL migrations run (eg content of repeatable migrations) and the number of records affected. This is not perfect however it helped us a lot in capturing more information about what was applied.
See https://flywaydb.org/documentation/commandline/ as it is not documented for each individual command as it applies to flyway itself.
My team at work is currently looking for a replacement for a rather expensive ETL tool that, at this point, we are using as a glorified scheduler. Any of the integrations offered by the ETL tool we have improved using our own python code, so I really just need its scheduling ability. One option we are looking at is Data Pipeline, which I am currently piloting.
My problem is thus: imagine we have two datasets to load - products and sales. Each of these datasets requires a number of steps to load (get source data, call a python script to transform, load to Redshift). However, product needs to be loaded before sales runs, as we need product cost, etc to calculate margin. Is it possible to have a "master" pipeline in Data Pipeline that calls products first, waits for its successful completion, and then calls sales? If so, how? I'm open to other product suggestions as well if Data Pipeline is not well-suited to this type of workflow. Appreciate the help
I think I can relate to this use case. Any how, Data Pipeline does not do this kind of dependency management on its own. It however can be simulated using file preconditions.
In this example, your child pipelines may depend on a file being present (as a precondition) before starting. A Master pipeline would create trigger files based on some logic executed in its activities. A child pipeline may create other trigger files that will start a subsequent pipeline downstream.
Another solution is to use Simple Workflow product . That has the features you are looking for - but would need custom coding using the Flow SDK.
This is a basic use case of datapipeline and should definitely be possible. You can use their graphical pipeline editor for creating this pipeline. Breaking down the problem:
There are are two datasets:
Product
Sales
Steps to load these datasets:
Get source data: Say from S3. For this, use S3DataNode
Call a python script to transform: Use ShellCommandActivity with staging. Data Pipeline does data staging implicitly for S3DataNodes attached to ShellCommandActivity. You can use them using special env variables provided: Details
Load output to Redshift: Use RedshiftDatabase
You will need to do add above components for each of the dataset you need to work with (product and sales in this case). For easy management, you can run these on an EC2 Instance.
Condition: 'product' needs to be loaded before 'sales' runs
Add dependsOn relationship. Add this field on ShellCommandActivity of Sales that refers to ShellCommandActivity of Product. See dependsOn field in documentation. It says: 'One or more references to other Activities that must reach the FINISHED state before this activity will start'.
Tip: In most cases, you would not want your next day execution to start while previous day execution is still active aka RUNNING. To avoid such a scenario, use 'maxActiveInstances' field and set it to '1'.