I am new to dbt and SQL in general. I am building an incremental model in dbt that gets run daily. I have a table on snowflake for example like this.
Nat_K1
Nat_k2
Product
Supplier
Component
Metadata
6231
~~
Toy Car
ToysRus
Base
{Hash:2bbd4cb604298556a40f16c50218b226, Load_time: 01:01:2023}
6231
~~
Toy Car
ToysRus
Wheels
{Hash:fd827e6fe9105024dfc5e58d21cde9bd, Load_time: 01:01:2023}
6231
~~
Toy Car
ToysRus
Remote
{Hash:6dfddb68e3219fa66af182f19b1c2ddf, Load_time: 01:01:2023}
I have a view that gets data from my source. The incremental model is a ctas using this view. I want to make sure no duplicates are being added. How do I make dbt check the hash values already in the table before inserting. The hash is made up of Product, Supplier and Component and makes up the Metadata variant column.
For example if I do dbt run the next day. This row would be added. Which I want to avoid.
Nat_K1
Nat_k2
Product
Supplier
Component
Metadata
6231
~~
Toy Car
ToysRus
Base
{Hash:2bbd4cb604298556a40f16c50218b226, Load_time: 02:01:2023}
I am unsure of the best approach to handle this.
Incremental models will automatically update records based on the unique_key. However, if you want to keep the timestamp, you have to get a little more creative.
Option 1: Join with self
{{
config({
"materialized": 'incremental',
"unique_key": 'HASH_FIELD',
})
}}
SELECT
S.HASH_FIELD,
S.NAT_K1,
S.NAT_K2,
S.PRODUCT,
S.SUPPLIER,
S.COMPONENT,
{% if is_incremental() %}
IFNULL(T.LOAD_TIME, CURRENT_TIMESTAMP()) LOAD_TIME
{% else %}
CURRENT_TIMESTAMP() LOAD_TIME
{% endif %}
FROM {{ ref('source') }} S
{% if is_incremental() %}
LEFT JOIN {{ this }} T ON S.HASH_FIELD = T.HASH_FIELD
{% endif %}
Option 2: Use not exists
{{
config({
"materialized": 'incremental',
"unique_key": 'HASH_FIELD',
})
}}
SELECT
S.HASH_FIELD,
S.NAT_K1,
S.NAT_K2,
S.PRODUCT,
S.SUPPLIER,
S.COMPONENT,
CURRENT_TIMESTAMP() LOAD_TIME
FROM {{ ref('source') }} S
{% if is_incremental() %}
WHERE NOT EXISTS (SELECT 1 FROM {{ this }} WHERE HASH_FIELD = S.HASH_FIELD)
{% endif %}
Use option 1 if you need to change other fields that may have updated but you don't want to change the load time. Use option 2 if you want to only load new records.
EDIT: I assume the timestamp you're looking to use is not available in the source table. However, if it is, then you just want to use the common strategy for incremental models which is to look for the last timestamp of the load in the source table. Something like this:
SELECT ...
FROM {{ ref('source') }} S
{% if is_incremental() %}
WHERE S.LOAD_TIME > (SELECT MAX(LOAD_TIME) FROM {{ this }})
{% endif %}
I have created a page which will be accessible only if they purchase this certain product. My problem is I am not sure how to retrieve the customer Order product list ID.
for example customer A, purchased product A & B,
then customer A would be able to access page A & B
I tried this but it what it retrieves is the order ID's not the product ID's of the orders made by the customer
{% for order in customer.orders %}
{{ order.id }}
{% endfor %}
Order object in Shopify contains information about purchased products in line_item. You should do something like this:
{% for order in customer.orders %}
{% for line_item in order.line_items%}
{{ line_item.product_id }}
{% endfor %}
{% endfor %}
You can find information about line_item object in the official documentation: https://help.shopify.com/themes/liquid/objects/line_item
In an attempt to teach myself how to program, I'm making a web app (Flask, SQLAlchemy, Jijna) to display all the books I've ever ordered from Amazon.
In the "barest bones" possible way, I'm trying to learn how to replicate http://pinboard.in—that's my paragon. I have no idea how his site runs so fast: I can load 160 bookmark entries—all with associated tags—in, I dunno, 500 ms? ... which is why I know I am doing something wrong, as discussed below.
In any case, I created a many-to-many relationship between my books Class and my tag Class such that a user can (1) click on a book and see all its tags, as well as (2) click on a tag and see all associated books. Here is my table architecture:
Entity relationship diagram
Here is the code for the relationship between the two Classes:
assoc = db.Table('assoc',
db.Column('book_id', db.Integer, db.ForeignKey('books.book_id')),
db.Column('tag_id', db.Integer, db.ForeignKey('tags.tag_id'))
)
class Book(db.Model):
__tablename__ = 'books'
book_id = db.Column(db.Integer, primary_key=True)
title = db.Column(db.String(120), unique=True)
auth = db.Column(db.String(120), unique=True)
comment = db.Column(db.String(120), unique=True)
date_read = db.Column(db.DateTime)
era = db.Column(db.String(36))
url = db.Column(db.String(120))
notable = db.Column(db.String(1))
tagged = db.relationship('Tag', secondary=assoc, backref=db.backref('thebooks',lazy='dynamic'))
def __init__(self, title, auth, comment, date_read, url, notable):
self.title = title
self.auth = auth
self.comment = comment
self.date_read = date_read
self.era = era
self.url = url
self.notable = notable
class Tag(db.Model):
__tablename__ = 'tags'
tag_id = db.Column(db.Integer, primary_key=True)
tag_name = db.Column(db.String(120))
the problem
If I iterate through the books table only (~400 rows), the query runs and renders to the browser in lightning speed. No problem there.
{% for i in book_query %}
<li>
{{i.notable}}{{i.notable}}
{{i.title}}, {{i.auth}}
{{i.era}} {{i.date_read}}
{% if i.comment %}
<p>{{i.comment}}</p>
{% else %}
<!-- print nothing -->
{% endif %}
</li>
{% endfor %}
If, however, I want to show any and all tags associated with a book, I change the code by nesting a for loop as follows:
{% for i in book_query %}
<li>
{{i.notable}}{{i.notable}}
{{i.title}}, {{i.auth}}
{{i.era}}
{% for ii in i.tagged %}
{{ii.tag_name}}
{% endfor %}
{{i.date_read}}
{% if i.comment %}
<p>{{i.comment}}</p>
{% else %}
<!-- print nothing -->
{% endif %}
</li>
{% endfor %}
The query slows down significantly (takes about 20 seconds). My understanding is that this is happening because for every row in the book table, my code is iterating through the entire assoc table (i.e., "full table scan").
discussion (or, "what i think is happening")
Obviously, I am a complete noob—I've been programming for ~3 months. It's motivating just to get things working, but I realize I have large gaps in my knowledge base that I am trying to fill as I go along.
Right off that bat, I can appreciate that it's incredibly inefficient that, with each new book, the code is iterating through the entire association table (if that's indeed what's happening, which I believe it is). I think I need to cluster(?) or sort(?) the assoc table in such a way that once I retrieve all tags for book with book_id == 1, I never again "check" the rows with book_id == 1 in the assoc table.
In other words, what I think is happening is this (in computerspeak):
Oh, he wants to know how the book with book_id == 1 in books table has been tagged
Okay, let me go to the assoc table
Row #1 ... Is book_id in assoc table equal to 1?
Okay, it is; then what is the tag_id for Row #1? ... [then computer goes to tag table to get tag_name, and returns it to the browser]
Row #2 ... is book_id in assoc table equal to 1?
Oh, no, it isn't ... okay, go to Row #3
Hmmmm, because my programmer is stupid and didn't make this table sorted or indexed in some way, I'm going to have to go through the entire assoc table looking for book_id == 1 when perhaps there aren't any more ...
Then, once we get to book_id == 2 in the books table the computer gets really mad:
Okay, he wants to know all the tags that go with book_id == 2
Okay, let me go to the assoc table
Row #1 ... wait a second ... didn't I check this one already?? Holy sh#t, I have to do this all over again??
Dammit ... okay ... Row #1 ... is book_id == 2? (I know it isn't! But I have to check anyway because my programmer is a dum-dum ...)
questions
So the question is, can I (1) sort(?) or cluster(?) the assoc table in some way that ensures more "intelligent" traversal through the assoc table, or, as a friend of mine suggested, do I (2) "learn to write good SQL queries"? (Note, I've never learned SQL since I've been handling everything with SQLAlchemy ... damn Alchemists ... enshrouding their magics in secret and whatnot.)
final words
Thanks for any input. If you have any suggestions to help me improve how I ask questions on stackoverflow (this is my first post!) please let me know.
Most of the answer is in the question.
In the first example 1 SQL query is executed when you iterate through books table. In the second example a separate assoc query is executed for every Book. So it is about 400 SQL queries which are quite time consuming. You can view them in your app debug log if you set SQLALCHEMY_ECHO config parameter:
app.config['SQLALCHEMY_ECHO'] = True
Or you can install Flask-DebugToolbar and look at these queries in web interface.
The best approach to handle this problem is to learn SQL basics, you will need them anyway when your applications grow larger. Try to write a more optimized query in pure SQL. For your case it may look like this:
SELECT books.*, tags.tag_name FROM books
JOIN assoc ON assoc.book_id = books.book_id
JOIN tags ON assoc.tag_id = tags.tag_id
Then try to rewrite it in SQLAlchemy code and then group by book before passing to HTML renderer:
# Single query to get all books and their tags
query = db.session.query(Book, Tag.tag_name).join('tagged')
# Dictionary of data to be passed to renderer
books = {}
for book, tag_name in query:
book_data = books.setdefault(book.book_id, {'book': book, 'tags': []})
book_data['tags'].append(tag_name)
# Rendering HTML
return render_template('yourtemplate.html', books=books)
Template code will look like this:
{% for book in books %}
<li>
{{ book.book.notable }}{{ book.book.notable }}
{{ book.book.title }}, {{ book.book.auth }}
{{ book.book.era }}
{% for tag in book.tags %}
{{ tag }}
{% endfor %}
{{ book.book.date_read }}
{% if book.book.comment %}
<p>{{ book.book.comment }}</p>
{% else %}
<!-- print nothing -->
{% endif %}
</li>
{% endfor %}
Another approach
If your database is PostgreSQL you can write such query:
SELECT books.title, books.auth (...), array_agg(tags.tag_name) as book_tags FROM books
JOIN assoc ON assoc.book_id = books.book_id
JOIN tags ON assoc.tag_id = tags.tag_id
GROUP BY books.title, books.auth (...)
In this case you will get books data with already aggregated tags as an array. SQLAlchemy allows you to make such query:
from sqlalchemy import func
books = db.session.query(Book, func.array_agg(Tag.tag_name)).\
join('tagged').group_by(Book).all()
return render_template('yourtemplate.html', books=books)
And template has the following structure:
{% for book, tags in books %}
<li>
{{ book.notable }}{{ book.notable }}
{{ book.title }}, {{ book.auth }}
{{ book.era }}
{% for tag in tags %}
{{ tag }}
{% endfor %}
{{ book.date_read }}
{% if book.comment %}
<p>{{ book.comment }}</p>
{% else %}
<!-- print nothing -->
{% endif %}
</li>
{% endfor %}
The following implementation, adapted from #Sergey-Shubin, was a workable solution to this question:
classes & table association declaration
assoc = db.Table('assoc',
db.Column('book_id', db.Integer, db.ForeignKey('books.book_id')),
db.Column('tag_id', db.Integer, db.ForeignKey('tags.tag_id'))
)
class Book(db.Model):
__tablename__ = 'books'
book_id = db.Column(db.Integer, primary_key=True)
title = db.Column(db.String(120), unique=True)
auth = db.Column(db.String(120), unique=True)
comment = db.Column(db.String(120), unique=True)
date_read = db.Column(db.DateTime)
era = db.Column(db.String(36))
url = db.Column(db.String(120))
notable = db.Column(db.String(1))
tagged = db.relationship('Tag', secondary=assoc, backref=db.backref('thebooks',lazy='dynamic'))
class Tag(db.Model):
__tablename__ = 'tags'
tag_id = db.Column(db.Integer, primary_key=True)
tag_name = db.Column(db.String(120))
def construct_dict(query):
books_dict = {}
for each in query: # query is {<Book object>, <Tag object>} in the style of assoc table - therefore, must make a dictionary bc of the multiple tags per Book object
book_data = books_dict.setdefault(each[0].book_id, {'bookkey':each[0], 'tagkey':[]}) # query is a list of like this {index-book_id, {<Book object>}, {<Tag object #1>, <Tag object #2>, ... }}
book_data['tagkey'].append(each[1])
return books_dict
route, sql-alchemy query
#app.route('/query')
def query():
query = db.session.query(Book, Tag).outerjoin('tagged') # query to get all books and their tags
books_dict = construct_dict(query)
return render_template("query.html", query=query, books_dict=books_dict)
If your query has a lot of books, fetching the tags for each book one by one in a separate SQL statement will kill your response time in network I/O.
One way to optimize that, if you know you always will need the tags for this query, is to hint SQLAlchemy to fetch all the dependent tags in one query either via join or subquery.
I don't see your query, but my guess is a subquery load would work best for your use case:
session.query(Book).options(subqueryload('tagged')).filter(...).all()
I am a novice when it comes to coding.
I was just wondering if you could please help me write some code, basically when a customer creates an account on the website I want them to add their date of birth. Now I need that date of birth to be able to be automatically added on the invoices that I have setup to be auto generated.
Like how do I create one of these things for date of birth:
{{ shipping_address.city }} (this Is the code on the invoice and will find the customers address and insert into the invoice) how can I do this for D.O.B? Do I have to create a snippet?
The Shopify help tells how to do this. In fact birthday is one of their examples.
As far as getting that on the invoice you would follow the information about cart attributes.
However the trick is hooking these up so that the customer only has to enter their info once. This is easy enough with a scripted app but if you want to just collect this on the order and then use it for future orders then use a cart attribute only.
Then force the customer to register/login before you let them go to the checkout and use the following to pull the date of birth from the most recent order that has one:
<label>What is your Date of Birth?</label>
{% assign dob = '' %}
{% for pastOrder in customer.orders %}
{% for attr in pastOrder.attributes %}
<p>{{ attr | first }}: {{ attr | last }}</p>
{% assign attrName = attr | first %}
{% if attrName == 'Date of Birth' %}{% assign dob = attr |last %}{% endif %}
{% endfor %}
{% endfor %}
<input type="text" name="attributes[Date of Birth]" value="{{ dob }}">
I have 2 models A and B, and one A model may be referenced in several B modals:
class A(Model):
name = CharField(...)
class B(Model):
name = CharField(...)
a = ForeignKey(A, related_name='all_B')
In view of A model I want to show how many B objects there are.
For now I do this:
args={'a_all': A.objects.all()}
...
{% for a in a_all %}
{{a.name}} : {{ a.all_B.objects.count }} <br>
{% endofr %}
But, this will do SQL query for every A object, and it is not cool if I have many models in db tables.
So, I want to fetch all counts in only one query.
select_related in this case can't be used, becouse it works only for one-to-one nad many-to-one relations, but not for one-to-many.
Only thing thant comes to my head is to add counter field to A:
class A(Model):
name = CharField(...)
b_count = PositiveIntegerField(...)
And update it when I change relation. But it brings many work to detect all relations change if there are many views that add/delete/rewrite "a" field of "B" modal.
Try this:
a_all = A.objects.all().annotate(b_count =Count('b'))
This will add a new field b_count with every object of A.
Then in your template you can do something like
{% for a in a_all %}
{{a.name}} : {{ a.b_count }} <br>
{% endofr %}
try:
A.objects.all().annotate(b_count = Count('b'))
then for each instance of A, you can do a_instance.b_count