Incremental model without updating changed records

Incremental model without updating changed records - sql

I want to build a dbt model that returns the minimum timestamp for each user_id.
My source table is a versioning table with several billion rows. Hence, I want to build it as an incremental model.
In source data, version_id is unique and each user_id can have several records. Here is what it looks like:
with t1 (version_id, user_id, updated_at, is_activated) as (
select * from values
(1, 'a', '2022-12-01'::timestamp, true)
, (2, 'b', '2022-12-02'::timestamp, true)
, (2, 'b', '2022-12-03'::timestamp, false)
-- , (1, 'a', '2022-12-03'::timestamp, true)
)
select * from t1
When a user_id gets another row (e.g. 4th row) with a newer time, I don't want dbt to update the old row as the first row has the minimum timestamp.
How to ensure the new rows do not update all the old ones based on the min timestamp per user_id?
{{
config(
materialized='incremental',
unique_key='version_id'
)
}}
WITH filtered AS (
SELECT
version_id
, user_id
, updated_at
FROM {{ ref('versions') }}
WHERE is_activated = TRUE
{% if is_incremental() %}
AND updated_at > (SELECT MAX(this.updated_at) FROM {{ this }} AS this)
{% endif %}
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at) = 1
)
SELECT * FROM filtered

Related

Is there a way to use sql functions after using INSERT ALL from a select statement?

I have multiple tables that I am inserting into, and I would like one of the tables to be partitioned after inserting into it so that I can determine the most updated IDs and label their ACTIVE status. My SQL for one of the tables that already contains my data looks as follows:
CREATE TABLE IF NOT EXISTS MY_TABLE
(
LINK_ID BINARY NOT NULL,
LOAD TIMESTAMP NOT NULL,
SOURCE STRING NOT NULL,
SOURCE_DATE TIMESTAMP NOT NULL,
ORDER BIGINT NOT NULL,
ID BINARY NOT NULL,
ATTRIBUTE_ID BINARY NOT NULL
);
INSERT ALL
WHEN HAS_DATA AND ID_SEQ_NUM > 1 AND (SELECT COUNT(1) FROM MY_TABLE WHERE ID = KEY) = 0 THEN
INTO MY_TABLE VALUES (
LINK_KEY,
TIME,
DATASET_NAME,
DATASET_DATE,
ORDER_NUMBER,
O_KEY,
OA_KEY
)
SELECT *
FROM TEST_TABLE;
Currently, I am inserting records that show a change in any of the columns to the table. I have extended the table to now include an ACTIVE column and defaulted every record to TRUE for the current records in the table.
ALTER TABLE MY_TABLE ADD COLUMN ACTIVE BOOLEAN DEFAULT FALSE;
When a new record is inserted which indicates a change in one of the column values, I want that ACTIVE value for the new record to be TRUE for that ID group while changing the ACTIVE value for the other records within the ID group to be FALSE (so the previous records would be not considered active/would be FALSE while the most recent record that is inserted indicated by the ORDER value would be active/TRUE)
At first, I tried doing:
INSERT ALL
WHEN HAS_DATA AND ID_SEQ_NUM > 1 AND (SELECT COUNT(1) FROM MY_TABLE WHERE ID = KEY) = 0 THEN
INTO MY_TABLE VALUES (
LINK_KEY,
TIME,
DATASET_NAME,
DATASET_DATE,
ORDER_NUMBER,
O_KEY,
OA_KEY,
ACTIVE
)
SELECT *, OFFSET_NUMBER = MAX(OFFSET_NUMBER) OVER (PARTITION BY O_KEY) AS ACTIVE,
FROM TEST_TABLE;
However, this does not seem to change the records for each ID group that already exist in the table to false when a new record is inserted that is considered to be the most recent active record. Is there a way I can run this select statement below after all the new records are inserted, but still have it be in the same statement that contains the insertion process?
SELECT *, ORDER = MAX(ORDER) OVER (PARTITION BY ID) AS ACTIVE,
FROM MY_TABLE

How in PostgreSQL 12 update on coflict data which can have one column null or not null avoiding that a new row is created

I cannot understand how to update data on conflict when I have a situation having a column null or not null. This actually creates a new row instead of merging into one or I have the error that constraints do not match in the on-conflict options.
The query has to update data when I have the column site_id === NULL or NOT NULL.
I tried 2 queries which should be in one but having one I get multiple inserted rows instead of one merged.
This is what I tried
I added the unique indexes
ALTER TABLE candidates DROP CONSTRAINT candidates_day_study_id_site_id_status_unique;
CREATE UNIQUE INDEX candidates_4col_uni_idx ON candidates ("day", study_id, site_id, status)
WHERE site_id IS NOT NULL OR site_id IS NULL;
CREATE UNIQUE INDEX candidates_3col_uni_idx ON candidates ("day", study_id, status)
WHERE
site_id IS NULL;
Then my queries
INSERT INTO candidates ("day", study_id, site_id, status, total, CURRENT)
VALUES('2020-01-01T00:00:00.000Z', 'ABC', NULL, 'INCOMPLETE', 4, 4) ON CONFLICT ("day", study_id, site_id, status)
WHERE
site_id IS NOT NULL
OR site_id IS NULL DO
UPDATE
SET
total = EXCLUDED.total,
CURRENT = EXCLUDED.current;
INSERT INTO candidates ("day", study_id, status, total, current)
VALUES('2020-01-01T00:00:00.000Z', 'ABCDE', 'INCOMPLETE', 5, 5) ON CONFLICT ("day", study_id, status)
where site_id is null do
UPDATE
SET
total = EXCLUDED.total,
current = EXCLUDED.current;
I spent hours finding a solution with my indexes but or I end with the error or I end with the multiple rows instead of been updated

Issues running insert into statement in dbt

I am new to dbt and I am trying to write a data model that would insert its result into an existing table on snowflake. The below code without the config runs smoothly on snowflake, but it doesn't work on dbt. Seems insert statements are not using in DBT?
{{ config(materialized = 'view') }}
INSERT INTO "v1" ("ID", "Value","Set",ROWNUM )
with LD as(
select "ID",
"Value",
"Set",
ROW_NUMBER()OVER ( PARTITION BY "ID" order by "Set" desc ) as rownum
from "Archive"."Prty" l
where l."Prty" = 'Log' AND "ID"= 111
),
LD2 as (
select "ID",
"Value",
"Set",
ROWNUM
from LD where ROWNUM = 1
)
SELECT * FROM LD2

You could use an incremental model to achieve this all in the same table. You can also use the qualify clause to remove the need for the second CTE. I am assuming ID should be unique, but refer to the link and modify if this is not the case.
{{
config(
materialized='incremental',
unique_key='ID'
)
}}
select
"ID",
"Value",
"Set"
from "Archive"."Prty" as l
where l."Prty" = 'Log' and "ID" = 111
qualify row_number() over (partition by "ID" order by "Set" desc) = 1
This will insert any rows that satisfy the above query and have an ID not already in the table. Otherwise, it will update any rows where the ID already does exist.

Presto filter an array during aggregation

I would like to filter an aggregated array depending on all values associated with an id. The values are strings and can be of three type all-x:y, x:y and empty (here x and y are arbitrary substrings of values).
I have a few conditions:
If an id has x:y then the result should contain x:y.
If an id always has all-x:y then the resulting aggregation should have all-x:y
If an id sometimes has all-x:y then the resulting aggregation should have x:y
For example with the following
WITH
my_table(id, my_values) AS (
VALUES
(1, ['all-a','all-b']),
(2, ['all-c','b']),
(3, ['a','b','c']),
(1, ['all-a']),
(2, []),
(3, ['all-c']),
),
The result should be:
(1, ['all-a','b']),
(2, ['c','b']),
(3, ['a','b','c']),
I have worked multiple hours on this but it seems like it's not feasible.
I came up with the following but it feels like it cannot work because I can check the presence of all-x in all arrays which would go in <<IN ALL>>:
SELECT
id,
SET_UNION(
CASE
WHEN SPLIT_PART(my_table.values,'-',1) = 'all' THEN
CASE
WHEN <<my_table.values IN ALL>> THEN my_table.values
ELSE REPLACE(my_table.values,'all-')
END
ELSE my_table.values
END
) AS values
FROM my_table
GROUP BY 1
I would need to check that all arrays values for the specific id contains all-x and that's where I'm struggling to find a solution.
I was trying to co
After a few hours of searching how to do so I am starting to believe that it is not feasible.
Any help is appreciated. Thank you for reading.

This should do what you want:
WITH my_table(id, my_values) AS (
VALUES
(1, array['all-a','all-b']),
(2, array['all-c','b']),
(3, array['a','b','c']),
(1, array['all-a']),
(2, array[]),
(3, array['all-c'])
),
with_group_counts AS (
SELECT *, count(*) OVER (PARTITION BY id) group_count -- to see if the number of all-X occurrences match the number of rows for a given id
FROM my_table
),
normalized AS (
SELECT
id,
if(
count(*) OVER (PARTITION BY id, value) = group_count AND starts_with(value, 'all-'), -- if its an all-X value and every original row for the given id contains it ...
value,
if(starts_with(value, 'all-'), substr(value, 5), value)) AS extracted
FROM with_group_counts CROSS JOIN UNNEST(with_group_counts.my_values) t(value)
)
SELECT id, array_agg(DISTINCT extracted)
FROM normalized
GROUP BY id
The trick is to compute the number of total rows for each id in the original table via the count(*) OVER (PARTITION BY id) expression in the with_group_counts subquery. We can then use that value to determine whether a given value should be treated as an all-x or the x should be extracted. That's handled by the following expression:
if(
count(*) OVER (PARTITION BY id, value) = group_count AND starts_with(value, 'all-'),
value,
if(starts_with(value, 'all-'), substr(value, 5), value))
For more information about window functions in Presto, check out the documentation. You can find the documentation for UNNEST here.

Select value from table based on conditions in other rows

I have multiple rows with a parent ID that associates related rows. I want to select the email address where Status = 'active', for the parent ID, and if there's multiple rows with that condition, I want to pick the most recently modified (createDate).
Basically I have two+ records, parent ID 111. The first record has m#x.com with a status of 'active', and the second record has m#y.com with a status of 'unsubscribed'. How do I select just ID 111 with m#x.com?
How would I go about this?
Table Data:
ID ParentID Email Status CreateDate
1000919 1000919 xxx#gmail.com bounced 2/5/18
1017005 1000919 yyy#gmail.com active 1/6/18
1002868 1002868 sss#gmail.com active 12/31/17
1002868 1002868 www#gmail.com active 12/31/17
1002982 1002982 uuu#gmail.com held 2/7/18
1002982 1002982 iii#gmail.com held 2/7/18
1002990 1002990 ooo#gmail.com active 10/26/18
1003255 1003255 ppp#gmail.com active 2/7/18
Expected Result:
ParentID Email Status CreateDate
1000919 yyy#gmail.com active 1/6/18
1002868 sss#gmail.com active 12/31/17

I hope this is what you need:
SELECT * FROM table
WHERE parent_id IN
(SELECT id FROM users WHERE status = "active")
ORDER BY createdate DESC LIMIT 1;
Order by createdate in descending order will allow you to select only last n rows, where n is set in LIMIT.

This is really hard with no primary key and duplicate rows. You have no defined the answer when a parentid has 2 rows on the same date and different emails. CreatedDate could be a datetime field, and likely to be unique. Without that we must use >=. This will do it though
> SELECT distinct a.* FROM [Temp] a join [Temp] b
> on a.parentid=b.parentid
> and a.createdate >= b.createdate
> and a.status='active' and b.status='active'

ok, so based on the information of the question.
A self-reference field exists on the table.
The status has to be active.
If 2 or more records exist in the table for the same parent take the latest.
I added a 4th if matching all conditions to take the highest email.
the code is not formatting properly I'm new to the stackoverflow (lol)
Blockquote table to create
create table #tmp
(id int identity,
name varchar(50),
email varchar(50),
status varchar(20),
add_date datetime,
mod_date datetime,
account_id int)
Blockquote Populating table
insert into #tmp
(name,email,status, add_date, mod_date, account_id)
values ('Cesar', 'Cesar#hotmail.com', 'Active', '20180101', '20180103', 1),
('manuel', 'manuel#hotmail.com', 'Active', '20180103', '20180103', 1),
('feliz', 'feliz#hotmail.com', 'Inactive', '20180103', '20180105', 1),
('lucien', 'lucien#hotmail.com', 'Active', '20180105', '20180105', 2),
('norman', 'norman#hotmail.com', 'Active', '20180110', '20180110', 2),
('tom', 'tom#hotmail.com', 'Active', '20180110', '20180115', 3),
('peter', 'peter#hotmail.com', 'inactive', '20180101', '20180110', 3),
('john', 'john#hotmail.com', 'Active', '20180101', '20180105', 3)
Blockquote Visualization
select *
from #tmp as a
where status = 'Active' and
exists (select
account_id
from #tmp as b
where
b.status = a.status
group by
account_id
having
MAX(b.mod_date) = a.mod_date and
a.email = MAX(b.email))
Blockquote, so the exists, is faster than having a subquery to predicate the data since the table would be pulled back in full

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Incremental model without updating changed records - sql

Related

Is there a way to use sql functions after using INSERT ALL from a select statement?

How in PostgreSQL 12 update on coflict data which can have one column null or not null avoiding that a new row is created

Issues running insert into statement in dbt

Presto filter an array during aggregation

Select value from table based on conditions in other rows

Categories

Resources