Result-set inconsistency between hive and hive-llap - hive

we are using Hive 3.1.x clusters on HDI 4.0, with 1 being LLAP and another Just HIVE.
we've created a managed tables on both the clusters with the row count being 272409.
Before merge on both clusters
+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date | col_count | col_distinct_count | min_lmd | max_lmd |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615 | 272409 | 272409 | 2020-06-15 00:00:12.0 | 2020-07-26 23:42:17.0 |
+---------------------+------------+---------------------+------------------------+------------------------+
Based on the delta, we'd perform a merge operation (which updates 17 rows).
After merging on the hive-llap cluster (before compaction)
+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date | col_count | col_distinct_count | min_lmd | max_lmd |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615 | 272409 | 272392 | 2020-06-15 00:00:12.0 | 2020-07-27 22:52:34.0 |
+---------------------+------------+---------------------+------------------------+------------------------+
After merging on the hive-llap cluster (after compaction)
+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date | col_count | col_distinct_count | min_lmd | max_lmd |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615 | 272409 | 272409 | 2020-06-15 00:00:12.0 | 2020-07-27 22:52:34.0 |
+---------------------+------------+---------------------+------------------------+------------------------+
After merging on just hive cluster (without compacting deltas)
+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date | col_count | col_distinct_count | min_lmd | max_lmd |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615 | 272409 | 272409 | 2020-06-15 00:00:12.0 | 2020-07-27 22:52:34.0 |
+---------------------+------------+---------------------+------------------------+------------------------+
This is the inconsistency observed
However, after compacting the table on hive-llap, the result-set inconsistency is not seen, both the clusters are returning same result.
We thought it might be due to either caching or llap issue, so we restarted the hive-server2 process which will clear the cache. The issue is still persistent.
We also created a dummy table with same schema on just hive cluster and pointed the location of that table to that of llap one, which in turn is producing result as expected.
We even queried on spark using **Qubole spark-acid reader** (direct hive managed table reader), which is also producing expected result
This is very strange and peculiar, can someone help out here.

We also faced a similar issue in the HDInsight Hive llap cluster. On setting hive.llap.io.enabled as false resolved the issue

Qubole does not support Hive LLAP yet. (However, we (at Qubole) are evaluating whether to support this in the future)

Related

Apache Ignite: SQL query returns empty result on non-baseline node

I have set up a 3 node Apache Ignite cluster and noticed the following unexpected behavior:
(Tested with Ignite 2.10 and 2.13, Azul Java 11.0.13 on RHEL 8)
We have a relational table "RELATIONAL_META". It's created by our software vendors product that uses Ignite to exchange configuration data. This table is backed by this cache, that gets replicated to all nodes:
[cacheName=SQL_PUBLIC_RELATIONAL_META, cacheId=-252123144, grpName=null, grpId=-252123144, prim=512, mapped=512, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, affCls=RendezvousAffinityFunction]
Seen behavior:
I did a failure test, simulating a disk failure of one of the Ignite nodes. The "failed" node restarts with an empty disk and joins the topology as expected. While the node is not yet part of the baseline nodes, either because auto-adjust is disabled, or auto-adjust did not yet complete, the restarted node returns empty results via the JDBC connection:
0: jdbc:ignite:thin://b2bivmign2/> select * from RELATIONAL_META;
+------------+--------------+------+-------+---------+
| CLUSTER_ID | CLUSTER_TYPE | NAME | VALUE | DETAILS |
+------------+--------------+------+-------+---------+
+------------+--------------+------+-------+---------+
No rows selected (0.018 seconds)
It's interesting that it knows the structure of the table, but not the contained data.
The table actually contains data, as I can see when I query against one of the other cluster nodes:
0: jdbc:ignite:thin://b2bivmign1/> select * from RELATIONAL_META;
+-------------------------+--------------+----------------------+-------+------------------------------------------------------------------------------------------------------------------+
| CLUSTER_ID | CLUSTER_TYPE | NAME | VALUE | DETAILS |
+-------------------------+--------------+----------------------+-------+------------------------------------------------------------------------------------------------------------------+
| cluster_configuration_1 | writer | change index | 1653 | 2023-01-24 10:25:27 |
| cluster_configuration_1 | writer | last run changes | 0 | Updated at 2023-01-29 11:08:48. |
| cluster_configuration_1 | writer | require full sync | false | Flag set to false on 2022-06-11 09:46:45 |
| cluster_configuration_1 | writer | schema version | 1.4 | Updated at 2022-06-11 09:46:25. Previous version was 1.3 |
| cluster_processing_1 | reader | STOP synchronization | false | Resume synchronization - the processing has the same version as the config - 2.6-UP2022-05 [2023-01-29 11:00:50] |
| cluster_processing_1 | reader | change index | 1653 | 2023-01-29 10:20:39 |
| cluster_processing_1 | reader | conflicts | 0 | Reset due to full sync at 2022-06-11 09:50:12 |
| cluster_processing_1 | reader | require full sync | false | Cleared the flag after full reader sync at 2022-06-11 09:50:12 |
| cluster_processing_2 | reader | STOP synchronization | false | Resume synchronization - the processing has the same version as the config - 2.6-UP2022-05 [2023-01-29 11:00:43] |
| cluster_processing_2 | reader | change index | 1653 | 2023-01-29 10:24:06 |
| cluster_processing_2 | reader | conflicts | 0 | Reset due to full sync at 2022-06-11 09:52:19 |
| cluster_processing_2 | reader | require full sync | false | Cleared the flag after full reader sync at 2022-06-11 09:52:19 |
+-------------------------+--------------+----------------------+-------+------------------------------------------------------------------------------------------------------------------+
12 rows selected (0.043 seconds)
Expected behavior:
While a node is not part of the baseline, it is per definition not persisting data. So when I run a query against it, I would expect it to fetch the partitions that it does not hold itself, from the other nodes of the cluster. Instead it just shows an empty result, even showing the correct structure of the table, just without any rows. This has caused inconsistent behavior in the product we're actually running, that uses Ignite as a configuration store, because suddenly the nodes see different results depending on which node they have opened their JDBC connection to. We are using a JDBC connection string that contains all the Ignite server nodes, so it fails over when one goes down, but of course it does not prevent the issue I have described here.
Is this "works a designed"? Is there any way to prevent such issues? It seems to be problematic to use Apache Ignite as a configuration store for an application with many nodes, when it behaves like this.
Regards,
Sven
Update:
After restarting one of the nodes with an empty disk, it joins as a node with a new ID. I think that is expected behavior. We have enabled baseline auto-adjust, so the new node id should join the baseline, and old one should leave the baseline. This works, but before this is completed, the node returns empty results to SQL queries.
Cluster state: active
Current topology version: 95
Baseline auto adjustment enabled: softTimeout=60000
Baseline auto-adjust is in progress
Current topology version: 95 (Coordinator: ConsistentId=cdf43fef-deb8-4732-907f-6264bd55de6f, Address=b2bivmign3.fritz.box/192.168.0.151, Order=11)
Baseline nodes:
ConsistentId=3ffe3798-9a63-4dc7-b7df-502ad9efc76c, Address=b2bivmign1.fritz.box/192.168.0.149, State=ONLINE, Order=64
ConsistentId=40a8ae8c-5f21-4f47-8f67-2b68f396dbb9, State=OFFLINE
ConsistentId=cdf43fef-deb8-4732-907f-6264bd55de6f, Address=b2bivmign3.fritz.box/192.168.0.151, State=ONLINE, Order=11
--------------------------------------------------------------------------------
Number of baseline nodes: 3
Other nodes:
ConsistentId=080fc170-1f74-44e5-8ac2-62b94e3258d9, Order=95
Number of other nodes: 1
Update 2:
This is the JDDB URL the application uses:
#distributed.jdbc.url - run configure to modify this property
distributed.jdbc.url=jdbc:ignite:thin://b2bivmign1.fritz.box:10800..10820,b2bivmign2.fritz.box:10800..10820,b2bivmign3.fritz.box:10800..10820
#distributed.jdbc.driver - run configure to modify this property
distributed.jdbc.driver=org.apache.ignite.IgniteJdbcThinDriver
We have seen it connecting via JDBC to a node that was not part of the baseline and therefore receiving empty results. I wonder why a node that is not part of the baseline returns any results without fetching the data from the baseline nodes?
Update 3:
It seems to be dependent on the tables/caches attributes wether this happens, I cannot yet reproduce it with a table I create on my own, only with the table that is created by the product we use.
This is the cache of the table that I can reproduce the issue with:
[cacheName=SQL_PUBLIC_RELATIONAL_META, cacheId=-252123144, grpName=null, grpId=-252123144, prim=512, mapped=512, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, affCls=RendezvousAffinityFunction]
I have created 2 tables my own for testing:
CREATE TABLE Test (
Key CHAR(10),
Value CHAR(10),
PRIMARY KEY (Key)
) WITH "BACKUPS=2";
CREATE TABLE Test2 (
Key CHAR(10),
Value CHAR(10),
PRIMARY KEY (Key)
) WITH "BACKUPS=2,atomicity=ATOMIC";
I then shut down one of the Ignite nodes, in this case b2bivmign3, and remove the ignite data folders, then start it again. It starts as a new node that is not part of the baseline, and I disabled auto-adjust to just keep that situation. I then connect to b2bivmign3 with the SQL CLI and query the tables:
0: jdbc:ignite:thin://b2bivmign3/> select * from Test;
+------+-------+
| KEY | VALUE |
+------+-------+
| Sven | Demo |
+------+-------+
1 row selected (0.202 seconds)
0: jdbc:ignite:thin://b2bivmign3/> select * from Test2;
+------+-------+
| KEY | VALUE |
+------+-------+
| Sven | Demo |
+------+-------+
1 row selected (0.029 seconds)
0: jdbc:ignite:thin://b2bivmign3/> select * from RELATIONAL_META;
+------------+--------------+------+-------+---------+
| CLUSTER_ID | CLUSTER_TYPE | NAME | VALUE | DETAILS |
+------------+--------------+------+-------+---------+
+------------+--------------+------+-------+---------+
No rows selected (0.043 seconds)
The same when I connect to one of the other Ignite nodes:
0: jdbc:ignite:thin://b2bivmign2/> select * from Test;
+------+-------+
| KEY | VALUE |
+------+-------+
| Sven | Demo |
+------+-------+
1 row selected (0.074 seconds)
0: jdbc:ignite:thin://b2bivmign2/> select * from Test2;
+------+-------+
| KEY | VALUE |
+------+-------+
| Sven | Demo |
+------+-------+
1 row selected (0.023 seconds)
0: jdbc:ignite:thin://b2bivmign2/> select * from RELATIONAL_META;
+-------------------------+--------------+----------------------+-------+------------------------------------------------------------------------------------------------------------------+
| CLUSTER_ID | CLUSTER_TYPE | NAME | VALUE | DETAILS |
+-------------------------+--------------+----------------------+-------+------------------------------------------------------------------------------------------------------------------+
| cluster_configuration_1 | writer | change index | 1653 | 2023-01-24 10:25:27 |
| cluster_configuration_1 | writer | last run changes | 0 | Updated at 2023-01-29 11:08:48. |
| cluster_configuration_1 | writer | require full sync | false | Flag set to false on 2022-06-11 09:46:45 |
| cluster_configuration_1 | writer | schema version | 1.4 | Updated at 2022-06-11 09:46:25. Previous version was 1.3 |
| cluster_processing_1 | reader | STOP synchronization | false | Resume synchronization - the processing has the same version as the config - 2.6-UP2022-05 [2023-01-29 11:00:50] |
| cluster_processing_1 | reader | change index | 1653 | 2023-01-29 10:20:39 |
| cluster_processing_1 | reader | conflicts | 0 | Reset due to full sync at 2022-06-11 09:50:12 |
| cluster_processing_1 | reader | require full sync | false | Cleared the flag after full reader sync at 2022-06-11 09:50:12 |
| cluster_processing_2 | reader | STOP synchronization | false | Resume synchronization - the processing has the same version as the config - 2.6-UP2022-05 [2023-01-29 11:00:43] |
| cluster_processing_2 | reader | change index | 1653 | 2023-01-29 10:24:06 |
| cluster_processing_2 | reader | conflicts | 0 | Reset due to full sync at 2022-06-11 09:52:19 |
| cluster_processing_2 | reader | require full sync | false | Cleared the flag after full reader sync at 2022-06-11 09:52:19 |
+-------------------------+--------------+----------------------+-------+------------------------------------------------------------------------------------------------------------------+
12 rows selected (0.032 seconds)
I will test more tomorrow the find out which attribute of the table/cache enables this issue.
Update 4:
I can reproduce this with a table that is set to mode=REPLICATED instead of PARTITIONED.
CREATE TABLE Test (
Key CHAR(10),
Value CHAR(10),
PRIMARY KEY (Key)
) WITH "BACKUPS=2";
[cacheName=SQL_PUBLIC_TEST, cacheId=-2066189417, grpName=null, grpId=-2066189417, prim=1024, mapped=1024, mode=PARTITIONED, atomicity=ATOMIC, backups=2, affCls=RendezvousAffinityFunction]
CREATE TABLE Test2 (
Key CHAR(10),
Value CHAR(10),
PRIMARY KEY (Key)
) WITH "BACKUPS=2,TEMPLATE=REPLICATED";
[cacheName=SQL_PUBLIC_TEST2, cacheId=372637563, grpName=null, grpId=372637563, prim=512, mapped=512, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, affCls=RendezvousAffinityFunction]
0: jdbc:ignite:thin://b2bivmign2/> select * from TEST;
+------+-------+
| KEY | VALUE |
+------+-------+
| Sven | Demo |
+------+-------+
1 row selected (0.06 seconds)
0: jdbc:ignite:thin://b2bivmign2/> select * from TEST2;
+-----+-------+
| KEY | VALUE |
+-----+-------+
+-----+-------+
No rows selected (0.014 seconds)
Testing with Visor:
It makes no difference where I run Visor, same results.
We see both caches for the tables have 1 entry:
+-----------------------------------------+-------------+-------+---------------------------------+-----------------------------------+-----------+-----------+-----------+-----------+
| SQL_PUBLIC_TEST(#c9) | PARTITIONED | 3 | 1 (0 / 1) | min: 0 (0 / 0) | min: 0 | min: 0 | min: 0 | min: 0 |
| | | | | avg: 0.33 (0.00 / 0.33) | avg: 0.00 | avg: 0.00 | avg: 0.00 | avg: 0.00 |
| | | | | max: 1 (0 / 1) | max: 0 | max: 0 | max: 0 | max: 0 |
+-----------------------------------------+-------------+-------+---------------------------------+-----------------------------------+-----------+-----------+-----------+-----------+
| SQL_PUBLIC_TEST2(#c10) | REPLICATED | 3 | 1 (0 / 1) | min: 0 (0 / 0) | min: 0 | min: 0 | min: 0 | min: 0 |
| | | | | avg: 0.33 (0.00 / 0.33) | avg: 0.00 | avg: 0.00 | avg: 0.00 | avg: 0.00 |
| | | | | max: 1 (0 / 1) | max: 0 | max: 0 | max: 0 | max: 0 |
+-----------------------------------------+-------------+-------+---------------------------------+-----------------------------------+-----------+-----------+-----------+-----------+
One is empty when I scan it, the other has one row as expected:
visor> cache -scan -c=#c9
Entries in cache: SQL_PUBLIC_TEST
+================================================================================================================================================+
| Key Class | Key | Value Class | Value |
+================================================================================================================================================+
| java.lang.String | Sven | o.a.i.i.binary.BinaryObjectImpl | SQL_PUBLIC_TEST_466f2363_47ed_4fba_be80_e33740804b97 [hash=-900301401, VALUE=Demo] |
+------------------------------------------------------------------------------------------------------------------------------------------------+
visor> cache -scan -c=#c10
Cache: SQL_PUBLIC_TEST2 is empty
visor>
Update 5:
I have reduced the configuration file to this:
https://pastebin.com/dL9Jja8Z
I did not manage to reproduce this with persistence turned off, as I don't manage to keep a node out the baseline then, it always joins immediately. So maybe this problem is only reproducible with persistence enabled.
I go to each of the 3 nodes, remove the Ignite data to start from scratch, and start the service:
[root#b2bivmign1,2,3 apache-ignite]# rm -rf db/ diagnostic/ snapshots/
[root#b2bivmign1,2,3 apache-ignite]# systemctl start apache-ignite#b2bi-config.xml.service
I open visor, check the topology that all nodes have joined, then activate the cluster.
https://pastebin.com/v0ghckBZ
visor> top -activate
visor> quit
I connect with sqlline and create my tables:
https://pastebin.com/Q7KbjN2a
I go to one of the servers, stop the service and delete the data, then start the service again:
[root#b2bivmign2 apache-ignite]# systemctl stop apache-ignite#b2bi-config.xml.service
[root#b2bivmign2 apache-ignite]# rm -rf db/ diagnostic/ snapshots/
[root#b2bivmign2 apache-ignite]# systemctl start apache-ignite#b2bi-config.xml.service
Baseline looks like this:
https://pastebin.com/CeUGYLE7
Connect with sqlline to that node, issue reproduces:
https://pastebin.com/z4TMKYQq
This was reproduced on:
openjdk version "11.0.18" 2023-01-17 LTS
OpenJDK Runtime Environment Zulu11.62+17-CA (build 11.0.18+10-LTS)
OpenJDK 64-Bit Server VM Zulu11.62+17-CA (build 11.0.18+10-LTS, mixed mode)
RPM: apache-ignite-2.14.0-1.noarch
Rocky Linux release 8.7 (Green Obsidian)

How to manage relationships between a main table and a variable number of secondary tables in Postgresql

I am trying to create a postgresql database to store the performance specifications of wind turbines and their characteristics.
The way I have structures this in my head is the following:
A main table with a unique id for each turbine model as well as basic information about them (rotor size, max power, height, manufacturer, model id, design date, etc.)
example structure of the "main" table holding all of the main turbine characteristics
turbine_model
rotor_size
height
max_power
etc.
model_x1
200
120
15
etc.
model_b7
250
145
18
etc.
A lookup table for each turbine model storing how much each produces for a given wind speed, with one column for wind speeds and another row for power output. There will be as many of these tables as there are rows in the main table.
example table "model_x1":
wind_speed
power_output
1
0.5
2
1.5
3
2.0
4
2.7
5
3.2
6
3.9
7
4.9
8
7.0
9
10.0
However, I am struggling to find a way to implement this as I cannot find a way to build relationships between each row of the "main" table and the lookup tables. I am starting to think this approach is not suited for a relational database.
How would you design a database to solve this problem?
A relational database is perfect for this, but you will want to learn a little bit about normalization to design the layout of the tables.
Basically, you'll want to add a 3rd column to your poweroutput reference table so that each model is just more rows (grow long, not wide).
Here is an example of what I mean, but I even took this to a further extreme where you might want to have a reference for other metrics in addition to windspeed (rpm in this case) so you can see what I mean.
PowerOutput Reference Table
+----------+--------+------------+-------------+
| model_id | metric | metric_val | poweroutput |
+----------+--------+------------+-------------+
| model_x1 | wind | 1 | 0.5 |
| model_x1 | wind | 2 | 1.5 |
| model_x1 | wind | 3 | 3 |
| ... | ... | ... | ... |
| model_x1 | rpm | 1250 | 1.5 |
| model_x1 | rpm | 1350 | 2.5 |
| model_x1 | rpm | 1450 | 3.5 |
| ... | ... | ... | ... |
| model_bg | wind | 1 | 0.7 |
| model_bg | wind | 2 | 0.9 |
| model_bg | wind | 3 | 1.2 |
| ... | ... | ... | ... |
| model_bg | rpm | 1250 | 1 |
| model_bg | rpm | 1350 | 1.5 |
| model_bg | rpm | 1450 | 2 |
+----------+--------+------------+-------------+

How can I get a list of any UDFs referenced by a query job?

I know that from the INFORMATION_SCHEMA views it's possible to find the tables a query job referenced, but I'm unable to figure out a way to find a way to get the list UDFs that the same kind of job referenced.
Is there a way to get a relationship of Job <-> UDFs like this?
+========+=====================+
| Job Id | UDFs |
+--------+---------------------+
|job1 | myproject.udf.UDF_a |
| | myproject.udf.UDF_b |
+--------+---------------------+
|job2 | myproject.udf.UDF_b |
| | myproject.udf.UDF_c |
+--------+---------------------+
|job3 | myproject.udf.UDF_c |
+========+=====================+

Druid generate missing records

I have a data table in druid and which has missing rows and I want to fill them by generating the missing timestamps and adding the precedent row value.
This is the table in druid :
| __time | distance |
|--------------------------|----------|
| 2022-05-05T08:41:00.000Z | 1337 |
| 2022-05-05T08:42:00.000Z | 1350 |
| 2022-05-05T08:44:00.000Z | 1360 |
| 2022-05-05T08:47:00.000Z | 1377 |
| 2022-05-05T08:48:00.000Z | 1400 |
And i want to add the missing minutes either by forcing it in the side of druid storage or by query it directly in druid without passing by other module.
The final result that I want will be look like this:
| __time | distance |
|--------------------------|----------|
| 2022-05-05T08:41:00.000Z | 1337 |
| 2022-05-05T08:42:00.000Z | 1350 |
| 2022-05-05T08:43:00.000Z | 1350 |
| 2022-05-05T08:44:00.000Z | 1360 |
| 2022-05-05T08:45:00.000Z | 1360 |
| 2022-05-05T08:46:00.000Z | 1360 |
| 2022-05-05T08:47:00.000Z | 1377 |
| 2022-05-05T08:48:00.000Z | 1400 |
And thank you in advance !
A Driud time series query will produce a densely populated timeline at a given time granularity like the one you want for every minute. But its current functionality either skips empty time buckets or assigns them a value of zero.
Doing other gap filling functions like LVCF (last value carried forward) that you describe seems like a great enhancement. You can join the Apache Druid community and create an issue that describes this request. That's a great way to start a conversation about requirements and how it might be achieved.
And/Or you could also add the functionality and submit a PR. We're always looking for more members in the Apache Druid community.

DBT Snapshots with not unique records in the source

I’m interested to know if someone here has ever come across a situation where the source is not always unique when dealing with snapshots in DBT.
I have a data lake where data arrives on an append only basis. Every time the source is updated, a new recorded is created on the respective table in the data lake.
By the time the DBT solution is ran, my source could have more than 1 row with the unique id as the data has changed more than once since the last run.
Ideally, I’d like to update the respective dbt_valid_to columns from the snapshot table with the earliest updated_at record from the source and subsequently add the new records to the snapshot table making the latest updated_at record the current one.
I know how to achieve this using window functions but not sure how to handle such situation with dbt.
I wonder if anybody has faced this same issue before.
Snapshot Table
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | null |
Source Table
|**id**|**some_attribute**| **updated_at** |
| 123 | ABCD | 2021-01-01 00:00:00 |-> already been loaded to snapshot
| 123 | ZABC | 2021-06-30 00:00:00 |-> already been loaded to snapshot
-------------------------------------------
| 123 | ZZAB | 2021-11-21 00:10:00 |
| 123 | FXAB | 2021-11-21 15:11:00 |
Snapshot Desired Result
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | 2021-11-21 00:10:00 |
| 123 | ZZAB | 2021-11-21 00:10:00 | 2021-11-21 15:11:00 |
| 123 | FXAB | 2021-11-21 15:11:00 | null |
Standard snapshots operate under the assumption that the source table we are snapshotting are being changed without storing history. This is opposed to the behaviour we have here (basically the source table we are snapshotting is nothing more than an append only log of events) - which means that we may get away with simply using a boring old incremental model to achieve the same SCD2 outcome that snapshots give us.
I have some sample code here where I did just that that may be of some help https://gist.github.com/jeremyyeo/3a23f3fbcb72f10a17fc4d31b8a47854
I agree it would be very convenient if dbt snapshots had a strategy that could involve deduplication, but it isn’t supported today.
The easiest work around would be a stage view downstream of the source that has the window function you describe. Then you snapshot that view.
However, I do see potential for a new snapshot strategy that handles append only sources. Perhaps you’d like to peruse the dbt Snapshot docs and strategies source code on existing strategies to see if you’d like to make a new one!