Snowflake - Identify duplicate rows and flag them using update statement - sql

I want to identify duplicate rows of a table and add a error code to them. I want to keep one value in all cases and mark all others as duplicate. Snowflake doesn't support CTE & UPDATE statement in one query unlike SQL server. So how do i go about implementing this?
Table creation Code:
DROP TABLE IF EXISTS DUP_CODE_TEST;
CREATE TABLE DUP_CODE_TEST
AS (
SELECT '1' AS PARENT,'OWN' AS REL, '11' AS CHILD, 'ROW1' AS X, NULL AS ERR_CD
UNION ALL
SELECT '1', 'OWN' AS REL, '11' , 'ROW2' , NULL
UNION ALL
SELECT '1', 'OWN' AS REL, '11' , 'ROW3' , NULL
);
Source Table:
+--------+-----+-------+------+--------+
| PARENT | REL | CHILD | X | ERR_CD |
+--------+-----+-------+------+--------+
| 1 | OWN | 11 | ROW1 | NULL |
| 1 | OWN | 11 | ROW2 | NULL |
| 1 | OWN | 11 | ROW3 | NULL |
+--------+-----+-------+------+--------+
I would do this in SQL SERVER
WITH CTE_UPD
AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY PARENT,REL,CHILD ORDER BY X ) RN FROM DUP_CODE_TEST
)
UPDATE CTE_UPD
SET ERR_CD = 'AR-DUP'
WHERE RN = 2
and the expected output is
+--------+-----+-------+------+-----------+
| PARENT | REL | CHILD | X | ERR_CD |
+--------+-----+-------+------+-----------+
| 1 | OWN | 11 | ROW1 | NULL |
| 1 | OWN | 11 | ROW2 | DUPLICATE |
| 1 | OWN | 11 | ROW3 | DUPLICATE |
+--------+-----+-------+------+-----------+

You can do something similar -- assuming that X is unique:
UPDATE DUP_CODE_TEST t
SET ERR_CD = 'AR-DUP'
FROM (SELECT PARENT, REL, CHILD, MIN(X) as MIN_X
FROM DUP_CODE_TEST tt
GROUP BY PARENT, REL, CHILD
) tt
WHERE t.PARENT = tt.PARENT AND t.REL = tt.REL AND
t.CHILD = tt.CHILD AND tt.X > t.MIN_X;
That is, Snowflake does support joining to another table (or subquery). This summarizes the table to get a minimum X for each group and then uses that for the update.

Related

Grouping data using PostgreSQL based on 2 fields

I have a problem with grouping data in postgresql. let say that I have table called my_table
some_id | description | other_id
---------|-----------------|-----------
1 | description-1 | a
1 | description-2 | b
2 | description-3 | a
2 | description-4 | a
3 | description-5 | a
3 | description-6 | b
3 | description-7 | b
4 | description-8 | a
4 | description-9 | a
4 | description-10 | a
...
I would like to group my database based on some_id then differentiate which one has same and different other_id
I would expecting 2 type of queries: 1 that has same other_id and 1 that has different other_id
Expected result
some_id | description | other_id
---------|-----------------|-----------
2 | description-3 | a
2 | description-4 | a
4 | description-8 | a
4 | description-9 | a
4 | description-10 | a
AND
some_id | description | other_id
---------|-----------------|-----------
1 | description-1 | a
1 | description-2 | b
3 | description-5 | a
3 | description-6 | b
3 | description-7 | b
I am open for suggestion both using sequelize or raw query
thank you
One approach, using MIN and MAX as analytic functions:
WITH cte AS (
SELECT *, MIN(other_id) OVER (PARTITION BY some_id) min_other_id,
MAX(other_id) OVER (PARTITION BY some_id) max_other_id
FROM yourTable
)
-- all some_id the same
SELECT some_id, description, other_id
FROM cte
WHERE min_other_id = max_other_id;
-- not all some_id the same
SELECT some_id, description, other_id
FROM cte
WHERE min_other_id <> max_other_id;
Demo
You can also do this using exists and not exists:
-- all same
select t.*
from my_table t
where not exists (select 1
from my_table t2
where t2.some_id = t.some_id and t2.other_id <> t.other_id
);
-- any different
select t.*
from my_table t
where exists (select 1
from my_table t2
where t2.some_id = t.some_id and t2.other_id <> t.other_id
);
Note that this ignores NULL values. If you want them treated as a "different" value then use is distinct from rather than <>.

How do I do an Oracle SQL update from select in a specific order?

I have a table with old values (some null) and new values for various attributes, all inserted at different add times throughout the months. I'm trying to update a second table with records with business month end dates. Right now, these records only contain the most recent new values for all month end dates. The goal is to create historical data by updating the previous month end values with the old values from the first table. I am a beginner and was able to come up with a query to update on one object where there was one entry from the first table. Now I am trying to expand the query to include multiple objects, with possible, multiple old values within the same month. I tried to use "order by" (since I need to make updates for a month in ascending order so it gets the latest value) but read that doesn't work with update statements without a subquery. So I tried my hand at making a more complicated query, without success. I am getting the following error: single-row subquery returns more than one row. Thanks!
TableA:
| ID | TYPE | OLD_VALUE | NEW_VALUE | ADD_TIME|
-----------------------------------------------
| 1 | A | 2 | 3 | 1/11/2019 8:00:00am |
| 1 | B | 3 | 4 | 12/10/2018 8:00:00am|
| 1 | B | 4 | 5 | 12/11/2018 8:00:00am|
| 2 | A | 5 | 1 | 12/5/2018 08:00:00am|
| 2 | A | 1 | 2 | 12/5/2019 09:00:00am|
| 2 | A | 2 | 3 | 12/5/2019 10:00:00am|
| 2 | B | 1 | 2 | 12/5/2019 10:00:00am|
TableB
| ID | MONTH_END | TYPE_A | TYPE_B |
-----------------------------------
| 1 | 1/31/19 | 3 | 5 |
| 1 | 12/31/18 | 3 | 5 |
| 1 | 11/30/18 | 3 | 5 |
| 2 | 12/31/18 | 3 | 2 |
| 2 | 11/30/18 | 3 | 2 |
Desired Output for TableB
| ID | MONTH_END | TYPE_A | TYPE_B |
-----------------------------------
| 1 | 1/31/19 | 3 | 5 |
| 1 | 12/31/18 | 2 | 5 |
| 1 | 11/30/18 | 2 | 3 |
| 2 | 12/31/18 | 3 | 2 |
| 2 | 11/30/18 | 5 | 2 |
My Query for Type A (Which I plan to adapt for Type B and execute as well for the desired output)
update TableB B
set b.type_a =
(
with aa as
(
select id, nvl(old_value, new_value) typea, add_time
from TableA
where type = 'A'
order by id, add_time ascending
)
select typea
from aa
where aa.id = b.id
and b.month_end <= aa.add_tm
)
where exists
(
with aa as
(
select id, nvl(old_value, new_value) typea, add_time
from TableA
where type = 'A'
order by id, add_time ascending
)
select typea
from aa
where aa.id = b.id
and b.month_end <= aa.add_tm
)
Kudo's for giving example input data and desired output. I found your question a bit confusing so let me rephrase to "Provide the last type a value from table a that is in the same month as the month end.
By matching on type and date of entry, we can get your answer. The "ROWNUM=1" is to limit result set to a single entry in case there is more than one row with the same add_time. This SQL is still a mess, maybe someone else can come up with a better one.
UPDATE tableb b
SET b.typea =
(SELECT old_value
FROM tablea a
WHERE LAST_DAY( TRUNC( a.add_time ) ) = b.month_end
AND TYPE = 'A'
AND add_time =
(SELECT MAX( add_time )
FROM tablea
WHERE TYPE = 'A' AND LAST_DAY( TRUNC( a.add_time ) ) = b.month_end)
AND ROWNUM = 1)
WHERE EXISTS
(SELECT old_value
FROM tablea a
WHERE LAST_DAY( TRUNC( a.add_time ) ) = b.month_end AND TYPE = 'A');

SQL select distinct when one column in and another column greater than

Consider the following dataset:
+---------------------+
| ID | NAME | VALUE |
+---------------------+
| 1 | a | 0.2 |
| 1 | b | 8 |
| 1 | c | 3.5 |
| 1 | d | 2.2 |
| 2 | b | 4 |
| 2 | c | 0.5 |
| 2 | d | 6 |
| 3 | a | 2 |
| 3 | b | 4 |
| 3 | c | 3.6 |
| 3 | d | 0.2 |
+---------------------+
I'm tying to develop a sql select statement that returns the top or distinct ID where NAME 'a' and 'b' both exist and both of the corresponding VALUE's are >= '1'. Thus, the desired output would be:
+---------------------+
| ID | NAME | VALUE |
+---------------------+
| 3 | a | 2 |
+----+-------+--------+
Appreciate any assistance anyone can provide.
You can try to use MIN window function and some condition to make it.
SELECT * FROM (
SELECT *,
MIN(CASE WHEN NAME = 'a' THEN [value] end) OVER(PARTITION BY ID) aVal,
MIN(CASE WHEN NAME = 'b' THEN [value] end) OVER(PARTITION BY ID) bVal
FROM T
) t1
WHERE aVal >1 and bVal >1 and aVal = [Value]
sqlfiddle
This seems like a group by and having query:
select id
from t
where name in ('a', 'b')
having count(*) = 2 and
min(value) >= 1;
No subqueries or joins are necessary.
The where clause filters the data to only look at the "a" and "b" records. The count(*) = 2 checks that both exist. If you can have duplicates, then use count(distinct name) = 2.
Then, you want the minimum value to be 1, so that is the final condition.
I am not sure why your desired results have the "a" row, but if you really want it, you can change the select to:
select id, 'a' as name,
max(case when name = 'a' then value end) as value
you can use in and sub-query
select top 1 * from t
where t.id in
(
select id from t
where name in ('a','b')
group by id
having sum(case when value>1 then 1 else 0)>=2
)
order by id

Set-based way to calculate family ranges in SQL?

I have a table that contains parents and 0 or more children for each parent, with a flag indicating which records are parents. All of the members of a given family have the same parent id, and the parent always has the lowest id in a given family. Also, each child has a value associated with it. (Specifically, this is a database of emails and attachments, where each parent is an email and the children are the attachments.)
I have two fields I need to calculate:
Range = {lowest id in family} - {highest id in family} [populated for all members]
Value-list = {delimited list of the values of each child, in id order} [only for parent]
So, given this:
Id | Parent| HasChildren| Value | Range | Value-list
----------------------------------------|-----------
1 | 1 | 1 | | |
2 | 1 | 0 | a | |
3 | 1 | 0 | b | |
4 | 4 | 1 | | |
5 | 4 | 0 | c | |
6 | 6 | 0 | | |
I would like to end up with this:
Id | Parent| HasChildren| Value | Range | Value-list
----------------------------------------|-----------
1 | 1 | 1 | | 1-3 | a;b
2 | 1 | 0 | a | 1-3 |
3 | 1 | 0 | b | 1-3 |
4 | 4 | 1 | | 4-5 | c
5 | 4 | 0 | c | 4-5 |
6 | 6 | 0 | | 6-6 |
How can I do this efficiently? Ideally, I'd like to do this with just set-based logic, without cursors, or even stored procedures. Temporary tables are fine.
I'm working in T-SQL, if that makes a difference, though I'd be curious to see platform agnostic answers.
The following SQLFiddle Solution should do the job for you, however as #Allan mentioned, you might want to revise your database structure.
Using CTE's:
Note: my query uses table1 as name of Your table
with cte as(
select parent
,ValueList= stuff(( select ';' +isnull(t2.Value, '')
from table1 t2
where t1.parent=t2.parent
order by t2.value
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)'), 1, 2, '')
from table1 t1
group by parent
),
cte2 as (select parent
, min(id) as firstID
, max(id) as LastID
from table1
group by parent)
select *
,(select FirstID from cte2 t2 where t2.parent=t1.parent)+'-'+(select LastID from cte2 t2 where t2.parent=t1.parent) as [Range]
,(select ValueList from cte t2 where t1.parent=t2.parent and t1.[haschildren]='1') as [Value -List]
from table1 t1

Select query where record count = 2 and column contains either two values

Example 1
+--------------------------+
| IDENT | CURRENT | SOURCE |
+--------------------------+
| 12345 | 12345 | A |
| 23456 | 12345 | B |
| 34567 | 12345 | C |
+--------------------------+
Example 2
+--------------------------+
| IDENT | CURRENT | SOURCE |
+--------------------------+
| 12345 | 55555 | A |
| 23456 | 55555 | B |
+--------------------------+
Trying to write select query that will show all records that CURRENT count = 2 and SOURCE contains both A and B (NOT C).
Example A should not show up as there are 3 entries for the CURRENT as record is linked to SOURCE C.
Example B is what I'm looking the query to find, CURRENT has two records and is only linked to SOURCE 'A' and 'B'.
Currently if I run something similar to "where SOURCE = A or SOURCE = B", results are records that just have SOURCE of A, OR A+C.
NOTES: IDENT is always a unique value. CURRENT links multiple IDENTS from different SOURCE's.
We're clearly missing more information. Let's take example data (thanks gloomy for the initial fiddle).
| ID | CURRENT | SOURCE |
|----|---------|--------|
| 1 | 111 | A |
| 2 | 111 | B |
| 3 | 111 | C |
| 4 | 222 | A |
| 5 | 222 | B |
| 6 | 333 | A |
| 7 | 333 | C |
| 8 | 444 | B |
| 9 | 444 | C |
| 10 | 555 | B |
| 11 | 666 | A |
| 12 | 666 | A |
| 13 | 666 | B |
| 14 | 777 | A |
| 15 | 777 | A |
I assume you only need this as the result:
| ID | CURRENT | SOURCE |
|----|---------|--------|
| 4 | 222 | A |
| 5 | 222 | B |
This query will work with any amount of sources and result in the expected output:
SELECT * FROM test
WHERE CURRENT IN (
SELECT CURRENT FROM test
WHERE CURRENT NOT IN (
SELECT CURRENT FROM test
WHERE SOURCE NOT IN ('A', 'B')
)
GROUP BY CURRENT
HAVING count(SOURCE) = 2 AND count(DISTINCT SOURCE) = 2
)
If SOURCE values are guaranteed to be unique per CURRENT:
SELECT CURRENT
FROM atable
GROUP BY CURRENT
HAVING COUNT(SOURCE) = 2
AND COUNT(CASE WHEN SOURCE IN ('A', 'B') THEN SOURCE END) = 2
;
If SOURCE values aren't unique per CURRENT but CURRENTs with duplicate entries of 'A' or 'B' are allowed:
SELECT CURRENT
FROM atable
GROUP BY CURRENT
HAVING COUNT(DISTINCT SOURCE) = 2
AND COUNT(DISTINCT CASE WHEN SOURCE IN ('A', 'B') THEN SOURCE END) = 2
;
If SOURCE values aren't unique and groups with duplicate SOURCE entries aren't allowed:
SELECT CURRENT
FROM atable
GROUP BY CURRENT
HAVING COUNT(SOURCE) = 2
AND COUNT(DISTINCT SOURCE) = 2
AND COUNT(DISTINCT CASE WHEN SOURCE IN ('A', 'B') THEN SOURCE END) = 2
;
Every query returns only distinct CURRENT values matching the requirements. Use the query as a derived dataset and join it back to your table to get the details.
All the above options assume that either SOURCE is a NOT NULL column or that NULLs can just be disregarded.
Records where current count = 2:
SELECT CURRENT
FROM table
GROUP BY CURRENT
HAVING COUNT(*) = 2
Records where C is in SOURCE values:
SELECT CURRENT
FROM table
WHERE SOURCE = 'C'
Global query:
SELECT t.*
FROM TABLE t
WHERE t.CURRENT IN (
SELECT CURRENT
FROM table
GROUP BY CURRENT
HAVING COUNT(*) = 2
) AND t.CURRENT NOT IN (
SELECT CURRENT
FROM table
WHERE SOURCE = 'C'
)
http://sqlfiddle.com/#!2/69be9/8/0
select * from test where current in (
select test_a.current
from
(select *
from test
where source = 'A') as test_a
join (select *
from test
where source = 'B') as test_b
on test_b.current = test_a.current
where test_a.current not in
(select current from test where source='C')
)
SELECT *
FROM TABLE mainTbl,
(SELECT CURRENT
FROM TABLE
WHERE source IN ('A', 'B')
HAVING COUNT(1) = 2
GROUP BY CURRENT
) selectedSet
WHERE mainTbl.current = selectedSet.current
AND mainTbl.source IN ('A', 'B');