How to write SQL query to find duplicates in tables

How to write SQL query to find duplicates in tables - sql

I'm currently trying to write a SQL query that finds any conflicts where any rows have the same x and y values.
Here are the tables I'm currently working with:
CREATE TABLE Slot (
sid INT,
wall varchar(200),
x FLOAT,
y FLOAT,
PRIMARY KEY (sid)
)
CREATE TABLE Route (
rid INT,
name varchar(200),
circuit varchar(200),
PRIMARY KEY (rid)
)
CREATE TABLE Placement (
rid INT FOREIGN KEY REFERENCES Route(rid),
hid INT FOREIGN KEY REFERENCES Hold(hid),
sid INT FOREIGN KEY REFERENCES Slot(sid)
)
So I'm trying to find any Slots that are on the same wall and have identical x and y values. In addition to this, I want them to all be the same Route circuit.
I don't know if I should be trying to use the third table of "Placement", as I'm pretty new to this and got confused when trying to join them because they don't have any shared columns.
Here is what I currently have
SELECT
DISTINCT
S.sid
FROM
Slot as S,
Route as R
WHERE
R.circuit = 'Beginner'
GROUP BY
S.x,
S.y,
S.wall
HAVING
COUNT(*) > 1
But this throws an error because I have to be using 'S.sid' in a GROUP BY or an aggregate function, but I don't want to group by that.
Here are the INSERT functions I was using to try and use as examples for what I have so far.
INSERT INTO Slot (sid, wall, x, y) VALUES (2345, 'south', 4, 7)
INSERT INTO Slot (sid, wall, x, y) VALUES (4534, 'south', 4, 7)
INSERT INTO Slot (sid, wall, x, y) VALUES (2456, 'west', 1, 7)
So here it would return the sid's 2345 and 4534 because they're both on the South wall and have the same x and y values.

Some things you need to be made aware-of:
The ancient-style of JOIN, where you do SELECT ... FROM x, y WHERE x.a = y.b, should not be used. I wish modern RDBMS would block queries using it (outside of any compatibility mode).
Always use explicit JOIN clauses! for the sake of readability and maintainability (while performance shouldn't be different, using explicit JOINs make it far, far easier to investigate performance issues should they occur).
Don't use the float or real types for representing exact quantities (by "exact" I don't mean integral: you can have exact fractional quantities), instead the decimal type should be preferred.
You should always include the Schema Name (default is dbo.) when referencing tables in queries as it solves issues relating to ambiguity.
Including the schema-name is now required when referencing any UDFs, UDTs, and other more modern SQL Server features.
This is because performing equality-checks on float and real values in T-SQL is a pain. This includes evaluating JOIN criteria.
Your many-to-many linking table dbo.Placement also allows duplicates because it doesn't have a PK defined.
Don't use short, cryptic column names like rid, hid and sid. Software should be self-documenting. I would name those columns to RouteId, HoldId, and SlotId respectively.
Don't fall for the mistake of naming a column just Id. Column names should not need the name of their parent table to be understandable (this is because queries can/will/do expose your data, often with their original column names, in contexts without their original table names, such as in CTEs, derived-table queries, VIEWs, etc).
It's subjective, but I believe the table-names should be plural, not singular (after-all, a table holds multiple rows - I'd only give a table a singular name if that table will only ever hold a single row).
The worst argument I've heard so far advocating for singular instead of plural is because (apparently) some ORMs and code-gen tools lack the ability to convert a plural noun to a singular noun. Yeesh. That hasn't been true for 20+ years now.
First, to avoid problems caused by using float types in JOIN conditions I'll change your dbo.Slot table to use decimal:
CREATE TABLE dbo.Slot2 (
sid int NOT NULL,
wall varchar(200) NOT NULL,
x decimal(19,6) NOT NULL, -- 6 decimal places should be enough.
y decimal(19,6) NOT NULL,
CONSTRAINT PK_Slot PRIMARY KEY ( sid ),
-- CONSTRAINT UK_SlotValues UNIQUE ( wall, x, y ) -- This will prevent duplicate values in future.
);
INSERT INTO dbo.Slot2 ( sid, wall, x, y )
SELECT
sid,
wall,
CONVERT( decimal(19,6), x ) AS x2,
CONVERT( decimal(19,6), y ) AS y2
FROM
dbo.Slot;
DROP TABLE dbo.Slot;
EXEC sp_rename 'dbo.Slot2', 'Slot';
With that taken care-of, let's now get the duplicate values in the set of slots (i.e. find the identical wall, x, y values without other values):
SELECT
wall,
x,
y
FROM
dbo.Slot
GROUP BY
wall,
x,
y
HAVING
COUNT(*) >= 2
Then we do an INNER JOIN between the original dbo.Slot table and this set of duplicate values, as well as adding a ROW_NUMBER value to make it easier to choose a single row to keep if the other duplicates are removed:
WITH duplicateValues (
SELECT
wall,
x,
y
FROM
dbo.Slot
GROUP BY
wall,
x,
y
HAVING
COUNT(*) >= 2
)
SELECT
ROW_NUMBER() OVER ( PARTITION BY s.wall, s.x, s.y ORDER BY s.sid ) AS n,
s.*
FROM
dbo.Slot AS s
INNER JOIN duplicateValues AS d ON
s.wall = d.wall
AND
s.x = d.x
AND
s.y = d.y
In your post you mentioned wanting to also consider the Placement table, however we need further details because your post doesn't explain how the Placement table should work.
However your Placement table should still have a PK. I'm assuming that the Placement table's HoldId column is not a key column, so should look like this:
CREATE TABLE dbo.Placement (
RouteId int NOT NULL,
SlotId int NOT NULL,
HoldId int NOT NULL,
CONSTRAINT PK_Placement PRIMARY KEY ( RouteId, SlotId ),
CONSTRAINT FK_Placement_Route FOREIGN KEY ( RouteId ) REFERENCES dbo.Route ( rid ),
CONSTRAINT FK_Placement_Slot FOREIGN KEY ( SlotId ) REFERENCES dbo.Slot ( sid ),
CONSTRAINT FK_Placement_Hold FOREIGN KEY ( HoldId ) REFERENCES dbo.Hold ( hid )
);

Related

SQL Server Indexing and Composite Keys

Given the following:
-- This table will have roughly 14 million records
CREATE TABLE IdMappings
(
Id int IDENTITY(1,1) NOT NULL,
OldId int NOT NULL,
NewId int NOT NULL,
RecordType varchar(80) NOT NULL, -- 15 distinct values, will never increase
Processed bit NOT NULL DEFAULT 0,
CONSTRAINT pk_IdMappings
PRIMARY KEY CLUSTERED (Id ASC)
)
CREATE UNIQUE INDEX ux_IdMappings_OldId ON IdMappings (OldId);
CREATE UNIQUE INDEX ux_IdMappings_NewId ON IdMappings (NewId);
and this is the most common query run against the table:
WHILE #firstBatchId <= #maxBatchId
BEGIN
-- the result of this is used to insert into another table:
SELECT
NewId, -- and lots of non-indexed columns from SOME_TABLE
FROM
IdMappings map
INNER JOIN
SOME_TABLE foo ON foo.Id = map.OldId
WHERE
map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
AND map.Processed = 0
-- We only really need this in case the user kills the binary or SQL Server service:
UPDATE IdMappings
SET Processed = 1
WHERE map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
SET #firstBatchId += 4999
SET #lastBatchId += 4999
END
What are the best indices to add? I figure Processed isn't worth indexing since it only has 2 values. Is it worth indexing RecordType since there are only about 15 distinct values? How many distinct values will a column likely have before we consider indexing it?
Is there any advantage in a composite key if some of the fields are in the WHERE and some are in a JOIN's ON condition? For example:
CREATE INDEX ix_IdMappings_RecordType_OldId
ON IdMappings (RecordType, OldId)
... if I wanted both these fields indexed (I'm not saying I do), does this composite key gain any advantage since both columns don't appear together in the same WHERE or same ON?
Insert time into IdMappings isn't really an issue. After we insert all records into the table, we don't need to do so again for months if ever.

How to make sure only one column is not null in postgresql table

I'm trying to setup a table and add some constraints to it. I was planning on using partial indexes to add constraints to create some composite keys, but ran into the problem of handling NULL values. We have a situation where we want to make sure that in a table only one of two columns is populated for a given row, and that the populated value is unique. I'm trying to figure out how to do this, but I'm having a tough time. Perhaps something like this:
CREATE INDEX foo_idx_a ON foo (colA) WHERE colB is NULL
CREATE INDEX foo_idx_b ON foo (colB) WHERE colA is NULL
Would this work? Additionally, is there a good way to expand this to a larger number of columns?

Another way to write this constraint is to use the num_nonulls() function:
create table table_name
(
a integer,
b integer,
check ( num_nonnulls(a,b) = 1)
);
This is especially useful if you have more columns:
create table table_name
(
a integer,
b integer,
c integer,
d integer,
check ( num_nonnulls(a,b,c,d) = 1)
);

You can use the following check:
create table table_name
(
a integer,
b integer,
check ((a is null) != (b is null))
);
If there are more columns, you can use the trick with casting boolean to integer:
create table table_name
(
a integer,
b integer,
...
n integer,
check ((a is not null)::integer + (b is not null)::integer + ... + (n is not null)::integer = 1)
);
In this example only one column can be not null (it simply counts not null columns), but you can make it any number.

One can do this with an insert/update trigger or checks, but having to do so indicates it could be done better. Constraints exist to give you certainty about your data so you don't have to be constantly checking if the data is valid. If one or the other is not null, you have to do the checks in your queries.
This is better solved with table inheritance and views.
Let's say you have (American) clients. Some are businesses and some are individuals. Everyone needs a Taxpayer Identification Number which can be one of several things such as a either a Social Security Number or Employer Identification Number.
create table generic_clients (
id bigserial primary key,
name text not null
);
create table individual_clients (
ssn numeric(9) not null
) inherits(generic_clients);
create table business_clients (
ein numeric(9) not null
) inherits(generic_clients);
SSN and EIN are both Taxpayer Identification Numbers and you can make a view which will treat both the same.
create view clients as
select id, name, ssn as tin from individual_clients
union
select id, name, ein as tin from business_clients;
Now you can query clients.tin or if you specifically want businesses you query business_clients.ein and for individuals individual_clients.ssn. And you can see how the inherited tables can be expanded to accommodate more divergent information between types of clients.

Distinct top 10 from multiple tables

I have these two tables in SQLite
CREATE TABLE "freq" (
`id` INTEGER,
`translation_id` INTEGER,
`freq` INTEGER DEFAULT NULL,
`year` INTEGER,
PRIMARY KEY(`id`),
FOREIGN KEY(`translation_id`) REFERENCES `translation`(`id`) DEFERRABLE INITIALLY DEFERRED
)
CREATE TABLE translation (
id INTEGER PRIMARY KEY,
w_id INTEGER,
word TEXT DEFAULT NULL,
located_in TEXT,
UNIQUE (word, language)
ON CONFLICT ABORT
)
Based on the values from these tables I want to create a third one which contains the top 10 words for every translation.located_in for every freq.year. This could look like this:
CREATE TABLE top_ten_words_by_country (
id INTEGER PRIMARY KEY,
located_in TEXT,
year INTEGER,
`translation_id` INTEGER,
freq INTEGER,
FOREIGN KEY(`translation_id`) REFERENCES `translation`(`id`) DEFERRABLE INITIALLY DEFERRED
)
Thats what I tried (for one country and one year) so far:
SELECT * FROM freq f, translation t
WHERE t.located_in = 'Germany' ORDER BY f.freq DESC
which has these problems:
it doesn't add up multiple words from translation which have the same w_id (which means they are a translation from each other)
it only works for one year and one country
it takes veeeeery long (I know joins are expensive, so its not that important to speed this up)
it contains duplicate translation.word
So can anyone provide me a way to do what I want?
The speed is the least important thing here for me.

Look, you have a cartesian product(there's no relation between your tables).
Besides, you have to use 'group by' clause.
And you can create a view instead a table.
Change your query to:
SELECT sum(f.freq) total_freq
, t.w_id
, t.located_in
, f.year
FROM freq f
, translation t
WHERE f.translation_id = t.id
group by t.w_id
, t.located_in
, f.year
ORDER BY total_freq DESC

How to combine particular rows in a pl/pgsql function that returns set of a view row type?

I have a view, and I have a function that returns records from this view.
Here is the view definition:
CREATE VIEW ctags(id, name, descr, freq) AS
SELECT tags.conc_id, expressions.name, concepts.descr, tags.freq
FROM tags, concepts, expressions
WHERE concepts.id = tags.conc_id
AND expressions.id = concepts.expr_id;
The column id references to the table tags, that, references to another table concepts, which, in turn, references to the table expressions.
Here are the table definitions:
CREATE TABLE expressions(
id serial PRIMARY KEY,
name text,
is_dropped bool DEFAULT FALSE,
rank float(53) DEFAULT 0,
state text DEFAULT 'never edited',
UNIQUE(name)
);
CREATE TABLE concepts(
id serial PRIMARY KEY,
expr_id int NOT NULL,
descr text NOT NULL,
source_id int,
equiv_p_id int,
equiv_r_id int,
equiv_len int,
weight int,
is_dropped bool DEFAULT FALSE,
FOREIGN KEY(expr_id) REFERENCES expressions,
FOREIGN KEY(source_id),
FOREIGN KEY(equiv_p_id) REFERENCES concepts,
FOREIGN KEY(equiv_r_id) REFERENCES concepts,
UNIQUE(id,equiv_p_id),
UNIQUE(id,equiv_r_id)
);
CREATE TABLE tags(
conc_id int NOT NULL,
freq int NOT NULL default 0,
UNIQUE(conc_id, freq)
);
The table expressions is also referenced from my view (ctags).
I want my function to combine rows of my view, that have equal values in the column name and that refer to rows of the table concepts with equal values of the column equiv_r_id so that these rows are combined only once, the combined row has one (doesn't matter which) of the ids, the value of the column descr is concatenated from the values of the rows being combined, and the row freq contains the sum of the values from the rows being combined. I have no idea how to do it, any help would be appreciated.

Basically, what you describe looks like this:
CREATE FUNCTION f_test()
RETURNS TABLE(min_id int, name text, all_descr text, sum_freq int) AS
$x$
SELECT min(t.conc_id) -- AS min_id
,e.name
,string_agg(c.descr, ', ') -- AS all_descr
,sum(t.freq) -- AS sum_freq
FROM tags t
JOIN concepts c USING (id)
JOIN expressions e ON e.id = c.expr_id;
-- WHERE e.name IS DISTINCT FROM
$x$
LANGUAGE sql;
Major points:
I ignored the view ctags altogether as it is not needed.
You could also write this as View so far, the function wrapper is not necessary.
You need PostgreSQL 9.0+ for string_agg(). Else you have to substitute with
array_to_string(array_agg(c.descr), ', ')
The only unclear part is this:
and that refer to rows of the table concepts with equal values of the column equiv_r_id so that these rows are combined only once
Waht column exactly refers to what column in table concepts?
concepts.equiv_r_id equals what exactly?
If you can clarify that part, I might be able to incorporate it into the solution.

Type II dimension joins

I have the following table lookup table in OLTP
CREATE TABLE TransactionState
(
TransactionStateId INT IDENTITY (1, 1) NOT NULL,
TransactionStateName VarChar (100)
)
When this comes into my OLAP, I change the structure as follows:
CREATE TABLE TransactionState
(
TransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
TransactionStateName VarChar (100) NOT NULL,
StartDateTime DateTime NOT NULL,
EndDateTime NULL
)
My question is regarding the TransactionStateId column. Over time, I may have duplicate TransactionStateId values in my OLAP, but with the combination of StartDateTime and EndDateTime, they would be unique.
I have seen samples of Type-2 Dimensions where an OriginalTransactionStateId is added and the incoming TransactionStateId is mapped to it, plus a new TransactionStateId IDENTITY field becomes the PK and is used for the joins.
CREATE TABLE TransactionState
(
TransactionStateId INT IDENTITY (1, 1) NOT NULL,
OriginalTransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
TransactionStateName VarChar (100) NOT NULL,
StartDateTime DateTime NOT NULL,
EndDateTime NULL
)
Should I go with bachellorete #2 or bachellorete #3?

By this phrase:
With the combination of StartDateTime and EndDateTime, they would be unique.
you mean that they never overlap or that they satisfy the database UNIQUE constraint?
If the former, then you can use the StartDateTime in joins, but note that it may be inefficient, since it will use a "<=" condition instead of "=".
If the latter, then just use a fake identity.
Databases in general do not allow an efficient algorithm for this query:
SELECT *
FROM TransactionState
WHERE #value BETWEEN StartDateTime AND EndDateTime
, unless you do arcane tricks with SPATIAL data.
That's why you'll have to use this condition in a JOIN:
SELECT *
FROM factTable
CROSS APPLY
(
SELECT TOP 1 *
FROM TransactionState
WHERE StartDateTime <= factDateTime
ORDER BY
StartDateTime DESC
)
, which will deprive the optimizer of possibility to use HASH JOIN, which is most efficient for such queries in many cases.
See this article for more details on this approach:
Converting currencies
Rewriting the query so that it can use HASH JOIN resulted in 600% times performance gain, though it's only possible if your datetimes have accuracy of a day or lower (or a hash table will grow very large).
Since your time component is stripped of your StartDateTime and EndDateTime, you can create a CTE like this:
WITH cal AS
(
SELECT CAST('2009-01-01' AS DATE) AS cdate
UNION ALL
SELECT DATEADD(day, 1, cdate)
FROM cal
WHERE cdate <= '2009-03-01'
),
state AS
(
SELECT cdate, ts.*
FROM cal
CROSS APPLY
(
SELECT TOP 1 *
FROM TransactionState
WHERE StartDateTime <= cdate
ORDER BY
StartDateTime DESC
) ts
WHERE ts.EndDateTime >= cdate
)
SELECT *
FROM factTable
JOIN state
ON cdate = DATE(factDate)
If your date ranges span more than 100 dates, adjust MAXRECURSION option on CTE.

Please be aware that IDENTITY(1,1) is a declaration for auto-generating values in that column. This is different than PRIMARY KEY, which is a declaration that makes a column into a primary key clustered index. These two declarations mean different things and there are performance implications if you don't say PRIMARY KEY.

You could also use SSIS to load the DW. In the slowly changing dimension (SCD) transformation, you can set how to treat each attribute. If a historical attribute is selected, the type 2 SCD is applied to the whole row, and the transformation takes care of details. You also get to configure if you prefer start_date, end_date or a current/expired column.
The thing to differentiate here is difference between the primary key and a the business (natural) key. Primary key uniquely identifies a row in the table. Business key uniquely identifies a business object/entity and it can be repeated in a dimension table. Each time a SCD 2 is applied, a new row is inserted, with a new primary key, but the same business key; the old row is then marked as expired, while the new one is marked as current -- or start date and end date fields are populated appropriately.
The DW should not expose primary keys, so incoming data from OLTP contains business keys, while assignment of primary keys is under control of the DW; IDENTITY int is good for PKs in dimension tables.
The cool thing is that SCD transformation in SSIS takes care of this.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to write SQL query to find duplicates in tables - sql

Related

SQL Server Indexing and Composite Keys

How to make sure only one column is not null in postgresql table

Distinct top 10 from multiple tables

How to combine particular rows in a pl/pgsql function that returns set of a view row type?

Type II dimension joins

Categories

Resources