Fast query to do normalization on SQL data - sql

I have some data that I want to normalize. Specifically I'm normalizing it so I can process the portions getting normalized without having to worry about duplicates. What I'm doing is:
INSERT INTO new_table (a, b, c)
SELECT DISTINCT a,b,c
FROM old_table;
UPDATE old_table
SET abc_id = new_table.id
FROM new_table
WHERE new_table.a = old_table.a
AND new_table.b = old_table.b
AND new_table.c = old_table.c;
First off, it seems as if there should be a better way of doing this. It seems that the inherent process of finding the distinct data could produce a list of the members that belong to it. Second, and more important, the INSERT takes a couple and the UPDATE takes FOREVER (I don't actually have a value for how long it takes yet because it's still running). I'm using postgresql. Is there a better way of doing this (perhaps all in one query).

This is my other answer, extended to three columns:
-- Some test data
CREATE TABLE the_table
( id SERIAL NOT NULL PRIMARY KEY
, name varchar
, a INTEGER
, b varchar
, c varchar
);
INSERT INTO the_table(name, a,b,c) VALUES
( 'Chimpanzee' , 1, 'mammals', 'apes' )
,( 'Urang Utang' , 1, 'mammals', 'apes' )
,( 'Homo Sapiens' , 1, 'mammals', 'apes' )
,( 'Mouse' , 2, 'mammals', 'rodents' )
,( 'Rat' , 2, 'mammals', 'rodents' )
,( 'Cat' , 3, 'mammals', 'felix' )
,( 'Dog' , 3, 'mammals', 'canae' )
;
-- [empty] table to contain the "squeezed out" domain {a,b,c}
CREATE TABLE abc_table
( id SERIAL NOT NULL PRIMARY KEY
, a INTEGER
, b varchar
, c varchar
, UNIQUE (a,b,c)
);
-- The original table needs a "link" to the new table
ALTER TABLE the_table
ADD column abc_id INTEGER -- NOT NULL
REFERENCES abc_table(id)
;
-- FK constraints are helped a lot by a supportive index.
CREATE INDEX abc_table_fk ON the_table (abc_id);
-- Chained query to:
-- * populate the domain table
-- * initialize the FK column in the original table
WITH ins AS (
INSERT INTO abc_table(a,b,c)
SELECT DISTINCT a,b,c
FROM the_table a
RETURNING *
)
UPDATE the_table ani
SET abc_id = ins.id
FROM ins
WHERE ins.a = ani.a
AND ins.b = ani.b
AND ins.c = ani.c
;
-- Now that we have the FK pointing to the new table,
-- we can drop the redundant columns.
ALTER TABLE the_table DROP COLUMN a, DROP COLUMN b, DROP COLUMN c;
SELECT * FROM the_table;
SELECT * FROM abc_table;
-- show it to the world
SELECT a.*
, c.a, c.b, c.c
FROM the_table a
JOIN abc_table c ON c.id = a.abc_id
;
Results:
CREATE TABLE
INSERT 0 7
CREATE TABLE
ALTER TABLE
CREATE INDEX
UPDATE 7
ALTER TABLE
id | name | abc_id
----+--------------+--------
1 | Chimpanzee | 4
2 | Urang Utang | 4
3 | Homo Sapiens | 4
4 | Mouse | 3
5 | Rat | 3
6 | Cat | 1
7 | Dog | 2
(7 rows)
id | a | b | c
----+---+---------+---------
1 | 3 | mammals | felix
2 | 3 | mammals | canae
3 | 2 | mammals | rodents
4 | 1 | mammals | apes
(4 rows)
id | name | abc_id | a | b | c
----+--------------+--------+---+---------+---------
1 | Chimpanzee | 4 | 1 | mammals | apes
2 | Urang Utang | 4 | 1 | mammals | apes
3 | Homo Sapiens | 4 | 1 | mammals | apes
4 | Mouse | 3 | 2 | mammals | rodents
5 | Rat | 3 | 2 | mammals | rodents
6 | Cat | 1 | 3 | mammals | felix
7 | Dog | 2 | 3 | mammals | canae
(7 rows)

Came up with a way to do this on my own:
BEGIN;
CREATE TEMPORARY TABLE new_table_temp (
LIKE new_table,
old_ids integer[]
)
ON COMMIT DROP;
INSERT INTO new_table_temp (a, b, c, old_ids)
SELECT a, b, c, array_ag(id) AS old_ids
FROM old_table
GROUP BY a, b, c;
INSERT INTO new_table (id, a, b, c)
SELECT id, a, b, c
FROM new_table_temp;
UPDATE old_table
SET abc_id = new_table_temp.id
FROM new_table_temp
WHERE old_table.id = ANY(new_table_temp.old_ids);
COMMIT;
This at least is what I was looking for. I'll update this as to whether it worked quickly. The EXPLAIN seems to form a sensible plan, so I'm hopeful.

Related

Select unique combinations (unique on both sides)

EDIT: added a link to Fiddle for a more comprehensive sample (actual dataset)
I wonder if the below is possible in SQL, in BigQuery in particular, and in one SELECT statement.
Consider following input:
Key | Value
-----|-------
a | 2
a | 3
b | 2
b | 3
b | 5
c | 2
c | 5
c | 7
Logic: select the lowest value "available" for each key. Available meaning not yet assigned/used. See below.
Key | Value | Rule
-----|-------|--------------------------------------------
a | 2 | keep
a | 3 | ignore because key "a" has a value already
b | 2 | ignore because value "2" was already used
b | 3 | keep
b | 5 | ignore because key "b" has a value already
c | 2 | ignore because value "2" was already used
c | 5 | keep
c | 7 | ignore because key "c" has a value already
Hence expected outcome:
Key | Value
-----|-------
a | 2
b | 3
c | 5
Here the SQL to create the dummy table:
with t as ( select
'a' key, 2 value UNION ALL select 'a', 3
UNION ALL select 'b', 2 UNION ALL select 'b', 3 UNION ALL select 'b', 5
UNION ALL select 'c', 2 UNION ALL select 'c', 5 UNION ALL select 'c', 7
)
select * from t
EDIT: here another dataset
Not sure what combination of FULL JOIN, DISTINCT, ARRAY or WINDOW functions I can use.
Any guidance is appreciated.
EDIT: This is an incorrect answer that worked with the original example dataset, but has issues (as seen with comprehensive sample). I'm leaving it here for now to maintain comment history.
I don't have a specific BigQuery answer, but here is one SQL solution using a Common Table Expression and recursion.
WITH MyCTE AS
(
/* ANCHOR SUBQUERY */
SELECT MyKey, MyValue
FROM MyTable t
WHERE t.MyKey = (SELECT MIN(MyKey) FROM MyTable)
UNION ALL
/* RECURSIVE SUBQUERY */
SELECT t.MyKey, t.MyValue
FROM MyTable t
INNER JOIN MyCTE c
ON c.MyKey < t.MyKey
AND c.MyValue < t.MyValue
)
SELECT MyKey, MIN(MyValue)
FROM MyCTE
GROUP BY MyKey
;
Results:
Key | Value
-----|-------
a | 2
b | 3
c | 5
SQL Fiddle

How to copy rows into a new a one to many relationship

I'm trying to copy a set of data in a one to many relationship to create a new set of the same data in a new, but unrelated one to many relationship. Lets call them groups and items. Groups have a 1-* relation with items - one group has many items.
I've tried to create a CTE to do this, however I can't get the items inserted (in y) as the newly inserted groups don't have any items associated with them yet. I think I need to be able to access old. and new. like you would in a trigger, but I can't work out how to do this.
I think I could solve this by introducing a previous parent id into the templateitem table, or maybe a temp table with the data required to enable me to join on that, but I was wondering if it is possible to solve it this way?
SQL Fiddle Keeps Breaking on me, so I've put the code here as well:
DROP TABLE IF EXISTS meta.templateitem;
DROP TABLE IF EXISTS meta.templategroup;
CREATE TABLE meta.templategroup (
templategroup_id serial PRIMARY KEY,
groupname text,
roworder int
);
CREATE TABLE meta.templateitem (
templateitem_id serial PRIMARY KEY,
itemname text,
templategroup_id INTEGER NOT NULL REFERENCES meta.templategroup(templategroup_id)
);
INSERT INTO meta.templategroup (groupname, roworder) values ('Group1', 1), ('Group2', 2);
INSERT INTO meta.templateitem (itemname, templategroup_id) values ('Item1A',1), ('Item1B',1), ('Item2A',2);
WITH
x AS (
INSERT INTO meta.templategroup (groupname, roworder)
SELECT distinct groupname || '_v1' FROM meta.templategroup where templategroup_id in (1,2)
RETURNING groupname, templategroup_id, roworder
),
y AS (
Insert INTO meta.templateitem (itemname, templategroup_id)
Select itemname, x.templategroup_id
From meta.templateitem i
INNER JOIN x on x.templategroup_id = i.templategroup_id
RETURNING *
)
SELECT * FROM y;
Use an auxiliary column templategroup.old_id:
ALTER TABLE meta.templategroup ADD old_id int;
WITH x AS (
INSERT INTO meta.templategroup (groupname, roworder, old_id)
SELECT DISTINCT groupname || '_v1', roworder, templategroup_id
FROM meta.templategroup
WHERE templategroup_id IN (1,2)
RETURNING templategroup_id, old_id
),
y AS (
INSERT INTO meta.templateitem (itemname, templategroup_id)
SELECT itemname, x.templategroup_id
FROM meta.templateitem i
INNER JOIN x ON x.old_id = i.templategroup_id
RETURNING *
)
SELECT * FROM y;
templateitem_id | itemname | templategroup_id
-----------------+----------+------------------
4 | Item1A | 3
5 | Item1B | 3
6 | Item2A | 4
(3 rows)
It's impossible to do that in a single plain sql query without an additional column. You have to store the old ids somewhere. As an alternative you can use plpgsql and anonymous code block:
Before:
select *
from meta.templategroup
join meta.templateitem using (templategroup_id);
templategroup_id | groupname | roworder | templateitem_id | itemname
------------------+-----------+----------+-----------------+----------
1 | Group1 | 1 | 1 | Item1A
1 | Group1 | 1 | 2 | Item1B
2 | Group2 | 2 | 3 | Item2A
(3 rows)
Insert:
do $$
declare
grp record;
begin
for grp in
select distinct groupname || '_v1' groupname, roworder, templategroup_id
from meta.templategroup
where templategroup_id in (1,2)
loop
with insert_group as (
insert into meta.templategroup (groupname, roworder)
values (grp.groupname, grp.roworder)
returning templategroup_id
)
insert into meta.templateitem (itemname, templategroup_id)
select itemname || '_v1', g.templategroup_id
from meta.templateitem i
join insert_group g on grp.templategroup_id = i.templategroup_id;
end loop;
end $$;
After:
select *
from meta.templategroup
join meta.templateitem using (templategroup_id);
templategroup_id | groupname | roworder | templateitem_id | itemname
------------------+-----------+----------+-----------------+-----------
1 | Group1 | 1 | 1 | Item1A
1 | Group1 | 1 | 2 | Item1B
2 | Group2 | 2 | 3 | Item2A
3 | Group1_v1 | 1 | 4 | Item1A_v1
3 | Group1_v1 | 1 | 5 | Item1B_v1
4 | Group2_v1 | 2 | 6 | Item2A_v1
(6 rows)

Select records not in another table with additional criteria

I am working on an ACCESS DB.
I have 1 table (tblData) with 1 column ( DataId) and 3 entries:
tblData (A)
+--------+
| DataId |
+--------+
| 1 |
| 2 |
| 3 |
+--------+
Another table (tblSelections) contains 3 columns (id, dataid, userid) and has 3 entries:
tblSelections (B)
+----+--------+---------+
| id | dataid | userid |
+----+--------+---------+
| 1 | 1 | 5 |
| 2 | 2 | 5 |
| 3 | 3 | 2 |
+----+--------+---------+
How can I select the records from table A (tblData) which are not in tbl B (tblSelections) for a certain 'userid'?
For 'userid' 5 the query must return 'DataId' 3 from table A as dataid 1 & 2 are already present in table B for userid 5.
For 'userid' 2 the query must return 'DataId' 1 & 2 from table A as dataid 3 is already present in table B for userid 2.
For 'userid' 1 the query must return 'DataId' 1, 2 & 3 from table A as no records are present in table B for userid 1
Use EXISTS or IN for queries like yours:
SELECT *
FROM tblData
WHERE DataId NOT IN
(
SELECT dataid
FROM tblSelections
WHERE userid = 5
);
SELECT *
FROM tblData
WHERE NOT EXISTS
(
SELECT *
FROM tblSelections
WHERE tblSelections.dataid = tblData.DataId AND tblSelections.userid = 5
);
You can use an outer join to select all records, then put a condition in the where clause that a non-nullable column in b is null. This will give you all records in a that do not have a matching row in b according to the join conditions.
This query assumes that you have a parameter or variable named #userid that represents the user ID to search against.
select
a.*
from tblData a
left join tblSelections b on b.dataid = a.dataid and b.userid = #userid
where b.id is null

Returning a row if and only if a sibling row doesn't exist

I'm having an Idiot Day today. I'm sure this is relatively simple, but my brain just isn't giving me the answer.
I have a table whose rows are types of object. Looks something like this:
id name foo bar house_id
1 Cat 12 4 1
2 Cat 9 4 2
3 Dog 8 23 1
4 Bird 9 54 1
5 Bird 78 2 2
6 Bird 29 32 3
This isn't how I'd choose to implement it, but it's what I'm working with. Objects (cats, dogs and birds, in real life they're actual business things) have been added to the table on an ad-hoc basis. When house_id 1 needs cats in it, a record for cats gets put in. When house_id 3 gets dogs, a record gets put in for dogs.
I now need to update this table so every type of object (Cat, Dog, Bird) has a record for a given house_id. I want to do this by inserting the result from a select query that returns a single record for each type, with the earliest values for 'foo' and 'bar' from a row of that type, if and only if there is no existent record for that type with the given house_id.
So for the above example data, where the given house_id = 3, the select query would return the following:
name foo bar house_id
Cat 12 4 3
Dog 8 23 3
which I can then insert straight into the table.
Basically, return the first row of each distinct name if there are no rows of that name with a given house_id.
Suggestions welcome. DB engine is postgres if that helps.
SET search_path= 'tmp';
DROP TABLE dogcat CASCADE;
CREATE TABLE dogcat
( id serial NOT NULL
, zname varchar
, foo INTEGER
, bar INTEGER
, house_id INTEGER NOT NULL
, PRIMARY KEY (zname,house_id)
);
INSERT INTO dogcat(zname,foo,bar,house_id) VALUES
('Cat',12,4,1)
,('Cat',9,4,2)
,('Dog',8,23,1)
,('Bird',9,54,1)
,('Bird',78,2,2)
,('Bird',29,32,3)
;
-- Carthesian product of the {name,house_id} domains
WITH cart AS (
WITH beast AS (
SELECT distinct zname AS zname
FROM dogcat
)
, house AS (
SELECT distinct house_id AS house_id
FROM dogcat
)
SELECT beast.zname AS zname
,house.house_id AS house_id
FROM beast , house
)
INSERT INTO dogcat(zname,house_id, foo,bar)
SELECT ca.zname, ca.house_id
,fb.foo, fb.bar
FROM cart ca
-- find the animal with the lowes id
JOIN dogcat fb ON fb.zname = ca.zname AND NOT EXISTS
( SELECT * FROM dogcat nx
WHERE nx.zname = fb.zname
AND nx.id < fb.id
)
WHERE NOT EXISTS (
SELECT * FROM dogcat dc
WHERE dc.zname = ca.zname
AND dc.house_id = ca.house_id
)
;
SELECT * FROM dogcat;
Result:
SET
DROP TABLE
NOTICE: CREATE TABLE will create implicit sequence "dogcat_id_seq" for serial column "dogcat.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "dogcat_pkey" for table "dogcat"
CREATE TABLE
INSERT 0 6
INSERT 0 3
id | zname | foo | bar | house_id
----+-------+-----+-----+----------
1 | Cat | 12 | 4 | 1
2 | Cat | 9 | 4 | 2
3 | Dog | 8 | 23 | 1
4 | Bird | 9 | 54 | 1
5 | Bird | 78 | 2 | 2
6 | Bird | 29 | 32 | 3
7 | Cat | 12 | 4 | 3
8 | Dog | 8 | 23 | 2
9 | Dog | 8 | 23 | 3
(9 rows)
As is usually the case, I struggle with a question all morning, post it to Stack Overflow and figure it out myself within the next half hour
select name, foo, bar, 3
from table
where id in
(
select min(id) from table where name not in
(
select name from table where house_id = 3
)
group by name
);

Distinct Values Ignoring Column Order

I have a table similar to:-
+----+---+---+
| Id | A | B |
+----+---+---+
| 1 | 1 | 2 |
+----+---+---+
| 2 | 2 | 1 |
+----+---+---+
| 3 | 3 | 4 |
+----+---+---+
| 4 | 0 | 5 |
+----+---+---+
| 5 | 5 | 0 |
+----+---+---+
I want to remove all duplicate pairs of values, regardless of which column contains which value, e.g. after whatever the query might be I want to see:-
+----+---+---+
| Id | A | B |
+----+---+---+
| 1 | 1 | 2 |
+----+---+---+
| 3 | 3 | 4 |
+----+---+---+
| 4 | 0 | 5 |
+----+---+---+
I'd like to find a solution in Microsoft SQL Server (has to work in <= 2005, though I'd be interested in any solutions which rely upon >= 2008 features regardless).
In addition, note that A and B are going to be in the range 1-100 (but that's not guaranteed forever. They are surrogate seeded integer foreign keys, however the foreign table might grow to a couple hundred rows max).
I'm wondering whether I'm missing some obvious solution here. The ones which have occurred all seem rather overwrought, though I do think they'd probably work, e.g.:-
Have a subquery return a bitfield with each bit corresponding to one of the ids and use this value to remove duplicates.
Somehow, pivot, remove duplicates, then unpivot. Likely to be tricky.
Thanks in advance!
Test data and sample below.
Basically, we do a self join with an OR criteria so either a=a and b=b OR a=b and b=a.
The WHERE in the subquery gives you the max for each pair to eliminate.
I think this should work for triplicates as well (note I added a 6th row).
DECLARE #t table(id int, a int, b int)
INSERT INTO #t
VALUES
(1,1,2),
(2,2,1),
(3,3,4),
(4,0,5),
(5,5,0),
(6,5,0)
SELECT *
FROM #t
WHERE id NOT IN (
SELECT a.id
FROM #t a
INNER JOIN #t b
ON (a.a=b.a
AND a.b=b.b)
OR
(a.b=b.a
AND a.a = b.b)
WHERE a.id > b.id)
Try:
select min(Id) Id, A, B
from (select Id, A, B from DuplicatesTable where A <= B
union all
select Id, B A, A B from DuplicatesTable where A > B) v
group by A, B
order by 1
Not 100% tested and I'm sure it can be tidied up but it produces your required result:
DECLARE #T TABLE (id INT IDENTITY(1,1), A INT, B INT)
INSERT INTO #T
VALUES (1,2), (2,1), (3,4), (0,5), (5,0);
SELECT *
FROM #T
WHERE id IN (SELECT DISTINCT MIN(id)
FROM (SELECT id, a, b
FROM #T
UNION ALL
SELECT id, b, a
FROM #T) z
GROUP BY a, b)