Many to many cleanup

Many to many cleanup - sql

I have a many to many table that has become a bit of a mess and I'm looking to update the schema to prevent this happening in future. To do that first I need to clean up the existing data to use as the source of a merge script. Not knowing quite how to term this kind of query hitting Google hasn't really helped.
Broken down simply this is what I'm looking at and what I'm aiming to extract.
DECLARE #TestTable
TABLE ([IdA] INT, [IdB] INT);
DECLARE #Expected
TABLE ([KeyId] INT, [MemberId] INT);
INSERT INTO #TestTable
VALUES
(1, 2),
(2, 1),
(1, 3),
(4, 1),
(12, 4),
(5, 6),
(6, 7),
(8, 9),
(11, 10)
INSERT INTO #Expected
VALUES
(1, 1),
(1, 2),
(1, 3),
(1, 4),
(1, 12),
(5, 5),
(5, 6),
(5, 7),
(8, 8),
(8, 9),
(10, 10),
(10, 11)
Lowest Id of a group should be considered the key
Number of members of a group will always be less than 20
Ordering isn't important so long as the resultant dataset is correct
Many thanks.

Here's one way of achieving what you are trying to do:
Here's a SQLFiddle
--Let's copy the rows in as they are to begin with
INSERT INTO
#Expected
SELECT
[IdA]
,[IdB]
FROM
#TestTable
--Lets switch the column values where the higher ID's are in column B
UPDATE #Expected
SET [KeyId] = [MemberId], [MemberId] = [KeyId]
WHERE [KeyId] > [MemberId]
--Now lets set the KEY IDs
UPDATE
E
SET
E.[KeyId] = E1.[KeyId]
FROM
#Expected E
INNER JOIN
#Expected E1
ON
E1.[MemberId] = E.[KeyId]
--and you wanted mappings for the keys to themselves in your result set, so let's add those
INSERT INTO
#Expected
SELECT
[KeyId]
,[KeyId]
FROM
#Expected E
WHERE NOT EXISTS
(SELECT * FROM #Expected WHERE [KeyId] = E.[KeyId] AND [MemberId] = [KeyID])
SELECT DISTINCT
*
FROM
#Expected
Note that I'm selecting DISTINCT at the end because I haven't removed duplicates from the result set.

Related

Forward fill since (possibly non existent) date in BigQuery

I have data from two different sources. On one hand I have user data from our app. This has a primary key of ID and UTC date. There are only rows for UTC dates when are users uses the app. On the other hand I have advertisement campaign attribition data for the users (which can be multiple advertisment campaigns per user). This table has a primary key of ID and campaign and a metric containing a advertisment attribution timestamp. I want to combine the two data sources such that I can compute if a campaign is generating more revenue than it costs among other campaign statistics.
App data example:
SELECT
*
FROM UNNEST(ARRAY<STRUCT<ID INT64, UTC_Date DATE, Revenue FLOAT64>>
[(1, DATE('2021-01-01'), 0),
(1, DATE('2021-01-05'), 5),
(1, DATE('2021-01-10'), 0),
(2, DATE('2021-01-03'), 10),
(2, DATE('2021-01-08'), 0),
(2, DATE('2021-01-09'), 0)])
advertisement campaign attribition data example:
SELECT
*
FROM UNNEST(ARRAY<STRUCT<ID INT64, Attribution_Timestamp Timestamp, campaign_name STRING>>
[(1, TIMESTAMP('2021-01-01 09:54:31'), "A"),
(1, TIMESTAMP('2021-01-09 22:32:51'), "B"),
(2, TIMESTAMP('2021-01-03 19:12:11'), "A")])
The end result I would like to get is:
SELECT
*
FROM UNNEST(ARRAY<STRUCT<ID INT64, UTC_Date DATE, Revenue FLOAT64, campaign_name STRING>>
[(1, DATE('2021-01-01'), 0, "A"),
(1, DATE('2021-01-05'), 5, "A"),
(1, DATE('2021-01-10'), 0, "B"),
(2, DATE('2021-01-03'), 10, "A"),
(2, DATE('2021-01-08'), 0, "A"),
(2, DATE('2021-01-09'), 0, "A")])
This can be achieved by somehow joining the campaign attribution data to the app data and then forward filling.
The problem I have is that the advertisment attribution timestamp can have a mismatch with the UTC dates in the app data table. This means I cannot use a left join as it will not assign campaign_name B to ID 1. Does anyone know an elegant way to solve this problem?

Found a solution! Here is what I did (and a little bit more sample data):
WITH app_data AS
(
SELECT
*
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, Revenue FLOAT64>>
[(1, DATE('2021-01-01'), 0),
(1, DATE('2021-01-05'), 5),
(1, DATE('2021-01-10'), 0),
(1, DATE('2021-01-12'), 0),
(1, DATE('2021-01-15'), 0),
(1, DATE('2021-01-16'), 15),
(1, DATE('2021-01-18'), 0),
(2, DATE('2021-01-03'), 10),
(2, DATE('2021-01-08'), 0),
(2, DATE('2021-01-09'), 0),
(2, DATE('2021-01-15'), 4),
(2, DATE('2021-02-01'), 0),
(2, DATE('2021-02-08'), 8),
(2, DATE('2021-02-15'), 0),
(2, DATE('2021-03-04'), 0),
(2, DATE('2021-03-06'), 12),
(3, DATE('2021-02-15'), 10),
(3, DATE('2021-02-23'), 5),
(3, DATE('2021-03-25'), 0),
(3, DATE('2021-03-30'), 0)])
),
advertisment_attribution_data AS
(
SELECT
*
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, campaign_name STRING>>
[(1, DATE(TIMESTAMP('2021-01-01 09:54:31')), "A"),
(1, DATE(TIMESTAMP('2021-01-09 22:32:51')), "B"),
(1, DATE(TIMESTAMP('2021-01-17 14:30:05')), "C"),
(2, DATE(TIMESTAMP('2021-01-03 19:12:11')), "A"),
(1, DATE(TIMESTAMP('2021-01-15 18:17:57')), "B"),
(3, DATE(TIMESTAMP('2021-03-14 22:32:51')), "C")])
)
SELECT
t1.*,
IFNULL(LAST_VALUE(t2.campaign_name IGNORE NULLS) OVER (PARTITION BY t1.adid ORDER BY t1.utc_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), "Organic") as campaign_name
FROM
app_data t1
LEFT JOIN
advertisment_attribution_data t2
ON t1.adid = t2.adid
AND t1.utc_date = (SELECT MIN(t3.utc_date) FROM app_data t3 WHERE t2.adid=t3.adid AND t2.utc_date <= t3.utc_date)
EDIT
It doesn't work when I select a real table in app_data. It says: Unsupported subquery with table in join predicate.
EDIT 2
Found a way to solve the problem where you cannot use subqueries in joins (apparently it is possible for tables which are not selected from an existing table...) This is the way it works in any case:
WITH app_data AS
(
SELECT
*
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, Revenue FLOAT64>>
[(1, DATE('2021-01-01'), 0),
(1, DATE('2021-01-05'), 5),
(1, DATE('2021-01-10'), 0),
(1, DATE('2021-01-12'), 0),
(1, DATE('2021-01-15'), 0),
(1, DATE('2021-01-16'), 15),
(1, DATE('2021-01-18'), 0),
(2, DATE('2021-01-03'), 10),
(2, DATE('2021-01-08'), 0),
(2, DATE('2021-01-09'), 0),
(2, DATE('2021-01-15'), 4),
(2, DATE('2021-02-01'), 0),
(2, DATE('2021-02-08'), 8),
(2, DATE('2021-02-15'), 0),
(2, DATE('2021-03-04'), 0),
(2, DATE('2021-03-06'), 12),
(3, DATE('2021-02-15'), 10),
(3, DATE('2021-02-23'), 5),
(3, DATE('2021-03-25'), 0),
(3, DATE('2021-03-30'), 0)])
),
advertisment_attribution_data AS
(
SELECT
*,
(
SELECT
MIN(t2.utc_date)
FROM app_data t2
WHERE t1.adid=t2.adid
AND t1.utc_date <= t2.utc_date
) as attribution_join_date -- is the closest next date for this adid in app_data to the attribution date. This ensures the join lateron works.
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, campaign_name STRING>>
[(1, DATE(TIMESTAMP('2021-01-01 09:54:31')), "A"),
(1, DATE(TIMESTAMP('2021-01-09 22:32:51')), "B"),
(1, DATE(TIMESTAMP('2021-01-17 14:30:05')), "C"),
(2, DATE(TIMESTAMP('2021-01-03 19:12:11')), "A"),
(1, DATE(TIMESTAMP('2021-01-15 18:17:57')), "B"),
(3, DATE(TIMESTAMP('2021-03-14 22:32:51')), "C")]) t1
)
SELECT
t1.*,
IFNULL(LAST_VALUE(t2.campaign_name IGNORE NULLS) OVER (PARTITION BY t1.adid ORDER BY t1.utc_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 'Organic') as campaign_name
FROM
app_data t1
LEFT JOIN
advertisment_attribution_data t2
ON t1.adid = t2.adid
AND t1.utc_date = t2.attribution_join_date

Select duplicate persons with duplicate memberships

SQL Fiddle with schema and my intial attempt.
CREATE TABLE person
([firstname] varchar(10), [surname] varchar(10), [dob] date, [personid] int);
INSERT INTO person
([firstname], [surname], [dob] ,[personid])
VALUES
('Alice', 'AA', '1/1/1990', 1),
('Alice', 'AA', '1/1/1990', 2),
('Bob' , 'BB', '1/1/1990', 3),
('Carol', 'CC', '1/1/1990', 4),
('Alice', 'AA', '1/1/1990', 5),
('Kate' , 'KK', '1/1/1990', 6),
('Kate' , 'KK', '1/1/1990', 7)
;
CREATE TABLE person_membership
([personid] int, [personstatus] varchar(1), [memberid] int);
INSERT INTO person_membership
([personid], [personstatus], [memberid])
VALUES
(1, 'A', 10),
(2, 'A', 20),
(3, 'A', 30),
(3, 'A', 40),
(4, 'A', 50),
(4, 'A', 60),
(5, 'T', 70),
(6, 'A', 80),
(7, 'A', 90);
CREATE TABLE membership
([membershipid] int, [memstatus] varchar(1));
INSERT INTO membership
([membershipid], [memstatus])
VALUES
(10, 'A'),
(20, 'A'),
(30, 'A'),
(40, 'A'),
(50, 'T'),
(60, 'A'),
(70, 'A'),
(80, 'A'),
(90, 'T');
There are three tables (as per the fiddle above). Person table contains duplicates, same people entered more than once, for the purpose of this exercise we assume that a combination of the first name, surname and DoB is enough to uniquely identify a person.
I am trying to build a query which will show duplicates of people (first name+surname+Dob) with two or more active entries in the Person table (person_membership.person_status=A) AND two or more active memberships (membership.mestatus=A).
Using the example from SQL Fiddle, the result of the query should be just Alice (two active person IDs, two active membership IDs).
I think I'm making progress with the following effort but it looks rather cumbersome and I need to remove Katie from the final result - she doesn't have a duplicate membership.
SELECT q.firstname, q.surname, q.dob, p1.personid, m.membershipid
FROM
(SELECT
p.firstname,p.surname,p.dob, count(*) as cnt
FROM
person p
GROUP BY
p.firstname,p.surname,p.dob
HAVING COUNT(1) > 1) as q
INNER JOIN person p1 ON q.firstname=p1.firstname AND q.surname=p1.surname AND q.dob=p1.dob
INNER JOIN person_membership pm ON p1.personid=pm.personid
INNER JOIN membership m ON pm.memberid = m.membershipid
WHERE pm.personstatus = 'A' AND m.memstatus = 'A'

Since you are using SQL Server windows function will be handy for this scenario. The following will give you the expected output.
SELECT firstname,surname,dob,personid,memberid
from(
SELECT firstname,surname,dob,p.personid,memberid
,Rank() over(partition by p.firstname,p.surname,p.dob order by p.personid) rnasc
,Rank() over(partition by p.firstname,p.surname,p.dob order by p.personid desc) rndesc
FROM [StagingGRG].[dbo].[person] p
INNER JOIN person_membership pm ON p.personid=pm.personid
INNER JOIN membership m ON pm.memberid = m.membershipid
where personstatus='A' and memstatus='A')a
where a.rnasc+rndesc>2

You have to add Group by and Having clause to return duplicate items only-
SELECT
person.firstname,person.surname,person.dob
FROM
person, person_membership, membership
WHERE
person.personid=person_membership.personid AND person_membership.memberid = membership.membershipid
AND
person_membership.personstatus = 'A' AND membership.memstatus = 'A'
GROUP BY
person.firstname,person.surname,person.dob
HAVING COUNT(1) > 1

Select Id where column takes all values in

Please help me with an SQL query. Here go test tables with data:
CREATE TABLE "Cats"
(
"CatId" SERIAL PRIMARY KEY,
"Name" character varying NOT NULL
);
CREATE TABLE "Measures"
(
"MeasureId" SERIAL PRIMARY KEY,
"CatId" integer NOT NULL REFERENCES "Cats",
"Weight" double precision NOT NULL,
"MeasureDay" integer NOT NULL
);
INSERT INTO "Cats" ("Name") VALUES
('A'), ('B'), ('C')
;
INSERT INTO "Measures" ("CatId", "Weight", "MeasureDay") VALUES
(1, 5.0, 1),
(1, 5.3, 2),
(1, 6.1, 5),
(2, 3.2, 1),
(2, 3.5, 2),
(2, 3.8, 3),
(2, 4.0, 4),
(2, 4.0, 5),
(3, 6.6, 1),
(3, 6.9, 2),
(3, 7.0, 3),
(3, 6.9, 4)
;
How do I select those CatId that have measures for ALL 5 days (MeasureDay takes all values in (1, 2, 3, 4, 5)) ?
On this test data, the query should return 2 since only Cat with CatId = 2 has measures for all days (1, 2, 3, 4, 5).
I assume that I should use GROUP BY "CatId" and HAVING clauses, but what kind of query should be inside HAVING?

try like this using group by
select CatId
from Measures
where MeasureDay in (1, 2, 3, 4, 5)
group by CatId
having count(distinct MeasureDay) = 5;

You can use aggregation and a having clause:
select m.CatId
from measures m
group by m.CatId
having count(distinct measureDay) = 5;

Subqueries With Multiple Tables

Good day, need your help on my Vehicle Inspection Database. You can see below the structure, you can see it also in here http://sqlfiddle.com/#!3/4ab7e . What I need from this is to extract The Number of Vehicles With Atleast One (1) Defect or Violation Per PROJECT. In the schema below The Total for Project 4 = two (2) vehicles AND Project 9 = 1 vehicle.
Columns Needed are [Project_Name],[Vehicle_Type],[yy],[mm],[Total]
-- Vehicle Inspection Database --
-- Vehicle_Type Table
CREATE TABLE VehicleType
([VehicleTypeId] int,
[Type] varchar (36));
INSERT INTO VehicleType ([VehicleTypeId],[Type])
VALUES (1, 'Light Vehicle'),
(2, 'Tanker'),
(3, 'Goods');
-- Car Table
CREATE TABLE Vehicle
([VehicleID] varchar(36),
[PlateNo] varchar(36),
[VehicleTypeId] int,
[Project] int);
INSERT INTO Vehicle ([VehicleID], [PlateNo],[VehicleTypeId], [Project])
VALUES('A57D4151-BD49-4B44-AF10-000F1C298E05', '8112AG', 1, 4),
('C7095628-AE88-4DD0-A4FD-00363EAB767F', '60070 AD2', 2, 9),
('E714CCD7-E56C-46A8-89D5-003CA5BF6094', '68823 AD1', 3, 9);
-- Event Table
CREATE TABLE Event
([EventID] int,
[VehicleID] varchar(36),
[EventTime] smalldatetime,
[TicketStatus] varchar (10)) ;
INSERT INTO Event([EventID], [VehicleID], [EventTime], TicketStatus)
VALUES (1, 'A57D4151-BD49-4B44-AF10-000F1C298E05', '20130701', 'Open'),
(2, 'A57D4151-BD49-4B44-AF10-000F1C298E05', '20130702', 'Close'),
(3, 'A57D4151-BD49-4B44-AF10-000F1C298E05', '20130703', 'Close'),
(4, 'C7095628-AE88-4DD0-A4FD-00363EAB767F','20130705', 'Open'),
(5, 'C7095628-AE88-4DD0-A4FD-00363EAB767F','20130710', 'Open');
-- Event_Defects Table
CREATE TABLE EventDefects
([EventDefectsID] int,
[EventID] int,
[Status] varchar(15),
[DefectID] int) ;
INSERT INTO EventDefects ([EventDefectsID], [EventID], [Status], [DefectID])
VALUES
-- 1st Inspection for PlateNo. 8112AG
(1, 1, 'YES', 1),
(2, 1, 'NO', 2),
(3, 1, 'YES',3),
(4, 1, 'N/A', 4),
(5, 1, 'N/A', 5),
-- 2nd Inspection for PlateNo. 8112AG
(6, 2, 'NO', 1),
(7, 2, 'NO', 2),
(8, 2, 'NO', 3),
(9, 2, 'N/A', 4),
(10,2, 'N/A', 5),
-- 3rd Inspection for PlateNo. 8112AG
(11, 3, 'NO', 1),
(12, 3, 'NO', 2),
(13, 3, 'NO', 3),
(14, 3, 'NO', 4),
(15, 3, 'NO', 5),
-- 1st Inspection for PlateNo. 60070 AD2
(16, 3, 'NO', 1),
(17, 3, 'NO', 2),
(18, 3, 'NO', 3),
(19, 3, 'N/A', 4),
(20, 3, 'N/A', 5);
-- Defects Table
CREATE TABLE Defects
([DefectID] int,
[DefectsName] varchar (36),
[DefectClassID] int) ;
INSERT INTO Defects ([DefectID], [DefectsName], [DefectClassID])
VALUES (1, 'TYRE', 1),
(2, 'BRAKING SYSTEM', 1),
(3, 'MIRRORS AND WINDSCREEN', 2),
(4, 'OVER SPEEDING', 3),
(5, 'NOT WEARING SEATBELTS', 3);
-- Defect_Class Table
CREATE TABLE DefectClass
([Description] varchar (15),
[DefectClassID] int) ;
INSERT INTO DefectClass ([DefectClassID], [Description])
VALUES (1, 'CATEGORY A'),
(2, 'CATEGORY B'),
(3, 'CATEGORY C');

Do all the joins as inner joins... and it'll eliminate records that are empty.
Can you check if this works for you?
SELECT Vehicle.VehicleID from Vehicle
INNER JOIN Event ON Vehicle.VehicleID = Event.VehicleID
INNER JOIN EventDefects ON EventDefects.EventID = Event.EventID
INNER JOIN Defects ON EventDefects.DefectID = Defects.DefectID
GROUP BY Vehicle.VehicleID;

How many times are the results of this common table expression evaluated?

I am trying to work out a bug we've found during our last iteration of testing. It involves a query which uses a common table expression. The main theme of the query is that it simulates a 'first' aggregate operation (get the first row for this grouping).
The problem is that the query seems to choose rows completely arbitrarily in some circumstances - multiple rows from the same group get returned, some groups simply get eliminated altogether. However, it always picks the correct number of rows.
I have created a minimal example to post here. There are clients and addresses, and a table which defines the relationships between them. This is a much simplified version of the actual query I'm looking at, but I believe it should have the same characteristics, and it is a good example to use to explain what I think is going wrong.
CREATE TABLE [Client] (ClientID int, Name varchar(20))
CREATE TABLE [Address] (AddressID int, Street varchar(20))
CREATE TABLE [ClientAddress] (ClientID int, AddressID int)
INSERT [Client] VALUES (1, 'Adam')
INSERT [Client] VALUES (2, 'Brian')
INSERT [Client] VALUES (3, 'Charles')
INSERT [Client] VALUES (4, 'Dean')
INSERT [Client] VALUES (5, 'Edward')
INSERT [Client] VALUES (6, 'Frank')
INSERT [Client] VALUES (7, 'Gene')
INSERT [Client] VALUES (8, 'Harry')
INSERT [Address] VALUES (1, 'Acorn Street')
INSERT [Address] VALUES (2, 'Birch Road')
INSERT [Address] VALUES (3, 'Cork Avenue')
INSERT [Address] VALUES (4, 'Derby Grove')
INSERT [Address] VALUES (5, 'Evergreen Drive')
INSERT [Address] VALUES (6, 'Fern Close')
INSERT [ClientAddress] VALUES (1, 1)
INSERT [ClientAddress] VALUES (1, 3)
INSERT [ClientAddress] VALUES (2, 2)
INSERT [ClientAddress] VALUES (2, 4)
INSERT [ClientAddress] VALUES (2, 6)
INSERT [ClientAddress] VALUES (3, 3)
INSERT [ClientAddress] VALUES (3, 5)
INSERT [ClientAddress] VALUES (3, 1)
INSERT [ClientAddress] VALUES (4, 4)
INSERT [ClientAddress] VALUES (4, 6)
INSERT [ClientAddress] VALUES (5, 1)
INSERT [ClientAddress] VALUES (6, 3)
INSERT [ClientAddress] VALUES (7, 2)
INSERT [ClientAddress] VALUES (8, 4)
INSERT [ClientAddress] VALUES (5, 6)
INSERT [ClientAddress] VALUES (6, 3)
INSERT [ClientAddress] VALUES (7, 5)
INSERT [ClientAddress] VALUES (8, 1)
INSERT [ClientAddress] VALUES (5, 4)
INSERT [ClientAddress] VALUES (6, 6)
;WITH [Stuff] ([ClientID], [Name], [Street], [RowNo]) AS
(
SELECT
[C].[ClientID],
[C].[Name],
[A].[Street],
ROW_NUMBER() OVER (ORDER BY [A].[AddressID]) AS [RowNo]
FROM
[Client] [C] INNER JOIN
[ClientAddress] [CA] ON
[C].[ClientID] = [CA].[ClientID] INNER JOIN
[Address] [A] ON
[CA].[AddressID] = [A].[AddressID]
)
SELECT
[CTE].[ClientID],
[CTE].[Name],
[CTE].[Street],
[CTE].[RowNo]
FROM
[Stuff] [CTE]
WHERE
[CTE].[RowNo] IN (SELECT MIN([CTE2].[RowNo]) FROM [Stuff] [CTE2] GROUP BY [CTE2].[ClientID])
ORDER BY
[CTE].[Name] ASC,
[CTE].[Street] ASC
DROP TABLE [ClientAddress]
DROP TABLE [Address]
DROP TABLE [Client]
The query is designed to get all clients, and their first address (the address with the lowest ID). This appears to me that it should work.
I have a theory about why it sometimes will not work. The statement that follows the CTE refers to the CTE in two places. If the CTE is non-deterministic, and it gets run more than once, the result of the CTE may be different in the two places it's referenced.
In my example, the CTE's RowNo column uses ROW_NUMBER() with an order by clause that will potentially result in different orderings when run multiple times (we're ordering by address, the clients can be in any order depending on how the query is executed).
Because of this is it possible that CTE and CTE2 can contain different results? Or is the CTE only executed once and do I need to look elsewhere for the problem?

It is not guaranteed in any way.
SQL Server is free to evaluate CTE each time it's accessed or cache the results, depending on the plan.
You may want to read this article:
Generating XML in subqueries
If your CTE is not deterministic, you will have to store its result in a temporary table or a table variable and use it instead of the CTE.
PostgreSQL, on the other hand, always evaluates CTEs only once, caching their results.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Many to many cleanup - sql

Related

Forward fill since (possibly non existent) date in BigQuery

Select duplicate persons with duplicate memberships

Select Id where column takes all values in

Subqueries With Multiple Tables

How many times are the results of this common table expression evaluated?

Categories

Resources