duplicate elimination with time - sql

For my Smart home Stuff I create a Database, but I make a mistake during programming: the application posting stuff into the Database twice. I want to delete all rows, which contain duplicates. With duplicate I mean a tuples what is identically in the data to the last one from the same type. I mark the duplicates in this Example with "<<" please pay also attention to the last 3 rows. I want to keep the first new Data so I want to delete all the Duplicate after them. I still hope you can help me to solve my Problem.
SmartHome=# select * from sensordata order by time desc Limit 21;
type | data | time
------+------+----------------------------
8 | 2459 | 2019-08-09 23:10:39.530087 <<
8 | 2459 | 2019-08-09 23:10:39.356908
8 | 2445 | 2019-08-09 23:05:39.933269 <<
8 | 2445 | 2019-08-09 23:05:39.789173
10 | 6105 | 2019-08-09 22:50:50.40792 <<
10 | 6105 | 2019-08-09 22:50:50.096132
8 | 2459 | 2019-08-09 22:50:41.429681 <<
8 | 2459 | 2019-08-09 22:50:41.357483
8 | 2474 | 2019-08-09 22:45:42.13396 <<
8 | 2474 | 2019-08-09 22:45:41.813046
10 | 6221 | 2019-08-09 22:40:51.107709 <<
10 | 6221 | 2019-08-09 22:40:51.076903
10 | 6105 | 2019-08-09 22:35:51.737255 <<
10 | 6105 | 2019-08-09 22:35:51.544886
10 | 6221 | 2019-08-09 22:30:52.493895 <<
10 | 6221 | 2019-08-09 22:30:51.795203
8 | 2459 | 2019-08-09 22:30:43.193447 <<
8 | 2459 | 2019-08-09 22:30:43.045599
10 | 6105 | 2019-08-09 22:25:52.571793 << Duplicate like them above
10 | 6105 | 2019-08-09 22:25:52.442844 << Also a Duplicate with much more
10 | 6105 | 2019-08-09 22:20:51.356846 time between the rows
(21 rows)
SmartHome=# \d sensordata
Table "public.sensordata"
Column | Type | Modifiers
--------+-----------------------------+------------------------
type | integer | not null
data | character varying(20) | not null
time | timestamp without time zone | not null default now()
Indexes:
"smarthome_idx" UNIQUE, btree (type, "time")
Foreign-key constraints:
"sensordata_type_fkey" FOREIGN KEY (type) REFERENCES sensortype(id)
If i run
with a as (Select *, row_number() over(partition by type,data order by time) from sensordata) select * from a where row_number=1 order by time desc;
the output is:
10 | 17316 | 2019-08-09 09:43:46.938507 | 1
10 | 18276 | 2019-08-09 09:38:47.129788 | 1
10 | 18176 | 2019-08-09 09:33:47.889064 | 1
10 | 17107 | 2019-08-08 10:36:11.383106 | 1
10 | 17921 | 2019-08-08 09:56:15.889191 | 1
10 | 17533 | 2019-08-03 09:30:11.047639 | 1
thats not what i mean :/ (ßorry dont know how to mark the stuff as code block in the comment therfore this way

There are many possible ways to do this. The fastest is often a correlated subquery but I can never remember the syntax so I normally go for window functions, specifically row_number().
If you run
Select *, row_number() over(partition by type,data order by date) from sensor data
That should give a version of your table where all the rows you want to keep have a number 1 and the duplicates are numbered 2,3,4... Use that same field in a delete query and you'll be sorted.
EDIT: I understand now you only want to remove duplicates that occur sequentially within the same type. This can also be achieved using row_number and join.This query should give you only the data you want.
WITH s as (SELECT *,row_number() over(partition by type order by date) as rnum from sensordata)
SELECT a.*
FROM s a
JOIN s b
ON a.rnum=b.rnum+1 AND a.type=b.type
WHERE NOT a.data=b.data
This might need a slight tweak to avoid missing the very first entry if that's important.

You can identify the duplicates using lag():
select t.*
from (select t.*,
lag(time) over (partition by type, data order by time) as prev_time
from sensordata t
) t
where prev_time > time - interval '10 minute';
This assumes an interval of 10 minutes for a duplicate. You don't specify in the question.
Then, you can delete these using a join. It would be better if you had a serial primary key, so you knew that data was unique. But it seems reasonable to assume that type/data/time will be unique.
So:
delete t
from (select t.*,
lag(time) over (partition by type, data order by time) as prev_time
from sensordata t
) tt
where tt.prev_time > t.time - interval '10 minute' and
tt.type = t.type and
tt.data = t.data and
tt.time = t.time;
Note that if you do have exact duplicates in the table, then you might end up duplicating all the rows for a given combination, including the original one.

if you want to select or delete records based on the existence of other records, EXISTS() is your friend.
First:
define exactly what you consider a set of duplicates
decide which one you want to keep, and which ones to delete
Assuming that you want to delete the most recent records, but only if the data did not change between the one to keep and the one to delete.
select the ones to delete:
SELECT *
FROM sensordata d
WHERE EXISTS( -- if there exists a record with...
SELECT * FROM sensordata x
WHERE x.ztype = d.ztype -- the same type
AND x.zdata = d.zdata -- the same data
AND x.ztime <d.ztime -- but:older
AND NOT EXISTS( -- and NO record exists with...
SELECT * FROM sensordata nx
WHERE nx.ztype = d.ztype -- the same type
AND nx.zdata <> d.zdata -- but DIFFERENT data
AND nx.ztime <d.ztime -- between the one to delete
AND nx.ztime >x.ztime -- and the oldest one
)
);
If that looks OK, you can use exactly the same condition in a DELETE statement.
DELETE FROM sensordata d
WHERE EXISTS( -- if there exists a record with...
SELECT * FROM sensordata x
WHERE x.ztype = d.ztype -- the same type
AND x.zdata = d.zdata -- the same data
AND x.ztime <d.ztime -- but:older
AND NOT EXISTS( -- and NO record exists with...
SELECT * FROM sensordata nx
WHERE nx.ztype = d.ztype -- the same type
AND nx.zdata <> d.zdata -- but DIFFERENT data
AND nx.ztime <d.ztime -- between the one to delete
AND nx.ztime >x.ztime -- and the oldest one
)
);

Related

Best Way to Join One Column on Columns From Two Other Tables

I have a schema like the following in Oracle
Section:
+--------+----------+
| sec_ID | group_ID |
+--------+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
+--------+----------+
Section_to_Item:
+--------+---------+
| sec_ID | item_ID |
+--------+---------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
+--------+---------+
Item:
+---------+------+
| item_ID | data |
+---------+------+
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
+---------+------+
Item_Version:
+---------+----------+--------+
| item_ID | start_ID | end_ID |
+---------+----------+--------+
| 1 | 1 | |
| 2 | 1 | 3 |
| 3 | 2 | |
| 4 | 1 | 2 |
+---------+----------+--------+
Section_to_Item has FK into Section and Item on the *_ID columns.
Item_version is indexed on item_ID but has no FK to Item.item_ID (ran out of space in the snapshot group).
I have code that receives a list of version IDs and I want to get all items in sections in a given group that are valid for at least one of the versions passed in. If an item has no end_ID, it's valid for anything starting with start_ID. If it has an end_id, it's valid for anything up until (not including) end_ID.
What I currently have is:
SELECT Items.data
FROM Section, Section_to_Items, Item, Item_Version
WHERE Section.group_ID = 1
AND Section_to_Item.sec_ID = Section.sec_ID
AND Item.item_ID = Section_to_Item.item_ID
AND Item.item_ID = Item_Version.item_ID
AND exists (
SELECT *
FROM (
SELECT 2 AS version FROM DUAL
UNION ALL SELECT 3 AS version FROM DUAL
) passed_versions
WHERE Item_Version.start_ID <= passed_versions.version
AND (Item_Version.end_ID IS NULL or Item_Version.end_ID > passed_version.version)
)
Note that the UNION ALL statement is dynamically generated from the list of passed in versions.
This query currently does a cartesian join and is very slow.
For some reason, if I change the query to join
AND Item_Version.item_ID = Section_to_Item.item_ID
which is not a FK, the query does not do the cartesian join and is much faster.
A) Can anyone explain why this is?
B) Is this the right way to be joining this sequence of tables (I feel weird about joining Item.item_ID to two different tables)
C) Is this the right way to get versions between start_ID and end_ID?
Edit
Same query with inner join syntax:
SELECT Items.data
FROM Item
INNER JOIN Section_to_Items ON Section_to_Items.item_ID = Item.item_ID
INNER JOIN Section ON Section.sec_ID = Section_to_Items.sec_ID
INNER JOIN Item_Version ON Item_Version.item_ID = Item_.item_ID
WHERE Section.group_ID = 1
AND exists (
SELECT *
FROM (
SELECT 2 AS version FROM DUAL
UNION ALL SELECT 3 AS version FROM DUAL
) passed_versions
WHERE Item_Version.start_ID <= passed_versions.version
AND (Item_Version.end_ID IS NULL or Item_Version.end_ID > passed_version.version)
)
Note that in this case the performance difference comes from joining on Item_Version first and then joining Section_to_Item on Item_Version.item_ID.
In terms of table size, Section_to_Item, Item, and Item_Version should be similar (1000s) while Section should be small.
Edit
I just found out that apparently, the schema has no FKs. The FKs specified in the schema configuration files are ignored. They're just there for documentation. So there's no difference between joining on a FK column or not. That being said, by changing the joins into a cascade of SELECT INs, I'm able to avoid joining the entire Item table twice. I don't love the resulting query, and I don't really understand the difference, but the stats indicate it's much less work (changes the A-Rows returned from the inner most scan on Section from 656,000 to 488 (it used to be 656k starts returning 1 row, now it's 488 starts returning 1 row)).
Edit
It turned out to be stale statistics - the two queries were equivalent the whole time but with the incomplete statistics, the DB happened to notice the correct plan only in the second instance. After updating statistics, both queries generated the same plan.
I'm not sure if this is the best idea but this seems to avoid the cartesian join:
select data
from Item
where item_ID in (
select item_ID
from Item_Version
where item_ID in (
select item_ID
from Section_to_Item
where sec_ID in (
select sec_ID
from Section
where group_ID = 1
)
)
and exists (
select 1
from (
select 2 as version
from dual
union all
select 3 as version
from dual
) versions
where versions.version >= start_ID
and (end_ID is null or versions.version <)
)
)

Insert into multiple tables

A brief explanation on the relevant domain part:
A Category is composed of four data:
Gender (Male/Female)
Age Division (Mighty Mite to Master)
Belt Color (White to Black)
Weight Division (Rooster to Heavy)
So, Male Adult Black Rooster forms one category. Some combinations may not exist, such as mighty mite black belt.
An Athlete fights Athletes of the same Category, and if he classifies, he fights Athletes of different Weight Divisions (but of the same Gender, Age and Belt).
To the modeling. I have a Category table, already populated with all combinations that exists in the domain.
CREATE TABLE Category (
[Id] [int] IDENTITY(1,1) NOT NULL,
[AgeDivision_Id] [int] NULL,
[Gender] [int] NULL,
[BeltColor] [int] NULL,
[WeightDivision] [int] NULL
)
A CategorySet and a CategorySet_Category, which forms a many to many relationship with Category.
CREATE TABLE CategorySet (
[Id] [int] IDENTITY(1,1) NOT NULL,
[Championship_Id] [int] NOT NULL,
)
CREATE TABLE CategorySet_Category (
[CategorySet_Id] [int] NOT NULL,
[Category_Id] [int] NOT NULL
)
Given the following result set:
| Options_Id | Championship_Id | AgeDivision_Id | BeltColor | Gender | WeightDivision |
|------------|-----------------|----------------|-----------|--------|----------------|
1. | 2963 | 422 | 15 | 7 | 0 | 0 |
2. | 2963 | 422 | 15 | 7 | 0 | 1 |
3. | 2963 | 422 | 15 | 7 | 0 | 2 |
4. | 2963 | 422 | 15 | 7 | 0 | 3 |
5. | 2964 | 422 | 15 | 8 | 0 | 0 |
6. | 2964 | 422 | 15 | 8 | 0 | 1 |
7. | 2964 | 422 | 15 | 8 | 0 | 2 |
8. | 2964 | 422 | 15 | 8 | 0 | 3 |
Because athletes may fight two CategorySets, I need CategorySet and CategorySet_Category to be populated in two different ways (it can be two queries):
One Category_Set for each row, with one CategorySet_Category pointing to the corresponding Category.
One Category_Set that groups all WeightDivisions in one CategorySet in the same AgeDivision_Id, BeltColor, Gender. In this example, only BeltColor varies.
So the final result would have a total of 10 CategorySet rows:
| Id | Championship_Id |
|----|-----------------|
| 1 | 422 |
| 2 | 422 |
| 3 | 422 |
| 4 | 422 |
| 5 | 422 |
| 6 | 422 |
| 7 | 422 |
| 8 | 422 |
| 9 | 422 | /* groups different Weight Division for BeltColor 7 */
| 10 | 422 | /* groups different Weight Division for BeltColor 8 */
And CategorySet_Category would have 16 rows:
| CategorySet_Id | Category_Id |
|----------------|-------------|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 1 | /* groups different Weight Division for BeltColor 7 */
| 9 | 2 | /* groups different Weight Division for BeltColor 7 */
| 9 | 3 | /* groups different Weight Division for BeltColor 7 */
| 9 | 4 | /* groups different Weight Division for BeltColor 7 */
| 10 | 5 | /* groups different Weight Division for BeltColor 8 */
| 10 | 6 | /* groups different Weight Division for BeltColor 8 */
| 10 | 7 | /* groups different Weight Division for BeltColor 8 */
| 10 | 8 | /* groups different Weight Division for BeltColor 8 */
I have no idea how to insert into CategorySet, grab it's generated Id, then use it to insert into CategorySet_Category
I hope I've made my intentions clear.
I've also created a SQLFiddle.
Edit 1: I commented in Jacek's answer that this would run only once, but this is false. It will run a couple of times a week. I have the option to run as SQL Command from C# or a stored procedure. Performance is not crucial.
Edit 2: Jacek suggested using SCOPE_IDENTITY to return the Id. Problem is, SCOPE_IDENTITY returns only the last inserted Id, and I insert more than one row in CategorySet.
Edit 3: Answer to #FutbolFan who asked how the FakeResultSet is retrieved.
It is a table CategoriesOption (Id, Price_Id, MaxAthletesByTeam)
And tables CategoriesOptionBeltColor, CategoriesOptionAgeDivision, CategoriesOptionWeightDivison, CategoriesOptionGender. Those four tables are basically the same (Id, CategoriesOption_Id, Value).
The query look like this:
SELECT * FROM CategoriesOption co
LEFT JOIN CategoriesOptionAgeDivision ON
CategoriesOptionAgeDivision.CategoriesOption_Id = co.Id
LEFT JOIN CategoriesOptionBeltColor ON
CategoriesOptionBeltColor.CategoriesOption_Id = co.Id
LEFT JOIN CategoriesOptionGender ON
CategoriesOptionGender.CategoriesOption_Id = co.Id
LEFT JOIN CategoriesOptionWeightDivision ON
CategoriesOptionWeightDivision.CategoriesOption_Id = co.Id
The solution described here will work correctly in multi-user environment and when destination tables CategorySet and CategorySet_Category are not empty.
I used schema and sample data from your SQL Fiddle.
First part is straight-forward
(ab)use MERGE with OUTPUT clause.
MERGE can INSERT, UPDATE and DELETE rows. In our case we need only to INSERT. 1=0 is always false, so the NOT MATCHED BY TARGET part is always executed. In general, there could be other branches, see docs. WHEN MATCHED is usually used to UPDATE; WHEN NOT MATCHED BY SOURCE is usually used to DELETE, but we don't need them here.
This convoluted form of MERGE is equivalent to simple INSERT, but unlike simple INSERT its OUTPUT clause allows to refer to the columns that we need.
MERGE INTO CategorySet
USING
(
SELECT
FakeResultSet.Championship_Id
,FakeResultSet.Price_Id
,FakeResultSet.MaxAthletesByTeam
,Category.Id AS Category_Id
FROM
FakeResultSet
INNER JOIN Category ON
Category.AgeDivision_Id = FakeResultSet.AgeDivision_Id AND
Category.Gender = FakeResultSet.Gender AND
Category.BeltColor = FakeResultSet.BeltColor AND
Category.WeightDivision = FakeResultSet.WeightDivision
) AS Src
ON 1 = 0
WHEN NOT MATCHED BY TARGET THEN
INSERT
(Championship_Id
,Price_Id
,MaxAthletesByTeam)
VALUES
(Src.Championship_Id
,Src.Price_Id
,Src.MaxAthletesByTeam)
OUTPUT inserted.id AS CategorySet_Id, Src.Category_Id
INTO CategorySet_Category (CategorySet_Id, Category_Id)
;
FakeResultSet is joined with Category to get Category.id for each row of FakeResultSet. It is assumed that Category has unique combinations of AgeDivision_Id, Gender, BeltColor, WeightDivision.
In OUTPUT clause we need columns from both source and destination tables. The OUTPUT clause in simple INSERT statement doesn't provide them, so we use MERGE here that does.
The MERGE query above would insert 8 rows into CategorySet and insert 8 rows into CategorySet_Category using generated IDs.
Second part
needs temporary table. I'll use a table variable to store generated IDs.
DECLARE #T TABLE (
CategorySet_Id int
,AgeDivision_Id int
,Gender int
,BeltColor int);
We need to remember the generated CategorySet_Id together with the combination of AgeDivision_Id, Gender, BeltColor that caused it.
MERGE INTO CategorySet
USING
(
SELECT
FakeResultSet.Championship_Id
,FakeResultSet.Price_Id
,FakeResultSet.MaxAthletesByTeam
,FakeResultSet.AgeDivision_Id
,FakeResultSet.Gender
,FakeResultSet.BeltColor
FROM
FakeResultSet
GROUP BY
FakeResultSet.Championship_Id
,FakeResultSet.Price_Id
,FakeResultSet.MaxAthletesByTeam
,FakeResultSet.AgeDivision_Id
,FakeResultSet.Gender
,FakeResultSet.BeltColor
) AS Src
ON 1 = 0
WHEN NOT MATCHED BY TARGET THEN
INSERT
(Championship_Id
,Price_Id
,MaxAthletesByTeam)
VALUES
(Src.Championship_Id
,Src.Price_Id
,Src.MaxAthletesByTeam)
OUTPUT
inserted.id AS CategorySet_Id
,Src.AgeDivision_Id
,Src.Gender
,Src.BeltColor
INTO #T(CategorySet_Id, AgeDivision_Id, Gender, BeltColor)
;
The MERGE above would group FakeResultSet as needed and insert 2 rows into CategorySet and 2 rows into #T.
Then join #T with Category to get Category.IDs:
INSERT INTO CategorySet_Category (CategorySet_Id, Category_Id)
SELECT
TT.CategorySet_Id
,Category.Id AS Category_Id
FROM
#T AS TT
INNER JOIN Category ON
Category.AgeDivision_Id = TT.AgeDivision_Id AND
Category.Gender = TT.Gender AND
Category.BeltColor = TT.BeltColor
;
This will insert 8 rows into CategorySet_Category.
Here is not the full answer, but direction which you can use to solve this:
1st query:
select row_number() over(order by t, Id) as n, Championship_Id
from (
select distinct 0 as t, b.Id, a.Championship_Id
from FakeResultSet as a
inner join
Category as b
on
a.AgeDivision_Id=b.AgeDivision_Id and
a.Gender=b.Gender and
a.BeltColor=b.BeltColor and
a.WeightDivision=b.WeightDivision
union all
select distinct 1, BeltColor, Championship_Id
from FakeResultSet
) as q
2nd query:
select q2.CategorySet_Id, c.Id as Category_Id from (
select row_number() over(order by t, Id) as CategorySet_Id, Id, BeltColor
from (
select distinct 0 as t, b.Id, null as BeltColor
from FakeResultSet as a
inner join
Category as b
on
a.AgeDivision_Id=b.AgeDivision_Id and
a.Gender=b.Gender and
a.BeltColor=b.BeltColor and
a.WeightDivision=b.WeightDivision
union all
select distinct 1, BeltColor, BeltColor
from FakeResultSet
) as q
) as q2
inner join
Category as c
on
(q2.BeltColor is null and q2.Id=c.Id)
OR
(q2.BeltColor = c.BeltColor)
of course this will work only for empty CategorySet and CategorySet_Category tables, but you can use select coalese(max(Id), 0) from CategorySet to get current number and add it to row_number, thus you will get real ID which will be inserted into CategorySet row for second query
What I do when I run into these situations is to create one or many temporary tables with row_number() over clauses giving me identities on the temporary tables. Then I check for the existence of each record in the actual tables, and if they exist update the temporary table with the actual record ids. Finally I run a while exists loop on the temporary table records missing the actual id and insert them one at a time, after the insert I update the temporary table record with the actual ids. This lets you work through all the data in a controlled manner.
##IDENTITY is your friend to the 2nd part of question
https://msdn.microsoft.com/en-us/library/ms187342.aspx
and
Best way to get identity of inserted row?
Some API (drivers) returns int from update() function, i.e. ID if it is "insert". What API/environment do You use?
I don't understand 1st problem. You should not insert identity column.
Below query will give final result For CategorySet rows:
SELECT
ROW_NUMBER () OVER (PARTITION BY Championship_Id ORDER BY Championship_Id) RNK,
Championship_Id
FROM
(
SELECT
Championship_Id
,BeltColor
FROM #FakeResultSet
UNION ALL
SELECT
Championship_Id,BeltColor
FROM #FakeResultSet
GROUP BY Championship_Id,BeltColor
)BASE

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.
It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.
So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

Calculate time difference between rows

I currently have a database in the following format
ID | DateTime | PID | TIU
1 | 2013-11-18 00:15:00 | 1551 | 1005
2 | 2013-11-18 00:16:03 | 1551 | 1885
3 | 2013-11-18 00:16:30 | 9110 | 75527
4 | 2013-11-18 00:22:01 | 1022 | 75
5 | 2013-11-18 00:22:09 | 1019 | 1311
6 | 2013-11-18 00:23:52 | 1022 | 89
7 | 2013-11-18 00:24:19 | 1300 | 44433
8 | 2013-11-18 00:38:57 | 9445 | 2010
I have a scenario where I need to identify where there are gaps in processes more than 5 minutes by using the DateTime column.
An example of what I am trying to achieve is:
ID | DateTime | PID | TIU
3 | 2013-11-18 00:16:30 | 9110 | 75527
4 | 2013-11-18 00:22:01 | 1022 | 75
7 | 2013-11-18 00:24:50 | 1300 | 44433
8 | 2013-11-18 00:38:57 | 9445 | 2010
ID3 is the last row before a 6 minute 1 second gap, ID4 is the next row after it.
ID7 is the last row before a 14 minute 7 second gap, ID8 is the next record available.
I am trying to do this in SQL, however if needs be I can do this in C# to process instead.
I have tried a number of inner joins, however the table is over 3 million rows so performance suffers greatly.
This is a CTE solution but, as has been indicated, this may not always perform well - because we're having to compute functions against the DateTime column, most indexes will be useless:
declare #t table (ID int not null,[DateTime] datetime not null,
PID int not null,TIU int not null)
insert into #t(ID,[DateTime],PID,TIU) values
(1,'2013-11-18 00:15:00',1551,1005 ),
(2,'2013-11-18 00:16:03',1551,1885 ),
(3,'2013-11-18 00:16:30',9110,75527 ),
(4,'2013-11-18 00:22:01',1022,75 ),
(5,'2013-11-18 00:22:09',1019,1311 ),
(6,'2013-11-18 00:23:52',1022,89 ),
(7,'2013-11-18 00:24:19',1300,44433 ),
(8,'2013-11-18 00:38:57',9445,2010 )
;With Islands as (
select ID as MinID,[DateTime],ID as RecID from #t t1
where not exists
(select * from #t t2
where t2.ID < t1.ID and --Or by date, if needed
--Use 300 seconds to avoid most transition issues
DATEDIFF(second,t2.[DateTime],t1.[DateTime]) < 300
)
union all
select i.MinID,t2.[DateTime],t2.ID
from Islands i
inner join
#t t2
on
i.RecID < t2.ID and
DATEDIFF(second,i.[DateTime],t2.[DateTime]) < 300
), Ends as (
select MinID,MAX(RecID) as MaxID from Islands group by MinID
)
select * from #t t
where exists(select * from Ends e where e.MinID = t.ID or e.MaxID = t.ID)
This also returns a row for ID 1, since that row has no preceding row within 5 minutes of it - but that should be easy enough to exclude in the final select, if needed.
I've assumed we can use ID as a proxy for increasing dates - that if for two rows, the ID is higher in the second row, then the DateTime will also be later.
Islands is a recursive CTE. The top half (the anchor) just selects rows which do not have any preceding row within 5 minutes of themselves. We select the ID twice for those rows and also keep the DateTime around.
In the recursive portion, we try to find a new row from the table that can be "added on" to an existing Islands row - based on this new row being no more than 5 minutes later than the current end-point of the island.
Once the recursion is complete, we then exclude the intermediate rows that the CTE produces. E.g. for the "4" island, it generated the following rows:
4,00:22:01,4
4,00:22:09,5
4,00:23:52,6
4,00:24:19,7
And all that we care about is that final row where we've identified an "island" of time from ID 4 to ID 7 - that's what the second CTE (Ends) is finding for us.

Row Rank in a MySQL View

I need to create a view that automatically adds virtual row number in the result. the graph here is totally random all that I want to achieve is the last column to be created dynamically.
> +--------+------------+-----+
> | id | variety | num |
> +--------+------------+-----+
> | 234 | fuji | 1 |
> | 4356 | gala | 2 |
> | 343245 | limbertwig | 3 |
> | 224 | bing | 4 |
> | 4545 | chelan | 5 |
> | 3455 | navel | 6 |
> | 4534345| valencia | 7 |
> | 3451 | bartlett | 8 |
> | 3452 | bradford | 9 |
> +--------+------------+-----+
Query:
SELECT id,
variety,
SOMEFUNCTIONTHATWOULDGENERATETHIS() AS num
FROM mytable
Use:
SELECT t.id,
t.variety,
(SELECT COUNT(*) FROM TABLE WHERE id < t.id) +1 AS NUM
FROM TABLE t
It's not an ideal manner of doing this, because the query for the num value will execute for every row returned. A better idea would be to create a NUMBERS table, with a single column containing a number starting at one that increments to an outrageously large number, and then join & reference the NUMBERS table in a manner similar to the variable example that follows.
MySQL Ranking, or Lack Thereof
You can define a variable in order to get psuedo row number functionality, because MySQL doesn't have any ranking functions:
SELECT t.id,
t.variety,
#rownum := #rownum + 1 AS num
FROM TABLE t,
(SELECT #rownum := 0) r
The SELECT #rownum := 0 defines the variable, and sets it to zero.
The r is a subquery/table alias, because you'll get an error in MySQL if you don't define an alias for a subquery, even if you don't use it.
Can't Use A Variable in a MySQL View
If you do, you'll get the 1351 error, because you can't use a variable in a view due to design. The bug/feature behavior is documented here.
Oracle has a rowid pseudo-column. In MySQL, you might have to go ugly:
SELECT id,
variety,
1 + (SELECT COUNT(*) FROM tbl WHERE t.id < id) as num
FROM tbl
This query is off the top of my head and untested, so take it with a grain of salt. Also, it assumes that you want to number the rows according to some sort criteria (id in this case), rather than the arbitrary numbering shown in the question.