Most efficient way of selecting the changes between timestamped snapshots - sql

I have a table that holds data about items that existed at a certain time - regular snapshots taken.
Simple example:
Timestamp ID
1 A
1 B
2 A
2 B
2 C
3 A
3 D
4 D
4 E
In this case, Item C gets created sometime between snapshot 1 and 2 and sometime between snapshot 2 and 3 B and C disappear and D gets created, etc.
The table is reasonably large (millions of records) and for each timestamp there are about 50 records.
What's the most efficient way of selecting the item IDs for items that disappear between two consecutive timestamps?
So for the above example ...
Between 1 and 2: NULL
Between 2 and 3: B, C
Between 3 and 4: A
If it doesn't make the query inefficient, can it be extended to automatically use the latest (i.e. MAX) timestamp and the previous one?

Another way to view this is that you want to find records that exist in timestamp #1 that do not exist in timestamp #2. The easiest way?
SELECT Timestamp
FROM records AS t1
WHERE NOT EXISTS (SELECT 1 FROM records AS t2 WHERE t2.id = t1.id AND t2.Timestamp = t1.Timestamp + 1)
Of course, I'm exploiting here the fact that your example timestamps are integers, when in reality I imagine they are genuine timestamps. But it turns out the integers work so well for this particular purpose, they'd be really handy to have around. So, perhaps we should make a numbered list of all available timestamps. The easiest way to get that?
CREATE TEMPORARY TABLE timestamp_map AS (
timestamp_id AS INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
timestamp_value AS DATETIME
);
INSERT INTO timestamp_map (timestamp_value) (SELECT DISTINCT timestamp FROM records ORDER BY timestamp);
(You could also maintain such a table permanently by use of triggers.)
It's a bit out there, but I've gotten similar techniques to work very efficiently in the past for data like what you describe, when lots of back-and-forth subqueries and NOT EXISTS proved too slow.

Update:
See this entry in my blog for performance details:
MySQL: difference between sets
SELECT ts,
(
SELECT GROUP_CONCAT(id)
FROM mytable mi
WHERE mi.ts =
(
SELECT MAX(ts)
FROM mytable mp
WHERE mp.ts = mo.pts
)
AND NOT EXISTS
(
SELECT NULL
FROM mytable mn
WHERE mn.ts = mo.ts
AND mn.id = mi.id
)
)
FROM (
SELECT #r AS pts,
#r := ts AS ts
FROM (
SELECT #r := NULL
) vars,
(
SELECT DISTINCT ts
FROM mytable
) moo
) mo
To select only the last change:
SELECT ts,
(
SELECT GROUP_CONCAT(id)
FROM mytable mi
WHERE mi.ts =
(
SELECT MAX(ts)
FROM mytable mp
WHERE mp.ts < mo.ts
)
AND NOT EXISTS
(
SELECT NULL
FROM mytable mn
WHERE mn.ts = mo.ts
AND mn.id = mi.id
)
)
FROM (
SELECT MAX(ts) AS ts
FROM mytable
) mo
For this to be efficient, you need to have a composite index on mytable (timestamp, id) (in this order).

Related

Finding the id's which include multiple criteria in long format

Suppose I have a table like this,
id
tagId
1
1
1
2
1
5
2
1
2
5
3
2
3
4
3
5
3
8
I want to select id's where tagId includes both 2 and 5. For this fake data set, It should return 1 and 3.
I tried,
select id from [dbo].[mytable] where tagId IN(2,5)
But it takes 2 and 5 into account respectively. I also did not want to keep my table in wide format since tagId is dynamic. It can reach any number of columns. I also considered filtering with two different queries to find (somehow) the intersection. However since I may search more than two values inside the tagId in real life, it sounds inefficient to me.
I am sure that this is something faced before when tag searching. What do you suggest? Changing table format?
One option is to count the number of distinct tagIds (from the ones you're looking for) each id has:
SELECT id
FROM [dbo].[mytable]
WHERE tagId IN (2,5)
GROUP BY id
HAVING COUNT(DISTINCT tagId) = 2
This is actually a Relational Division With Remainder question.
First, you have to place your input into proper table format. I suggest you use a Table Valued Parameter if executing from client code. You can also use a temp table or table variable.
DECLARE #ids TABLE (tagId int PRIMARY KEY);
INSERT #ids VALUES (2), (5);
There are a number of different solutions to this type of question.
Classic double-negative EXISTS
SELECT DISTINCT
mt.Id
FROM mytable mt
WHERE NOT EXISTS (SELECT 1
FROM #ids i
WHERE NOT EXISTS (SELECT 1
FROM mytable mt2
WHERE mt2.id = mt.id
AND mt2.tagId = i.tagId)
);
This is not usually efficient though
Comparing to the total number of IDs to match
SELECT mt.id
FROM mytable mt
JOIN #ids i ON i.tagId = mt.tagId
GROUP BY mt.id
HAVING COUNT(*) = (SELECT COUNT(*) FROM #ids);
This is much more efficient. You can also do this using a window function, it may be more or less efficient, YMMV.
SELECT mt.Id
FROM mytable mt
JOIN (
SELECT *,
total = COUNT(*) OVER ()
FROM #ids i
) i ON i.tagId = mt.tagId
GROUP BY mt.id
HAVING COUNT(*) = MIN(i.total);
Another solution involves cross-joining everything and checking how many matches there are using conditional aggregation
SELECT mt.id
FROM (
SELECT
mt.id,
mt.tagId,
matches = SUM(CASE WHEN i.tagId = mt.tagId THEN 1 END),
total = COUNT(*)
FROM mytable mt
CROSS JOIN #ids i
GROUP BY
mt.id,
mt.tagId
) mt
GROUP BY mt.id
HAVING SUM(matches) = MIN(total)
AND MIN(matches) >= 0;
db<>fiddle
There are other solutions also, see High Performance Relational Division in SQL Server

Firebird select from table distinct one field

The question I asked yesterday was simplified but I realize that I have to report the whole story.
I have to extract the data of 4 from 4 different tables into a Firebird 2.5 database and the following query works:
SELECT
PRODUZIONE_T t.CODPRODUZIONE,
PRODUZIONE_T.NUMEROCOMMESSA as numeroco,
ANGCLIENTIFORNITORI.RAGIONESOCIALE1,
PRODUZIONE_T.DATACONSEGNA,
PRODUZIONE_T.REVISIONE,
ANGUTENTI.NOMINATIVO,
ORDINI.T_DATA,
FROM PRODUZIONE_T
LEFT OUTER JOIN ORDINI_T ON PRODUZIONE_T.CODORDINE=ORDINI_T.CODORDINE
INNER JOIN ANGCLIENTIFORNITORI ON ANGCLIENTIFORNITORI.CODCLIFOR=ORDINI_T.CODCLIFOR
LEFT OUTER JOIN ANGUTENTI ON ANGUTENTI.IDUTENTE = PRODUZIONE_T.RESPONSABILEUC
ORDER BY right(numeroco,2) DESC, left(numeroco,3) desc
rows 1 to 500;
However the query returns me double (or more) due to the REVISIONE column.
How do I select only the rows of a single NUMEROCOMMESSA with the maximum REVISIONE value?
This should work:
select COD, ORDER, S.DATE, REVISION
FROM TAB1
JOIN
(
select ORDER, MAX(REVISION) as REVISION
FROM TAB1
Group By ORDER
) m on m.ORDER = TAB1.ORDER and m.REVISION = TAB1.REVISION
Here you go - http://sqlfiddle.com/#!6/ce7cf/4
Sample Data (as u set it in your original question):
create table TAB1 (
cod integer primary key,
n_order varchar(10) not null,
s_date date not null,
revision integer not null );
alter table tab1 add constraint UQ1 unique (n_order,revision);
insert into TAB1 values ( 1, '001/18', '2018-02-01', 0 );
insert into TAB1 values ( 2, '002/18', '2018-01-31', 0 );
insert into TAB1 values ( 3, '002/18', '2018-01-30', 1 );
The query:
select *
from tab1 d
join ( select n_ORDER, MAX(REVISION) as REVISION
FROM TAB1
Group By n_ORDER ) m
on m.n_ORDER = d.n_ORDER and m.REVISION = d.REVISION
Suggestions:
Google and read the classic book: "Understanding SQL" by Martin Gruber
Read Firebird SQL reference: https://www.firebirdsql.org/file/documentation/reference_manuals/fblangref25-en/html/fblangref25.html
Here is yet one more solution using Windowed Functions introduced in Firebird 3 - http://sqlfiddle.com/#!6/ce7cf/13
I do not have Firebird 3 at hand, so can not actually check if there would not be some sudden incompatibility, do it at home :-D
SELECT * FROM
(
SELECT
TAB1.*,
ROW_NUMBER() OVER (
PARTITION BY n_order
ORDER BY revision DESC
) AS rank
FROM TAB1
) d
WHERE rank = 1
Read documentation
https://community.modeanalytics.com/sql/tutorial/sql-window-functions/
https://www.firebirdsql.org/file/documentation/release_notes/html/en/3_0/rnfb30-dml-windowfuncs.html
Which of the three (including Gordon's one) solution would be faster depends upon specific database - the real data, the existing indexes, the selectivity of indexes.
While window functions can make the join-less query, I am not sure it would be faster on real data, as it maybe can just ignore indexes on order+revision cortege and do the full-scan instead, before rank=1 condition applied. While the first solution would most probably use indexes to get maximums without actually reading every row in the table.
The Firebird-support mailing list suggested a way to break out of the loop, to only use a single query: The trick is using both windows functions and CTE (common table expression): http://sqlfiddle.com/#!18/ce7cf/2
WITH TMP AS (
SELECT
*,
MAX(revision) OVER (
PARTITION BY n_order
) as max_REV
FROM TAB1
)
SELECT * FROM TMP
WHERE revision = max_REV
If you want the max revision number in Firebird:
select t.*
from tab1 t
where t.revision = (select max(t2.revision) from tab1 t2 where t2.order = t.order);
For performance, you want an index on tab1(order, revision). With such an index, performance should be competitive with any other approach.

Oracle SQL - Efficient Join N Rows Per Match

The basic idea is to join two tables, let's call them MYTABLE1 and MYTABLE2 on a field JOINID. There will be lots of matches per JOINID (one row from MYTABLE1 corresponding to many rows in MYTABLE2, and for the purposes of testing, MYTABLE1 has 50 rows), but we want to only select up to N matches per JOINID value. I have seen lots of inefficient solutions, for example:
select t1.*, t2.*
from MYTABLE1 t1 inner join
(select MYTABLE2.*,
row_number() over (partition by MYTABLE2.JOINKEY order by 1) as seqnum
from MYTABLE2) t2
on t1.joinkey = t2.joinkey and seqnum <= 2;
which takes over 5 minutes for me to run and returns less than 100 results, whereas something like
select t1.*, t2.*
from MYTABLE1 t1 inner join MYTABLE2 t2 on t1.JOINKEY = t2.JOINKEY
where rownum <= 100;
returns 100 results in ~60 milliseconds.
(To be sure of the validity of these results, I selected a different test table and performed the second query above on a specific single JOINKEY until I got a result set with less than 100 results, meaning it did in fact search through all of MYTABLE2. The total query time was 30 milliseconds. Afterwards, I started the original query, but this time getting 50 joins per row of MYTABLE1, which again took over 5 minutes to complete.)
How can I approach this in a not-so-terribly-inefficient manner?
It seems so simple, all we need to do is go through the rows of MYTABLE1 and matching the JOINKEY field to that of rows of MYTABLE2, moving on to the next row of MYTABLE1 once we have matched the desired number for that row.
In the worst case scenario for my second example, we should have to spend 30ms searching through the full TABLE2 per row of TABLE1, of which there are 50 rows, for a total execution time of 1.5 seconds.
I wouldn't call the below approach efficient by any means and it cheats a little and has some clunkiness, but it comes in under the 1500ms limit you provided so I'll add as something to consider.
This example cheats in that it compiles a TYPE, so it can table an anonymous function.
This approach just iteratively probes MYTABLE2 with each JOINKEY from MYTABLE1 using an anonymous subquery-factoring-clause function and accumulates the results as it goes.
I don't know the real structure of the tables involved, so this example pretends MYTABLE2 has one additional CHAR attribute called OTHER_DATA that is the target of the SELECT.
First, setup the test tables:
CREATE TABLE MYTABLE1 (
JOINKEY NUMBER NOT NULL
);
CREATE TABLE MYTABLE2 (
JOINKEY NUMBER NOT NULL,
OTHER_DATA CHAR(1) NOT NULL
);
CREATE INDEX MYTABLE2_I
ON MYTABLE2 (JOINKEY);
Then add the test data. 50 rows to MYTABLE1 and 100M rows to MYTABLE2:
INSERT INTO MYTABLE1
SELECT LEVEL
FROM DUAL
CONNECT BY LEVEL < 51;
BEGIN
<<COMMIT_LOOP>>
FOR OUTER_POINTER IN 1..4000 LOOP
<<DATA_LOOP>>
FOR POINTER IN 1..10 LOOP
INSERT INTO MYTABLE2
SELECT
JOINKEY, OTHER_DATA
FROM
(SELECT LEVEL AS JOINKEY FROM DUAL CONNECT BY LEVEL < 51)
CROSS JOIN
(SELECT CHR(64 + LEVEL) AS OTHER_DATA FROM DUAL CONNECT BY LEVEL < 51);
END LOOP DATA_LOOP;
COMMIT;
END LOOP COMMIT_LOOP;
END;
/
Then gather stats...
Verify the table counts:
SELECT COUNT(*) FROM MYTABLE1;
50
SELECT COUNT(*) FROM MYTABLE2;
100000000
Then create a TYPE that includes the desired data:
CREATE OR REPLACE TYPE JOINKEY_OTHER_DATA IS OBJECT (JOINKEY1 NUMBER, OTHER_DATA CHAR(1));
/
CREATE OR REPLACE TYPE JOINKEY_OTHER_DATA_LIST IS TABLE OF JOINKEY_OTHER_DATA;
/
And then run a query that uses an anonymous subquery-factoring-block function that imposes a rowcount per JOINKEY to be returned. In this first example, it fetches two MYTABLE2 rows per JOINKEY:
SELECT SYSTIMESTAMP FROM DUAL;
WITH FUNCTION FETCH_N_ROWS
(P_MATCHES_LIMIT IN NUMBER)
RETURN JOINKEY_OTHER_DATA_LIST
AS
V_JOINKEY_OTHER_DATAS JOINKEY_OTHER_DATA_LIST;
BEGIN
V_JOINKEY_OTHER_DATAS := JOINKEY_OTHER_DATA_LIST();
FOR JOINKEY_POINTER IN (SELECT MYTABLE1.JOINKEY
FROM MYTABLE1)
LOOP
DECLARE
V_MYTABLE2_JOINKEYS JOINKEY_OTHER_DATA_LIST;
BEGIN
SELECT JOINKEY_OTHER_DATA(MYTABLE2.JOINKEY, MYTABLE2.OTHER_DATA)
BULK COLLECT INTO V_MYTABLE2_JOINKEYS
FROM MYTABLE2 WHERE MYTABLE2.JOINKEY = JOINKEY_POINTER.JOINKEY
FETCH FIRST P_MATCHES_LIMIT ROWS ONLY;
V_JOINKEY_OTHER_DATAS := V_JOINKEY_OTHER_DATAS MULTISET UNION ALL V_MYTABLE2_JOINKEYS;
END;
END LOOP;
RETURN V_JOINKEY_OTHER_DATAS;
END;
SELECT *
FROM TABLE (FETCH_N_ROWS(2));
/
SELECT SYSTIMESTAMP FROM DUAL;
Results in:
SYSTIMESTAMP
18-APR-17 03.32.10.623056000 PM -06:00
JOINKEY1 OTHER_DATA
1 A
1 B
2 A
2 B
3 A
3 B
...
49 A
49 B
50 A
50 B
100 rows selected.
SYSTIMESTAMP
18-APR-17 03.32.11.014554000 PM -06:00
By changing the number passed to FETCH_N_ROWS, different data volumens can be fetched with fairly consistent performance.
...
SELECT * FROM TABLE (FETCH_N_ROWS(13));
Returns:
...
50 K
50 L
50 M
650 rows selected.
You cannot compare the two queries. The second query is simply returning whatever rows come first. The second has to go through all the data to return anything. A more apt comparison would use:
select . . .
from (select t1.*, t2.*
from MYTABLE1 t1 inner join
MYTABLE2 t2
on t1.JOINKEY = t2.JOINKEY
order by t1.JOINKEY
) t1
where rownum <= 100;
This has to read all the data before returning anything so it is more analogous to using row number.
But, start with this query:
select t1.*, t2.*
from MYTABLE1 t1 inner join
(select MYTABLE2.*,
row_number() over (partition by t2.JOINKEY order by 1) as seqnum
from MYTABLE2 t2
) t2
on t1.joinkey = t2.joinkey and seqnum <= 2;
For this query, you want an index on MYTABLE2(JOINKEY). If the ORDER BY has another key, that should be in the query as well.

create a table of duplicated rows of another table using the select statement

I have a table with one column containing different integers.
For each integer in the table I would like to duplicate it as the number of digits -
For example:
12345 (5 digits):
1. 12345
2. 12345
3. 12345
4. 12345
5. 12345
I thought doing it using with recursion t (...) as () but I didn't manage, since I don't really understand how it works and what is happening "behind the scenes.
I don't want to use insert because I want it to be scalable and automatic for as many integers as needed in a table.
Any thoughts and an explanation would be great.
The easiest way is to join to a table with numbers from 1 to n in it.
SELECT n, x
FROM yourtable
JOIN
(
SELECT day_of_calendar AS n
FROM sys_calendar.CALENDAR
WHERE n BETWEEN 1 AND 12 -- maximum number of digits
) AS dt
ON n <= CHAR_LENGTH(TRIM(ABS(x)))
In my example I abused TD's builtin calendar, but that's not a good choice, as the optimizer doesn't know how many rows will be returned and as the plan must be a Product Join it might decide to do something stupid. So better use a number table...
Create a numbers table that will contain the integers from 1 to the maximum number of digits that the numbers in your table will have (I went with 6):
create table numbers(num int)
insert numbers
select 1 union select 2 union select 3 union select 4 union select 5 union select 6
You already have your table (but here's what I was using to test):
create table your_table(num int)
insert your_table
select 12345 union select 678
Here's the query to get your results:
select ROW_NUMBER() over(partition by b.num order by b.num) row_num, b.num, LEN(cast(b.num as char)) num_digits
into #temp
from your_table b
cross join numbers n
select t.num
from #temp t
where t.row_num <= t.num_digits
I found a nice way to perform this action. Here goes:
with recursive t (num,num_as_char,char_n)
as
(
select num
,cast (num as varchar (100)) as num_as_char
,substr (num_as_char,1,1)
from numbers
union all
select num
,substr (t.num_as_char,2) as num_as_char2
,substr (num_as_char2,1,1)
from t
where char_length (num_as_char2) > 0
)
select *
from t
order by num,char_length (num_as_char) desc

How to do this data transformation

This is my input data
GroupId Serial Action
1 1 Start
1 2 Run
1 3 Jump
1 8 End
2 9 Shop
2 10 Start
2 11 Run
For each activitysequence in a group I want to Find pairs of Actions where Action1.SerialNo = Action2.SerialNo + k and how may times it happens
Suppose k = 1, then output will be
FirstAction NextAction Frequency
Start Run 2
Run Jump 1
Shop Start 1
How can I do this in SQL, fast enough given the input table contains millions of entries.
tful, This should produce the result you want, but I don't know if it will be as fast as you 'd like. It's worth a try.
create table Actions(
GroupId int,
Serial int,
"Action" varchar(20) not null,
primary key (GroupId, Serial)
);
insert into Actions values
(1,1,'Start'), (1,2,'Run'), (1,3,'Jump'),
(1,8,'End'), (2,9,'Shop'), (2,10,'Start'),
(2,11,'Run');
go
declare #k int = 1;
with ActionsDoubled(Serial,Tag,"Action") as (
select
Serial, 'a', "Action"
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B
), Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table Actions;
If you will be doing the same computation for various #k values on stable data, this may work better in the long run:
declare #k int = 1;
select
Serial, 'a' as Tag, "Action"
into ActionsDoubled
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B;
go
create unique clustered index AD_S on ActionsDoubled(Serial,Tag);
create index AD_a on ActionsDoubled(Tag,Serial);
go
with Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table ActionsDoubled;
SELECT a1.Action AS FirstActio, a2.Action AS NextAction, COUNT(*) AS Frequency
FROM Activities a1 JOIN Activities a2
ON (a1.GroupId = a2.GroupId AND a1.Serial = a2.Serial + #k)
GROUP BY a1.Action, a2.Action;
The problem is this: Your query has to go through EVERY row regardless.
You can make it more manageable for your database by tackling each group separately as separate queries. Especially if the size of each group is SMALL.
There's a lot going on under the hood and when the query has to do a scan of the entire table, this actually ends up being many times slower than if you did small chunks which effectively cover all million rows.
So for instance:
--Stickler for clean formatting...
SELECT
a1.Action AS FirstAction,
a2.Action AS NextAction,
COUNT(*) AS Frequency
FROM
Activities a1 JOIN Activities a2
ON (a1.groupid = a2.groupid
AND a1.Serial = a2.Serial + #k)
WHERE
a1.groupid = 1
GROUP BY
a1.Action,
a2.Action;
By the way, you have an index (GroupId, Serial) on the table, right?