Joining multiple tables containing historical data - sql

I have multiple tables containing historical data, so there is not a 1 to 1 relation between id.
I have to join on id and the time stamp indicating when the data has been active, TO_TIMESTMP can be null if the data is still active or if it has never been set for old data.
My main table after some grouping outputs something like this:
TABLE_A
AID USER_ID AMOUNT FROM_TIMESTMP TO_TIMESTMP
1 1 2 11/21/2012 00:00:00 12/04/2012 11:59:00
1 2 3 11/24/2012 12:00:00 null
2 1 2 11/21/2012 01:00:00 null
then i have another table that i use to link further
TABLE_B
AID CID FROM_TIMESTMP TO_TIMESTMP HIST_ID
1 3 11/01/2012 00:00:00 null 1
1 3 11/21/2012 00:00:00 12/04/2012 11:59:00 2
1 3 11/24/2012 12:00:00 null 3
2 4 11/21/2012 00:59:59 null 4
and my 3rd table looks something like this:
TABLE_C
CID VALUE FROM_TIMESTMP TO_TIMESTMP HIST_ID
3 A 11/01/2012 00:00:00 null 1
3 B 11/21/2012 00:00:00 11/24/2012 11:59:00 2
3 C 11/24/2012 12:00:00 null 3
4 D 11/21/2012 01:00:01 null 4
My expected output if I want to combine table A with Value of from Table C through Table B is:
AID USER_ID AMOUNT FROM_TIMESTMP TO_TIMESTMP VALUE
1 1 2 11/21/2012 00:00:00 12/04/2012 11:59:00 B
1 2 3 11/24/2012 12:00:00 null C
2 1 2 11/21/2012 01:00:00 null D
There is indexes on everything except AMOUNT in Table A and VALUE in Table C and I use the following SQL to pull out the data.
SELECT a.AID, a.USER_ID, a.AMOUNT, a.FROM_TIMESTMP, a.TO_TIMESTMP, c.VALUE from
(SELECT AID, USER_ID, SUM(AMOUNT), FROM_TIMESTMP, TO_TIMESTMP from TABLE_A GROUP BY AID, USER_ID, FROM_TIMESTMP, TO_TIMESTMP) a
inner join TABLE_B b on b.HIST_ID in (select max(HIST_ID) from TABLE_B
where AID = a.AID and FROM_TIMESTMP <= a.FROM_TIMESTMP+1/2880 and (TO_TIMESTMP>= a.FROM_TIMESTMP or TO_TIMESTMP is null))
inner join TABLE_C c on c.HIST_ID in (select max(HIST_ID) from TABLE_C
where CID = b.CID and FROM_TIMESTMP <= a.FROM_TIMESTMP+1/2880 and (TO_TIMESTMP>= a.FROM_TIMESTMP or TO_TIMESTMP is null));
Due to some inconsistencies on when data is saved I have added a 30 sec grace period when comparing starting time stamps in case they where created around the same time, is there a way to improve the way I do this?
I select the one with MAX(HIST_ID) so cases like AID=1 and USER_ID=2 in TABLE_A only get the newest row that matches id/timestamp from other tables.
In my real data I Inner join 4 tables like this(instead of just 2) and it works good on my local test data (pulling just over 42000 lines in 11 sec when asking for all data).
But when I try and run it on test environment where the data amount is closer to production it runs to slow even when I limit the amount of lines I query in the first table to about 6000 lines by setting FROM_TIMESTMP has to be between 2 dates.
Is there a way to improve the performance of my joining of tables by doing it another way?

one simple change to avoid the max() repeated sub queries is:
select a.aid,a.user_id,a.amount,a.from_timestmp,a.to_timestmp,a.value
from (select a.aid,a.user_id,a.amount,a.from_timestmp,a.to_timestmp,c.value,
row_number() over (partition by a.aid,a.user_id order by b.hist_id desc, c.hist_id desc) rn
from (select aid,user_id,sum(amount) amount,from_timestmp,to_timestmp
from table_a
group by aid,user_id,from_timestmp,to_timestmp) a
inner join table_b b
on b.aid = a.aid
and b.from_timestmp <= a.from_timestmp + (1 / 2880)
and ( b.to_timestmp >= a.from_timestmp or b.to_timestmp is null)
inner join table_c c
on c.cid = b.cid
and c.from_timestmp <= a.from_timestmp + (1 / 2880)
and ( c.to_timestmp >= a.from_timestmp or c.to_timestmp is null)) a
where rn = 1
order by a.aid, a.user_id;

There could be many reasons why your query runs faster on one environment and slower on another. Most probably it's because the optimizer has defined two distinct plans and one runs faster. Probably because the statistics are slightly different.
You can certainly optimize your query to use your indexes but I think your main problem lies with the data and/or data model. And with bad data you'll run into these kind of problems again and again.
It's pretty common to archive data into the same table, it can be useful to represent transient data that needs to be queried historically. However, having archived data should not make you forget essential rules about database design.
In your case it seems you have three related tables: they would be linked in your entity-relationship model. However, somewhere along the designing process, they lost this link so now you can't reliably identify which row is relied to which one.
I suggest the following:
If two tables are related in your ER model, add a foreign key. This will ensure that you can always join them if you need to. Foreign keys only add a small cost in DML operations (and only INSERT, DELETE and update to the primary key (?!)). If your data is inserted once and queried many times, the performance impact is negligible.
In your case if (AID, FROM_TIMESTAMP) is your primary key in TABLE_A, then have the same columns in TABLE_B reference TABLE_A's primary key columns. You may need FROM_TIMESTAMP_A and FROM_TIMESTAMP_C if A and C (which seem unrelated) have distinct updating scheme.
If you don't follow this logic, you will have to build your queries differently. If A, B and C are each historically archived yet not fully referenced, you will only be able to answer questions with a single point-in-time reference, questions such as "What was the status of the DB at time TS":
SELECT *
FROM A
JOIN B on A.aid = B.aid
JOIN C on C.cid = B.cid
WHERE a.timestamp_from <= :TS
AND nvl(a.timestamp_to, DATE '9999-12-31') > :TS
AND b.timestamp_from <= :TS
AND nvl(b.timestamp_to, DATE '9999-12-31') > :TS
AND c.timestamp_from <= :TS
AND nvl(c.timestamp_to, DATE '9999-12-31') > :TS

Related

How can I replace the LAST() function in MS Access with proper ordering on a rather large table?

I have an MS Access database with the two tables, Asset and Transaction. The schema looks like this:
Table ASSET
Key Date1 AType FieldB FieldC ...
A 2023.01.01 T1
B 2022.01.01 T1
C 2023.01.01 T2
.
.
TABLE TRANSACTION
Date2 Key TType1 TType2 TType3 FieldOfInterest ...
2022.05.31 A 1 1 1 10
2022.08.31 A 1 1 1 40
2022.08.31 A 1 2 1 41
2022.09.31 A 1 1 1 30
2022.07.31 A 1 1 1 30
2022.06.31 A 1 1 1 20
2022.10.31 A 1 1 1 45
2022.12.31 A 2 1 1 50
2022.11.31 A 1 2 1 47
2022.05.23 B 2 1 1 30
2022.05.01 B 1 1 1 10
2022.05.12 B 1 2 1 20
.
.
.
The ASSET table has a PK (Key).
The TRANSACTION table has a composite key that is (Key, Date2, Type1, Type2, Type3).
Given the above tables let's see an example:
Input1 = 2022.04.01
Input2 = 2022.08.31
Desired result:
Key FieldOfInterest
A 41
because if the Transactions in scope was to be ordered by Date2, TType1, TType2, TType3 all ascending then the record having FieldOfInterest = 41 would be the last one.
Note that Asset B is not in scope due to Asset.Date1 < Input1, neither is Asset C because AType != T1. Ultimately I am curious about the SUM(FieldOfInterest) of all the last transactions belonging to an Asset that is in scope determined by the input variables.
The following query has so far provided the right results but after upgrading to a newer MS Access version, the LAST() operation is no longer reliably returning the row which is the latest addition to the Transaction table.
I have several input values but the most important ones are two dates, lets call them InputDate1 and
InputDate2.
This is how it worked so far:
SELECT Asset.AType, Last(FieldOfInterest) AS CurrentValue ,Asset.Key
FROM Transaction
INNER JOIN Asset ON Transaction.Key = Asset.Key
WHERE Transaction.Date2 <= InputDate2 And Asset.Date1 >= InputDate1
GROUP BY Asset.Key, Asset.AType
HAVING Asset.AType='T1'
It is known that the grouped records are not guaranteed to be in any order. Obviously it is a mistake to rely on the order of the records of the group by operation will always keep the original table order but lets just ignore this for now.
I have been struggling to come up with the right way to do the following:
join the Asset and Transaction tables on Asset.Key = Transaction.Key
filter by Asset.Date1 >= InputDate1 AND Transaction.Date2 <= InputDate2
then I need to select one record for all Transaction.Key where Date2 and TType1 and TType2 and TType3 has the highest value. (this represents the actual last record for given Key)
As far as I know there is no way to order records within a group by clause which is unfortunate.
I have tried Ranking, but the Transactions table is large (800k rows) and the performance was very slow, I need something faster than this. The following are an example of three saved queries that I wrote and chained together but the performance is very disappointing probably due to the ranking step.
-- Saved query step_1
SELECT Asset.*, Transaction.*
FROM Transaction
INNER JOIN Asset ON Transaction.Key = Asset.Key
WHERE Transaction.Date2 <= 44926
AND Asset.Date1 >= 44562
AND Asset.aType = 'T1'
-- Saved query step_2
SELECT tr.FieldOfInterest, (SELECT Count(*) FROM
(SELECT tr2.Transaction.Key, tr2.Date2, tr2.Transaction.tType1, tr2.tType2, tr2.tType3 FROM step_1 AS tr2) AS tr1
WHERE (tr1.Date2 > tr.Date2 OR
(tr1.Date2 = tr.Date2 AND tr1.tType1 > tr.Transaction.tType1) OR
(tr1.Date2 = tr.Date2 AND tr1.tType1 = tr.Transaction.tType1 AND tr1.tType2 > tr.tType2) OR
(tr1.Date2 = tr.Date2 AND tr1.tType1 = tr.Transaction.tType1 AND tr1.tType2 = tr.tType2 AND tr1.tType3 > tr.tType3))
AND tr1.Key = tr.Transaction.Key)+1 AS Rank
FROM step_1 AS tr
-- Saved query step_3
SELECT SUM(FieldOfInterest) FROM step_2
WHERE Rank = 1
I hope I am being clear enough so that I can get some useful recommendations. I've been stuck with this for weeks now and really don't know what to do about it. I am open for any suggestions.
Reading the following specification
then I need to select one record for all Transaction.Key where Date2 and TType1 and TType2 and TType3 has the highest value. (this represents the actual last record for given Key)
Consider a simple aggregation for step 2 to retrieve the max values then in step 3 join all fields to first query.
Step 1 (rewritten to avoid name collision and too many columns)
SELECT a.[Key] AS Asset_Key, a.Date1, a.AType,
t.[Key] AS Transaction_Key, t.Date2,
t.TType1, t.TType2, t.TType3, t.FieldOfInterest
FROM Transaction t
INNER JOIN Asset a ON a.[Key] = a.[Key]
WHERE t.Date2 <= 44926
AND a.Date1 >= 44562
AND a.AType = 'T1'
Step 2
SELECT Transaction_Key,
MAX(Date2) AS Max_Date2,
MAX(TType1) AS TType1,
MAX(TType2) AS TType2,
MAX(TType3) AS TType3
FROM step_1
GROUP Transaction_Key
Step 3
SELECT s1.*
FROM step_1 s1
INNER JOIN step_2 s2
ON s1.Transaction_Key = s2.Transaction_Key
AND s1.Date2 = s2.Max_Date2
AND s1.TType1 = s2.Max_TType1
AND s1.TType2 = s2.Max_TType2
AND s1.TType3 = s2.Max_TType3

SQL: select rows from a certain table based on conditions in this and another table

I have two tables that share IDs on a postgresql .
I would like to select certain rows from table A, based on condition Y (in table A) AND based on Condition Z in a different table (B) ).
For example:
Table A Table B
ID | type ID | date
0 E 1 01.01.2022
1 F 2 01.01.2022
2 E 3 01.01.2010
3 F
IDs MUST by unique - the same ID can appear only once in each table, and if the same ID is in both tables it means that both are referring to the same object.
Using an SQL query, I would like to find all cases where:
1 - the same ID exists in both tables
2 - type is F
3 - date is after 31.12.2021
And again, only rows from table A will be returned.
So the only returned row should be:1 F
It is a bit hard t understand what problem you are actually facing, as this is very basic SQL.
Use EXISTS:
select *
from a
where type = 'F'
and exists (select null from b where b.id = a.id and dt >= date '2022-01-01');
Or IN:
select *
from a
where type = 'F'
and id in (select id from b where dt >= date '2022-01-01');
Or, as the IDs are unique in both tables, join:
select a.*
from a
join b on b.id = a.id
where a.type = 'F'
and b.dt >= date '2022-01-01';
My favorite here is the IN clause, because you want to select data from table A where conditions are met. So no join needed, just a where clause, and IN is easier to read than EXISTS.
SELECT *
FROM A
WHERE type='F'
AND id IN (
SELECT id
FROM B
WHERE DATE>='2022-01-01'; -- '2022' imo should be enough, need to check
);
I don't think joining is necessary.

Inserting in a table from a two table join using Trigger for filter

I am working on oracle DB, and having for exemple this two tables :
TABLE A :
ID Phone WEEK MODEL
1 10 14-18 XYX
2 32 15-18 XXZ
3 40 15-18 XYX
4 19 16-18 ZZT
5 10 14-18 XYX
TABLE B :
ID MODEL TRAFIC
1 XYX 2G/3G
2 XCA 2G/3G/4G
3 ZZT 2G/3G/4G
4 ABC 2G only
5 XYZ 2G/3G
6 XXZ 2G only
TABLE C RESULTS of JOIN :
ID Phone WEEK MODEL TRAFIC
1 10 14-18 XYX 2G/3G
2 32 15-18 XXZ 2G only
3 40 15-18 XYX 2G/3G
4 19 16-18 ZZT 2G/3G/4G
Now, I want to insert the rows in table B and the Table A (JOIN) , into Table C where (A.Phone != C.Phone and A.WEEK != C.WEEK)
Here's the sql script for the insert, in the first place the result table C, is empty :
INSERT INTO C(PHONE, MODEL, TRAFIC, WEEK)
SELECT DISTINCT PHONE, WEEK, MODEL,TRAFIC
FROM(SELECT WEEK, A.PHONE,A.MODEL,B.TRAFIC
FROM A
LEFT JOIN B ON B.model = A.model)
GROUP BY PHONE, WEEK;
I want to use a trigger while inserting the values, it will first check if the phone has already been inserted in the same week
Thanks.
You can try the following code, without using the trigger, but comparing the existence of data combinations:
INSERT INTO C(PHONE, MODEL, TRAFIC, WEEK)
SELECT DISTINCT PHONE, WEEK, MODEL,TRAFIC
FROM (SELECT A.WEEK, A.PHONE,A.MODEL,B.TRAFIC
FROM A
LEFT JOIN B ON B.model = A.model
WHERE NOT EXISTS (SELECT 1
FROM C
WHERE C.PHONE = A.PHONE
AND C.WEEK = A.WEEK))
GROUP BY PHONE, WEEK;
As far as I know, it is not possible to prevent the insert inside the trigger. If you really need to use the trigger, alternative can be to make an AFTER INSERT TRIGGER that would delete all rows from C table that were just inserted but shouldn't have been (using another table to store rows that will need to be deleted).
Example I found for that alternative can be found here:
https://community.oracle.com/thread/484449?start=15&tstart=0

sql query to fetch the records from a two tables

Table A
Id pin status etc
1 11 FAILED
2 22
3 44
4 55 FAILED
Table B
id PIN msg counter
1 11 xyz 1
4 55 wsc 10
Table data: I have 2 tables table A(status,id,pin as columns),table B(counter,id as columns)
I need a sql query to select the records where A.status =failed and also b.counter <10
In final result I need all those records with a.status=failed and B.counter <10 and also fresh records which wont be present in table B.
But fresh records wont be present in table B so the b.counter <10 is not satisfied.
How do I handle this situation?
If I'm interpreting this correctly, you want to join tables A and B on the PIN column and you want records with A.status = failed and B.counter < 10. But you also want records from table A that have no matching PIN in table B?
This may not be the most efficient query but it will get the job done, if I interpreted your requirements correctly.
SELECT *
FROM A
WHERE A.pin = B.pin
AND A.status = 'FAILED'
AND B.counter < 10
UNION
SELECT *
FROM A
WHERE A.status = 'FAILED'
AND A.pin NOT IN (SELECT pin FROM B);

Count number of repeats in SQL

I tried to solve one problem but without success.
I have two list of number
{1,2,3,4}
{5,6,7,8,9}
And I have table
ID Number
1 1
1 2
1 7
1 2
1 6
2 8
2 7
2 3
2 9
Now I need to count how many times number from second list come after number from first list but I should count only one by one id
in example table above result should be 2
three matched pars but because we have only two different IDs result is 2 instead 3
Pars:
1 2
1 7
1 2
1 6
2 3
2 9
note. I work with MSSQL
Edit. There is one more column Date which determined order
Edit2 - Solution
i write this query
SELECT * FROM table t
left JOIN table tt ON tt.ID = t.ID
AND tt.Date > t.Date
AND t.Number IN (1,2,3,4)
AND tt.Number IN (6,7,8,9)
And after this I had a plan to group by id and use only one match for each id but execution take a lot time
Here is a query that would do it:
select a.id, min(a.number) as a, min(b.number) as b
from mytable a
inner join mytable b
on a.id = b.id
and a.date < b.date
and b.number in (5,6,7,8,9)
where a.number in (1,2,3,4)
group by a.id
Output is:
id a b
1 1 6
2 3 9
So the two pairs are output each on one line, with the value a belonging to the first group of numbers, and the value of column b to the second group.
Here is a fiddle
Comments on attempt (edit 2 to question)
Later you added a query attempt to your question. Some comments about that attempt:
You don't need a left join because you really want to have a match for both values. inner join has in general better performance, so use that.
The condition t.Number IN (1,2,3,4) does not belong in the on clause. In combination with a left join the result will include t records that violate this condition. It should be put in the where clause.
Your concern about performance may be warranted, but can be resolved by adding a useful index on your table, i.e. on (id, number, date) or (id, date, number)