Tricky SQL statement over 3 tables - sql

I have 3 different transaction tables, which look very similar, but have slight differences. This comes from the fact that there are 3 different transaction types; depending on the transaction types the columns change, so to get them in 3NF I need to have them in separate tables (right?).
As an example:
t1:
date,user,amount
t2:
date,user,who,amount
t3:
date,user,what,amount
Now I need a query who is going to get me all transactions in each table for the same user, something like
select * from t1,t2,t3 where user='me';
(which of course doesn't work).
I am studying JOIN statements but haven't got around the right way to do this. Thanks.
EDIT: Actually I need then all of the columns from every table, not just the ones who are the same.
EDIT #2: Yeah,having transaction_type doesn't break 3NF, of course - so maybe my design is utterly wrong. Here is what really happens (it's an alternative currency system):
- Transactions are between users, like mutual credit. So units get swapped between users.
- Inventarizations are physical stuff brought into the system; a user gets units for this.
- Consumations are physical stuff consumed; a user has to pay units for this.
|--------------------------------------------------------------------------|
| type | transactions | inventarizations | consumations |
|--------------------------------------------------------------------------|
| columns | date | date | date |
| | creditor(FK user) | creditor(FK user) | |
| | debitor(FK user) | | debitor(FK user) |
| | service(FK service)| | |
| | | asset(FK asset) | asset(FK asset) |
| | amount | amount | amount |
| | | | price |
|--------------------------------------------------------------------------|
(Note that 'amount' is in different units;these are the entries and calculations are made on those amounts. Outside the scope to explain why, but these are the fields). So the question changes to "Can/should this be in one table or be multiple tables (as I have it for now)?"
I need the previously described SQL statement to display running balances.
(Should this now become a new question altogether or is that OK to EDIT?).
EDIT #3: As EDIT #2 actually transforms this to a new question, I also decided to post a new question. (I hope this is ok?).

You can supply defaults as constants in the select statements for columns where you have no data;
so
SELECT Date, User, Amount, 'NotApplicable' as Who, 'NotApplicable' as What from t1 where user = 'me'
UNION
SELECT Date, User, Amount, Who, 'NotApplicable' from t2 where user = 'me'
UNION
SELECT Date, User, Amount, 'NotApplicable', What from t3 where user = 'me'
which assumes that Who And What are string type columns. You could use Null as well, but some kind of placeholder is needed.
I think that placing your additional information in a separate table and keeping all transactions in a single table will work better for you though, unless there is some other detail I've missed.

I think the meat of your question is here:
depending on the transaction types the columns change, so to get them in 3NF I need to have them in separate tables (right?).
I'm no 3NF expert, but I would approach your schema a little differently (which might clear up your SQL a bit).
It looks like your data elements are as such: date, user, amount, who, and what. With that in mind, a more normalized schema might look something like this:
User
----
id, user info (username, etc)
Who
---
id, who info
What
----
id, what info
Transaction
-----------
id, date, amount, user_id, who_id, what_id
Your foreign key constraint verbiage will vary based on database implementation, but this is a little clearer (and extendable).

You should consider STI "architecture" (single table inheritance). I.e. put all different columns into one table, and put them all under one index.
In addition you may want to add indexes to other columns you're making selection.

What is the result schema going to look like? - If you only want the minimal columns that are in all 3 tables, then it's easy, you would just UNION the results:
SELECT Date, User, Amount from t1 where user = 'me'
UNION
SELECT Date, User, Amount from t2 where user = 'me'
UNION
SELECT Date, User, Amount from t3 where user = 'me'

Or you could 'SubClass' them
Create Table Transaction
(
TransactionId Integer Primary Key Not Null,
TransactionDateTime dateTime Not Null,
TransactionType Integer Not Null,
-- Othe columns all transactions Share
)
Create Table Type1Transactions
{
TransactionId Integer PrimaryKey Not Null,
// Type 1 specific columns
}
ALTER TABLE Type1Transactions WITH CHECK ADD CONSTRAINT
[FK_Type1Transaction_Transaction] FOREIGN KEY([TransactionId])
REFERENCES [Transaction] ([TransactionId])
Repeat for other types of transactions...

What about simply leaving the unnecessary columns null and adding a TransactionType column? This would result in a simple SELECT statement.

select *
from (
select user from t1
union
select user from t2
union
select user from t3
) u
left outer join t1 on u.user=t1.user
left outer join t2 on u.user=t2.user
left outer join t3 on u.user=t3.user

Related

How to design a SQL table where a field has many descriptions

I would like to create a product table. This product has unique part numbers. However, each part number has various number of previous part numbers, and various number of machines where the part can be used.
For example the description for part no: AA1007
Previous part no's: AA1001, AA1002, AA1004, AA1005,...
Machine brand: Bosch, Indesit, Samsun, HotPoint, Sharp,...
Machine Brand Models: Bosch A1, Bosch A2, Bosch A3, Indesit A1, Indesit A2,....
I would like to create a table for this, but I am not sure how to proceed. What I have been able to think is to create a table for Previous Part no, Machine Brand, Machine Brand Models individually.
Question: what is the proper way to design these tables?
There are of course various ways to design the tables. A very basic way would be:
You could create tables like below. I added the columns ValidFrom and ValidTill, to identify at which time a part was active/in use.
It depends on your data, if datatype date is enough, or you need datetime to make it more exactly.
CREATE TABLE Parts
(
ID bigint NOT NULL
,PartNo varchar(100)
,PartName varchar(100)
,ValidFrom date
,ValidTill date
)
CREATE TABLE Brands
(
ID bigint NOT NULL
,Brand varchar(100)
)
CREATE TABLE Models
(
ID bigint NOT NULL
,BrandsID bigint NOT NULL
,ModelName varchar(100)
)
CREATE TABLE ModelParts
(
ModelsID bigint NOT NULL
,PartID bigint NOT NULL
)
Fill your data like:
INSERT INTO Parts VALUES
(1,'AA1007', 'Screw HyperFuturistic', '2017-08-09', '9999-12-31'),
(1,'AA1001', 'Screw Iron', '1800-01-01', '1918-06-30'),
(1,'AA1002', 'Screw Steel', '1918-07-01', '1945-05-08'),
(1,'AA1004', 'Screw Titanium', '1945-05-09', '1983-10-05'),
(1,'AA1005', 'Screw Futurium', '1983-10-06', '2017-08-08')
INSERT INTO Brands VALUES
(1,'Bosch'),
(2,'Indesit'),
(3,'Samsung'),
(4,'HotPoint'),
(5,'Sharp')
INSERT INTO Models VALUES
(1,1,'A1'),
(2,1,'A2'),
(3,1,'A3'),
(4,2,'A1'),
(5,2,'A2')
INSERT INTO ModelParts VALUES
(1,1)
To select all parts of a certain date (in this case 2013-03-03) of the "Bosch A1":
DECLARE #ReportingDate date = '2013-03-03'
SELECT B.Brand
,M.ModelName
,P.PartNo
,P.PartName
,P.ValidFrom
,P.ValidTill
FROM Brands B
INNER JOIN Models M
ON M.BrandsID = B.ID
INNER JOIN ModelParts MP
ON MP.ModelsID = M.ID
INNER JOIN Parts P
ON P.ID = MP.PartID
WHERE B.Brand = 'Bosch'
AND M.ModelName = 'A1'
AND P.ValidFrom <= #ReportingDate
AND P.ValidTill >= #ReportingDate
Of course there a several ways to do an historization of data.
ValidFrom and ValidTill (ValidTo) is one of my favourites, as you can easily do historical reports.
Unfortunately you have to handle the historization: When inserting a new row - in example for your screw - you have to "close" the old record by setting the ValidTill column before inserting the new one. Furthermore you have to develop logic to handle deletes...
Well, thats a quite large topic. You will find tons of information in the world wide web.
For the part number table, you can consider the following suggestion:
id | part_no | time_created
1 | AA1007 | 2017-08-08
1 | AA1001 | 2017-07-01
1 | AA1002 | 2017-06-10
1 | AA1004 | 2017-03-15
1 | AA1005 | 2017-01-30
In other words, you can add a datetime column which versions each part number. Note that I added a primary key id column here, which is invariant over time and keeps track of each part, despite that the part number may change.
For time independent queries, you would join this table using the id column. However, the part number might also serve as a foreign key. Off the top of my head, if you were generating an invoice from a previous date, you might lookup the appropriate part number at that time, and then join out to one or more tables using that part number.
For the other tables you mentioned, I do not see a similar requirement.

Glueing together relational data rows

I have a sparse table structured like:
id | name | phone | account
There is no primary key or index
There are also null values. What I want is to "glue" data from different rows together, e.g.:
Given
id | name | phone | account
1 null '339-33-27' 4
null 'John' '339-33-27' 4
I want to end up with
id | name | phone | account |
1 'John' '339-33-27' 4
However, I don't know which values are missed in the table.
What are the general way to approach this kind of problem? Do I need to use only joins or might be recursive functions?
Update: Provided more clear example
id to account is many-to-many
account to name is many-to-many
phone to name is one-to-one
The database is basically raw transactional data
What I want to is to get all the rows for which I already have / could find an account
If I understand you correctly then this might work. What you need is a self join
select t2.id, t1.name, t1.phone, t1.account
from table1 t1
join table1 t2 on t1.account = t2.account and t1.phone = t2.phone
where t1.name is not null
However this particular query relies on an assumption from your example data. My assumption is that if name is not null, Id will be null and the Id can be found by looking at the phone number and account. If this assumption is not true , then we may need more sample data to solve your problem.
Depending on the data, you might need left joins or to swap so that T1 gets the id and not the name and the where condition is that ID is not null. It's hard to tell with such a small data sample size.

Get data from one table using id and date ranges stored in another table

The app I am writing is for telemetry units that get rented out to customers, and I am trying to query logged data for a particular customer, without a customer_id_fk column in the Log table.
I have the following tables:
Customer table:
id | name | ...
Unit table:
id | name | ...
RentOut table:
id | start_date | end_date | unit_id_fk | customer_id_fk
Log table:
id | unit_id_fk | datetime | data1 | data2
The reason a customer_id_fk column is not included in the Log table is so that a mistake in the RentOut table is easily rectified, without the need to alter data in the Log table (though maybe there is a better way to do this?).
It would be more efficient to include the customer in the log table. But, without that, you can get what you want with a bunch of joins:
select l.*, c.*
from log l left outer join
RentOut ro
on l.unit_id_fk = ro.unit_id_rk and
l.datetime between ro.start_date and ro.end_date left outer join
Customer c
on ro.customer_id = c.id
where c.<whatever> = <foobar>
If the dates in the RentOut table are really dates (with no times), and the log records have a datetime, then you might have to do more date arithmetic for the join to work properly. For instance, you might need to say the "start" really is after noon and the "end" is really before noon, for a given log record.

Which field should I use with Oracle Partition By clause to improve performance

I have an update statement that works fine but takes a very long time to complete.
I'm updating roughly 150 rows in one table with some tens of thousands of rows exposed through a view. It's been suggested that I use the Partition By clause to speed up the process.
I'm not too familiar with Partition By statement but I've been looking around and I think maybe I need to use a field that has a numeric value that can be compared against.
Is this correct? Or can I partition the larger table with something else?
if that is the case I'm struggling with what in the larger table can be used. The table is composed as follows.
ID has a type of NUMBER and creates the unique id for a particular item.
Start_Date has a date type and indicates the start when the ID is valid.
End date has a date type and indicates the end time when the ID cease to be valid.
ID_Type is NVARCHAR2(30) and indicates what type of Identifier we are using.
ID_Type2 is NVARCHAR2(30) and indicates what sub_type of Identifier we are using.
Identifier is NVARCHAR2(30) and any one ID can be mapped to one or more Identifiers.
So for example - View_ID
ID | Start_Date | End_Date | ID_Type1| ID_Type2 | Identifier
1 | 2012-01-01 | NULL | Primary | Tertiary | xyz1
1 | 2012-01-01 | NULL | Second | Alpha | abc2
2 | 2012-01-01 | 2012-01-31 | Primary | Tertiary | ghv2
2 | 2012-02-01 | NULL | Second | Alpha | mno4
Would it be possible to Partition By the ID field of this view as long as there is a clause that the id is valid by date?
The update statement is quite basic although it selects against one of several possible identifiers and and ID_Type1's.
UPDATE Temp_Table t set ID =
(SELECT DISTINCT ID FROM View_ID v
WHERE inDate BETWEEN Start_Date and End_Date
AND v.Identifier = (NVL(t.ID1, NVL(t.ID2, t.ID3)))
AND v.ID_Type1 in ('Primary','Secondary'));
Thanks in advance for any advice on any aspect of my question.
Additional Info ***
After investigating and following Gordon's advice I changed the update to three updates. This reduced the overall update process 75% going from just over a minute to just over 20 seconds. Thats a big improvement but I'd like to reduce the process even more if possible.
Does anyone think that Partition By clause would help even further? If so what would be the correct method for putting this clause into an update statement. I'm honestly not sure if I understand how this clause operates.
If the UPDATE using a SELECT statement only allows for 1 value to be selected does this exclude something like the following from working?
UPDATE Temp_Table t SET t.ID =
(SELECT DISTINCT ID,
Row_Number () (OVER PARTITION BY ID_Type1) AS PT1
FROM View_ID v
WHERE inDate BETWEEN v.Start_Date and v.End_Date
AND v.Identifier = t.ID1
AND PT1.Row_Number = 1 )
*Solution************
I combined advice from both Responders below to dramatically improve performance. From Gordon I removed the NVL from my UPDATE and changed it to three separate updates. (I'd prefer to combine them into a case but my trials were still slow.)
From Eggi, I looked working with some kind of Materialized view that I can actually index myself and settled on a WITH Clause.
UPDATE Temp_Table t set ID =
(WITH IDs AS (SELECT /*+ materialize */ DISTINCT ID, Identifier FROM View_ID v
WHERE inDate BETWEEN Start_Date and End_Date
AND v.Identifier = ID1)
SELECT g.ID FROM IDs g
WHERE g.Identifier = t.ID1;
Thanks again.
It is very hard to imagine how windows/analytic functions would help with this update. I do highly recommend that you learn them, but not for this purpose.
Perhaps the suggestion was for partitioning the table space, used for the table. Note that this is very different from the "partition by" statement, which usually refers to window/analytic functions. Tablespace partitioning might help performance. However, here is something else you can try.
I think your problem is the join between the temp table and the view. Presumably, you are creating the temporary table. You should add in a new column, say UsedID, with the definition:
coalesce(t.ID1, t.ID2, t.ID3) as UsedId
The "WHERE" clause in the update would then be:
WHERE inDate BETWEEN Start_Date and End_Date AND
v.Identifier = t.UsedId AND
v.ID_Type1 in ('Primary', 'Secondary')
I suspect that the performance problem is the use of NVL in the join, which interferes with optimization strategies.
In response to your comment . . . your original query would have the same problem as this version. Perhaps the logic you want is:
WHERE inDate BETWEEN Start_Date and End_Date AND
v.Identifier in (t.ID1, t.ID2, t.ID3) AND
v.ID_Type1 in ('Primary', 'Secondary')
The best option for partitioning seems to be the start date, because it seems to always have a value and you also get it as input parameter in your query.
If you have not already done that I would add a bitmap index on ID_Type1.

Maintaining logical consistency with a soft delete, whilst retaining the original information

I have a very simple table students, structure as below, where the primary key is id. This table is a stand-in for about 20 multi-million row tables that get joined together a lot.
+----+----------+------------+
| id | name | dob |
+----+----------+------------+
| 1 | Alice | 01/12/1989 |
| 2 | Bob | 04/06/1990 |
| 3 | Cuthbert | 23/01/1988 |
+----+----------+------------+
If Bob wants to change his date of birth, then I have a few options:
Update students with the new date of birth.
Positives: 1 DML operation; the table can always be accessed by a single primary key lookup.
Negatives: I lose the fact that Bob ever thought he was born on 04/06/1990
Add a column, created date default sysdate, to the table and change the primary key to id, created. Every update becomes:
insert into students(id, name, dob) values (:id, :name, :new_dob)
Then, whenever I want the most recent information do the following (Oracle but the question stands for every RDBMS):
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by created desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: All queries over the entire database take that little bit longer. If the table was the size indicated this doesn't matter but once you're on your 5th left outer join using range scans rather than unique scans begins to have an effect.
Add a different column, deleted date default to_date('2100/01/01','yyyy/mm/dd'), or whatever overly early, or futuristic, date takes my fancy. Change the primary key to id, deleted then every update becomes:
update students x
set deleted = sysdate
where id = :id
and deleted = ( select max(deleted) from students where id = x.id );
insert into students(id, name, dob) values ( :id, :name, :new_dob );
and the query to get out the current information becomes:
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by deleted desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: Two DML operations; I still have to use ranked queries with the additional cost or a range scan rather than a unique index scan in every query.
Create a second table, say student_archive and change every update into:
insert into student_archive select * from students where id = :id;
update students set dob = :newdob where id = :id;
Positives: Never lose any information.
Negatives: 2 DML operations; if you ever want to get all the information ever you have to use union or an extra left outer join.
For completeness, have a horribly de-normalised data-structure: id, name1, dob, name2, dob2... etc.
If number 1 is not an option if I never want to lose any information and always do a soft delete. Number 5 can be safely discarded as causing more trouble than it's worth.
I'm left with options 2, 3 and 4 with their attendant negative aspects. I usually end up using option 2 and the horrific 150 line (nicely-spaced) multiple sub-select joins that go along with it.
tl;dr I realise I'm skating close to the line on a "not constructive" vote here but:
What is the optimal (singular!) method of maintaining logical consistency while never deleting any data?
Is there a more efficient way than those I have documented? In this context I'll define efficient as "less DML operations" and / or "being able to remove the sub-queries". If you can think of a better definition when (if) answering please feel free.
I'd stick to #4 with some modifications.No need to delete data from original table ; it's enough to copy old values to archive table before updating(or before deleting) original record. That's can be easily done with row level trigger. Retrieving all information in my opinion is not a frequent operation, and I don't see anything wrong with extra join /union. Also, you can define a view , so all queries will be straightforward from end user perspective.