How to handle null columns in a Relational Database Design - sql

While mostly working with non-relational databases I need to switch gears and use a relational database as the application that I need to build will run complex queries and the join operation between tables is needed.
Before starting to create the database itself I've had to think about the architecture and I've set up an UML for Database Design:
This is how the TransactionDEpositBreakdown table may look:
id amount date reference_number batch_id payment_processor_id mid_id main_dep_id
1 100 2020-10-11 900 null 1 100 2
2 101 2020-10-11 900 null 1 100 2
3 102 2020-10-11 900 null 1 100 1
4 103 2020-10-11 350 null 1 100 1
5 104 2020-10-11 350 null 1 100 3
6 105 2020-10-11 600 null 1 100 4
7 106 2020-10-11 null 1000 2 201 null
8 107 2020-10-11 null 1001 2 201 null
9 108 2020-10-11 null 1002 2 201 null
10 109 2020-10-11 null 1003 2 201 null
A reference_number can be assigned to multiple transaction deposit breakdowns
A batch_id is assigned to only one transaction deposit breakdown
There is a use case where a TransactionDepositBreakdown may have a reference number or a batch id, depending on the payment processor type (type 1 - reference number, type 2 - batch id). I'm not sure how to handle this case, but I'm thinking about the following options:
Add two tables TransactionDepositBatch and TransactionDepositReference which will have the transaction_deposit_id as a foreign key, batch_id on the first table and reference_number on the latter one:
Keep the reference_number and batch_id columns in the TransactionDepositBreakdown table and have at all times one of them null depending on the payment processor type.
Note: There might be a need of adding another column to the TransactionDepositBreakdown table, such as card_type, which will have a value assigned only when the payment processor type is 1.
Is the first option the correct way to handle this, by also taking into consideration the above note?
Also, any recommendations regarding the UML that I've built would be really useful.

These one-of relationships are difficult to model in relational databases. Different databases have different capabilities, so some may have extensions that can be applied to this problem (such as Postgres's support of table inheritance).
Your situation is rather simple, given just two options. Under those circumstances, I would go for the first option for one simple reason: it easily allows you to design the data model with declared foreign key relationships. The downside is that the you need space for both foreign keys, even if one of them is going to be NULL.
You can also enforce that one or the other is set, but not both using a check constraint:
constraint chk_TransactionDepositBreakdown_reference_or_batch
check (reference_number is null or batch_id is null);

Related

How to join a fact table to a dimension table which has a duplicate key value while avoiding duplications in fact table?

How do I join a fact table to a dimension table with a duplicate key value, while at the same time avoiding duplication in the fact table that would result from the join?
Dimension table: enter image description here
fact table: enter image description here
product look-up table (another dimension table): enter image description here
I thought of using the activation date as the next unique value, but they share a month in common.
I thought of creating a snowflake schema which connects dimension table in question (marketing campaigns) to product dimension which in turn connects to the fact table with no issues.
edit:
I am designing a datawarehouse which should answer how effective marketing campaigns based on purchase data.
Purchase data which will be the core of my fact table looks like this:
product_id timestamp sales_price user_id
1 5/9/2015 120 124
2 6/9/2015 150 129
the product lookup table looks like this:
id product_name model production_cost
6 ring 2019 300
5 headband 2018 200
the marketing campaigns look up table looks like this:
startdate enddate type amount_spent currency product_id
1/1/2019 7/1/2019 print 100,000 USD 6
6/1/2019 1/1/2020 socialmedia 10,000,000 USD 6
6/1/2019 1/1/2020 socialmedia 10,000,000 USD 3
The issue is that the marketing table has duplicate product id value of 6. So, when I use it as my natural key to create a surrogate primary key for that dimension table and pull that surrogate key to the fact table as a foreign key it's going to cause duplications for anything with product_id of 6 (as it's not unique). How do I connect marketing campaigns data to fact table, whilst keeping the data integrity intact -- that is no duplications?
I thought about combining start/end date with product_id to create a composite primary key, but they share/overlap a month (6/1/2019 to 7/1/2019)
I also thought about connecting the purchases (fact table) to product lookup and then product to marketing campaigns (a snowflake schema) to avoid the duplication.
I suggest you take the time to read the details of dimensional database design.
If you mean dimensional design, there is no such thing as a lookup table there; there is either a Slowly Changing Dimension (SCD), or just a Dimension. Your product lookup table could be a product lookup dimension. Your Dimension table looks imperfect, too: It does contain the element of time , but not correctly. You need - usually in this order:
a completely arbitrary integer as a surrogate, primary, key - often populated by a sequence or defined as IDENTITY
a business identifier - that could be the SKU for a product, first part of a business unique identifier
the valid-from-date, second part of a business unique identifier
the valid-to-date, '9999-12-31' for the current row, or equal to the valid-from-date of its successor
Type 1 attributes, those that don't change over time
Type 2 attributes, those that change over time and need a new row every time they change
There can be more columns: the Boolean current-indicator, and an inserted timestamp and an updated timestamp.
The fact table is populated from the source transactions, after the dimension table. For each transaction row, you join with the SCD table with the business identifier (SKU in our case), that must be equal and the transaction's timestamp, that must be greater or equal to the valid-from-date and less than the valid-to-date. You pick the surrogate key of the row found in the SCD to populate the fact table's foreign key.
This is an exemplary, minimal, customer SCD table, without the insert/change timestamps and without the current-indicator:
c_key
c_id
c_from_dt
c_to_dt
c_fname
c_lname
c_loy_lvl
c_org_id
66459
1
2022-01-25
9999-12-31
Arthur
Dent
1
1
34168
2
2022-01-25
9999-12-31
Ford
Prefect
2
2
2284
3
2021-12-25
9999-12-31
Zaphod
Beeblebrox
3
3
84768
4
2021-12-25
9999-12-31
Tricia
McMillan
4
4
80080
5
2022-01-25
9999-12-31
Gag
Halfrunt
5
5
57458
6
2022-01-25
9999-12-31
Prostetnic Vogon
Jeltz
6
6
1076
7
2022-01-25
9999-12-31
Lionel
Prosser
1
0
9782
8
2021-12-25
9999-12-31
Benji
Mouse
2
1
42655
9
2021-12-25
9999-12-31
Frankie
Mouse
3
2
57348
10
2021-09-25
2021-10-25
Wonko
The Sane
1
3
22279
10
2021-10-25
2021-11-25
Wonko
The Sane
2
3
3675
10
2021-11-25
2021-12-25
Wonko
The Sane
3
3
95534
10
2021-12-25
2022-01-25
Wonko
The Sane
4
3
69529
10
2022-01-25
9999-12-31
Wonko
The Sane
5
3
34845
11
2022-01-25
9999-12-31
Eccentrica
Gallumbitis
6
4

Apply a discount to order if user already ordered something else

I have a table with users, a table with levels, a table for submitted orders and processed orders.
Here's what the submitted orders looks like:
OrderId UserId Level_Name Discounted_Price Order_Date Price
1 1 OLE Core 0 2020-11-01 00:00:00.000 19.99
2 1 Xandadu 1 2020-11-01 00:00:00.000 0
3 2 Xandadu 0 2020-12-05 00:00:00.000 5
4 1 Eldorado 1 2021-01-31 00:00:00.000 9
5 2 Eldorado 0 2021-02-20 00:00:00.000 10
6 2 Birmingham Blues NULL 2021-07-10 00:00:00.000 NULL
What I am trying to do:
UserId 2 has an order for Birmingham Blues, they have already ordered Eldorado and so qualify for a discount on their Birmingham Blues order. Is there a way to check the entire table for this similarity, and if it exists update the discounted price to a 1 and change the price to lets say 10 for the Birmingham Blues order.
EDIT: I have researched the use of cursors, which I'm sure will do the job but they seem complicated and was hoping a simpler solution would be possible. A lot of threads seem to also avoid using cursors. I also looked at this question: T-SQL: Deleting all duplicate rows but keeping one and was thinking I could potentially use the answer to that in some way.
Based on your description and further comments, the following should hopefully meet your requirements - updating the row for the specified User where the values are currently NULL and the user has a qualifying existing order:
update s set
s.Discounted_Price = 1,
Price = 10
from submitted_Orders s
where s.userId=2
and s.Level_Name = 'Birmingham Blues'
and s.discounted_Price is null
and s.Price is null
and exists (
select * from submitted_orders so
where so.userId = s.userId
and so.Level_name = 'Eldorado'
and so.Order_Date < s.OrderDate
);

Generate rows where none exist

I'm a little stumped on how to generate rows when none exist for specified conditions. Apologies for the formatting since I don't know how to write tables in SO posts, but let's say I have data that looks like this:
TimePeriodID CityspanSiteKey Mean_Name Mean
2 123 Social Environment 4
2 123 Youth with Adults 3.666666746
2 123 Youth with Peers 3.5
4 123 Social Environment 2.75
4 123 Youth with Adults 2.555555582
4 123 Youth with Peers 3.5
There are a few other Mean_Name values which I would like to include in every single time period ID, but just a Mean value of NULL, like the following:
TimePeriodID CityspanSiteKey Mean_Name Mean
2 123 Social Environment 4
2 123 Youth with Adults 3.666666746
2 123 Youth with Peers 3.5
2 123 Staff Build Relationships and Support Individual Youth NULL
2 123 Staff Positively Guide Behavior NULL
4 123 Social Environment 2.75
4 123 Youth with Adults 2.555555582
4 123 Youth with Peers 3.5
4 123 Staff Build Relationships and Support Individual Youth NULL
4 123 Staff Positively Guide Behavior NULL
5 123 Social Environment 2.75
5 123 Youth with Adults 2.555555582
5 123 Youth with Peers 3.5
5 123 Staff Build Relationships and Support Individual Youth NULL
5 123 Staff Positively Guide Behavior NULL
6 123 Social Environment NULL
6 123 Youth with Adults NULL
6 123 Youth with Peers NULL
6 123 Staff Build Relationships and Support Individual Youth NULL
6 123 Staff Positively Guide Behavior NULL
What's the best way to go about doing this? I don't think CASEing will be of much use since these records don't exist.
You seem to want a cross join and then left join. Not all values are in your original data, so you might as well construct them:
select ti.timeperiod, c.CityspanSiteKey, m.mean_name, t.mean
from (values (2), (4), (5), (6)
) ti(timeperiod) cross join
(values (123)
) c(CityspanSiteKey) cross join
(values ('Social Environment'), ('Youth with Adults'), ('Youth with Peers'), ('Staff Build Relationships and Support Individual Youth'), ('Staff Positively Guide Behavior')
) m(mean_name) left join
t
on t.timeperiod = ti.timeperiod and
t.CityspanSiteKey = c.CityspanSiteKey and
t.mean_name = m.mean_name;
You can use subqueries or existing tables instead of the values() clause.

SQL: Display joined data on a day to day basis anchored on a start date

Perhaps my title is misleading, but I am not sure how else to phrase this. I have two tables, tblL and tblDumpER. They are joined based on the field SubjectNumber. This is a one (tblL) to many (tblDumpER) relationship.
I need to write a query that will give me, for all my subjects, a value from tblDumpER associated with a date in tblL. This is to say:
SELECT tblL.SubjectNumber, tblDumpER.ER_Q1
FROM tblL
LEFT JOIN tblDumpER ON tblL.SubjectNumber=tblDumpER.SubjectNumber
WHERE tblL.RandDate=tblDumpER.ER_DATE And tblDumpER.ER_Q1 Is Not Null
This is straightforward enough. My problem is the value RandDate from tblL is different for every subject. However, it needs to be displayed as Day1 so I can have tblDumpER.ER_Q1 as Day1 for every subject. Then I need RandDate+1 As Day2, etc until I hit either null or Day84. The 'dumb' solution is to write 84 queries. This is obviously not practical. Any advice would be greatly appreciated!
I appreciate the responses so far but I don't think that I'm explaining this correctly so here is some example data:
SubjectNumber RandDate
1001 1/1/2013
1002 1/8/2013
1003 1/15/2013
SubjectNumber ER_DATE ER_Q1
1001 1/1/2013 5
1001 1/2/2013 6
1001 1/3/2013 2
1002 1/8/2013 1
1002 1/9/2013 10
1002 1/10/2013 8
1003 1/15/2013 7
1003 1/16/2013 4
1003 1/17/2013 3
Desired outcome:
(Where Day1=RandDate, Day2=RandDate+1, Day3=RandDate+2)
SubjectNumber Day1_ER_Q1 Day2_ER_Q1 Day3_ER_Q1
1001 5 6 2
1002 1 10 8
1003 7 4 3
This data is then going to be plotted on a graph with Day# on the X-axis and ER_Q1 on the Y-axis
I would do this in two steps:
Create a query that gets the MIN date for each SubjectNumber
Join this query to your existing query, so you can perform a DATEDIFF calculation on the MIN date and the date of the current record.
I'm not entirely sure of what it is that you need, but perhaps a calendar table would be of help. Just create a local table that contains all of the days of the year in it, then use that table to JOIN your dates up?

How to generate a column with a series of numbers based on a min and max value

I have a table structured as so:
fake_id start end misc_data
------------------------------------------------------
1 101 105 ab
1 101 105 cd
1 101 105 ef
2 117 123 gh
2 117 123 ij
2 117 123 kl
2 117 123 mn
3 51 53 op
3 51 53 qr
Notice that the fake_id field is not really a primary key, but is repeated a number of times equal to the number of distinct odd numbers in the range specified by start and end. The real id for each record is one of the odd numbers in that range. I need to write a query that returns fake_id, misc_data, and another column that contains those odd numbers to produce a real id, as follows:
fake_id real_id misc_data
------------------------------------------
1 101 ab
1 103 cd
1 105 ef
2 117 gh
2 119 ij
2 121 kl
2 123 mn
3 51 op
3 53 qr
As far as I know, there is no guarantee that there will be no gaps in the sequence (for example, there might be no records for range 21-31). How do I tell the query (or procedure, but query is preferable) that for each record with a particular fake_id, it should return the next odd number between start and end?
Also, is there a way to make the values for misc_data belong to a particular real_id? Using the second table as an example, how could I tell the query that "ab" belongs to real_id 101 instead of 103?
Thanks in advance.
Guessing here that you plan to sort on misc_data:
SELECT "fake_id",
((ROW_NUMBER()OVER(PARTITION BY "start"
ORDER BY "misc_data")-1)*2)+"start" AS "real_id",
"misc_data"
FROM t
ORDER BY "misc_data";
http://www.sqlfiddle.com/#!4/ae23c/23
Apologies for not answering sooner or to the individual comments. #John Dewey, I believe when I tried your script it did not correctly keep the gaps between the start-end series, but I was motivated to learn more about the PARTITION keyword and I think I am more enlightened now.
Since this was for an ETL task, I ended up writing code to generate the real IDs in a loop on the extract (I guess it would also count as a transform) side.