Is using Null to represent a Value Bad Practice? - sql

If I use null as a representation of everything in a database table is that bad practice ?
i.e.
I have the tables: myTable(ID) and myRelatedTable(ID,myTableID)
myRelatedTable.myTableID is a FK of myTable.ID
What I want to accomplish is: if myRelatedTable.myTableID is null then my business logic will interpret that as being linked to all myTable rows.
The reason I want to do this is because I have an uknown amount of rows that could be inserted into myTable after the myRelatedTable row is created and some rows in the myRelatedTable need to reference all existing rows in myTable.

I think you might agree that it would be bad to use the number 3 to represent a value other an 3.
By the same reasoning it is therefore a bad idea to use NULL to represent anything other than the absence of a value.
If you disagree and twist NULL to some other purpose, the maintenance programmers that come after you will not be grateful.

Not a good idea, because then you cannot use the "related to all entries" fact in SQL queries at all. At some point, you'll probably want/need to do this.

Ideally there should be no nulls at all. There should be another table to represent the relation.
If you are going to assign special meanings however NULL should only ever mean "not assigned" - ie no relationship exists, use negative numbers, ie -1 if you want to trigger some business layer trickery. It should be obvious to any developers that come across this in the future that -1 is an extraordinary value that should not be treated as normal.

I don't think NULL is the best way to do it but you might use a separate tinyInt column to indicate that the row in MyRelatedTable is related to everything in MyTable, e.g. MyRelatedTable.RelatedAll. That would make it more explicit for other that have to maintain it. Then you could do some sort of Union query e.g.
SELECT M.ID, R.ID AS RelatedTableID,....
FROM MyTable M INNER JOIN MyRelated Table R ON R.myTableId = M.Id
UNION
SELECT M.ID, R.ID AS RelatedTableID,....
FROM MyTable M, MyRelatedTable R
WHERE R.RelatedAll = 1

Yes, for the simple reason that NULL represents no value. Not a special value; not a blank value, but nothing.
If the foreign key is just a simple integer, and it's generated automatically, then you could use 0 to represent the "magic" value.

What you posted, namely that a NULL in a foreign key asserts a relationship with all the rows in the referenced table, is very non standard. Off the top of my head, I think it's fraught with dangers.
What most people who use NULLs in FKs mean by it is that it asserts a relationship to NONE of the rows in the referenced table. This is common in the case of optional relationships, ones that can occur zero times.
Example: We have an HR database, with a table called "EMPLOYEES". We have two columns, called "EmpID" and "SupervisorID". (Many people call the first column simply "ID"). Every employee in the table has an entry under SupervisorID with the sole exception of the CEO of the company. THe CEO has a NULL in the SupervisorID column, meaning that the CEO has no supervisor. The CEO is accountable to the BOD, but that isn't represented in SupervisorID.
What you might mean by a relationship with ALL the rows in the refernced table is this: There's a POSSIBLE relationship between the row in question and ANY ONE of the rows in the reference table. When you start to get into the questions of the facts that are true in the real world but unknown to the database you open a whole big can of worms.

Related

How to understand this query?

SELECT DISTINCT
...
...
...
FROM Reviews Rev
INNER JOIN Reviews SubRev ON Subrev.W_ID=Rev.ID
WHERE Rev.Status='Approved'
This is a small part of a long query that I've been trying to understand for a day now. What is happening with the join? Reviews table appears to be joined with itself, under different aliases. Why is this done? What does it achieve? Also, ID field of the Reviews table is null for the entries that are nevertheless selected and returned. This is correct, but I don't understand how that can happen if the W_ID field is not null.
It allows you to join one row from the table to a different row in the table.
I've both seen this done, and used it myself, in cases where you maybe have a relationship between those rows.
Real-world examples:
An old version of a record and a newer version
Some sort of hierarchical relationship (e.g. if the table contains records of people, you can record that someone is a parent of someone else). There are probably plenty of other possible use cases, too.
SQL allows you to create a foreign key which relates between two different columns in the same table.

Is it better to have int joins instead of string columns?

Let's say I have a User which has a status and the user's status can be 'active', 'suspended' or 'inactive'.
Now, when creating the database, I was wondering... would it be better to have a column with the string value (with an enum type, or rule applied) so it's easier to both query and know the current user status or are joins better and I should join in a UserStatuses table which contains the possible user statuses?
Assuming, of course statuses can not be created by the application user.
Edit: Some clarification
I would NOT use string joins, it would be a int join to UserStatuses PK
My primary concern is performance wise
The possible status ARE STATIC and will NEVER change
On most systems it makes little or no difference to performance. Personally I'd use a short string for clarity and join that to a table with more detail as you suggest.
create table intLookup
(
pk integer primary key,
value varchar(20) not null
)
insert into intLookup (pk, value) values
(1,'value 1'),
(2,'value 2'),
(3,'value 3'),
(4,'value 4')
create table stringLookup
(
pk varchar(4) primary key,
value varchar(20) not null
)
insert into stringLookup (pk, value) values
(1,'value 1'),
(2,'value 2'),
(3,'value 3'),
(4,'value 4')
create table masterData
(
stuff varchar(50),
fkInt integer references intLookup(pk),
fkString varchar(4)references stringLookup(pk)
)
create index i on masterData(fkInt)
create index s on masterData(fkString)
insert into masterData
(stuff, fkInt, fkString)
select COLUMN_NAME, (ORDINAL_POSITION %4)+1,(ORDINAL_POSITION %4)+1 from INFORMATION_SCHEMA.COLUMNS
go 1000
This results in 300K rows.
select
*
from masterData m inner join intLookup i on m.fkInt=i.pk
select
*
from masterData m inner join stringLookup s on m.fkString=s.pk
On my system (SQL Server)
- the query plans, I/O and CPU are identical
- execution times are identical.
- The lookup table is read and processed once (in either query)
There is NO difference using an int or a string.
I think, as a whole, everyone has hit on important components of the answer to your question. However, they all have good points which should be taken together, rather than separately.
As logixologist mentioned, a healthy amount of Normalization is generally considered to increase performance. However, in contrast to logixologist, I think your situation is the perfect time for normalization. Your problem seems to be one of normalization. In this case, using a numeric key as Santhosh suggested which then leads back to a code table containing the decodes for the statuses will result in less data being stored per record. This difference wouldn't show in a small Access database, but it would likely show in a table with millions of records, each with a status.
As David Aldridge suggested, you might find that normalizing this particular data point will result in a more controlled end-user experience. Normalizing the status field will also allow you to edit the status flag at a later date in one location and have that change perpetuated throughout the database. If your boss is like mine, then you might have to change the Status of Inactive to Closed (and then back again next week!), which would be more work if the status field was not normalized. By normalizing, it's also easier to enforce referential integrity. If a status key is not in the Status code table, then it can't be added to your main table.
If you're concerned about the performance when querying in the future, then there are some different things to consider. To pull back status, if it's normalized, you'll be adding a join to your query. That join will probably not hurt you in any sized recordset but I believe it will help in larger recordsets by limiting the amount of raw text that must be handled. If your primary concern is performance when querying the data, here's a great resource on how to optimize queries: http://www.sql-server-performance.com/2007/t-sql-where/ and I think you'll find that a lot of the rules discussed here will also apply to any inclusion criteria you enforce in the join itself.
Hope this helps!
Christopher
The whole idea behind normalization is to keep the data from repeating (well at least one of the concepts).
In this case there is only 1 status a user at one time (I assume) can have so their is no reason to put it in its own table. You would simply complicate things. The only reason you would have a seperate table is if for some reason these statuses were not static. Meaning next month you may add "Sort of Active" and "Maybe Inactive". This would mean changing code to make up for that if you didnt put them in their own table. You could create a maintenace page where users could add statuses and then that would require you to create a seperate table.
An issue to consider is whether these status values have attributes of their own.
For example, perhaps you would want to have a default sort order that is different from the alphabetical order of the status text. You might also want to treat two of the statuses in a particular way that you do not treat the other, and that could be an attribute.
If you have a need for that, or suspect a future need for that, then move the status text to a different table and use an integer key value for them.
I would suggest using Integer values like 0, 1, 2. If this is fixed. When interpreting the results in Reports we can change these status back to strings.

SQL join basic questions

When I have to select a number of fields from different tables:
do I always need to join tables?
which tables do I need to join?
which fields do I have to use for the join/s?
do the joins effects reflect on fields specified in select clause or on where conditions?
Thanks in advance.
Think about joins as a way of creating a new table (just for the purposes of running the query) with data from several different sources. Absent a specific example to work with, let's imagine we have a database of cars which includes these two tables:
CREATE TABLE car (plate_number CHAR(8),
state_code CHAR(2),
make VARCHAR(128),
model VARCHAR(128),);
CREATE TABLE state (state_code CHAR(2),
state_name VARCHAR(128));
If you wanted, say, to get a list of the license plates of all the Hondas in the database, that information is already contained in the car table. You can simply SELECT * FROM car WHERE make='Honda';
Similarly, if you wanted a list of all the states beginning with "A" you can SELECT * FROM state WHERE state_name LIKE 'A%';
In either case, since you already have a table with the information you want, there's no need for a join.
You may even want a list of cars with Colorado plates, but if you already know that "CO" is the state code for Colorado you can SELECT * FROM car WHERE state_code='CO'; Once again, the information you need is all in one place, so there is no need for a join.
But suppose you want a list of Hondas including the name of the state where they're registered. That information is not already contained within a table in your database. You will need to "create" one via a join:
car INNER JOIN state ON (car.state_code = state.state_code)
Note that I've said absolutely nothing about what we're SELECTing. That's a separate question entirely. We also haven't applied a WHERE clause limiting what rows are included in the results. That too is a separate question. The only thing we're addressing with the join is getting data from two tables together. We now, in effect, have a new table called car INNER JOIN state with each row from car joined to each row in state that has the same state_code.
Now, from this new "table" we can apply some conditions and select some specific fields:
SELECT plate_number, make, model, state_name
FROM car
INNER JOIN state ON (car.state_code = state.state_code)
WHERE make = 'Honda'
So, to answer your questions more directly, do you always need to join tables? Yes, if you intend to select data from both of them. You cannot select fields from car that are not in the car table. You must first join in the other tables you need.
Which tables do you need to join? Whichever tables contain the data you're interested in querying.
Which fields do you have to use? Whichever fields are relevant. In this case, the relationship between cars and states is through the state_code field in both table. I could just as easily have written
car INNER JOIN state ON (state.state_code = car.plate_number)
This would, for each car, show any states whose abbreviations happen to match the car's license plate number. This is, of course, nonsensical and likely to find no results, but as far as your database is concerned it's perfectly valid. Only you know that state_code is what's relevant.
And does the join affect SELECTed fields or WHERE conditions? Not really. You can still select whatever fields you want and you can still limit the results to whichever rows you want. There are two caveats.
First, if you have the same column name in both tables (e.g., state_code) you cannot select it without clarifying which table you want it from. In this case I might write SELECT car.state_code ...
Second, when you're using an INNER JOIN (or on many database engines just a JOIN), only rows where your join conditions are met will be returned. So in my nonsensical example of looking for a state code that matches a car's license plate, there probably won't be any states that match. No rows will be returned. So while you can still use the WHERE clause however you'd like, if you have an INNER JOIN your results may already be limited by that condition.
Very broad question, i would suggest doing some reading on it first but in summary:
1. joins can make life much easier and queries faster, in a nut shell try to
2. the ones with the data you are looking for
3. a field that is in both tables and generally is unique in at least one
4. yes, essentially you are createing one larger table with joins. if there are two fields with the same name, you will need to reference them by table name.columnname
do I always need to join tables?
No - you could perform multiple selects if you wished
which tables do I need to join?
Any that you want the data from (and need to be related to each other)
which fields do I have to use for the
join/s?
Any that are the same in any tables within the join (usually primary key)
do the joins effects reflect on fields specified in select clause or on where conditions?
No, however outerjoins can cause problems
(1) what else but tables would you want to join in mySQL?
(2) those from which you want to correlate and retrieve fields (=data)
(3) best use indexed fields (unique identifiers) to join as this is fast. e.g. join
retrieve user-email and all the users comments in a 2 table db
(with tables: tableA=user_settings, tableB=comments) and both having the column uid to indetify the user by
select * from user_settings as uset join comments as c on uset.uid = c.uid where uset.email = "test#stackoverflow.com";
(4) both...

trying to determine unique identifier for database table

I have a database table with many columns and there is no specified primary key. There isn't a list of super keys either. Besides iteratively trying all candidate keys/columns, is there a way for me, using SQL, to try and figure our whether a subset of keys can make a unique identifier for my table?
For example, a table may have 4 columns first name, last name, address and zip and the data I see is:
John, Smith, 1 main st, 00001
Mary, Smith, 1 main st, 00001
Mary, Smith, 2 sub st, 00002
In this case, I'll need first, last and zip as my unique key.
John, Smith, 1 main st, 00001
John, Smith, 1 main st, 00001
In this case, there is no unique key.
Please don't comment on my table construction and/or normalization of databases, I'm just trying to find a practical answer. Thanks.
This is my question: Besides iteratively trying all candidate keys/columns, is there a way for me, using SQL, to try and figure our whether a subset of keys can make a unique identifier for my table?
Looking for a subset of unique values in this case seems so specific to the particular data set. What if you arrive at a subset today and find you can't insert a new row tomorrow?
Use an artificial key, like an auto-incrementing integer.
In short: no, there's no way to do this in T-SQL really.
My advice: just add a ID INT IDENTITY PRIMARY KEY column to the table. It's guaranteed to be unique, it will be filled automagically when you create it, it's fast and easy, no messy "is this really unique or are there any combinations of rows that violate the uniqueness" questions......
Just do it - it's the easiest way to go!!
You cannot find if a combination "can" make a primary key. You can find if one WILL make a good primary key for an existing set of data.
To find if a set of fields is candidate or not, you can count the distinct of those fields (using group-by with rollup) and compare that with count (*)
There is a much faster method.
Enterprise dbms have had it for many years but MS SQL Server 2005 (useable in 2008) and later provided the HashBytes() function. Convert the columns to CHAR() (VARCHAR on MS), concatenate them; then hash them; then compare the hashes. You can compare the two tables in a single SELECT command. IIRC max 8000 characters per row.
(If you use this answer, please undo and redo your Answer choice.)
if you are comparing two databases, then you can see if any duplicate rows exist in the source db with structures like this:
select a,b,c,d
from mytable
having count(*) > 1
group by a,b,c,d
include all columns.
then use all columns as the 'row key' to see if it exists in the target system
there are update anomalies in this schema:
you cannot a person without knowing his address
better approach is to separate to three tables, one for persons and one for PersonAddress
> perons: id,firstname, lastname
> address: id,address:
> personaddress: personid, addressid
You cannot find if a combination "can" make a primary key.
I actually disagree with this, I think it is possible to write a query that will SELECT all possible permutations of columns from the table and combine each permutation into a single unique value (the simplest, crudest way is to CAST them all to VARCHAR and connect them with a spacer character - a better way would be some kind of hash function).
With a single pass you would then have set of columns like P1, P12, P123, P2, P23, P3 etc (in case of three columns). Then you can do a query with COUNT(*) vs COUNT(DISTINCT) for each permutation column and you will see which permutations are unique.
Using dynamic SQL you could probably make it so that it would work on any table, although I don't know about the column limit for SQL Server.

How to Delete all data from a table which contain self referencing foreign key

I have a table which has employee relationship defined within itself.
i.e.
EmpID Name SeniorId
-----------------------
1 A NULL
2 B 1
3 C 1
4 D 3
and so on...
Where Senior ID is a foreign key whose primary key table is same with refrence column EmpId
I want to clear all rows from this table without removing any constraint. How can i do this?
Deletion need to be performed like this
4, 3 , 2 , 1
How can I do this
EDIT:
Jhonny's Answer is working for me but which of the answers are more efficient.
I don't know if I am missing something, but maybe you can try this.
UPDATE employee SET SeniorID = NULL
DELETE FROM employee
If the table is very large (cardinality of millions), and there is no need to log the DELETE transactions, dropping the constraint and TRUNCATEing and recreating constraints is by far the most efficient way. Also, if there are foreign keys in other tables (and in this particular table design it would seem to be so), those rows will all have to be deleted first in all cases, as well.
Normalization says nothing about recursive/hierarchical/tree relationships, so I believe that is a red herring in your reply to DVK's suggestion to split this into its own table - it certainly is viable to make a vertical partition of this table already and also to consider whether you can take advantage of that to get any of the other benefits I list below. As DVK alludes to, in this particular design, I have often seen a separate link table to record self-relationships and other kinds of relationships. This has numerous benefits:
have many to many up AND down instead of many-to-one (uncommon, but potentially useful)
track different types of direct relationships - manager, mentor, assistant, payroll approver, expense approver, technical report-to - with rows in the relationship and relationship type tables instead of new columns in the employee table
track changing hierarchies in a temporally consistent way (including terminated employee hierarchy history) by including active indicators and effective dates on the relationship rows - this is only fully possible when normalizing the relationship into its own table
no NULLs in the SeniorID (actually on either ID) - this is a distinct advantage in avoiding bad logic, but NULLs will usually appear in views when you have to left join to the relationship table anyway
a better dedicated indexing strategy - as opposed to adding SeniorID to selected indexes you already have on Employee (especially as the number of relationship types grows)
And of course, the more information you relate to this relationship, the more strongly is indicated that the relationship itself merits a table (i.e. it is a "relation" in the true sense of the word as used in relational databases - related data is stored in a relation or table - related to a primary key), and thus a normal form for relationships might strongly indicate that the relationship table be created instead of a simple foreign key relationship in the employee table.
Benefits also include its straightforward delete scenario:
DELETE FROM EmployeeRelationships;
DELETE FROM Employee;
You'll note a striking equivalence to the accepted answer here on SO, since, in your case, employees with no senior relationship have a NULL - so in that answer the poster set all to NULL first to eliminate relationships and then remove the employees.
There is a possibly appropriate usage of TRUNCATE depending upon constraints (EmpployeeRelationships is typically able to be TRUNCATEd since its primary key is usually a composite and not a foreign key in any other table).
Try this
DELETE FROM employee;
Inside a loop, run a command that deletes all rows with an unreferenced EmpID until there are zero rows left. There are a variety of ways to write that inner DELETE command:
DELETE FROM employee WHERE EmpID NOT IN (SELECT SeniorID FROM employee)
DELETE FROM employee e1 WHERE NOT EXISTS
(SELECT * FROM employee e2 WHERE e2.SeniorID = e.EmpID
and probably a third one using a JOIN, but I'm not familiar with the SQL Server syntax for that.
One solution is to normalize this by splitting out "senior" relationship into a separate table. For the sake of generality, make that second table "empID1|empID2|relationship_type".
Barring that, you need to do this in a loop. One way is to do it:
declare #count int
select #count=count(1) from table
while (#count > 0)
BEGIN
delete employee WHERE NOT EXISTS
(select 1 from employee 'e_senior'
where employee.EmpID=e_senior.SeniorID)
select #count=count(1) from table
END