Database Design: replace a boolean column with a timestamp column? - sql

Earlier I have created tables this way:
create table workflow (
id number primary key,
name varchar2(100 char) not null,
is_finished number(1) default 0 not null,
date_finished date
);
Column is_finished indicates whether the workflow finished or not. Column date_finished is when the workflow was finished.
Then I had the idea "I don't need is_finished as I can just say: where data_finished is not null", and I designed without is_finished column:
create table workflow (
id number primary key,
name varchar2(100 char) not null,
date_finished date
);
(We use Oracle 10)
Is it a good or bad idea? I've heard you can not have an index on a column with NULL values, so where data_finished is not null will be very slow on big tables.

Is it a good or bad idea?
Good idea.
You've eliminated space taken by a redundant column; the DATE column serves double duty--you know the work was finished, and when.
I've heard like you can't have an index on a column with NULL values, so "where data_finished is not null" will be very slow on big tables.
That's incorrect. Oracle indexes ignore NULL values.
You can create a function based index in order to get around the NULL values not being indexed, but most DBAs I've encountered really don't like them so be prepared for a fight.

There is a right way to index null values, and it doesn't use a FBI. Oracle will index null values, but it will NOT index null LEAF values in the tree. So, you could eliminate the column is_finished and create the index like this.
CREATE INDEX ON workflow (date_finished, 1);
Then, if you check the explain plan on this query:
SELECT count(*) FROM workflow WHERE date_finished is null;
You might see the index being used (if the optimizer is happy).
Back to the original question: looking at the variety of answers here, I think there is no right answer. I may have a personal preference to eliminate a column if it is unnecessary, but I also don't like overloading the meaning of columns either. There are two concepts here:
The record has finished. is_finished
The record finished on a particular date. date_finished
Maybe you need to keep these separate, maybe you don't. When I think about eliminating the is_finished column, it bothers me. Down the road, the situation may arise where the record finished, but you don't know precisely when. Perhaps you have to import data from another source and the date is unknown. Sure, that's not in the business requirements now, but things change. What do you do then? Well, you have to put some dummy value in the date_finished column, and now you've compromised the data a bit. Not horribly, but there is a rub there. The little voice in my head is shouting YOU'RE DOING IT WRONG when I do things like that.
My advice, keep it separate. You're talking about a tiny column and a very skinny index. Storage should not be an issue here.
Rule of Representation: Fold knowledge
into data so program logic can be
stupid and robust.
-Eric S. Raymond

To all those who said the column is a waste of space:
Double Duty isn't a good thing in a database. Your primary goal should be clarity. Lots of systems, tools, people will use your data. If you disguise values by burying meaning inside of other columns you're BEGGING for another system or user to get it wrong.
And anyone who thinks it saves space is utterly wrong.
You'll need two indexes on that date column... one will be Function Based as OMG suggests. It will look like this:
NVL(Date_finished, TO_DATE('01-JAN-9999'))
So to find unfinished jobs you'll have to make sure to write the where clause correctly
It will look like this:
WHERE
NVL(Date_finished, TO_DATE('01-JAN-9999')) = TO_DATE('01-JAN-9999')
Yep. That's so clear. It's completely better than
WHERE
IS_Unfinished = 'YES'
The reason you'll want to have a second index on the same column is for EVERY OTHER query on that date... you won't want to use that index for finding jobs by date.
So let's see what you've accomplish with OMG's suggestion et al.
You've used more space, you've obfuscated the meaning of the data, you've made errors more likely... WINNER!
Sometime it seems programmers are still living in the 70's when a MB of hard drive space was a down payment on a house.
You can be space efficient about this without giving up a lot of clarity. Make the Is_unfinished either Y or NULL... IF you will only use that column to find 'work to do'. This will keep that index compact. It will only be as big as rows which are unfinished (in this way you exploit the unindexed nulls instead of being screwed by it). You put a little bit of space in your table, but over all it's less than the FBI. You need 1 byte for the column and you'll only index the unfinished rows so that' a small fraction of job and probably stays pretty constant. The FBI will need 7 bytes for EVERY ROW whether you're trying to find them or not. That index will keep pace with the size of the table, not just the size of the unfinished jobs.
Reply to the comment by OMG
In his/her comment he/she states that to find unfinished jobs you'd just use
WHERE date_finished IS NULL
But in his answer he says
You can create a function based index in order to get around the NULL values not being indexed
If you follow the link he points you toward, using NVL to replace null values with some other arbitrary value then I'm not sure what else there is to explain.

Is it a good or bad idea? I've heard like you can't have an index on a column with NULL values, so "where data_finished is not null" will be very slow on big tables.
Oracle does index nullable fields, but does not index NULL values
This means that you can create an index on a field marked NULL, but the records holding NULL in this field won't make it into the index.
This, on its turn, means that if you make date_finished NULL, the index will be less in size, as the NULL values won't be stored in the index.
So the queries involving equality of range searches on date_finished will in fact perform better.
The downside of this solution, of course, is that the queries involving the NULL values of date_finished will have to revert to full table scan.
You can work around this by creating two indexes:
CREATE INDEX ON mytable (date_finished)
CREATE INDEX ON mytable (DECODE(date_finished, NULL, 1))
and use this query to find unfinished work:
SELECT *
FROM mytable
WHERE DECODE(date_finished, NULL, 1) = 1
This will behave like partitioned index: the complete works will be indexed by the first index; the incomplete ones will be indexed by the second.
If you don't need to search for complete or incomplete works, you can always get rid of the appropriate indexes.

In terms of table design, I think it's good that you removed the is_finished column as you said that it isn't necessary (it's redundant). There's no need to store extra data if it isn't necessary, it just wastes space. In terms of performance, I don't see this being a problem for NULL values. They should be ignored.

I would use nulls as indexes work, as already mentioned in other answers, for all queries apart from "WHERE date_finished IS NULL" (so it depends if you need to use that query). I definitely wouldn't use outliers like year 9999 as suggested by the answer:
you could also use a "dummy" value (such as 31 December 9999) as the date_finished value for unfinished workflows
Outliers like year 9999 affect performance, because (from http://richardfoote.wordpress.com/2007/12/13/outlier-values-an-enemy-of-the-index/):
The selectivity of a range scan is basically calculated by the CBO to be the number of values in the range of interest divided by the full range of possible values (IE. the max value minus the min value)
If you use a value like 9999 then the DB will think the range of values being stored in the field is e.g. 2008-9999 rather than the actual 2008-2010; so any range query (e.g. "between 2008 and 2009") will appear to be covering a tiny % of the range of possible values, vs. actually covering about half the range. It uses this statistic to say, if the % of the ths possible values covered is high, probably a lot of rows will match, and then a full table scan will be faster than an index scan. It won't do this correctly if there are outliers in the data.

good idea to remove the deriveable value column as others have said.
one more thought is that by removing the column, you will avoid paradoxical conditions that you will need to code around, such as what happens when the is_finished = No and the finished_date = yesterday... etc.

To resolve the indexed / non-indexed columns, wouldn't it be easier to simply JOIN two tables, like this:
-- PostgreSQL
CREATE TABLE workflow(
id SERIAL PRIMARY KEY
, name VARCHAR(100) NOT NULL
);
CREATE TABLE workflow_finished(
id INT NOT NULL PRIMARY KEY REFERENCES workflow
, date_finished date NOT NULL
);
Thus, if a record exists in workflow_finished, this workflow's completed, else it isn't. It seems to me this is rather simple.
When querying for unfinished workflows, the query becomes:
-- Only unfinished workflow items
SELECT workflow.id
FROM workflow
WHERE NOT EXISTS(
SELECT 1
FROM workflow_finished
WHERE workflow_finished.id = workflow.id);
Maybe you want the original query? With a flag and the date? Query like this then:
-- All items, with the flag and date
SELECT
workflow.id
, CASE
WHEN workflow_finished.id IS NULL THEN 'f'
ELSE 't'
END AS is_finished
, workflow_finished.date_finished
FROM
workflow
LEFT JOIN workflow_finished USING(id);
For consumers of the data, views can and should be created for their needs.

As an alternative to a function-based index, you could also use a "dummy" value (such as 31 December 9999, or alternatively one day before the earliest expected date_finished value) as the date_finished value for unfinished workflows.
EDIT: Alternative dummy date value, following comments.

I prefer the single-column solution.
However, in the databases I use most often NULLs are included in indexes, so your common case of searching for open workflows will be fast whereas in your case it will be slower. Because the case of searching for open workflows is likely to be one of the most common things you do, you may need the redundant column simply to support that search.
Test for performance to see if you can use the better solution performance-wise, then fall back to the less-good solution if necessary.

Related

Cursor select on 2nd Primary key on db2 sql field for embedded positional value - Unable to determine most efficient for longterm design

I have a SQL DB2 table where the first two fields are the primary keys (not including the third field which is date/time stamp). The table was designed by another team with the intent to make it generic. I was brought into the project after the key value for the second field was coded for when it was inserted onto the table. This leads me to this: We now have to do a cursor select with a WHERE clause that includes the first primary key – and then for the second primary key it must be for only when it is a specific value in position 21 for 8 bytes. (And we will always know what that value will be for the second field.) The second field is a generic 70 byte field (alphanumeric). My question is should we use a LIKE wildcard for the WHERE clause statement for the second primary field condition or instead a SUBSTR since we know the position of the value? I ask because I have done an EXPLAIN yet I do not see a difference between the two (neither does my database analyst). And this is for a few million records for a 1300 byte long table. However, my concern is volume of the data on the table will grow on various systems. Therefore performance may become an issue. Just right now it is hard to measure the difference between LIKE and SUBSTR. But I would like to do my due diligence and code this for long term performance. And if there is a third option, please let me know.
The CPU difference between a SUBSTR and a LIKE will be minimal. What might be more important is the filter factor estimate in the explain. You might want to use a statistical view over the table to help the optimizer here. In that case the SUBSTR operator will be better as you would include that as a column in the stats view.
Alternatively, a generated column based on the SUBSTR would also provide the better stats, and you could include the column in an index too.

SQL Server table optimal indexing

I have a very specific question, this is part of a job interview test.
I have this table:
CREATE TABLE Teszt
(
Id INT NOT NULL
, Name NVARCHAR(100)
, [Description] NVARCHAR(MAX)
, Value DECIMAL(20,4)
, IsEnabled BIT
)
And these selects:
SELECT Name
FROM Teszt
WHERE Id = 10
SELECT Id, Value
FROM Teszt
WHERE IsEnabled = 1
SELECT [Description]
FROM Teszt
WHERE Name LIKE '%alma%'
SELECT [Description]
FROM Teszt
WHERE Value > 1000 AND IsEnabled = 1
SELECT Id, Name
FROM Teszt
WHERE IsEnabled = 1
The question is, where on this table should I put indexes to optimize the performance of the above queries. No other info on the table was provided, so my answer will contain the general pro/contra arguments for indexes, but I'm not sure regarding the above queries.
My thoughts on optimizing these specific queries with indexes:
Id should probably have an index, looks like the primary key and it is part of a where clause;
Creating one on the Value column would also be good, as its part of a where clause here;
Now it gets murky for me. for the Name column, based on just the above queries, I probably shouldn't create one, as it is used with LIKE, which defeats the purpose of an index, right?
I tried to read everything on indexing a bit column (isEnabled column in the table), but I couldn't say it's any clearer to me, as the arguments are wildly ranging. should I create an index on it? should that be filtered? should it be part of a separate index, or just part of one with the other columns?
Again, this is all theoretical, so no info on the size or the usage of the table.
Thanks in advance for any answer!
Regards,
Tom
An index on a bit column is generally not recommended. The following discussion applies not only to bit columns but to any low-cardinality value. In English, "low-cardinality" means the column takes on only a handful of values.
The reason is simple. A bit column takes on three values (if you include NULL). That means that a typical selection on the column would return about a third of the rows. A third of the rows means that you would (typically) be accessing every data page. If so, you might as well do a full table scan.
So, let's ask the question explicitly: When is an index on a bit index useful or appropriate?
First, the above argument does not work if you are always looking for IsEnabled = 1 and, say, 0.001% of the rows are enabled. This is a highly selective query and an index could help. Note: The index would not help on IsEnabled = 0 in this scenario.
Second, the above argument argues in favor of a clustered index on the bit value. If the values are clustered, then even a 30% selectivity means that you are only reading 30% of the rows. The downside is that updating the value means moving the record from one data page to another (a somewhat expensive operation).
Third, a bit column can constructively be part of a larger index. This is especially true of a clustered index with the bit being first. For instance, for the fourth query, one could argue that a clustered index on (IsEnabled, Value, Description) would be the optimal index.
To be honest, though, I don't like playing around with clustered indexes. I prefer that the primary key be the clustered index. I admit that performance gains can be impressive for a narrow set of queries -- and if this is your use case, then use them (and accessing enabled rows might be a good reason for using them). However, the clustered index is something that you get to use only once, and primary keys are the best generic option to optimize joins.
You can read the detail about on how to create index from this article: https://msdn.microsoft.com/en-us/library/ms188783.aspx
As you said there are pros and cons using index.
Pros: Select query will be faster
Cons: Insert query will be slower
Conclusion: Add index if your table has less INSERT AND most SELECT operation.
In which Column I should consider adding index? This is really a very good question. Though I am not the DB expert, here are my views:
Add index on your primary key column
Add index on your join column [inner/outer/left]
Short answer: on Id and IsEnabled
(despite the controversy about indexing on BIT field; and Id should be primary key)
Generally, to optimize the performance, indexes should be on fields, where there is WHERE or JOIN. (Under the hood) To make the selection, the db server looks for index, and if not found -- creates one on-the-fly in the memory, which takes time, hence performance degradation.
As Bhuwan noted, indexes are "bad" for INSERTs (keep that in mind for the whole picture when designing a database), but there are just SELECTs in the example privided.
Hope you passed the test :)
-Nick
tldr: I will probably delete this later, so no need!
My answer to this job interview question: "It depends." ... and then I would probably spend too much of the interview talking about how terrible the question is.
The problem is that this is simply a bad question for a "job interview test". I have been poking at this for two hours now, and the longer I spend the more annoyed I get.
With absolutely no information on the content of the table, we can not guarantee that this table is even in the first normal form or better, so we can not even assume that the only non nullable column Id is a valid primary key.
With no idea about the content of the table, we do not even know if it needs indexes. If it has only a few rows, then the entire page will sit in memory and whichever operations you are running against it will be fast enough.
With no cardinality information we do not know if a value > 1000 is common or uncommon. All or none of the values could be greater than 1000, but we do not know.
With no cardinality information we do not know if IsEnabled = 1 would mean 99% of rows, or even 0% of rows.
I would say you are on the right track as far as your thought process for evaluating indexing, but the trick is that you are drawing from your experiences with indexes you needed on tables before this table. Applying assumptions based on general previous experience is fine, but you should always test them. In this case, blindly applying general practices could be a mistake.
The question is, where on this table should I put indexes to optimize the performance of the above queries. No other info on the table was provided
If I try to approach this from another position: Nothing else matters except the performance of these five queries, I would apply these indexes:
create index ixf_Name on dbo.Teszt(Name)
include (Id)
where id = 10;
create index ixf_Value_Enabled on dbo.Teszt(Value)
include (Id)
where IsEnabled = 1;
create index ixf_Value_gt1k_Enabled on dbo.Teszt(Id)
include (description,value,IsEnabled)
where Value > 1000 and IsEnabled = 1;
create index ixf_Name_Enabled on dbo.Teszt(Id)
include (Name, IsEnabled)
where IsEnabled = 1;
create index ixf_Name_notNull on dbo.Teszt(Name)
include (Description)
where Name is not null;
Also, the decimal(20,4) annoys me because this is the least amount of data you can store in the 13 bytes of space it takes up. decimal(28,4) has the same storage size and if it could have been decimal(19,4) then it would have been only 9 bytes. Granted this is a silly thing to be annoyed about, especially considering the table is going to be wide anyway, but I thought I would point it out anyway.

Optimize SQL Query on SQLite3 by using indexes

I'm trying to optimize a SQL Query by creating indexes to have the best performances.
Table definition
CREATE TABLE Mots (
numero INTEGER NOT NULL,
fk_dictionnaires integer(5) NOT NULL,
mot varchar(50) NOT NULL,
ponderation integer(20) NOT NULL,
drapeau varchar(1) NOT NULL,
CONSTRAINT pk_mots PRIMARY KEY(numero),
CONSTRAINT uk_dico_mot_mots UNIQUE(fk_dictionnaires, mot),
CONSTRAINT fk_mots_dictionnaires FOREIGN KEY(fk_dictionnaires) REFERENCES Dictionnaires(numero)
);
Indexes definition
CREATE INDEX idx_dictionnaires ON mots(fk_dictionnaires DESC);
CREATE INDEX idx_mots_ponderation ON mots(ponderation);
CREATE UNIQUE INDEX idx_mots_unique ON mots(fk_dictionnaires, mot);
SQL Query :
SELECT numero, mot, ponderation, drapeau
FROM mots
WHERE mot LIKE 'ar%'
AND fk_dictionnaires=1
AND LENGTH(mot)>=4
ORDER BY ponderation DESC
LIMIT 5;
Query Plan
0|0|0|SEARCH TABLE mots USING INDEX idx_dictionnaires (fk_dictionnaires=?) (~2 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY
Defined indexes don't seem used and the query lasts (according to the .timer) :
CPU Time: user 0.078001 sys 0.015600
However, when I removed the fk_dictionnaires=1. My indexes are correctly used and the performances are around 0.000000-0.01XXXXXX sec
0|0|0|SCAN TABLE mots USING INDEX idx_mots_ponderation (~250000 rows)
I found out some similars questions on stackoverflow but no anwser help me.
Removing a Temporary B Tree Sort from a SQLite Query
Similar issue
How can I improve the performances by using indexes or/and by changing the SQL Query?
Thanks in advance.
SQLite seems to think that the idx_dictionnaires index is very sparse and concludes that if it scans using idx_dictionnaires, it will only have to examine a couple of rows. However, the performance results you quote suggest that it must be examining more than just a couple rows. First, why don't you try ANALYZE mots, so SQLite will have up-to-date information on the cardinality of each index available?
Here is something else which might help, from the SQLite documentation:
Terms of the WHERE clause can be manually disqualified for use with indices by prepending a unary + operator to the column name. The unary + is a no-op and will not slow down the evaluation of the test specified by the term. But it will prevent the term from constraining an index. So, in the example above, if the query were rewritten as:
SELECT z FROM ex2 WHERE +x=5 AND y=6;
The + operator on the x column will prevent that term from constraining an index. This would force the use of the ex2i2 index.
Note that the unary + operator also removes type affinity from an expression, and in some cases this can cause subtle changes in the meaning of an expression. In the example above, if column x has TEXT affinity then the comparison "x=5" will be done as text. But the + operator removes the affinity. So the comparison "+x=5" will compare the text in column x with the numeric value 5 and will always be false.
If ANALYZE mots isn't enough to help SQLite choose the best index to use, you can use this feature to force it to use the index you want.
You could also try compound indexes -- it looks like you already defined one on fk_dictionnaires,mot, but SQLite isn't using it. For the "fast" query, SQLite seemed to prefer using the index on ponderation, to avoid sorting the rows at the end of the query. If you add an index on fk_dictionnaires,ponderation DESC, and SQLite actually uses it, it could pick out the rows which match fk_dictionnaires=1 without a table scan and avoid sorting at the end.
POSTSCRIPT: The compound index I suggested above "fixed" the OP's performance problem, but he also asked how and why it works. #AGeiser, I'll use a brief illustration to try to help you understand DB indexes intuitively:
Imagine you need to find all the people in your town whose surnames start with "A". You have a directory of all the names, but they are in random order. What do you do? You have no choice but to read through the whole directory, and pick out the ones which start with "A". Sounds like a lot of work, right? (This is like a DB table with no indexes.)
But what if somebody gives you a phone book, with all the names in alphabetical order? Now you can just find the first and last entries which start with "A" (using something like a binary search), and take all the entries in that range. You don't have to even look at all the other names in the book. This will be way faster. (This is like a DB table with an index; in this case, call it an index on last_name,first_name.)
Now what if you want all the people whose names start with "A", but in the case that 2 people have the same name, you want them to be ordered by postal code? Even if you get the needed names quickly using the "phone book" (ie. index on last_name,first_name), you will still have to sort them all manually... so it starts sounding like a lot of work again. What could make this job really easy?
It would take another "phone book" -- but one in which the entries are ordered first by name, and then by postal code. With a "phone book" like that, you could quickly select the range of entries which you need, and you wouldn't even need to sort them -- they would already be in the desired order. (This is an index on last_name,first_name,postal_code.)
I think this illustration should make it clear how indexes can help SELECT queries, not just by reducing the number of rows which must be examined, but also by (potentially) eliminating the need for a separate "sort" phase after the needed rows are found. Hopefully it also makes it clear that a compound index on a,b is completely different from one on b,a. I could go on giving more "phone book" examples, but this answer would become so long that it would be more like a blog post. To build your intuition on which indexes are likely to benefit a query, I recommend the book from O'Reilly on "SQL Antipatterns" (especially chapter 13, "Index Shotgun").

Index on column with only 2 distinct values

I am wondering about the performance of this index:
I have an "Invalid" varchar(1) column that has 2 values: NULL or 'Y'
I have an index on (invalid), as well as (invalid, last_validated)
Last_validated is a datetime (this is used for a unrelated SELECT query)
I am flagging a small amount of items (1-5%) of rows in the table with this as 'to be deleted'.
This is so when i
DELETE FROM items WHERE invalid='Y'
it does not perform a full table scan for the invalid items.
A problem seems to be, the actual DELETE is quite slow now, possibly because all the indexes are being removed as they are deleted.
Would a bitmap index provide better performance for this? or perhaps no index at all?
Index should be used, but DELETE can still take some time.
Have a look at the execution plan of the DELETE:
EXPLAIN PLAN FOR
DELETE FROM items WHERE invalid='Y';
SELECT * FROM TABLE( dbms_xplan.display );
You could try using a Bitmap Index, but I doubt that it will have much impact on performance.
Using NULL as value is not a good idea. The query
SELECT something FROM items WHERE invalid IS NULL
would not be able to use your index, since it only contains not-null values.
As Peter suggested, it's important to first verify that the index is being used for the delete. Bitmap indexes will invoke other locking for DML that could hurt overall performance.
Additional considerations:
are there unindexed foreign key
references to this table from other
tables?
are there triggers on this table
that are performing other DML?
Two thoughts on this...
Using NULL to express the opposite of 'Y' is possibly not a good idea. Null means *I don't know what this value is' or 'there is no meaningful answer to a question'. You should really use 'N' as the opposite of 'Y'. This would eliminate the problem of searching for valid items, because Oracle will not use the index on that column when it contains only non-null values.
You may want to consider adding a CHECK CONSTRAINT on such a column to ensure that only legal values are entered.
Neither of these changes necessarily has any impact on DELETE performance however.
I recommend:
check how many records you expect the DELETE to affect (i.e. maybe there are more than you expect)
if the number of rows that should be deleted is relatively small, check that the index on invalid is actually being used by the DELETE
get a trace on the session to see what it is doing - it might be reading more blocks from disk than expected, or it might be waiting (e.g. record locking or latch contention)
Don't bother dropping or creating indexes until you have an idea of what actually is going on. You could make all kinds of changes, see an improvement (but not know why it improved), then months down the track the problem reoccurs or is even worse.
Drop the index on (invalid) and try both SELECT and DELETE. You already have an index on (invalid,last_validated). You should not be needing the index on invalid alone.Also approximately how many rows are there in this table ?

Indexing nulls for fast searching on DB2

It's my understanding that nulls are not indexable in DB2, so assuming we have a huge table (Sales) with a date column (sold_on) which is normally a date, but is occasionally (10% of the time) null.
Furthermore, let's assume that it's a legacy application that we can't change, so those nulls are staying there and mean something (let's say sales that were returned).
We can make the following query fast by putting an index on the sold_on and total columns
Select * from Sales
where
Sales.sold_on between date1 and date2
and Sales.total = 9.99
But an index won't make this query any faster:
Select * from Sales
where
Sales.sold_on is null
and Sales.total = 9.99
Because the indexing is done on the value.
Can I index nulls? Maybe by changing the index type? Indexing the indicator column?
From where did you get the impression that DB2 doesn't index NULLs? I can't find anything in documentation or articles supporting the claim. And I just performed a query in a large table using a IS NULL restriction involving an indexed column containing a small fraction of NULLs; in this case, DB2 certainly used the index (verified by an EXPLAIN, and by observing that the database responded instantly instead of spending time to perform a table scan).
So: I claim that DB2 has no problem with NULLs in non-primary key indexes.
But as others have written: Your data may be composed in a way where DB2 thinks that using an index will not be quicker. Or the database's statistics aren't up-to-date for the involved table(s).
I'm no DB2 expert, but if 10% of your values are null, I don't think an index on that column alone will ever help your query. 10% is too many to bother using an index for -- it'll just do a table scan. If you were talking about 2-3%, I think it would actually use your index.
Think about how many records are on a page/block -- say 20. The reason to use an index is to avoid fetching pages you don't need. The odds that a given page will contain 0 records that are null is (90%)^20, or 12%. Those aren't good odds -- you're going to need 88% of your pages to be fetched anyway, using the index isn't very helpful.
If, however, your select clause only included a few columns (and not *) -- say just salesid, you could probably get it to use an index on (sold_on,salesid), as the read of the data page wouldn't be needed -- all the data would be in the index.
The rule of thumb is that an index is useful for values up on to 15% of the records. ... so an index might be useful here.
If DB2 won't index nulls, then I would suggest adding a boolean field, IsSold, and set it to true whenever the sold_on date gets set (this could be done in a trigger).
That's not the nicest solution, but it might be what you need.
Troels is correct; even rows with a SOLD_ON value of NULL will benefit from an index on that column. If you're doing ranged searches on SOLD_ON, you may benefit even more by creating a clustered index that begins with SOLD_ON. In this particular example, it may not require much additional overhead to maintain the clustering order based on SOLD_ON, since newer rows added will most likely have a newer SOLD_ON date.