How to design SQL tables when column data arrives in multiple types/margins of error? - sql

I've been given a stack of data where a particular value has been collected sometimes as a date (YYYY-MM-DD) and sometimes as just a year.
Depending on how you look at it, this is either a variance in type or margin of error.
This is a subprime situation, but I can't afford to recover or discard any data.
What's the optimal (eg. least worst :) ) SQL table design that will accept either form while avoiding monstrous queries and allowing maximum use of database features like constraints and keys*?
*i.e. Entity-Attribute-Value is out.

You could store the year, month and day components in separate columns. That way, you only need to populate the columns for which you have data.

if it comes in as just a year make it default to 01 for month and date, YYYY-01-01
This way you can still use a date/datetime datatype and don't have to worry about invalid dates

Either bring it in as a string unmolested, and modify it so it's consistent in another step, or modify the year-only values during the import like SQLMenace recommends.

I'd store the value in a DATETIME type and another value (just an integer will do, or some kind of enumerated type) that signifies its precision.
It would be easier to give more information if you mentioned what kind of queries you will be doing on the data.

Either fix it, then store it (OK, not an option)
Or store it broken with a fixed computed columns
Something like this
CREATE TABLE ...
...
Broken varchar(20),
Fixed AS CAST(CASE WHEN Broken LIKE '[12][0-9][0-9][0-9]' THEN Broken + '0101' ELSE Broken END AS datetime)
This also allows you to detect good from bad source data

If you don't always have a full date, what sort of keys and constraints would you need? Perhaps store two columns of data; a full date, and a year. For data that has only year, the year is stored and date is null. For items with full info, both are populated.

I'd put three columns in the table:
The provided value (YYYY-MM-DD or YYYY)
A date column, Date or DateTime data type, which is nullable
A year column, as an integer or char(4) depending upon your needs.
I'd always populate the year column, populate the date column only when the provided value is a date.
And, because you've kept the provided value, you can always re-process down the road if needs change.

An alternative solution would be to that of a date mask (like in IP). Store the date in a regular datetime field, and insert an additional field of type smallint or something, where you could indicate which is present (could go even binary here):
If you have YYYY-MM-DD, you would have 3 bits of data, which will have the values 1 if data is present and 0 if not.
Example:
Date Mask
2009-12-05 7 (111)
2009-12-01 6 (110, only year and month are know, and day is set to default 1)
2009-01-20 5 (101, for some strange reason, only the year and the date is known. January has 31 days, so it will never generate an error)
Which solution is better depends on what you will do with it.
This is better when you want to select those with full dates, which are between a certain period (less to write). Also this way it's easier to compare any dates which have masks like 7,6,4. It may also take up less memory (date + smallint may be smaller than int+int+int, and only if datetime uses 64 bit, and smallint uses up as much as int, it will be the same).

I was going to suggest the same solution as #ninesided did above. Additionally, you could have a date field and a field that quantitatively represents your uncertainty. This offers the advantage of being able to represent things like "on or about Sept 23, 2010". The problem is that to represent the case where you only know the year, you'd have to set your date to be the middle of the year, with 182.5 days' uncertainty (assuming non-leap year), which seems ugly.
You could use a similar but distinct approach with a mask that represents what date parts you're confident about - that's what SQLMenace offered in his answer above.

+1 each to recommendations from ninesided, Nikki9696 and Jeff Siver - I support all those answers though none was exactly what I decided upon.
My solution:
a date column used only for complete dates
an int column used for years
a constraint to ensure integrity between the two
a trigger to populate the year if only date is supplied
Advantages:
can run simple (one-column) queries on the date column with missing data ignored (by using NULL for what it was designed for)
can run simple (one-column) queries on the year column for any row with a date (because year is automatically populated)
insert either year or date or both (provided they agree)
no fear of disagreement between columns
self explanatory, intuitive
I would argue that methods using YYYY-01-01 to signify missing data (when flagged as such with a second explanatory column) fail seriously on points 1 and 5.
Example code for Sqlite 3:
create table events
(
rowid integer primary key,
event_year integer,
event_date date,
check (event_year = cast(strftime("%Y", event_date) as integer))
);
create trigger year_trigger after insert on events
begin
update events set event_year = cast(strftime("%Y", event_date) as integer)
where rowid = new.rowid and event_date is not null;
end;
-- various methods to insert
insert into events (event_year, event_date) values (2008, "2008-02-23");
insert into events (event_year) values (2009);
insert into events (event_date) values ("2010-01-19");
-- select events in January without expressions on supplementary columns
select rowid, event_date from events where strftime("%m", event_date) = "01";

Related

Unexpected result with ORDER BY

I have the following query:
SELECT
D.[Year] AS [Year]
, D.[Month] AS [Month]
, CASE
WHEN f.Dept IN ('XSD') THEN 'Marketing'
ELSE f.Dept
END AS DeptS
, COUNT(DISTINCT f.OrderNo) AS CountOrders
FROM Sales.LocalOrders AS l WITH
INNER JOIN Sales.FiscalOrders AS f
ON l.ORDER_NUMBER = f.OrderNo
INNER JOIN Dimensions.Date_Dim AS D
ON CAST(D.[Date] AS DATE) = CAST(f.OrderDate AS DATE)
WHERE YEAR(f.OrderDate) = 2019
AND f.Dept IN ('XSD', 'PPM', 'XPP')
GROUP BY
D.[Year]
, D.[Month]
, f.Dept
ORDER BY
D.[Year] ASC
, D.[Month] ASC
I get the following result the ORDER BY isn't giving the right result with Month column as we can see it is not ordered:
Year Month Depts CountOrders
2019 1 XSD 200
2019 10 PPM 290
2019 10 XPP 150
2019 2 XSD 200
2019 3 XPP 300
The expected output:
Year Month Depts CountOrders
2019 1 XSD 200
2019 2 XSD 200
2019 3 XPP 300
2019 10 PPM 290
2019 10 XPP 150
Your query
It is ordered by month, as your D.[Month] is treated like a text string in the ORDER BY clause.
You could do one of two things to fix this:
Use a two-digit month number (e.g. 01... 12)
Use a data type for the ORDER BY clause that will be recognized as representing a month
A quick fix
You can correct this in your code by quickly changing the ORDER BY clause to analyze those columns as though they are numbers, which is done by converting ("casting") them to an integer data type like this:
ORDER BY
CAST(D.[Year] AS INT) ASC
,CAST(D.[Month] AS INT) ASC
This will correct your unexpected query results, but does not address the root cause, which is your underlying data (more on that below).
Your underlying data
The root cause of your issue is how your underlying data is stored and/or surfaced.
Your Month seems to be appearing as a default data type (VarChar), rather than something more specifically suited to a month or date.
If you administer or have access to or control over the database, it is a good idea to consider correcting this.
In considering this, be mindful of potential context and change management issues, including:
Is this underlying data, or just a representation of upstream data that is elsewhere? (e.g. something that is refreshed periodically using a process that you do not control, or a view that is redefined periodically)
What other queries or processes rely on how this data is currently stored or surfaced (including data types), that may break if you mess with it?
Might there be validation issues if correcting it? (such as from the way zero, null, non-numeric or non-date data is stored, even if invalid)
What change management practices should be followed in your environment?
Is the data source under high transactional load?
Is it a production dataset?
Are other reporting processes dependent on it?
None of these issues are a good excuse to leave something set up incorrectly forever, which will likely compound the issue and introduce others. However, that is only part of the story.
The appropriate approach (correct it, or leave it) will depend on your situation. In a perfect textbook world, you'd correct it. In your world, you will have to decide.
A better way?
The above solution is a bit of a quick and nasty way to force your query to work.
The fact that the solution CASTs late in the query syntax, after the results have been selected and filtered, hints that is not the most elegant way to achieve this.
Ideally you can convert data types as early as possible in the process:
If done in underlying data, not the query, this is the ultimate but may not suit the situation (see below)
If done in the query, try to do it earlier.
In your case, your GROUP BY and ORDER BY are both using columns that look to be redundant data from the original query results, that is, you are getting a DATE and a MONTH and a YEAR. Ideally you would just get a DATE and then use the MONTH or YEAR from that date. Your issue is your dates are not actually dates (see "underlying data" above), which:
In the case of DATE, is converted in your INNER JOIN line ON CAST(D.[Date] AS DATE) = CAST(f.OrderDate AS DATE) (likely to minimise issues with the join)
In the case of D.[year] and D.[month], are not converted (which is why we still need to convert them further down, in ORDER BY)
You could consider ignoring D.[month] and use the MONTH DATEPART computed from DATE, which would avoid the need to use CAST in the ORDER BY clause.
In your instance, this approach is a middle ground. The quick fix is included at the top of this answer, and the best fix is to correct the underlying data. This last section considers optimizing the quick fix, but does not correct the underlying issue. It is only mentioned for awareness and to avoid promoting the use of CAST in an ORDER BY clause as the most legitimate way of addressing your issue with good clean query syntax.
There are also potential performance tradeoffs between how many columns you select that you don't need (e.g. all of the ones in D?), whether to compute the month from the date or a seperate month column, whether to cast to date before filtering, etc. These are beyond the scope of this solution.
So:
The immediate solution: use the quick fix
The optimal solution: after it's working, consider the underlying data (in your situation)
The real problem is your object Dimensions.Date_Dim here. As you are simply ordering on the value of D.[Year] and D.[Month] without manipulating the values at all, this means the object is severely flawed; you are storing numerical data as a varchar. varchar, and numerical data types are completely different. For example 2 is less than 10 but '2' is greater than '10'; because '2' is greater than '1', so therefore it must also be greater than '10'.
The real solution, therefore, is fixing your object. Assuming that both Month and Year are incorrectly stored as a varchar, don't have any non-integer values (another and different flaw if so), and not a computed column then you could just do:
ALTER TABLE Dimensions.Date_Dim ALTER COLUMN [Year] int NOT NULL;
ALTER TABLE Dimensions.Date_Dim ALTER COLUMN [Month] int NOT NULL;
You could, however, also make the columns a PERSISTED computed column, which might well be easier, in my opinion, as DATEPART already returns a strongly typed int value.
ALTER TABLE dbo.Date_Dim DROP COLUMN [Month];
ALTER TABLE dbo.Date_Dim ADD [Month] AS DATEPART(MONTH,[Date]) PERSISTED;
Of course, for both solutions, you'll need to (first) DROP and (afterwards) reCREATE any indexes and constraints on the columns.
As long as your "Month" is always 1-12, you can use
SELECT ..., TRY_CAST(D.[Month] AS INT) AS [Month],...
ORDER BY TRY_CAST(D.[Month] AS INT)
The simplest solution is:
ORDER BY MIN(D.DATE)
or:
ORDER BY MIN(f.ORDER_DATE)
Fiddling with the year and month columns is totally unnecessary when you have a date column that is available.
A very common issue when you store numerical data as a varchar/nvarchar.
Try to cast Year and Month to INT.
ORDER BY
CAST(D.[Year] AS INT) ASC
,CAST(D.[Month] AS INT) ASC
If you try using the <, > and BETWEEN operators, you will get some really "weird" results.

Copying only certain values from one row into another table

I am trying to copy data from one table to another table, which works fine, but I only want to copy certain data from one the of the columns.
Insert Into Period (Invoice_No, Period_Date)
Select Invoice_Seq_No, Inv_Comment
From Invoices
Where INV_Comment LIKE '%November 2015';
The Inv_Comment column contains free-form comments and the date in different formats, e.g. "paid on November 2015 or "paid on Aug" or "July 2015". What I am trying to do is to copy only the "November 2015" part of the comment into the new table.
The above code only copies the entire data of the Inv_Comment field and I only want to copy the date. The date part can be in one of three formats: MON YYYY, DD.MM.YYYY or only the month i.e. MON
How can I extract only the date part I am interested in?
For your very simple example query you can use the substr() function, using the length of your fixed value to count back from the end of the string, as that document describes:
If position is negative, then Oracle counts backward from the end of char.
So you can do:
select invoice_seq_no, substr(inv_comment, -length('November 2015'))
from invoices
where inv_comment like '%November 2015';
But it's clear from the comments that you really want to find all dates, in various formats, and not always at the end of the free-form text. One option is to search the text repeatedly for all the possible formats and values, starting with the most specific (e.g. DD.MM.YYYY) and then going down to least specific
(e.g. just MON). You could insert just the sequence numbers into your table start with, and then repeatedly update the rows that do not yet have values set:
insert into period (invoice_no) select invoice_seq_no from invoices;
update period p
set period_date = (
select case when instr(i.inv_comment, '15.09.2015') > 0 then
substr(i.inv_comment, instr(i.inv_comment, '15.09.2015'), length('15.09.2015'))
end
from invoices i
where i.invoice_seq_no = p.invoice_no
)
where period_date is null;
then repeat the update with another date, or a more generic November 2015 pattern, etc. But specifying every possible date isn't going to be feasible, so you could regular expressions. There are probably better patterns for this but as an example:
update period p
set period_date = (
select regexp_substr(i.inv_comment, '[[0-3][0-9][-./][0-1][0-9][-./][12]?[901]?[0-9]{2}')
from invoices i
where i.invoice_seq_no = p.invoice_no
)
where period_date is null;
which matches (or attempts to match) anything looking like DD.MM.YYYY, followed by maybe:
update period p
set period_date = (
select regexp_substr(i.inv_comment,
'(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|'
|| 'Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)([[:space:]]+[12]?[901]?[0-9]{2})?')
from invoices i
where i.invoice_seq_no = p.invoice_no
)
where period_date is null;
which matches any short or long month name. You may have mixed case though - aug, Aug, AUG - so you might want to use the match parameter to make it case-insensitive. This isn't supposed to be a complete solution though, and you may need further formats. There are some ideas on other questions.
You may really want actual dates, which means breaking down a bit more, and then assuming missing years - perhaps taking the year from another column (order date?) if it isn't available in the comments, though that gets a bit messy around year-end. But you can essentially do the same thing, just passing each extracted value through to_date() with a format mask matching the search expression you're using.
There will always be mistakes, typos, odd formatting etc., so even if this approach identified most patterns, you'll probably end up with some that are left blank, and will need to be set manually by a human looking at the comments; and some that are just wrong. But this is why dates shouldn't be stored as strings at all - having them mixed in with other text is just making things even worse.
Here you're dealing with strings containing disparate date information. Several string operations may be needed.

Is it possible to return part of a field from the last row entered into a table

I am proposing to have a table (the design isn't settled on yet and can be altered dependent upon the views expressed in reply to this question) that will have a primary key of type int (using auto increment) and a field (ReturnPeriod of type Nchar) that will contain data in the form of '06 2013' (representing in this instance June 2013).
I would simply like to return 06 or whatever happens to be in the last record entered in the table. This table will never grow by more than 4 records per annum (so it will never be that big). It also has a column indicating the date that the last entry was created.
That column seems to my mind at least to be the most suitable candidate for getting the last record, so essentially I'd like to know if sql has a inbuilt function for comparing the date the query is run to the nearest match in a column, and to return the first two characters of a field.
So far I have:
Select Mid(ReturnPeriod,1,2) from Returns
Where DateReturnEntered = <and this is where I'm stuck>
What I'm looking for is a where clause that would get me the last entered record using the date the query is run as its reference point(DateRetunEntered of type Date contains the date a record was entered).
Of course there may be an even easier way to guarantee that one has the last record in which case I'm open to suggestions.
Thanks
I think you should store ReturnPeriod as a datetime for example not 06 2013 as a VARCHAR but 01.06.2013 as a DATETIME (first day of 06.2013).
In this case, if I've got your question right, you can use GETDATE() to get current time:
SELECT TOP 1 MONTH(ReturnPeriod)
FROM Returns
WHERE DateReturnEntered<=GETDATE()
ORDER BY DateReturnEntered DESC
If you store ReturnPeriod as a varchar then
SELECT TOP 1 LEFT(ReturnPeriod,2)
FROM Returns
WHERE DateReturnEntered<=GETDATE()
ORDER BY DateReturnEntered DESC
I would store your ReturnPeriod as a date datatype, using a nominal 1st of the month, e.g. 1 Jun 2013, if you don't have the actual date.
This will allow direct comparison against your entered date, with trivial formatting of the return value if required.
Your query would then find the latest date prior to your date entered.
SELECT MONTH(MAX(ReturnPeriod)) AS ReturnMonth
FROM Returns
WHERE ReturnPeriod <= #DateReturnEntered

How can I query just the month and day of a DATE column?

I have a date of birth DATE column in a customer table with ~13 million rows. I would like to query this table to find all customers who were born on a certain month and day of that month, but any year.
Can I do this by casting the date into a char and doing a subscript query on the cast, or should I create an aditional char column, update it to hold just the month and day, or create three new integer columns to hold month, day and year, respectively?
This will be a very frequently used query criteria...
EDIT:... and the table has ~13 million rows.
Can you please provide an example of your best solution?
If it will be frequently used, consider a 'functional index'. Searching on that term at the Informix 11.70 InfoCentre produces a number of relevant hits.
You can use:
WHERE MONTH(date_col) = 12 AND DAY(date_col) = 25;
You can also play games such as:
WHERE MONTH(date_col) * 100 + DAY(date_col) = 1225;
This might be more suitable for a functional index, but isn't as clear for everyday use. You could easily write a stored procedure too:
Note that in the absence of a functional index, invoking functions on a column in the criterion means that an index is unlikely to be used.
CREATE FUNCTION mmdd(date_val DATE DEFAULT TODAY) RETURNING SMALLINT AS mmdd;
RETURN MONTH(date_val) * 100 + DAY(date_val);
END FUNCTION;
And use it as:
WHERE mmdd(date_col) = 1225;
Depending on how frequently you do this and how fast it needs to run you might think about splitting the date column into day, month and year columns. This would make search faster but cause all sorts of other problems when you want to retrieve a whole date (and also problems in validating that it is a date) - not a great idea.
Assuming speed isn't a probem I would do something like:
select *
FROM Table
WHERE Month(*DateOfBirthColumn*) = *SomeMonth* AND DAY(*DateOfBirthColumn*) = *SomeDay*
I don't have informix in front of me at the moment but I think the syntax is right.

adding months to a date SQL

I am trying to add months to an existing date in SQL. The new column displayed will have a followup column instead of a days column. Im getting an error in the select statement.can u help?
Create table auctions(
item varchar2(50),
datebought date,
datesold date,
days number
);
Insert into auctions values (‘Radio’,’12-MAY-2001’,’21-MAY-2001’,9);
Select item,datebought,datesold,ADD MONTHS(datesold,3)”followup” from auctions;
Your usage of the add_months() function is incorrect. It's not two words, it's just one (with an underscore)
add_months(datesold, 1)
note the underscore _ between ADD and MONTHS. It's function call not an operator.
Alternatively you could use:
datesold + INTERVAL '1' month
Although it's worth noting that the arithmetics with intervals is limited (if not broken) because it simply "increments" the month value of the date value. That can lead to invalid dates (e.g. from January to February). Although this is documented behaviour (see below links) I consider this a bug (the SQL standard requires those operations to "Arithmetic obey the natural rules associated with dates and times and yield valid datetime or interval results according to the Gregorian calendar")
See the manual for details:
http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions011.htm#i76717
http://docs.oracle.com/cd/E11882_01/server.112/e26088/sql_elements001.htm#i48042
Another thing:
I am trying to add months to an existing date in SQL.
Then why are you using an INSERT statement? To change the data of existing rows you should use UPDATE. So it seems what you are really after is something like this:
update auctions
set datesold = add_months(datesold, 1)
where item = 'Radio';
Your SQL has typographical quotation marks, not standard ones. E.g. ’ is not the same as '. Instead of delimiting a string value, those quotes become part of the value, at least for the particular SQL I have here to test with.
If this doesn't fix your problem, try posting the error you're getting in your question. Magical debugging isn't possible.
This can be used to add months to a date in SQL:
select DATEADD(mm,1,getdate())
This might be a useful link.