Does the precision of datetime field change when used with bigint field to form a composite key or Unique Index? - sql

I need to use a composite key or unique index comprised of a bigint data type and a datetime data type. However, I've noticed that the seconds element of the datetime has been rounded to the nearest minute and causes a duplicate key violation when trying to import a dataset.
The dataset is essentially transactional data, so our ID (stored in the bigint field) will be repeated, hence the need for inclusion of the datetime field in the unique key.
To give an example: the following two rows cause a 'duplicate key row' error:
ID field (bigint) | ActionDate (datetime)
--------------------- |--------------------------
1050000284002 | 2016-01-08 15:51:24.000
1050000284002 | 2016-01-08 15:50:35.000
The values are clearly different (and are stored correctly in the database) but the error shows:
The duplicate key value is (1050000284002, Jan 8 2016 3:51PM).
(It's worth adding that I initially created a composite key and have since replaced it with a unique index; the error outlined above was generated with the index in place.)
My questions are:
Is my datetime field being rounded because I'm using an integer in the key/index?
Is there another reason that I lose the accuracy of the time component of the datetime field?
How can I rectify the issue so that the example wouldn't result in a key violation?

If you are using an index of the form (index on columns A,B) and not a formulaic one (index on columns A + B), then no, the datatype of one column will have no effect on the contents of the other.
Based on your description, I'd check the following:
The actual datatype of the datetime column. Is it datetime? (Datetime will round to the nearest 333rd of a second, though that's not the issue here.)
The actual definition of the index. Is defined as you think it is defined? Perhaps it is indexing on date(DateTimeColumn)?
The actual data being stored. Is whatever is loading the data into the table perhaps truncating the seconds?
Further suggestions based on your edit:
If the data you are importing clearly contains unique datetime values, yet SQL is not identifying unique date values, then something's up with the data import process.
Try this loading your data into the table without the index in place. Does it load? Does it match your source data to the millisecond? Now, with the data loaded, create the index (primary key, unique constraint, whatever). Does this fail? Where's the duplicate data coming from? In short, mess around with the data and loading processes and see what falls out.

When I do this:
declare #i bigint
set #i=25
select #i+getdate()
select cast ( #i+getdate() as int)
I get:
2016-10-17 09:47:05.753
42658
Notice the first result is the date plus 25 days. So there is not dropoff in the integer part of the date (otherwise it would be midnight on the date...).
The second select drops everything after the decimal place....

Related

Efficient storage pattern for millions of values of different types

I am about to build an SQL database that will contain the results of statistics calculations for hundreds of thousands of objects. It is planned to use Postgres, but the question equally applies to MySQL.
For example, hypothetically, let's assume I have half a million records of phone calls. Each PhoneCall will now, through a background job system, have statistics calculated. For example, a PhoneCall has the following statistics:
call_duration: in seconds (float)
setup_time: in seconds (float)
dropouts: periods in which audio dropout was detected (array), e.g. [5.23, 40.92]
hung_up_unexpectedly: true or false (boolean)
These are just simple examples; in reality, the statistics are more complex. Each statistic has a version number associated with it.
I am unsure as to which storage pattern for these type of calculated data will be the most efficient. I'm not looking into fully normalizing everything in the database though. So far, I have come up with the following options:
Option 1 – long format in one column
I store the statistic name and its value in one column each, with a reference to the main transaction object. The value column is a text field; the value will be serialized (e.g. as JSON or YAML) so that different types (strings, arrays, ...) can be stored. The database layout for the statistics table would be:
statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value (text, serialized)
statistic_version (integer)
created_at (datetime)
I have worked with this pattern for a while, and what's good about it is that I can easily filter statistics according to phone call and the statistic name. I can also add new types of statistics easily and filter by version and creation time.
But it seems to me that the (de)serialization of values makes it quite inefficient in terms of handling lots of data. Also, I cannot perform calculations on SQL-level; I always have to load and deserialize the data. Or is the JSON suppot in Postgres that good so that I could still pick this pattern?
Option 2 – statistics as attributes of main object
I could also think about collecting all types of statistic names and adding them as new columns to the phone call object, e.g.:
id (PK)
call_duration
setup_time
dropouts
hung_up_unexpectedly
...
This would be very efficient, and each column would have its own type, but I can no longer store different versions of statistics, or filter them according to when they were created. The whole business logic of statistics disappears. Adding new statistics is also not possible easily since the names are baked in.
Option 3 – statistics as different columns
This would probably be the most complex. I am storing only a reference to the statistic type, and the column will be looked up according to that:
statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value_bool (boolean)
statistic_value_string (string)
statistic_value_float (float)
statistic_value_complex (serialized or complex data type)
statistic_value_type (string that indicates bool, string etc.)
statistic_version (integer)
created_at (datetime)
This would mean that the table is going to be very sparse, as only one of the statistic_value_ columns would be populated. Could that lead to performance issues?
Option 4 – normalized form
Trying to normalize option 3, I would create two tables:
statistics
id (PK)
version
created_at
statistic_mapping
phone_call_id (FK)
statistic_id (FK)
statistic_type_mapping
statistic_id (FK)
type (string, indicates bool, string etc.)
statistic_values_boolean
statistic_id (FK)
value (bool)
…
But this isn't going anywhere since I can't dynamically join to another table name, can I? Or should I anyway then just join to all statistic_values_* tables based on the statistic ID? My application would have to make sure that no duplicate entries exist then.
To summarize, given this use case, what would be the most efficient approach for storing millions of statistic values in a relational DB (e.g. Postgres), when the requirement is that statistic types may be added or changed, and that several versions exist at the same time, and that querying of the values should be somewhat efficient?
IMO you can use the following simple database structure to solve your problem.
Statistics type dictionary
A very simple table - just name and description of the stat. type:
create table stat_types (
type text not null constraint stat_types_pkey primary key,
description text
);
(You can replace it with enum if you have a finite number of elements)
Stat table for every type of objects in the project
It contains FK to the object, FK to the stat. type (or just enum) and, this is important, the jsonb field with an arbitrary stat. data related to its type. For example, such a table for phone calls:
create table phone_calls_statistics (
phone_call_id uuid not null references phone_calls,
stat_type text not null references stat_types,
data jsonb,
constraint phone_calls_statistics_pkey primary key (phone_call_id, stat_type)
);
I assume here that table phone_calls has uuid type of its PK:
create table phone_calls (
id uuid not null constraint phone_calls_pkey primary key
-- ...
);
The data field has a different structure which depends on its stat. type. Example for call duration:
{
"call_duration": 120.0
}
or for dropouts:
{
"dropouts": [5.23, 40.92]
}
Let's play with data:
insert into phone_calls_statistics values
('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'CALL_DURATION', '{"call_duration": 100.0}'),
('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'CALL_DURATION', '{"call_duration": 110.0}'),
('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'CALL_DURATION', '{"call_duration": 120.0}'),
('39465c2f-2321-499e-a156-c56a3363206a', 'CALL_DURATION', '{"call_duration": 130.0}'),
('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}'),
('39465c2f-2321-499e-a156-c56a3363206a', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}');
Get the average, min and max call duration:
select
avg((pcs.data ->> 'call_duration')::float) as avg,
min((pcs.data ->> 'call_duration')::float) as min,
max((pcs.data ->> 'call_duration')::float) as max
from
phone_calls_statistics pcs
where
pcs.stat_type = 'CALL_DURATION';
Get the number of unexpected hung ups:
select
sum(case when (pcs.data ->> 'unexpected_hungup')::boolean is true then 1 else 0 end) as hungups
from
phone_calls_statistics pcs
where
pcs.stat_type = 'UNEXPECTED_HANGUP';
I believe that this solution is very simple and flexible, has good performance potential and perfect scalability. The main table has a simple index; all queries will perform inside it. You can always extend the number of stat. types and their calculations.
Live example: https://www.db-fiddle.com/f/auATgkRKrAuN3jHjeYzfux/0

Calculate date and time key in fact table using existing date time field

I have date time field in a fact table in the format MM/DD/YY HH:MM:SS (e.g 2/24/2009 11:18:47 AM) and I have seperate date and time dimension
tables. What I would like ask is that how I can create date key and time key in the fact table using the date time field so that I can join the
date and time dimension.
There are alot of reference for creating seperate date and time dimension and their benefit but I could not find how to create date and time keys
in the fact table using existing date time field.
I have also heard that having date time field in fact table has certain benefit. If so, what would you recommend, should I have all three (date key, time key
and date time field) in the fact table. Date key and time key are must to have for me and I am concerned about fact table size if I have date time field
also in the fact table.
Thank you all for any help you can give.
What you need to do (if I understand correctly) is to create two fields in your Fact table:kTime, kDate.
We would always suggest using the primary keys for DimTime and DimDate as having meaning (this being a special case normally Dim tables' promary keys dont have any meaning). So e.g. in DimDate, we would have kDate as primary key with values formatted as YYYYMMDD so that you can order by kDate and it puts them in date order. Then have DimTime table having kTime primary key in the form HHMM or HHMMSS (depending on the resolution you need.
It is best to keep the actual date time field on the fact table as well, as it allows SQL to use its inbuilt date/time functions to do subsetting, but if you extend your Dim tables with useful extra columns : DimDate (add DayOfWeek, IsHoliday, DayNumber, MonthNumber, YearNumber, etc) and DimTime (HourNumber, MinuteNumber, IsWorkingTime), then you can perform very interesting queries very simply.
So to answer your question, "how to create date and time keys in the fact table using existing date time field?" ... as you are loading the data into the fact table, use the inbuilt date/time functions to create separate Date fields and Time fields.
It depends very much on how many rows you expect in your Fact table wether this approach will produce a lot of data, but it is the easiest to work with from a data warehouse point of view.
best of luck!

DateTime as part of PK in FACT Table for warehouses

I know generally its not good idea to have a DateTime column as your PK, however for my situation I believe it makes more sense than having a surrogate key in the Fact Table.
The reasons are...
The data inserted into the fact table is always sequential. i.e. I would never insert date time value that is older than the last value already in Fact table.
The date time field is not the only column of the PK (composite PK), The PK is of course itself and the dimension FK's surrogate key.
The way i be querying the data is nearly always based on time.
A surrogate key on the Fact table would tell me nothing about the row. Each row is already unique and to find that particular fact I would always filter on the Date time first and the values in the dimensions.
There is no separate datetime dimension table. No requirement now or in the foreseeable future to have named points in time etc.
Side notes - time will be in UTC and using SQL 2008 R2.
What I'm asking is are giving the situation - what are the disadvantages to doing this? Will I come up against unforeseen issues?
Is this actaully a good thing to be doing when querying that data back later?
Would like to know peoples view points on a DateTime field as the first column of a composite PK.
It's almost an essential feature of any data warehouse that date/time is a component of a key in most tables. There's nothing "wrong" with that.
A surrogate key generally shouldn't be the only key of a table, so perhaps your question is really "Should I create a surrogate key on my table as well?". My suggestion is that if you don't have a reason to create a surrogate key then don't. The time to create a surrogate is when you find that you need it.
Most fact tables have composite keys and date-time or often DateKey, TimeKey are part of it. Actually, quite common.
The dimDate and dimTime are simply used to avoid having "funny" date-time functions in the WHERE clause of a query. For example
-- sales on
-- weekends for previous 28 weeks
--
select sum(f.SaleAmount)
from factSale as f
join dimDate as d on d.DateKey = f.DateKey
where d.IsWeekend = 'yes'
and d.WeeksAgo between 1 and 28 ;
So here I can have indexes on IsWeekend and WeeksAgo (DateKey too). If these were replaced by date-time functions, this would cause row-by-row processing.

MySQL. Working with Integer Data Interval

I've just started using SQL, so that have no idea how t work with not standard data types.
I'm working with MySQL...
Say, there are 2 tables: Stats and Common. The Common table looks like this:
CREATE TABLE Common (
Mutation VARCHAR(10) NOT NULL,
Deletion VARCHAR(10) NOT NULL,
Stats_id ??????????????????????,
UNIQUE(Mutation, Deletion) );
Instead of ? symbols there must be some type that references on the Stats table (Stats.id).
The problem is, this type must make it possible to save data in such a format: 1..30 (interval between 1 and 30). According to this type, it was my idea to shorten the Common table's length.
Is it possible to do this, are there any different ideas?
Assuming that Stats.id is an INTEGER (if not, change the below items as appropriate):
first_stats_id INTEGER NOT NULL REFERENCES Stats(id)
last_stats_id INTEGER NOT NULL REFERENCES Stats(id)
Given that your table contains two VARCHAR fields and an unique index over them, having an additional integer field is the least of your concerns as far as memory usage goes (seriously, one integer field represents a mere 1GB of memory for 262 million lines).

Linked List in SQL

What's the best way to store a linked list in a MySQL database so that inserts are simple (i.e. you don't have to re-index a bunch of stuff every time) and such that the list can easily be pulled out in order?
Using Adrian's solution, but instead of incrementing by 1, increment by 10 or even 100. Then insertions can be calculated at half of the difference of what you're inserting between without having to update everything below the insertion. Pick a number large enough to handle your average number of insertions - if its too small then you'll have to fall back to updating all rows with a higher position during an insertion.
create a table with two self referencing columns PreviousID and NextID. If the item is the first thing in the list PreviousID will be null, if it is the last, NextID will be null. The SQL will look something like this:
create table tblDummy
{
PKColumn int not null,
PreviousID int null,
DataColumn1 varchar(50) not null,
DataColumn2 varchar(50) not null,
DataColumn3 varchar(50) not null,
DataColumn4 varchar(50) not null,
DataColumn5 varchar(50) not null,
DataColumn6 varchar(50) not null,
DataColumn7 varchar(50) not null,
NextID int null
}
Store an integer column in your table called 'position'. Record a 0 for the first item in your list, a 1 for the second item, etc. Index that column in your database, and when you want to pull your values out, sort by that column.
alter table linked_list add column position integer not null default 0;
alter table linked_list add index position_index (position);
select * from linked_list order by position;
To insert a value at index 3, modify the positions of rows 3 and above, and then insert:
update linked_list set position = position + 1 where position >= 3;
insert into linked_list (my_value, position) values ("new value", 3);
A linked list can be stored using recursive pointers in the table. This is very much the same hierarchies are stored in Sql and this is using the recursive association pattern.
You can learn more about it here (Wayback Machine link).
I hope this helps.
The simplest option would be creating a table with a row per list item, a column for the item position, and columns for other data in the item. Then you can use ORDER BY on the position column to retrieve in the desired order.
create table linked_list
( list_id integer not null
, position integer not null
, data varchar(100) not null
);
alter table linked_list add primary key ( list_id, position );
To manipulate the list just update the position and then insert/delete records as needed. So to insert an item into list 1 at index 3:
begin transaction;
update linked_list set position = position + 1 where position >= 3 and list_id = 1;
insert into linked_list (list_id, position, data)
values (1, 3, "some data");
commit;
Since operations on the list can require multiple commands (eg an insert will require an INSERT and an UPDATE), ensure you always perform the commands within a transaction.
A variation of this simple option is to have position incrementing by some factor for each item, say 100, so that when you perform an INSERT you don't always need to renumber the position of the following elements. However, this requires a little more effort to work out when to increment the following elements, so you lose simplicity but gain performance if you will have many inserts.
Depending on your requirements other options might appeal, such as:
If you want to perform lots of manipulations on the list and not many retrievals you may prefer to have an ID column pointing to the next item in the list, instead of using a position column. Then you need to iterative logic in the retrieval of the list in order to get the items in order. This can be relatively easily implemented in a stored proc.
If you have many lists, a quick way to serialise and deserialise your list to text/binary, and you only ever want to store and retrieve the entire list, then store the entire list as a single value in a single column. Probably not what you're asking for here though.
This is something I've been trying to figure out for a while myself. The best way I've found so far is to create a single table for the linked list using the following format (this is pseudo code):
LinkedList(
key1,
information,
key2
)
key1 is the starting point. Key2 is a foreign key linking to itself in the next column. So your columns will link something link something like this
col1
key1 = 0,
information= 'hello'
key2 = 1
Key1 is primary key of col1. key2 is a foreign key leading to the key1 of col2
col2
key1 = 1,
information= 'wassup'
key2 = null
key2 from col2 is set to null because it doesn't point to anything
When you first enter a column in for the table, you'll need to make sure key2 is set to null or you'll get an error. After you enter the second column, you can go back and set key2 of the first column to the primary key of the second column.
This makes the best method to enter many entries at a time, then go back and set the foreign keys accordingly (or build a GUI that just does that for you)
Here's some actual code I've prepared (all actual code worked on MSSQL. You may want to do some research for the version of SQL you are using!):
createtable.sql
create table linkedlist00 (
key1 int primary key not null identity(1,1),
info varchar(10),
key2 int
)
register_foreign_key.sql
alter table dbo.linkedlist00
add foreign key (key2) references dbo.linkedlist00(key1)
*I put them into two seperate files, because it has to be done in two steps. MSSQL won't let you do it in one step, because the table doesn't exist yet for the foreign key to reference.
Linked List is especially powerful in one-to-many relationships. So if you've ever wanted to make an array of foreign keys? Well this is one way to do it! You can make a primary table that points to the first column in the linked-list table, and then instead of the "information" field, you can use a foreign key to the desired information table.
Example:
Let's say you have a Bureaucracy that keeps forms.
Let's say they have a table called file cabinet
FileCabinet(
Cabinet ID (pk)
Files ID (fk)
)
each column contains a primary key for the cabinet and a foreign key for the files. These files could be tax forms, health insurance papers, field trip permissions slips etc
Files(
Files ID (pk)
File ID (fk)
Next File ID (fk)
)
this serves as a container for the Files
File(
File ID (pk)
Information on the file
)
this is the specific file
There may be better ways to do this and there are, depending on your specific needs. The example just illustrates possible usage.
There are a few approaches I can think of right off, each with differing levels of complexity and flexibility. I'm assuming your goal is to preserve an order in retrieval, rather than requiring storage as an actual linked list.
The simplest method would be to assign an ordinal value to each record in the table (e.g. 1, 2, 3, ...). Then, when you retrieve the records, specify an order-by on the ordinal column to get them back in order.
This approach also allows you to retrieve the records without regard to membership in a list, but allows for membership in only one list, and may require an additional "list id" column to indicate to which list the record belongs.
An slightly more elaborate, but also more flexible approach would be to store information about membership in a list or lists in a separate table. The table would need 3 columns: The list id, the ordinal value, and a foreign key pointer to the data record. Under this approach, the underlying data knows nothing about its membership in lists, and can easily be included in multiple lists.
This post is old but still going to give my .02$. Updating every record in a table or record set sounds crazy to solve ordering. the amount of indexing also crazy, but it sounds like most have accepted it.
Crazy solution i came up with to reduce updates and indexing is to create two tables (and in most use cases you don's sort all records in just one table anyway). Table A to hold the records of the list being sorted and table B to group and hold a record of the order as a string. the order string represents an array that can be used to order the selected records either on the web server or browser layer of a webpage application.
Create Table A{
Id int primary key identity(1,1),
Data varchar(10) not null
B_Id int
}
Create Table B{
Id int primary key Identity(1,1),
GroupName varchat(10) not null,
Order varchar(max) null
}
The format of the order sting should be id, position and some separator to split() your string by. in the case of jQuery UI the .sortable('serialize') function outputs an order string for you that is POST friendly that includes the id and position of each record in the list.
The real magic is the way you choose to reorder the selected list using the saved ordering string. this will depend on the application you are building. here is an example again from jQuery to reorder the list of items: http://ovisdevelopment.com/oramincite/?p=155
https://dba.stackexchange.com/questions/46238/linked-list-in-sql-and-trees suggests a trick of using floating-point position column for fast inserts and ordering.
It also mentions specialized SQL Server 2014 hierarchyid feature.
I think its much simpler adding a created column of Datetime type and a position column of int, so now you can have duplicate positions, at the select statement use the order by position, created desc option and your list will be fetched in order.
Increment the SERIAL 'index' by 100, but manually add intermediate values with an 'index' equal to Prev+Next / 2. If you ever saturate the 100 rows, reorder the index back to 100s.
This should maintain sequence with primary index.
A list can be stored by having a column contain the offset (list index position) -- an insert in the middle is then incrementing all above the new parent and then doing an insert.
You could implement it like a double ended queue (deque) to support fast push/pop/delete(if oridnal is known) and retrieval you would have two data structures. One with the actual data and another with the number of elements added over the history of the key. Tradeoff: This method would be slower for any insert into the middle of the linked list O(n).
create table queue (
primary_key,
queue_key
ordinal,
data
)
You would have an index on queue_key+ordinal
You would also have another table which stores the number of rows EVER added to the queue...
create table queue_addcount (
primary_key,
add_count
)
When pushing a new item to either end of the queue (left or right) you would always increment the add_count.
If you push to the back you could set the ordinal...
ordinal = add_count + 1
If you push to the front you could set the ordinal...
ordinal = -(add_count + 1)
update
add_count = add_count + 1
This way you can delete anywhere in the queue/list and it would still return in order and you could also continue to push new items maintaining the order.
You could optionally rewrite the ordinal to avoid overflow if a lot of deletes have occurred.
You could also have an index on the ordinal to support fast ordered retrieval of the list.
If you want to support inserts into the middle you would need to find the ordinal which it needs to be insert at then insert with that ordinal. Then increment every ordinal by one following that insertion point. Also, increment the add_count as usual. If the ordinal is negative you could decrement all of the earlier ordinals to do fewer updates. This would be O(n)