Efficient storage pattern for millions of values of different types - sql

I am about to build an SQL database that will contain the results of statistics calculations for hundreds of thousands of objects. It is planned to use Postgres, but the question equally applies to MySQL.
For example, hypothetically, let's assume I have half a million records of phone calls. Each PhoneCall will now, through a background job system, have statistics calculated. For example, a PhoneCall has the following statistics:
call_duration: in seconds (float)
setup_time: in seconds (float)
dropouts: periods in which audio dropout was detected (array), e.g. [5.23, 40.92]
hung_up_unexpectedly: true or false (boolean)
These are just simple examples; in reality, the statistics are more complex. Each statistic has a version number associated with it.
I am unsure as to which storage pattern for these type of calculated data will be the most efficient. I'm not looking into fully normalizing everything in the database though. So far, I have come up with the following options:
Option 1 – long format in one column
I store the statistic name and its value in one column each, with a reference to the main transaction object. The value column is a text field; the value will be serialized (e.g. as JSON or YAML) so that different types (strings, arrays, ...) can be stored. The database layout for the statistics table would be:
statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value (text, serialized)
statistic_version (integer)
created_at (datetime)
I have worked with this pattern for a while, and what's good about it is that I can easily filter statistics according to phone call and the statistic name. I can also add new types of statistics easily and filter by version and creation time.
But it seems to me that the (de)serialization of values makes it quite inefficient in terms of handling lots of data. Also, I cannot perform calculations on SQL-level; I always have to load and deserialize the data. Or is the JSON suppot in Postgres that good so that I could still pick this pattern?
Option 2 – statistics as attributes of main object
I could also think about collecting all types of statistic names and adding them as new columns to the phone call object, e.g.:
id (PK)
call_duration
setup_time
dropouts
hung_up_unexpectedly
...
This would be very efficient, and each column would have its own type, but I can no longer store different versions of statistics, or filter them according to when they were created. The whole business logic of statistics disappears. Adding new statistics is also not possible easily since the names are baked in.
Option 3 – statistics as different columns
This would probably be the most complex. I am storing only a reference to the statistic type, and the column will be looked up according to that:
statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value_bool (boolean)
statistic_value_string (string)
statistic_value_float (float)
statistic_value_complex (serialized or complex data type)
statistic_value_type (string that indicates bool, string etc.)
statistic_version (integer)
created_at (datetime)
This would mean that the table is going to be very sparse, as only one of the statistic_value_ columns would be populated. Could that lead to performance issues?
Option 4 – normalized form
Trying to normalize option 3, I would create two tables:
statistics
id (PK)
version
created_at
statistic_mapping
phone_call_id (FK)
statistic_id (FK)
statistic_type_mapping
statistic_id (FK)
type (string, indicates bool, string etc.)
statistic_values_boolean
statistic_id (FK)
value (bool)
…
But this isn't going anywhere since I can't dynamically join to another table name, can I? Or should I anyway then just join to all statistic_values_* tables based on the statistic ID? My application would have to make sure that no duplicate entries exist then.
To summarize, given this use case, what would be the most efficient approach for storing millions of statistic values in a relational DB (e.g. Postgres), when the requirement is that statistic types may be added or changed, and that several versions exist at the same time, and that querying of the values should be somewhat efficient?

IMO you can use the following simple database structure to solve your problem.
Statistics type dictionary
A very simple table - just name and description of the stat. type:
create table stat_types (
type text not null constraint stat_types_pkey primary key,
description text
);
(You can replace it with enum if you have a finite number of elements)
Stat table for every type of objects in the project
It contains FK to the object, FK to the stat. type (or just enum) and, this is important, the jsonb field with an arbitrary stat. data related to its type. For example, such a table for phone calls:
create table phone_calls_statistics (
phone_call_id uuid not null references phone_calls,
stat_type text not null references stat_types,
data jsonb,
constraint phone_calls_statistics_pkey primary key (phone_call_id, stat_type)
);
I assume here that table phone_calls has uuid type of its PK:
create table phone_calls (
id uuid not null constraint phone_calls_pkey primary key
-- ...
);
The data field has a different structure which depends on its stat. type. Example for call duration:
{
"call_duration": 120.0
}
or for dropouts:
{
"dropouts": [5.23, 40.92]
}
Let's play with data:
insert into phone_calls_statistics values
('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'CALL_DURATION', '{"call_duration": 100.0}'),
('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'CALL_DURATION', '{"call_duration": 110.0}'),
('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'CALL_DURATION', '{"call_duration": 120.0}'),
('39465c2f-2321-499e-a156-c56a3363206a', 'CALL_DURATION', '{"call_duration": 130.0}'),
('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}'),
('39465c2f-2321-499e-a156-c56a3363206a', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}');
Get the average, min and max call duration:
select
avg((pcs.data ->> 'call_duration')::float) as avg,
min((pcs.data ->> 'call_duration')::float) as min,
max((pcs.data ->> 'call_duration')::float) as max
from
phone_calls_statistics pcs
where
pcs.stat_type = 'CALL_DURATION';
Get the number of unexpected hung ups:
select
sum(case when (pcs.data ->> 'unexpected_hungup')::boolean is true then 1 else 0 end) as hungups
from
phone_calls_statistics pcs
where
pcs.stat_type = 'UNEXPECTED_HANGUP';
I believe that this solution is very simple and flexible, has good performance potential and perfect scalability. The main table has a simple index; all queries will perform inside it. You can always extend the number of stat. types and their calculations.
Live example: https://www.db-fiddle.com/f/auATgkRKrAuN3jHjeYzfux/0

Related

Store 3-dimensional table in database where 1 dimension increases over time

I have a data set with three dimensions that I would like to store for use with a website:
A list of companies (about 1000)
Information about the company (about 15 things)
Time (monthly)
Essentially, I want to track this information over time and keep it up to date.
When I start, the data will be 1000x15x1, after a year it will be 1000x15x12, and after 10 years if will be 1000x15x120.
The main queries I would make are:
Get all information for one company over all times
Get all information for one particular time
What would be a good database configuration for doing this? I'm open to either SQL or noSQL solutions.
In case it matters, the website is on Google App Engine.
From the relational database schema design perspective:
If the goal is analytics / ad-hoc querying / OLAP in general only, then you can use star-schema which is well suited for these type of analytics. But beware, OLAP databases are de-normalized and not suitable for operational transaction storage / OLTP in general, if you are planning to do both on this database.
The beauty of the Star schema:
The fact tables are usually all numeric, making the tables very small even though there are too many records. Small table means it is very fast to read (I/O).
All joins from the fact table to dimension tables are based on foreign keys (single column, numeric, indexable foreign keys)
All dimension tables have surrogate key, which is a single column primary key. Single column primary key is easier to JOIN than a multi-column primary key and also easier to index.
There is no NULL in foreign keys in fact tables. This makes JOIN operations straightforward, i.e. always JOIN fact table to all of its dimension tables. If you need NULL case, you need to add that as a special case in your dimension table. For example: if a company is not listed on stock market, and one of the thing you track is stock price, then you enter 0 or NULL into the fact for the stock price table depending on (how you want to do SUM(), AVG() etc later) and then add a special case into your StockSymbols dimension table called 'Private company' and add the foreign key of this special case into the fact table as your foreign key.
Almost all filtering is done through the dimension tables that are much much smaller than the fact tables. This requires having a Date dimension to be able to do date-based queries.
If you can stay in pure Star schema, then all yours JOINs are single hop (i.e. no join between two tables through another table).
All these makes JOIN operations very fast, simple and straightforward. That's why the Star schema is at the heart of data-warehousing designs.
https://en.wikipedia.org/wiki/Star_schema
https://en.wikipedia.org/wiki/Data_warehouse
One level up from this is OLAP (SSAS SQL Server Analyses Services for example) which does pre-processing of the data to make it fast to query but it involves more learning than pure start-schema and it's an overkill in your case
For your example
In Star schema,
Companies will be a dimension table
You will need Month dimension table. It's simplified version of Date dimension, just for month info. An example of Date dimension is here.
https://www.codeproject.com/Articles/647950/Create-and-Populate-Date-Dimension-for-Data-Wareho
The information about the company (15 things you say) will be fact tables. The facts must be numeric (b/c ideally all non-numeric values is saved in dimension tables). This means taking the non-numeric part of a fact to a dimension table. For example: if you are keeping revenue and would like to keep the currency type too, then the you will need a Currency dimension and save only the amount in the fact table and a foreign key to the Currency dimension table.
If you have any non-numeric facts, you need to store the distinct list in a dimension table and add foreign key to that dimension table inside your Fact table (this is called factless fact table). The only exception to that is if the cardinality of the dimension and the fact table is very similar, then you can just store the non-numeric fact value inside the fact table directly as there is no benefit in having a dimension table (in fact a disadvantage).
Also the facts can be grouped by their granularity. For example you could have company_monthly_summary fact table and keep more than one fact in that table (which are all joining to Company dimension and Month dimension). This is all up-to-you how you would like to group facts table. But if their granularity are not the same, they should not be grouped as that will cause sparse fact tables and harder to query.
You will use foreign keys in Fact tables to join to your Dimension tables
Add index for your Dimension tables' most used columns
Add a numeric surrogate key to your dimension. It is usually an auto-increment number but that's up-to you. One exception people prefers for the surrogate key of Date dimension is using the format YYYYMMDD (as integer). This makes is easier on WHERE clause: i.e instead of filtering for the Date column (a DATETIME value), which will do search to find the surrogate keys, you just provide the surrogate keys directly b/c you know the format. Depending on your business domain, you may have other similar useful surrogate key patterns that you may want to consider and use. But just know, in case of a business domain change, you will have have to update all fact records. Simple auto-increment surrogate key does not have that problem. In your case, the surrogate key for the month can be actual month number (1 for Jan)
That being said, 1 million rows in 5 years is easy to query even without a Star-schema design (with proper indexing, database maintenance). But if this is part of a larger analytics system, then go with Star schema
The simplest way.
Create a table, companyname + info you needto store + column for year-month.
Ex:
CREATE TABLE tablename (
id int(11) NOT NULL AUTO_INCREMENT,
companyname varchar(255) ,
info1 int(11) NOT NULL,
info2 datetime ,
info3 varchar(255) ,
info4 bool ,
yearmonth datetime,
PRIMARY KEY (id) );
#queries
select * from tablename where companyname="nameofthecompany";
select * from tablename where yearmonth="year-month"; #can use between here

SQL a table for each metadata of other tables

Hi I have various time series each having a unique timeseries ID. Given an ID, the series look something like this (obviously with different dates and data respetively)
datetime data
1/1/1980 11.6985
1/2/1980 43.6431
1/3/1980 54.9089
1/4/1980 63.1225
1/5/1980 72.4399
1/6/1980 79.1363
1/7/1980 82.2778
1/8/1980 86.0785
These time series have different "types". For instance, suppose that some time series are "WindData" type, some that are "SolarData" type and some that are "GasData" type. Given a timeseries ID, this will belong to some type. For instance:
IDs 1, 2, 3 could belong to SolarData
IDs 4,5 could belong to Wind Data
ID 6 could belong to GasData.
Time series of the same type (for instanec 1, 2, 3) share the same fields of metadata (but not the same values!) For instance WindData could have fields:
WindTurbineNumber, WindFarmName, Country
while the SolarData could have fields:
SiteName, SolarPanelType
and the GasData could have:
PipelineNumber, CountryOfOrigin, CountryOfDestination
Now, the issue is that as time grows I could have many many more types. Therefore, I want a way of generalizing this data-metadata structure. How? My idea would be to have:
A table that given a timeseries id it tells me the type of that series (i.e. given 1, it tells SolarData)
A table that given the type, it would give me the column names (and optionally their types)
a table that given the id, it would return the data.
What database structure would I need?
I cannot figure out how I would create a table (or multiple tables) that could tell me, given a seriesid, which metadata fields it needs..
I believe you're not going to find a relational database structure that will really suit your needs here.
Relational databases are designed with a "schema on write" philosophy. We decide what the data we will be getting in the future will look like, then we design a storage structure with that data schema, and then insert data into that schema. Under the right circumstances, this works well, as evidenced by fifty or so years of Boyce-Codd-esque database structures.
It sounds, though, like you want to store your data as you receive it, whatever that shape may be, and then apply a "schema on read" philosophy, extracting the useful bits later, in the form the query requires. That's going to require a NoSQL or NewSQL solution. You could consider any number of appliances to accomplish that, from Hadoop and its related structures like HBase (but not Hive) to CouchDB or Apache Cassandra.
The general ideal goes as below. You must a kind of series table and a "father" series table and some child series tables.
create table dbo.Seriekind
(
Id int not null primrary key
,Description varchar(50) not null
,ListOfColumns varchar(500) not null
)
create table dbo.Series
(
Id int not null indentity primary key
,TimeStamp datetime not null
,SerieKindId int not null
)
create table dbo.SolarData
(
Id int not null primary key identity
,SerieId int not null
,SiteName
,SolarPanelType
)
create table dbo.WindData
(
Id int not null primary key identity
,SerieId int not null
,WindTurbineNumber
,WindFarmName
,Country
)
create table dbo.GasData
(
Id int not null primary key identity
,SerieId int not null
,PipelineNumber
,CountryOfOrigin
,CountryOfDestination
)
One "disvantage" you do needs a new table for any new kind of data. FK are trivial.
Edit
As Eric explained SQL structure is not that flexible. It's awesome to describe data relations and is really efficient in storing and fetching large chunks of data, not to say it's capabilities in some kinds of processing.
A better solution is maybe a a hybrid one, maybe storing the data as a flexible format like json inside a Series table or even do use a NoSql solution or a hybrid of SQL x NoSQL.
The main thing here is how many series do you need and how often a new one can come in. A dozen: SQl, A thousand: NoSQL.

Database Table Design Issues

I am new to DB Design and I've recently inherited the responsibility of adding some new attributes to an existing design.
Below is a sample of the current table in question:
Submission Table:
ID (int)
Subject (text)
Processed (bit)
SubmissionDate (datetime)
Submitted (bit)
...
The new requirements are:
A Submission can be marked as valid or invalid
A Reason must be provided when a Submission is marked as invalid. (So a submission may have an InvalidReason)
Submissions can be associated with one another such that: Multiple valid Submissions can be set as "replacements" for an invalid Submission.
So I've currently taken the easy solution and simply added new attributes directly to the Submission Table so that it looks like this:
NEW Submission Table:
ID (int)
Subject (text)
Processed (bit)
SubmissionDate (datetime)
Submitted (bit)
...
IsValid (bit)
InvalidReason (text)
ReplacedSubmissionID (int)
Everything works fine this way, but it just seems a little strange:
Having InvalidReason as a column that will be NULL for majority of submissions.
Having ReplacedSubmissionID as a column that will be NULL for majority of submissions.
If I understand normalization right, InvalidReason might be transitively dependent on the IsValid bit.
It just seems like somehow some of these attributes should be extracted to a separate table, but I don't see how to create that design with these requirements.
Is this single table design okay? Anyone have better alternative ideas?
Whether or not you should have a single table design really depends on
1) How you will be querying the data
2) How much data would end up being potentially NULL in the resulting table.
In your case its probably ok, but again it depends on #1. If you will be querying separately to get information on invalid submissions, you may want to create a separate table that references the Id of invalid submissions and the reason:
New table: InvalidSubmissionInfo
Id (int) (of invalid submissions; will have FK contraint on Submission table)
InvalidReason (string)
Additionally if you will be querying for replaced submissions separately you may want to have a table just for those:
New table: ReplacementSubmissions
Id (int) (of the replacement submissions; will have FK contraint on Submission table)
ReplacedSubmissionId (int) (of what got replaced; will have FK constraint on submission table)
To get the rest of the info you will still have to join with the Submissions table.
All this to say you do not need separate this out into multiple tables. Having a NULL value only takes up 1 bit of memory which isn't bad. And if you need to query and return an entire Submission record each time, it makes more sense to condense this info into one table.
Single table design looks good to me and it should work in your case.
If you do not like NULLS, you can give default value of an empty string and ReplacedSubmissionID to 0. Default values are always preferable in database design.
Having an empty string or default value will make your data look more cleaner.
Please remember if you add default values, you might need to change queries to get proper results.
For example:-
Getting submissions which have not been replaced>
Select * from tblSubmission where ReplacedSubmissionID = 0
Don't fear joins. Looking for ways to place everything in a single table is at best a complete waste of time, at worst results in a convoluted, unmaintainable mess.
You are correct about InvalidReason and IsValid. However, you missed SubmittedDate and Submitted.
Whenever modeling an entity that will be processed in some way and going through consecutive state changes, these states really should be placed in a separate table. Any information concerning the state change -- date, reason for the change, authorization, etc. -- will have a functional dependency on the state rather than the entity as a whole, therefore an attempt to make the state information part of the entity tuple will fail at 2nf.
The problem this causes is shown in your very question. You already incorporated Submitted and SubmittedDate into the tuple. Now you have another state you want to add. If you had normalized the submission data, you could have simply added another state and gone on.
create table StateDefs(
ID int auto_generated primary key,
Name varchar( 16 ) not null, -- 'Submitted', 'Processed', 'Rejected', etc.
... -- any other data concerning states
);
create table Submissions(
ID int auto_generated primary key,
Subject varchar( 128 ) not null,
... -- other data
);
create table SubmissionStates(
SubID int not null references Submissions( ID ),
State int not null references StateDefs( ID ),
When date not null,
Description varchar( 128 )
);
This shows that a state consists of a date and an open text field to place any other information. That may suit your needs. If different states require different data, you may have to (gasp) create other state tables. Whatever your needs require.
You could insert the first state of a submission into the table and update that record at state changes. But you lose the history of state changes and that is useful information. So each state change would call for a new record each time. Reading the history of a submission would then be easy. Reading the current state would be more difficult.
But not too difficult:
select ss.*
from SubmissionStates ss
where ss.SubID = :SubID
and ss.When =(
select Max( When )
from SubmissionStates
where SubID = ss.SubID
and When <= Today() );
This finds the current row, that is, the row with the most recent date. To find the state that was in effect on a particular date, change Today() to something like :AsOf and place the date of interest in that variable. Storing the current date in that variable returns the current state so you can use the same query to find current or past data.

MySQL. Working with Integer Data Interval

I've just started using SQL, so that have no idea how t work with not standard data types.
I'm working with MySQL...
Say, there are 2 tables: Stats and Common. The Common table looks like this:
CREATE TABLE Common (
Mutation VARCHAR(10) NOT NULL,
Deletion VARCHAR(10) NOT NULL,
Stats_id ??????????????????????,
UNIQUE(Mutation, Deletion) );
Instead of ? symbols there must be some type that references on the Stats table (Stats.id).
The problem is, this type must make it possible to save data in such a format: 1..30 (interval between 1 and 30). According to this type, it was my idea to shorten the Common table's length.
Is it possible to do this, are there any different ideas?
Assuming that Stats.id is an INTEGER (if not, change the below items as appropriate):
first_stats_id INTEGER NOT NULL REFERENCES Stats(id)
last_stats_id INTEGER NOT NULL REFERENCES Stats(id)
Given that your table contains two VARCHAR fields and an unique index over them, having an additional integer field is the least of your concerns as far as memory usage goes (seriously, one integer field represents a mere 1GB of memory for 262 million lines).

SQL static data / lookup lists IDENTIFIER

In regard to static data table design. Having static data in tables like shown:
Currencies (Code, Name). Row example: USD, United States Dollar
Countries (Code, Name). Row example: DE, Germany
XXXObjectType (Code, Name, ... additional attributes)
...
does it make sense to have another (INTEGER) column as a Primary Key so that all Foreign Key references would use it?
Possible solutions:
Use additional INTEGER as PK and FK
Use Code (usually CHAR(N), where N is small) as PK and FK
Use Code only if less then certain size... What size?
Other _______
What would be your suggestion? Why?
I usually used INT IDENTITY columns, but very often having the short code is good enough to show to the user on the UI, in which case the query would have one JOIN less.
An INT IDENTITY is absolutely not needed here. use the 2 or 3 digit mnemonics instead. If you have an entity that has no small, unique property, you should then consider using a synthetic key. But currency codes and country codes aren't the time to do it.
I once worked on a system where someone actually had a table of years, and each year had a YearID. And, true to form, 2001 was year 3, and 2000 was year 4. It made everything else in the system so much harder to understand and query for, and it was for nothing.
If you use a ID INT or a CHAR, referential integrity is preserved in both cases.
An INT is 4 bytes long, so it's equal in size as a CHAR(4); if you use CHAR(x) where x<4, your CHAR key will be shorter than a INT one; if you use CHAR(x) where x>4, your CHAR key will be greater than a INT one; for short keys doesn't usually make sense to use VARCHAR, as the latter has a 2-byte overhead. Anyway, when talking about tables with - say - 500 records, the total overhead of a CHAR(5) over a INT key would be just 500 bytes, a value hilarious for database where some tables could have millions of records.
Considering that countries and currencies (e.g.) are limited in numbers (a few hundred, at most) you have no real gain in using an ID INT instead of a CHAR(4); moreover, a CHAR(4) key can be easier to remember for the end user, and can ease your life when you have to debug/test your Sql and/or data.
Therefore, though I usually use an ID INT key for most of my tables, in several circumstances I choose to have a PK/FK made of CHARs: countries, languages, currencies are amongst those cases.