MySQL. Working with Integer Data Interval - sql

I've just started using SQL, so that have no idea how t work with not standard data types.
I'm working with MySQL...
Say, there are 2 tables: Stats and Common. The Common table looks like this:
CREATE TABLE Common (
Mutation VARCHAR(10) NOT NULL,
Deletion VARCHAR(10) NOT NULL,
Stats_id ??????????????????????,
UNIQUE(Mutation, Deletion) );
Instead of ? symbols there must be some type that references on the Stats table (Stats.id).
The problem is, this type must make it possible to save data in such a format: 1..30 (interval between 1 and 30). According to this type, it was my idea to shorten the Common table's length.
Is it possible to do this, are there any different ideas?

Assuming that Stats.id is an INTEGER (if not, change the below items as appropriate):
first_stats_id INTEGER NOT NULL REFERENCES Stats(id)
last_stats_id INTEGER NOT NULL REFERENCES Stats(id)
Given that your table contains two VARCHAR fields and an unique index over them, having an additional integer field is the least of your concerns as far as memory usage goes (seriously, one integer field represents a mere 1GB of memory for 262 million lines).

Related

PostgreSQL: Using BIT VARYING column for bitmask operations

Having used bitmasks for years as a C programmer I'm attempting to do something similar in Postgres and it's not working as I expected. So here is a table definition with 2 columns:
dummy
(
countrymask BIT VARYING (255) not null, -- Yes it's a pretty wide bitmask
countryname CHARACTER VARYING NOT NULL,
);
So, some data in the "dummy" table would be:
Now, what is the SQL to return Albania, Armenia and Belarus with one select using the mask?? (i.e. '100010001')
I thought it would be something like this:
SELECT * FROM DUMMY WHERE (countrymask & (b'100010001')) <> 0;
But I get a type mismatch.. which I'd love some assistance on.
But also, is this going to work when the typecasting is sorted out?
You would have to use bit strings of the same length throughout, that is bit(255), and store all the leading zeros.
This would be simpler if you could use integers and do
WHERE countrymask & 273 <> 0
but there are no integer types with 255 bits supporting the & operator.
Anyway, such a query could never use an index, which is no problem with a tiny table like dummy, but it could be a problem if you want to scan a bigger table.
In a way, that data model violates the first normal form, because it stores several country codes in a single datum. I think that you would be happier with a classical relational model: have a country table that has a numerical primary key filled with a sequence, and use a mapping table to associate rows in another table with several countries.
An alternative would be to store the countries for a row as an array of country identifiers (bigint[]). Then you can use the “overlaps” operator && to scan the table for rows that have any of the countries in a given array. Such an operation can be made fast with a GIN index.

SQL table with incompatible columns (only 1 must be used at a time)

Context:
Let's consider that I have a database with a table "house". I also have tables "tiledRoof" and "thatchedRoof".
Aim:
All my houses must have only 1 roof at a time. It can be a tiled one or a thatched one, but not both. Even if it doesn't makes a lot of sense, imagine that we might change the roof of our houses many times.
My solution:
I can figure out 2 solutions to link houses to roofs:
Solution 1 : Delete/create roofs every time :
The database should looks like this (more or less pseudo sql code):
house{
tiledRoof_id int DEFAULT NULL FOREIGN KEY REFERENCES tiledRoof(id)
thatchedRoof_id int DEFAULT NULL FOREIGN KEY REFERENCES thatchedRoof(id)
// Other columns ...
}
tiledRoof{
id
// Other columns ...
}
thatchedRoof{
id
// Other columns ...
}
So, I make "tiledRoof_id" and "thatchedRoof_id" nullable. Then if I want to link an house with a tiled roof, I do an upsert in the table "tiledRoof" . If a row have been created, I update "tiledRoof_id" to match the id created. Then, if my house was linked to a thatched roof, I delete a row in "thatchedRoof" and set "thatchedRoof_id" to NULL (I guess I can do it automatically by implementing the onDelete of my foreign key constraint).
Down sides :
Deleting a row and create later a similar other row might not be really clever. If I change 50 times my roof, I will create 50 rows and also delete 49 of them...
More queries to run than with the second solution.
Solution 2 : Add "enabler columns" :
The database should looks like this (more or less pseudo sql code):
house{
tiledRoof_id int DEFAULT(...) FOREIGN KEY REFERENCES tiledRoof(id)
thatchedRoof_id int DEFAULT(...) FOREIGN KEY REFERENCES thatchedRoof(id)
tiledRoof_enabled boolean DEFAULT True
thatchedRoof_enabled boolean DEFAULT False
// Other columns ...
}
tiledRoof{
id
// Other columns ...
}
thatchedRoof{
id
// Other columns ...
}
I fill both "tiledRoof_id" and "thatchedRoof_id" with a foreign id that links each of my houses to a tile roof AND to a thatched roof.
To make my house not really having both roofs, I just enable one of them. To do so I implement 2 additional columns : "tiledRoof_enabled " and "thatchedRoof_enabled" that will define which roof is enabled.
Alternatively, I can use a single column to set the enabled roof if that column takes an integer (1 would means that the tiled one is enabled and 2 would means the thatched one).
Difficulty :
To make that solution works, It would requiere an implementation of the default value of "tiledRoof_id" and "thatchedRoof_id" that might not be possible. It have to insert in the corresponding roof-table a new row and use the resulting row id as default value.
If that can not be done, I have start by running queries to create my roofs and then create my house.
Question:
What is the best way to reach my purpose? One of the solutions that I proposed? An other one? If it's the second one of my propositions, I would be grateful if you could explain to me if my difficulty can be resolved and how.
Note:
I'm working with sqlite3 (just for syntax is differences)
It sounds like you want a slowly changing dimension. Given only two types, I would suggest:
create table house_roofs (
house_id int references houses(house_id),
thatched_roof_id int references thatched_roofs(thatched_roof_id),
tiled_roof_id int references tiled_roofs(tiled_roof_id),
version_eff_dt datetime not null,
version_end_dt datetime,
check (thatched_roof_id is null or tiles_roof_id is null) -- only one at a time
);
This allows you to have properly declared foreign key relationships.
Are you sure you need to normalize the roof type? Why not simply add a boolean for each of the roof types in your house table. SQLLite doesn't actually have a boolean, so you could use integer 0 or 1.
Note: You would still want to have the tables thatchedRoof and tiledRoof if there are details about each of those types that are generic for all roofs of that type.
If the the tables thatchedRoof and tiledRoof contain details that are specific to each specific house, then this strategy may not work to well.

Efficient storage pattern for millions of values of different types

I am about to build an SQL database that will contain the results of statistics calculations for hundreds of thousands of objects. It is planned to use Postgres, but the question equally applies to MySQL.
For example, hypothetically, let's assume I have half a million records of phone calls. Each PhoneCall will now, through a background job system, have statistics calculated. For example, a PhoneCall has the following statistics:
call_duration: in seconds (float)
setup_time: in seconds (float)
dropouts: periods in which audio dropout was detected (array), e.g. [5.23, 40.92]
hung_up_unexpectedly: true or false (boolean)
These are just simple examples; in reality, the statistics are more complex. Each statistic has a version number associated with it.
I am unsure as to which storage pattern for these type of calculated data will be the most efficient. I'm not looking into fully normalizing everything in the database though. So far, I have come up with the following options:
Option 1 – long format in one column
I store the statistic name and its value in one column each, with a reference to the main transaction object. The value column is a text field; the value will be serialized (e.g. as JSON or YAML) so that different types (strings, arrays, ...) can be stored. The database layout for the statistics table would be:
statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value (text, serialized)
statistic_version (integer)
created_at (datetime)
I have worked with this pattern for a while, and what's good about it is that I can easily filter statistics according to phone call and the statistic name. I can also add new types of statistics easily and filter by version and creation time.
But it seems to me that the (de)serialization of values makes it quite inefficient in terms of handling lots of data. Also, I cannot perform calculations on SQL-level; I always have to load and deserialize the data. Or is the JSON suppot in Postgres that good so that I could still pick this pattern?
Option 2 – statistics as attributes of main object
I could also think about collecting all types of statistic names and adding them as new columns to the phone call object, e.g.:
id (PK)
call_duration
setup_time
dropouts
hung_up_unexpectedly
...
This would be very efficient, and each column would have its own type, but I can no longer store different versions of statistics, or filter them according to when they were created. The whole business logic of statistics disappears. Adding new statistics is also not possible easily since the names are baked in.
Option 3 – statistics as different columns
This would probably be the most complex. I am storing only a reference to the statistic type, and the column will be looked up according to that:
statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value_bool (boolean)
statistic_value_string (string)
statistic_value_float (float)
statistic_value_complex (serialized or complex data type)
statistic_value_type (string that indicates bool, string etc.)
statistic_version (integer)
created_at (datetime)
This would mean that the table is going to be very sparse, as only one of the statistic_value_ columns would be populated. Could that lead to performance issues?
Option 4 – normalized form
Trying to normalize option 3, I would create two tables:
statistics
id (PK)
version
created_at
statistic_mapping
phone_call_id (FK)
statistic_id (FK)
statistic_type_mapping
statistic_id (FK)
type (string, indicates bool, string etc.)
statistic_values_boolean
statistic_id (FK)
value (bool)
…
But this isn't going anywhere since I can't dynamically join to another table name, can I? Or should I anyway then just join to all statistic_values_* tables based on the statistic ID? My application would have to make sure that no duplicate entries exist then.
To summarize, given this use case, what would be the most efficient approach for storing millions of statistic values in a relational DB (e.g. Postgres), when the requirement is that statistic types may be added or changed, and that several versions exist at the same time, and that querying of the values should be somewhat efficient?
IMO you can use the following simple database structure to solve your problem.
Statistics type dictionary
A very simple table - just name and description of the stat. type:
create table stat_types (
type text not null constraint stat_types_pkey primary key,
description text
);
(You can replace it with enum if you have a finite number of elements)
Stat table for every type of objects in the project
It contains FK to the object, FK to the stat. type (or just enum) and, this is important, the jsonb field with an arbitrary stat. data related to its type. For example, such a table for phone calls:
create table phone_calls_statistics (
phone_call_id uuid not null references phone_calls,
stat_type text not null references stat_types,
data jsonb,
constraint phone_calls_statistics_pkey primary key (phone_call_id, stat_type)
);
I assume here that table phone_calls has uuid type of its PK:
create table phone_calls (
id uuid not null constraint phone_calls_pkey primary key
-- ...
);
The data field has a different structure which depends on its stat. type. Example for call duration:
{
"call_duration": 120.0
}
or for dropouts:
{
"dropouts": [5.23, 40.92]
}
Let's play with data:
insert into phone_calls_statistics values
('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'CALL_DURATION', '{"call_duration": 100.0}'),
('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'CALL_DURATION', '{"call_duration": 110.0}'),
('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'CALL_DURATION', '{"call_duration": 120.0}'),
('39465c2f-2321-499e-a156-c56a3363206a', 'CALL_DURATION', '{"call_duration": 130.0}'),
('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}'),
('39465c2f-2321-499e-a156-c56a3363206a', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}');
Get the average, min and max call duration:
select
avg((pcs.data ->> 'call_duration')::float) as avg,
min((pcs.data ->> 'call_duration')::float) as min,
max((pcs.data ->> 'call_duration')::float) as max
from
phone_calls_statistics pcs
where
pcs.stat_type = 'CALL_DURATION';
Get the number of unexpected hung ups:
select
sum(case when (pcs.data ->> 'unexpected_hungup')::boolean is true then 1 else 0 end) as hungups
from
phone_calls_statistics pcs
where
pcs.stat_type = 'UNEXPECTED_HANGUP';
I believe that this solution is very simple and flexible, has good performance potential and perfect scalability. The main table has a simple index; all queries will perform inside it. You can always extend the number of stat. types and their calculations.
Live example: https://www.db-fiddle.com/f/auATgkRKrAuN3jHjeYzfux/0

Does the precision of datetime field change when used with bigint field to form a composite key or Unique Index?

I need to use a composite key or unique index comprised of a bigint data type and a datetime data type. However, I've noticed that the seconds element of the datetime has been rounded to the nearest minute and causes a duplicate key violation when trying to import a dataset.
The dataset is essentially transactional data, so our ID (stored in the bigint field) will be repeated, hence the need for inclusion of the datetime field in the unique key.
To give an example: the following two rows cause a 'duplicate key row' error:
ID field (bigint) | ActionDate (datetime)
--------------------- |--------------------------
1050000284002 | 2016-01-08 15:51:24.000
1050000284002 | 2016-01-08 15:50:35.000
The values are clearly different (and are stored correctly in the database) but the error shows:
The duplicate key value is (1050000284002, Jan 8 2016 3:51PM).
(It's worth adding that I initially created a composite key and have since replaced it with a unique index; the error outlined above was generated with the index in place.)
My questions are:
Is my datetime field being rounded because I'm using an integer in the key/index?
Is there another reason that I lose the accuracy of the time component of the datetime field?
How can I rectify the issue so that the example wouldn't result in a key violation?
If you are using an index of the form (index on columns A,B) and not a formulaic one (index on columns A + B), then no, the datatype of one column will have no effect on the contents of the other.
Based on your description, I'd check the following:
The actual datatype of the datetime column. Is it datetime? (Datetime will round to the nearest 333rd of a second, though that's not the issue here.)
The actual definition of the index. Is defined as you think it is defined? Perhaps it is indexing on date(DateTimeColumn)?
The actual data being stored. Is whatever is loading the data into the table perhaps truncating the seconds?
Further suggestions based on your edit:
If the data you are importing clearly contains unique datetime values, yet SQL is not identifying unique date values, then something's up with the data import process.
Try this loading your data into the table without the index in place. Does it load? Does it match your source data to the millisecond? Now, with the data loaded, create the index (primary key, unique constraint, whatever). Does this fail? Where's the duplicate data coming from? In short, mess around with the data and loading processes and see what falls out.
When I do this:
declare #i bigint
set #i=25
select #i+getdate()
select cast ( #i+getdate() as int)
I get:
2016-10-17 09:47:05.753
42658
Notice the first result is the date plus 25 days. So there is not dropoff in the integer part of the date (otherwise it would be midnight on the date...).
The second select drops everything after the decimal place....

Multiple Wildcard Counts in Same Query

One of my job functions is being responsible for mining and marketing on a large newsletter subscription database. Each one of my newsletters has four columns (newsletter_status, newsletter_datejoined, newsletter_dateunsub, and newsletter_unsubmid).
In addition to these columns, I also have a master unsub column that our customer service dept. can update to accomodate irate subscribers who wish to be removed from all our mailings, and another column that gets updated if a hard bounce (or a set number of soft bounces) occurs called emailaddress_status.
When I pull a count for current valid subscribers for one list I use the following syntax:
select count (*) from subscriber_db
WHERE (emailaddress_status = 'VALID' OR emailaddress_status IS NULL)
AND newsletter_status = 'Y'
and unsub = 'N' and newsletter_datejoined >= '2013-01-01';
What I'd like to have is one query that looks for all columns with %_status, with the aforementioned criteria ordered by current count size.
I'd like for it to look like this:
etc.
I've search around the web for months looking for something similar, but other than running them in a terminal and exporting the results I've not been able to successfully get them all in one query.
I'm running PostgreSQL 9.2.3.
A proper test case would be each aggregate total matching the counts I get when running the individual queries.
Here's my obsfucated table definition for ordinal placement, column_type, char_limit, and is_nullable.
Your schema is absolutely horrifying:
24 ***_status text YES
25 ***_status text YES
26 ***_status text YES
27 ***_status text YES
28 ***_status text YES
29 ***_status text YES
where I presume the masked *** is something like the name of a publication/newsletter/etc.
You need to read about data normalization or you're going to have a problem that keeps on growing until you hit PostgreSQL's row-size limit.
Since each item of interest is in a different column the only way to solve this with your existing schema is to write dynamic SQL using PL/PgSQL's EXECUTE format(...) USING .... You might consider this as an interim option only, but it's a bit like using a pile driver to jam the square peg into the round hole because a hammer wasn't big enough.
There are no column name wildcards in SQL, like *_status or %_status. Columns are a fixed component of the row, with different types and meanings. Whenever you find yourself wishing for something like this it's a sign that your design needs to be re-thought.
I'm not going to write an example since (a) this is an email marketing company and (b) the "obfuscated" schema is completely unusable for any kind of testing without lots of manual work re-writing it. (In future, please provide CREATE TABLE and INSERT statements for your dummy data, or better yet, a http://sqlfiddle.com/). You'll find lots of examples of dynamic SQL in PL/PgSQL - and warnings about how to avoid the resulting SQL injection risks by proper use of format - with a quick search of Stack Overflow. I've written a bunch in the past.
Please, for your sanity and the sanity of whoever else needs to work on this system, normalize your schema.
You can create a view over the normalized tables to present the old structure, giving you time to adapt your applications. With a bit more work you can even define a DO INSTEAD view trigger (newer Pg versions) or RULE (older Pg versions) to make the view updateable and insertable, so your app can't even tell that anything has changed - though this comes at a performance cost so it's better to adapt the app if possible.
Start with something like this:
CREATE TABLE subscriber (
id serial primary key,
email_address text not null,
-- please read http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/
-- for why I merged "fname" and "lname" into one field:
realname text,
-- Store birth month/year as a "date" with a "CHECK" constraint forcing it to be the 1st day
-- of the month. Much easier to work with.
birthmonth date,
CONSTRAINT birthmonth_must_be_day_1 CHECK ( extract(day from birthmonth) = 1),
postcode text,
-- Congratulations! You made "gender" a "text" field to start with, you avoided
-- one of the most common mistakes in schema design, the boolean/binary gender
-- field!
gender text,
-- What's MSO? Should have a COMMENT ON...
mso text,
source text,
-- Maintain these with a trigger. If you want modified to update when any child record
-- changes you can do that with triggers on subscription and reducedfreq_subscription.
created_on timestamp not null default current_timestamp,
last_modified timestamp not null,
-- Use the native PostgreSQL UUID type, after running CREATE EXTENSION "uuid-ossp";
uuid uuid not null,
uuid2 uuid not null,
brand text,
-- etc etc
);
CREATE TABLE reducedfreq_subscription (
id serial primary key,
subscriber_id integer not null references subscriber(id),
-- Suspect this was just a boolean stored as text in your schema, in which case
-- delete it.
reducedfreqsub text,
reducedfreqpref text,
-- plural, might be a comma list? Should be in sub-table ("join table")
-- if so, but without sample data can only guess.
reducedfreqtopics text,
-- date can be NOT NULL since the row won't exist unless they joined
reducedfreq_datejoined date not null,
reducedfreq_dateunsub date
);
CREATE TABLE subscription (
id serial primary key,
subscriber_id integer not null references subscriber(id),
sub_name text not null,
status text not null,
datejoined date not null,
dateunsub date
);
CREATE TABLE subscriber_activity (
last_click timestamptz,
last_open timestamptz,
last_hardbounce timestamptz,
last_softbounce timestamptz,
last_successful_mailing timestamptz
);
To call it merely "horrifying" shows a great deal of tact and kindness on your part. Thank You. :) I inherited this schema only recently (which was originally created by the folks at StrongMail).
I have a full relational DB re-arch project on my roadmap this year - the sample normalization is very much inline with what I'd been working on. Very interesting insight on realname, I hadn't really thought about that. I suppose the only reason StrongMail had it broken out was for first name email personalization.
MSO is multiple systems operator (cable company). We're a large lifestyle media company, and the newsletters we produce are on food, travel, homes and gardening.
I'm creating a Fiddle for this - I'm new here so going forward I'll be more mindful of what you guys need to be able to help. Thank you!