Data Modeling - how to handle two, dependent "status" columns? - sql

I've run across something that's bugging me just enough that I wanted to come here and seek out a sort of "best practice" type of advice from you guys (et gals)
I have a table in my model, let's call it prospect. Two separate external systems can provide an update for rows in this table, but only as a "status" of that record in those respective systems.
I need to store those statuses locally. Initial idea, of course, it just to make two nullable foreign keys. Something like this.
+-----------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+--------------+------+-----+---------+----------------+
| prospect_id | int(11) | NO | PRI | NULL | auto_increment |
| ext_status_1_id | int(11) | YES | | NULL | |
| ext_status_2_id | int(11) | YES | | NULL | |
+-----------------+--------------+------+-----+---------+----------------+
In this example there would be, of course, two tables that hold id/value pairs for statuses.
Here's the catch - ext_status_2_id will always be NULL unless ext_status_1_id is 1 (this is just how the business rules work).
Have I modeled this correctly? I just have this nagging voice in the back of my brain telling me that "not every row in prospect will need an ext_status_2_id so this might not be right".
If it matters, this is MySQL 5.0.45 and I'm using InnoDB

Since there is an in-built dependency for Status2 on Status1, why not just have a single status field on the prospect table, and create Status2 as a property on the Status1 table? It is certainly normalized heavily in this fashion but having the data structure this way speaks about the dependency of Status2 on Status1.

This is probably fine. But since you'll always only use 1 of the 2, you could model it as :
ext_status_type (either 1 or 2) and ext_status for the actual id.
I would probably do the same as you did, because it might be easier to build indexes around this and both numbers appear to have a true different meaning.
If there will be more statuses (3,4,5,6) I would consider the first approach in my answer.

What are the possible ext__status__1? Will ext__status__2 have a value only if status__1=1? What is status__1=2? I agree partially with Nissan Fan. Is there, however a direct dependency between status__1 and Status__2? Is there a Functional dependency of the form status__1 -> Status__2? If there is no such dependence then keeping status__1 and Status__2 in a separate table does not solve your problem.

Related

Is comparing two tables faster by importing them into a sql database or by using jdbc?

Background
I need to compare two tables in two different datacenters to make sure they're the same. The tables can be hundreds of millions, even a billion lines.
An example of this is having a production data pipeline and a development data pipeline. I need to verify that the tables at the end of each pipeline are the same, however, they're located in different datacenters.
The tables are the same if all the values and datatypes for each row and column match. There are primary keys for each table.
Here's an example input and output:
Input
table1:
Name | Age |
Alice| 25.0|
Bob | 49 |
Jim | 45 |
Cal | 52 |
table2:
Name | Age |
Bob | 49 |
Cal | 42 |
Alice| 25 |
Output:
table1 missing rows (empty):
Name | Age |
| |
table2 missing rows:
Name | Age |
Jim | 45 |
mismatching rows:
Name | Age | table |
Alice| 25.0| table1|
Alice| 25 | table2|
Cal | 52 | table1|
Cal | 42 | table2|
Note: The output doesn't need to be exactly like the above format, but it does need to contain the same information.
Question
Is it faster to import these tables into a new, common SQL environment, then use SQL to produce my desired output?
OR
Is it faster to use something like JDBC, retrieve all rows for each table, sort each table, then compare them line by line to produce my desired output?
Edits:
The above solutions would be executed at a datacenter that's hosting one of the tables. In the first solution, the only purpose for creating a new database would be to compare these tables using SQL, there are no other uses.
You should definitively start with the database option. Especially if the databases are connected with a database link you can easy set up the transfer of the data.
Such comparison often leads to a full outer join of the two sources and the experience tell us that DIY joins are notorically less performant that the native database implementation (you can deploy for example a parallel option).
Anyway you may try to implement some sofisticated algoritm that can make the compare without the necessity to transfer the whole table.
An example is based on the Merkle Trees where you first scan both source in their location to recognise which parts are identical (that can be ignored) and transfer and compare only the party with a difference.
So if you expect the tables are nearly identical and have keys that allows some hierarchy such approach could end better than a brute force full compare.
The faster solution is to load both tables to variables (memory) in your programing language and then compare them with your favorite algorithm.
Copy them first to a new table is the more than the double of time in read/write operations to disk, especially the write ones.

sqlite variable and unknown number of entries in column

I am sure this question has been asked before, but I'm so new to SQL, I can't even combine the correct search terms to find an answer! So, apologies if this is a repetition.
The db I'm creating has to be created at run-time, then the data is entered after creation. Some fields will have a varying number of entries, but the number is unknown at creation time.
I'm struggling to come up with a db design to handle this variation.
As an (anonymised) example, please see below:
| salad_name | salad_type | salad_ingredients | salad_cost |
| apple | fruity | apple | cheap |
| unlikely | meaty | sausages, chorizo | expensive |
| normal | standard | leaves, cucumber, tomatoes | mid |
As you can see, the contents of "salad_ingredients" varies.
My thoughts were:
just enter a single, comma-separated string and separate at run-time. Seems hacky, and couldn't search by salad_ingredients!
have another table, for each salad, such as "apple_ingredients", which could have a varying number of rows for each ingredient. However, I can't do this, because I don't know the salad_name at creation time! :(
Have a separate salad_ingredients table, where each row is a salad_name, and there is an arbitrary number of ingredients fields, say 10, so you could have up to 10 ingredients. Again, seems slightly hacky, as I don't like to unused fields, and what happens if a super-complicated salad comes along?
Is there a solution that I've missed?
Thanks,
Dan
based on my experience the best solution is based on a normalized set of tables
table salads
id
salad_name
salad_type
salad_cost
.
table ingredients
id
name
and
table salad_ingredients
id
id_salad
id_ingredients
where id_salad is the corresponding if from salads
and id_ingredients is the corresponding if from ingredients
using proper join you can get (select) and filter (where) all the values you need

How to flatten a one-to-many relationship

While trying to build a data warehousing application using Talend, we are faced with the following scenario.
We have two tables tables that look like
Table master
ID | CUST_NAME | CUST_EMAIL
------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM
Events Table
ID | CUST_ID | EVENT_NAME | EVENT_DATE
---------------------------------------
1 | 1 | ACC_APPLIED | 2014-01-01
2 | 1 | ACC_OPENED | 2014-01-02
3 | 1 | ACC_CLOSED | 2014-01-02
There is a one-to-many relationship between master and the events table.Since, given a limited number of event names I proposing that we denormalize this structure into something that looks like
ID | CUST_NAME | CUST_EMAIL | ACC_APP_DATE_ID | ACC_OPEN_DATE_ID |ACC_CLOSE_DATE_ID
-----------------------------------------------------------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM | 20140101 | 20140102 | 20140103
THE DATE_ID columns refer to entries inside the time dimension table.
First question : Is this a good idea ? What are the other alternatives to this scheme ?
Second question : How do I implement this using Talend Open Studio ? I figured out a way in which I moved the data for each event name into it's own temporary table along with cust_id using the tMap component and later linked them together using another tMap. Is there another way to do this in talend ?
To do this in Talend you'll need to first sort your data so that it is reliably in the order of applied, opened and closed for each account and then denormalize it to a single row with a single delimited field for the dates using the tDenormalizeRows component.
After this you'll want to use tExtractDelimitedFields to split the single dates field.
Yeah, this is a good idea, this is called a cumulative snapshot fact. http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
Not sure how to do this in Talend (dont know the tool) but it would be quite easy to implement in SQL using a Case or Pivot statement
Regarding only your first question, it's certainly a good idea -- unless there is any possibility of the same persons applying-opening-closing their account more than once AND you want to keep all this information in their history (so UPDATE wouldn't help).
Snowflaking is definitely not a good option if you are going to design a data warehouse. So, denormalizing will certainly be a good choice in this case. Following article almost fits perfectly to clear the air over such scenarios,
http://www.kimballgroup.com/2008/09/design-tip-105-snowflakes-outriggers-and-bridges/

Storing a COUNT of values in a table

I have a table with data along the (massively simplified) lines of:
User | Value
-----|------
UsrA | 100
UsrA | 102
UsrB | 100
UsrA | 100
UsrB | 101
and, for reasons far to obscure to go into, I need to store the COUNT of each value in a table for future retrieval - ending up with something like
User | Value100Count | Value101Count | Value102Count
-----|---------------|---------------|--------------
UsrA | 2 | 0 | 1
UsrB | 1 | 1 | 0
However, there could be up to 255 different Values - meaning potentially 255 different ValueXCount columns. I know this is a horrible way to do things, but is there an easy way to get the data into a format that can be easily INSERTed into the destination table? Is there a better way to store the COUNT of values per user (unfortunately I do need to store this information; grabbing it from the source table each time isn't an option)?
The whole thing isn't very pretty, but you know that, rather than your table with 255 columns I'd consider setting up another table with:
User | Value | CountOfValue
And set a primary key over User and Value.
You could then insert the count's for given user/value combos into the CountOfValue field
As I said, the design is horrible and it feels like you would be better off starting from scratch, normalizing and doing counts live.
Check out indexed views. You can maintain the table automatically, with integrity and as a bonus it can get used in queries that already do count(*) on that data.

MySQL Database Design with Internationalization

I'm going to start work on a medium sized application, and i'm planning it's db design.
One thing that I'm not sure about is this.
I will have many tables which will need internationalization, such as: "membership_options, gender_options, language_options etc"
Each of these tables will share common i18n fields, like:
"title, alternative_title, short_description, description"
In your opinion which is the best way to do it?
Have an i18n table with the same fields for each of the tables that will need them?
or do something like:
Membership table Gender table
---------------- --------------
id | created_at id | created_at
1 - 22.03.2001 1 - 14.08.2002
2 - 22.03.2001 2 - 14.08.2002
General translation table
-------------------------
record_id | table_name | string_name | alternative_title| .... |id_language
1 - membership regular null 1 (english)
1 - membership normale null 2 (italian)
1 - gender man null 1(english)
1 -gender uomo null 2(italian)
This would avoid me repeating something like:
membership_translation table
-----------------------------
membership_id | name | alternative_title | id_lang
1 regular null 1
1 normale null 2
gender_translation table
-----------------------------
gender_id | name | alternative_title | id_lang
1 man null 1
1 uomo null 2
and so on, so i would probably reduce the number of db tables, but i'm not sure about performance.I'm not much of a DB designer, so please let me know.
The most common way I've seen this done is with two tables, membership and membership_ml, with one storing the base values and the ml table storing the localized strings. This is similar to your second option. Most of the systems I see like this are made that way because they weren't designed with internationalization in mind from the get go, so the extra _ml tables were "tacked on" later.
What I think is a better option is similar to your first option, but a little bit different. You would have a central table for storing all the translations, but instead of putting the table name and field name in there, you would use tokens and a central "Content" table to store all the translations. That way you can enforce some kind of RI between the tokens in the base table and the translations in the Content table if you want as well.
I actually asked a question about this very thing a while back, so you can have a look at that for some more info (rather than repasting the schema examples here).
I also think the best solution is to keep translations on different table. This approach use Open Cart which is open source and you can take a look the way it deals with the problem. Another source of information is here "http://www.gsdesign.ro/blog/multilanguage-database-design-approach/" especially on the comments sections