Background:
I'm working on a project that does not allow me to share the data, but I'll do my best to give you some visualisation below. So before going further, I know (some) SQL, and I have done basic work relationship before, but the data was clean and simple and for some reason I just can't' figure out a solution.
Problem (?)
I'm trying to define a relationship between two tables from two different sources that each work with different identifiers. I do have however a mapping table from one of those but again the identifiers do not align. Let me try explain visually:
| TABLE 1 (cies) | | TABLE 2 (forms) |
| ------------ | | ------------- |
| id(PK) | | id(PK) |
| 4_digit_code | | 16_digit_code |
| ...more fields | | ...more fields |
The second source provided me a mapping table they use internally:
| MAPPING TABLE |
| ------------- |
| id(PK) |
| 4_digit_code | (= to the one in TABLE 1)
| 16_digit_code | (= to the one in TABLE 2)
My first thought was to create a script and just merge the info in the mapping table in TABLE 1 like so:
| TABLE 1 | | TABLE 2 |
| ------------ | | ------------- |
| id(PK) | | id(PK) |
| 16_digit_code | ==== | 16_digit_code |
| 4_digit_code |
The issue here is the 16_digit_code is not unique so I believe this does not work. Now comes something I have no experience with so I am just thinking out loud here:
Can I keep (?) the mapping table and each time reference that one to get my data from the other table via another? On other hand should not all values in a mapping table be unique as well for it to work? The reason there are non-unique values is that (some) very old numbers end up getting recycled.
For example get me all forms from company with id 1:
| TABLE 1 | | MAPPING TABLE | | TABLE 2 |
| ------------ | | ------------- | | ------------- |
| id(PK) | | id(PK) | | id(PK) |
| 16_digit_code | | 16_digit_code | ==== | 16_digit_code |
| 4_digit_code | ==== | 4_digit_code | | ...more fields |
And in the above, I would not know how to efficiently approach this problem. I really don't know if it makes any sense though what I am saying or I am missing something or making this way too complex.
Solution?
I'd love it if someone could point me in the right direction. And if you have the solution I'd love to know the reasoning, not just the solution as I'd love to learn from this for the future obviously.
Edit/Clarification:
Just for completion sake, the mapping combination (4 digit + 16 digit code) is unique. Although, as I said earlier one 16 digit code can be linked to multiple 4 digit codes.
I have a table that has user a user_id and a new record for each return reason for that user. As show here:
| user_id | return_reason |
|--------- |-------------- |
| 1 | broken |
| 2 | changed mind |
| 2 | overpriced |
| 3 | changed mind |
| 4 | changed mind |
What I would like to do is generate a foreign key for each combination of values that are applicable in a new table and apply that key to the user_id in a new table. Effectively creating a many to many relationship. The result would look like so:
Dimension Table ->
| reason_id | return_reason |
|----------- |--------------- |
| 1 | broken |
| 2 | changed mind |
| 2 | overpriced |
| 3 | changed mind |
Fact Table ->
| user_id | reason_id |
|--------- |----------- |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 3 |
My thought process is to iterate through the table with a cursor, but this seems like a standard problem and therefore has a more efficient way of doing this. Is there a specific name for this type of problem? I also thought about pivoting and unpivoting. But that didn't seem too clean either. Any help or reference to articles in how to process this is appreciated.
The problem concerns data normalization and relational integrity. Your concept doesn't really make sense - Dimension table shows two different reasons with same ID and Fact table loses a record. Conventional schema for this many-to-many relationship would be three tables like:
Users table (info about users and UserID is unique)
Reasons table (info about reasons and ReasonID is unique)
UserReasons junction table (associates users with reasons - your
existing table). Assuming user could associate with same reason
multiple times, probably also need ReturnDate and OrderID_FK fields
in UserReasons.
So, need to replace reason description in first table (UserReasons) with a ReasonID. Add a number long integer field ReasonID_FK in that table to hold ReasonID key.
To build Reasons table based on current data, use DISTINCT:
SELECT DISTINCT return_reason INTO Reasons FROM UserReasons
In new table, rename return_reason field to ReasonDescription and add an autonumber field ReasonID.
Now run UPDATE action to populate ReasonID_FK field in UserReasons.
UPDATE UserReasons INNER JOIN UserReasons.return_reason ON Reasons.ReasonDescription SET UserReasons.ReasonID_FK = Reasons.ReasonID
When all looks good, delete return_reason field.
I have a jobs table that stores information such as title, department, and salary. I'm wanting the user to be able to create a job using a form that has fields for the aforementioned information, as well as a field for the job category. category would be something like retail, or IT, for example.
I don't have any issues with the actual coding itself, but rather what the best way to design the database store the information in it. So my question is this: should I create a separate table categories that stores each job category, along with an ID, so that the tables would look something like this
categories jobs
+----+---------------+ +----+---------------+-------------+--------+-------------+
| id | category | | id | title | department | salary | category_id |
+----+---------------+ +----+---------------+-------------+--------+-------------+
| 1 | Retail | | 1 | Retail | department1 | 10000 | 2 |
+----+---------------+ +----+---------------+-------------+--------+-------------+
| 2 | IT | | 2 | IT | department2 | 12000 | 1 |
+----+---------------+ +----+---------------+-------------+--------+-------------+
where category_id is a foreign key linking to the categories table,
or should I do something like this, where all the information is stored in a single table:
jobs
+----+---------------+-------------+--------+-------------+
| id | title | department | salary | category |
+----+---------------+-------------+--------+-------------+
| 1 | Retail | department1 | 10000 | IT |
+----+---------------+-------------+--------+-------------+
| 2 | IT | department2 | 12000 | Retail |
+----+---------------+-------------+--------+-------------+
Which is the better option? They both seem to achieve the same result, but what are the pros and cons of doing it either way, and which way would be the more preferred way of doing it?
In general, you want to store "entities" in separate tables. In this case, category is a separate entity from jobs.
Why do you want to do this?
There is only one row per category, so you don't have to worry about duplication -- and errors.
There may be additional information that you want to store, such as the creation date, abbreviation, who created it, and so on.
Properly declared foreign key constraints ensure that only valid categories are stored.
Categories may be shared across different tables, and a separate reference table ensures that the values are consistent.
I have been working to build a more abstract schema, where there had been several tables modeling remarkably similar relationships, I want to model just the "essence". Due to the environment I am working with (Drupal 7), I can't change the nature of the issue: that a relationship of the same essential type could reference one of two different tables for the object in one role. Let's bring in some example to clarify (this is not my actual problem domain, but a similar problem). Here are the requirements:
First, if you are unfamiliar with Drupal, here's the gist: Users in one table, every other entity in a single second table (gross generalization, but enough).
Let's say we want to model the "works for" relationship, and lets have the given be that "companies" are of type "entity" and "supervisor" is of type "user" (and by "type" I mean that's the table in the database where their tuples reside). Here are the simplified requirements:
A user can work for a company
A company can work for a company
These "works for" relationships should be in the same table.
I have two ideas, and both don't exactly sit well with my current disposition toward schema quality, and this is where I would like some insight.
One foreign-key column paired with a 'type' column
Two foreign-key columns, always at most one utilized (ick!)
In case you are a visual thinker, here are the two options representing the fact that users 123 and 632, as well as entity 123 all work for entity 435:
Option 1
+---------------+-------------+---------------+-------------+
| employment_id | employee_id | employee_type | employer_id |
+---------------+-------------+---------------+-------------+
| 1 | 123 | user | 435 |
+---------------+-------------+---------------+-------------+
| 2 | 123 | entity | 435 |
+---------------+-------------+---------------+-------------+
| 3 | 632 | user | 435 |
+---------------+-------------+---------------+-------------+
Option 2
+---------------+------------------+--------------------+-------------+
| employment_id | employee_user_id | employee_entity_id | employer_id |
+---------------+------------------+--------------------+-------------+
| 1 | 123 | <NULL> | 435 |
+---------------+------------------+--------------------+-------------+
| 2 | <NULL> | 123 | 435 |
+---------------+------------------+--------------------+-------------+
| 3 | 632 | <NULL> | 435 |
+---------------+------------------+--------------------+-------------+
Thoughts on option 1: I like that the employee_id column has concrete role, but I despise that it has ambiguous target. Option 2 has ambiguous role (which column is the employee?), but has concrete target for any given FK, so I can think of it this way:
+-----------+-----------+----------+
| | ROLE |
| | ambiguous | concrete |
+-----------+-----------+----------+
| T | | |
| A ambig. | | 1 |
| R | | |
| G -------+-----------+----------+
| E | | |
| T concr. | 2 | ? |
| | | |
+-----------+-----------+----------+
Option two has very pragmatic benefits for my project, but I do not feel comfortable with so many nulls (you might not even call it 1NF!)
So here's the crux of my question for SO: How can option 1 be improved, or else what knowledge gap might I have that leaves me unsettled? While I can't bring to mind a specific rule which it violates, the design clearly is not in keeping with the intentions of normalization (requiring two columns to uniquely identify a relationship is not doing me any favors for safeguarding against anomalies).
I do understand that the ideal solution would be to redesign the users entity to be the same as what I have been calling "entity" here, but please consider that beside the point/circumstantial (or at least let's draw the pragmatic line right exactly there for this question).
Again, the essential question: What, in terms of normalization, is wrong with schema option 1, and how might you model this relationship given the constraint of not refactoring "user" into "entity"?
note: For this, I am more interested in theoretical purity than a pragmatic solution
The solutions you present contravene 4th normal form as #podiluska says. If this is recast into the form below, then the solution removes this difficult and is in 5NF (and even 6NF?).
Adopt one of the patterns for sub/super types. This uses the relation definitions set out below, plus the super/subtype constraint. This constraint is that each tuple in the super type relation must correspond exactly to one sub type tuple. In other words, the subtypes must form a disjoint, covering set over the supertype.
I suspect the performance of this in a real situation might require some heavy tuning:
Table: Employment
+---------------+-------------+
| employee_id | employer_id |
+---------------+-------------+
| 1 | 435 |
+---------------+-------------+
| 2 | 435 |
+---------------+-------------+
| 3 | 435 |
+---------------+-------------+
Table: Employee (SuperType)
+---------------+
| employee_id |
+---------------+
| 1 |
+---------------+
| 2 |
+---------------+
| 3 |
+---------------+
Table: User employee (SubType)
+---------------+-------------+
| employee_id | user_id |
+---------------+-------------+
| 1 | 123 |
+---------------+-------------+
| 3 | 632 |
+---------------+-------------+
Table: Entity employee (SubType)
+---------------+-------------+
| employee_id | entity_id |
+---------------+-------------+
| 2 | 123 |
+---------------+-------------+
What is wrong with option 1 ( and option 2) is that it is a multivalued dependency, and as such, a breach of 4th normal form. However, within the constraints you have given, there's not a lot you can do about that.
If you could replace the worksfor table with a view, then you could keep user-company and company-company relations separate.
Of your two choices, Option 2 has the advantage that it may be easier to enforce the referential integrity, depending on your platform.
One potential, if icky, pragmatic solution within you current constraints could be to give companies positive IDs and users negative IDs which eliminates the empty column of option 2 and turns the type column of option 1 into an implication, but I feel dirty even suggesting it.
Similarly, if you don't need to know what type the entity is as long as you can determine it via joining, then using Guids as IDs would eliminate the need for the type column
I consider myself fairly competent in understanding and manipulating C-ish languages; it's not a problem for me to come up with an algorithm and implement it in any C-ish language.
I have tremendous difficulty writing SQL (in my specific case, MySQL) queries. For very simple queries, it isn't a problem, but for complex queries, I become frustrated not knowing where to start. Reading the MySQL documentation is difficult, mainly because the syntax description and explanation isn't organized very well.
For example, the SELECT documentation is all over the map: it starts out with what looks like psuedo-BNF, but then (since the text for aggregate descriptions aren't clickable... like select_expr) it quickly devolves into this frustrating exercise of trying to piece the syntax together yourself by having a number of browser windows open.
Enough whining.
I'd like to know how people, step by step, begin constructing a complex MySQL query. Here is a specific example. I have three tables below. I want to SELECT a set of rows with the following characteristics:
From the userInfo and userProgram tables, I want to select the userName, isApproved, and modifiedTimestamp fields and UNION them into one set. From this set I want to ORDER by modifiedTimestamp taking the MAX(modifiedTimestamp) for every user (i.e. there should be only one row with a unique userName and the timestamp associated with that username should be as high as possible).
From the user table, I want to match the firstName and lastName that is associated with the userName so that it looks something like this:
+-----------+----------+----------+-------------------+
| firstName | lastName | userName | modifiedTimestamp |
+-----------+----------+----------+-------------------+
| JJ | Prof | jjprofUs | 1289914725 |
| User | 2 | user2 | 1289914722 |
| User | 1 | user1 | 1289914716 |
| User | 3 | user3 | 1289914713 |
| User | 4 | user4 | 1289914712 |
| User | 5 | user5 | 1289914711 |
+-----------+----------+----------+-------------------+
The closest I've got is a query that looks like this:
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userInfo
WHERE user.userName=userInfo.userName)
UNION
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userProgram
WHERE user.userName=userProgram.userName)
ORDER BY modifiedTimestamp DESC;
I feel like I'm pretty close but I don't know where to go from here or even if I'm thinking about this in the right way.
> user
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| userName | char(8) | NO | PRI | NULL | |
| firstName | varchar(255) | NO | | NULL | |
| lastName | varchar(255) | NO | | NULL | |
| email | varchar(255) | NO | UNI | NULL | |
| avatar | varchar(255) | YES | | '' | |
| password | varchar(255) | NO | | NULL | |
| passwordHint | text | YES | | NULL | |
| access | int(11) | NO | | 1 | |
| lastLoginTimestamp | int(11) | NO | | -1 | |
| isActive | tinyint(4) | NO | | 1 | |
+--------------------+--------------+------+-----+---------+-------+
> userInfo
+-------------------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+------------+------+-----+---------+-------+
| userName | char(8) | NO | MUL | NULL | |
| isApproved | tinyint(4) | NO | | 0 | |
| modifiedTimestamp | int(11) | NO | | NULL | |
| field | char(255) | YES | | NULL | |
| value | text | YES | | NULL | |
+-------------------+------------+------+-----+---------+-------+
> userProgram
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| userName | char(8) | NO | PRI | NULL | |
| isApproved | tinyint(4) | NO | PRI | 0 | |
| modifiedTimestamp | int(11) | NO | | NULL | |
| name | varchar(255) | YES | | NULL | |
| address1 | varchar(255) | YES | | NULL | |
| address2 | varchar(255) | YES | | NULL | |
| city | varchar(50) | YES | | NULL | |
| state | char(2) | YES | MUL | NULL | |
| zip | char(10) | YES | | NULL | |
| phone | varchar(25) | YES | | NULL | |
| fax | varchar(25) | YES | | NULL | |
| ehsChildren | int(11) | YES | | NULL | |
| hsChildren | int(11) | YES | | NULL | |
| siteCount | int(11) | YES | | NULL | |
| staffCount | int(11) | YES | | NULL | |
| grantee | varchar(255) | YES | | NULL | |
| programType | varchar(255) | YES | | NULL | |
| additional | text | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
For what I understand from your question, you seem to need a correlated query, which would look like this:
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userInfo ui1
WHERE user.userName=userInfo.userName
AND modifiedtimestamp=(select max(modifiedtimestamp) from userInfo ui2 where ui1.userName=ui2.userName))
UNION
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userProgram up1
WHERE user.userName=userProgram.userName
AND modifiedtimestamp=(select max(modifiedtimestamp) from userProgram up2 where up1.userName=up2.userName))
ORDER BY modifiedTimestamp DESC;
So, do I proceed to get to this result? Key is: express clearly the information you want to retrieve, without taking mental shortcuts.
Step 1: Choose the fields I need in the different tables of my database. That's what is between SELECT and FROM. Seems obvious, but it becomes less obvious when it comes to aggregation function like sums or counts. In that case, you have to say, for example "I need the count of lines in userInfo for each firstName". See below in GROUP BY.
Step 2: Knowing the field you need, write the joins between the different corresponding tables. That's an easy one...
Step 3: Express your conditions. It can be easy, like if you want data from user for userName="RZEZDFGBH", or more complicated, like in your case: the way to formulate it so you can get the thing done, if you want only the most recent modifiedtimestamp, is "so that the modifiedtimestamp is equal to the most recent modifiedtimestamp" (that's where you can easily take a mental shortcut and miss the point)
Step 4: If you have aggregates, it's time to set the GROUP BY statement. For example, if you count all line in userInfo for each firstName, you would write "GROUP BY firstName":
SELECT firstName,count(*) FROM userInfo GROUP BY firstName
This gives you the number of entries in the table for each different firstName.
Step 5: HAVING conditions. These are conditions on the aggregates. In the previous example, if you wanted only the data for the firstName having more than 5 lines in the table, you could write SELECT firstName,count(*) FROM userInfo GROUP BY firstName HAVING count(*)>5
Step 6: Sort with ORDER BY. Pretty easy...
That's only a short summary. There is much, much more to discover, but it would be too long to write an entire SQL course here... Hope it helps, though!
As f00 says, it's simple(r) if you think of the data in terms of sets.
One of the issues with the question as it stands is that the expected output doesn't match the stated requirements - the description mentions the isApproved column, but this doesn't appear anywhere in either the query or the expected output.
What this illustrates is that the first step in writing a query is to have a clear idea of what you want to achieve. The bigger issue with the question as it stands is that this is not clearly described - instead, it moves from a sample table of expected output (which would be more helpful if we had corresponding samples of expected input data) straight into a description of how you intend to achieve it.
As I understand it, what you want to see is a list of users (by username, with their associated first and last names), together with the last time any associated record was modified on either the userInfo or userProgram tables.
(It isn't clear whether you want to see users who have no associated activity on either of these other tables - your supplied query implies not, otherwise the joins would be outer joins.)
So, you want a list of users (by username, with their associated first and last names):
SELECT firstName, lastName, userName
FROM user
together with a list of times that records were last modified:
SELECT userName, MAX(modifiedTimestamp)
...
on either the userInfo or userProgram tables:
...
FROM
(SELECT userName, modifiedTimestamp FROM userInfo
UNION ALL
SELECT userName, modifiedTimestamp FROM userProgram
) subquery -- <- this is an alias
...
by userName:
...
group by userName
These two sets of data need to be linked by their userName - so the final query becomes:
SELECT user.firstName, user.lastName, user.userName,
MAX(subquery.modifiedTimestamp) last_modifiedTimestamp
FROM user
JOIN
(SELECT userName, modifiedTimestamp FROM userInfo
UNION ALL
SELECT userName, modifiedTimestamp FROM userProgram
) subquery
ON user.userName = subquery.userName
GROUP BY user.userName
In most versions of SQL, this query would return an error as user.firstName and user.lastName are not included in the GROUP BY clause, nor are they summarised.
MySQL allows this syntax - in other SQLs, since those fields are functionally dependant on userName, adding a MAX in front of each field or adding them to the grouping would achieve the same result.
A couple of additional points:
UNION and UNION ALL are not identical - the former removes duplicates while the latter does not; this makes the former more processor-intensive.
Since duplicates will be removed by the grouping, it is better to use UNION ALL.
Many people will write this query as user joined to userInfo UNIONed ALL with user joined to userProgram - this is because many SQL engines can optimise this type of query more effectively.
At this point, this represents premature optimisation.
There's a lot of good stuff here. Thanks to everyone who contributed. This is a quick summary of the things I found helpful as well as some additional thoughts in connecting building functions to building queries. I wish I could give everyone SO merit badges/points but I think that there can only be one (answer) so I'm picking Traroth based upon point total and personal helpfulness.
A function can be understood as three parts: input, process, output. A query can be understood similarly. Most queries look something like this:
SELECT stuff FROM data WHERE data is like something
The SELECT portion is the output. There are some capabilities for formatting the output here (i.e. using AS)
The FROM portion is the input. The input should be seen as a pool of data; you will want to make this as specific as possible, using a variety of joins and subqueries that are appropriate.
The WHERE portion is like the process, but there's a lot of overlap with the FROM portion. Both the FROM and WHERE portions can reduce the pool of data appropriately using a variety of conditions to filter out unwanted data (or to only included desired data). The WHERE portion can also help format the output.
Here's how I broke down the steps:
Start with thinking about what your output looks like. This stuff goes into the SELECT portion.
Next, you want to define the set of data that you wish to work on. Traroth notes: "Knowing the field you need, write the joins between the different corresponding tables. That's an easy one..." It depends on what you mean by 'easy'. If you are new to writing queries, you will probably just default to writing inner joins (like I did). This is not always the best way to go. http://en.wikipedia.org/wiki/Join_(SQL) is a great resource to understanding the different kinds of joins possible.
As a part of the previous step think about smaller parts of that data set and build up to the complete data set you are interested in. In writing a function, you can write subfunctions to help express your process in a clearer manner. Similar to that, you can write subqueries. A huge tip from Mark Bannister in creating a subquery AND USING AN ALIAS. You will have to reconfigure your output to use this alias, but this is pretty key.
Last, you can use various methods to pare down your data set, removing data you're not interested in
One way to think about the data you are operating on is a giant 2-D matrix: JOINs make larger the horizontal aspect, UNIONs make larger the vertical aspect. All the other filters are designed to make this matrix smaller to be appropriate for your output. I don't know if there is a "functional" analogy to JOIN, but UNION is just adding the output of two functions together.
I realize, though, there are lots of ways that building query IS NOT like writing a function. For example, you can build and pare down your data set in both the FROM and WHERE areas. What was key for me was understanding joins and finding out how to create subqueries using aliases.
just learn to think in terms of sets - then it's simple :P
http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
You can't construct sql without understanding the data in the tables and the logical result required. There's no background given for what data the tables might look like and mean and the description of the results you're trying to gather doesn't make sense to me so I'm not going to venture a guess.
On the latter point... it's rare that you'd want a union of timestamp values multiple sources. Generally speaking when results like that are gathered it's generally for some sort of auditing/tracing. However, when you're discarding all information about the source of the timestamp and just computing a maximum you have... well what exactly?
Anyways, one or more examples of data and desired output and maybe something about the application and the whys is a must to make yourself clear.
To the extent I'll make any prediction about the shape of your eventual statement, (assuming your task will still be to get a single maximum timestamp per user) it's that it will look something like this:
select u.firstname, u.lastname, user_max_time.userName, user_max_time.max_time
from users u,
( select (sometable).userName, max((sometable).(timestamp column))
from (data of interest)
group by (sometable).userName) user_max_time
where u.userName = user_max_time.userName
order by max_time desc;
Your task here would then be to replace the ()s inside the the user_max_time subselect with something that makes sense and maps to your requirements. In terms of a general approach to complex sql, the major suggestion is to build the query from the innermost subselects back out (testing along the way to make sure performance is ok and you don't need intermediate tables).
Anyways, if you're having trouble, and can come back with examples, would be happy to help.
Cheers,
Ben