What is a structured way to build a MySQL query? - sql

I consider myself fairly competent in understanding and manipulating C-ish languages; it's not a problem for me to come up with an algorithm and implement it in any C-ish language.
I have tremendous difficulty writing SQL (in my specific case, MySQL) queries. For very simple queries, it isn't a problem, but for complex queries, I become frustrated not knowing where to start. Reading the MySQL documentation is difficult, mainly because the syntax description and explanation isn't organized very well.
For example, the SELECT documentation is all over the map: it starts out with what looks like psuedo-BNF, but then (since the text for aggregate descriptions aren't clickable... like select_expr) it quickly devolves into this frustrating exercise of trying to piece the syntax together yourself by having a number of browser windows open.
Enough whining.
I'd like to know how people, step by step, begin constructing a complex MySQL query. Here is a specific example. I have three tables below. I want to SELECT a set of rows with the following characteristics:
From the userInfo and userProgram tables, I want to select the userName, isApproved, and modifiedTimestamp fields and UNION them into one set. From this set I want to ORDER by modifiedTimestamp taking the MAX(modifiedTimestamp) for every user (i.e. there should be only one row with a unique userName and the timestamp associated with that username should be as high as possible).
From the user table, I want to match the firstName and lastName that is associated with the userName so that it looks something like this:
+-----------+----------+----------+-------------------+
| firstName | lastName | userName | modifiedTimestamp |
+-----------+----------+----------+-------------------+
| JJ | Prof | jjprofUs | 1289914725 |
| User | 2 | user2 | 1289914722 |
| User | 1 | user1 | 1289914716 |
| User | 3 | user3 | 1289914713 |
| User | 4 | user4 | 1289914712 |
| User | 5 | user5 | 1289914711 |
+-----------+----------+----------+-------------------+
The closest I've got is a query that looks like this:
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userInfo
WHERE user.userName=userInfo.userName)
UNION
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userProgram
WHERE user.userName=userProgram.userName)
ORDER BY modifiedTimestamp DESC;
I feel like I'm pretty close but I don't know where to go from here or even if I'm thinking about this in the right way.
> user
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| userName | char(8) | NO | PRI | NULL | |
| firstName | varchar(255) | NO | | NULL | |
| lastName | varchar(255) | NO | | NULL | |
| email | varchar(255) | NO | UNI | NULL | |
| avatar | varchar(255) | YES | | '' | |
| password | varchar(255) | NO | | NULL | |
| passwordHint | text | YES | | NULL | |
| access | int(11) | NO | | 1 | |
| lastLoginTimestamp | int(11) | NO | | -1 | |
| isActive | tinyint(4) | NO | | 1 | |
+--------------------+--------------+------+-----+---------+-------+
> userInfo
+-------------------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+------------+------+-----+---------+-------+
| userName | char(8) | NO | MUL | NULL | |
| isApproved | tinyint(4) | NO | | 0 | |
| modifiedTimestamp | int(11) | NO | | NULL | |
| field | char(255) | YES | | NULL | |
| value | text | YES | | NULL | |
+-------------------+------------+------+-----+---------+-------+
> userProgram
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| userName | char(8) | NO | PRI | NULL | |
| isApproved | tinyint(4) | NO | PRI | 0 | |
| modifiedTimestamp | int(11) | NO | | NULL | |
| name | varchar(255) | YES | | NULL | |
| address1 | varchar(255) | YES | | NULL | |
| address2 | varchar(255) | YES | | NULL | |
| city | varchar(50) | YES | | NULL | |
| state | char(2) | YES | MUL | NULL | |
| zip | char(10) | YES | | NULL | |
| phone | varchar(25) | YES | | NULL | |
| fax | varchar(25) | YES | | NULL | |
| ehsChildren | int(11) | YES | | NULL | |
| hsChildren | int(11) | YES | | NULL | |
| siteCount | int(11) | YES | | NULL | |
| staffCount | int(11) | YES | | NULL | |
| grantee | varchar(255) | YES | | NULL | |
| programType | varchar(255) | YES | | NULL | |
| additional | text | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+

For what I understand from your question, you seem to need a correlated query, which would look like this:
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userInfo ui1
WHERE user.userName=userInfo.userName
AND modifiedtimestamp=(select max(modifiedtimestamp) from userInfo ui2 where ui1.userName=ui2.userName))
UNION
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userProgram up1
WHERE user.userName=userProgram.userName
AND modifiedtimestamp=(select max(modifiedtimestamp) from userProgram up2 where up1.userName=up2.userName))
ORDER BY modifiedTimestamp DESC;
So, do I proceed to get to this result? Key is: express clearly the information you want to retrieve, without taking mental shortcuts.
Step 1: Choose the fields I need in the different tables of my database. That's what is between SELECT and FROM. Seems obvious, but it becomes less obvious when it comes to aggregation function like sums or counts. In that case, you have to say, for example "I need the count of lines in userInfo for each firstName". See below in GROUP BY.
Step 2: Knowing the field you need, write the joins between the different corresponding tables. That's an easy one...
Step 3: Express your conditions. It can be easy, like if you want data from user for userName="RZEZDFGBH", or more complicated, like in your case: the way to formulate it so you can get the thing done, if you want only the most recent modifiedtimestamp, is "so that the modifiedtimestamp is equal to the most recent modifiedtimestamp" (that's where you can easily take a mental shortcut and miss the point)
Step 4: If you have aggregates, it's time to set the GROUP BY statement. For example, if you count all line in userInfo for each firstName, you would write "GROUP BY firstName":
SELECT firstName,count(*) FROM userInfo GROUP BY firstName
This gives you the number of entries in the table for each different firstName.
Step 5: HAVING conditions. These are conditions on the aggregates. In the previous example, if you wanted only the data for the firstName having more than 5 lines in the table, you could write SELECT firstName,count(*) FROM userInfo GROUP BY firstName HAVING count(*)>5
Step 6: Sort with ORDER BY. Pretty easy...
That's only a short summary. There is much, much more to discover, but it would be too long to write an entire SQL course here... Hope it helps, though!

As f00 says, it's simple(r) if you think of the data in terms of sets.
One of the issues with the question as it stands is that the expected output doesn't match the stated requirements - the description mentions the isApproved column, but this doesn't appear anywhere in either the query or the expected output.
What this illustrates is that the first step in writing a query is to have a clear idea of what you want to achieve. The bigger issue with the question as it stands is that this is not clearly described - instead, it moves from a sample table of expected output (which would be more helpful if we had corresponding samples of expected input data) straight into a description of how you intend to achieve it.
As I understand it, what you want to see is a list of users (by username, with their associated first and last names), together with the last time any associated record was modified on either the userInfo or userProgram tables.
(It isn't clear whether you want to see users who have no associated activity on either of these other tables - your supplied query implies not, otherwise the joins would be outer joins.)
So, you want a list of users (by username, with their associated first and last names):
SELECT firstName, lastName, userName
FROM user
together with a list of times that records were last modified:
SELECT userName, MAX(modifiedTimestamp)
...
on either the userInfo or userProgram tables:
...
FROM
(SELECT userName, modifiedTimestamp FROM userInfo
UNION ALL
SELECT userName, modifiedTimestamp FROM userProgram
) subquery -- <- this is an alias
...
by userName:
...
group by userName
These two sets of data need to be linked by their userName - so the final query becomes:
SELECT user.firstName, user.lastName, user.userName,
MAX(subquery.modifiedTimestamp) last_modifiedTimestamp
FROM user
JOIN
(SELECT userName, modifiedTimestamp FROM userInfo
UNION ALL
SELECT userName, modifiedTimestamp FROM userProgram
) subquery
ON user.userName = subquery.userName
GROUP BY user.userName
In most versions of SQL, this query would return an error as user.firstName and user.lastName are not included in the GROUP BY clause, nor are they summarised.
MySQL allows this syntax - in other SQLs, since those fields are functionally dependant on userName, adding a MAX in front of each field or adding them to the grouping would achieve the same result.
A couple of additional points:
UNION and UNION ALL are not identical - the former removes duplicates while the latter does not; this makes the former more processor-intensive.
Since duplicates will be removed by the grouping, it is better to use UNION ALL.
Many people will write this query as user joined to userInfo UNIONed ALL with user joined to userProgram - this is because many SQL engines can optimise this type of query more effectively.
At this point, this represents premature optimisation.

There's a lot of good stuff here. Thanks to everyone who contributed. This is a quick summary of the things I found helpful as well as some additional thoughts in connecting building functions to building queries. I wish I could give everyone SO merit badges/points but I think that there can only be one (answer) so I'm picking Traroth based upon point total and personal helpfulness.
A function can be understood as three parts: input, process, output. A query can be understood similarly. Most queries look something like this:
SELECT stuff FROM data WHERE data is like something
The SELECT portion is the output. There are some capabilities for formatting the output here (i.e. using AS)
The FROM portion is the input. The input should be seen as a pool of data; you will want to make this as specific as possible, using a variety of joins and subqueries that are appropriate.
The WHERE portion is like the process, but there's a lot of overlap with the FROM portion. Both the FROM and WHERE portions can reduce the pool of data appropriately using a variety of conditions to filter out unwanted data (or to only included desired data). The WHERE portion can also help format the output.
Here's how I broke down the steps:
Start with thinking about what your output looks like. This stuff goes into the SELECT portion.
Next, you want to define the set of data that you wish to work on. Traroth notes: "Knowing the field you need, write the joins between the different corresponding tables. That's an easy one..." It depends on what you mean by 'easy'. If you are new to writing queries, you will probably just default to writing inner joins (like I did). This is not always the best way to go. http://en.wikipedia.org/wiki/Join_(SQL) is a great resource to understanding the different kinds of joins possible.
As a part of the previous step think about smaller parts of that data set and build up to the complete data set you are interested in. In writing a function, you can write subfunctions to help express your process in a clearer manner. Similar to that, you can write subqueries. A huge tip from Mark Bannister in creating a subquery AND USING AN ALIAS. You will have to reconfigure your output to use this alias, but this is pretty key.
Last, you can use various methods to pare down your data set, removing data you're not interested in
One way to think about the data you are operating on is a giant 2-D matrix: JOINs make larger the horizontal aspect, UNIONs make larger the vertical aspect. All the other filters are designed to make this matrix smaller to be appropriate for your output. I don't know if there is a "functional" analogy to JOIN, but UNION is just adding the output of two functions together.
I realize, though, there are lots of ways that building query IS NOT like writing a function. For example, you can build and pare down your data set in both the FROM and WHERE areas. What was key for me was understanding joins and finding out how to create subqueries using aliases.

just learn to think in terms of sets - then it's simple :P
http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html

You can't construct sql without understanding the data in the tables and the logical result required. There's no background given for what data the tables might look like and mean and the description of the results you're trying to gather doesn't make sense to me so I'm not going to venture a guess.
On the latter point... it's rare that you'd want a union of timestamp values multiple sources. Generally speaking when results like that are gathered it's generally for some sort of auditing/tracing. However, when you're discarding all information about the source of the timestamp and just computing a maximum you have... well what exactly?
Anyways, one or more examples of data and desired output and maybe something about the application and the whys is a must to make yourself clear.
To the extent I'll make any prediction about the shape of your eventual statement, (assuming your task will still be to get a single maximum timestamp per user) it's that it will look something like this:
select u.firstname, u.lastname, user_max_time.userName, user_max_time.max_time
from users u,
( select (sometable).userName, max((sometable).(timestamp column))
from (data of interest)
group by (sometable).userName) user_max_time
where u.userName = user_max_time.userName
order by max_time desc;
Your task here would then be to replace the ()s inside the the user_max_time subselect with something that makes sense and maps to your requirements. In terms of a general approach to complex sql, the major suggestion is to build the query from the innermost subselects back out (testing along the way to make sure performance is ok and you don't need intermediate tables).
Anyways, if you're having trouble, and can come back with examples, would be happy to help.
Cheers,
Ben

Related

Access text count in query design

I am new to Access and am trying to develop a query that will allow me to count the number of occurrences of one word in each field from a table with 15 fields.
The table simply stores test results for employees. There is one table that stores the employee identification - id, name, etc.
The second table has 15 fields - A1 through A15 with the words correct or incorrect in each field. I need the total number of incorrect occurrences for each field, not for the entire table.
Is there an answer through Query Design, or is code required?
The solution, whether Query Design, or code, would be greatly appreciated!
Firstly, one of the reasons that you are struggling to obtain the desired result for what should be a relatively straightforward request is because your data does not follow database normalisation rules, and consequently, you are working against the natural operation of a RDBMS when querying your data.
From your description, I assume that the fields A1 through A15 are answers to questions on a test.
By representing these as separate fields within your database, aside from the inherent difficulty in querying the resulting data (as you have discovered), if ever you wanted to add or remove a question to/from the test, you would be forced to restructure your entire database!
Instead, I would suggest structuring your table in the following way:
Results
+------------+------------+-----------+
| EmployeeID | QuestionID | Result |
+------------+------------+-----------+
| 1 | 1 | correct |
| 1 | 2 | incorrect |
| ... | ... | ... |
| 1 | 15 | correct |
| 2 | 1 | correct |
| 2 | 2 | correct |
| ... | ... | ... |
+------------+------------+-----------+
This table would be a junction table (a.k.a. linking / cross-reference table) in your database, supporting a many-to-many relationship between the tables Employees & Questions, which might look like the following:
Employees
+--------+-----------+-----------+------------+------------+-----+
| Emp_ID | Emp_FName | Emp_LName | Emp_DOB | Emp_Gender | ... |
+--------+-----------+-----------+------------+------------+-----+
| 1 | Joe | Bloggs | 01/01/1969 | M | ... |
| ... | ... | ... | ... | ... | ... |
+--------+-----------+-----------+------------+------------+-----+
Questions
+-------+------------------------------------------------------------+--------+
| Qu_ID | Qu_Desc | Qu_Ans |
+-------+------------------------------------------------------------+--------+
| 1 | What is the meaning of life, the universe, and everything? | 42 |
| ... | ... | ... |
+-------+------------------------------------------------------------+--------+
With this structure, if ever you wish to add or remove a question from the test, you can simply add or remove a record from the table without needing to restructure your database or rewrite any of the queries, forms, or reports which depends upon the existing structure.
Furthermore, since the result of an answer is likely to be a binary correct or incorrect, then this would be better (and far more efficiently) represented using a Boolean True/False data type, e.g.:
Results
+------------+------------+--------+
| EmployeeID | QuestionID | Result |
+------------+------------+--------+
| 1 | 1 | True |
| 1 | 2 | False |
| ... | ... | ... |
| 1 | 15 | True |
| 2 | 1 | True |
| 2 | 2 | True |
| ... | ... | ... |
+------------+------------+--------+
Not only does this consume less memory in your database, but this may be indexed far more efficiently (yielding faster queries), and removes all ambiguity and potential for error surrounding typos & case sensitivity.
With this new structure, if you wanted to see the number of correct answers for each employee, the query can be something as simple as:
select results.employeeid, count(*)
from results
where results.result = true
group by results.employeeid
Alternatively, if you wanted to view the number of employees answering each question correctly (for example, to understand which questions most employees got wrong), you might use something like:
select results.questionid, count(*)
from results
where results.result = true
group by results.questionid
The above are obviously very basic example queries, and you would likely want to join the Results table to an Employees table and a Questions table to obtain richer information about the results.
Contrast the above with your current database structure -
Per your original question:
The second table has 15 fields - A1 through A15 with the words correct or incorrect in each field. I need the total number of incorrect occurrences for each field, not for the entire table.
Assuming that you want to view the number of incorrect answers by employee, you are forced to use an incredibly messy query such as the following:
select
employeeid,
iif(A1='incorrect',1,0)+
iif(A2='incorrect',1,0)+
iif(A3='incorrect',1,0)+
iif(A4='incorrect',1,0)+
iif(A5='incorrect',1,0)+
iif(A6='incorrect',1,0)+
iif(A7='incorrect',1,0)+
iif(A8='incorrect',1,0)+
iif(A9='incorrect',1,0)+
iif(A10='incorrect',1,0)+
iif(A11='incorrect',1,0)+
iif(A12='incorrect',1,0)+
iif(A13='incorrect',1,0)+
iif(A14='incorrect',1,0)+
iif(A15='incorrect',1,0) as IncorrectAnswers
from
YourTable
Here, notice that the answer numbers are also hard-coded into the query, meaning that if you decide to add a new question or remove an existing question, not only would you need to restructure your entire database, but queries such as the above would also need to be rewritten.

De-conflicting data strategies when associating two tables

I have some tables that look like this:
+------------+------------+------------+----------------+----------+
| Locations | HotelsA | HotelsB | HotelsB-People | People |
+------------+------------+------------+----------------+----------+
| LocationID | HotelAID | HotelBID | PersonID | PersonID |
| Address | HotelAName | HotelBName | HotelBID | Name |
| | LocationID | | | |
+------------+------------+------------+----------------+----------+
Currently, if I want to know what the address is of the hotel someone is staying at there is no way to make that association without manually looking through the names of HotelsA for something that looks similar enough to the name of HotelsB.
I would like to remove HotelBName and replace it with a foreign key to HotelAID (in this example it would actually make more sense to change HotelsB-People to HotelsAPeople, but there are additional columns that I have omitted for simplicity that prevent that solution from being viable in my particular case). The end result would look like this:
+------------+------------+-------------+----------------+----------+
| Locations | HotelsA | HotelsB | HotelsB-People | People |
+------------+------------+-------------+----------------+----------+
| LocationID | HotelAID | HotelBID | PersonID | PersonID |
| Address | HotelAName | FK_HotelAID | HotelBID | Name |
| | LocationID | | | |
+------------+------------+-------------+----------------+----------+
HotelAName and HotelBName are likely very similar, but inconsistently so. You could have "Springfield Marriott" in one and "Marriott, Springfield" in the other, but there's no consistency (no guarantee anything is spelled correctly either).
Are there any strategies for how this could be done as well as considerations for how to make the applications that utilize this data continue to work during the time it takes to fix all of the data?
Thank you.
I would just add the FK_HotelAID column to the HotelsB table. Assigning the correct id to that column will largely be a manual process, although you could try joining HotelAName to HotelBName to at least cover the ids for names that have a perfect match. Your applications should continue to work while you do this. After you've assigned all the ids inHotelsB you can define the foreign key and then delete the HotelBName column. Of course, any references that the applications make to HotelBName will need to be modified.

What type of data structure should I use for mimicking a file-system?

The title might be worded strange, but it's probably because I don't even know if I'm asking the right question.
So essentially what I'm trying to build is a "breadcrumbish" categoricalization type system (like a file directory) where each node has a parent (except for root) and each node can contain either data or another node. This will be used for organizing email addresses in a database. I have a system right now where you can create a "group" and add email addresses to that group, but it would be very nice to add an organizational system to it.
This (in my head) is in a tree format, but I don't know what tree.
The issue I'm having is building it using MySQL. It's easy to traverse trees that are in memory, but on database, it's a bit trickier.
Image of tree: http://j.imagehost.org/0917/asdf.png
SELECT * FROM Businesses:
Tim's Hardware Store, 7-11, Kwik-E-Mart, Cub Foods, Bob's Grocery Store, CONGLOM-O
SELECT * FROM Grocery Stores:
Cub Foods, Bob's Grocery Store, CONGLOM-O
SELECT * FROM Big Grocery Stores:
CONGLOM-O
SELECT * FROM Churches:
St. Peter's Church, St. John's Church
I think this should be enough information so I can accurately describe what my goal is.
Well, there are a few patterns you could use. Which one is right depends on your needs.
Do you need to select a node and all its children? If so, then a Nested set Model (Scroll down to the heading) may be better for you. The table would look like this:
| Name | Left | Right |
| Emails | 1 | 12 |
| Business | 2 | 7 |
| Tim's | 3 | 4 |
| 7-11 | 5 | 6 |
| Churches | 8 | 11 |
| St. Pete | 9 | 10 |
So then, to find anything below a node, just do
SELECT name FROM nodes WHERE Left > *yourleftnode* AND Right < *yourrightnode*
To find everything above the node:
SELECT name FROM nodes WHERE Left < *yourleftnode* AND Right > *yourrightnode*
If you only want to query for a specific level, you could do an Adjacency List Model (Scoll down to the heading):
| Id | Name | Parent_Id |
| 1 | Email | null |
| 2 | Business | 1 |
| 3 | Tim's | 2 |
To find everything on the same level, just do:
SELECT name FROM nodes WHERE parent_id = *yourparentnode*
Of course, there's nothing stopping you from doing a hybrid approach which will let you query however you'd like for the query at hand
| Id | Name | Parent_Id | Left | Right | Path |
| 1 | Email | null | 1 | 6 | / |
| 2 | Business | 1 | 2 | 5 | /Email/ |
| 3 | Tim's | 2 | 3 | 4 | /Email/Business/ |
Really, it's just a matter of your needs...
The easiest way to do it would be something like this:
Group
- GroupID (PK)
- ParentGroupID
- GroupName
People
- PersonID (PK)
- EmailAddress
- FirstName
- LastName
GroupMembership
- GroupID (PK)
- PersonID (PK)
That should establish a structure where you can have groups that have parent groups and people that can be members of groups (or multiple groups). If a person can only be a member of one group, then get rid of the GroupMembership table and just put a GroupID on the People table.
Complex queries against this structure can get difficult though. There are other less intuitive ways to model this that make querying easier (but often make updates more difficult). If the number of groups is small, the easiest way to handle queries against this is often to load the whole tree of Groups into memory, cache it, and use that to build your queries.
As always when I see questions about modeling trees and hierarchies, my suggestion is that you get a hold of a copy of Joe Celko's book on the subject. He presents various ways to model them in a RDBMS, some of which are fairly imaginative, and he gives the pros and cons for each pattern.
Create an object Group which has a name, many email addresses, and a parent, which can be null.

How do you merge rows from 2 SQL tables without duplicating rows?

I guess this query is a little basic and I should know more about SQL but haven't done much with joins yet which I guess is the solution here.
What I have is a table of people and a table of job roles they hold. A person can have multiple jobs and I wish to have one set of results with a row per person containing their details and their job roles.
Two example tables (people and job_roles) are below so you can understand the question easier.
People
id | name | email_address | phone_number
1 | paul | paul#example.com | 123456
2 | bob | bob#example.com | 567891
3 | bart | bart#example.com | 987561
job_roles
id | person_id | job_title | department
1 | 1 | secretary | hr
2 | 1 | assistant | media
3 | 2 | manager | IT
4 | 3 | finance clerk | finance
4 | 3 | manager | IT
so that I can output each person and their roles like such
Name: paul
Email Address: paul#example.com
Phone: 123456
Job Roles:
Secretary for HR department
Assistant for media department
_______
Name: bob
Email address: bob#example.com
Phone: 567891
Job roles:
Manager for IT department
So how would I get each persons information (from the people table) along with their job details (from the job_roles table) to output like the example above. I guess it would be some kind of way of merging their jobs and their relevant departments into a jobs column that can be split up for output, but maybe there is a better way and what would the sql look like?
Thanks
Paul
PS it would be a mySQL database if that makes any difference
It looks like a straight-forward join:
SELECT p.*, j.*
FROM People AS p INNER JOIN Roles AS r ON p.id = r.person_id
ORDER BY p.name;
The remainder of the work is formatting; that's best done by a report package.
Thanks for the quick response, that seems a good start but you get multiple rows per person like (you have to imagine this is a table as you don't seem to be able to format in comments):
id | Name | email_address | phone_number | job_role | department
1 | paul | paul#example.com | 123456 | secretary | HR
1 | paul | paul#example.com | 123456 | assistant | media
2 | bob | bob#example.com | 567891 | manager | IT
I would like one row per person ideally with all their job roles in it if that's possible?
It depends on your DBMS, but most available ones do not support RVAs - relation-valued attributes. What you'd like is to have the job role and department part of the result like a table associated with the user:
+----+------+------------------+--------------+------------------------+
| id | Name | email_address | phone_number | dept_role |
+----+------+------------------+--------------+------------------------+
| | | | | +--------------------+ |
| | | | | | job_role | dept | |
| 1 | paul | paul#example.com | 123456 | | secretary | HR | |
| | | | | | assistant | media | |
| | | | | +--------------------+ |
+----+------+------------------+--------------+------------------------+
| | | | | +--------------------+ |
| | | | | | job_role | dept | |
| 2 | bob | bob#example.com | 567891 | | manager | IT | |
| | | | | +--------------------+ |
+----+------+------------------+--------------+------------------------+
This accurately represents the information you want, but is not usually an option.
So, what happens next depends on your report generation tool. Using the one I'm most familiar with, (Informix ACE, part of Informix SQL, available from IBM for use with the Informix DBMSs), you would simply ensure that the data is sorted and then print the name, email address and phone number in the 'BEFORE GROUP OF id' section of the report, and in the 'ON EVERY ROW' section you would process (print) just the role and department information.
It is often a good idea to separate the report formatting from the data retrieval operations; this is an example of where it is necessary unless your DBMS has unusual features to help with the formatting of selected data.
Oh dear that sounds very complicated and not something I could run easily on a mySQL database in a PHP page?
The RVA stuff - you're right, that is not for MySQL and PHP.
On the other hand, there are millions of reports (meaning results from queries that are formatted for presentation to a user) that do roughly this. The technical term for them is 'Control-Break Report', but the basic idea is not hard.
You keep a record of the 'id' number you last processed - you can initialize that to -1 or 0.
When the current record has a different id number from the previous number, then you have a new user and you need to start a new set of output lines for the new user and print the name, email address and phone number (and change the last processed id number). When the current record has the same id number, then all you do is process the job role and department information (not the name, email address and phone number). The 'break' occurs when the id number changes. With a single level of control-break, it is not hard; if you have 4 or 5 levels, you have to do more work, and that's why there are reporting packages to handle it.
So, it is not hard - it just requires a little care.
RE:
I was hoping SQL could do something
clever and join the rows together
nicely so I had essentially a jobs
column with that persons jobs in it.
You can get fairly close with
SELECT p.id, p.name, p.email_address, p.phone_number,
group_concat(concat(job_title, ' for ', department, ' department') SEPARATOR '\n') AS JobRoles
FROM People AS p
INNER JOIN job_roles AS r ON p.id = r.person_id
GROUP BY p.id, p.name, p.email_address, p.phone_number
ORDER BY p.name;
Doing it the way you're wanting would mean the result set arrays could have infinite columns, which would be very messy. for example, you could left join the jobs table 10 times and get job1, job2, .. job10.
I would do a single join, then use PHP to check if the name ID is the same from 1 row to the next.
One way might be to left outer join the tables and then load them up into an array using
$people_array =array();
while($row1=mysql_fetch_assoc($extract1)){
$people_array[] = $row1;
}
and then loop through using
for ($x=0;$x<=sizeof($people_array;)
{
echo $people_array[$x][id];
echo $people_array[$x][name];
for($y=0;$y<=$number_of_roles;$y++)
{
echo $people_array[$x][email_address];
echo $people_array[$x][phone_number];
$x++;
}
}
You might have to play with the query a bit and the loops but it should do generally what you want.For it to work as above every person would have to have the same number of roles, but you may be able to fill in the blanks in your table

Retrieve comma delimited data from a field

I've created a form in PHP that collects basic information. I have a list box that allows multiple items selected (i.e. Housing, rent, food, water). If multiple items are selected they are stored in a field called Needs separated by a comma.
I have created a report ordered by the persons needs. The people who only have one need are sorted correctly, but the people who have multiple are sorted exactly as the string passed to the database (i.e. housing, rent, food, water) --> which is not what I want.
Is there a way to separate the multiple values in this field using SQL to count each need instance/occurrence as 1 so that there are no comma delimitations shown in the results?
Your database is not in the first normal form. A non-normalized database will be very problematic to use and to query, as you are actually experiencing.
In general, you should be using at least the following structure. It can still be normalized further, but I hope this gets you going in the right direction:
CREATE TABLE users (
user_id int,
name varchar(100)
);
CREATE TABLE users_needs (
need varchar(100),
user_id int
);
Then you should store the data as follows:
-- TABLE: users
+---------+-------+
| user_id | name |
+---------+-------+
| 1 | joe |
| 2 | peter |
| 3 | steve |
| 4 | clint |
+---------+-------+
-- TABLE: users_needs
+---------+----------+
| need | user_id |
+---------+----------+
| housing | 1 |
| water | 1 |
| food | 1 |
| housing | 2 |
| rent | 2 |
| water | 2 |
| housing | 3 |
+---------+----------+
Note how the users_needs table is defining the relationship between one user and one or many needs (or none at all, as for user number 4.)
To normalise your database further, you should also use another table called needs, and as follows:
-- TABLE: needs
+---------+---------+
| need_id | name |
+---------+---------+
| 1 | housing |
| 2 | water |
| 3 | food |
| 4 | rent |
+---------+---------+
Then the users_needs table should just refer to a candidate key of the needs table instead of repeating the text.
-- TABLE: users_needs (instead of the previous one)
+---------+----------+
| need_id | user_id |
+---------+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 1 | 2 |
| 4 | 2 |
| 2 | 2 |
| 1 | 3 |
+---------+----------+
You may also be interested in checking out the following Wikipedia article for further reading about repeating values inside columns:
Wikipedia: First normal form - Repeating groups within columns
UPDATE:
To fully answer your question, if you follow the above guidelines, sorting, counting and aggregating the data should then become straight-forward.
To sort the result-set by needs, you would be able to do the following:
SELECT users.name, needs.name
FROM users
INNER JOIN needs ON (needs.user_id = users.user_id)
ORDER BY needs.name;
You would also be able to count how many needs each user has selected, for example:
SELECT users.name, COUNT(needs.need) as number_of_needs
FROM users
LEFT JOIN needs ON (needs.user_id = users.user_id)
GROUP BY users.user_id, users.name
ORDER BY number_of_needs;
I'm a little confused by the goal. Is this a UI problem or are you just having trouble determining who has multiple needs?
The number of needs is the difference:
Len([Needs]) - Len(Replace([Needs],',','')) + 1
Can you provide more information about the Sort you're trying to accomplish?
UPDATE:
I think these Oracle-based posts may have what you're looking for: post and post. The only difference is that you would probably be better off using the method I list above to find the number of comma-delimited pieces rather than doing the translate(...) that the author suggests. Hope this helps - it's Oracle-based, but I don't see .