Check if a value exists in the child-parent tree - sql

I'm creating a simple directory listing page where you can specify what kind of thing you want to list in the directory e.g. a person or a company.
Each user has an UserTypeID and there is a dbo.UserType lookup table. The dbo.UserType lookup table is like this:
UserTypeID | UserTypeParentID | Name
1 NULL Person
2 NULL Company
3 2 IT
4 3 Accounting Software
In the dbo.Users table we have records like this:
UserID | UserTypeID | Name
1 1 Jenny Smith
2 1 Malcolm Brown
3 2 Wall Mart
4 3 Microsoft
5 4 Sage
My SQL (so far) is very simple: (excuse the pseudo-code style)
DECLARE #UserTypeID int
SELECT
*
FROM
dbo.Users u
INNER JOIN
dbo.UserType ut
WHERE
ut.UserTypeID = #UserTypeID
The problem is here is that when people want to search for companies they will enter in '2' as the UserTypeID. But both Microsoft and Sage won't show up because their UserTypeIDs are 3 and 4 respectively. But its the final UserTypeParentID which tells me that they're both Companies.
How could I rewrite the SQL to ask it to return to return records where the UserTypeID = #UserTypeID or where its final UserTypeParentID is also equal to #UserTypeID. Or am I going about this the wrong way?

Schema Change
I would suggest you to break it down this schema a little bit more, to make your queries and life simpler, with this current schema you will end up writing a recursive query every time you want to get simplest data from your Users table, and trust me you dont want to do this to yourself.
I would break down this schema of these tables as follow:
dbo.Users
UserID | UserName
1 | Jenny
2 | Microsoft
3 | Sage
dbo.UserTypes_Type
TypeID | TypeName
1 | Person
2 | IT
3 | Compnay
4 | Accounting Software
dbo.UserTypes
UserID | TypeID
1 | 1
2 | 2
2 | 3
3 | 2
3 | 3
3 | 4

You say that you are "creating" this - excellent because you have the opportunity to reconsider your whole approach.
Dealing with hierarchical data in a relational database is problematic because it is not designed for it - the model you choose to represent it will have a huge impact on the performance and ease of construction of your queries.
You have opted for an Adjacently List model which is great for inserts (and deletes) but a bugger for selects because the query has to effectively reconstruct the hierarchy path. By the way an Adjacency List is the model almost everyone goes for on their first attempt.
Everything is a trade off so you should decide what queries will be most common - selects (and updates) or inserts (and deletes). See this question for starters. Also, since SQL Server 2008, there is a native HeirachyID datatype (see this) which may be of assistance.
Of course, you could store your data in an XML file (in SQL Server or not) which is designed for hierarchical data.

Related

Specify Association Depth in Nhibernate Criteria

Okay, lets say I have A Many-To-Many, Student/Class objects. With the following mapping
HasManyToMany(m => m.Classes)
.Table("StudentClasses")
.LazyLoad()
.Cascade.None();
and query:
return _session.CreateCriteria<Student>()
.Add(Restrictions.Eq("Id", studentId))
.SetCacheable(false)
.UniqueResult<Student>();
My Tables
ID | Student ID | Class StudentId | ClassId
============ =========== ===================
1 | John 1 | Algebra 1 | 1
2 | Sue 2 | Biology 1 | 2
3 | Frank 3 | Speech 2 | 2
4 | Jim 4 | Athletics 2 | 5
5 | Frankenstein 5 | History 5 | 5
What I want is John, with Algebra and Biology This comes, But Biology comes with John and Sue, she in turn gets me Bio and History, which populates sue and Frankenstein. You don't want to get Frankenstein (or anything beyond biology). Also at any point you can start cycling back around in circles.
How do I specify not to populate that deeply? SetMaxRows obviously isn't want I'm looking for, as It cuts down on the total number of rows, I only want to hydrate the first level. I'm particularly confused because I thought LazyLoading would force me to specify exactly what I did want to hydrate
Lazy loading allows you to not have to specify what you want to hydrate. The collections will be populated "lazily" at the time you access them. Your above query will only fetch a single row: John. When you access john.Classes, NHibernate will execute another query behind the scenes to fetch John's classes. Then when you access the biology.Students property, NHibernate will execute another query to populate that collection... and so on.
You definitely do want lazy-loading on this relationship (which is the default, by the way). If you turned lazy loading off, then this would do exactly what you're saying you want to avoid - NHibernate would know that it's not allowed to lazily load these collections, so it would start working to load them as soon as you loaded a student, which could potentially end up loading all of the data in these three tables, and in a very inefficient manner with lots of round-trips to the database.
If you know that all you want is "John, with Algebra and Biology", then you should add these lines to your query:
.SetFetchMode("Classes", FetchMode.Eager)
.SetResultTransformer(Transformers.DistinctRootEntity)
The SetFetchMode tells NHibernate to use left outer joins to pre-populate the student.Classes collection. The DistintRootEntity tells NHibernate to not return the duplicate students that will be generated by the left outer join.

How to query huge MySQL databases?

I have 2 tables, a purchases table and a users table. Records in the purchases table looks like this:
purchase_id | product_ids | customer_id
---------------------------------------
1 | (99)(34)(2) | 3
2 | (45)(3)(74) | 75
Users table looks like this:
user_id | email | password
----------------------------------------
3 | joeShmoe#gmail.com | password
75 | nolaHue#aol.com | password
To get the purchase history of a user I use a query like this:
mysql_query(" SELECT * FROM purchases WHERE customer_id = '$users_id' ");
The problem is, what will happen when tens of thousands of records are inserted into the purchases table. I feel like this will take a performance toll.
So I was thinking about storing the purchases in an additional field directly in the user's row:
user_id | email | password | purchases
------------------------------------------------------
1 | joeShmoe#gmail.com | password | (99)(34)(2)
2 | nolaHue#aol.com | password | (45)(3)(74)
And when I query the user's table for things like username, etc. I can just as easily grab their purchase history using that one query.
Is this a good idea, will it help better performance or will the benefit be insignificant and not worth making the database look messier?
I really want to know what the pros do in these situations, for example how does amazon query it's database for user's purchase history since they have millions of customers. How come there queries don't take hours?
EDIT
Ok, so I guess keeping them separate is the way to go. Now the question is a design one:
Should I keep using the "purchases" table I illustrated earlier. In that design I am separating the product ids of each purchase using parenthesis and using this as the delimiter to tell the ids apart when extracting them via PHP.
Instead should I be storing each product id separately in the "purchases" table so it looks like this?:
purchase_id | product_ids | customer_id
---------------------------------------
1 | 99 | 3
1 | 34 | 3
1 | 2 | 3
2 | 45 | 75
2 | 3 | 75
2 | 74 | 75
Nope, this is a very, very, very bad idea.
You're breaking first normal form because you don't know how to page through a large data set.
Amazon and Yahoo! and Google bring back (potentially) millions of records - but they only display them to you in chunks of 10 or 25 or 50 at a time.
They're also smart about guessing or calculating which ones are most likely to be of interest to you - they show you those first.
Which purchases in my history am I most likely to be interested in? The most recent ones, of course.
You should consider building these into your design before you violate relational database fundamentals.
Your database already looks messy, since you are storing multiple product_ids in a single field, instead of creating an "association" table like this.
_____product_purchases____
purchase_id | product_id |
--------------------------
1 | 99 |
1 | 34 |
1 | 2 |
You can still fetch it in one query:
SELECT * FROM purchases p LEFT JOIN product_purchases pp USING (purchase_id)
WHERE purchases.customer_id = $user_id
But this also gives you more possibilities, like finding out how many product #99 were bought, getting a list of all customers that purchased product #34 etc.
And of course don't forget about indexes, that will make all of this much faster.
By doing this with your schema, you will break the entity-relationship of your database.
You might want to look into Memcached, NoSQL, and Redis.
These are all tools that will help you improve your query performances, mostly by storing data in the RAM.
For example - run the query once, store it in the Memcache, if the user refresh the page, you get the data from Memcache, not from MySQL, which avoids querying your database a second time.
Hope this helps.
First off, tens of thousands of records is nothing. Unless you're running on a teensy weensy machine with limited ram and harddrive space, a database won't even blink at 100,000 records.
As for storing purchase details in the users table... what happens if a user makes more than one purchase?
MySQL is hugely extensible, and don't let the fact that it's free convince you of otherwise. Keeping the two tables separate is probably best, not only because it keeps the db more normal, but having more indices will speed queries. A 10,000 record database is relatively small in deference to multi-hundred-million record health record databases.
As far as Amazon and Google, they hire hundreds of developers to write specialized query languages for their specific application needs... not something developers like us have the resources to fund.

MySQL Database Design with Internationalization

I'm going to start work on a medium sized application, and i'm planning it's db design.
One thing that I'm not sure about is this.
I will have many tables which will need internationalization, such as: "membership_options, gender_options, language_options etc"
Each of these tables will share common i18n fields, like:
"title, alternative_title, short_description, description"
In your opinion which is the best way to do it?
Have an i18n table with the same fields for each of the tables that will need them?
or do something like:
Membership table Gender table
---------------- --------------
id | created_at id | created_at
1 - 22.03.2001 1 - 14.08.2002
2 - 22.03.2001 2 - 14.08.2002
General translation table
-------------------------
record_id | table_name | string_name | alternative_title| .... |id_language
1 - membership regular null 1 (english)
1 - membership normale null 2 (italian)
1 - gender man null 1(english)
1 -gender uomo null 2(italian)
This would avoid me repeating something like:
membership_translation table
-----------------------------
membership_id | name | alternative_title | id_lang
1 regular null 1
1 normale null 2
gender_translation table
-----------------------------
gender_id | name | alternative_title | id_lang
1 man null 1
1 uomo null 2
and so on, so i would probably reduce the number of db tables, but i'm not sure about performance.I'm not much of a DB designer, so please let me know.
The most common way I've seen this done is with two tables, membership and membership_ml, with one storing the base values and the ml table storing the localized strings. This is similar to your second option. Most of the systems I see like this are made that way because they weren't designed with internationalization in mind from the get go, so the extra _ml tables were "tacked on" later.
What I think is a better option is similar to your first option, but a little bit different. You would have a central table for storing all the translations, but instead of putting the table name and field name in there, you would use tokens and a central "Content" table to store all the translations. That way you can enforce some kind of RI between the tokens in the base table and the translations in the Content table if you want as well.
I actually asked a question about this very thing a while back, so you can have a look at that for some more info (rather than repasting the schema examples here).
I also think the best solution is to keep translations on different table. This approach use Open Cart which is open source and you can take a look the way it deals with the problem. Another source of information is here "http://www.gsdesign.ro/blog/multilanguage-database-design-approach/" especially on the comments sections

Creating new table from data of other tables

I'm very new to SQL and I hope someone can help me with some SQL syntax. I have a database with these tables and fields,
DATA: data_id, person_id, attribute_id, date, value
PERSONS: person_id, parent_id, name
ATTRIBUTES: attribute_id, attribute_type
attribute_type can be "Height" or "Weight"
Question 1
Give a person's "Name", I would like to return a table of "Weight" measurements for each children. Ie: if John has 3 children names Alice, Bob and Carol, then I want a table like this
| date | Alice | Bob | Carol |
I know how to get a long list of children's weights like this:
select d.date,
d.value
from data d,
persons child,
persons parent,
attributes a
where parent.name='John'
and child.parent_id = parent.person_id
and d.attribute_id = a.attribute_id
and a.attribute_type = "Weight';
but I don't know how to create a new table that looks like:
| date | Child 1 name | Child 2 name | ... | Child N name |
Question 2
Also, I would like to select the attributes to be between a certain range.
Question 3
What happens if the dates are not consistent across the children? For example, suppose Alice is 3 years older than Bob, then there's no data for Bob during the first 3 years of Alice's life. How does the database handle this if we request all the data?
1) It might not be so easy. MS SQL Server can PIVOT a table on an axis, but dumping the resultset to an array and sorting there (assuming this is tied to some sort of program) might be the simpler way right now if you're new to SQL.
If you can manage to do it in SQL it still won't be enough info to create a new table, just return the data you'd use to fill it in, so some sort of external manipulation will probably be required. But you can probably just use INSERT INTO [new table] SELECT [...] to fill that new table from your select query, at least.
2) You can join on attributes for each unique attribute:
SELECT [...] FROM data AS d
JOIN persons AS p ON d.person_id = p.person_id
JOIN attributes AS weight ON p.attribute_id = weight.attribute_id
HAVING weight.attribute_type = 'Weight'
JOIN attributes AS height ON p.attribute_id = height.attribute_id
HAVING height.attribute_type = 'Height'
[...]
(The way you're joining in the original query is just shorthand for [INNER] JOIN .. ON, same thing except you'll need the HAVING clause in there)
3) It depends on the type of JOIN you use to match parent/child relationships, and any dates you're filtering on in the WHERE, if I'm reading that right (entirely possible I'm not). I'm not sure quite what you're looking for, or what kind of database you're using, so no good answer. If you're new enough to SQL that you don't know the different kinds of JOINs and what they can do, it's very worthwhile to learn them - they put the R in RDBMS.
when you do a select, you need to specify the exact columns you want. In other words you can't return the Nth child's name. Ie this isn't possible:
1/2/2010 | Child_1_name | Child_2_name | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name | Child_2_name
Each record needs to have the same amount of columns. So you might be able to make a select that does this:
1/2/2010 | Child_1_name
1/2/2010 | Child_2_name
1/2/2010 | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name
1/4/2010 | Child_2_name
And then in a report remap it to how you want it displayed

cloning hierarchical data

let's assume i have a self referencing hierarchical table build the classical way like this one:
CREATE TABLE test
(name text,id serial primary key,parent_id integer
references test);
insert into test (name,id,parent_id) values
('root1',1,NULL),('root2',2,NULL),('root1sub1',3,1),('root1sub2',4,1),('root
2sub1',5,2),('root2sub2',6,2);
testdb=# select * from test;
name | id | parent_id
-----------+----+-----------
root1 | 1 |
root2 | 2 |
root1sub1 | 3 | 1
root1sub2 | 4 | 1
root2sub1 | 5 | 2
root2sub2 | 6 | 2
What i need now is a function (preferrably in plain sql) that would take the id of a test record and
clone all attached records (including the given one). The cloned records need to have new ids of course. The desired result
would like this for example:
Select * from cloningfunction(2);
name | id | parent_id
-----------+----+-----------
root2 | 7 |
root2sub1 | 8 | 7
root2sub2 | 9 | 7
Any pointers? Im using PostgreSQL 8.3.
Pulling this result in recursively is tricky (although possible). However, it's typically not very efficient and there is a much better way to solve this problem.
Basically, you augment the table with an extra column which traces the tree to the top - I'll call it the "Upchain". It's just a long string that looks something like this:
name | id | parent_id | upchain
root1 | 1 | NULL | 1:
root2 | 2 | NULL | 2:
root1sub1 | 3 | 1 | 1:3:
root1sub2 | 4 | 1 | 1:4:
root2sub1 | 5 | 2 | 2:5:
root2sub2 | 6 | 2 | 2:6:
root1sub1sub1 | 7 | 3 | 1:3:7:
It's very easy to keep this field updated by using a trigger on the table. (Apologies for terminology but I have always done this with SQL Server). Every time you add or delete a record, or update the parent_id field, you just need to update the upchain field on that part of the tree. That's a trivial job because you just take the upchain of the parent record and append the id of the current record. All child records are easily identified using LIKE to check for records with the starting string in their upchain.
What you're doing effectively is trading a bit of extra write activity for a big saving when you come to read the data.
When you want to select a complete branch in the tree it's trivial. Suppose you want the branch under node 1. Node 1 has an upchain '1:' so you know that any node in the branch of the tree under that node must have an upchain starting '1:...'. So you just do this:
SELECT *
FROM table
WHERE upchain LIKE '1:%'
This is extremely fast (index the upchain field of course). As a bonus it also makes a lot of activities extremely simple, such as finding partial trees, level within the tree, etc.
I've used this in applications that track large employee reporting hierarchies but you can use it for pretty much any tree structure (parts breakdown, etc.)
Notes (for anyone who's interested):
I haven't given a step-by-step of the SQL code but once you get the principle, it's pretty simple to implement. I'm not a great programmer so I'm speaking from experience.
If you already have data in the table you need to do a one time update to get the upchains synchronised initially. Again, this isn't difficult as the code is very similar to the UPDATE code in the triggers.
This technique is also a good way to identify circular references which can otherwise be tricky to spot.
The Joe Celko's method which is similar to the njreed's answer but is more generic can be found here:
Nested-Set Model of Trees (at the middle of the article)
Nested-Set Model of Trees, part 2
Trees in SQL -- Part III
#Maximilian: You are right, we forgot your actual requirement. How about a recursive stored procedure? I am not sure if this is possible in PostgreSQL, but here is a working SQL Server version:
CREATE PROCEDURE CloneNode
#to_clone_id int, #parent_id int
AS
SET NOCOUNT ON
DECLARE #new_node_id int, #child_id int
INSERT INTO test (name, parent_id)
SELECT name, #parent_id FROM test WHERE id = #to_clone_id
SET #new_node_id = ##IDENTITY
DECLARE #children_cursor CURSOR
SET #children_cursor = CURSOR FOR
SELECT id FROM test WHERE parent_id = #to_clone_id
OPEN #children_cursor
FETCH NEXT FROM #children_cursor INTO #child_id
WHILE ##FETCH_STATUS = 0
BEGIN
EXECUTE CloneNode #child_id, #new_node_id
FETCH NEXT FROM #children_cursor INTO #child_id
END
CLOSE #children_cursor
DEALLOCATE #children_cursor
Your example is accomplished by EXECUTE CloneNode 2, null (the second parameter is the new parent node).
This sounds like an exercise from "SQL For Smarties" by Joe Celko...
I don't have my copy handy, but I think it's a book that'll help you quite a bit if this is the kind of problems you need to solve.