What the best way to handle categories, sub-categories - hierachical data? - sql

Duplicate:
SQL - how to store and navigate hierarchies
If I have a database where the client requires categories, sub-categories, sub-sub-categories and so on, what's the best way to do that? If they only needed three, and always knew they'd need three I could just create three tables cat, subcat, subsubcat, or the like. But what if they want further depth? I don't like the three tables but it's the only way I know how to do it.
I have seen the "sql adjacency list" but didn't know if that was the only way possible. I was hoping for input so that the client can have any level of categories and subcategories. I believe this means hierarchical data.
EDIT: Was hoping for the sql to get the list back out if possible
Thank you.

table categories: id, title, parent_category_id
id | title | parent_category_id
----+-------+-------------------
1 | food | NULL
2 | pizza | 1
3 | wines | NULL
4 | red | 3
5 | white | 3
6 | bread | 1
I usually do a select * and assemble the tree algorithmically in the application layer.

You might have a look at Joe Celko's book, or this previous question.

creating a table with a relation to itself is the best way for doing the same. its easy and flexible to the extent you want it to be without any limitation. I dont think i need to repeat the structure that you should put since that has already been suggested in the 1st answer.

I have worked with a number of methods, but still stick to the plain "id, parent_id" intra-table relationship, where root items have parent_id=0. If you need to query the items in a tree a lot, especially when you only need 'branches', or all underlying elements of one node, you could use a second table: "id, path_id, level" holding a reference to each node in the upward path of each node. This might look like a lot of data, but it drastically improves the branch-lookups when used, and is quite manageable to render in triggers.

Not a recommended method, but I have seen people use dot-notation on the data.
Food.Pizza or Wines.Red.Cabernet
You end up doing lots of Like or midstring queries which don't use indices terribly well. And you end up parsing things alot.

Related

Doing multiple queries in Postgresql - conditional loop

Let me first start by stating that in the last two weeks I have received ENORMOUS help from just about all of you (ok ok not all... but I think perhaps two dozen people commented, and almost all of these comments were helpful). This is really amazing and I think it shows that the stackoverflow team really did something GREAT altogether. So thanks to all!
Now as some of you know, I am working at a campus right now and I have to use a windows machine. (I am the only one who has to use windows here... :( )
Now I manage to setup (ok, IT department did that for me) and populate a Postgres database (this I did on my own), with about 400 mb of data. Which perhaps is not so much for most of you heavy Ppostgre users, but I was more used to sqlite database for personal use which rarely exceeded 2mb ever.
Anyway, sorry for being so chatty - now the queries from that database work
nicely. I use ruby to do queries actually.
The entries in the Postgres database are interconnected, in as far as they are like
"pointers" - they have one field that points to another field.
Example:
entry 3667 points to entry 35785 which points to entry 15566. So it is quite simple.
The main entry is 1, so the end of all these queries is 1. So, from any other number, we can reach 1 in the end as the last result.
I am using ruby to make as many individual queries to the database until the last result returned is 1. This can take up to 10 individual queries. I do this by logging into psql with my password and data, and then performing the SQL query via -c. This probably is not ideal, it takes a little time to do these logins and queries, and ideally I would have to login only once, perform ALL queries in Postgres, then exit with a result (all these entries as result).
Now here comes my question:
- Is there a way to make conditional queries all inside of Postgres?
I know how to do it in a shell script and in ruby but I do not know if this is available in postgresql at all.
I would need to make the query, in literal english, like so:
"Please give me all the entries that point to the parent entry, until the last found entry is eventually 1, then return all of these entries."
I already "solved" it by using ruby to make several queries until 1 is eventually returned, but this strikes me as fairly inelegant and possibly not effective.
Any information is very much appreciated - thanks!
Edit (argh, I fail at pasting...):
Example dataset, the table would be like this:
id | parent
----+---------------+
1 | 1 |
2 | 131567 |
6 | 335928 |
7 | 6 |
9 | 1 |
10 | 135621 |
11 | 9 |
I hope that works, I tried to narrow it down solely on example.
For instance, id 11 points to id 9, and id 9 points to id 1.
It would be great if one could use SQL to return:
11 -> 9 -> 1
Unless you give some example table definitions, what you're asking for vaguely reminds of a tree structure which could be manipulated with recursive queries: http://www.postgresql.org/docs/8.4/static/queries-with.html .

Creating a workable Redis store with several filters

I am working on a system to display information about real estate. It runs in angular with the data stored as a json file on the server, which is updated once a day.
I have filters on number of bedrooms, bathrooms, price and a free text field for the address. It's all very snappy, but the problem is the load time of the app. This is why I am looking at Redis. Trouble is, I just can't get my head round how to get data with several different filters running.
Let's say I have some data like this: (missing off lots of fields for simplicity)
id beds price
0 3 270000
1 2 130000
2 4 420000
etc...
I am thinking I could set up three sets, one to hold the whole dataset, one to create an index on bedrooms and another for price:
beds id
2 1
3 0
4 2
and the same for price:
price id
130000 1
270000 0
420000 2
Then I was thinking I could use SINTER to return the overlapping sets.
Let's say I looking for a house with more than 2 bedrooms that is less than 300000.
From the bedrooms set I get IDs 0,2 for beds > 2.
From the prices set I get IDs 0,1 for price < 300000
So the common id is 0, which I would then lookup in the main dataset.
It all sounds good in theory, but being a Redis newbie, I have no clue how to go about achieving it!
Any advice would be gratefully received!
You're on the right track; sets + sorted sets is the right answer.
Two sources for all of the information that you could ever want:
Chapter 7 of my book, Redis in Action - http://bitly.com/redis-in-action
My Python/Redis object mapper - https://github.com/josiahcarlson/rom (it uses ideas directly from chapter 7 of my book to implement sql-like indices)
Both of those resources use Python as the programming language, though chapter 7 has been translated into Java: https://github.com/josiahcarlson/redis-in-action/ (go to the java path to see the code).
... That said, a normal relational database (especially one with built-in Geo handling like Postgres) should handle this data with ease. Have you considered a relational database?

help with tree-like structure

I've got some financial data to store and manipulate. Let's say I have 2 divisions, with offices in 2 cities, in 2 currencies, and 4 bank accounts. (It's actually more complex than that.) I want to show a list like this:
Electronics
Chicago
Dollars
Account 2 -> transactions in acct2 in $ in chicago/electronics
Euros
Account 1 -> transactions in acct1 in E in chicago/electronics
Account 3 -> etc.
Account 4
Brussles
Dollars
Account 1
Euros
Account 3
Account 4
Dessert Toppings
Chicago
Dollars
Account 1
Account 4
Euros
Account 2
Account 4
Brussles
Dollars
Account 2
Euros
Account 3
Account 4
So at each level except the top, the category can appear in multiple places. I've been reading around about the various methods, but none of the examples seem to address my particular use case, where nodes can appear in more than one place in the hierarchy. (Maybe there's a different name for this than "tree" or "hierarchy".)
I guess my hierarchy is actually something like Division > City > Currency with 'Electronics' and 'Euros' merely instances of each level, but I'm not quite sure how that helps or hurts.
A few notes: this is for a demo site, so the dataset won't be large -- ease of set-up and maintenance is more important than query efficiency. (I'm actually considering just building a data object by hand, though I'd much rather do it the right way.) Also, FWIW, we're working in php with an ms access back-end, so any libraries out there that make this easy in that environment would be helpful. (I've found a couple of implementations of the nested set pattern already.)
Are you sure you want to use a hierarchical design for this? To me, the hierarchy seems more a consequence of the desired output format than something intrinsic to your data structure.
And what if you have to display the data in a different order, like City > Currency > Division? Wouldn't that be very cumbersome?
You could use a plain structure instead, with a table for Branches, one for Cities, one for Currencies, and then then one Account table with Branch_ID, City_ID, and Currency_ID as foreign keys.
I'm not sure what database platform you're using. But if you're using MS SQL Server, then you should check out recursive queries using common table expressions (CTEs). They're easy to use and are designed for exactly the type of situation you've illustrated (a bill of materials, for instance). Check out this website for more detail: http://www.mssqltips.com/tip.asp?tip=1520
Good luck!

Should these be 3 SQL tables or one?

This is a new question which arose out of this question
Due to answers, the nature of the question changed, so I think posting a new one is ok(?).
You can see my original DB design below. I have 3 tables, and now I need a query to get all the records for a specific user for running_balances calculations.
Transactions are between users, like mutual credit. So units get swapped between users.
Inventarizations are physical stuff brought into the system; a user gets units for this.
Consumations are physical stuff consumed; a user has to pay units for this.
|--------------------------------------------------------------------------|
| type | transactions | inventarizations | consumations |
|--------------------------------------------------------------------------|
| columns | date | date | date |
| | creditor(FK user) | creditor(FK user) | |
| | debitor(FK user) | | debitor(FK user) |
| | service(FK service)| | |
| | | asset(FK asset) | asset(FK asset) |
| | amount | amount | amount |
| | | | price |
|--------------------------------------------------------------------------|
(Note that 'amount' is in different units;these are the entries and calculations are made on those amounts. Outside the scope to explain why, but these are the fields).
The question is: "Can/should this be in one table or be multiple tables (as I have it for now)?" I like the 3 tables solution because it makes semantically more sense. But then I need such a complicated select statement (with possibly negative performance results) for the running_balances. The original question in the link above asked for this statement, here I am asking if the db design is appropriate (apologies four double posting, hope it's ok).
This same question arises when you try to implement a general ledger system for single entry bookkeeping. What you have called "transactions" corresponds to "transfers", like from savings to checking. What you have called "inventarizations" corresponds to "income", like depositing a paycheck. What you have called "consumations" corresponds to "expenses", like when you pay the electric bill. The only difference is that in bookkeeping, everything has been reduced to dollar (or other currency) value. So you don't have to worry about identifying assets, because one dollar is as good as another.
So the question arises whether you need to have separate columns for "debit amount" and "credit amount" or alternatively, whether you can just have one column for "amount", and enter a positive number for debits and a negative amount for credits. Essentially the same question arises if you are implementing double entry bookkeeping rather than single entry bookkeeping.
In terms of internal arithmetic and internal data handling, things are far simpler when you adopt the single column approach. For example, to test whether a given transaction is in balance, all you have to do ask whether sum (amount) is equal to zero.
The complications arise when people require the traditional bookeeping format for data entry forms, on screen retrievals, and published reports. The traditional format requires two separate columns, marked "Debit" and "Credit", which contain only positive numbers or blank, with the constraint that every item must have an entry in either debit or credit but not both, and the other column must be left blank. These transformations require a certain amount of programming between the external format and the internal format.
It's really a matter of choice. Is it better to retain the traditional bookkeeping format of side by side debit and credit coulmns, or is it better to move forward to a format that uses negative numbers in a meaningful way? There are some circumstances that favor each of these design choices.
In your case, it's going to depend on how you intend to use the data. I would build prototypes with each of the two designs, and then start working on the fundamental CRUD processing for each. Whichever one works out easier in your environment is the one to choose.
You said that amount will be different units, then i think you should keep each table for itself.
I personally hate a DB design that has "different rules" for filling a table based on the type of entity that stored in a row. It just gets messy, and its hard to keep your constraints alive propperly on a table like that.
Just create an indexed view that will answer your balance questions to keep your queries "simple"
There's no definitive answer to this and answers will largely be down to the database design methodologies adopted by the answerer.
My advice would be to trial both ways and see which one has the best compromise between querying, performance and maintenance/usability.
You could always set up a view that returns all 3 tables as one table for querying and has a type field for the type of process a row relates to.

Generate breadcrumbs of categories stored in MySQL

In MySQL, I store categories this way:
categories:
- category_id
- category_name
- parent_category_id
What would be the most efficient way to generate the trail / breadcrumb for a given category_id?
For example
breadcrumbs(category_id):
General > Sub 1 > Sub 2
There could be in theories unlimited levels.
I'm using php.
UPDATE:
I saw this article (http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/) about the Nested Set Model.
It looks interesting, but how would you ago about dynamically managing categories?
It looks easier on paper, like when you know ahead of times the categories, but not when the user can create/delete/edit categories on the fly ...
What do you think?
I like to use the Materialized Path method, since it essentially contains your breadcrumb trail, and makes it easy to do things like select all descendants of a node without using recursive queries.
Materialized Path model
The idea with the Materialized path model is to link each node in the hierarchy with its position in the tree. This is done with a concatenated list of all the nodes ancestors. This list is usually stored in a delimited string. Note the “Linage” field below.
CAT_ID NAME CAT_PARENT Lineage
1 Home .
2 product 1 .1
3 CD’s 2 .1.2
4 LP’s 2 .1.2
5 Artists 1 .1
6 Genre 5 .1. 5
7 R&B 6 .1. 5.6
8 Rock 6 .1. 5.6
9 About Us 1 .1
Traversing the table
Select lpad('-',length(t1.lineage))||t1.name listing
From category t1, category t2
Where t1.lineage like t2.lineage ||'%'
And t2.name = 'Home';
Order by t1.lineage;
Listing
Home
-product
–CD’s
–LP’s
-Artists
–Genre
—R&B
—Rock
-About Us
Generate it (however you like) from a traditional parent model and cache it. It's too expensive to be generating it on the fly and the changes to the hierarchy are usually several orders of magnitude less frequent than other changes ever are. I wouldn't bother with the nested sets model since the hierarchy will be changing and then you have to go fooling around with the lefts and rights. (Note that the article only included recipes for adding and deleting - not re-parenting - which is very simple in the parent model).
The beauty of nested sets is that you can easily add/remove nodes from the graph with just a few simple SQL statements. It's really not all that expensive, and can be coded pretty quickly.
If you happen to be using PHP (or even if you don't), you can look at this code to see a fairly straight-forward implementation of adding nodes to a nested set model (archive.org backup). Removing (or even moving) is similarly straightforward.