Should these be 3 SQL tables or one? - sql

This is a new question which arose out of this question
Due to answers, the nature of the question changed, so I think posting a new one is ok(?).
You can see my original DB design below. I have 3 tables, and now I need a query to get all the records for a specific user for running_balances calculations.
Transactions are between users, like mutual credit. So units get swapped between users.
Inventarizations are physical stuff brought into the system; a user gets units for this.
Consumations are physical stuff consumed; a user has to pay units for this.
|--------------------------------------------------------------------------|
| type | transactions | inventarizations | consumations |
|--------------------------------------------------------------------------|
| columns | date | date | date |
| | creditor(FK user) | creditor(FK user) | |
| | debitor(FK user) | | debitor(FK user) |
| | service(FK service)| | |
| | | asset(FK asset) | asset(FK asset) |
| | amount | amount | amount |
| | | | price |
|--------------------------------------------------------------------------|
(Note that 'amount' is in different units;these are the entries and calculations are made on those amounts. Outside the scope to explain why, but these are the fields).
The question is: "Can/should this be in one table or be multiple tables (as I have it for now)?" I like the 3 tables solution because it makes semantically more sense. But then I need such a complicated select statement (with possibly negative performance results) for the running_balances. The original question in the link above asked for this statement, here I am asking if the db design is appropriate (apologies four double posting, hope it's ok).

This same question arises when you try to implement a general ledger system for single entry bookkeeping. What you have called "transactions" corresponds to "transfers", like from savings to checking. What you have called "inventarizations" corresponds to "income", like depositing a paycheck. What you have called "consumations" corresponds to "expenses", like when you pay the electric bill. The only difference is that in bookkeeping, everything has been reduced to dollar (or other currency) value. So you don't have to worry about identifying assets, because one dollar is as good as another.
So the question arises whether you need to have separate columns for "debit amount" and "credit amount" or alternatively, whether you can just have one column for "amount", and enter a positive number for debits and a negative amount for credits. Essentially the same question arises if you are implementing double entry bookkeeping rather than single entry bookkeeping.
In terms of internal arithmetic and internal data handling, things are far simpler when you adopt the single column approach. For example, to test whether a given transaction is in balance, all you have to do ask whether sum (amount) is equal to zero.
The complications arise when people require the traditional bookeeping format for data entry forms, on screen retrievals, and published reports. The traditional format requires two separate columns, marked "Debit" and "Credit", which contain only positive numbers or blank, with the constraint that every item must have an entry in either debit or credit but not both, and the other column must be left blank. These transformations require a certain amount of programming between the external format and the internal format.
It's really a matter of choice. Is it better to retain the traditional bookkeeping format of side by side debit and credit coulmns, or is it better to move forward to a format that uses negative numbers in a meaningful way? There are some circumstances that favor each of these design choices.
In your case, it's going to depend on how you intend to use the data. I would build prototypes with each of the two designs, and then start working on the fundamental CRUD processing for each. Whichever one works out easier in your environment is the one to choose.

You said that amount will be different units, then i think you should keep each table for itself.
I personally hate a DB design that has "different rules" for filling a table based on the type of entity that stored in a row. It just gets messy, and its hard to keep your constraints alive propperly on a table like that.
Just create an indexed view that will answer your balance questions to keep your queries "simple"

There's no definitive answer to this and answers will largely be down to the database design methodologies adopted by the answerer.
My advice would be to trial both ways and see which one has the best compromise between querying, performance and maintenance/usability.
You could always set up a view that returns all 3 tables as one table for querying and has a type field for the type of process a row relates to.

Related

Mondrian Schema, separation of data and presentation

I've been attempting to build a Mondrian schema specifically to use with Pentaho 5.0 (I'm not sure the version matters much here.) One problem I seem to repeatedly come up against, is how to control the presentation of data vs the data itself. Let me offer an example to illustrate.
Imagine a cube such as: (D for Dimension, H for hierarchy, L for level)
D: time
H: default
L: year
L: month
L: day
D: currency
H: default
L: name
L: code
If we think about the members of time.year, I'm sure we'd all agree they would be ..., 2008, 2009, 2010, 2011, 2012, 2013, .... So let's just move on to time.month. Here things get interesting. Do we represent time.month as numbers or words? Why not have both?
Mondrian provides a way to specify the name of the member as well as the "caption" of the member, which provides a different value for presentation than the member's name. Great! However if I provide a caption, then in Pentaho, you ONLY see the caption. Never the original member name. How can I let my user choose whichever is more appropriate?
The month level (as well as the day level, and any hierarchy with multiple levels,) cause another source of confusion. If the months are represented as one of 12 values, (numbers or words make no difference here,) then the actual member values are time.[2012].[1], time.[2012].[2], ..., time.[2012][12], time.[2013].[1], .... So for June (month 6), there are many members such as ..., time.[2009].[6], time.[2010].[6], time.[2011].[6], .... So if the list of members is presented and it only contains the month portion of the member name, then we see 1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,.... You can't differentiate between equal months. "Just include the year column as well," you say. Yes that makes sense, but there are other places that Pentaho doesn't provide the option to do that, such as in the filtering dialog. I had the idea of including the year in the caption of the member so instead of just 6 or June, you would see 2012 June. Unfortunately, this is also less than ideal. If each level of the hierarchy is present, (and suppose we followed this pattern for day as well), then you have each row looking like 2012 | 2012 June | 2012 June 13 | your_measure. This is of course, silly. But this could arise easily when drilling into a report in Pentaho.
Our second dimension has similar problems. Imagine the data set of world currency types. There's the 3-letter ISO standard currency code, and an official currency name. These two values are 1:1 and fully dependent on each other. Each one is a unique key. There is no actual hierarchical relationship between the two. I see them simply as 2 different representations of the same piece of data. The biggest obstacle here is that if they are not in the same hierarchy then Pentaho freely allows you to place them on opposite axes. This makes ridiculous looking reports like:
United States Dollar | Canadian Dollar | Euro | ...
USD | 12345 | - | - |
CAD | - | 12345 | - |
EUR | - | - | 1234 |
...
The codes are excellent when you desire conciseness. However maybe you're dealing with a specific situation involving several uncommon currencies and you don't want to make the report reader have to look up the meaning of the more obscure codes. I explored the use of <Property> elements but Pentaho again lacks flexibility in that you MUST display the member column to also display a property value. If the name was a property of the code member, there is no way to display only the currency name in a report without also including the code, which is redundant.
Ultimately, I'm hoping there's some mechanism to control the presentation of data, or some technique in the schema design that results in a sensible, coherent experience for the end user doing analysis in Pentaho.
This is pretty common.
Regarding properties - It is annoying how you can't re-order them in analyzer they seem to be sort of 2nd class citizens. And as they are still not supported or displayed in Saiku it frequently means they can't be used anyway. However in your example that is indeed an intended correct use of a property - a good description in fact!
So there is one solution, but it's not super clean. You can define different hierarchies depending on the user preferences, then use role based security to hide one or other of the hierarchies from the end user.
A variation of a theme I've done with this is to have admin level, senior level and beginner level access to the same cubes where you see different levels and hierarchies depending on permissions.
I havent had time to go into Mondrian4 much which may improve things here as everything is now simply an attribute - but I'm not 100% sure.
Finally I would definately raise this with support (sounds like you have a support contract) and see what improvements can be made. Post the jira here i'll definately upvote it!

Database model to describe IT environment [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm looking at writing a Django app to help document fairly small IT environments. I'm getting stuck at how best to model the data as the number of attributes per device can vary, even between devices of the same type. For example, a SAN will have 1 or more arrays, and 1 or more volumes. The arrays will then have an attribute of Name, RAID Level, Size, Number of disks, and the volumes will have attributes of Size and Name. Different SANs will have a different number of arrays and volumes.
Same goes for servers, each server could have a different number of disks/partitions, all of which will have attributes of Size, Used space, etc, and this will vary between servers.
Another device type may be a switch, which won't have arrays or volumes, but will have a number of network ports, some of which may be gigabit, others 10/100, others 10Gigabit, etc.
Further, I would like the ability to add device types in the future without changing the model. A new device type may be a phone system, which will have its own unique attributes which may vary between different phone systems.
I've looked into EAV database designs but it seems to get very complicated very quickly, and I'm unclear on whether it's the best way to go about this. I was thinking something along the lines of the model as shown in the picture.
http://i.stack.imgur.com/ZMnNl.jpg
A bonus would be the ability to create 'snapshots' of environments at a particular time, making it possible to view changes to the environment over time. Adding a date column to the attributes table may be a way to solve this.
For the record, this app won't need to scale very much (at most 1000 devices), so massive scalability isn't a big concern.
Since your attributes are per model instance and are different for each instance,
I would suggest going with completely free schema
class ITEntity(Model):
name = CharField()
class ITAttribute(Modle)
name = CharField()
value = CharField()
entity = ForeignKey(ITEntity, related_name="attrs")
This is very simple model and you can do the rest, like templates (i.e. switch template, router template, etc) in you app code - its much more straight-forward then using complicated model like EAV (I do like EAV, but this does not seem the usage case for this).
Adding history is also simple - just add timestamp to ITAttribute. When changing attribute - create new one instead. Then fetching attribute pick the one with the latest timestamp. That way you can always have point-in-time view of your environment.
If you are more comfortable with something along the lines of the image you posted, below is a slightly modified version (sorry I can't upload an image, don't have enough rep).
+-------------+
| Device Type |
|-------------|
| type |--------+
+-------------+ |
^
+---------------+ +--------------------+ +-----------+
| Device |----<| DeviceAttributeMap |>----| Attribute |
|---------------| |--------------------| |-----------|
| name | | Device | | name |
| DeviceType | | Attribute | +-----------+
| parent_device | | value |
| Site | +--------------------+
+---------------+
v
+-------------+ |
| Site | |
|-------------| |
| location |--------+
+-------------+
I added a linker table DeviceAttributeMap so you can have more control over an Attribute catalog, allowing queries for devices with the same Attribute but differing values. I also added a field in the Device model named parent_device intended as a self-referential foreign key to capture a relationship between a device's parent device. You'll likely want to make this field optional. To make the foreign key parent_device optional in Django set the field's null and blank attributes to True.
You could try a document based NoSQL database, like MongoDB. Each document can represent a device with as many different fields as you like.

help with tree-like structure

I've got some financial data to store and manipulate. Let's say I have 2 divisions, with offices in 2 cities, in 2 currencies, and 4 bank accounts. (It's actually more complex than that.) I want to show a list like this:
Electronics
Chicago
Dollars
Account 2 -> transactions in acct2 in $ in chicago/electronics
Euros
Account 1 -> transactions in acct1 in E in chicago/electronics
Account 3 -> etc.
Account 4
Brussles
Dollars
Account 1
Euros
Account 3
Account 4
Dessert Toppings
Chicago
Dollars
Account 1
Account 4
Euros
Account 2
Account 4
Brussles
Dollars
Account 2
Euros
Account 3
Account 4
So at each level except the top, the category can appear in multiple places. I've been reading around about the various methods, but none of the examples seem to address my particular use case, where nodes can appear in more than one place in the hierarchy. (Maybe there's a different name for this than "tree" or "hierarchy".)
I guess my hierarchy is actually something like Division > City > Currency with 'Electronics' and 'Euros' merely instances of each level, but I'm not quite sure how that helps or hurts.
A few notes: this is for a demo site, so the dataset won't be large -- ease of set-up and maintenance is more important than query efficiency. (I'm actually considering just building a data object by hand, though I'd much rather do it the right way.) Also, FWIW, we're working in php with an ms access back-end, so any libraries out there that make this easy in that environment would be helpful. (I've found a couple of implementations of the nested set pattern already.)
Are you sure you want to use a hierarchical design for this? To me, the hierarchy seems more a consequence of the desired output format than something intrinsic to your data structure.
And what if you have to display the data in a different order, like City > Currency > Division? Wouldn't that be very cumbersome?
You could use a plain structure instead, with a table for Branches, one for Cities, one for Currencies, and then then one Account table with Branch_ID, City_ID, and Currency_ID as foreign keys.
I'm not sure what database platform you're using. But if you're using MS SQL Server, then you should check out recursive queries using common table expressions (CTEs). They're easy to use and are designed for exactly the type of situation you've illustrated (a bill of materials, for instance). Check out this website for more detail: http://www.mssqltips.com/tip.asp?tip=1520
Good luck!

Should I initialize my AUTO_INCREMENT id column to 2^32+1 instead of 0?

I'm designing a new system to store short text messages [sic].
I'm going to identify each message by a unique identifier in the database, and use an AUTO_INCREMENT column to generate these identifiers.
Conventional wisdom says that it's okay to start with 0 and number my messages from there, but I'm concerned about the longevity of my service. If I make an external API, and make it to 2^31 messages, some people who use the API may have improperly stored my identifier in a signed 32-bit integer. At this point, they would overflow or crash or something horrible would happen. I'd like to avoid this kind of foo-pocalypse if possible.
Should I "UPDATE message SET id=2^32+1;" before I launch my service, forcing everyone to store my identifiers as signed 64-bit numbers from the start?
If you wanted to achieve your goal and avoid the problems that cletus mentioned, the solution is to set your starting value to 2^32+1. There's still plenty of IDs to go and it won't fit in a 32 bit value, signed or otherwise.
Of course, documenting the value's range and providing guidance to your API or data customers is the only right solution. Someone's always going to try and stick a long into a char and wonder why it doesn't work (always)
What if you provided a set of test suites or a test service that used messages in the "high but still valid" range and persuade your service users to use it to validate their code is proper? Starting at an arbitrary value for defensive reasons is a little weird to me; providing sanity tests rubs me right.
Actually 0 can be problematic with many persistence libraries. That's because they use it as some sort of sentinel value (a substitute for NULL). Rightly or wrongly, I would avoid using 0 as a primary key value. Convention is to start at 1 and go up. With negative numbers you're likely just to confuse people for no good reason.
If everyone alive on the planet sent one message per second every second non-stop, your counter wouldn't wrap until the year 2050 using 64 bit integers.
Probably just starting at 1 would be sufficient.
(But if you did start at the lower bound, it would extend into the start of 2092.)
Why use incrementing IDs? These require locking and will kill any plans for distributing your service over multiple machines. I would use UUIDs. API users will likely store these as opaque character strings, which means you can probably change the scheme later if you like.
If you want to ensure that messages have an order, implement the ordering like a linked list:
---
id: 61746144-3A3A-5555-4944-3D5343414C41
msg: "Hello, world"
next: 006F6F66-0000-0000-655F-444E53000000
prev: null
posted_by: jrockway
---
id: 006F6F66-0000-0000-655F-444E5300000
msg: "This is my second message EVER!"
next: 00726162-0000-0000-655F-444E53000000
prev: 61746144-3A3A-5555-4944-3D5343414C41
posted_by: jrockway
---
id: 00726162-0000-0000-655F-444E53000000
msg: "OH HAI"
next: null
prev: 006F6F66-0000-0000-655F-444E5300000
posted_by: jrockway
(As an aside, if you are actually returning the results as YAML, you can use & and * references instead of just using the IDs as data. Then the client will get the linked-list structure "for free".)
One thing I don't understand is why developers don't grasp that they don't need to expose their AUTO_INCREMENT field. For example, richardtallent mentioned using Guids as the primary key. I say do one better. Use a 64bit Int for your table ID/Primary Key, but also use a GUID, or something similar, as your publicly exposed ID.
An example Message table:
Name | Data Type
-------------------------------------
Id | BigInt - Primary Key
Code | Guid
Message | Text
DateCreated | DateTime
Then your data looks like:
Id | Code Message DateCreated
-------------------------------------------------------------------------------
1 | 81e3ab7e-dde8-4c43-b9eb-4915966cf2c4 | ....... | 2008-09-25T19:07:32-07:00
2 | c69a5ca7-f984-43dd-8884-c24c7e01720d | ....... | 2007-07-22T18:00:02-07:00
3 | dc17db92-a62a-4571-b5bf-d1619210245a | ....... | 2001-01-09T06:04:22-08:00
4 | 700910f9-a191-4f63-9e80-bdc691b0c67f | ....... | 2004-08-06T15:44:04-07:00
5 | 3b094cf9-f6ab-458e-965d-8bda6afeb54d | ....... | 2005-07-16T18:10:51-07:00
Where Code is what you would expose to the public whether it be a URL, Service, CSV, Xml, etc.
Don't want to be the next Twitter, eh? lol
If you're worried about scalability, consider using a GUID (uniqueidentifier) instead.
They are only 16 bytes (twice that of a bigint), but they can be assigned independently on multiple database or BL servers without worrying about collisions.
Since they are random, use NEWSEQUENTIALID() (in SQL Server) or a COMB technique (in your business logic or pre-MSSQL 2005 database) to ensure that each GUID is "higher" than the last one (speeds inserts into your table).
If you start with a number that high, some "genius" programmer will either subtract 2^32 to squeeze it in an int, or will just ignore the first digit (which is "always the same" until you pass your first billion or so messages).

What the best way to handle categories, sub-categories - hierachical data?

Duplicate:
SQL - how to store and navigate hierarchies
If I have a database where the client requires categories, sub-categories, sub-sub-categories and so on, what's the best way to do that? If they only needed three, and always knew they'd need three I could just create three tables cat, subcat, subsubcat, or the like. But what if they want further depth? I don't like the three tables but it's the only way I know how to do it.
I have seen the "sql adjacency list" but didn't know if that was the only way possible. I was hoping for input so that the client can have any level of categories and subcategories. I believe this means hierarchical data.
EDIT: Was hoping for the sql to get the list back out if possible
Thank you.
table categories: id, title, parent_category_id
id | title | parent_category_id
----+-------+-------------------
1 | food | NULL
2 | pizza | 1
3 | wines | NULL
4 | red | 3
5 | white | 3
6 | bread | 1
I usually do a select * and assemble the tree algorithmically in the application layer.
You might have a look at Joe Celko's book, or this previous question.
creating a table with a relation to itself is the best way for doing the same. its easy and flexible to the extent you want it to be without any limitation. I dont think i need to repeat the structure that you should put since that has already been suggested in the 1st answer.
I have worked with a number of methods, but still stick to the plain "id, parent_id" intra-table relationship, where root items have parent_id=0. If you need to query the items in a tree a lot, especially when you only need 'branches', or all underlying elements of one node, you could use a second table: "id, path_id, level" holding a reference to each node in the upward path of each node. This might look like a lot of data, but it drastically improves the branch-lookups when used, and is quite manageable to render in triggers.
Not a recommended method, but I have seen people use dot-notation on the data.
Food.Pizza or Wines.Red.Cabernet
You end up doing lots of Like or midstring queries which don't use indices terribly well. And you end up parsing things alot.