I've been attempting to build a Mondrian schema specifically to use with Pentaho 5.0 (I'm not sure the version matters much here.) One problem I seem to repeatedly come up against, is how to control the presentation of data vs the data itself. Let me offer an example to illustrate.
Imagine a cube such as: (D for Dimension, H for hierarchy, L for level)
D: time
H: default
L: year
L: month
L: day
D: currency
H: default
L: name
L: code
If we think about the members of time.year, I'm sure we'd all agree they would be ..., 2008, 2009, 2010, 2011, 2012, 2013, .... So let's just move on to time.month. Here things get interesting. Do we represent time.month as numbers or words? Why not have both?
Mondrian provides a way to specify the name of the member as well as the "caption" of the member, which provides a different value for presentation than the member's name. Great! However if I provide a caption, then in Pentaho, you ONLY see the caption. Never the original member name. How can I let my user choose whichever is more appropriate?
The month level (as well as the day level, and any hierarchy with multiple levels,) cause another source of confusion. If the months are represented as one of 12 values, (numbers or words make no difference here,) then the actual member values are time.[2012].[1], time.[2012].[2], ..., time.[2012][12], time.[2013].[1], .... So for June (month 6), there are many members such as ..., time.[2009].[6], time.[2010].[6], time.[2011].[6], .... So if the list of members is presented and it only contains the month portion of the member name, then we see 1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,.... You can't differentiate between equal months. "Just include the year column as well," you say. Yes that makes sense, but there are other places that Pentaho doesn't provide the option to do that, such as in the filtering dialog. I had the idea of including the year in the caption of the member so instead of just 6 or June, you would see 2012 June. Unfortunately, this is also less than ideal. If each level of the hierarchy is present, (and suppose we followed this pattern for day as well), then you have each row looking like 2012 | 2012 June | 2012 June 13 | your_measure. This is of course, silly. But this could arise easily when drilling into a report in Pentaho.
Our second dimension has similar problems. Imagine the data set of world currency types. There's the 3-letter ISO standard currency code, and an official currency name. These two values are 1:1 and fully dependent on each other. Each one is a unique key. There is no actual hierarchical relationship between the two. I see them simply as 2 different representations of the same piece of data. The biggest obstacle here is that if they are not in the same hierarchy then Pentaho freely allows you to place them on opposite axes. This makes ridiculous looking reports like:
United States Dollar | Canadian Dollar | Euro | ...
USD | 12345 | - | - |
CAD | - | 12345 | - |
EUR | - | - | 1234 |
...
The codes are excellent when you desire conciseness. However maybe you're dealing with a specific situation involving several uncommon currencies and you don't want to make the report reader have to look up the meaning of the more obscure codes. I explored the use of <Property> elements but Pentaho again lacks flexibility in that you MUST display the member column to also display a property value. If the name was a property of the code member, there is no way to display only the currency name in a report without also including the code, which is redundant.
Ultimately, I'm hoping there's some mechanism to control the presentation of data, or some technique in the schema design that results in a sensible, coherent experience for the end user doing analysis in Pentaho.
This is pretty common.
Regarding properties - It is annoying how you can't re-order them in analyzer they seem to be sort of 2nd class citizens. And as they are still not supported or displayed in Saiku it frequently means they can't be used anyway. However in your example that is indeed an intended correct use of a property - a good description in fact!
So there is one solution, but it's not super clean. You can define different hierarchies depending on the user preferences, then use role based security to hide one or other of the hierarchies from the end user.
A variation of a theme I've done with this is to have admin level, senior level and beginner level access to the same cubes where you see different levels and hierarchies depending on permissions.
I havent had time to go into Mondrian4 much which may improve things here as everything is now simply an attribute - but I'm not 100% sure.
Finally I would definately raise this with support (sounds like you have a support contract) and see what improvements can be made. Post the jira here i'll definately upvote it!
Related
This is a fundamental application design question I’ve struggled with and flip-flopped on for years. We have a legacy webapp that doesn't really have a solid ORM, if that tidbit might influence your answer. To abstract my question let’s say we have a class Car, and a corresponding table in our database named car. Car has a few properties: color, weight, year, maxspeed These properties directly correspond to columns in the db table.
In our application, we define the car as “classic old” if year is < 1960 and color = black. And in many places within our app knowing whether the car is "classic old" is extremely important (maybe we’re running a very illogical insurance agency which gives steep discounts and other perks to cars which are “classic old”).
All over our application, we do things like:
--list all classic old cars
--give the current user a discount if their car is classic old
--list all classic old cars with max speed > 100 miles per hour
--email the current user if their car is classic old and weights more than 1000 pounds
What is the best way to go about this? We have a legacy application that does this in some places:
getOldClassicCars()
select * where year < 1960 and color = black
and in other places:
cararray = getAllCars();
for each car in cararray
if car.year < 1960 and car.color = black
oldcararray = car.add()
The point being that this very important, fundamental piece of our application – is the car classic old – is “hardcoded” as year < 1960 and color = black in many places. Sometimes in SQL, sometimes in application code, etc. Obviously that is not good, but as we’ve refactored things I’m not sure we’re refactoring things the best way we can.
Well, you are stuck with the fundamental problem that
you cant run your code on the database
you want to be able to use the database's selection functionality on this criteria.
you want the calculation of "classic old" to be defined in a single place (preferably code)
Lets enumerate the solutions
1: Put the calculation in a sproc and always use the sproc to retrieve cars.
The problem here is if you create a new car in code, its class status is undefined, so you haven't really solved the 'not in two places' problem.
2: Get the DB to run your calc via an assembly. for example you can get mssql to run functions from a .net assembly which you can also use in your code base to perform the same calculation.
Problem, its hard work. Plus essentially its still in two places, you have to keep the db up to date and ensure that the table is accessed correctly
3: Persist the calculated value on the DB, but perform the calc in the code
Problem, if the calculation changes the DB values will be incorrect and need updating.
3 seems to be the best option, as we will know when the calculation changes and be able to take some action to resolve the situation.
However, it might be best, given the fundamental nature of this calculation, to make that 'out of dateness' implicit in the way we structure the code.
Instead of simply persisting car.IsClassic we could add a CarStatusReport object with a datetime property. We then generate a CarStatusReport(2017) which evaluates all the cars at that point in time and saves that data in a separate table.
Our business logic is then no longer, "Is this car a classic?" but "What does the latest CarStatusReport say the status of this car is?"
You Business Logic will then reside in a single CarStatusReportGenerator service and any other logic accessing the IsClassic calculation, will be forced to acknowledge the ephemeral nature of the stored info.
No optimal solution here. But, one good point will be to move all the business logic into the one place. If you can't (when you make methods or functions calculating some property, for example isOld()) then hide all those inconsistencies under the hood, so implementation users (conceptually) will never notice DRY violation from outside.
Good afternoon and happy Friday, folks
I’m trying to automate a placement simulation of youth into residential treatment where they will have the highest likelihood of success. Success is operationalized as “not recidivating” within 3 years of entering treatment. Equations predicting recidivism have been generated for each location, and the equations have been applied to each individual in the scenario (based on youth characteristics like risk, age, etc., LOS). Each youth has predicted success rates for every location, which throws in a wrench: youth are not qualified for all of the treatment facilities for which they have predicted success rates. Indeed, treatment locations have differing, yet overlapping qualifications.
Let’s take a made-up example. Johnny (ID # 5, below) is a 15-year-old boy with drug charges. He could have “predicted success rates” of 91% for location A, 88% for location B, 50% for location C, and 75% for location D. Johnny is most likely to be successful (i.e., not recidivate within three years of entering treatment) if he is treated at location A; unfortunately, location A only accepts youth who are 17 years old or older; therefore, Johnny would not qualify for treatment here. Alternatively, for Johnny, location B is the next best location. Let us assume that Johnny is qualified for location B, but that all of location-B beds are filled; so, we must now look to location D, as it is now Johnny’s “best available” option at 75%.
The score so far: We are matching youth to available beds in location for which they qualify and might enjoy the greatest likelihood of success. Unfortunately, each location only has a certain number of available beds, and the number of available beds different across locations. The qualifications of entry into treatment facilities differ, yet overlap (e.g., 12-17 year-olds vs 14-20 year-olds).
In order to simulate what placement decisions might look like based on success rates, I went through the scenario describe above for over 400 youth, by hand, in excel. It took me about a week. I’d like to use PROC SQL imbedded in a SAS MACRO to automate these placement scenarios with the ultimate goals of a) obtain the ability to bootstrap iterations in order to examine effect sizes across distributions, b) save time, and c) prevent further brain damage from banging my head again desk and wall in frustration whilst doing this by hand. Whilst never having had the necessity—nay—the privilege of using SQL in my typical roll as a researcher, I believe that this time has now come to pass and I’m excited about it! Honestly. I believe it has the capacity I’m looking for. Unfortunately, it is beating the devil out of me!
Here’s what I’ve got cookin’ so far: I want to create and automate the placement simulation with the clever use of merging/joining/switching/or something like that.
I have two datasets (tables). The first dataset contains all of the youth information (one row per youth; several columns with demographics, location ranks, which correspond to the predicted success rates). The order of rows in the youth dataset (was/will be randomly generated (to simulate the randomness with which youth enter the system and are subsequently place into treatment). Note that I will be “cleaning” the youth dataset prior to merging such that rank-column cells will only be populated for programs for which a respective youth qualifies. This should take the “does the youth even qualify for the program” problem out of the equation.
However, it still leaves the issue of availability left to be contended with in the scenario.
The second dataset containing the treatment facility beds, with each row corresponding to an available bed in one of the treatment location; two columns contain bed numbers and location names. Each bed (row) has only one location cell populated, but locations will populate several cells.
Thus, in descending order, I want to merge each youth row with the available bed that represents his/her best chance of success, and so the merge/join/switch/thing should take place
on youth.Rank1= distinct TF.Location,
and if youth.Rank1≠ TF.location then
merge on youth.Rank2= TF.location,
if youth.Rank2≠ TF.location then merge at
youth.Rank3 = TF.location, etc.
Put plainly: “Merge on rank1 unless rank1 location is no longer available, then merge on rank2, unless rank2 location is no longer available, and on down the line, etc., etc., until all option are exhausted and foster care (i.e., alternative services). Is the only option.
I’ve had no success getting this to work. I haven’t even been successful getting the union function to work. About the only successful thing I’ve done in SQL so far is create a view of a single dataset. It’s pretty sad. I’ve been following this guidance, but I get hung up around the “where” command:
proc sql; /Calls the SQL procedure*/;
create table x as /*Tells SAS to create a table called x*/
select /*Specifies the column(s) to be selected*/
from /*Specificies the tables(s) (data sets) to be queried*/
where /*Subjests the data based on a condition*/
group by /*Classifies the data into groups based on the specified
column(s)*/
order by /*Sorts the resulting rows observations) by the specified
column(s)*/
; quit; /*Ends the proc sql procedure*/
Frankly, I’m stuck and I could use some advice. This greenhorn in me is in way over his head.
I appreciate any help or guidance anyone might lend.
Cheers!
P
The process you describe (and to be honest I skiped to the end so I might of missed something) does not lend itself to SQL because each step could affect the results of the next one. However, you want to get the most best results for the most kids. (I think a lot of that text was to convince us how important it is to help out). You don't actually give us anything we can really use to help since you don't give any details of your data model, your data, or expected results. There really is no way to answer this question. But I don't care -- I'm going to go forward with some suggestions because it is a friday and I've never done a stream of consciousness answer to a stream of consciousness question before. I will suggest you don't formulate your solution just in sql, but instead use a higher level program and engage is a process like the one described below -- because this a DB questions I've noted the locations where the DB might be involved.
Generate a list kids (this can be in a table -- called NEEDY-KID)
Have a list of locations to assign (this can also be a table LOCATION)
Run your matching for best fit from KID to location -- at this point don't worry about assign more than one kid to a location -- there can be duplicates (put this in table called KID2LOC using a query)
Check KID2LOC for locations assigned twice -- use some method to remove the duplicate ones so each loc is only assigned once. (remove from the KID2LOC using a query)
Prune the LOCATION list to remove assigned locations (once again -- a query)
If kids exist without a location go to 3 with new pruned location list.
Done.
I have no idea if I wrote that correctly. I want to start learning higher end data mining techniques and I'm currently using SQL server and Access 2016.
I have a system that tracks ID cards. Each ID is tagged to one particular level of a security hierarchy, which has many branches.
For example
Root
-Maintenance
- Management
- Supervisory
- Manager
- Executive
- Vendors
- Secure
- Per Diem
- Inside Trades
There are many other departments like Maintenance, some simple, some with much more convoluted, hierarchies.
Each ID card is tagged to a level so in the Maintenance example, - Per Diem:Vendors:Maintenance:Root. Others may be just tagged to Vendors, Some to general Maintenance itself (No one has root, thank god).
So lets say I have 20 ID Cards selected, these are available personnel I can task to a job but since they have different area's of security I want to find a commonalities they can all work on together as a 20 person group or whatever other groupings I can make.
So the intended output would be
CommonMatch = - Per Diem
CardID = 1
CardID = 3
CommonMatch = Vendors
CardID = 1
CardID = 3
CardID = 20
So in the example above, while I could have 2 people working on -Per Diem work, because that is their lowest common security similarity, there is also card holder #20 who has rights to the predecessor group (Vendors), that 1 and 3 share, so I could have three of them work at that level.
I'm not looking for anyone to do the work for me (Although examples always welcome), more to point me in the right direction on what I should be studying, what I'm trying to do is called, etc. I know CTE's are a way to go but that seems like only a tool in a much bigger process that needs to be done.
Thank you all in advance
Well, it is not so much a graph-theory or data-mining problem but rather a data-structure problem and one that has almost solved itself.
The objective is to be able to partition the set of card IDs into disjoint subsets given a security clearance level.
So, the main idea here would be to layout the hierarchy tree and then assign each card ID to the path implied by its security level clearance. For this purpose, each node of the hierarchy tree now becomes a container of card IDs (e.g. each node of the hierarchy tree holds a) its own name (as unique identification) b) pointers to other nodes c) a list of card IDs assigned to its "name".)
Then, retrieving the set of cards with clearance UP TO a specific security level is simply a case of traversing the tree from that specific level downwards until the tree's leafs, all along collecting the card IDs from the node containers as they are encountered.
Suppose that we have access tree:
A
+-B
+-C
D
+-E
And card ID assignments:
B:[1,2,3]
C:[4,8]
E:[10,12]
At the moment, B,C,E only make sense as tags, there is no structural information associated with them. We therefore need to first "build" the tree. The following example uses Networkx but the same thing can be achieved with a multitude of ways:
import networkx
G = networkx.DiGraph() #Establish a directed graph
G.add_edge("A","B")
G.add_edge("A","C")
G.add_edge("A","D")
G.add_edge("D","E")
Now, assign the card IDs to the node containers (in Networkx, nodes can be any valid Python object so I am going to go with a very simple list)
G.node["B"]=[1,2,3]
G.node["C"]=[4,8]
G.node["E"]=[10,12]
So, now, to get everybody working under "A" (the root of the tree), you can traverse the tree from that level downwards either via Depth First Search (DFS) or Breadth First Search (BFS) and collect the card IDs from the containers. I am going to use DFS here, purely because Networkx has a function that returns the visited nodes depending on visiting order, directly.
#dfs_preorder_nodes returns a generator, this is an efficient way of iterating very large collections in Python but I am casting it to a "list" here, so that we get the actual list of nodes back.
vis_nodes = list(networkx.dfs_preorder_nodes(G,"A")); #Start from node "A" and DFS downwards
cardIDs = []
#I could do the following with a one-line reduce but it might be clearer this way
for aNodeID in vis_nodes:
if G.node[aNodeID]:
cardIDs.extend(G.node[aNodeID])
In the end of the above iteration, cardIDs will contain all card IDs from branch "A" downwards in one convenient list.
Of course, this example is ultra simple, but since we are talking about trees, the tree can be as large as you like and you are still traversing it in the same way requiring only a single point of entry (the top level branch).
Finally, just as a note, the fact that you are using Access as your backend is not necessarily an impediment but relational databases do not handle graph type data with great ease. You might get away easily for something like a simple tree (like what you have here for example), but the hassle of supporting this probably justifies undertaking this process outside of the database (e.g, use the database just for retrieving the data and carry out the graph type data processing in a different environment. Doing a DFS on SQL is the sort of hassle I am referring to above.)
Hope this helps.
I've got some financial data to store and manipulate. Let's say I have 2 divisions, with offices in 2 cities, in 2 currencies, and 4 bank accounts. (It's actually more complex than that.) I want to show a list like this:
Electronics
Chicago
Dollars
Account 2 -> transactions in acct2 in $ in chicago/electronics
Euros
Account 1 -> transactions in acct1 in E in chicago/electronics
Account 3 -> etc.
Account 4
Brussles
Dollars
Account 1
Euros
Account 3
Account 4
Dessert Toppings
Chicago
Dollars
Account 1
Account 4
Euros
Account 2
Account 4
Brussles
Dollars
Account 2
Euros
Account 3
Account 4
So at each level except the top, the category can appear in multiple places. I've been reading around about the various methods, but none of the examples seem to address my particular use case, where nodes can appear in more than one place in the hierarchy. (Maybe there's a different name for this than "tree" or "hierarchy".)
I guess my hierarchy is actually something like Division > City > Currency with 'Electronics' and 'Euros' merely instances of each level, but I'm not quite sure how that helps or hurts.
A few notes: this is for a demo site, so the dataset won't be large -- ease of set-up and maintenance is more important than query efficiency. (I'm actually considering just building a data object by hand, though I'd much rather do it the right way.) Also, FWIW, we're working in php with an ms access back-end, so any libraries out there that make this easy in that environment would be helpful. (I've found a couple of implementations of the nested set pattern already.)
Are you sure you want to use a hierarchical design for this? To me, the hierarchy seems more a consequence of the desired output format than something intrinsic to your data structure.
And what if you have to display the data in a different order, like City > Currency > Division? Wouldn't that be very cumbersome?
You could use a plain structure instead, with a table for Branches, one for Cities, one for Currencies, and then then one Account table with Branch_ID, City_ID, and Currency_ID as foreign keys.
I'm not sure what database platform you're using. But if you're using MS SQL Server, then you should check out recursive queries using common table expressions (CTEs). They're easy to use and are designed for exactly the type of situation you've illustrated (a bill of materials, for instance). Check out this website for more detail: http://www.mssqltips.com/tip.asp?tip=1520
Good luck!
This is a new question which arose out of this question
Due to answers, the nature of the question changed, so I think posting a new one is ok(?).
You can see my original DB design below. I have 3 tables, and now I need a query to get all the records for a specific user for running_balances calculations.
Transactions are between users, like mutual credit. So units get swapped between users.
Inventarizations are physical stuff brought into the system; a user gets units for this.
Consumations are physical stuff consumed; a user has to pay units for this.
|--------------------------------------------------------------------------|
| type | transactions | inventarizations | consumations |
|--------------------------------------------------------------------------|
| columns | date | date | date |
| | creditor(FK user) | creditor(FK user) | |
| | debitor(FK user) | | debitor(FK user) |
| | service(FK service)| | |
| | | asset(FK asset) | asset(FK asset) |
| | amount | amount | amount |
| | | | price |
|--------------------------------------------------------------------------|
(Note that 'amount' is in different units;these are the entries and calculations are made on those amounts. Outside the scope to explain why, but these are the fields).
The question is: "Can/should this be in one table or be multiple tables (as I have it for now)?" I like the 3 tables solution because it makes semantically more sense. But then I need such a complicated select statement (with possibly negative performance results) for the running_balances. The original question in the link above asked for this statement, here I am asking if the db design is appropriate (apologies four double posting, hope it's ok).
This same question arises when you try to implement a general ledger system for single entry bookkeeping. What you have called "transactions" corresponds to "transfers", like from savings to checking. What you have called "inventarizations" corresponds to "income", like depositing a paycheck. What you have called "consumations" corresponds to "expenses", like when you pay the electric bill. The only difference is that in bookkeeping, everything has been reduced to dollar (or other currency) value. So you don't have to worry about identifying assets, because one dollar is as good as another.
So the question arises whether you need to have separate columns for "debit amount" and "credit amount" or alternatively, whether you can just have one column for "amount", and enter a positive number for debits and a negative amount for credits. Essentially the same question arises if you are implementing double entry bookkeeping rather than single entry bookkeeping.
In terms of internal arithmetic and internal data handling, things are far simpler when you adopt the single column approach. For example, to test whether a given transaction is in balance, all you have to do ask whether sum (amount) is equal to zero.
The complications arise when people require the traditional bookeeping format for data entry forms, on screen retrievals, and published reports. The traditional format requires two separate columns, marked "Debit" and "Credit", which contain only positive numbers or blank, with the constraint that every item must have an entry in either debit or credit but not both, and the other column must be left blank. These transformations require a certain amount of programming between the external format and the internal format.
It's really a matter of choice. Is it better to retain the traditional bookkeeping format of side by side debit and credit coulmns, or is it better to move forward to a format that uses negative numbers in a meaningful way? There are some circumstances that favor each of these design choices.
In your case, it's going to depend on how you intend to use the data. I would build prototypes with each of the two designs, and then start working on the fundamental CRUD processing for each. Whichever one works out easier in your environment is the one to choose.
You said that amount will be different units, then i think you should keep each table for itself.
I personally hate a DB design that has "different rules" for filling a table based on the type of entity that stored in a row. It just gets messy, and its hard to keep your constraints alive propperly on a table like that.
Just create an indexed view that will answer your balance questions to keep your queries "simple"
There's no definitive answer to this and answers will largely be down to the database design methodologies adopted by the answerer.
My advice would be to trial both ways and see which one has the best compromise between querying, performance and maintenance/usability.
You could always set up a view that returns all 3 tables as one table for querying and has a type field for the type of process a row relates to.