Recursive Hierarchy Ranking - sql

I have no idea if I wrote that correctly. I want to start learning higher end data mining techniques and I'm currently using SQL server and Access 2016.
I have a system that tracks ID cards. Each ID is tagged to one particular level of a security hierarchy, which has many branches.
For example
Root
-Maintenance
- Management
- Supervisory
- Manager
- Executive
- Vendors
- Secure
- Per Diem
- Inside Trades
There are many other departments like Maintenance, some simple, some with much more convoluted, hierarchies.
Each ID card is tagged to a level so in the Maintenance example, - Per Diem:Vendors:Maintenance:Root. Others may be just tagged to Vendors, Some to general Maintenance itself (No one has root, thank god).
So lets say I have 20 ID Cards selected, these are available personnel I can task to a job but since they have different area's of security I want to find a commonalities they can all work on together as a 20 person group or whatever other groupings I can make.
So the intended output would be
CommonMatch = - Per Diem
CardID = 1
CardID = 3
CommonMatch = Vendors
CardID = 1
CardID = 3
CardID = 20
So in the example above, while I could have 2 people working on -Per Diem work, because that is their lowest common security similarity, there is also card holder #20 who has rights to the predecessor group (Vendors), that 1 and 3 share, so I could have three of them work at that level.
I'm not looking for anyone to do the work for me (Although examples always welcome), more to point me in the right direction on what I should be studying, what I'm trying to do is called, etc. I know CTE's are a way to go but that seems like only a tool in a much bigger process that needs to be done.
Thank you all in advance

Well, it is not so much a graph-theory or data-mining problem but rather a data-structure problem and one that has almost solved itself.
The objective is to be able to partition the set of card IDs into disjoint subsets given a security clearance level.
So, the main idea here would be to layout the hierarchy tree and then assign each card ID to the path implied by its security level clearance. For this purpose, each node of the hierarchy tree now becomes a container of card IDs (e.g. each node of the hierarchy tree holds a) its own name (as unique identification) b) pointers to other nodes c) a list of card IDs assigned to its "name".)
Then, retrieving the set of cards with clearance UP TO a specific security level is simply a case of traversing the tree from that specific level downwards until the tree's leafs, all along collecting the card IDs from the node containers as they are encountered.
Suppose that we have access tree:
A
+-B
+-C
D
+-E
And card ID assignments:
B:[1,2,3]
C:[4,8]
E:[10,12]
At the moment, B,C,E only make sense as tags, there is no structural information associated with them. We therefore need to first "build" the tree. The following example uses Networkx but the same thing can be achieved with a multitude of ways:
import networkx
G = networkx.DiGraph() #Establish a directed graph
G.add_edge("A","B")
G.add_edge("A","C")
G.add_edge("A","D")
G.add_edge("D","E")
Now, assign the card IDs to the node containers (in Networkx, nodes can be any valid Python object so I am going to go with a very simple list)
G.node["B"]=[1,2,3]
G.node["C"]=[4,8]
G.node["E"]=[10,12]
So, now, to get everybody working under "A" (the root of the tree), you can traverse the tree from that level downwards either via Depth First Search (DFS) or Breadth First Search (BFS) and collect the card IDs from the containers. I am going to use DFS here, purely because Networkx has a function that returns the visited nodes depending on visiting order, directly.
#dfs_preorder_nodes returns a generator, this is an efficient way of iterating very large collections in Python but I am casting it to a "list" here, so that we get the actual list of nodes back.
vis_nodes = list(networkx.dfs_preorder_nodes(G,"A")); #Start from node "A" and DFS downwards
cardIDs = []
#I could do the following with a one-line reduce but it might be clearer this way
for aNodeID in vis_nodes:
if G.node[aNodeID]:
cardIDs.extend(G.node[aNodeID])
In the end of the above iteration, cardIDs will contain all card IDs from branch "A" downwards in one convenient list.
Of course, this example is ultra simple, but since we are talking about trees, the tree can be as large as you like and you are still traversing it in the same way requiring only a single point of entry (the top level branch).
Finally, just as a note, the fact that you are using Access as your backend is not necessarily an impediment but relational databases do not handle graph type data with great ease. You might get away easily for something like a simple tree (like what you have here for example), but the hassle of supporting this probably justifies undertaking this process outside of the database (e.g, use the database just for retrieving the data and carry out the graph type data processing in a different environment. Doing a DFS on SQL is the sort of hassle I am referring to above.)
Hope this helps.

Related

Data Science: Using Inferential Statistics to label train dataset

Lack of High Schools in remote areas is a problem for students in developing country. Students in some locations are better than that in other. So, I have to find those locations. Now, the main problem is defining "BETTER". I have made some rules that will define the profile of a location.
Right now, I am concerned with the good students.
So, what I have done is-
1. Used some inferential statistics to and made some rules to come up with the conclusion that Location A,B,C,etc are the most potential locations where you can put the high schools because according to my rules these locations contain quality students.
I did all of the things above to label the data because I required to define "BETTER" and label the data so that I can now use machine learning algorithm to learn the factors which makes a location a potential location so that if I give a data point from test data to the model, it will instantly tell if the location is better or not.
Overview of the method:
For each location, I have these 4 information:
total_students_staying_for_high_school_education(A),
total_students_leaving_for_high_school_education_in_another_place(B),
mean_grade_point_of_students_of_type_B,
ratio (calculated as B/A),
For the location whose ratio > 1
I applied the chi-squared significance test to come up with a statistic which would tell me if students are leaving that place in significant amount than staying. I used ANOVA and then Tukey test to compare means_grade points and then find combinations of pairs of locations whose means vary and whose is greater than the others.
I then wrote a python program with a custom comparator that first compares if mean_grade of those points vary and returns the one with greater mean. If the means don't vary, the comparator return the location with the one whose chi-squared value is greater.
This is how, the whole process comes up with few suggestions of location and I call those location "BETTER".
What I am concerned about is-
1. How do I verify if my rules are valid? Or do I even need to verify it?
2. Most importantly, is mingling statistics with machine learning as described above an appropriate approach?Is there any major leakage in the method?Can anyone suggest a more general method?

Can proc sql embedded in sas macros dynamically merge to data-sets, simulating residential treatment placement decisions for trouble youth?

Good afternoon and happy Friday, folks
I’m trying to automate a placement simulation of youth into residential treatment where they will have the highest likelihood of success. Success is operationalized as “not recidivating” within 3 years of entering treatment. Equations predicting recidivism have been generated for each location, and the equations have been applied to each individual in the scenario (based on youth characteristics like risk, age, etc., LOS). Each youth has predicted success rates for every location, which throws in a wrench: youth are not qualified for all of the treatment facilities for which they have predicted success rates. Indeed, treatment locations have differing, yet overlapping qualifications.
Let’s take a made-up example. Johnny (ID # 5, below) is a 15-year-old boy with drug charges. He could have “predicted success rates” of 91% for location A, 88% for location B, 50% for location C, and 75% for location D. Johnny is most likely to be successful (i.e., not recidivate within three years of entering treatment) if he is treated at location A; unfortunately, location A only accepts youth who are 17 years old or older; therefore, Johnny would not qualify for treatment here. Alternatively, for Johnny, location B is the next best location. Let us assume that Johnny is qualified for location B, but that all of location-B beds are filled; so, we must now look to location D, as it is now Johnny’s “best available” option at 75%.
The score so far: We are matching youth to available beds in location for which they qualify and might enjoy the greatest likelihood of success. Unfortunately, each location only has a certain number of available beds, and the number of available beds different across locations. The qualifications of entry into treatment facilities differ, yet overlap (e.g., 12-17 year-olds vs 14-20 year-olds).
In order to simulate what placement decisions might look like based on success rates, I went through the scenario describe above for over 400 youth, by hand, in excel. It took me about a week. I’d like to use PROC SQL imbedded in a SAS MACRO to automate these placement scenarios with the ultimate goals of a) obtain the ability to bootstrap iterations in order to examine effect sizes across distributions, b) save time, and c) prevent further brain damage from banging my head again desk and wall in frustration whilst doing this by hand. Whilst never having had the necessity—nay—the privilege of using SQL in my typical roll as a researcher, I believe that this time has now come to pass and I’m excited about it! Honestly. I believe it has the capacity I’m looking for. Unfortunately, it is beating the devil out of me!
Here’s what I’ve got cookin’ so far: I want to create and automate the placement simulation with the clever use of merging/joining/switching/or something like that.
I have two datasets (tables). The first dataset contains all of the youth information (one row per youth; several columns with demographics, location ranks, which correspond to the predicted success rates). The order of rows in the youth dataset (was/will be randomly generated (to simulate the randomness with which youth enter the system and are subsequently place into treatment). Note that I will be “cleaning” the youth dataset prior to merging such that rank-column cells will only be populated for programs for which a respective youth qualifies. This should take the “does the youth even qualify for the program” problem out of the equation.
However, it still leaves the issue of availability left to be contended with in the scenario.
The second dataset containing the treatment facility beds, with each row corresponding to an available bed in one of the treatment location; two columns contain bed numbers and location names. Each bed (row) has only one location cell populated, but locations will populate several cells.
Thus, in descending order, I want to merge each youth row with the available bed that represents his/her best chance of success, and so the merge/join/switch/thing should take place
on youth.Rank1= distinct TF.Location,
and if youth.Rank1≠ TF.location then
merge on youth.Rank2= TF.location,
if youth.Rank2≠ TF.location then merge at
youth.Rank3 = TF.location, etc.
Put plainly: “Merge on rank1 unless rank1 location is no longer available, then merge on rank2, unless rank2 location is no longer available, and on down the line, etc., etc., until all option are exhausted and foster care (i.e., alternative services). Is the only option.
I’ve had no success getting this to work. I haven’t even been successful getting the union function to work. About the only successful thing I’ve done in SQL so far is create a view of a single dataset. It’s pretty sad. I’ve been following this guidance, but I get hung up around the “where” command:
proc sql; /Calls the SQL procedure*/;
create table x as /*Tells SAS to create a table called x*/
select /*Specifies the column(s) to be selected*/
from /*Specificies the tables(s) (data sets) to be queried*/
where /*Subjests the data based on a condition*/
group by /*Classifies the data into groups based on the specified
column(s)*/
order by /*Sorts the resulting rows observations) by the specified
column(s)*/
; quit; /*Ends the proc sql procedure*/
Frankly, I’m stuck and I could use some advice. This greenhorn in me is in way over his head.
I appreciate any help or guidance anyone might lend.
Cheers!
P
The process you describe (and to be honest I skiped to the end so I might of missed something) does not lend itself to SQL because each step could affect the results of the next one. However, you want to get the most best results for the most kids. (I think a lot of that text was to convince us how important it is to help out). You don't actually give us anything we can really use to help since you don't give any details of your data model, your data, or expected results. There really is no way to answer this question. But I don't care -- I'm going to go forward with some suggestions because it is a friday and I've never done a stream of consciousness answer to a stream of consciousness question before. I will suggest you don't formulate your solution just in sql, but instead use a higher level program and engage is a process like the one described below -- because this a DB questions I've noted the locations where the DB might be involved.
Generate a list kids (this can be in a table -- called NEEDY-KID)
Have a list of locations to assign (this can also be a table LOCATION)
Run your matching for best fit from KID to location -- at this point don't worry about assign more than one kid to a location -- there can be duplicates (put this in table called KID2LOC using a query)
Check KID2LOC for locations assigned twice -- use some method to remove the duplicate ones so each loc is only assigned once. (remove from the KID2LOC using a query)
Prune the LOCATION list to remove assigned locations (once again -- a query)
If kids exist without a location go to 3 with new pruned location list.
Done.

Multithreaded grouping algorithm

I have a collection of circles, each of which may or may not intersect one or more other circles in the collection. I want to group these circles such that each "group" contains all circles such that every member of the group intersects at least one other member of the group, and such that no member of any group intersects any member of any other group. I have come up with the following VB.NET/pseudocode algorithm to solve this problem on a single thread:
Dim groups As New List(Of List(Of Circle))
For Each circleToClassify In allCircles
Dim added As Boolean
For Each group In groups
For Each circle In group
If circleToClassify.Intersects(circle) Then
group.Add(circleToClassify)
added = True
Exit For
End If
Next
If added Then
Exit For
End If
Next
If Not added Then
Dim newGroup As New List(Of Circle)
newGroup.Add(circleToClassify)
groups.Add(newGroup)
End If
Next
Return groups
Or in English
Take each item from the collection of circles
Check if it intersects with any member of any existing group (Bear in mind a "group" may only contain a single circle)
If the circle does intersect in the aforementioned manner add it to the appropriate group
Otherwise create a new group with this circle as its only member
Go to step 1.
What I want to be able to do is perform this task using an arbitrary number of threads. However, I haven't got very far at all as all solutions I've come up with so far will just end up executing serially due to locking.
Can anyone provide any tips on what I want to be thinking about to achieve this multithreading?
TLDR
The best multithreaded solutions avoid sharing or perform read-only sharing. (And hence don't need locks.)
Consider partitioning your work so that threads don't share result data, and then merging each thread's results.
Note that when you strip away the detail of detecting whether groups of circles intersect, you are really dealing with a connected components graph theory problem. There's plenty of useful material on this subject online. And in fact you may find it much easier and sufficiently fast to simply apply a breadth first search algorithm to find connected components.
Detail
When doing multi-threaded development, first prize is to implement the threads in such a way as to minimise the number of locks. In the most trivial case: if they don't share any data, they don't need locks at all. However, if you can guarantee that the shared data won't be modified while the threads are running: then you don't need locks in this case either.
In your question, there's no need for your input list of circles to be modified. The problem you have is that you're building up a shared list of circle groups. Basically you're sharing your result space and need locks to ensure the integrity of the results.
One technique in this situation is to "partition and merge". As a trivial example consider finding the maximum of a large list of numbers. The naive (and ideal single-threaded solution) is to:
keep a single "current maximum" found;
compare each element to this value;
and update the "current maximum" if it's higher.
The problem for multithreading occurs in updating of the shared result. One solution is to:
partition the list for each of p threads;
find the maximum within each partition;
once all threads finish their work, the final result is trivially obtained by finding the maximum of the p partitioned maximums.
The trade-off against a single-threaded solution involves weighing up the ease with which the workload can be partitioned and the per-thread results merged versus the often much simpler single-threaded approach.
Applying partition and merge to circle clusters
As a side note: Observe that your question is essentially a graph theory question such that: Each circle is a node; where if any 2 circles intersect, there's an undirected edge between them; and you're trying to determine the connected components of the graph.
Obviously this provides an area that you can research for more ideas/information. But more importantly it makes easier to analyse the problem with simple boolean assessment of whether 2 circles intersect.
Also note the potential performance improvements by first pre-processing your circles into a suitable graph structure.
Assume you have 8 circles (A-H) where 1's in the table below indicate the 2 circles intersect.
ABCDEFGH
A11000110
B11000000
C00100000
D00010101
E00001110
F10011100
G10001010
H00010001
One partitioning idea involves determining what's connected by only considering a subset of circles and all their immediate connections.
ABCDEFGH
A11000110 p1 [AB]
B11000000
---------
C00100000 p2 [CD]
D00010101
---------
E00001110 p3 [EF]
F10011100
---------
G10001010 p4 [GH]
H00010001
NB Even though threads are sharing data (e.g. 2 threads may consider the intersection between circles A and F concurrently), the share is read-only and doesn't require a lock.
Assume 4 partitions (and 4 threads) of [AB][CD][EF][GH]. Connected components per partition would be broken down as follows:
[AB]: ABFG
[CD]: C DFH
[EF]: ADEFG
[GH]: AEG DH
You now have a list of potentially overlapping connected components. Merging involves iterating the list to find overlaps. If found, take union of the 2 sets is a new connected component. This will finally produce: ABFGDHE and C
Some optimisation techniques to consider:
The bottom left of the matrix mirrors the top-right. So you should be able to avoid duplicating processing of the inverse connections.
The merging of partitions can itself be partitioned and merged.
In fact in the extreme case you could start out partitioning a single circle per partition.
Connected(A) = ABFG
Connected(B) = B
Connected(AB) = ABFG
Connected(C) = C
Connected(D) = DFH
Connected(CD) = C,DFH
Connected(ABCD) = ABFGDH,C
Connected(E) = EFG
Connected(F) = F
Connected(EF) = EFG
Connected(G) = G
Connected(H) = H
Connected(GH) = G,H
Connected(EFGH) = EFG,H
Connected(ABCDEFGH) = ABFGDHE,C
Very NB You need to ensure appropriate selection of data structures and algorithms or suffer extremely poor performance. E.g. A naive intersection implementation might require O(n^2) operations to determine if two intermediate connected components intersect and totally destroy your goal that lead to all this additional complexity.
One approach is to divide the image into blocks, run the algorithm for each block independently, on different threads (i.e. considering only the circles whose center is in that block), and afterwards join the groups from different blocks that have intersecting circles.
Another approach is to formulate the problem using a graph, where the nodes represent circles, and an edge exists between two nodes if the corresponding circles are intersecting. We need to find the connected components of this graph. This disregards the geometric aspects of the problem, however, there are general algorithms which may be useful (e.g. you could consider the last slides from this link).

Additional PlanningEntity in CloudBalancing - bounded-space situation

I successfully amended the nice CloudBalancing example to include the fact that I may only have a limited number of computers open at any given time (thanx optaplanner team - easy to do). I believe this is referred to as a bounded-space problem. It works dandy.
The processes come in groupwise, say 20 processes in a given order per group. I would like to amend the example to have optaplanner also change the order of these groups (not the processes within one group). I have therefore added a class ProcessGroup in the domain with a member List<Process>, the instances of ProcessGroup being stored in a List<ProcessGroup>. The desired optimisation would shuffle the members of this List, causing the instances of ProcessGroup to be placed at different indices of the List List<ProcessGroup>. The index of ProcessGroup should be ProcessGroup.index.
The documentation states that "if in doubt, the planning entity is the many side of the many-to-one relationsship." This would mean that ProcessGroup is the planning entity, the member index being a planning variable, getting assigned to (hopefully) different integers. After every new assignment of indices, I would have to resort the list List<ProcessGroup in ascending order of ProcessGroup.index. This seems very odd and cumbersome. Any better ideas?
Thank you in advance!
Philip.
The current design has a few disadvantages:
It requires 2 (genuine) entity classes (each with 1 planning variable): probably increases search space (= longer to solve, more difficult to find a good or even feasible solution) + it increases configuration complexity. Don't use multiple genuine entity classes if you can avoid it reasonably.
That Integer variable of GroupProcess need to be all different and somehow sequential. That smelled like a chained planning variable (see docs about chained variables and Vehicle Routing example), in which case the entire problem could be represented as a simple VRP with just 1 variable, but does that really apply here?
Train of thought: there's something off in this model:
ProcessGroup has in Integer variable: What does that Integer represent? Shouldn't that Integer variable be on Process instead? Are you ordering Processes or ProcessGroups? If it should be on Process instead, then both Process's variables can be replaced by a chained variable (like VRP) which will be far more efficient.
ProcessGroup has a list of Processes, but that a problem property: which means it doesn't change during planning. I suspect that's correct for your use case, but do assert it.
If none of the reasoning above applies (which would surprise me) than the original model might be valid nonetheless :)

Oracle APEX Tree Hierarchy Limitations/Issues

I am using Oracle APEX 4.2.2 and have constructed a Tree region based off a view.
Now when I take this query (see below) and run this query say in Oracle SQL Developer - all is fine but when I place this same query within the page in Oracle APEX based off a Tree region - all saves correctly but when I run this query, no records/tree is displayed at all.
Now the underlying view can change in record size but for the example I am talking about here, I have just over 6000 records that I need to build a Oracle Tree hierarchy from.
One thing I have noticed is that if I reduce the record size to say 500 rows, the tree displays perfectly.
Questions:
1) Now is there a limitation that I am not aware of as I really need to get this going based on whether there are 500 records or 6000 records?
2) Is 6000 rows too many for a tree hierarchy representation?
3) Could it possibly be because that Oracle APEX 4.2.2 is now using js for building trees and there causing issues due to the quantity of data?
4) Is there a means of reducing the depth of the tree records so that I can still at least display something to the user?
My query is something like:
SELECT case when connect_by_isleaf = 1 then 0
when level = 1 then 1
else -1
end as status,
level,
c as title,
null as icon,
c as value,
null as tooltip,
null as link
FROM t
start with p IS NULL
CONNECT BY NOCYCLE PRIOR c = p;
Also I've noticed that if I try and run the query in SQL Workshop, it doesn't work there either unless I reduce the record size down to say 500 records.
I asked about using IE because the 'too large tree' issue especially plays up in IE. I've seen this issue pass by and asked about a couple of times already. The conclusion was simply that there isn't much to be done about it and generally the browser(s) don't cope too well with a tree with such a large dataset. Usually the issue isn't there or is minimal in ff or chrome though, and ie is mostly not playing ball, and my guess is that this has to do with memory and dom manipulation.
1) Now is there a limitation that I am not aware of as I really need
to get this going based on whether there are 500 records or 6000
records?
No limitation.
2) Is 6000 rows too many for a tree hierarchy representation?
Probably, yes.
3) Could it possibly be because that Oracle APEX 4.2.2 is now using js
for building trees and there causing issues due to the quantity of
data?
Trees are being built with jstree since 4.0 (don't know about 3.2). Apex puts out a global variable in the tree region which holds all the data. The initialization of the widget will then create the complete ul-li list structure. Part of the issue might be that there are so many nodes to begin with, and then how this is ran through jstree, and the huge amount of dom manipulation occuring. I'm not sure if this would go better with the newer release of jstree (apex version is 0.9.9 while 1.x has been released for a while now).
4) Is there a means of reducing the depth of the tree records so that
I can still atleast display something to the user?
If you want to limit the depth you can limit the query by using level in the where clause. eg
WHERE level <= 3
Alternative options will probably be non-apex solutions. Dynamic trees, ajax for the tree nodes, another plugin,... I haven't really explored those as I haven't had to deal with such a big tree yet.
I experienced, that the number of displayable tree nodes depends also on the text lenghts in you tree (e.g. nodes and tooltips). The shorter the texts, the more nodes your tree can display. However, it makes a difference of maybe 50 nodes, so it won't solve your problem, as it didn't solve mine.
My mediocre educated guess is, that this ul-li is limited in size.
I built in a drop-down prefilter, so the user has to narrow down what she/he wants to have displayed.