Find minimal key for relation scheme

Find minimal key for relation scheme - sql

I have a database for an investment firm:
B (broker)
O (office of broker)
I (investor)
S (stock)
Q (quantity of stock owned by investor)
D (dividend paid by stock)
Functional dependencies
S ⟶ D
I ⟶ B
IS ⟶ Q
B ⟶ O
I need to find minimal key for relation scheme R=BOSQID and need to prove it.
I have no idea how to solve this problem.
Can you give me any idea?

Jay, the way I understand this is the following. You need to find the minimal set of fields that would allow you to identify all fields BOSQID. There is an algorithm which I don't remember right now to properly do the analysis you're looking for, but the exercise seems to be simple enough in order not to need it.
Take B -> O. As B determines O we can keep B and remove O from the keys. Current possible key fields: BSQID.
Take I -> B. As I determines B we can keep I and remove B from the keys. Notice that, by transitivity, I determines O. Current possible key fields SQID
Take S -> D. As S determines D we can keep S and remove D from the keys. Current possible key fields SQI
Take IS -> Q. As IS determines Q we can keep IS and remove Q from the keys. Current possible key fields: IS
As we no longer have functional dependencies we can't go on, so the result is IS. There are more complex examples to work on in which this simple technique won't help you because it'll drive you crazy, that's why I recommend you too look for the algorithm to solve this.

Related

Pseudo randomization constraints

I am programming a task switching experiment with 3 tasks. The aim of the experiment is to investigate sequential effects: triplets in which the task X repeat after a switch (e.g. ABA or CAC) will be compared with triplets in which the task always switch (e.g. CBA or BAC).
To this aim, is important that the 3 tasks never repeat and that in each block there are (roughly) the same number of repeat and switch sequences.
Each block has 108 trials resulting in 106 triplets (the first two trials cannot be classified as repeat or switch of course).
I have tried to find a solution with several softwares (Psychopy, Conan, Excel), but I haven't found a solution, and I have no clue how to do it.
Any help would be much appreciated

There are only six possible orders in which the task doesn't repeat:
A B C
A C B
B A C
B C A
C A B
C B A
And six where it does:
A B A
A C A
B A B
B C B
C A C
C B C
So to get to 108 trials, you just need to present each of those orders nine times. But this might conflict with your requirement that the tasks don't repeat (but that phrasing is ambiguous and you should be more specific as to the meaning of that constraint.)
Also, phrases like "there are (roughly) the same number of repeat and switch sequences" aren't great when defining an experimental design. Strive for as much precision as possible.
Having said all that, I'm not sure how this is yet an actual programming-related question? You'll need to say exactly what the issue is with implementing this. The programs you mention have wildly different purposes (PsychoPy is for implementing experiments. Excel is not. I don't know what Conan is).

Can proc sql embedded in sas macros dynamically merge to data-sets, simulating residential treatment placement decisions for trouble youth?

Good afternoon and happy Friday, folks
I’m trying to automate a placement simulation of youth into residential treatment where they will have the highest likelihood of success. Success is operationalized as “not recidivating” within 3 years of entering treatment. Equations predicting recidivism have been generated for each location, and the equations have been applied to each individual in the scenario (based on youth characteristics like risk, age, etc., LOS). Each youth has predicted success rates for every location, which throws in a wrench: youth are not qualified for all of the treatment facilities for which they have predicted success rates. Indeed, treatment locations have differing, yet overlapping qualifications.
Let’s take a made-up example. Johnny (ID # 5, below) is a 15-year-old boy with drug charges. He could have “predicted success rates” of 91% for location A, 88% for location B, 50% for location C, and 75% for location D. Johnny is most likely to be successful (i.e., not recidivate within three years of entering treatment) if he is treated at location A; unfortunately, location A only accepts youth who are 17 years old or older; therefore, Johnny would not qualify for treatment here. Alternatively, for Johnny, location B is the next best location. Let us assume that Johnny is qualified for location B, but that all of location-B beds are filled; so, we must now look to location D, as it is now Johnny’s “best available” option at 75%.
The score so far: We are matching youth to available beds in location for which they qualify and might enjoy the greatest likelihood of success. Unfortunately, each location only has a certain number of available beds, and the number of available beds different across locations. The qualifications of entry into treatment facilities differ, yet overlap (e.g., 12-17 year-olds vs 14-20 year-olds).
In order to simulate what placement decisions might look like based on success rates, I went through the scenario describe above for over 400 youth, by hand, in excel. It took me about a week. I’d like to use PROC SQL imbedded in a SAS MACRO to automate these placement scenarios with the ultimate goals of a) obtain the ability to bootstrap iterations in order to examine effect sizes across distributions, b) save time, and c) prevent further brain damage from banging my head again desk and wall in frustration whilst doing this by hand. Whilst never having had the necessity—nay—the privilege of using SQL in my typical roll as a researcher, I believe that this time has now come to pass and I’m excited about it! Honestly. I believe it has the capacity I’m looking for. Unfortunately, it is beating the devil out of me!
Here’s what I’ve got cookin’ so far: I want to create and automate the placement simulation with the clever use of merging/joining/switching/or something like that.
I have two datasets (tables). The first dataset contains all of the youth information (one row per youth; several columns with demographics, location ranks, which correspond to the predicted success rates). The order of rows in the youth dataset (was/will be randomly generated (to simulate the randomness with which youth enter the system and are subsequently place into treatment). Note that I will be “cleaning” the youth dataset prior to merging such that rank-column cells will only be populated for programs for which a respective youth qualifies. This should take the “does the youth even qualify for the program” problem out of the equation.
However, it still leaves the issue of availability left to be contended with in the scenario.
The second dataset containing the treatment facility beds, with each row corresponding to an available bed in one of the treatment location; two columns contain bed numbers and location names. Each bed (row) has only one location cell populated, but locations will populate several cells.
Thus, in descending order, I want to merge each youth row with the available bed that represents his/her best chance of success, and so the merge/join/switch/thing should take place
on youth.Rank1= distinct TF.Location,
and if youth.Rank1≠ TF.location then
merge on youth.Rank2= TF.location,
if youth.Rank2≠ TF.location then merge at
youth.Rank3 = TF.location, etc.
Put plainly: “Merge on rank1 unless rank1 location is no longer available, then merge on rank2, unless rank2 location is no longer available, and on down the line, etc., etc., until all option are exhausted and foster care (i.e., alternative services). Is the only option.
I’ve had no success getting this to work. I haven’t even been successful getting the union function to work. About the only successful thing I’ve done in SQL so far is create a view of a single dataset. It’s pretty sad. I’ve been following this guidance, but I get hung up around the “where” command:
proc sql; /Calls the SQL procedure*/;
create table x as /*Tells SAS to create a table called x*/
select /*Specifies the column(s) to be selected*/
from /*Specificies the tables(s) (data sets) to be queried*/
where /*Subjests the data based on a condition*/
group by /*Classifies the data into groups based on the specified
column(s)*/
order by /*Sorts the resulting rows observations) by the specified
column(s)*/
; quit; /*Ends the proc sql procedure*/
Frankly, I’m stuck and I could use some advice. This greenhorn in me is in way over his head.
I appreciate any help or guidance anyone might lend.
Cheers!
P

The process you describe (and to be honest I skiped to the end so I might of missed something) does not lend itself to SQL because each step could affect the results of the next one. However, you want to get the most best results for the most kids. (I think a lot of that text was to convince us how important it is to help out). You don't actually give us anything we can really use to help since you don't give any details of your data model, your data, or expected results. There really is no way to answer this question. But I don't care -- I'm going to go forward with some suggestions because it is a friday and I've never done a stream of consciousness answer to a stream of consciousness question before. I will suggest you don't formulate your solution just in sql, but instead use a higher level program and engage is a process like the one described below -- because this a DB questions I've noted the locations where the DB might be involved.
Generate a list kids (this can be in a table -- called NEEDY-KID)
Have a list of locations to assign (this can also be a table LOCATION)
Run your matching for best fit from KID to location -- at this point don't worry about assign more than one kid to a location -- there can be duplicates (put this in table called KID2LOC using a query)
Check KID2LOC for locations assigned twice -- use some method to remove the duplicate ones so each loc is only assigned once. (remove from the KID2LOC using a query)
Prune the LOCATION list to remove assigned locations (once again -- a query)
If kids exist without a location go to 3 with new pruned location list.
Done.

Uncapacited Facility Location optimized brute force

How can I optimize a brute force method to find optimal solution for UFL problem? My solution is working very slowly.
If you already know the UFL problem, you can just jump the following description.
There is a graph G. We can divide G's vertices in 2 subsets C and F.
C is the subset of clients and F the subset of facilities.
Every client has a distance to every facility, that is, dij is the distance from client i to facility j.
Every facility i has a cost fi to open
Every client i needs ci objects (from some facility)
Every client i must be served by only one facility j, at the price of (dij * ci)
We want to minimize the overall cost (to serve all clients and open the necessary facilities)
My solution is as simple as possible: test all possibilities of associating clients and facilities, and this is very bad, given that, for example, if I had 10 clients and 5 facilities, there would be 5^10 possibilities.
How can I optimize this? I thought about some preprocessings but I got confused with it because of the fi and I still didn't come up with anything.

Representing multiply-linked lists in SQL

I have a data structure consisting of a set of objects which are arranged into a multiply-linked list which is also (isomorphically) a valid DAG. It could be viewed as one single multiply-linked list, or as a series of n doubly-linked lists which may share members. (This is the same data structure from Algorithm for quickly obtaining a partial ordering over multiple linked lists, for those of you following my questions.)
I am looking for a general technique, in no specific SQL dialect, for expressing this multiply-linked list/DAG in SQL, such that it's easy to take a given node and obtain:
The previous and next links in the DAG, given a topological ordering of the DAG
The previous and next links in each doubly-linked list to which this node belongs
Using the example data from that other question:
first = [a, b, d, f, h, i];
second = [a, b, c, f, g, i];
third = [a, e, f, g, h, i];
I'd want to be able to, given node f, obtain [(c|d|e), g] from the overall DAG's topology and also {first: [d, h], second: [c, g], third: [e, g]} from each of the lists orderings.
Here's the fun part: n, the number of doubly-linked lists, is not fixed and may grow at any time. I'd rather not redo the schema each time that happens.
All of the algorithms I've come up with so far either (a) stuff a big pickle into the DB and pull it out in order to calculate orderings, or (b) require that the lists be explicitly enumerated as recursive relations in the DB.
I'll go with an option in (b) if I can't find something better but I'm hoping that there's something magical out there to make this easier.

Pre:
This is a question and answer forum, not 'lets sit down, group think for a bit, and solve the whole problem' forum.
I think what you want to investigate in a technique called 'modified preordered tree traversal' a mouthful i know, but it allows the storing of hierarchical data in a flat database and individual enties. Sadly, you do have to do some rewriting on inserts, but the selects can be done in a single query, so it's best for 'many view/ few changes' situations like a website. Luckily, you rarely have to rewrite the whole dataset (only the parts you changed and those hierarchically after them.
I remember a good article on the basics on it ( a couple years ago) but can't find the bookmark atm, so start with just a google search.
EDIT/UPDATE:
link: http://www.sitepoint.com/hierarchical-data-database/
No matter what, from dealing with this issue extensively, you will have to choose were to put the brunt of the work, on view, or on change. Depending on the size of the 'master' tree, you may (like me) decide to break the tree up into parts and use a tree of trees, limiting the update cost.

Compute the difference between two sets (sorted and simple)

Is there a way to compute the difference between two sorted sets (zset) or do I have to use simple sets for this?
Problem:
Set F contains a list of sorted id's (sorted set, full list)
Set K contains a list of id's (simple set, subset of F)
I want to retrieve every entry in F, in order, that's not in K.
Is this possible using Redis alone or do I have to do the computation on the application? If yes, what is the best way?
EDIT: SDIFF does not suit this purpose as it doesn't allow sorted sets.

Make a copy of F as a simple set. Let's call it G. Now perform the SDIFF.
Or...
Make a copy of F as a sorted set. Let's call it G. Iterate through K and remove each element from G.
SDIFF really should work on sorted sets, regular sets, or combinations. But, at this time, it does not.
Also, if F is very large, you may see some performance hits when you make a copy of it. In this case, create a set G in your Redis DB that it updated when K is updated. That is, F and G are initially equal. As you add elements to K, remove the element from G.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas