Can proc sql embedded in sas macros dynamically merge to data-sets, simulating residential treatment placement decisions for trouble youth? - sql

Good afternoon and happy Friday, folks
I’m trying to automate a placement simulation of youth into residential treatment where they will have the highest likelihood of success. Success is operationalized as “not recidivating” within 3 years of entering treatment. Equations predicting recidivism have been generated for each location, and the equations have been applied to each individual in the scenario (based on youth characteristics like risk, age, etc., LOS). Each youth has predicted success rates for every location, which throws in a wrench: youth are not qualified for all of the treatment facilities for which they have predicted success rates. Indeed, treatment locations have differing, yet overlapping qualifications.
Let’s take a made-up example. Johnny (ID # 5, below) is a 15-year-old boy with drug charges. He could have “predicted success rates” of 91% for location A, 88% for location B, 50% for location C, and 75% for location D. Johnny is most likely to be successful (i.e., not recidivate within three years of entering treatment) if he is treated at location A; unfortunately, location A only accepts youth who are 17 years old or older; therefore, Johnny would not qualify for treatment here. Alternatively, for Johnny, location B is the next best location. Let us assume that Johnny is qualified for location B, but that all of location-B beds are filled; so, we must now look to location D, as it is now Johnny’s “best available” option at 75%.
The score so far: We are matching youth to available beds in location for which they qualify and might enjoy the greatest likelihood of success. Unfortunately, each location only has a certain number of available beds, and the number of available beds different across locations. The qualifications of entry into treatment facilities differ, yet overlap (e.g., 12-17 year-olds vs 14-20 year-olds).
In order to simulate what placement decisions might look like based on success rates, I went through the scenario describe above for over 400 youth, by hand, in excel. It took me about a week. I’d like to use PROC SQL imbedded in a SAS MACRO to automate these placement scenarios with the ultimate goals of a) obtain the ability to bootstrap iterations in order to examine effect sizes across distributions, b) save time, and c) prevent further brain damage from banging my head again desk and wall in frustration whilst doing this by hand. Whilst never having had the necessity—nay—the privilege of using SQL in my typical roll as a researcher, I believe that this time has now come to pass and I’m excited about it! Honestly. I believe it has the capacity I’m looking for. Unfortunately, it is beating the devil out of me!
Here’s what I’ve got cookin’ so far: I want to create and automate the placement simulation with the clever use of merging/joining/switching/or something like that.
I have two datasets (tables). The first dataset contains all of the youth information (one row per youth; several columns with demographics, location ranks, which correspond to the predicted success rates). The order of rows in the youth dataset (was/will be randomly generated (to simulate the randomness with which youth enter the system and are subsequently place into treatment). Note that I will be “cleaning” the youth dataset prior to merging such that rank-column cells will only be populated for programs for which a respective youth qualifies. This should take the “does the youth even qualify for the program” problem out of the equation.
However, it still leaves the issue of availability left to be contended with in the scenario.
The second dataset containing the treatment facility beds, with each row corresponding to an available bed in one of the treatment location; two columns contain bed numbers and location names. Each bed (row) has only one location cell populated, but locations will populate several cells.
Thus, in descending order, I want to merge each youth row with the available bed that represents his/her best chance of success, and so the merge/join/switch/thing should take place
on youth.Rank1= distinct TF.Location,
and if youth.Rank1≠ TF.location then
merge on youth.Rank2= TF.location,
if youth.Rank2≠ TF.location then merge at
youth.Rank3 = TF.location, etc.
Put plainly: “Merge on rank1 unless rank1 location is no longer available, then merge on rank2, unless rank2 location is no longer available, and on down the line, etc., etc., until all option are exhausted and foster care (i.e., alternative services). Is the only option.
I’ve had no success getting this to work. I haven’t even been successful getting the union function to work. About the only successful thing I’ve done in SQL so far is create a view of a single dataset. It’s pretty sad. I’ve been following this guidance, but I get hung up around the “where” command:
proc sql; /Calls the SQL procedure*/;
create table x as /*Tells SAS to create a table called x*/
select /*Specifies the column(s) to be selected*/
from /*Specificies the tables(s) (data sets) to be queried*/
where /*Subjests the data based on a condition*/
group by /*Classifies the data into groups based on the specified
column(s)*/
order by /*Sorts the resulting rows observations) by the specified
column(s)*/
; quit; /*Ends the proc sql procedure*/
Frankly, I’m stuck and I could use some advice. This greenhorn in me is in way over his head.
I appreciate any help or guidance anyone might lend.
Cheers!
P

The process you describe (and to be honest I skiped to the end so I might of missed something) does not lend itself to SQL because each step could affect the results of the next one. However, you want to get the most best results for the most kids. (I think a lot of that text was to convince us how important it is to help out). You don't actually give us anything we can really use to help since you don't give any details of your data model, your data, or expected results. There really is no way to answer this question. But I don't care -- I'm going to go forward with some suggestions because it is a friday and I've never done a stream of consciousness answer to a stream of consciousness question before. I will suggest you don't formulate your solution just in sql, but instead use a higher level program and engage is a process like the one described below -- because this a DB questions I've noted the locations where the DB might be involved.
Generate a list kids (this can be in a table -- called NEEDY-KID)
Have a list of locations to assign (this can also be a table LOCATION)
Run your matching for best fit from KID to location -- at this point don't worry about assign more than one kid to a location -- there can be duplicates (put this in table called KID2LOC using a query)
Check KID2LOC for locations assigned twice -- use some method to remove the duplicate ones so each loc is only assigned once. (remove from the KID2LOC using a query)
Prune the LOCATION list to remove assigned locations (once again -- a query)
If kids exist without a location go to 3 with new pruned location list.
Done.

Related

Data Science: Using Inferential Statistics to label train dataset

Lack of High Schools in remote areas is a problem for students in developing country. Students in some locations are better than that in other. So, I have to find those locations. Now, the main problem is defining "BETTER". I have made some rules that will define the profile of a location.
Right now, I am concerned with the good students.
So, what I have done is-
1. Used some inferential statistics to and made some rules to come up with the conclusion that Location A,B,C,etc are the most potential locations where you can put the high schools because according to my rules these locations contain quality students.
I did all of the things above to label the data because I required to define "BETTER" and label the data so that I can now use machine learning algorithm to learn the factors which makes a location a potential location so that if I give a data point from test data to the model, it will instantly tell if the location is better or not.
Overview of the method:
For each location, I have these 4 information:
total_students_staying_for_high_school_education(A),
total_students_leaving_for_high_school_education_in_another_place(B),
mean_grade_point_of_students_of_type_B,
ratio (calculated as B/A),
For the location whose ratio > 1
I applied the chi-squared significance test to come up with a statistic which would tell me if students are leaving that place in significant amount than staying. I used ANOVA and then Tukey test to compare means_grade points and then find combinations of pairs of locations whose means vary and whose is greater than the others.
I then wrote a python program with a custom comparator that first compares if mean_grade of those points vary and returns the one with greater mean. If the means don't vary, the comparator return the location with the one whose chi-squared value is greater.
This is how, the whole process comes up with few suggestions of location and I call those location "BETTER".
What I am concerned about is-
1. How do I verify if my rules are valid? Or do I even need to verify it?
2. Most importantly, is mingling statistics with machine learning as described above an appropriate approach?Is there any major leakage in the method?Can anyone suggest a more general method?

Need to format table from BigQuery for specific words in the body of text

I'm working with Google BigQuery to scrape the reddit comments database. I'll start with the query I'm working on:
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
subreddit,
author AS comment_author,
ups AS upvotes,
LOWER(body)
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE
body CONTAINS 'acid'
OR body CONTAINS 'ecstasy'
OR body CONTAINS 'fire'
OR body CONTAINS 'heroin'
LIMIT 10;
I need to scrape the reddit database for a list of about 30 drug-related word (I limited it to 3 for brevity).
I'm having trouble with two things:
I want to be able to correctly query the DB, but a lot of the results that are returned do not meet the criteria a.k.a. do not contain any of the matching words.
I want to be able to create a column which displays the specific word which was matched....so if it matched the word 'drug', that word would appear in a 'word_matched' column, along with the body, author, date, etc.
I've tried regular expressions as well for matching the words, but that doesn't seem to be helping either:
WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))
Any and all help will be greatly appreciated. Thanks all!
Below adressing both points of the question
1. Have in output only matching words and not those which are part of another/different word. This is easy to accomplish using REGEXP_MATCH function
2. Have column wich consists of all matching words. (i think it makes more sense to have all matching words vs. just one as it is asked in question.
SELECT
[date],
subreddit,
comment_author,
upvotes,
GROUP_CONCAT(word) AS matches,
body
FROM (
SELECT
[date],
subreddit,
comment_author,
upvotes,
body,
word
FROM (
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS [date],
subreddit,
author AS comment_author,
ups AS upvotes,
LOWER(body) AS body
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE REGEXP_MATCH(body, r'\b(drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)\b')
) x
CROSS JOIN (
SELECT SPLIT(list,'|') AS word FROM
(SELECT 'drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers' AS list)
) y
HAVING body CONTAINS word
)
GROUP BY [date], subreddit, comment_author, upvotes, body
LIMIT 1000
Above solution provides list of matching words on best-effort basis, so please note:
If column matches consists one word - it is for sure exact matched word
But if this columns consist of few words - still one of those is exact match, but others can be not exact match.
I think for lengthy body - it still valuable to have those at least as a hint to what to look for. For example as in
drug,meth,heroin,alcohol,benzos it also inhibits the reuptake of serotonin and norepinephrine which gives a hell of a lot worse withdrawal symptoms than most other drugs(incl. heroin, meth, coke and etc.). from what i have heard the only things that rival tramadol it terms of withdrawal are benzos and alcohol.
liquor,beer,alcohol,booze 1. reinforce #3 - it is not cheap to live here. not by any stretch. expect to pay more than the rest of the country pays for everything. even franchises that operate nation-wide have special wa/perth pricing. 2. petrol has literally just dropped to $1 this past month, i wouldn't go as far as quoting that as our average price just yet. average is still between $1.20-1.30. 3. parking is free at beaches & parks, do not expect to get free parking anywhere in the city though. if you're using public parking in the city all day, expect to pay $50 unless you get in early. 4. forget bribing the cops, don't even call them "mate". last time i was pulled over (last week, random stop) i said "evening mate" as i was handing him my license and was responded with "don't call me mate, i'm not your friend, i don't know you". 5. unlike the rest of the world, regular stores do not sell alcohol here. liquor stores only, don't expect to buy beer from a gas station or grocery store. 6. rent is expensive, food is expensive, booze is expensive, being alive is expensive.
drug,meth,heroin,beer that's simply not true. first there's a difference between legalization and decriminalization. second, some european countries have places to go to safely use drugs. there is middle ground between allowing heroin to be sold all over town and having users go to prison. heroin, meth and some other drugs are not good things for society and their use should encouraged by making it as easy to buy as a 6 pack of beer. i'm not really sure why you can't see a middle ground because it's clearly not as black and white as you say. you can go after the dealers while leaving the users alone.
drug,fire,joint,smoke not a story about a rave, but still relevant i think: i was working a job called "fire watch," which is just what it sounds like, at a nine inch nails concert a few years ago. our comrades, the security workers, were far from seasoned professionals. they were mostly college temps with a yellow security tee shirt and a flashlight; they didn't even have radios. the job is basically to make sure people don't go into restricted areas. ...but this one boy scout took it upon himself to tame the metal masses. mid-concert, he pulled me close and shouted "they're smoking pot!" i shrugged, and shot him an "and?" look. i guess he thought i should care because technically a joint is a tiny dangerous drug fire, and i was on the fire crew. he then proceeded to disappear into the crowd, shoving people out of the way on his heroic journey toward the countless smoke puff origins. the next time i saw him he was bleeding out of his face and getting a flashlight in the eyes from an onsite emt. i guess it's pretty harsh to say that he deserved the beating, but it's hard to argue that he didn't go asking for it. i guess the moral of my story is that security people are just people, and some people's shittyness is inflamed when combined with authority. it sounds like your event just happened to be warded by a gaggle of douches, probably being captained by king fuckwad who really wanted to be a cop, but couldn't pass the exams.
Note: If you need list of only exact matches, it is still relatively easy to do with BigQuery User-Defined Functions
I suggest debugging this using REGEXP_EXTRACT. I tried running your query, and it kept finding things like "meth" in "something", which might be what you're seeing. You probably want to check for word boundaries around the match, since some of your words you are searching for can be contained in several normal, non-drug-related words.
Something like the following should help in debugging:
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
subreddit,
author AS comment_author,
ups AS upvotes,
REGEXP_EXTRACT(body, '(drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)') AS match,
LOWER(body),
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))
LIMIT 10;

selecting "similar" groups - where to start with probabilities?

Let's say I have a table with 10.000 lines (representing 10.000 persons) and the following columns:
id qualification gender age income
When I select all persons having a certain qualification (say "plumber") I get 100 lines, having a certain gender, age and income distribution.
What I now want to do is select some kind of test group to check if the income is influenced by qualification or by the distribution of the other attributes.
That means (and now I come to my question) I want to get another set of 100 lines, having the same gender and age distribution (but a different qualification value). These 100 lines should of course been chosen by random.
My primary problem is that I don't know how to write an SQL command that would take care of the distributions (which of course could and maybe should be seen as probabilities in this context) when I select random lines.
Thank you in advance!
You seem to be trying to solve something that is tightly related to this extremely thorny problem.
The wiki page depicts a number of approaches for detecting correlations in a database, complete with references to prior pg-hacker discussions (here's another), a variety of (rejected) patch proposals, and scientific papers that discusses the topic.
If it sounds too thorny, I'd second Catcall's pl/r suggestion. Or another applicable pl, for that matter.
As an aside, you might find pg-kmeans useful too:
http://pgxn.org/dist/kmeans/doc/kmeans.html
As well as PostStat (never tried it myself):
http://poststat.projects.postgresql.org/
Might be better on stats.stackexchange.com.
Selecting random rows is easy; matching the distribution is hard.
You could write a stored procedure that
repeatedly selects 100 rows at random,
calculates the statistics,
and returns when it finds 100 rows that fit.
But that seems a lot like kicking dead whales down the beach. And, depending on your data, it might never return.
Before you spend much time trying to do this in SQL, consider spending a little time to see how hard (or how easy) this is to do with statistical software, like R.
Later
Just discovered that there's a package called pl/R.
PL/R is a loadable procedural language that enables you to write
PostgreSQL functions and triggers in the R programming language. PL/R
offers most (if not all) of the capabilities a function writer has in
the R language.
Google postgresql +statistics +r +pl for additional links to papers and tutorials.
SELECT * from Table1 order by random() limit 100;
random() is valid for PostgreSql. For MySql you can use RAND() instead of Random()

How to slice a factless fact table on time dimension? (SSAS)

Starting with the customary - "please excuse me as this is my first post and i'm a relative beginner" disclaimer, i have the following question...
I work for a not for profit campaigning organisation, I've set up an SSAS solution to measure campaigning actions (e.g emailing the priminister) taken by a set on campaigners (customers) the main fact table has a count of actions as its measure, and is sliceable by say time and geography....
... but I also want to have another factless fact table that can show a count of how many campaigners are in what mailing segment... so i think what i need to do is basically dump a copy of my campaigner dimesion (which is slowly changing for people moving geography etc) into its own factless fact table... columns being FK_campaigner, segment_id, start_date, end_date but then how do i link that into the time dimension as it doesn't have an FK_time (merely a start and end time)... i guess what i want to do is relate the factless table to the time table on a "when PK_time > start_date and < end_date" then slice for me... but HOW? and is this possible or do i have to go down the route of loading one fact for each day that someone was in a segment?
many thanks to anybody who can point me in the right direction either structurally (is the broad approach wrong?) or even better in the practicalities of actually doing it in SSAS..
AJ
If you just want to analyse this data for a single point in time e.g. show me what what my numbers looked like at point x. Then you could have the time dimension be the "effective date" .
This would be semi additive and you would not be able to aggregate the data across time.
However, if what you interested in is analyzing the transition between time periods, than there is a "Many to Many" solution that would allow this:
Many to Many revolution white paper
The whitepaper provides several models the one that would be relevant in your scenario would be either the "Cross Time" or "Transition Matrix"
Good luck

how to determine if a record in every source, represents the same person

I have several sources of tables with personal data, like this:
SOURCE 1
ID, FIRST_NAME, LAST_NAME, FIELD1, ...
1, jhon, gates ...
SOURCE 2
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
1, jon, gate ...
SOURCE 3
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
2, jhon, ballmer ...
So, assuming that records with ID 1, from sources 1 and 2, are the same person, my problem is how to determine if a record in every source, represents the same person. Additionally, sure not every records exists in all sources. All the names, are written in spanish, mainly.
In this case, the exact matching needs to be relaxed because we assume the data sources has not been rigurously checked against the official bureau of identification of the country. Also we need to assume typos are common, because the nature of the processes to collect the data. What is more, the amount of records is around 2 or 3 millions in every source...
Our team had thought in something like this: first, force exact matching in selected fields like ID NUMBER, and NAMES to know how hard the problem can be. Second, relaxing the matching criteria, and count how much records more can be matched, but is here where the problem arises: how to do to relax the matching criteria without generating too noise neither restricting too much?
What tool can be more effective to handle this?, for example, do you know about some especific extension in some database engine to support this matching?
Do you know about clever algorithms like soundex to handle this approximate matching, but for spanish texts?
Any help would be appreciated!
Thanks.
The crux of the problem is to compute one or more measures of distance between each pair of entries and then consider them to be the same when one of the distances is less than a certain acceptable threshold. The key is to setup the analysis and then vary the acceptable distance until you reach what you consider to be the best trade-off between false-positives and false-negatives.
One distance measurement could be phonetic. Another you might consider is the Levenshtein or edit distance between the entires, which would attempt to measure typos.
If you have a reasonable idea of how many persons you should have, then your goal is to find the sweet spot where you are getting about the right number of persons. Make your matching too fuzzy and you'll have too few. Make it to restrictive and you'll have too many.
If you know roughly how many entries a person should have, then you can use that as the metric to see when you are getting close. Or you can divide the number of records into the average number of records for each person and get a rough number of persons that you're shooting for.
If you don't have any numbers to use, then you're left picking out groups of records from your analysis and checking by hand whether they look like the same person or not. So it's guess and check.
I hope that helps.
This sounds like a Customer Data Integration problem. Search on that term and you might find some more information. Also, have a poke around inside The Data Warehousing Institude, and you might find some answers there as well.
Edit: In addition, here's an article that might interest you on spanish phonetic matching.
I've had to do something similar before and what I did was use a double metaphone phonetic search on the names.
Before I compared the names though, I tried to normalize away any name/nickname differences by looking up the name in a nick name table I created. (I populated the table with census data I found online) So people called Bob became Robert, Alex became Alexander, Bill became William, etc.
Edit: Double Metaphone was specifically designed to be better than Soundex and work in languages other than English.
SSIS , try using the Fuzzy Lookup transformation
Just to add some details to solve this issue, I'd found this modules for Postgresql 8.3
Fuzzy String Match
Trigrams
You might try to cannonicalise the names by comparing them with a dicionary.
This would allow you to spot some common typos and correct them.
Sounds to me you have a record linkage problem. You can use the references in the link.