Interview: Determine how many x in y? - puzzle

Today I had an interview with a gentleman who asked me to determine how many veterinarians are in the city of Atlanta. The interview was for an entry-level development position.
Assumptions: 1,000,000 people in Atlanta, 500,000 pets in Atlanta. The actual data is irrelevant.
Other than that there were no specifics. He asked me to find this data using only a whiteboard. There was no code required; it was simply a question to determine how well I could "reason" the problem. He said there was no right or wrong answer, and that I should work from the ground up.
After several answers, one of which was ~1,000 veterinarians in Atlanta, he told me he was going to ask other questions and I got the impression I had missed the point entirely.
I tried to work from the assumption that each vet could maybe see five animals a day, in a total of 24 working days per month.
Using those assumptions, I finally calculated (24 * 5) * 12 = 1,440 pets/year, and with 500,000 pets that would come to 500,000 / 1,440 ~= 348 veterinarians.
What steps could I have taken to approach this problem differently, in case I run into this sort of problems in future interviews?

I agree with your approach. The average pet sees a veterinarian so many times a year. The average veterinarian sees so many pets per week. Crunch those numbers and you have your answer.
Just guessing off the top of my head, I would say the average pet sees a veterinarian twice each year. So that's 1,000,000 visits. I'd say the average vet works 48 weeks a year, sees about a pet every 40 minutes, and works 30 hours per working week. That's about 2,160 vists per vet.
1,000,000 / 2,160 ~= 462.
My answer is close enough to yours, given that the numbers are all guesses.

The point of the question, I think, is to clearly define each assumption you have to make in order to produce an estimate. Your assumptions can be wildly inaccurate; in practice, they usually aren't too bad.
Interesting aside...there's a fun board game called Guesstimation built entirely around this kind of estimation problem.

How many pets are the types of pets that need to see veterinarians? How many vets see pets instead of large animals?
The point of this question isn't necessarily a Fermi question: It's to see how you handle ambiguous requirements that could significantly affect your answer.

Related

VRP with parallel relations in Optaplanner

We are trying to solve a VRP with Optaplanner where it is important that two (or more) customers are served at the same time.
This means, for example, that if customer #1 is supplied at 10 o'clock, then customer #2 must also be supplied at 10 o'clock.
It is not allowed to deliver to one customer and leave the other unscheduled.
Such constellations occur with approx 50% of all customers out of a total number of 1000 customers.
It is not sufficient to apply the "delay till last pattern".
All other conditions remain the same as in the VRP example.
How can we proceed in order to solve this problem with Optaplanner?
Are there any examples of such constellations?
In the docs, take a look at chapter Design Patterns, and specifically the auto delay until last pattern.

Can proc sql embedded in sas macros dynamically merge to data-sets, simulating residential treatment placement decisions for trouble youth?

Good afternoon and happy Friday, folks
I’m trying to automate a placement simulation of youth into residential treatment where they will have the highest likelihood of success. Success is operationalized as “not recidivating” within 3 years of entering treatment. Equations predicting recidivism have been generated for each location, and the equations have been applied to each individual in the scenario (based on youth characteristics like risk, age, etc., LOS). Each youth has predicted success rates for every location, which throws in a wrench: youth are not qualified for all of the treatment facilities for which they have predicted success rates. Indeed, treatment locations have differing, yet overlapping qualifications.
Let’s take a made-up example. Johnny (ID # 5, below) is a 15-year-old boy with drug charges. He could have “predicted success rates” of 91% for location A, 88% for location B, 50% for location C, and 75% for location D. Johnny is most likely to be successful (i.e., not recidivate within three years of entering treatment) if he is treated at location A; unfortunately, location A only accepts youth who are 17 years old or older; therefore, Johnny would not qualify for treatment here. Alternatively, for Johnny, location B is the next best location. Let us assume that Johnny is qualified for location B, but that all of location-B beds are filled; so, we must now look to location D, as it is now Johnny’s “best available” option at 75%.
The score so far: We are matching youth to available beds in location for which they qualify and might enjoy the greatest likelihood of success. Unfortunately, each location only has a certain number of available beds, and the number of available beds different across locations. The qualifications of entry into treatment facilities differ, yet overlap (e.g., 12-17 year-olds vs 14-20 year-olds).
In order to simulate what placement decisions might look like based on success rates, I went through the scenario describe above for over 400 youth, by hand, in excel. It took me about a week. I’d like to use PROC SQL imbedded in a SAS MACRO to automate these placement scenarios with the ultimate goals of a) obtain the ability to bootstrap iterations in order to examine effect sizes across distributions, b) save time, and c) prevent further brain damage from banging my head again desk and wall in frustration whilst doing this by hand. Whilst never having had the necessity—nay—the privilege of using SQL in my typical roll as a researcher, I believe that this time has now come to pass and I’m excited about it! Honestly. I believe it has the capacity I’m looking for. Unfortunately, it is beating the devil out of me!
Here’s what I’ve got cookin’ so far: I want to create and automate the placement simulation with the clever use of merging/joining/switching/or something like that.
I have two datasets (tables). The first dataset contains all of the youth information (one row per youth; several columns with demographics, location ranks, which correspond to the predicted success rates). The order of rows in the youth dataset (was/will be randomly generated (to simulate the randomness with which youth enter the system and are subsequently place into treatment). Note that I will be “cleaning” the youth dataset prior to merging such that rank-column cells will only be populated for programs for which a respective youth qualifies. This should take the “does the youth even qualify for the program” problem out of the equation.
However, it still leaves the issue of availability left to be contended with in the scenario.
The second dataset containing the treatment facility beds, with each row corresponding to an available bed in one of the treatment location; two columns contain bed numbers and location names. Each bed (row) has only one location cell populated, but locations will populate several cells.
Thus, in descending order, I want to merge each youth row with the available bed that represents his/her best chance of success, and so the merge/join/switch/thing should take place
on youth.Rank1= distinct TF.Location,
and if youth.Rank1≠ TF.location then
merge on youth.Rank2= TF.location,
if youth.Rank2≠ TF.location then merge at
youth.Rank3 = TF.location, etc.
Put plainly: “Merge on rank1 unless rank1 location is no longer available, then merge on rank2, unless rank2 location is no longer available, and on down the line, etc., etc., until all option are exhausted and foster care (i.e., alternative services). Is the only option.
I’ve had no success getting this to work. I haven’t even been successful getting the union function to work. About the only successful thing I’ve done in SQL so far is create a view of a single dataset. It’s pretty sad. I’ve been following this guidance, but I get hung up around the “where” command:
proc sql; /Calls the SQL procedure*/;
create table x as /*Tells SAS to create a table called x*/
select /*Specifies the column(s) to be selected*/
from /*Specificies the tables(s) (data sets) to be queried*/
where /*Subjests the data based on a condition*/
group by /*Classifies the data into groups based on the specified
column(s)*/
order by /*Sorts the resulting rows observations) by the specified
column(s)*/
; quit; /*Ends the proc sql procedure*/
Frankly, I’m stuck and I could use some advice. This greenhorn in me is in way over his head.
I appreciate any help or guidance anyone might lend.
Cheers!
P
The process you describe (and to be honest I skiped to the end so I might of missed something) does not lend itself to SQL because each step could affect the results of the next one. However, you want to get the most best results for the most kids. (I think a lot of that text was to convince us how important it is to help out). You don't actually give us anything we can really use to help since you don't give any details of your data model, your data, or expected results. There really is no way to answer this question. But I don't care -- I'm going to go forward with some suggestions because it is a friday and I've never done a stream of consciousness answer to a stream of consciousness question before. I will suggest you don't formulate your solution just in sql, but instead use a higher level program and engage is a process like the one described below -- because this a DB questions I've noted the locations where the DB might be involved.
Generate a list kids (this can be in a table -- called NEEDY-KID)
Have a list of locations to assign (this can also be a table LOCATION)
Run your matching for best fit from KID to location -- at this point don't worry about assign more than one kid to a location -- there can be duplicates (put this in table called KID2LOC using a query)
Check KID2LOC for locations assigned twice -- use some method to remove the duplicate ones so each loc is only assigned once. (remove from the KID2LOC using a query)
Prune the LOCATION list to remove assigned locations (once again -- a query)
If kids exist without a location go to 3 with new pruned location list.
Done.

Need to format table from BigQuery for specific words in the body of text

I'm working with Google BigQuery to scrape the reddit comments database. I'll start with the query I'm working on:
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
subreddit,
author AS comment_author,
ups AS upvotes,
LOWER(body)
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE
body CONTAINS 'acid'
OR body CONTAINS 'ecstasy'
OR body CONTAINS 'fire'
OR body CONTAINS 'heroin'
LIMIT 10;
I need to scrape the reddit database for a list of about 30 drug-related word (I limited it to 3 for brevity).
I'm having trouble with two things:
I want to be able to correctly query the DB, but a lot of the results that are returned do not meet the criteria a.k.a. do not contain any of the matching words.
I want to be able to create a column which displays the specific word which was matched....so if it matched the word 'drug', that word would appear in a 'word_matched' column, along with the body, author, date, etc.
I've tried regular expressions as well for matching the words, but that doesn't seem to be helping either:
WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))
Any and all help will be greatly appreciated. Thanks all!
Below adressing both points of the question
1. Have in output only matching words and not those which are part of another/different word. This is easy to accomplish using REGEXP_MATCH function
2. Have column wich consists of all matching words. (i think it makes more sense to have all matching words vs. just one as it is asked in question.
SELECT
[date],
subreddit,
comment_author,
upvotes,
GROUP_CONCAT(word) AS matches,
body
FROM (
SELECT
[date],
subreddit,
comment_author,
upvotes,
body,
word
FROM (
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS [date],
subreddit,
author AS comment_author,
ups AS upvotes,
LOWER(body) AS body
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE REGEXP_MATCH(body, r'\b(drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)\b')
) x
CROSS JOIN (
SELECT SPLIT(list,'|') AS word FROM
(SELECT 'drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers' AS list)
) y
HAVING body CONTAINS word
)
GROUP BY [date], subreddit, comment_author, upvotes, body
LIMIT 1000
Above solution provides list of matching words on best-effort basis, so please note:
If column matches consists one word - it is for sure exact matched word
But if this columns consist of few words - still one of those is exact match, but others can be not exact match.
I think for lengthy body - it still valuable to have those at least as a hint to what to look for. For example as in
drug,meth,heroin,alcohol,benzos it also inhibits the reuptake of serotonin and norepinephrine which gives a hell of a lot worse withdrawal symptoms than most other drugs(incl. heroin, meth, coke and etc.). from what i have heard the only things that rival tramadol it terms of withdrawal are benzos and alcohol.
liquor,beer,alcohol,booze 1. reinforce #3 - it is not cheap to live here. not by any stretch. expect to pay more than the rest of the country pays for everything. even franchises that operate nation-wide have special wa/perth pricing. 2. petrol has literally just dropped to $1 this past month, i wouldn't go as far as quoting that as our average price just yet. average is still between $1.20-1.30. 3. parking is free at beaches & parks, do not expect to get free parking anywhere in the city though. if you're using public parking in the city all day, expect to pay $50 unless you get in early. 4. forget bribing the cops, don't even call them "mate". last time i was pulled over (last week, random stop) i said "evening mate" as i was handing him my license and was responded with "don't call me mate, i'm not your friend, i don't know you". 5. unlike the rest of the world, regular stores do not sell alcohol here. liquor stores only, don't expect to buy beer from a gas station or grocery store. 6. rent is expensive, food is expensive, booze is expensive, being alive is expensive.
drug,meth,heroin,beer that's simply not true. first there's a difference between legalization and decriminalization. second, some european countries have places to go to safely use drugs. there is middle ground between allowing heroin to be sold all over town and having users go to prison. heroin, meth and some other drugs are not good things for society and their use should encouraged by making it as easy to buy as a 6 pack of beer. i'm not really sure why you can't see a middle ground because it's clearly not as black and white as you say. you can go after the dealers while leaving the users alone.
drug,fire,joint,smoke not a story about a rave, but still relevant i think: i was working a job called "fire watch," which is just what it sounds like, at a nine inch nails concert a few years ago. our comrades, the security workers, were far from seasoned professionals. they were mostly college temps with a yellow security tee shirt and a flashlight; they didn't even have radios. the job is basically to make sure people don't go into restricted areas. ...but this one boy scout took it upon himself to tame the metal masses. mid-concert, he pulled me close and shouted "they're smoking pot!" i shrugged, and shot him an "and?" look. i guess he thought i should care because technically a joint is a tiny dangerous drug fire, and i was on the fire crew. he then proceeded to disappear into the crowd, shoving people out of the way on his heroic journey toward the countless smoke puff origins. the next time i saw him he was bleeding out of his face and getting a flashlight in the eyes from an onsite emt. i guess it's pretty harsh to say that he deserved the beating, but it's hard to argue that he didn't go asking for it. i guess the moral of my story is that security people are just people, and some people's shittyness is inflamed when combined with authority. it sounds like your event just happened to be warded by a gaggle of douches, probably being captained by king fuckwad who really wanted to be a cop, but couldn't pass the exams.
Note: If you need list of only exact matches, it is still relatively easy to do with BigQuery User-Defined Functions
I suggest debugging this using REGEXP_EXTRACT. I tried running your query, and it kept finding things like "meth" in "something", which might be what you're seeing. You probably want to check for word boundaries around the match, since some of your words you are searching for can be contained in several normal, non-drug-related words.
Something like the following should help in debugging:
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
subreddit,
author AS comment_author,
ups AS upvotes,
REGEXP_EXTRACT(body, '(drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)') AS match,
LOWER(body),
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))
LIMIT 10;

SDK2 query for counting: which is more efficient?

I have an app that is displaying metrics about defects in a project.
I have the option of making one query that returns all the defects, and from that I can break out about four different metrics (How many defects escaped QA in 90 days, 180 days, and then the same metrics again but only counting sev1/sev2 defects).
I could make four queries and limit the results to one so that I just get a count for each. Or I could make one query that encompass them all (all defects that escaped QA in 180 days) and then count up the difference.
I'm figuring worst case, the number of defects that escaped QA in the last six months will generally be less than 100, certainly less 500 worst case.
Which would you do-- four queryies with one result each, or one single query that on average might return 50, perhaps worst case 500?
And I guess the key question is-- where are the inflections points? Perhaps I have more metrics tomorrow (who knows, 8?) and a different average defect counts. Is there a rule of thumb I could use to help choose which approach?
Well I would probably make the series of four queries and use the result count. If you are expecting 500 defects that will end up being three queries each with 200 defects anyways.
The solution where you do each individual query and use the total result count would be safe with even a very large amount of defects. Plus I usually find it to be a bad plan to think that I know the data sets that an App will be dealing with. Most of my Apps end up living much longer and being used on larger datasets than I intended.
The max page size is 200, so it sounds like you'd be requesting between 1 and 3 pages to get all the data vs. 4 queries with a page size of 1 and using the TotalResultCount...
You'd definitely have less aggregation code to write if you use the multi query approach (letting the server do the counting for you based on your supplied filters).
I'd guess the 4 independent queries might be faster but it would be interesting to hear back your experimental results...

selecting "similar" groups - where to start with probabilities?

Let's say I have a table with 10.000 lines (representing 10.000 persons) and the following columns:
id qualification gender age income
When I select all persons having a certain qualification (say "plumber") I get 100 lines, having a certain gender, age and income distribution.
What I now want to do is select some kind of test group to check if the income is influenced by qualification or by the distribution of the other attributes.
That means (and now I come to my question) I want to get another set of 100 lines, having the same gender and age distribution (but a different qualification value). These 100 lines should of course been chosen by random.
My primary problem is that I don't know how to write an SQL command that would take care of the distributions (which of course could and maybe should be seen as probabilities in this context) when I select random lines.
Thank you in advance!
You seem to be trying to solve something that is tightly related to this extremely thorny problem.
The wiki page depicts a number of approaches for detecting correlations in a database, complete with references to prior pg-hacker discussions (here's another), a variety of (rejected) patch proposals, and scientific papers that discusses the topic.
If it sounds too thorny, I'd second Catcall's pl/r suggestion. Or another applicable pl, for that matter.
As an aside, you might find pg-kmeans useful too:
http://pgxn.org/dist/kmeans/doc/kmeans.html
As well as PostStat (never tried it myself):
http://poststat.projects.postgresql.org/
Might be better on stats.stackexchange.com.
Selecting random rows is easy; matching the distribution is hard.
You could write a stored procedure that
repeatedly selects 100 rows at random,
calculates the statistics,
and returns when it finds 100 rows that fit.
But that seems a lot like kicking dead whales down the beach. And, depending on your data, it might never return.
Before you spend much time trying to do this in SQL, consider spending a little time to see how hard (or how easy) this is to do with statistical software, like R.
Later
Just discovered that there's a package called pl/R.
PL/R is a loadable procedural language that enables you to write
PostgreSQL functions and triggers in the R programming language. PL/R
offers most (if not all) of the capabilities a function writer has in
the R language.
Google postgresql +statistics +r +pl for additional links to papers and tutorials.
SELECT * from Table1 order by random() limit 100;
random() is valid for PostgreSql. For MySql you can use RAND() instead of Random()