How to choose ranges to analyse in a given population? - data-science

I want to analyse weight of elderly population. What is the right way to choose ranges in age to divide this population into? ie. how to choose between ranges of 61-70,71-80 etc or 61-65,66-70 etc?
Is there a right way at all? Or is it up to researcher, taking into account what is analysed? I.e. I know that in case of elderly range 61-80, 81-100 would be too broad, and rages per 1 year too thin. But, can it be objectively defined, 5 or 10 years, is better?
I would be really thankful for your help and perspective!

Related

Is there an algorithm for near optimal partition of the Travelling salesman problem, creating routes that need the same time to complete?

I have a problem to solve. I need to visit 7962 places with a vehicle. The vehicle travels with 10km/h and each time I visit one place I stay there for 1 minute. I want to divide those 7962 places into subsets that take will take up to 8 hours. So lets say 200 places take 8 hours I visit them and come back the next day to visit another maybe 250 places(the 200 places subsets will require more distance travelled). For the distance I only care for Euclidean Distances no need to take into account the distance through the road network.
A map of the 7962 places
What I have done so far is use the k means clustering algorithm to get good enough subsets and then the Lin Kernighan heuristic (Program Concorde) to find the distance. And then compute times. But my results go from 4 hours to 12 hours. Any idea to make it better? Or a code that does this whole task all together. Propose anything but I am not a programmer I just use Python some times.
Set of coordinates :
http://www.filedropper.com/wholesetofcoordinates
Coordinates subsets(40 clusters produces with the k means algorithm):
http://www.filedropper.com/kmeans40clusters

Understanding Stratified sampling in numpy

I am currently completing an exercise book on machine learning to wet my feet so to speak in the discipline. Right now I am working on a real estate data set: each instance is a district of california and has several attributes, including the district's median income, which has been scaled and capped at 15. The median income histogram reveals that most median income values are clustered around 2 to 5, but some values go far beyond 6. The author wants to use stratified sampling, basing the strata on the median income value. He offers the next piece of code to create an income category attribute.
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
He explains that he divides the median_income by 1.5 to limit the number of categories and that he then keeps only those categories lower than 5 and merges all other categories into category 5.
What I don't understand is
Why is it mathematically sound to divide the median_income of each instance to create the strata? What exactly does the result of this division mean? Are there other ways to calculate/limit the number of strata?
How does the division restrict the number of categories and why did he choose 1.5 as the divisor instead of a different value? How did he know which value to pick?
Why does he only want 5 categories and how did he know beforehand that there would be at least 5 categories?
Any help understanding these decisions would be greatly appreciated.
I'm also not sure if this is the StackOverFlow category I should post this question in, so if I made a mistake by doing so please let me know what might be the appropriate forum.
Thank you!
You may be the right person to analyze more on this based on your data set. But I can help you understanding stratified sampling, so that you will have an idea.
STRATIFIED SAMPLING: suppose you have a data set with consumers who eat different fruits. One feature is 'fruit type' and this feature has 10 different categories(apple,orange,grapes..etc) now if you just sample the data from data set, there is a possibility that sample data might not cover all the categories. Which is very bad when train the data. To avoid such scenario, we have a method called stratified sampling, in this probability of sampling each different category is same so that we will not miss any useful data.
Please let me know if you still have any questions, I would be very happy to help you.

How to determine a baseline value in mdx

Not really sure of the proper terminology that I am asking for, so bear with me. I have an MDX cube. One of the measures in the cube is called [Duration]. I have a dimension called Commands and another dimension called Locations. Basically, the duration measure tells me how long it took to run a specific command...aggregated across a bunch of dimensions.
I need to be able to compare the [duration] measure of all Locations against one specific location (Location 1). Location 1 is basically the best case scenario. The [duration] at all other locations should almost never be better than the duration of Location 1...but the question I need to answer is how much worse is location 2, 3, 4, etc. compared to location 1.
I need it to be an apples to apples comparison, so if the user brings in one of the many other dimensions that are available, I need this comparison to properly reflect all the slices against Location 1 and the current location. Does anyone have any ideas? I can do pretty much whatever I want to make this work, so any useful ideas are welcome.
Thanks
If you need to check the duration measure against one location, you may use the following calculated measure:
[Measures].[Duration] / ([Location].[Location].&[Reference location],[Measures].[Duration])

Interview: Determine how many x in y?

Today I had an interview with a gentleman who asked me to determine how many veterinarians are in the city of Atlanta. The interview was for an entry-level development position.
Assumptions: 1,000,000 people in Atlanta, 500,000 pets in Atlanta. The actual data is irrelevant.
Other than that there were no specifics. He asked me to find this data using only a whiteboard. There was no code required; it was simply a question to determine how well I could "reason" the problem. He said there was no right or wrong answer, and that I should work from the ground up.
After several answers, one of which was ~1,000 veterinarians in Atlanta, he told me he was going to ask other questions and I got the impression I had missed the point entirely.
I tried to work from the assumption that each vet could maybe see five animals a day, in a total of 24 working days per month.
Using those assumptions, I finally calculated (24 * 5) * 12 = 1,440 pets/year, and with 500,000 pets that would come to 500,000 / 1,440 ~= 348 veterinarians.
What steps could I have taken to approach this problem differently, in case I run into this sort of problems in future interviews?
I agree with your approach. The average pet sees a veterinarian so many times a year. The average veterinarian sees so many pets per week. Crunch those numbers and you have your answer.
Just guessing off the top of my head, I would say the average pet sees a veterinarian twice each year. So that's 1,000,000 visits. I'd say the average vet works 48 weeks a year, sees about a pet every 40 minutes, and works 30 hours per working week. That's about 2,160 vists per vet.
1,000,000 / 2,160 ~= 462.
My answer is close enough to yours, given that the numbers are all guesses.
The point of the question, I think, is to clearly define each assumption you have to make in order to produce an estimate. Your assumptions can be wildly inaccurate; in practice, they usually aren't too bad.
Interesting aside...there's a fun board game called Guesstimation built entirely around this kind of estimation problem.
How many pets are the types of pets that need to see veterinarians? How many vets see pets instead of large animals?
The point of this question isn't necessarily a Fermi question: It's to see how you handle ambiguous requirements that could significantly affect your answer.

SQL anagram efficiency and logic? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I have an SQL db with about 200,000 words. I need a query which I will be able to solve an anagram kind of. The difference is that I need all the possible words that could be made with the input characters. For example, if you input ofdg, it should output the words: do, go, and dog. Can you estimate the amount of time a query like this would take. How can I make it faster and more efficient? Also, in general how long does it take SQL to parse a 200000 row database.
To solve this problem, the first thing you need to do is reduce every word to what Scrabble players call an alphagram. That is, all the letters in the word but in alphabetical order. So do, go and dog make do, go and dgo. Of course, any given alphagram may correspond to more than one word, so, for example, alphagram dgo corresponds to both the words dog and god.
The next thing you need to do is construct a table with a key alphagram-sequence number and a single attribute field word.
Word lists tend to be static. For example, the two Scrabble word lists in the English-speaking world change about every 5 years of so. So you construct this lookup table beforehand. Performance is O( n ) and it is a sunk cost. That is, you do it once and store it, so it is not counted against the cost of the query. You have to do this beforehand. It makes absolutely no sense to build such an index on the fly every time a query comes in.
You may be wondering "What is all this about Scrabble?" The answer is that your figure of 200,000 words falls neatly between the two approved tournament word lists in the English-speaking world. The US National Scrabble Association's Official Tournament and Club Word List (2006) contains 178,691 words, and the international list, maintained by the World English Scrabble Players' Association, contains 246,691.
When you get a query you reduce the supplied word to a bunch of alphagrams. Input odfg makes alphagrams od fo go df dg fg dfo dgo fgo dfg dfgo (which is a pretty programming problem in pure SQL, so I have to assume there is a PHP or Python or JavaScript front-end that will do that for you). Then you do the lookup in the database. The cost of each query should be approximately O(log2 n), in other words pretty damn immediate. That sort of query is what relational databases are good at.
BTW, your example output is poor. Alphagram dfgo with what Scrabble players call 'build' (all possible subsets) makes do od of go dog god fog.
(I hate to have to do this rigmarole, but Hasbro's lawyers are touchy, so: Scrabble is a registered trademark owned in the USA by Hasbro, Inc.; in Canada by Hasbro Canada Corporation; and throughout the rest of the world by J. W. Spear & Sons, a Mattel Company.)
Well, the number of possible letter combination in a word of length n is n!. Apparently you have a few more options as you want the shorter words as well, but that does not change that much the general O(n!) relationship. So a simple algorithm trying all combinations and looking the up in the database will have that as complexity.
Making the algorithm more efficient is apparently to reduce the search space - for which there are a few options.
How long it takes to look up a 200.000 row table depends on what kind of data is stored in there, in what format and what indexes you have on that table.