hypothesis testing dependent proportions /repeated mesurements proportions - testing

I need statistical advice:
I have two groups, e.g. young (N=57) and old adults (N=61), and 4 sights in a city. The task for all participants was to visit either one, two, three or all four sights in the city.
The result is a frequency table of how many participants visited the four sites.
Sight 1 Sight 2 Sight 3 Sight 4
Old 80% 45% 33% 45%
Young 78% 60% 30% 42%
I would like to test 1) sights were visited in similar proportions across the participant groups and 2) whether the proportion of participants visiting each sight was higher/lower in old people.
Looking at the table, I would expect that the sights visited are equally distributed across the groups meaning that Sight 1 has been visited by both groups more than Sights 2-4. Also I would expect that the proportion of young people that visted sight 2 is higher compared to old people.
How would I test this?
I thought about a chi-square test but the as the participants can visit one or more sights the sample is not independent, this test seems not appropriate. Also, I would like to avoid the multiple testing problem as much as possible.
Thanks

Related

Optimize the big value in redis for ranking with sorted sets

We are using the Redis for mobile game ranking with 5 million users, we are using sorted sets for ranking, some ranking got big value which is more than few hundred MB, and our cluster is unbalanced distributed because of that, is there a good way to optimize this. Right now, we are thinking about splitting the key with various sub-ranges of the score(1~100,101~200,201~300, etc), but that adds much complexity cause we have different rankings for a different purpose, each will have different score range. Also, it will be pretty hard to maintain and configure. Could you please help provide some guidance, thanks in advance.
Try to optimize the value structure. See there's any room to compress the value. Say, your sorted set used to be
"my_sorted_set":
{"userId":"1","profileId":"123","name":"John"}--111.0(score),
{"userId":"2","profileId":"1234","name":"Peter"}--122.0(score)
Change it to {"u":"1","p":"123","name":"John"} --111.0(score)
Everybit can save with this amount of data
Split your sorted. Instead of splitting by score range, you can split by ranking range.
Assuming the ranking doesn't need to be accurate for most people. For example, I might care if I rank 2rd or 3rd in the game. But won't care/notice if I'm in mid-range already and my ranking dropped from 145645th to 14569th (You might as well show 145600th so it looks the same).
With this in mind, say you expect 10 million players. you would to have a 100 redis zsets, each taking about 100K players. And these zsets should spread evently across your cluster.
zset_001 : Place of the top 100K players.
zset_002 : Place of the 100K~200K ranking players.
and so on
You'll have a hash, saving the zset number each player belong to. For example, Player A, id:1234, you'll save player_zset_hash : {"1234","2"}, to mark this player belongs to zset_002
To check palyer A's ranking, you just add 100K with A's ranking in zset_2. So he might rank 10K +50 = 100050th of the game. Player C who's in zset_4 ranking 101th. So he's overall ranking is 4*100K+101=4000101th.
Now assume player A made some big increase in scores, you increase his score and check his place in zset 002. Say he's ranking first in zset_002 now! *Compare his score with the last ranking member(say Player B)'s score in zset_001, if the A's score is higher than B's, swap their position. Put A in zset_001 and B in zset_002. Maintain the player_zset_hash accordingly.
When you add new players(assuming each player has a starting score of zero, and people can only have non-negative score, and you don't remove players out of ranking), you put them to the last one of the zset, say, zset_100. Every night, you check the size of the last zset, if it's bigger than 100K, trim it and remove the extra ones to a new zset. (And of course, record all the zsets names you have somewhere)
One thing to note: You might have zset_001 checked very frequently, so your redis-clusters reading loads becomes unevenly spread now. You can consider having each server checks the ranking once ever five seconds and cache it in the local server and provides it to visitor. Instead of hitting the redis every time a visiter needs to check the ranking.
***You might omit this step as well. And maintain all the zsets once every night, since ZRANK is a O(LogN) operation, not O(n). After all, you probably can tolerate some player's ranking being inaccurate for one day. Say he was 200005th, now he's supposed to be 199987th yet you marked him as 200001th. If there are certain game operations that players scores significantly, incorporate this step into your code then. Say if Play A hits jack pot, compare his score ranking with other zsets after his score changes.

Dealing with a very large TSP-esque problem (80,000+ items)

The problem is as follows:
An input text file will contain N lines each representing a book. Each book will have G genres associated with it with 1<=G<=100.
You want to order those books in terms of interest factor. Several books that are too similar are boring while books that are too different can cause visitors to be confused. Thus the score for each pair of books is MIN(genres unique to book i,genres unique to book i+1, genres common between books).
There are several files with one having 80,000 books
I am new to optimisation (this is my first time working on a problem like this), and the first thing I thought of was to simply create a NxN 2-d array with the scores from and to each book (node) and at each step choose the best possible book to add to the "path".
However, due to the size of the problem, I was unable to create such a large array. So instead, I broke the problem into 8 sections of 10,000 books each and was able to go on with that idea. Unfortunately, the time it takes to finish running is quite great (more than an hour). Although the score was quite decent.
I tried Simulated Annealing where you would randomly switch two books. I used temperature of 100,000 and cooling rate of 0.0005. That also took a long time and gave a worse score than the other approach.
I was wondering if there is a way to improve those approaches (in terms of time taken) or if there was a better approach/algorithm for solving this problem?
Thanks in advance!

Getting the optimal number of employees for a month (rostering)

Is it possible to get the optimal number of employees in a month for a given number of shifts?
I'll explain myself a little further taking the nurse rostering as an example.
Imagine that we don't know the number of nurses to plan in a given month with a fixed number of shifts. Also, imagine that each time you insert a new nurse in the planification it decreases your score and each nurse has a limited number of normal hours and a limited number of extra hours. Extra hours decrease more the score than normal ones.
So, the problem consists on getting the optimal number of nurses needed and their planification. I've come up with two possible solutions:
Fix the number of nurses clearly above of the ones needed and treat the problem as an overconstrained one, so there will be some nurses not assigned to any shifts.
Launching multiple instances of the same problem in parallel with an incremental number of nurses for each instance. This solution has the problem that you have to estimate more or less an approximate range of nurses under and above the nurses needed beforehand.
Both solutions are a little bit inefficient, is there a better approach to tackle with this problem?
I call option 2 doing simulations. Typically in simulations, they don't just play with the number of employees, but also the #ConstraintWeights etc. It's useful for strategic "what if" decisions (What if we ... hire more people? ... focus more on service quality? ... focus more on financial gain? ...)
If you really just need to minimize the number of employees, and you can clearly weight that versus all the other hard and soft constraint (probably as a weight in between both, similar to overconstrained planning), then option 1 is good enough - and less cpu costly.

Approximation to Large Linear Program

I have a simple LP with linear constraints. There are many decision variables, roughly 24 million. I have been using lpSolve in R to play with small samples, but this solver isn't scaling well. Are there ways to get an approximate solution to the LP?
Edit:
The problem is a scheduling problem. There are 1 million people who need to be scheduled into one of 24 hours, hence 24 million decision variables. There is a reward $R_{ij}$ for scheduling person $i$ into hour $j$. The constraint is that each person needs to be scheduled into some hour, but each hour only has a finite amount of appointment slots $c$
One good way to approach LPs/IPs with a massive number of variables and constraints is to look for ways to group the decision variables in some logical way. Since you have only given a sketch of your problem, here's a solution idea.
Approach 1 : Group people into smaller batches
Instead of 1M people, think of them as 100 units of 10K people each. So now you only have 2400 (24 x 100) variables. This will get you part of the way there, and note that this won't be the optimal solution, but a good approximation. You can of course make 1000 batches of 1000 people and get a more fine-grained solution. You get the idea.
Approach 2: Grouping into cohorts, based on the Costs
Take a look at your R_ij's. Presumably you don't have a million different costs. There will typically be only a few unique cost values. The idea is to group many people with the same cost structure into one 'cohort'. Now you solve a much smaller problem - which cohorts go into which hour.
Again, once you get the idea you can make it very tractable.
Update Based on OP's comment:
By its very nature, making these groups is an approximation technique. There is no guarantee that the optimal solution will be obtained. However, the whole idea of careful grouping (by looking at cohorts with identical or very similar cost structures) is to get solutions as close to the optimal as possible, with far less computational effort.
I should have also added that when scaling (grouping is just one way to scale-down the problem size), the other constants should also be scaled. That is, c_j should also be in the same units (10K).
If persons A,B,C cannot be fit into time slot j, then the model will squeeze in as many of those as possible in the lowest cost time slot, and move the others to other slots where the cost is slightly higher, but they can be accommodated.
Hope that helps you going in the right direction.
Assuming you have a lot of duplicate people, you are now using way too many variables.
Suppose you only have 1000 different kinds of people and that some of these occcur 2000 times whilst others occur 500 times.
Then you just have to optimize the fraction of people that you allocate to each hour. (Note that you do have to adjust the objective functions and constraints a bit by adding 2000 or 500 as a constant)
The good news is that this should give you the optimal solution with just a 'few' variables, but depending on your problem you will probably need to round the results to get whole people as an outcome.

Find Similar Rows in Database

I try to design my app to find database entries which are similar.
Let's for example take the table car (Everything in one table to keep the example simple):
CarID | Car Name | Brand | Year | Top Speed | Performance | Displacement | Price
1 Z3 BMW 1990 250 5.4 123 23456
2 3er BMW 2000 256 5.4 123 23000
3 Mustang Ford 2000 190 9.8 120 23000
Now i want to do Queries like that:
"Search for Cars similar to Z3 (all brands)" (ignore "Car Name")
Similar in this context means that the row where the most columns are exactly the same is the most similar.
In this example it would be "3er BMW" since 2 columns(Performance and Displacement are the same)
Can you give me hints how to design database queries/application like that. The application gonna be really big with a lot of entries.
Also I would really appreciate useful links or books. (No problem for me to investigate further if i know where to search or what to read)
You could try to give each record a 'score' depending on its fields
You could weigh a column's score depending on how important the property is for the comparison (for instance top speed could be more important than brand)
You'll end up with a score for each record, and you will be able to find similar records by comparing scores and finding the records that are +/- 5% (for example) of the record you're looking at
The methods of finding relationships and similarities in data is called Data Mining, in your case you could already try clustering and classify your data in order to see what are the different groups that show up.
I think this book is a good start for an introduction to data mining. Hope this helps.
To solve your problem, you have to use a cluster algorithm. First, you need define a similarity metric, than you need to count the similarity between your input tuples (all Z3) and the rest of the database. You can speed up the process using algorithms, such as k-means. Please take a look on this question, there you will find a discussion on similar problem as yours - Finding groups of similar strings in a large set of strings.
This link is very helpful as well: http://matpalm.com/resemblance/.
Regarding the implementation if you have a lot of tuples (and more than several machines) you can use http://mahout.apache.org/. It is machine learning framework based on hadoop. You will need a lot of computation power, because cluster algorithms are complex.
Have a look at one of the existing search engines like Lucene. They implement a lot of things like that.
This paper might also be useful: Supporting developers with natural language queries
Not really an answer to your question, but you say you have lot of entries, you should consider normalizing your car table, move Brand to a separate table and "Car name"/model to a separate table. This will reduce the amount of data to compare during the lookups.