The problem is as follows:
An input text file will contain N lines each representing a book. Each book will have G genres associated with it with 1<=G<=100.
You want to order those books in terms of interest factor. Several books that are too similar are boring while books that are too different can cause visitors to be confused. Thus the score for each pair of books is MIN(genres unique to book i,genres unique to book i+1, genres common between books).
There are several files with one having 80,000 books
I am new to optimisation (this is my first time working on a problem like this), and the first thing I thought of was to simply create a NxN 2-d array with the scores from and to each book (node) and at each step choose the best possible book to add to the "path".
However, due to the size of the problem, I was unable to create such a large array. So instead, I broke the problem into 8 sections of 10,000 books each and was able to go on with that idea. Unfortunately, the time it takes to finish running is quite great (more than an hour). Although the score was quite decent.
I tried Simulated Annealing where you would randomly switch two books. I used temperature of 100,000 and cooling rate of 0.0005. That also took a long time and gave a worse score than the other approach.
I was wondering if there is a way to improve those approaches (in terms of time taken) or if there was a better approach/algorithm for solving this problem?
Thanks in advance!
Related
TL;DR version: Is there a way to cope with optimisation problems where there exists a large number of optimal solutions (solutions that find the best objective value)? That is, finding an optimal solution is pretty quick (but highly dependent on the size of the problem, obviously) but many such solutions exists so that the solver runs endlessly trying to find a better solution (endlessly because it does find other feasible solutions but with an objective value equals to the current best).
Not TL;DR version:
For a university project, I need to implement a scheduler that should output the schedule for every university programme per year of study. I'm provided some data and for the matter of this question, will simply stick to a general but no so rare example.
In many sections, you have mandatory courses and optional courses. Sometimes, those optional courses are divided in modules and the student needs to choose one of these modules. Often, they have to select two modules, but some combinations arise more often than others. Clearly, if you count the number of courses (mandatory + optional courses) without taking into account the subdivision into modules, you happen to have more courses than time slots in which they need to be scheduled. My model is quite simple. I have constraints stating that every course should be scheduled to one and only one time slot (period of 2 hours) and that a professor should not give two courses at the same time. Those are hard constraints. The thing is, in a perfect world, I should add hard constraints stating that a student cannot have two courses at the same time. But because I don't have enough data and that every combination of modules is possible, there is no point in creating one student per combination mandatory + module 1 + module 2 and apply the hard constraints on each of these students, since it is basically identical to have one student (mandatory + all optionals) and try to fit the hard constraints - which will fail.
This is why, I decided to move those hard constraints in an optimisation problem. I simply define my objective function minimising for each student the number of courses he/she takes that are scheduled simultaneously.
If I run this simple model with only one student (22 courses) and 20 time slots, I should have an objective value of 4 (since 2 time slots embed each 2 courses). But, using Gurobi, the relaxed objective is 0 (since you can have fraction of courses inside a time slot). Therefore, when the solver does reach a solution of cost 4, it cannot prove optimality directly. The real trouble, is that for this simple case, there exists a huge number of optimal solutions (22! maybe...). Therefore, to prove optimality, it will go through all other solutions (which share the same objective) desperately trying to find a solution with a smaller gap between the relaxed objective (0) and the current one (4). Obviously, such solution doesn't exist...
Do you have any idea on how I could tackle this problem? I thought of analysing the existing database and trying to figure out which combinations of modules are very likely to happen so that I can put back the hard constraints but it seems hazardous (maybe I will select a combination that leads to a conflict therefore not finding any solution or omitting a valid combination). The current solution I use is putting a time threshold to stop the optimisation...
I have a simple LP with linear constraints. There are many decision variables, roughly 24 million. I have been using lpSolve in R to play with small samples, but this solver isn't scaling well. Are there ways to get an approximate solution to the LP?
Edit:
The problem is a scheduling problem. There are 1 million people who need to be scheduled into one of 24 hours, hence 24 million decision variables. There is a reward $R_{ij}$ for scheduling person $i$ into hour $j$. The constraint is that each person needs to be scheduled into some hour, but each hour only has a finite amount of appointment slots $c$
One good way to approach LPs/IPs with a massive number of variables and constraints is to look for ways to group the decision variables in some logical way. Since you have only given a sketch of your problem, here's a solution idea.
Approach 1 : Group people into smaller batches
Instead of 1M people, think of them as 100 units of 10K people each. So now you only have 2400 (24 x 100) variables. This will get you part of the way there, and note that this won't be the optimal solution, but a good approximation. You can of course make 1000 batches of 1000 people and get a more fine-grained solution. You get the idea.
Approach 2: Grouping into cohorts, based on the Costs
Take a look at your R_ij's. Presumably you don't have a million different costs. There will typically be only a few unique cost values. The idea is to group many people with the same cost structure into one 'cohort'. Now you solve a much smaller problem - which cohorts go into which hour.
Again, once you get the idea you can make it very tractable.
Update Based on OP's comment:
By its very nature, making these groups is an approximation technique. There is no guarantee that the optimal solution will be obtained. However, the whole idea of careful grouping (by looking at cohorts with identical or very similar cost structures) is to get solutions as close to the optimal as possible, with far less computational effort.
I should have also added that when scaling (grouping is just one way to scale-down the problem size), the other constants should also be scaled. That is, c_j should also be in the same units (10K).
If persons A,B,C cannot be fit into time slot j, then the model will squeeze in as many of those as possible in the lowest cost time slot, and move the others to other slots where the cost is slightly higher, but they can be accommodated.
Hope that helps you going in the right direction.
Assuming you have a lot of duplicate people, you are now using way too many variables.
Suppose you only have 1000 different kinds of people and that some of these occcur 2000 times whilst others occur 500 times.
Then you just have to optimize the fraction of people that you allocate to each hour. (Note that you do have to adjust the objective functions and constraints a bit by adding 2000 or 500 as a constant)
The good news is that this should give you the optimal solution with just a 'few' variables, but depending on your problem you will probably need to round the results to get whole people as an outcome.
I'm thinking about moving from one index to day-based indexes (multi-index) in our Elasticsearch cluster with huge number of records.
The actual question is how it can affect the performance of indexing, searching and mapping in the ES cluster?
Is it take more time to search through one huge index than from a hundreds of big indices?
It will take less time to search through one large index, rather than hundreds of smaller ones.
Breaking an index in this fashion could help performance if you will be primarily searching only one of the broken out indexes. In your case, if you most often will need to search for records added on a particular day, then you might be helped by this, performance-wise. If you will mostly be searching across the entire of range of indexes, you would generally be better off searching in the single monolithic index.
Finally, we have implemented ES multi-indexing in our company. For our application we chose monthly indices strategy, so we create a new index every month.
Of course, as it was advised by #femtoRgon, the search through all smaller indices takes a little bit more, but speed of application has been increased because of its logic.
So, my advice to everybody who wants to move from one index to multi-indices: make research of your application needs and select appropriate slices of the whole index (if it's really needed).
As example, i can share some results of research of our application, that helped us to make a decision to use monthly indices:
90-95% of our queries are only for last 3 months
we have about 4 big groups of queries: today, last week, last month and last 3 months (of course, we could create weekly or daily indices, but they would be too small, since we don't have enough documents inside)
we can explain to customers why he need to wait if he makes "non usual" query across whole period (all indices).
I try to design my app to find database entries which are similar.
Let's for example take the table car (Everything in one table to keep the example simple):
CarID | Car Name | Brand | Year | Top Speed | Performance | Displacement | Price
1 Z3 BMW 1990 250 5.4 123 23456
2 3er BMW 2000 256 5.4 123 23000
3 Mustang Ford 2000 190 9.8 120 23000
Now i want to do Queries like that:
"Search for Cars similar to Z3 (all brands)" (ignore "Car Name")
Similar in this context means that the row where the most columns are exactly the same is the most similar.
In this example it would be "3er BMW" since 2 columns(Performance and Displacement are the same)
Can you give me hints how to design database queries/application like that. The application gonna be really big with a lot of entries.
Also I would really appreciate useful links or books. (No problem for me to investigate further if i know where to search or what to read)
You could try to give each record a 'score' depending on its fields
You could weigh a column's score depending on how important the property is for the comparison (for instance top speed could be more important than brand)
You'll end up with a score for each record, and you will be able to find similar records by comparing scores and finding the records that are +/- 5% (for example) of the record you're looking at
The methods of finding relationships and similarities in data is called Data Mining, in your case you could already try clustering and classify your data in order to see what are the different groups that show up.
I think this book is a good start for an introduction to data mining. Hope this helps.
To solve your problem, you have to use a cluster algorithm. First, you need define a similarity metric, than you need to count the similarity between your input tuples (all Z3) and the rest of the database. You can speed up the process using algorithms, such as k-means. Please take a look on this question, there you will find a discussion on similar problem as yours - Finding groups of similar strings in a large set of strings.
This link is very helpful as well: http://matpalm.com/resemblance/.
Regarding the implementation if you have a lot of tuples (and more than several machines) you can use http://mahout.apache.org/. It is machine learning framework based on hadoop. You will need a lot of computation power, because cluster algorithms are complex.
Have a look at one of the existing search engines like Lucene. They implement a lot of things like that.
This paper might also be useful: Supporting developers with natural language queries
Not really an answer to your question, but you say you have lot of entries, you should consider normalizing your car table, move Brand to a separate table and "Car name"/model to a separate table. This will reduce the amount of data to compare during the lookups.
I am trying to solve complex thing (as it looks to me).
I have next entities:
PLAYER (few of them, with names like "John", "Peter", etc.). Each has unique ID. For simplicity let's think it's their name.
GAME (few of them, say named "Hide and Seek", "Jump and Run", etc.). Same - each has unique ID. For simplicity of the case let it be it's name for now.
SCORE (it's numeric).
So, how it works.
Each PLAYER can play in multiple GAMES. He gets some SCORE in every GAME.
I need to build rating table -- and not one!
Table #1: most played GAMES
Table #2: best PLAYERS in all games (say the total SCORE in every GAME).
Table #3: best PLAYERS per GAME (by SCORE in particularly that GAME).
I could be build something straight right away, but that will not work.
I will have more than 10,000 players; and 15 games, which will grow for sure.
Score can be as low as 0, and as high as 1,000,000 (not sure if higher is possible at this moment) for player in the game. So I really need some relative data.
Any suggestions?
I am planning to do it with SQL, but may be just using it for key-value storage; anything -- any ideas are welcome.
Thank you!
I would say two things.
First my answer to your question. Secondly what I think you should do instead.
1. Answer:
SQL, its easy to develop and test + production for some time.
A table for Players, with INT or some other uniq value, not strings. (I know you said its a sample, but go for "long word" ints that ought to give you enough unique ID's
Same goes for Game. Now the thing to keep the highscores together would be to have a relation between the two.
Score (Table relation):
[Player ID][Game_ID][Score]
Where score is a numeric value... I dont know the max score of each of your games, so you figure out what type is enough.
Now, this should be quite easy to implement for a start. Get that to work. But dont make every call directly to the database.
Make a 3-TIER architecture. Make a datalayer and a businesslayer and then the "game" layer.
So every game calls the businesslayer with its own "game ID" like:
PlayerSaveScore(int gameID, int playerID, int score)
The Businesslayer then checks that the "parameters" are of the correct size and are valid ID's, perhaps validates that this player actual has been in a session the past 5 minutes etc.
After validation, then the Businesslayer calles the datalayer for "update table" where the datalayer first looks if the record exists. IF not, then it inserts it.
Tier design
Once you are "online" (in air) and the games becomes popular, then you can start to "upgrade", but you are still able to get going now with a "furture scaleable solution". Just remember that EVERY game MUST call to the business object/layer, not directly - NEVER!
I've been in the same "thought ooh so many times" but I kept getting into one simple loop called preparation, but that has almost never gotten me into a realistic solution thats up and running fast.
So get 100000 players first! then start worrying when it grows beyond.
2. Part to... how to scale... suggestion:
So here is my reason for all the trouble of building the "businesslayer/webservices"...
And best of all, your speed problems can be solved nicely now.
You can implement "cache" quite simple.
You make an extra table, if you only have 15 games, you dont need a table pr. game, but you decide. That one ONLY keeps the TOP 100 of each game. each time you post a new record from a player, you make a select on this "top 100" and checks if the posted value comes into the list. if it does, then handle that by updating the top 100 table and for extra speed purpose.
Build the extract of Top 100 as a static datalist, eg. XML or similar static data. Depending on your platform, you pick the right "static format" for you.
You could even improve speed further. Just keep the smallest value needed to get on top 100 of each game. That would be a record pr. game.
Then match the player score against the game's "lowest score in top 100"... if its above, then you have some "caching/indexing" to do and THEN you call the "giant sort" :o)
Get the point? I know its a very long answer, but I wanted to post you a "complete" solution.
Hope you will mark this as your answer :o)
I don't see why this can't be solved with one score table and simple SQL queries:
(Untested pseudo-SQL)
create table scores {
player_id as integer,
game_id as integer,
score as integer
}
most played games: SELECT count(*) AS c FROM scores GROUP BY game_id ORDER BY c DESC
best player: SELECT sum(score) AS s FROM scores GROUP BY player_id ORDER BY s DESC
best player in a given game: SELECT * FROM scores WHERE score=(SELECT max(score) FROM scores WHERE game_id=$given_game) LIMIT 1
If you need to get a list of the best players across all games simultaneously, you can extend that last query a little (which can probably be optimised with a join, but it's too early for me to think that through right now).
The number of rows you're talking about is tiny in database terms. If you cache the query results as well (eg. via something like memcached, or within your RoR application) then you'll barely touch the database at all for this.