HABTM finds with "AND" joins, NOT "OR" - sql

I have two models, associated with a HABTM (actually using has_many :through on both ends, along with a join table). I need to retrieve all ModelAs that is associated with BOTH of two ModelBs. I do NOT want all ModelAs for ModelB_1 concatenated with all ModelAs for ModelB_2. I literally want all ModelAs that are associated with BOTH ModelB_1 and ModelB_2. It is not limited to only 2 ModelBs, it may be up to 50 ModelBs, so this must scale.
I can describe the problem using a variety of analogies, that I think better describes my problem than the previous paragraph:
* Find all books that were written by all 3 authors together.
* Find all movies that had the following 4 actors in them.
* Find all blog posts that belonged to BOTH the Rails and Ruby categories for each post.
* Find all users that had all 5 of the following tags: funny, thirsty, smart, thoughtful, and quick. (silly example!)
* Find all people that have worked in both San Francisco AND San Jose AND New York AND Paris in their lifetimes.
I've thought of a variety of ways to accomplish this, but they're grossly inefficient and very frowned upon.
Taking an analogy above, say the last one, you could do something like query for all the people in each city, then find items in each array that exist across each array. That's a minimum of 5 queries, all the data of those queries transfered back to the app, then the app has to intensively compare all 5 arrays to each other (loops galore!). That's nasty, right?
Another possible solution would be to chain the finds on top of each other, which would essentially do the same as above, but won't eliminate the multiple queries and processing. Also, how would you dynamicize the chain if you had user submitted checkboxes or values that could be as high as 50 options? Seems dirty. You'd need a loop. And again, that would intensify the search duration.
Obviously, if possible, we'd like to have the database perform this for us, so, people have suggested to me that I simply put multiple conditions in. Unfortunately, you can only do an OR with HABTM typically.
Another solution I've run across is to use a search engine, like sphinx or UltraSphinx. For my particular situation, I feel this is overkill, and I'd rather avoid it. I still feel there should be a solution that will let a user craft a query for an arbitrary number of ModelBs and find all ModelAs.
How would you solve this problem?

You may do this:
build a query from your ModelA, joining ModelB (through the join model), filtering the ModelBs that have one of the values that you are looking for, that is putting them in OR (i.e. where ModelB = 'ModelB_1' or ModelB = 'ModelB_2'). With this query the result set will have multiple 'ModelA' rows, exactly one row for each ModelB condition satisfied.
add a group by condition to the query on the ModelA columns you need (even all of them if you wish). The count() for each row is equal to the number of ModelB conditions satisfied*.
add a 'having' condition selecting only the rows whose count(*) is equal to the number of ModelB conditions you need to have satisfied
example:
model_bs_to_find = [100, 200]
ModelA.all( :joins=>{:model_a_to_b=>:model_bs},
:group=>"model_as.id",
:select=>"model_as.*",
:conditions=>["model_bs.id in (?)", model_bs_to_find],
:having=>"count(*)=#{model_bs_to_find.size}")
N.B. the group and select parameters specified in that way will work in MySQL, the standard SQL way to do so would be to put the whole list of model_as columns in both the group and select parameters.

Related

Order results by weighted variables?

I have a listing of ~10,000 apps and I'd like to order them by certain columns, but I want to give certain columns more "weight" than others.
For instance, each app has overall_ratings and current_ratings. If the app has a lot of overall_ratings, that's worth 1.5, but the number of current_ratings would be worth, say 2, since the number of current_ratings shows the app is active and currently popular.
Right now there are probably 4-6 of these variables I want to take into account.
So, how can I pull that off? In the query itself? After the fact using just Ruby (remember, there are over 10,000 rows that would need to be processed here)? Something else?
This is a Rails 3.2 app.
Sorting 10000 objects in plain Ruby doesn't seem like a good idea, specially if you just want the first 10 or so.
You can try to put your math formula in the query (using the order method from Active Record).
However, my favourite approach would be to create a float attribute to store the score and update that value with a before_save method.
I would read about dirty attributes so you only perform this scoring when some of you're criteria is updated.
You may also create a rake task that re-scores your current objects.
This way you would keep the scoring functionality in Ruby (you could test it easily) and you could add an index to your float attribute so database queries have better performance.
One attempt would be to let the DB do this work for you with some query like: (can not really test it because of laking db schema):
ActiveRecord::Base.connection.execute("SELECT *,
(2*(SELECT COUNT(*) FROM overall_ratings
WHERE app_id = a.id) +
1.5*(SELECT COUNT(*) FROM current_ratings
WHERE app_id = a.id)
AS rating FROM apps a
WHERE true HAVING rating > 3 ORDER BY rating desc")
Idea is to sum the number of ratings found for each current and overall rating with the subqueries for an specific app id and weight them as desired.

Designing a solution to retrieve and classify content based on given attributes

This is a design problem I am facing. Let's say I have a cars website. Cars have the following attributes with different possible values.
Color: red, green, blue
Size: small, big
Based on those attributes I want to classify between cars for young people, cars for middle aged people and cars for elder people, with the following criteria:
Cars_young: red or green
Cars_middle_age: blue and big
Cars_elder: blue and small
I'll call this criteria target
I have a table cars with columns: id, color and size.
I need to be able to:
a) when retrieving a car by id, tell its target (if it's young, middle age or elder people)
b) be able to query the database to know how many views had cars belonging to each target
Also, as a developer, I must implement it in a way that those criteria are easily changed.
Which is the best way to implement it? Is there a design pattern for it? I can explain two possible solutions I thought about but I don't really like:
1) create a new column in the database table called target, so it's easy to make both a) and b).
Drawbacks: Each time crieteria changes I have to update the column target for all cars, and also, I have to change the insertNewCar() function.
2) Implement it in the 'Cars' class.
Drawback: Each time criteria changes I have to change query in b) as well as code in 'getCarById' in a).
3) Use TRIGGERS in SQL, but I would like to avoid this solution if possible
I would like to be able have this criteria definition somewhere in the code which can be changed easily, and would also hopefully be used by 'Cars' class. I'm thinking about some singleton or global objects for 'target' which can be injected in some Cars methods.
Anyone can explain a nice solution or send documentation about some post that faces this problem, or a pattern design that solves it?
On first sight specification pattern might meet your expectations. Wikipedia gives a nice explanation how it works, small teaser bellow:
OverDueSpecification OverDue = new OverDueSpecification();
NoticeSentSpecification NoticeSent = new NoticeSentSpecification();
InCollectionSpecification InCollection = new InCollectionSpecification();
ISpecification SendToCollection = OverDue.And(NoticeSent).And(InCollection.Not());
InvoiceCollection = Service.GetInvoices();
foreach (Invoice currentInvoice in InvoiceCollection) {
if (SendToCollection.IsSatisfiedBy(currentInvoice)) {
currentInvoice.SendToCollection();
}
}
You can consider combine specification pattern with observers.
Also there are few other ideas:
extention of specification pattern on SQL generation, WHERE clauses in particular
storing criteria configuration in database
criteria versioning: storing information about version of rules used to assign to category comined with category itself

Ruby Rails Complex SQL with aggregate function and DayOfWeek

Rails 2.3.4
I have searched google, and have not found an answer to my dilemma.
For this discussion, I have two models. Users and Entries. Users can have many Entries (one for each day).
Entries have values and sent_at dates.
I want to query and display the average value of entries for a user BY DAY OF WEEK. So if a user has entered values for, say, the past 3 weeks, I want to show the average value for Sundays, Mondays, etc. In MySQL, it is simple:
SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = ? GROUP BY 1
That query will return between 0 and 7 records, depending upon how many days a user has had at least one entry.
I've looked at find_by_sql, but while I am searching Entry, I don't want to return an Entry object; instead, I need an array of up to 7 days and averages...
Also, I am concerned a bit about the performance of this, as we would like to load this to the user model when a user logs in, so that it can be displayed on their dashboard. Any advice/pointers are welcome. I am relatively new to Rails.
You can query the database directly, no need to use an actual ActiveRecord object. For example:
ActiveRecord::Base.connection.execute "SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = #{user.id} GROUP BY DAYOFWEEK(sent_at);"
This will give you a MySql::Result or MySql2::Result that you can then use each or all on this enumerable, to view your results.
As for caching, I would recommend using memcached, but any other rails caching strategy will work as well. The nice benefit of memcached is that you can have your cache expire after a certain amount of time. For example:
result = Rails.cache.fetch('user/#{user.id}/averages', :expires_in => 1.day) do
# Your sql query and results go here
end
This would put your results into memcached for one day under the key 'user//averages'. For example if you were user with id 10 your averages would be in memcached under 'user/10/average' and the next time you went to perform this query (within the same day) the cached version would be used instead of actually hitting the database.
Untested, but something like this should work:
#user.entries.select('DAYOFWEEK(sent_at) as day, AVG(value) as average').group('1').all
NOTE: When you use select to specify columns explicitly, the returned objects are read only. Rails can't reliably determine what columns can and can't be modified. In this case, you probably wouldn't try to modify the selected columns, but you can'd modify your sent_at or value columns through the resulting objects either.
Check out the ActiveRecord Querying Guide for a breakdown of what's going on here in a fairly newb-friendly format. Oh, and if that query doesn't work, please post back so others that may stumble across this can see that (and I can possibly update).
Since that won't work due to entries returning an array, we can try using join instead:
User.where(:user_id => params[:id]).joins(:entries).select('...').group('1').all
Again, I don't know if this will work. Usually you can specify where after joins, but I haven't seen select combined in there. A tricky bit here is that the select is probably going to eliminate returning any data about the user at all. It might make more sense just to eschew find_by_* methods in favor of writing a method in the Entry model that just calls your query with select_all (docs) and skips the association mapping.

figuring out which field to look for a value in with SQL and perl

I'm not too good with SQL and I know there's probably a much more efficient way to accomplish what I'm doing here, so any help would be much appreciated. Thanks in advance for your input!
I'm writing a short program for the local school high school. At this school, juniors and seniors who have driver's licenses and cars can opt to drive to school rather than ride the bus. Each driver is assigned exactly one space, and their DLN is used as the primary key of the driver's table. Makes, models, and colors of cars are stored in a separate cars table, related to the drivers table by the License plate number field.
My idea is to have a single search box on the main GUI of the program where the school secretary can type in who/what she's looking for and pull up a list of results. Thing is, she could be typing a license plate number, a car color, make, and model, someone driver's name, some student driver's DLN, or a space number. As the programmer, I don't know what exactly she's looking for, so a couple of options come to mind for me to build to be certain I check everywhere for a match:
1) preform a couple of
SELECT * FROM [tablename]
SQL statements, one per table and cram the results into arrays in my program, then search across the arrays one element at a time with regex, looking for a matched pattern similar to the search term, and if I find one, add the entire record that had a match in it to a results array to display on screen at the end of the search.
2) take whatever she's looking for into the program as a scaler and prepare multiple select statements around it, such as
SELECT * FROM DRIVERS WHERE DLN = $Search_Variable
SELECT * FROM DRIVERS WHERE First_Name = $Search_Variable
SELECT * FROM CARS WHERE LICENSE = $Search_Variable
and so on for each attribute of each table, sticking the results into a results array to show on screen when the search is done.
Is there a cleaner way to go about this lookup without having to make her specify exactly what she's looking for? Possibly some kind of SQL statement I've never seen before?
Seems like a right application for the Sphinx full-text search engine. There's the Sphinx::Search module on CPAN which can be used as perl client for Sphinx.
First of all, you should not use SELECT * and you should definitely use bind values.
Second, the easiest way to figure out what the user is searching for is to ask the user. Have a set of checkboxes likes so:
Search among: [ ] Names
[ ] License Plate Numbers
[ ] Driver's License Numbers
Alternatively, you can note that names do not contain any digits and I have not seen any driver's license number which contains digits. There are other heuristics you can apply to partially deduce what the user was trying to search.
If you do an OK job of presenting the results, this might work out.
Finally, try to figure out what search possibilities are offered by the database you are using and leverage them so that most of the searching happens before the user interface touches the data.

how to determine if a record in every source, represents the same person

I have several sources of tables with personal data, like this:
SOURCE 1
ID, FIRST_NAME, LAST_NAME, FIELD1, ...
1, jhon, gates ...
SOURCE 2
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
1, jon, gate ...
SOURCE 3
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
2, jhon, ballmer ...
So, assuming that records with ID 1, from sources 1 and 2, are the same person, my problem is how to determine if a record in every source, represents the same person. Additionally, sure not every records exists in all sources. All the names, are written in spanish, mainly.
In this case, the exact matching needs to be relaxed because we assume the data sources has not been rigurously checked against the official bureau of identification of the country. Also we need to assume typos are common, because the nature of the processes to collect the data. What is more, the amount of records is around 2 or 3 millions in every source...
Our team had thought in something like this: first, force exact matching in selected fields like ID NUMBER, and NAMES to know how hard the problem can be. Second, relaxing the matching criteria, and count how much records more can be matched, but is here where the problem arises: how to do to relax the matching criteria without generating too noise neither restricting too much?
What tool can be more effective to handle this?, for example, do you know about some especific extension in some database engine to support this matching?
Do you know about clever algorithms like soundex to handle this approximate matching, but for spanish texts?
Any help would be appreciated!
Thanks.
The crux of the problem is to compute one or more measures of distance between each pair of entries and then consider them to be the same when one of the distances is less than a certain acceptable threshold. The key is to setup the analysis and then vary the acceptable distance until you reach what you consider to be the best trade-off between false-positives and false-negatives.
One distance measurement could be phonetic. Another you might consider is the Levenshtein or edit distance between the entires, which would attempt to measure typos.
If you have a reasonable idea of how many persons you should have, then your goal is to find the sweet spot where you are getting about the right number of persons. Make your matching too fuzzy and you'll have too few. Make it to restrictive and you'll have too many.
If you know roughly how many entries a person should have, then you can use that as the metric to see when you are getting close. Or you can divide the number of records into the average number of records for each person and get a rough number of persons that you're shooting for.
If you don't have any numbers to use, then you're left picking out groups of records from your analysis and checking by hand whether they look like the same person or not. So it's guess and check.
I hope that helps.
This sounds like a Customer Data Integration problem. Search on that term and you might find some more information. Also, have a poke around inside The Data Warehousing Institude, and you might find some answers there as well.
Edit: In addition, here's an article that might interest you on spanish phonetic matching.
I've had to do something similar before and what I did was use a double metaphone phonetic search on the names.
Before I compared the names though, I tried to normalize away any name/nickname differences by looking up the name in a nick name table I created. (I populated the table with census data I found online) So people called Bob became Robert, Alex became Alexander, Bill became William, etc.
Edit: Double Metaphone was specifically designed to be better than Soundex and work in languages other than English.
SSIS , try using the Fuzzy Lookup transformation
Just to add some details to solve this issue, I'd found this modules for Postgresql 8.3
Fuzzy String Match
Trigrams
You might try to cannonicalise the names by comparing them with a dicionary.
This would allow you to spot some common typos and correct them.
Sounds to me you have a record linkage problem. You can use the references in the link.