I'm trying to create search functionality in a site, and I want the user to be able to search for multiple words, performing substring matching against criteria which exist in various models.
For the sake of this example, let's say I have the following models:
Employee
Company
Municipality
County
A county has multiple municipalities, which has multiple companies, which have multiple employees.
I want the search to be able to search against a combination of Employee.firstname, Employee.lastname, Company.name, Municipality.name and County.name, and I want the end result to be Employee instances.
For example a search for the string "joe tulsa" should return all Employees where both words can be found somewhere in the properties I named in the previous sentence. I'll get some false positives, but at least I should get every employee named "Joe" in Tulsa county.
I've tried a couple of approaches, but I'm not sure I'm going down the right path. I'm looking for a nice RoR-ish way of doing this, and I'm hoping someone with more RoR wisdom can help outline a proper solution.
What I have tried:
I'm not very experienced with this kind of search, but outside RoR, I'd manually create an SQL statement to join all the tables together, create where clauses for each separate search word, covering the different tables. Perhaps use a builder. Then just execute the query and loop through the results, instantiate Employee objects manually and adding them to an array.
To solve this in RoR, I've been:
1) Dabbling with named scopes in what in my project corresponds to the Employee model, but I got stuck when I needed to join in tables two or more "steps" away (Municipality and County).
2) Created a view (called "search_view") joining all the tables together, to simplify the query. Then thought I'd use Employee.find_by_sql() on this table, which would yield me these nice Employee objects. I thought I'd use a builder to create the SQL, and it seemed that Arel was the thing to use, so I tried doing something like:
view = Arel::Table.new(:search_view)
But the resulting Ariel::Table does not contain any columns, so it's not usable to build my query. At this point I'm a bit stuck as I don't know how to get a working query builder.
I strongly recommend using a proper search engine for something like this, it will make life a lot easier for you. I had a similar problem and I thought "Boy, I bet setting up something like Sphinx means I have to read thousands of manuals and tutorials first". Well, that's not the case.
Thinking Sphinx is a Rails gem I recommend which makes it very easy to integrate Sphinx. You don't need to have much experience at all to get started:
http://freelancing-god.github.com/ts/en/
I haven't tried other search engines, but I'm very satisfied with Sphinx. I managed to set up a relatively complex real-time search in less than a day.
Related
I have some voting data for an issue that I want to create some reports on.
I want to display the voting results for each issue following criteria
Age
Sex
Income
Education
Race
Different issues could be Abortion, Gun Control, etc..
How would I use Redis to store this voting data and then display reports on them? Here's one report I'm trying to create.
Here's what the report looks like when I want to view the voting data by Age
https://docs.google.com/spreadsheets/d/1N-C4pNN_fwb1kNGQck44TIrIAEn-jPZEpEsW6qQ8lh8/edit?usp=sharing
I want to create similar reports but they could also be by age and sex or age and income or income and education, etc..
Hope you understand what I'm trying to create. I want to let the end user select different criteria on the website and create this dynamic report on the fly as fast as I can which is why I dont' want to use MySQL for this. I know Redis can be used to solve this but I'm just not sure how to get started.
Thanks in advance for any pointers you can provide for me to get started.
Really, this is a problem most easily solved with a traditional RDBMS, like PostgreSQL/MySQL.
However, there are a few ways you could do this in Redis.
One way would be to simply store attributes for each vote in a hash.
redis.hmset "vote:123", "age", 26, "abortion", "yes", "gun_control", "undecided" #, ...
You would also want a redis SET (ie: "all_votes") containing all the vote ids, so you don't have to use redis.keys to search for votes.
The next step is making other sets. If you want to be able to look up by age ranges quickly, you will probably need to build a SET (ie: "vote_indexes:age:18-22") for each age range, populating it with the ids of any votes within that age range. Every time you add a vote or remove a vote you will need to add or remove them to/from the all_votes SET as well as its corresponding age range SET, and any other index SETs you build. If this sounds a lot like database indexes, it is exactly like that. Except you have to maintain them yourself, so that is quite a bit of extra code you wouldn't have to write with an RDBMS.
Now that you have your index sets, you can perform intersections of those sets to do some querying.
redis.sinter("indexes:age:18-22", "indexes:abortion:yes").count
# => 20
Instead of manually maintaining your own hand-built indexes, you could go the route of simply iterating through every vote and build the report as you go, hopefully in one pass. This would be pretty slow to implement within your application. The most performant option would likely be to use Lua scripting running within redis. Basically your Lua script would get passed to redis with the filter parameters and it would iterate through all votes and do the filtering, returning the matching results or even a final report.
That of course means you'll have to learn Lua. Its a nice little language and not difficult to pick up, but its a bit harder than a language you probably already know: SQL.
I love Redis, but not sure you have the need for it. An adhoc reporting system is something SQL was literally made for. Don't worry about performance issues until you have them. You'll be surprised how far SQL can get you. If you do hit some performance problems, Redis is an amazing way to cache your SQL results and give your RDBMS a break.
I need a special operator that's maybe a bit better than LIKE to query for "similar" values.
THE SCENARIO:
I have a table of students, and I have a table of lessons. The table of lessons was imported from other software, so the StudentID column is null on the imported rows. So we need the user to manually select the appropriate student row for each lesson, and then the StudentID column can be populated so they're properly synced. Both tables contain first and last names, but a lot of them are very likely to be misspelled.
After importing the rows, I would like to present the user with the names from the student rows where the names are "top five most similar" to the values stored in each lesson row. In fact I'd like to present them in descending order from most-to-least similar.
A query containing the LIKE operator doesn't quite cut it because it requires specific text must exist within the column, and it doesn't return a "similarity score".
It is my understanding (from non-technical articles) that the US Post Office has this issue very well handled... People misspell names and street names all the time, but their algorithm for "find similar" is very effective.
I know the algorithm could vary from one solution to the next. For example I read from the article that some algorithms consider phonetics, others consider the count of vowels and consonants, while others consider that "T" sounds like "P" when spoken over the phone.
I COULD load every record into my app code and write my own algorithm in c#, VB.NET or whatever, but there are lots of problems with that including performance. I'd rather accomplish this within the query so I'm looking for alternatives.
I'm using SQL Server Express but I'm sure the solution applies to other database platforms.
SQL Server supports the SOUNDEX() function, but this works only for similar sounding names and that not to well, at least if you handle non english texts. You could write you own function in c# or vb.net, facilitating any algorithm that might apply to your needs, and import it as a scalar function into sql server.
SQL FreeText might work for you:
http://msdn.microsoft.com/en-us/library/ms176078.aspx
It searches against a Thesaurus, although i'm not sure how well it does with names.
Its very easy to implement, however.
I want to build a search on my website that is similar to Facebook.
For example, entering a search phrase and returning results from multiple tables.
In turn, I have two tables on my website which include the following: Account and Posts. I want to make a search that returns results from both tables based on a search phrase.
I am confused on how to do this.
Can someone point me in the right direction please.
Thank you,
Brian
You can do a search on a joined field by doing a joins to join the second table. For example:
Account.where(:your_attribute => search_term).joins(:post).where("posts.some_attribute = ?", search_term_2)
Or if you're searching from the opposite direction:
Post.where(:some_attribute => search_term).joins(:accounts).where("accounts.your_attribute = ?", search_term_2)
If you want to do an or between the two tables, you can. Just modify the query a little:
Post.joins(:accounts).where("posts.attribute = ? or accounts.attribute = ?", search_term)
If I'm reading your post correctly, what you want to do is in a single search box, you want to search multiple tables and return multiple types of data. On Facebook, it returns people, apps, pages, etc. By doing a join, you are going to return Post with their associated users, however, this won't return Users that don't have any posts, and even if you did an "outer join" on the table, it wouldn't be scalable if you wanted to search additional models.
Your simplest solution without introducing more software in to the mix is to create a database view that maps the data to a structure where its more easily queryable. In Rails/Ruby, you would query that view like you would a normal database table.
A more complex solution would be to use a full text index such as Apache Solr and use a gem like acts_as_solr_reloaded to query the full text index. At the end of the day, this would be a more robust and scalable solution.
Surfing the net I ran into Aquabrowser (no need to click, I'll post a pic of the relevant part).
It has a nice way of presenting search results and discovering semantically linked entities.
Here is a screenshot taken from one of the demos.
On the left side you have they word you typed and related words.
Clicking them refines your results.
Now as an example project I have a data set of film entities and subjects (like wolrd-war-2 or prison-escape) and their relations.
Now I imagine several use cases, first where a user starts with a keyword.
For example "world war 2".
Then i would somehow like to calculate related keywords and rank them.
I think about some sql query like this:
Lets assume "world war 2" has id 3.
select keywordId, count(keywordId) as total from keywordRelations
WHERE movieId IN (select movieId from keywordRelations
join movies using (movieId)
where keywordId=3)
group by keywordId order by total desc
which basically should select all movies which also have the keyword world-war-2 and then looks up the keywords which theese films have as well and selects those which occour the most.
I think with theese keywords I can select movies which match best and have a nice tag cloud containing similar movies and related keywords.
I think this should work but its very, very, very inefficient.
And its also only one level or relation.
There must be a better way to do this, but how??
I basically have an collection of entities. They could be different entities (movies, actors, subjects, plot-keywords) etc.
I also have relations between them.
It must somehow be possible to efficiently calculate "semantic distance" for entities.
I also would like to implement more levels of relation.
But I am totally stuck. Well I have tried different approaches but everything ends up in some algorithms that take ages to calculate and the runtime grows exponentially.
Are there any database systems available optimized for that?
Can someone point me in the right direction?
You probably want an RDF triplestore. Redland is a pretty commonly used one, but it really depends on your needs. Queries are done in SPARQL, not SQL. Also... you have to drink the semantic web koolaid.
From your tags I see you're more familiar with sql, and I think it's still possible to use it effectively for your task.
I have an application where a custom-made full-text search implemented using sqlite as a database. In the search field I can enter terms and popup list will show suggestions about the word and for any next word only those are shown that appears in the articles where previously entered words appeared. So it's similar to the task you described
To make things more simple let's assume we have only three tables. I suppose you have a different schema and even details can be different but my explanation is just to give an idea.
Words
[Id, Word] The table contains words (keywords)
Index
[Id, WordId, ArticleId]
This table (indexed also by WordId) lists articles where this term appeared
ArticleRanges
[ArticleId, IndexIdFrom, IndexIdTo]
This table lists ranges of Index.Id for any given Article (obviously also indexed by ArticleId) . This table requires that for any new or updated article Index table should contain entries having known from-to range. I suppose it can be achieved with any RDBMS with a little help of autoincrement feature
So for any given string of words you
Intersect all articles where all previous words appeared. This will narrow the search. SELECT ArticleId FROM Index Where WordId=... INTERSECT ...
For the list of articles you can get ranges of records from ArticleRanges table
For this range you can effectively query WordId lists from Index grouping the results to get Count and finally sort by it.
Although I listed them as separate actions, the final query can be just big sql based on the parsed query string.
A "static" query is one that remains the same at all times. For example, the "Tags" button on Stackoverflow, or the "7 days" button on Digg. In short, they always map to a specific database query, so you can create them at design time.
But I am trying to figure out how to do "dynamic" queries where the user basically dictates how the database query will be created at runtime. For example, on Stackoverflow, you can combine tags and filter the posts in ways you choose. That's a dynamic query albeit a very simple one since what you can combine is within the world of tags. A more complicated example is if you could combine tags and users.
First of all, when you have a dynamic query, it sounds like you can no longer use the substitution api to avoid sql injection since the query elements will depend on what the user decided to include in the query. I can't see how else to build this query other than using string append.
Secondly, the query could potentially span multiple tables. For example, if SO allows users to filter based on Users and Tags, and these probably live in two different tables, building the query gets a bit more complicated than just appending columns and WHERE clauses.
How do I go about implementing something like this?
The first rule is that users are allowed to specify values in SQL expressions, but not SQL syntax. All query syntax should be literally specified by your code, not user input. The values that the user specifies can be provided to the SQL as query parameters. This is the most effective way to limit the risk of SQL injection.
Many applications need to "build" SQL queries through code, because as you point out, some expressions, table joins, order by criteria, and so on depend on the user's choices. When you build a SQL query piece by piece, it's sometimes difficult to ensure that the result is valid SQL syntax.
I worked on a PHP class called Zend_Db_Select that provides an API to help with this. If you like PHP, you could look at that code for ideas. It doesn't handle any query imaginable, but it does a lot.
Some other PHP database frameworks have similar solutions.
Though not a general solution, here are some steps that you can take to mitigate the dynamic yet safe query issue.
Criteria in which a column value belongs in a set of values whose cardinality is arbitrary does not need to be dynamic. Consider using either the instr function or the use of a special filtering table in which you join against. This approach can be easily extended to multiple columns as long as the number of columns is known. Filtering on users and tags could easily be handled with this approach.
When the number of columns in the filtering criteria is arbitrary yet small, consider using different static queries for each possibility.
Only when the number of columns in the filtering criteria is arbitrary and potentially large should you consider using dynamic queries. In which case...
To be safe from SQL injection, either build or obtain a library that defends against that attack. Though more difficult, this is not an impossible task. This is mostly about escaping SQL string delimiters in the values to filter for.
To be safe from expensive queries, consider using views that are specially crafted for this purpose and some up front logic to limit how those views will get invoked. This is the most challenging in terms of developer time and effort.
If you were using python to access your database, I would suggest you use the Django model system. There are many similar apis both for python and for other languages (notably in ruby on rails). I am saving so much time by avoiding the need to talk directly to the database with SQL.
From the example link:
#Model definition
class Blog(models.Model):
name = models.CharField(max_length=100)
tagline = models.TextField()
def __unicode__(self):
return self.name
Model usage (this is effectively an insert statement)
from mysite.blog.models import Blog
b = Blog(name='Beatles Blog', tagline='All the latest Beatles news.')
b.save()
The queries get much more complex - you pass around a query object and you can add filters / sort elements to it. When you finally are ready to use the query, Django creates an SQL statment that reflects all the ways you adjusted the query object. I think that it is very cute.
Other advantages of this abstraction
Your models can be created as database tables with foreign keys and constraints by Django
Many databases are supported (Postgresql, Mysql, sql lite, etc)
DJango analyses your templates and creates an automatic admin site out of them.
Well the options have to map to something.
A SQL query string CONCAT isn't a problem if you still use parameters for the options.