Guidance on creating a basic search function in Rails3 - ruby-on-rails-3

Still pretty new to Rails and hoping to develop a function on a site enabling a search to be performed of the manner detailed below:
User inputs a search term / phrase (string of words but unlikely to be more than 5 or 6)
String is chopped into its constituent words
Entries in a single model with a description (a single field in the model) are output
Having looked at previous questions on this site, I am aware that there are a number of add-ons which are commonly used for search queries, however, are these needed in such a simple situation?
I was thinking that I could use an SQL command with a number of ANDs to perform this task?
Currently the model is stored within sqlite3, but it is probably going to grow to about 100,000 lines (just 10 fields though) in the near future if this is likely to cause problems?
Finally, is there an easy way to pull out the words of a string automatically for any length of string / up to a certain limit that is unlikely to be exceeded?
Thanks in advance for your time and patience

You can easily pull the words from a string with ruby: 'alice bob charlie'.split(/\s+/) will give you an array with the words.
Then, you can string those words together into an SQL query to find the appropriate records. It don't know about the performance of this solution though... You should definitely test it out to see if there are any performance issues.

Related

Does redshift store failed queries?

I wish to analyze the queries executed on certain redshift warehouse (not mine).
In order to do so I'm using a query with a join on stl_querytext and stl_query.
My question is how come I'm also getting illegal queries (I.E queries with wrong sql syntax)?
When I've tried it in my local redshift I haven't seen those. Also, couldn't find relevant documentation.
Is this a configuration issue? And in case I'm supposed to those queries is there a way to know those are illegal ones?
Thanks,
Nir.
So stl_querytext breaks long queries into parts identified by sequence number. I hope you are reconstructing the parts into the original query as a first step. This can be done with listagg() function as long as the resulting query doesn't over the max tex field (about 320 parts).
Now this is not enough to get valid SQL back in all cases because you need to treat combining the parts differently depending if the section of the query is inside or outside a text string in the query. (Is white space needed between parts or not)
I've done this exact process a bunch so this is doable. I don't have a perfect process on the whitespace question, I get close. Maybe others know the expression to get an exact recreation of the query from stl_querytext.

Contains() function falters with strings of numbers?

For some background information, I'm creating an application that searches against a couple of indexed tables to retrieve some records. It isn't overtly complex to the point of say Google, but it's good enough for the purpose it serves, barring this strange issue.
I'm using the Contains() function, and it's going very well, except when the search contains strings of numbers. Now, I'm only passing in a string -- nowhere numerical datatypes being passed in -- only characters. We're searching against a collection of emails, each appended with a custom ID when shot off from a workflow. So while testing, we decided to search via number strings.
In our test, we isolated a number 0042600006, which belongs to one and only one email subject. However, when using our query we are getting results for 0042600001, 0042600002, etc. The query is this as follows (with some generic columns standing in):
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006')
We've tried every possible combination: '0042600006*', '"0042600006"' and '"0042600006*"'.
I think it's just a limitation of the function, but I thought this would probably be the best place for answers. Thanks in advance.
Asked this same question recently. Please see the insightful answer someone left me here
Essentially what this user says to do is to turn off the noise words (Microsoft has included integers 0-9 as noise in the Full Text Search). Hope you can use this awesome tool with integers as I now am!
try to add language 1033 as an additional parameter. that worked with my solution.
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006', language 1033)
try using
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '%0042600006%')

Testing phrases to see if they match each other

I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?
An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:
It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
Entries:
testing
testing phrases
phrases
phrases to
to
to see
see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.
When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.

Searching with words one character long (MySQL)

I have a table Books in my MySQL database which has the columns Title (varchar(255)) and Edition (varchar(20)). Example values for these are "Introduction to Microeconomics" and "4".
I want to let users search for Books based on Title and Edition. So, for example they could enter "Microeconomics 4" and it would get the proper result. My question is how I should set this up on the database side.
I've been told that FULLTEXT search is generally a good way to do things like this. However, because the edition is sometimes just a single character ("4"), full text search would have to be setup to look at individual characters (ft_min_word_len = 1).. This, I've heard, is very inefficient.
So, how should I setup searches of this database?
UPDATE: I'm aware the CONCAT/LIKE could be used here.. My question is whether it would be too slow. My Books database has hundreds of thousands of books and a lot of users are going to be searching it..
here are the steps for solution
1) read the search string from user.
2) make the string in to parts according to space(" ") between the words.
3) use following query for getting the result
SELECT * FROM books WHERE Title LIKE '%part[0]%' AND Edition LIKE '%part[1]%';
here part[0] and part[1] are separated words from the given word
the PHP code for the above could be
<?php
$string_array=explode(" ",$string); //$string is the value we are searching
$select_query="SELECT * FROM books WHERE Title LIKE '%".$string_array[0]."%' AND Edition LIKE '%".$string_array[1]."%';";
$result=mysql_fetch_array(mysql_query($select_query));
?>
for $string_array[0] it could be extended to get all the parts except last one which can be applied for the case "Introduction to Microeconomics 4"
For your application, where you're interested in just title and edition, I suspect that using a FULLTEXT index with MATCH/AGAINST and reducing the ft_min_word_len to 1 would not have that much impact performance-wise (if you were data was more verbose or user written content, then I might hesitate).
The easiest way to check is to change the value, REPAIR the table to account for the new ft_min_word_len and rebuild the index, and do some simple benchmarking.
Having said that, for your application, I might consider looking into Sphinx. It's definitely going to be magnitudes faster, and your content is relatively static, so a delay between re-indexing (Sphinx's main drawback IMO) isn't an issue. Plus, with careful usage of the wordforms and exceptions, you could map things like 4/four/fourth/IV all to the same token for improved searching.

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article