Do I need to implement full text search in this case? alternatives? - sql

I have two columns in a table first_name and last_name(PostgreSQL).
In front end, I have an input to allow users to search for people. It is an auto-complete field that calls a web service for searching people by first and/or last names.
Currently, I have made a query (using my query builder):
$searches = preg_split('/\s+/', $search);
if (!empty($search)) {
$orX = $query->expr()->orX();
$i = 0;
foreach ($searches as $value) {
$orX->add($query->expr()->eq('c.firstName', ':name'.$i));
$orX->add($query->expr()->eq('c.lastName', ':name'.$i));
$query->setParameter('name'.$i, $value);
$i++;
}
$query->andWhere($orX);
}
But this query is not as precise as it is required, it uses OR for every word so if I am looking for "Rasmus Lerdorf" it also gives me "Rasmus Adams" and "Adel Lerdorf". It works only if I enter a single word ("Rasmus" for example), in this case it gives me all people with "Rasmus" as first_name or last_name.
I read about MATCH AGAINST but I am using PostgreSQL. I also heard about Full text search feature in PostgreSQL as the equivalent of MATCH AGAINST, but I am wondering if implementing a full text search would be an overkill for such an objective (especially that the maximum number of words in both columns wouldn't exceed 4).
I ask you please your advices, your usual help is always appreciated. Thanks

You don't need fulltext search.
Just add the different search terms with AND instead of OR:
$i = 0;
foreach ($searches as $value) {
$orX = $query->expr()->orX();
$orX->add($query->expr()->eq('c.firstName', ':name'.$i));
$orX->add($query->expr()->eq('c.lastName', ':name'.$i));
$query->setParameter('name'.$i, $value);
$i++;
$query->andWhere($orX);
}
I would also suggest using LIKE instead of an equality comparison (add '%' to the start and end of the users search term), and probably also make everything case insensitive by adding $query->expr()->lower() appropriately.

Related

Using Wildcard Sql for searching a word in a TextField

To make it clearer I have this fields
Columntobesearch
aword1 bword1
aword2 bword2
aword3 bword4
Now what I want to do is search using the sql wild card so what I did is like this
%searchbox%
I placed to wildcards on both ends of my search but what it searches is just the first word on the field
when I search 'aword' all of the fields is showing but when I search 'bword' nothing is showing, Please help.
Here is my Full Code
$Input=Input::all();
$makethis=Input::flash();
$soptions=Input::get('soptions');
$searchbox=Input::get('searchbox');
$items = Gamefarm::where('roost_hen', '=',Input::get('sex'))
->where($soptions, 'LIKE','%' . $searchbox . '%')
->paginate(12);
If you use mysql you can try this:
<?php
$q = Input::get('searchbox');
$results = DB::table('table')
->whereRaw("MATCH(columntobesearch) AGAINST(? IN BOOLEAN MODE)",
array($q)
)->get();
Ofcourse you need to prepare your table for full text search in your migration file with
DB::statement('ALTER TABLE table ADD FULLTEXT search(columntobesearch)');
Any way, this is not the more scalable nor efficient way to do FTS.
For a scalable and reliable full text search I strongly recommend you see elasticsearch and implement any Laravel package to this task

SQL query to bring all results regardless of punctuation with JSF

So I have a database with articles in them and the user should be able to search for a keyword they input and the search should find any articles with that word in it.
So for example if someone were to search for the word Alzheimer's I would want it to return articles with the word spell in any way regardless of the apostrophe so;
Alzheimer's
Alzheimers
results should all be returned. At the minute it is search for the exact way the word is spell and wont bring results back if it has punctuation.
So what I have at the minute for the query is:
private static final String QUERY_FIND_BY_SEARCH_TEXT = "SELECT o FROM EmailArticle o where UPPER(o.headline) LIKE :headline OR UPPER(o.implication) LIKE :implication OR UPPER(o.summary) LIKE :summary";
And the user's input is called 'searchText' which comes from the input box.
public static List<EmailArticle> findAllEmailArticlesByHeadlineOrSummaryOrImplication(String searchText) {
Query query = entityManager().createQuery(QUERY_FIND_BY_SEARCH_TEXT, EmailArticle.class);
String searchTextUpperCase = "%" + searchText.toUpperCase() + "%";
query.setParameter("headline", searchTextUpperCase);
query.setParameter("implication", searchTextUpperCase);
query.setParameter("summary", searchTextUpperCase);
List<EmailArticle> emailArticles = query.getResultList();
return emailArticles;
}
So I would like to bring back all results for alzheimer's regardless of weather their is an apostrophe or not. I think I have given enough information but if you need more just say. Not really sure where to go with it or how to do it, is it possible to just replace/remove all punctuation or just apostrophes from a user search?
In my point of view, you should change your query,
you should add alter your table and add a FULLTEXT index to your columns (headline, implication, summary).
You should also use MATCH-AGAINST rather than using LIKE query and most important, read about SOUNDEX() syntax, very beautiful syntax.
All I can give you is a native query example:
SELECT o.* FROM email_article o WHERE MATCH(o.headline, o.implication, o.summary) AGAINST('your-text') OR SOUNDEX(o.headline) LIKE SOUNDEX('your-text') OR SOUNDEX(o.implication) LIKE SOUNDEX('your-text') OR SOUNDEX(o.summary) LIKE SOUNDEX('your-text') ;
Though it won't give you results like Google search but it works to some extent. Let me know what you think.

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.

SQL Select Like Keywords in Any Order

I am building a Search function for a shopping cart site, which queries a SQL Server database. When the user enters "Hula Hoops" in the search box, I want results for all records containing both "Hula" and "Hoop", in any order. Furthermore, I need to search multiple columns (i.e. ProductName, Description, ShortName, MaufacturerName, etc.)
All of these product names should be returned, when searching for "Hula hoop":
Hula hoop
Hoop Hula
The Hoopity of xxhula sticks
(Bonus points if these can be ordered by relevance!)
It sounds like you're really looking for full-text search, especially since you want to weight the words.
In order to use LIKE, you'll have to use multiple expressions (one per word, per column), which means dynamic SQL. I don't know which language you're using, so I can't provide an example, but you'll have to produce a statement that's like this:
For "Hula Hoops":
where (ProductName like '%hula%' or ProductName like '%hoops%')
and (Description like '%hula%' or Description like '%hoops%')
and (ShortName like '%hula%' or ShortName like '%hoops%')
etc.
Unfortunately, that's really the only way to do it. Using Full Text Search would allow you to reduce your criteria to one per column, but you'll still have to specify the columns explicitly.
Since you're using SQL Server, I'm going to hazard a guess that this is a C# question. You'd have to do something like this (assuming you're constructing the SqlCommand or DbCommand object yourself; if you're using an ORM, all bets are off and you probably wouldn't be asking this anyway):
SqlCommand command = new SqlCommand();
int paramCount = 0;
string searchTerms = "Hula Hoops";
string commandPrefix = #"select *
from Products";
StringBuilder whereBuilder = new StringBuilder();
foreach(string term in searchTerms.Split(' '))
{
if(whereBuilder.Length == 0)
{
whereBuilder.Append(" where ");
}
else
{
whereBuilder.Append(" and ");
}
paramCount++;
SqlParameter param = new SqlParameter(string.Format("param{0}",paramCount), "%" + term + "%");
command.Parameters.Add(param);
whereBuilder.AppendFormat("(ProductName like #param{0} or Description like #param{0} or ShortName like #param{0})",paramCount);
}
command.CommandText = commandPrefix + whereBuilder.ToString();
SQL Server Full Text Search should help you out. You will basically create indexes on the columns you want to search. in the where clause of your query you will use the CONTAINS operator and pass it your search input.
you can start HERE or HERE to learn more
You might want to check out SOLR too - if you're going to be doing this type of searching. Super cool.
http://lucene.apache.org/solr/

Building SEO-friendly URLs for accented characters

We are making our site an SEO-friendly site by following the pattern below:
http://OurWebsite.com/MyArticle/Math/Spain/Glaño
As you see, Glaño has a spelling character that search engines may not like it. On the other hand we cannot build up the last URL!
Any suggestions to maintain our current URL generation code to handle Spanish or French entries or we need to change our approach?
Try these functions:
function Slug($string, $slug = '-', $extra = null)
{
return strtolower(trim(preg_replace('~[^0-9a-z' . preg_quote($extra, '~') . ']+~i', $slug, Unaccent($string)), $slug));
}
function Unaccent($string)
{
return html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8')), ENT_QUOTES, 'UTF-8');
}
And use it like this:
echo Slug('Iñtërnâtiônàlizætiøn of Glaño'); // internationalizaetion-of-glano
You can embed the Unaccent() code into the Slug() function if you wish to have only one function.
Perhaps replace accented characters with the closest matching non-accented latin character.
Unless "Glano" means something very rude, this is probably your best bet.
If you search google for "Glaño" it returns pages with "Glano" in it anyway, so the SEO shouldn't be harmed.
To translate the characters from accented to unaccented, you could use this function (this is in PHP, but hopefully you'd be able to use it as a starting point for other languages):
function normalize ($string) {
$table = array(
'Š'=>'S', 'š'=>'s', 'Đ'=>'Dj', 'đ'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'Č'=>'C', 'č'=>'c', 'Ć'=>'C', 'ć'=>'c',
'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
'ÿ'=>'y', 'Ŕ'=>'R', 'ŕ'=>'r',
);
return strtr($string, $table);
}
(Author credit goes to allixsenos at gmail http://php.net/manual/en/function.strtr.php)
I agree that unless "Glano" means something very rude, this is probably your best bet. Now, I want to add that if you care about SEO I would consider not having too many folders in the URL. One root, three sub-folders and then the file. This may hurt more than the special character.