fulltext search and highlighting by PHP and MySQL? - highlighting

MySQL can take care of fulltext search quite well,
but it doesn't highlight the keywords that are searched,
how can I do this most efficiently?

The solutions posed above require retrieval of the entire document in order to search, replace, and highlight text. If the document is large, and many are, this seems like a really bad idea. Better would be for MySQL FTS to return the text offsets directly like SQLITE does, then use an indexed substring operator - that would be significantly more efficient.

Do your sql query and then do a preg_replace on the result, replacing each keyword with KeyWord
$hilitedText = preg_replace ( '/keyword/' , '/<span class="hilite">keyword<\/span>/' , $row['columName']);
And define the hilite class in your css to format however you want the hilighted keywords to appear.
If you have multiple keywords put them in an array and their replacements in a second array in the same order and pass those arrays as the firt two arguments of the function.

Get the result set from mysql. Do a search and replace for each search word, replacing each word with whatever you're doing for highlighting, e.g., <span class='highlight'>word</span>


Replace all occurrences of a substring in a database text field

I have a database that has around 10k records and some of them contain HTML characters which I would like to replace.
For example I can find all occurrences:
the original string example:
this is the cool mega string that contains &#47
how to replace all &#47 with / ?
The end result should be:
this is the cool mega string that contains /
If you want to replace a specific string with another string or transformation of that string, you could use the "replace" function in postgresql. For instance, to replace all occurances of "cat" with "dog" in the column "myfield", you would do:
UPDATE tablename
SET myfield = replace(myfield,"cat", "dog")
You could add a WHERE clause or any other logic as you see fit.
Alternatively, if you are trying to convert HTML entities, ASCII characters, or between various encoding schemes, postgre has functions for that as well. Postgresql String Functions.
The answer given by #davesnitty will work, but you need to think very carefully about whether the text pattern you're replacing could appear embedded in a longer pattern you don't want to modify. Otherwise you'll find someone's nooking a fire, and that's just weird.
If possible, use a suitable dedicated tool for what you're un-escaping. Got URLEncoded text? use a url decoder. Got XML entities? Process them though an XSLT stylesheet in text mode output. etc. These are usually safer for your data than hacking it with find-and-replace, in that find and replace often has unfortunate side effects if not applied very carefully, as noted above.
It's possible you may want to use a regular expression. They are not a universal solution to all problems but are really handy for some jobs.
If you want to unconditionally replace all instances of "&#47" with "/", you don't need a regexp.
If you want to replace "&#47" but not "&#471", you might need a regexp, because you can do things like match only whole words, match various patterns, specify min/max runs of digits, etc.
In the PostgreSQL string functions and operators documentation you'll find the regexp_replace function, which will let you apply a regexp during an UPDATE statement.
To be able to say much more I'd need to know what your real data is and what you're really trying to do.
If you don't have postgres, you can export all database to a sql file, replace your string with a text editor and delete your db on your host, and re-import your new db
PS: be careful

How to SQL compare columns when one has accented chars?

I have two SQLite tables, that I would love to join them on a name column. This column contains accented characters, so I am wondering how can I compare them for join. I would like the accents dropped for the comparison to work.
You can influence the comparison of characters (such as ignoring case, ignoring accents) by using a Collation. SQLLite has only a few built in collations, although you can add your own.
SqlLite, Data types, Collating Sequences
SqlLite, Define new Collating Sequence
Given that it seems doubtful if Android supports UDFs and computed columns, here's another approach:
Add another column to your table, normalizedName
When your app writes out rows to your table, it normalizes name itself, removing accents and performing other changes. It saves the result in normalizedName.
You use normalizedName in your join.
As the normalization function is now in java, you should have few restrictions in coding it. Several examples for removing accents in java are given here.
There is an easy solution, but not very elegant.
Use the REPLACE function, to remove your accents. Exemple:
SELECT YOUR_COLUMN FROM YOUR_TABLE WHERE replace(replace(replace(replace(replace(replace(replace(replace(
replace(replace(replace( lower(YOUR_COLUMN), 'á','a'), 'ã','a'), 'â','a'), 'é','e'), 'ê','e'), 'í','i'),
'ó','o') ,'õ','o') ,'ô','o'),'ú','u'), 'ç','c') LIKE 'SEARCH_KEY%'
Where SEARCH_KEY is the key word that you wanna find on the column.
As mdma says, a possible solution would be a User-Defined-Function (UDF). There is a document here describing how to create such a function for SQLite in PHP. You could write a function called DROPACCENTS() which drops all the accents in the string. Then, you could join your column with the following code:
SELECT * FROM table1
LEFT JOIN table2
ON DROPACCENTS(table1.column1) = DROPACCENTS(table2.column1)
Much similar to how you would use the UCASE() function to perform a case-insensitive join.
Since you cannot use PHP on Android, you would have to find another way to create the UDF. Although it has been said that creating a UDF is not possible on Android, there is another Stack Overflow article claiming that a content provider could do the trick. The latter sounds slightly complicated, but promising.
Store a special "neutral" column without accented characters and compare / search only this column.

Mysql query: it's not fetching first result

I have below values in my database.
been Lorem Ipsum and scrambled ever
Here is my query:
MATCH (`sHeadline`) AGAINST ("text" IN BOOLEAN MODE) AS score
FROM wiki_businessads
AND bDeleted ="0" AND nAdStatus ="1"
ORDER BY score DESC, bPrimeListing DESC, dDateCreated DESC
It's not fetching first result, why? It should fetch first result because its contain text word in it. I have disabled the stopword filtering.
This one is also not working
MATCH (`sHeadline`) AGAINST ('"text"' IN BOOLEAN MODE) AS score
FROM wiki_businessads
AND bDeleted ="0" AND nAdStatus ="1"
ORDER BY score DESC, bPrimeListing DESC, dDateCreated DESC
The full text search only matches words and word prefixes. Because your data in the database does not contain word boundaries (spaces) the words are not indexed, so they are not found.
Some possible choices you could make are:
Fix your data so that it contains spaces between words.
Use LIKE '%text%' instead of a full text search.
Use an external full-text search engine.
I will expand on each of these in turn.
Fix your data so that it contains spaces between words.
Your data seems to have been corrupted somehow. It looks like words or sentences but with all the spaces removed. Do you know how that happened? Was it intentional? Perhaps there is a bug elsewhere in the system. Try to fix that. Find out where the data came from and see if it can be reimported correctly.
If the original source doesn't contain spaces, perhaps you could use some natural language toolkit to guess where the spaces should be and insert them. There most likely already exist libraries that can do this, although I don't happen to know any. A Google search might find something.
Use LIKE '%text%' instead of a full text search.
A workaround is to use LIKE '%text%' instead but note that this will be much slower as it will not be able to use the index. However it will give the correct result.
Use an external full-text search engine.
You could also look at Lucene or Sphinx. For example I know that Sphinx supports finding text using *text*. Here is an extract from the documentation which explains how to enable infix searching, which is what you need.
9.2.16. min_infix_len
Minimum infix prefix length to index. Optional, default is 0 (do not index infixes).
Infix indexing allows to implement wildcard searching by 'start*', '*end', and 'middle' wildcards (refer to enable_star option for details on wildcard syntax). When mininum infix length is set to a positive number, indexer will index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. Too short infixes (below the minimum allowed length) will not be indexed.
For instance, indexing a keyword "test" with min_infix_len=2 will result in indexing "te", "es", "st", "tes", "est" infixes along with the word itself. Searches against such index for "es" will match documents that contain "test" word, even if they do not contain "es" on itself. However, indexing infixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times.

How to format keywords in SQL Server Full Text Search

I have a sql function that accepts keywords and returns a full text search table.
How do I format the keyword string when it contains multiple keywords? Do I need to splice the string and insert "AND"? (I am passing the keywords to the method through Linq TO SQL)
Also, how do I best protect myself from sql injection here.? Are the default ASP.NET filters sufficient?
I would use "AND" and asterisks on each word. The asterisk will help the matching be a bit wider since I believe it is best to return too many rather than too few. For example, a search for "Georgia Peach" would use the keyword string '"Georgia*" AND "Peach*"' (the double quotes around each word are important).
And I believe the ASP.NET Filters are sufficient. Plus, since you are using parameterized queries (which LINQ to SQL does), you are pretty safe.

What is causing the LIKE statement to disregard html-tags, words after commas, or end in periods?

I'm working on a search module that searches in text columns that contains html code. The queries are constructed like: WHERE htmlcolumn LIKE '% searchterm %';
Default the modules searches with spaces at both end of the searchterms, with wildcards at the beginning and/or the end of the searchterms these spaces are removed (*searchterm -> LIKE '%searchterm %'; Also i've added the possibility to exclude results with certain words (-searchterm -> NOT LIKE '% searchterm %'). So far so good.
The problem is that words that that are preceded by an html-tag are not found (<br/>searchterm is not found when searching on LIKE '% searchterm.., also words that come after a comma or end with a period etc.).
What i would like to do is search for words that are not preceded or followed by the characters A-Z and a-z. Every other characters are ok.
Any ideas how i should achieve this? Thanks!
Look into MySQLs fulltextsearch, it might be able to use non-letter characters as delimiters. It will alsow be much much faster than a %term% search since that requires a full table-scan.
You could use a regular expression: http://dev.mysql.com/doc/refman/5.0/en/regexp.html
Generally speaking, it is better to use full text search facilities, but if you really want a small SQL, here it is:
SELECT * FROM `t` WHERE `htmlcolumn` REGEXP '[[:<:]]term[[:>:]]'
It returns all records that contain word 'term' whether it is surrounded with spaces, punctuation, special characters etc
I don't think SQL's "LIKE" operator alone is the right tool for the job you are trying to do. Consider using Lucene, or something like it. I was able to integrate Lucene.NET into my application in a couple days. You'll spend more time than that trying to salvage your current approach.
If you have no choice but to make your current approach work, then consider storing the text in two columns in your database. The first column is for the pure text, with punctuation etc. The second column is the text that has been pre-preprocessed, just words, no punctuation, normalized so as to be easier for your "LIKE" approach.