What is the best way to search an Oracle CLOB column? - sql

I need to search a CLOB column and am looking for the best way to do this, I've seen variants online of using the DBMS_LOB package as well as using something called Oracle Text. Can someone provide a quick example of how to do this?

Oracle Text indexing is the way go. You can use either CONTEXT or CTXRULE index. CONTEXT can be used on unstructured document where CTXRULE is more helpful on structured documents.
This link will provide more info the index types & syntax.
The most important factor you need to consider is LEXER & STOPLIST.
You can also read the posts on asktom.oracle.com
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:5533095920114

What is in your CLOB and what are you searching for ?
Oracle Text is good if you are searching for words or phrases (which is probably what you have in a CLOB). Sometimes you'll store something 'strange' in a CLOB, like XML or the return value of a web-service call and that might be a different kettle of fish.

I needed to do this just recently and came up with the following solution (uses Spring JDBC)
String sql = "select * from clobtest where dbms_lob.instr(myclob, ? , 1, 1) > 0";
return (String) getSimpleJdbcTemplate().getJdbcOperations().queryForObject(sql, new RowMapper<Object>() {
public String mapRow(ResultSet rs, int rowNum) throws SQLException {
String clobText = lobHandler.getClobAsString(rs, "myclob");
return clobText;
}
}, searchText);
Seems to work pretty well, but I'm going to do some performance testing to see how well it works under load.

Related

Regex match SQL values string with multiple rows and same number of columns

I tried to match the sql values string (0),(5),(12),... or (0,11),(122,33),(4,51),... or (0,121,12),(31,4,5),(26,227,38),... and so on with the regular expression
\(\s*\d+\s*(\s*,\s*\d+\s*)*\)(\s*,\s*\(\s*\d+\s*(\s*,\s*\d+\s*)*\))*
and it works. But...
How can I ensure that the regex does not match a values string like (0,12),(1,2,3),(56,7) with different number of columns?
Thanks in advance...
As i mentioned in comment to the question, the best way to check if input string is valid: contains the same count of numbers between brackets, is to use client side programm, but not clear SQL.
Implementation:
List<string> s = new List<string>(){
"(0),(5),(12)", "(0,11),(122,33),(4,51)",
"(0,121,12),(31,4,5),(26,227,38)","(0,12),(1,2,3),(56,7)"};
var qry = s.Select(a=>new
{
orig = a,
newst = a.Split(new string[]{"),(", "(", ")"},
StringSplitOptions.RemoveEmptyEntries)
})
.Select(a=>new
{
orig = a.orig,
isValid = (a.newst
.Sum(b=>b.Split(new char[]{','},
StringSplitOptions.RemoveEmptyEntries).Count()) %
a.newst.Count()) ==0
});
Result:
orig isValid
(0),(5),(12) True
(0,11),(122,33),(4,51) True
(0,121,12),(31,4,5),(26,227,38) True
(0,12),(1,2,3),(56,7) False
Note: The second Select statement gets the modulo of sum of comma instances and the count of items in string array returned by Split function. If the result isn't equal to zero, it means that input string is invalid.
I strongly believe there's a simplest way to achieve that, but - at this moment - i don't know how ;)
:(
Unless you add some more constraints, I don't think you can solve this problem only with regular expressions.
It isn't able to solve all of your string problems, just as it cannot be used to check that the opening and closing of brackets (like "((())()(()(())))") is invalid. That's a more complicated issue.
That's what I learnt in class :P If someone knows a way then that'd be sweet!
I'm sorry, I spent a bit of time looking into how we could turn this string into an array and do more work to it with SQL but built in functionality is lacking and the solution would end up being very hacky.
I'd recommend trying to handle this situation differently as large scale string computation isn't the best way to go if your database is to gradually fill up.
A combination of client and serverside validation can be used to help prevent bad data (like the ones with more numbers) from getting into the database.
If you need to keep those numbers then you could rework your schema to include some metadata which you can use in your queries, like how many numbers there are and whether it all matches nicely. This information can be computed inexpensively from your server and provided to the database.
Good luck!

Parse a search string (into NHibernate Criterias )

I would like to implement an advanced search for my project.
The search right now uses all the strings the user enters and makes one big disjunction with criteria API.
This works fine, but now I would like to implement more features: AND, OR and brackets()
I have got a hard time parsing the string - and building criterias from the string. I have found this Stackoverflow question, but it didn't really help (he didn't make it clear what he wanted).
I found another article, but this supports much more and spits out sql statements.
Another thing I've heard mention a lot is Lucene - but I'm not sure if this really would help me.
I've been searching around a little bit and I've found the Lucene.Net WhitespaceAnalyzer and the QueryParser.
It changes the search A AND B OR C into something like +A +B C, which is a good step in the correct direction (plus it handles brackets).
The next step would be to get the converted string into a set of conjunctions and disjunctions.
The Java example I found was using the query builder which I couldn't find in NHibernate.
Any more ideas ?
Guess you haven't heard about Nhibernate Search till now
Nhibernate Search uses lucene underneath and gives u all the options of using AND, OR, grammar.
All you have to do is attribute your entities for indexing and Nhibernate will index it at a predefined location.
Next time you can search this index with the power that lucene exposes and then get your domain level entity objects in return.
using (IFullTextSession s = Search.CreateFullTextSession(sf.OpenSession(new SearchInterceptor()))) {
QueryParser qp = new QueryParser("id", new StopAnalyzer());
IQuery NHQuery = s.CreateFullTextQuery(qp.Parse("Summary:series"), typeof(Book));
IList result = NHQuery.List();
Powerful, isn’t it?
What I am basically doing right now is parsing the input string with the Lucene.Net parse API.
This gives me a uniform and simplified syntax. (Pseudocode)
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
void Function Search (string queryString)
{
Analyzer analyzer = new WhitespaceAnalyzer();
QueryParser luceneParser = new QueryParser("name", analyzer);
Query luceneQuery = luceneParser.Parse(queryString);
string[] words = luceneQuery.ToString().Split(' ');
foreach (string word in words)
{
//Parsing the lucene.net string
}
}
After that I am parsing this string manually, creating the disjunctions and conjunctions.

JSON filter (like SQL SELECT)

I have a json file/stream, i like to be able to make select SQL style
so here is the file
the file contain all the data i have, I'll like to be able to show, let said :
all the : odeu_nom and odeu_desc that is : categorie=Feuilles
if you can do that with PHP and json (eval) fine... tell me how...
on the other part in sql i will do : SELECT * from $json where categorie=Feuilles
p.s. i have found : jsonpath that is a xpath for json... maybe another option ?
p.s. #2... with some research, i have found anoter option, the json is the same as a array, maybe I can filter the array and just return the one i need ?... how do i do that ?
It makes more sense to try and stick with XPath-style selectors (like jsonpath), rather than using SQL, even if you are more familiar with SQL.
The advantage of the "path" is that it is more readily expressive of the hierarchical structure implicit to XML/JSON, as opposed to SQL which requires using various joins to help it "get out of its rectangular/tabular prison".
Although I never used jsonpath, by reading its summary page, I believe that the following should produce all the odeu_nom for objects which catagorie is 'Feuilles' (given the json input referred in the question).
$.Liste_des_odeurs[?(#.categorie = 'Feuilles'].odeu_nom
which correspond to the following XPath
/Liste_des_odeurs[categorie='Feuilles']/odeu_nom
Et voila...
BTW, 'Jazz is not dead, it just smells funny' (F Zappa)

How to make Lucene match all words in query?

I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered.
Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action.
Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words.
Just to clarify
I also thought of inserting '+' into all spaces in the query. I just wanted to avoid detecting grouped terms (brackets, quotes etc) and potentially breaking the query. Is there another approach?
This looks similar to the Lucene Sentence Search question. If you're interested, this is how I answered that question:
String defaultField = ...;
Analyzer analyzer = ...;
QueryParser queryParser = new QueryParser(defaultField, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("Searching is fun");
Like Adam said, there's no need to do anything to the query string. QueryParser's setDefaultOperator does exactly what you're asking for.
Why not just preparse the user search input and adjust it to fit your criteria using the Lucene query syntax before passing it on to Lucene. Alternatively, you could just create some help documentation on how to use the standard syntax to create a specific query and let the user decide how the query should be performed.
Lucene has a extensive query language as described here that describes everything you want except for + being the default but that's something you can simple handle by replacing spaces with +. So the only thing you need to do is define the format you want people to enter their search queries in (I would strongly advise to adhere to the default Lucene syntax) and then you can write the transformations from your own syntax to the Lucene syntax.
The behavior is hard-coded in method addClause(List, int, int, Query) of class org.apache.lucene.queryParser.QueryParser, so the only way to change the behavior (other than the workarounds above) is to change that method. The end of the method looks like this:
if (required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
Changing "SHOULD" to "MUST" should make clauses (e.g. words) required by default.

How to make the Lucene QueryParser more forgiving?

I'm using Lucene.net, but I am tagging this question for both .NET and Java versions because the API is the same and I'm hoping there are solutions on both platforms.
I'm sure other people have addressed this issue, but I haven't been able to find any good discussions or examples.
By default, Lucene is very picky about query syntax. For example, I just got the following error:
[ParseException: Cannot parse 'hi there!': Encountered "<EOF>" at line 1, column 9.
Was expecting one of:
"(" ...
"*" ...
<QUOTED> ...
<TERM> ...
<PREFIXTERM> ...
<WILDTERM> ...
"[" ...
"{" ...
<NUMBER> ...
]
Lucene.Net.QueryParsers.QueryParser.Parse(String query) +239
What is the best way to prevent ParseExceptions when processing queries from users? It seems to me that the most usable search interface is one that always executes a query, even if it might be the wrong query.
It seems that there are a few possible, and complementary, strategies:
"Clean" the query prior to sending it to the QueryProcessor
Handle exceptions gracefully
Show an intelligent error message to the user
Perhaps execute a simpler query, leaving off the erroneous bit
I don't really have any great ideas about how to do any of those strategies. Has anyone else addressed this issue? Are there any "simple" or "graceful" parsers that I don't know about?
Yo can make Lucene ignore the special characters by sanitizing the query with something like
query = QueryParser.Escape(query)
If you do not want your users to ever use advanced syntax in their queries, you can do this always.
If you want your users to use advanced syntax but you also want to be more forgiving with the mistakes you should only sanitize after a ParseException has occured.
Well, the easiest thing to do would be to give the raw form of the query a shot, and if that fails, fall back to cleaning it up.
Query safe_query_parser(QueryParser qp, String raw_query)
throws ParseException
{
Query q;
try {
q = qp.parse(raw_query);
} catch(ParseException e) {
q = null;
}
if(q==null)
{
String cooked;
// consider changing this "" to " "
cooked = raw_query.replaceAll("[^\w\s]","");
q = qp.parse(cooked);
}
return q;
}
This gives the raw form of the user's query a chance to run, but if parsing fails, we strip everything except letters, numbers, spaces and underscores; then we try again. We still risk throwing ParseException, but we've drastically reduced the odds.
You could also consider tokenizing the user's query yourself, turning each token into a term query, and glomming them together with a BooleanQuery. If you're not really expecting your users to take advantage of the features of the QueryParser, that would be the best bet. You'd be completely(?) robust, and users could search for whatever funny characters will make it through your analyzer
FYI... Here is the code I am using for .NET
private Query GetSafeQuery(QueryParser qp, String query)
{
Query q;
try
{
q = qp.Parse(query);
}
catch(Lucene.Net.QueryParsers.ParseException e)
{
q = null;
}
if(q==null)
{
string cooked;
cooked = Regex.Replace(query, #"[^\w\.#-]", " ");
q = qp.Parse(cooked);
}
return q;
}
I'm in the same situation as you.
Here's what I do. I do catch the exception, but only so that I can make the error look prettier. I don't change the text.
I also provide a link to an explanation of the Lucene syntax which I have simplified a little bit:
http://ifdefined.com/btnet/lucene_syntax.html
I do not know much about Lucene.net. For general Lucene, I highly recommend the book Lucene in Action. For the question at hand, it depends on your users. There are strong reasons, such as ease of use, security and performance, to limit your users' queries. The book shows ways to parse the queries using a custom parser instead of QueryParser. I second Jay's idea about the BooleanQuery, although you can build stronger queries using a custom parser.
If you don't need all Lucene features, you might go better by writing your own query parser. It's not as complicated as it might seem in the first place.