Using Hibernate Search (Lucene), I Need to Be Able to Search a Code With or Without Dashes - lucene

This is really the same as it would be for a social security #.
If I have a code with this format:
WHO-S-09-0003
I want to be able to do:
query = qb.keyword().onFields("key").matching("WHOS090003").createQuery();
I tried using a WhitespaceAnalyzer.

Using StandardAnalyzer or WhitespaceAnalyzer both have the same problem. They will index 'WHO-S-09-0003' as is which means that when you do a search it will only work if you have hyphens in the search term.
One solution to your problem would be to implement your own TokenFilter which detects the format of your codes and removes the hyphens during indexing. You can use AnayzerDef to build a chain of toekn filters and an overall custom analyzer. Of course you will have to use the same analyzer when searching, but the Hibernate Search query DSL will take care of that.

actually you can implement your own method like this one:
private String specialCharacters(String keyword) {
String [] specialChars = {"-","!","?"};
for(int i = 0; i < specialChars.length; i++ )
if(keyword.indexOf(specialChars[i]) > -1)
keyword = keyword.replace(specialChars[i], "\\"+specialChars[i]);
return keyword;
}
as you know lucene has special chars, so if you want escape special chars than you should insert before that char double backslashes...

Related

Why is my resource pack saying "Unable to parse pack manifest with stack: * Line 9, Column 5 Missing '}' or object member name" [duplicate]

When manually generating a JSON object or array, it's often easier to leave a trailing comma on the last item in the object or array. For example, code to output from an array of strings might look like (in a C++ like pseudocode):
s.append("[");
for (i = 0; i < 5; ++i) {
s.appendF("\"%d\",", i);
}
s.append("]");
giving you a string like
[0,1,2,3,4,5,]
Is this allowed?
Unfortunately the JSON specification does not allow a trailing comma. There are a few browsers that will allow it, but generally you need to worry about all browsers.
In general I try turn the problem around, and add the comma before the actual value, so you end up with code that looks like this:
s.append("[");
for (i = 0; i < 5; ++i) {
if (i) s.append(","); // add the comma only if this isn't the first entry
s.appendF("\"%d\"", i);
}
s.append("]");
That extra one line of code in your for loop is hardly expensive...
Another alternative I've used when output a structure to JSON from a dictionary of some form is to always append a comma after each entry (as you are doing above) and then add a dummy entry at the end that has not trailing comma (but that is just lazy ;->).
Doesn't work well with an array unfortunately.
No. The JSON spec, as maintained at http://json.org, does not allow trailing commas. From what I've seen, some parsers may silently allow them when reading a JSON string, while others will throw errors. For interoperability, you shouldn't include it.
The code above could be restructured, either to remove the trailing comma when adding the array terminator or to add the comma before items, skipping that for the first one.
Simple, cheap, easy to read, and always works regardless of the specs.
$delimiter = '';
for .... {
print $delimiter.$whatever
$delimiter = ',';
}
The redundant assignment to $delim is a very small price to pay.
Also works just as well if there is no explicit loop but separate code fragments.
Trailing commas are allowed in JavaScript, but don't work in IE. Douglas Crockford's versionless JSON spec didn't allow them, and because it was versionless this wasn't supposed to change. The ES5 JSON spec allowed them as an extension, but Crockford's RFC 4627 didn't, and ES5 reverted to disallowing them. Firefox followed suit. Internet Explorer is why we can't have nice things.
As it's been already said, JSON spec (based on ECMAScript 3) doesn't allow trailing comma. ES >= 5 allows it, so you can actually use that notation in pure JS. It's been argued about, and some parsers did support it (http://bolinfest.com/essays/json.html, http://whereswalden.com/2010/09/08/spidermonkey-json-change-trailing-commas-no-longer-accepted/), but it's the spec fact (as shown on http://json.org/) that it shouldn't work in JSON. That thing said...
... I'm wondering why no-one pointed out that you can actually split the loop at 0th iteration and use leading comma instead of trailing one to get rid of the comparison code smell and any actual performance overhead in the loop, resulting in a code that's actually shorter, simpler and faster (due to no branching/conditionals in the loop) than other solutions proposed.
E.g. (in a C-style pseudocode similar to OP's proposed code):
s.append("[");
// MAX == 5 here. if it's constant, you can inline it below and get rid of the comparison
if ( MAX > 0 ) {
s.appendF("\"%d\"", 0); // 0-th iteration
for( int i = 1; i < MAX; ++i ) {
s.appendF(",\"%d\"", i); // i-th iteration
}
}
s.append("]");
PHP coders may want to check out implode(). This takes an array joins it up using a string.
From the docs...
$array = array('lastname', 'email', 'phone');
echo implode(",", $array); // lastname,email,phone
Interestingly, both C & C++ (and I think C#, but I'm not sure) specifically allow the trailing comma -- for exactly the reason given: It make programmaticly generating lists much easier. Not sure why JavaScript didn't follow their lead.
Rather than engage in a debating club, I would adhere to the principle of Defensive Programming by combining both simple techniques in order to simplify interfacing with others:
As a developer of an app that receives json data, I'd be relaxed and allow the trailing comma.
When developing an app that writes json, I'd be strict and use one of the clever techniques of the other answers to only add commas between items and avoid the trailing comma.
There are bigger problems to be solved...
Use JSON5. Don't use JSON.
Objects and arrays can have trailing commas
Object keys can be unquoted if they're valid identifiers
Strings can be single-quoted
Strings can be split across multiple lines
Numbers can be hexadecimal (base 16)
Numbers can begin or end with a (leading or trailing) decimal point.
Numbers can include Infinity and -Infinity.
Numbers can begin with an explicit plus (+) sign.
Both inline (single-line) and block (multi-line) comments are allowed.
http://json5.org/
https://github.com/aseemk/json5
No. The "railroad diagrams" in https://json.org are an exact translation of the spec and make it clear a , always comes before a value, never directly before ]:
or }:
There is a possible way to avoid a if-branch in the loop.
s.append("[ "); // there is a space after the left bracket
for (i = 0; i < 5; ++i) {
s.appendF("\"%d\",", i); // always add comma
}
s.back() = ']'; // modify last comma (or the space) to right bracket
According to the Class JSONArray specification:
An extra , (comma) may appear just before the closing bracket.
The null value will be inserted when there is , (comma) elision.
So, as I understand it, it should be allowed to write:
[0,1,2,3,4,5,]
But it could happen that some parsers will return the 7 as item count (like IE8 as Daniel Earwicker pointed out) instead of the expected 6.
Edited:
I found this JSON Validator that validates a JSON string against RFC 4627 (The application/json media type for JavaScript Object Notation) and against the JavaScript language specification. Actually here an array with a trailing comma is considered valid just for JavaScript and not for the RFC 4627 specification.
However, in the RFC 4627 specification is stated that:
2.3. Arrays
An array structure is represented as square brackets surrounding zero
or more values (or elements). Elements are separated by commas.
array = begin-array [ value *( value-separator value ) ] end-array
To me this is again an interpretation problem. If you write that Elements are separated by commas (without stating something about special cases, like the last element), it could be understood in both ways.
P.S. RFC 4627 isn't a standard (as explicitly stated), and is already obsolited by RFC 7159 (which is a proposed standard) RFC 7159
It is not recommended, but you can still do something like this to parse it.
jsonStr = '[0,1,2,3,4,5,]';
let data;
eval('data = ' + jsonStr);
console.log(data)
With Relaxed JSON, you can have trailing commas, or just leave the commas out. They are optional.
There is no reason at all commas need to be present to parse a JSON-like document.
Take a look at the Relaxed JSON spec and you will see how 'noisy' the original JSON spec is. Way too many commas and quotes...
http://www.relaxedjson.org
You can also try out your example using this online RJSON parser and see it get parsed correctly.
http://www.relaxedjson.org/docs/converter.html?source=%5B0%2C1%2C2%2C3%2C4%2C5%2C%5D
As stated it is not allowed. But in JavaScript this is:
var a = Array()
for(let i=1; i<=5; i++) {
a.push(i)
}
var s = "[" + a.join(",") + "]"
(works fine in Firefox, Chrome, Edge, IE11, and without the let in IE9, 8, 7, 5)
From my past experience, I found that different browsers deal with trailing commas in JSON differently.
Both Firefox and Chrome handles it just fine. But IE (All versions) seems to break. I mean really break and stop reading the rest of the script.
Keeping that in mind, and also the fact that it's always nice to write compliant code, I suggest spending the extra effort of making sure that there's no trailing comma.
:)
I keep a current count and compare it to a total count. If the current count is less than the total count, I display the comma.
May not work if you don't have a total count prior to executing the JSON generation.
Then again, if your using PHP 5.2.0 or better, you can just format your response using the JSON API built in.
Since a for-loop is used to iterate over an array, or similar iterable data structure, we can use the length of the array as shown,
awk -v header="FirstName,LastName,DOB" '
BEGIN {
FS = ",";
print("[");
columns = split(header, column_names, ",");
}
{ print(" {");
for (i = 1; i < columns; i++) {
printf(" \"%s\":\"%s\",\n", column_names[i], $(i));
}
printf(" \"%s\":\"%s\"\n", column_names[i], $(i));
print(" }");
}
END { print("]"); } ' datafile.txt
With datafile.txt containing,
Angela,Baker,2010-05-23
Betty,Crockett,1990-12-07
David,Done,2003-10-31
String l = "[" + List<int>.generate(5, (i) => i + 1).join(",") + "]";
Using a trailing comma is not allowed for json. A solution I like, which you could do if you're not writing for an external recipient but for your own project, is to just strip (or replace by whitespace) the trailing comma on the receiving end before feeding it to the json parser. I do this for the trailing comma in the outermost json object. The convenient thing is then if you add an object at the end, you don't have to add a comma to the now second last object. This also makes for cleaner diffs if your config file is in a version control system, since it will only show the lines of the stuff you actually added.
char* str = readFile("myConfig.json");
char* chr = strrchr(str, '}') - 1;
int i = 0;
while( chr[i] == ' ' || chr[i] == '\n' ){
i--;
}
if( chr[i] == ',' ) chr[i] = ' ';
JsonParser parser;
parser.parse(str);
I usually loop over the array and attach a comma after every entry in the string. After the loop I delete the last comma again.
Maybe not the best way, but less expensive than checking every time if it's the last object in the loop I guess.

Java - Index a String (Substring)

I have this string:
201057&channelTitle=null_JS
I want to be able to cut out the '201057' and make it a new variable. But I don't always know how long the digits will be, so can I somehow use the '&' as a reference?\
myDigits substring(0, position of &)?
Thanks
Sure, you can split the string along the &.
String s = "201057&channelTitle=null_JS";
String[] parts = s.split("&");
String newVar = parts[0];
The expected result here is
parts[0] = "201057";
parts[1] = "channelTitle=null_JS";
In production code you chould check of course the length of the parts array, in case no "&" was present.
Several programming languages also support the useful inverse operation
String s2 = parts.join("&"); // should have same value like s
Alas this one is not part of the Java standard libs, but e.g. Apache Commons Lang features it.
Always read the API first. There is an indexOf method in String that will return you the first index of the character/String you gave it.
You can use myDigits.substring(0, myDigits.indexOf('&');
However, if you want to get all of the arguments in the query separately, then you should use mvw's answer.

Strange behavior of Lucene SpanishAnalyzer class with accented words

I'm using the SpanishAnalyzer class in Lucene 3.4. When I want to parse accented words, I'm having a strange result. If I parse, for example, these two words: "comunicación" and "comunicacion", the stems I'm getting are "comun" and "comunicacion". If I instead parse "maratón" and "maraton", I'm getting the same stem for both words ("maraton").
So, at least in my opinion, it's very strange that the same word, "comunicación", gives different results depending on it is accented or not. If I search the word "comunicacion", I should get the same result regardless of whether it's accented or not.
The code I'm using is the next one:
SpanishAnalyzer sa = new SpanishAnalzyer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", sa);
String str = "comunicación";
String str2 = "comunicacion";
System.out.println("first: " + parser.parse(str)); //stem = comun
System.out.println("second: " + parser.parse(str2)); //stem = comunicacion
The solution I've found to be able to get every single word that shares the stem of "comunicacion", accented or not, is to take off accents in a first step, and then parse it with the Analyzer, but I don't know if it's the right way.
Please, can anyone help me?
Did you check what tokenizer & tokenfilters SpanishAnalyzer uses? There is something called ASCIIFoldingFilter. Try placing it before the StemFilter. It will remove the accents

Hyphens in Lucene

I'm playing around with Lucene and noticed that the use of a hyphen (e.g. "semi-final") will result in two words ("semi" and "final" in the index. How is this supposed to match if the users searches for "semifinal", in one word?
Edit: I'm just playing around with the StandardTokenizer class actually, maybe that is why? Am I missing a filter?
Thanks!
(Edit)
My code looks like this:
StandardAnalyzer sa = new StandardAnalyzer();
TokenStream ts = sa.TokenStream("field", new StringReader("semi-final"));
while (ts.IncrementToken())
{
string t = ts.ToString();
Console.WriteLine("Token: " + t);
}
This is the explanation for the tokenizer in lucene
- Splits words at punctuation
characters, removing punctuation.
However, a dot that's not followed by
whitespace is considered part of a
token.
- Splits words at hyphens, unless
there's a number in the token, in
which case the whole token is
interpreted as a product number and
is not split.
- Recognizes email addresses and internet hostnames as one token.
Found here
this explains why it would be splitting your word.
This is probably the hardest thing to correct, human error. If an individual types in semifinal, this is theoretically not the same as searching semi-final. So if you were to have numerous words that could be typed in different ways ex:
St-Constant
Saint Constant
Saint-Constant
your stuck with the task of having
both st and saint as well as a hyphen or non hyphenated to veriy. your tokens would be huge and each word would need to be compared to see if they matched.
Im still looking to see if there is a good way of approaching this, otherwise, if you don't have a lot of words you wish to use then have all the possibilities stored and tested, or have a loop that splits the word starting at the first letter and moves through each letter splitting the string in half to form two words, testing the whole way through to see if it matches. but again whose to say you only have 2 words. if you are verifying more then two words then you have the problem of splitting the word in multiple sections
example
saint-jean-sur-richelieu
if i come up with anything else I will let you know.
I would recommend you use the WordDelimiterFilter from Solr (you can use it in just your Lucene application as a TokenFilter added to your analyzer, just go get the java file for this filter from Solr and add it to your application).
This filter is designed to handle cases just like this:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
If you're looking for a port of the WordDelimiterFilter then I advise a google of WordDelimiter.cs, I found such a port here:
http://osdir.com/ml/attachments/txt9jqypXvbSE.txt
I then created a very basic WordDelimiterAnalyzer:
public class WordDelimiterAnalyzer: Analyzer
{
#region Overrides of Analyzer
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
TokenStream result = new WhitespaceTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(true, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
result = new WordDelimiterFilter(result, 1, 1, 1, 1, 0);
return result;
}
#endregion
}
I said it was basic :)
If anyone has an implementation I would be keen to see it!
You can write your own tokenizer which will produce for words with hyphen all possible combinations of tokens like that:
semifinal
semi
final
You will need to set proper token offsets to tell lucene that semi and semifinal actually start at the same place in document.
The rule (for the classic analyzer) is from is written in jflex:
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM = ({ALPHANUM} {P} {HAS_DIGIT}
| {HAS_DIGIT} {P} {ALPHANUM}
| {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
| {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
// punctuation
P = ("_"|"-"|"/"|"."|",")

How to make Lucene match all words in query?

I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered.
Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action.
Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words.
Just to clarify
I also thought of inserting '+' into all spaces in the query. I just wanted to avoid detecting grouped terms (brackets, quotes etc) and potentially breaking the query. Is there another approach?
This looks similar to the Lucene Sentence Search question. If you're interested, this is how I answered that question:
String defaultField = ...;
Analyzer analyzer = ...;
QueryParser queryParser = new QueryParser(defaultField, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("Searching is fun");
Like Adam said, there's no need to do anything to the query string. QueryParser's setDefaultOperator does exactly what you're asking for.
Why not just preparse the user search input and adjust it to fit your criteria using the Lucene query syntax before passing it on to Lucene. Alternatively, you could just create some help documentation on how to use the standard syntax to create a specific query and let the user decide how the query should be performed.
Lucene has a extensive query language as described here that describes everything you want except for + being the default but that's something you can simple handle by replacing spaces with +. So the only thing you need to do is define the format you want people to enter their search queries in (I would strongly advise to adhere to the default Lucene syntax) and then you can write the transformations from your own syntax to the Lucene syntax.
The behavior is hard-coded in method addClause(List, int, int, Query) of class org.apache.lucene.queryParser.QueryParser, so the only way to change the behavior (other than the workarounds above) is to change that method. The end of the method looks like this:
if (required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
Changing "SHOULD" to "MUST" should make clauses (e.g. words) required by default.