Using Lucene's highlighting, getting too much highlighted, is there a workaround for this? - lucene

I am using the highlighting feature of Lucene to isolate matching terms for my query, but some of the matched terms are excessive.
I have some simple test cases which are delivered in an Ant project (download details below).
Materials
You can download the test case here: mydemo_with_libs.zip
That archive includes the Lucene 8.6.3 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo_without_libs.zip
The necessary libraries are: core, analyzers, queries, queryparser, highlighter, and memory.
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant synsearch
Input
I have provided a short synonym list which is used for indexing and analysing in the highlighting methods:
cope,manage
jobs,tasks
simultaneously,at once
and there is one document being indexed:
Queues are a useful way of grouping jobs together in order to manage a number of them at once. You can:
hold or release multiple jobs at the same time;
group multiple tasks (for the same event);
control the priority of jobs in the queue;
Eventually log all events that take place in a queue.
Use either job.queue or task.queue in specifications.
Process
When building the index I am storing the text field, and using a custom analyzer. This is because (in the real world) the content I am indexing is technical documentation, so stripping out punctuation is inappropriate because so much of it may be significant in technical expressions. My analyzer uses a TechTokenFilter which breaks the stream up into tokens consisting of strings of words or digits, or individual characters which don't match the previous pattern.
Here's the relevant code for the analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
and here's my filter:
public class TechTokenFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final PositionIncrementAttribute posIncAttr;
private final ArrayList<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAttr;
public TechTokenFilter(TokenStream tokenStream) {
super(tokenStream);
termStack = new ArrayList<>();
termAttr = addAttribute(CharTermAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if (this.termStack.isEmpty() && input.incrementToken()) {
final String currentTerm = termAttr.toString();
final int bufferLen = termAttr.length();
if (bufferLen > 0) {
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(techTokens(currentTerm)));
current = captureState();
}
}
}
if (!this.termStack.isEmpty()) {
String part = termStack.remove(0);
restoreState(current);
termAttr.setEmpty().append(part);
posIncAttr.setPositionIncrement(1);
return true;
} else {
return false;
}
}
public static String[] techTokens(String t) {
List<String> tokenlist = new ArrayList<String>();
String[] tokens;
StringBuilder next = new StringBuilder();
String token;
char minus = '-';
char underscore = '_';
char c, prec, subc;
// Boolean inWord = false;
for (int i = 0; i < t.length(); i++) {
prec = i > 0 ? t.charAt(i - 1) : 0;
c = t.charAt(i);
subc = i < (t.length() - 1) ? t.charAt(i + 1) : 0;
if (Character.isLetterOrDigit(c) || c == underscore) {
next.append(c);
// inWord = true;
}
else if (c == minus && Character.isLetterOrDigit(prec) && Character.isLetterOrDigit(subc)) {
next.append(c);
} else {
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
next.setLength(0);
}
if (Character.isWhitespace(c)) {
// shouldn't be possible because the input stream has been tokenized on
// whitespace
} else {
tokenlist.add(String.valueOf(c));
}
// inWord = false;
}
}
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
// next.setLength(0);
}
tokens = tokenlist.toArray(new String[0]);
return tokens;
}
}
Examining the index I can see that the index contains the separate terms I expect, including the synonym values. For example the text at the end of the first line has produced the terms
of
them
at , simultaneously
once
.
You
can
:
and the text at the end of the third line has produced the terms
same
event
)
;
When the application performs a search it analyzes the query without using the synonym list (because the synonyms are already in the index), but I have discovered that I need to include the synonym list when analyzing the stored text to identify the matching fragments.
Searches match the correct documents, but the code I have added to identify the matching terms over-performs. I won't show all the search method here, but will focus on the code which lists matched terms:
public static void doSearch(IndexReader reader, IndexSearcher searcher,
Query query, int max, String synList) throws IOException {
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("\001", "\002");
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
Analyzer analyzer;
if (synList != null) {
analyzer = new MyAnalyzer(synList);
} else {
analyzer = new MyAnalyzer();
}
// Collect all the docs
TopDocs results = searcher.search(query, max);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits.value);
System.out.println("\nQuery: " + query.toString());
System.out.println("Matches: " + numTotalHits);
// Collect matching terms
HashSet<String> matchedWords = new HashSet<String>();
int start = 0;
int end = Math.min(numTotalHits, max);
for (int i = start; i < end; i++) {
int id = hits[i].doc;
float score = hits[i].score;
Document doc = searcher.doc(id);
String docpath = doc.get("path");
String doctext = doc.get("text");
try {
TokenStream tokens = TokenSources.getTokenStream("text", null, doctext, analyzer, -1);
TextFragment[] frag = highlighter.getBestTextFragments(tokens, doctext, false, 100);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
String match = frag[j].toString();
addMatchedWord(matchedWords, match);
}
}
} catch (InvalidTokenOffsetsException e) {
System.err.println(e.getMessage());
}
System.out.println("matched file: " + docpath);
}
if (matchedWords.size() > 0) {
System.out.println("matched terms:");
for (String word : matchedWords) {
System.out.println(word);
}
}
}
Problem
While the correct documents are selected by these queries, and the fragments chosen for highlighting do contain the query terms, the highlighted pieces in some of the selected fragments extend over too much of the input.
For example, if the query is
+text:event +text:manage
(the first example in the test case) then I would expect to see 'event' and 'manage' in the highlighted list. But what I actually see is
event);
manage
Despite the highlighting process using an analyzer which breaks terms apart and treats punctuation characters as single terms, the highlight code is "hungry" and breaks on whitespace alone.
Similarly if the query is
+text:queeu~1
(my final test case) I would expect to only see 'queue' in the list. But I get
queue.
job.queue
task.queue
queue;
It is so nearly there... but I don't understand why the highlighted pieces are inconsistent with the index, and I don't think I should have to parse the list of matches through yet another filter to produce the correct list of matches.
I would really appreciate any pointers to what I am doing wrong or how I could improve my code to deliver exactly what I need.
Thanks for reading this far!

I managed to get this working by replacing the WhitespaceTokenizer and TechTokenFilter in my analyzer with a PatternTokenizer; the regular expression took a bit of work but once I had it all the matching terms were extracted with pinpoint accuracy.
The replacement analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
private static final String tokenRegex = "(([\\w]+-)*[\\w]+)|[^\\w\\s]";
#Override
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRegex), 0);
TokenStream result = new LowerCaseFilter(src);
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}

Related

Lucene 6 Payloads

I am trying to work with payloads in Lucene 6 but I am having troubles. The idea is to index payloads and use them in a CustomScoreQuery to check if the payload of a query term matches the payload for the document term.
Here is my payload filter:
#Override
public final boolean incrementToken() throws IOException {
if (!this.input.incrementToken()) {
return false;
}
// get the current token
final char[] token = Arrays.copyOfRange(this.termAtt.buffer(), 0, this.termAtt.length());
String stoken = String.valueOf(token);
String[] parts = stoken.split(Constants.PAYLOAD_DELIMITER);
if (parts.length > 1 && parts.length == 2){
termAtt.setLength(parts[0].length());
// the rest is the payload
BytesRef br = new BytesRef(parts[1]);
System.out.println(br);
payloadAtt.setPayload(br);
}else if (parts.length > 1){
// skip
}else{
// no payload here
payloadAtt.setPayload(null);
}
return true;
}
It seems to be adding the payload, however when I try to access the payload in CustomScoreQuery it just keeps returning null.
public float determineBoost(int doc) throws IOException{
float boost = 1f;
LeafReader reader = this.context.reader();
System.out.println("Has payloads:" + reader.getFieldInfos().hasPayloads());
// loop through each location of the term and boost if location matches the payload
if (reader != null){
PostingsEnum posting = reader.postings(new Term(this.field, term.getTerm()), PostingsEnum.POSITIONS);
System.out.println("Term: " + term.getTerm());
if (posting != null){
// move to the document currently looking at
posting.advance(doc);
int count = 0;
while (count < posting.freq()){
BytesRef load = posting.getPayload();
System.out.println(posting);
System.out.println(posting.getClass());
System.out.println(posting.attributes());
System.out.println("Load: " + load);
// if the location matches in the term location than boos the term by the boost factor
try {
if(load != null && term.containLocation(new Payload(load))){
boost = boost * this.boost;
}
} catch (PayloadException e) {
// do not care too much, the payload is unrecognized
// this is not going to change the boost factor
}
posting.nextPosition();
count += 1;
}
}
}
return boost;
}
For my two tests it keeps stating the load is null. Any suggestions or help?

How to force Mobile Vision for Android to read full lines of text

I have implemented Google's Mobile Vision for Android by following a tutorial. I am trying to build an app that will scan a receipt and find the numeric total. However, as I scan different receipts that are printed in different formats, the API will detect TextBlocks in what seems to be an arbitrary way. For example, in one receipt, if several words of text are separated by single spaces, then they are grouped into a single TextBlock. However, if two words of text are separated by lots of spaces, then they are separated as independent TextBlocks, even though they appear on the same "line". What I am trying to do is force the API to recognize each entire line of the receipt as a single entity. Is this possible?
public ArrayList<T> getAllGraphicsInRow(float rawY) {
synchronized (mLock) {
ArrayList<T> row = new ArrayList<>();
// Get the position of this View so the raw location can be offset relative to the view.
int[] location = new int[2];
this.getLocationOnScreen(location);
for (T graphic : mGraphics) {
float rawX = this.getWidth();
for (int i=0; i<rawX; i+=10){
if (graphic.contains(i - location[0], rawY - location[1])) {
if(!row.contains(graphic)) {
row.add(graphic);
}
}
}
}
return row;
}
}
This should be in the GraphicOverlay.java file and essentially fetches all the graphics in that row.
public static boolean almostEqual(double a, double b, double eps){
return Math.abs(a-b)<(eps);
}
public static boolean pointAlmostEqual(Point a, Point b){
return almostEqual(a.y,b.y,10);
}
public static boolean cornerPointAlmostEqual(Point[] rect1, Point[] rect2){
boolean almostEqual=true;
for (int i=0; i<rect1.length;i++){
if (!pointAlmostEqual(rect1[i],rect2[i])){
almostEqual=false;
}
}
return almostEqual;
}
private boolean onTap(float rawX, float rawY) {
String priceRegex = "(\\d+[,.]\\d\\d)";
ArrayList<OcrGraphic> graphics = mGraphicOverlay.getAllGraphicsInRow(rawY);
OcrGraphic currentGraphics = mGraphicOverlay.getGraphicAtLocation(rawX,rawY);
if (graphics !=null && currentGraphics!=null) {
List<? extends Text> currentComponents = currentGraphics.getTextBlock().getComponents();
final Pattern pattern = Pattern.compile(priceRegex);
final Pattern pattern1 = Pattern.compile(priceRegex);
TextBlock text = null;
Log.i("text results", "This many in the row: " + Integer.toString(graphics.size()));
ArrayList<Text> combinedComponents = new ArrayList<>();
for (OcrGraphic graphic : graphics) {
if (!graphic.equals(currentGraphics)) {
text = graphic.getTextBlock();
Log.i("text results", text.getValue());
combinedComponents.addAll(text.getComponents());
}
}
for (Text currentText : currentComponents) { // goes through components in the row
final Matcher matcher = pattern.matcher(currentText.getValue()); // looks for
Point[] currentPoint = currentText.getCornerPoints();
for (Text otherCurrentText : combinedComponents) {//Looks for other components that are in the same row
final Matcher otherMatcher = pattern1.matcher(otherCurrentText.getValue()); // looks for
Point[] innerCurrentPoint = otherCurrentText.getCornerPoints();
if (cornerPointAlmostEqual(currentPoint, innerCurrentPoint)) {
if (matcher.find()) { // if you click on the price
Log.i("oh yes", "Item: " + otherCurrentText.getValue());
Log.i("oh yes", "Value: " + matcher.group(1));
itemList.add(otherCurrentText.getValue());
priceList.add(Float.valueOf(matcher.group(1)));
}
if (otherMatcher.find()) { // if you click on the item
Log.i("oh yes", "Item: " + currentText.getValue());
Log.i("oh yes", "Value: " + otherMatcher.group(1));
itemList.add(currentText.getValue());
priceList.add(Float.valueOf(otherMatcher.group(1)));
}
Toast toast = Toast.makeText(this, " Text Captured!" , Toast.LENGTH_SHORT);
toast.show();
}
}
}
return true;
}
return false;
}
This should be in OcrCaptureActivity.java and it breaks up the TextBlock into lines and finds the blocks in the same row as the line and checks if the components are all prices, and prints all value accordingly.
The eps value in almostEqual is the tolerance for how tall it checks for graphics in the row.

Capitalise first letter of each word in string + lowercase all other letters [duplicate]

Is there a function built into Java that capitalizes the first character of each word in a String, and does not affect the others?
Examples:
jon skeet -> Jon Skeet
miles o'Brien -> Miles O'Brien (B remains capital, this rules out Title Case)
old mcdonald -> Old Mcdonald*
*(Old McDonald would be find too, but I don't expect it to be THAT smart.)
A quick look at the Java String Documentation reveals only toUpperCase() and toLowerCase(), which of course do not provide the desired behavior. Naturally, Google results are dominated by those two functions. It seems like a wheel that must have been invented already, so it couldn't hurt to ask so I can use it in the future.
WordUtils.capitalize(str) (from apache commons-text)
(Note: if you need "fOO BAr" to become "Foo Bar", then use capitalizeFully(..) instead)
If you're only worried about the first letter of the first word being capitalized:
private String capitalize(final String line) {
return Character.toUpperCase(line.charAt(0)) + line.substring(1);
}
The following method converts all the letters into upper/lower case, depending on their position near a space or other special chars.
public static String capitalizeString(String string) {
char[] chars = string.toLowerCase().toCharArray();
boolean found = false;
for (int i = 0; i < chars.length; i++) {
if (!found && Character.isLetter(chars[i])) {
chars[i] = Character.toUpperCase(chars[i]);
found = true;
} else if (Character.isWhitespace(chars[i]) || chars[i]=='.' || chars[i]=='\'') { // You can add other chars here
found = false;
}
}
return String.valueOf(chars);
}
Try this very simple way
example givenString="ram is good boy"
public static String toTitleCase(String givenString) {
String[] arr = givenString.split(" ");
StringBuffer sb = new StringBuffer();
for (int i = 0; i < arr.length; i++) {
sb.append(Character.toUpperCase(arr[i].charAt(0)))
.append(arr[i].substring(1)).append(" ");
}
return sb.toString().trim();
}
Output will be: Ram Is Good Boy
I made a solution in Java 8 that is IMHO more readable.
public String firstLetterCapitalWithSingleSpace(final String words) {
return Stream.of(words.trim().split("\\s"))
.filter(word -> word.length() > 0)
.map(word -> word.substring(0, 1).toUpperCase() + word.substring(1))
.collect(Collectors.joining(" "));
}
The Gist for this solution can be found here: https://gist.github.com/Hylke1982/166a792313c5e2df9d31
String toBeCapped = "i want this sentence capitalized";
String[] tokens = toBeCapped.split("\\s");
toBeCapped = "";
for(int i = 0; i < tokens.length; i++){
char capLetter = Character.toUpperCase(tokens[i].charAt(0));
toBeCapped += " " + capLetter + tokens[i].substring(1);
}
toBeCapped = toBeCapped.trim();
I've written a small Class to capitalize all the words in a String.
Optional multiple delimiters, each one with its behavior (capitalize before, after, or both, to handle cases like O'Brian);
Optional Locale;
Don't breaks with Surrogate Pairs.
LIVE DEMO
Output:
====================================
SIMPLE USAGE
====================================
Source: cApItAlIzE this string after WHITE SPACES
Output: Capitalize This String After White Spaces
====================================
SINGLE CUSTOM-DELIMITER USAGE
====================================
Source: capitalize this string ONLY before'and''after'''APEX
Output: Capitalize this string only beforE'AnD''AfteR'''Apex
====================================
MULTIPLE CUSTOM-DELIMITER USAGE
====================================
Source: capitalize this string AFTER SPACES, BEFORE'APEX, and #AFTER AND BEFORE# NUMBER SIGN (#)
Output: Capitalize This String After Spaces, BeforE'apex, And #After And BeforE# Number Sign (#)
====================================
SIMPLE USAGE WITH CUSTOM LOCALE
====================================
Source: Uniforming the first and last vowels (different kind of 'i's) of the Turkish word D[İ]YARBAK[I]R (DİYARBAKIR)
Output: Uniforming The First And Last Vowels (different Kind Of 'i's) Of The Turkish Word D[i]yarbak[i]r (diyarbakir)
====================================
SIMPLE USAGE WITH A SURROGATE PAIR
====================================
Source: ab 𐐂c de à
Output: Ab 𐐪c De À
Note: first letter will always be capitalized (edit the source if you don't want that).
Please share your comments and help me to found bugs or to improve the code...
Code:
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Locale;
public class WordsCapitalizer {
public static String capitalizeEveryWord(String source) {
return capitalizeEveryWord(source,null,null);
}
public static String capitalizeEveryWord(String source, Locale locale) {
return capitalizeEveryWord(source,null,locale);
}
public static String capitalizeEveryWord(String source, List<Delimiter> delimiters, Locale locale) {
char[] chars;
if (delimiters == null || delimiters.size() == 0)
delimiters = getDefaultDelimiters();
// If Locale specified, i18n toLowerCase is executed, to handle specific behaviors (eg. Turkish dotted and dotless 'i')
if (locale!=null)
chars = source.toLowerCase(locale).toCharArray();
else
chars = source.toLowerCase().toCharArray();
// First charachter ALWAYS capitalized, if it is a Letter.
if (chars.length>0 && Character.isLetter(chars[0]) && !isSurrogate(chars[0])){
chars[0] = Character.toUpperCase(chars[0]);
}
for (int i = 0; i < chars.length; i++) {
if (!isSurrogate(chars[i]) && !Character.isLetter(chars[i])) {
// Current char is not a Letter; gonna check if it is a delimitrer.
for (Delimiter delimiter : delimiters){
if (delimiter.getDelimiter()==chars[i]){
// Delimiter found, applying rules...
if (delimiter.capitalizeBefore() && i>0
&& Character.isLetter(chars[i-1]) && !isSurrogate(chars[i-1]))
{ // previous character is a Letter and I have to capitalize it
chars[i-1] = Character.toUpperCase(chars[i-1]);
}
if (delimiter.capitalizeAfter() && i<chars.length-1
&& Character.isLetter(chars[i+1]) && !isSurrogate(chars[i+1]))
{ // next character is a Letter and I have to capitalize it
chars[i+1] = Character.toUpperCase(chars[i+1]);
}
break;
}
}
}
}
return String.valueOf(chars);
}
private static boolean isSurrogate(char chr){
// Check if the current character is part of an UTF-16 Surrogate Pair.
// Note: not validating the pair, just used to bypass (any found part of) it.
return (Character.isHighSurrogate(chr) || Character.isLowSurrogate(chr));
}
private static List<Delimiter> getDefaultDelimiters(){
// If no delimiter specified, "Capitalize after space" rule is set by default.
List<Delimiter> delimiters = new ArrayList<Delimiter>();
delimiters.add(new Delimiter(Behavior.CAPITALIZE_AFTER_MARKER, ' '));
return delimiters;
}
public static class Delimiter {
private Behavior behavior;
private char delimiter;
public Delimiter(Behavior behavior, char delimiter) {
super();
this.behavior = behavior;
this.delimiter = delimiter;
}
public boolean capitalizeBefore(){
return (behavior.equals(Behavior.CAPITALIZE_BEFORE_MARKER)
|| behavior.equals(Behavior.CAPITALIZE_BEFORE_AND_AFTER_MARKER));
}
public boolean capitalizeAfter(){
return (behavior.equals(Behavior.CAPITALIZE_AFTER_MARKER)
|| behavior.equals(Behavior.CAPITALIZE_BEFORE_AND_AFTER_MARKER));
}
public char getDelimiter() {
return delimiter;
}
}
public static enum Behavior {
CAPITALIZE_AFTER_MARKER(0),
CAPITALIZE_BEFORE_MARKER(1),
CAPITALIZE_BEFORE_AND_AFTER_MARKER(2);
private int value;
private Behavior(int value) {
this.value = value;
}
public int getValue() {
return value;
}
}
Using org.apache.commons.lang.StringUtils makes it very simple.
capitalizeStr = StringUtils.capitalize(str);
From Java 9+
you can use String::replaceAll like this :
public static void upperCaseAllFirstCharacter(String text) {
String regex = "\\b(.)(.*?)\\b";
String result = Pattern.compile(regex).matcher(text).replaceAll(
matche -> matche.group(1).toUpperCase() + matche.group(2)
);
System.out.println(result);
}
Example :
upperCaseAllFirstCharacter("hello this is Just a test");
Outputs
Hello This Is Just A Test
With this simple code:
String example="hello";
example=example.substring(0,1).toUpperCase()+example.substring(1, example.length());
System.out.println(example);
Result: Hello
I'm using the following function. I think it is faster in performance.
public static String capitalize(String text){
String c = (text != null)? text.trim() : "";
String[] words = c.split(" ");
String result = "";
for(String w : words){
result += (w.length() > 1? w.substring(0, 1).toUpperCase(Locale.US) + w.substring(1, w.length()).toLowerCase(Locale.US) : w) + " ";
}
return result.trim();
}
Use the Split method to split your string into words, then use the built in string functions to capitalize each word, then append together.
Pseudo-code (ish)
string = "the sentence you want to apply caps to";
words = string.split(" ")
string = ""
for(String w: words)
//This line is an easy way to capitalize a word
word = word.toUpperCase().replace(word.substring(1), word.substring(1).toLowerCase())
string += word
In the end string looks something like
"The Sentence You Want To Apply Caps To"
This might be useful if you need to capitalize titles. It capitalizes each substring delimited by " ", except for specified strings such as "a" or "the". I haven't ran it yet because it's late, should be fine though. Uses Apache Commons StringUtils.join() at one point. You can substitute it with a simple loop if you wish.
private static String capitalize(String string) {
if (string == null) return null;
String[] wordArray = string.split(" "); // Split string to analyze word by word.
int i = 0;
lowercase:
for (String word : wordArray) {
if (word != wordArray[0]) { // First word always in capital
String [] lowercaseWords = {"a", "an", "as", "and", "although", "at", "because", "but", "by", "for", "in", "nor", "of", "on", "or", "so", "the", "to", "up", "yet"};
for (String word2 : lowercaseWords) {
if (word.equals(word2)) {
wordArray[i] = word;
i++;
continue lowercase;
}
}
}
char[] characterArray = word.toCharArray();
characterArray[0] = Character.toTitleCase(characterArray[0]);
wordArray[i] = new String(characterArray);
i++;
}
return StringUtils.join(wordArray, " "); // Re-join string
}
public static String toTitleCase(String word){
return Character.toUpperCase(word.charAt(0)) + word.substring(1);
}
public static void main(String[] args){
String phrase = "this is to be title cased";
String[] splitPhrase = phrase.split(" ");
String result = "";
for(String word: splitPhrase){
result += toTitleCase(word) + " ";
}
System.out.println(result.trim());
}
1. Java 8 Streams
public static String capitalizeAll(String str) {
if (str == null || str.isEmpty()) {
return str;
}
return Arrays.stream(str.split("\\s+"))
.map(t -> t.substring(0, 1).toUpperCase() + t.substring(1))
.collect(Collectors.joining(" "));
}
Examples:
System.out.println(capitalizeAll("jon skeet")); // Jon Skeet
System.out.println(capitalizeAll("miles o'Brien")); // Miles O'Brien
System.out.println(capitalizeAll("old mcdonald")); // Old Mcdonald
System.out.println(capitalizeAll(null)); // null
For foo bAR to Foo Bar, replace the map() method with the following:
.map(t -> t.substring(0, 1).toUpperCase() + t.substring(1).toLowerCase())
2. String.replaceAll() (Java 9+)
ublic static String capitalizeAll(String str) {
if (str == null || str.isEmpty()) {
return str;
}
return Pattern.compile("\\b(.)(.*?)\\b")
.matcher(str)
.replaceAll(match -> match.group(1).toUpperCase() + match.group(2));
}
Examples:
System.out.println(capitalizeAll("12 ways to learn java")); // 12 Ways To Learn Java
System.out.println(capitalizeAll("i am atta")); // I Am Atta
System.out.println(capitalizeAll(null)); // null
3. Apache Commons Text
System.out.println(WordUtils.capitalize("love is everywhere")); // Love Is Everywhere
System.out.println(WordUtils.capitalize("sky, sky, blue sky!")); // Sky, Sky, Blue Sky!
System.out.println(WordUtils.capitalize(null)); // null
For titlecase:
System.out.println(WordUtils.capitalizeFully("fOO bAR")); // Foo Bar
System.out.println(WordUtils.capitalizeFully("sKy is BLUE!")); // Sky Is Blue!
For details, checkout this tutorial.
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Enter the sentence : ");
try
{
String str = br.readLine();
char[] str1 = new char[str.length()];
for(int i=0; i<str.length(); i++)
{
str1[i] = Character.toLowerCase(str.charAt(i));
}
str1[0] = Character.toUpperCase(str1[0]);
for(int i=0;i<str.length();i++)
{
if(str1[i] == ' ')
{
str1[i+1] = Character.toUpperCase(str1[i+1]);
}
System.out.print(str1[i]);
}
}
catch(Exception e)
{
System.err.println("Error: " + e.getMessage());
}
I decided to add one more solution for capitalizing words in a string:
words are defined here as adjacent letter-or-digit characters;
surrogate pairs are provided as well;
the code has been optimized for performance; and
it is still compact.
Function:
public static String capitalize(String string) {
final int sl = string.length();
final StringBuilder sb = new StringBuilder(sl);
boolean lod = false;
for(int s = 0; s < sl; s++) {
final int cp = string.codePointAt(s);
sb.appendCodePoint(lod ? Character.toLowerCase(cp) : Character.toUpperCase(cp));
lod = Character.isLetterOrDigit(cp);
if(!Character.isBmpCodePoint(cp)) s++;
}
return sb.toString();
}
Example call:
System.out.println(capitalize("An à la carte StRiNg. Surrogate pairs: 𐐪𐐪."));
Result:
An À La Carte String. Surrogate Pairs: 𐐂𐐪.
Use:
String text = "jon skeet, miles o'brien, old mcdonald";
Pattern pattern = Pattern.compile("\\b([a-z])([\\w]*)");
Matcher matcher = pattern.matcher(text);
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(buffer, matcher.group(1).toUpperCase() + matcher.group(2));
}
String capitalized = matcher.appendTail(buffer).toString();
System.out.println(capitalized);
There are many way to convert the first letter of the first word being capitalized. I have an idea. It's very simple:
public String capitalize(String str){
/* The first thing we do is remove whitespace from string */
String c = str.replaceAll("\\s+", " ");
String s = c.trim();
String l = "";
for(int i = 0; i < s.length(); i++){
if(i == 0){ /* Uppercase the first letter in strings */
l += s.toUpperCase().charAt(i);
i++; /* To i = i + 1 because we don't need to add
value i = 0 into string l */
}
l += s.charAt(i);
if(s.charAt(i) == 32){ /* If we meet whitespace (32 in ASCII Code is whitespace) */
l += s.toUpperCase().charAt(i+1); /* Uppercase the letter after whitespace */
i++; /* Yo i = i + 1 because we don't need to add
value whitespace into string l */
}
}
return l;
}
package com.test;
/**
* #author Prasanth Pillai
* #date 01-Feb-2012
* #description : Below is the test class details
*
* inputs a String from a user. Expect the String to contain spaces and alphanumeric characters only.
* capitalizes all first letters of the words in the given String.
* preserves all other characters (including spaces) in the String.
* displays the result to the user.
*
* Approach : I have followed a simple approach. However there are many string utilities available
* for the same purpose. Example : WordUtils.capitalize(str) (from apache commons-lang)
*
*/
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
public class Test {
public static void main(String[] args) throws IOException{
System.out.println("Input String :\n");
InputStreamReader converter = new InputStreamReader(System.in);
BufferedReader in = new BufferedReader(converter);
String inputString = in.readLine();
int length = inputString.length();
StringBuffer newStr = new StringBuffer(0);
int i = 0;
int k = 0;
/* This is a simple approach
* step 1: scan through the input string
* step 2: capitalize the first letter of each word in string
* The integer k, is used as a value to determine whether the
* letter is the first letter in each word in the string.
*/
while( i < length){
if (Character.isLetter(inputString.charAt(i))){
if ( k == 0){
newStr = newStr.append(Character.toUpperCase(inputString.charAt(i)));
k = 2;
}//this else loop is to avoid repeatation of the first letter in output string
else {
newStr = newStr.append(inputString.charAt(i));
}
} // for the letters which are not first letter, simply append to the output string.
else {
newStr = newStr.append(inputString.charAt(i));
k=0;
}
i+=1;
}
System.out.println("new String ->"+newStr);
}
}
Here is a simple function
public static String capEachWord(String source){
String result = "";
String[] splitString = source.split(" ");
for(String target : splitString){
result += Character.toUpperCase(target.charAt(0))
+ target.substring(1) + " ";
}
return result.trim();
}
This is just another way of doing it:
private String capitalize(String line)
{
StringTokenizer token =new StringTokenizer(line);
String CapLine="";
while(token.hasMoreTokens())
{
String tok = token.nextToken().toString();
CapLine += Character.toUpperCase(tok.charAt(0))+ tok.substring(1)+" ";
}
return CapLine.substring(0,CapLine.length()-1);
}
Reusable method for intiCap:
public class YarlagaddaSireeshTest{
public static void main(String[] args) {
String FinalStringIs = "";
String testNames = "sireesh yarlagadda test";
String[] name = testNames.split("\\s");
for(String nameIs :name){
FinalStringIs += getIntiCapString(nameIs) + ",";
}
System.out.println("Final Result "+ FinalStringIs);
}
public static String getIntiCapString(String param) {
if(param != null && param.length()>0){
char[] charArray = param.toCharArray();
charArray[0] = Character.toUpperCase(charArray[0]);
return new String(charArray);
}
else {
return "";
}
}
}
Here is my solution.
I ran across this problem tonight and decided to search it. I found an answer by Neelam Singh that was almost there, so I decided to fix the issue (broke on empty strings) and caused a system crash.
The method you are looking for is named capString(String s) below.
It turns "It's only 5am here" into "It's Only 5am Here".
The code is pretty well commented, so enjoy.
package com.lincolnwdaniel.interactivestory.model;
public class StringS {
/**
* #param s is a string of any length, ideally only one word
* #return a capitalized string.
* only the first letter of the string is made to uppercase
*/
public static String capSingleWord(String s) {
if(s.isEmpty() || s.length()<2) {
return Character.toUpperCase(s.charAt(0))+"";
}
else {
return Character.toUpperCase(s.charAt(0)) + s.substring(1);
}
}
/**
*
* #param s is a string of any length
* #return a title cased string.
* All first letter of each word is made to uppercase
*/
public static String capString(String s) {
// Check if the string is empty, if it is, return it immediately
if(s.isEmpty()){
return s;
}
// Split string on space and create array of words
String[] arr = s.split(" ");
// Create a string buffer to hold the new capitalized string
StringBuffer sb = new StringBuffer();
// Check if the array is empty (would be caused by the passage of s as an empty string [i.g "" or " "],
// If it is, return the original string immediately
if( arr.length < 1 ){
return s;
}
for (int i = 0; i < arr.length; i++) {
sb.append(Character.toUpperCase(arr[i].charAt(0)))
.append(arr[i].substring(1)).append(" ");
}
return sb.toString().trim();
}
}
Here we go for perfect first char capitalization of word
public static void main(String[] args) {
String input ="my name is ranjan";
String[] inputArr = input.split(" ");
for(String word : inputArr) {
System.out.println(word.substring(0, 1).toUpperCase()+word.substring(1,word.length()));
}
}
}
//Output : My Name Is Ranjan
For those of you using Velocity in your MVC, you can use the capitalizeFirstLetter() method from the StringUtils class.
String s="hi dude i want apple";
s = s.replaceAll("\\s+"," ");
String[] split = s.split(" ");
s="";
for (int i = 0; i < split.length; i++) {
split[i]=Character.toUpperCase(split[i].charAt(0))+split[i].substring(1);
s+=split[i]+" ";
System.out.println(split[i]);
}
System.out.println(s);
package corejava.string.intern;
import java.io.DataInputStream;
import java.util.ArrayList;
/*
* wap to accept only 3 sentences and convert first character of each word into upper case
*/
public class Accept3Lines_FirstCharUppercase {
static String line;
static String words[];
static ArrayList<String> list=new ArrayList<String>();
/**
* #param args
*/
public static void main(String[] args) throws java.lang.Exception{
DataInputStream read=new DataInputStream(System.in);
System.out.println("Enter only three sentences");
int i=0;
while((line=read.readLine())!=null){
method(line); //main logic of the code
if((i++)==2){
break;
}
}
display();
System.out.println("\n End of the program");
}
/*
* this will display all the elements in an array
*/
public static void display(){
for(String display:list){
System.out.println(display);
}
}
/*
* this divide the line of string into words
* and first char of the each word is converted to upper case
* and to an array list
*/
public static void method(String lineParam){
words=line.split("\\s");
for(String s:words){
String result=s.substring(0,1).toUpperCase()+s.substring(1);
list.add(result);
}
}
}
If you prefer Guava...
String myString = ...;
String capWords = Joiner.on(' ').join(Iterables.transform(Splitter.on(' ').omitEmptyStrings().split(myString), new Function<String, String>() {
public String apply(String input) {
return Character.toUpperCase(input.charAt(0)) + input.substring(1);
}
}));
String toUpperCaseFirstLetterOnly(String str) {
String[] words = str.split(" ");
StringBuilder ret = new StringBuilder();
for(int i = 0; i < words.length; i++) {
ret.append(Character.toUpperCase(words[i].charAt(0)));
ret.append(words[i].substring(1));
if(i < words.length - 1) {
ret.append(' ');
}
}
return ret.toString();
}

Lucene TFIDF does not return 1 for exactly same query with certain document

I implemented a program to rank documents based on its TFIDF similarity score given a user input.
Following is the program:
public class Ranking{
private static int maxHits = 10;
private static Connection connect = null;
private static PreparedStatement preparedStatement = null;
private static ResultSet resultSet = null;
public static void main(String[] args) throws Exception {
System.out.println("Enter your paper title: ");
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String paperTitle = null;
paperTitle = br.readLine();
Class.forName("com.mysql.jdbc.Driver");
connect = DriverManager.getConnection("jdbc:mysql://localhost/arnetminer?"
+ "user=root&password=1234");
preparedStatement = connect.prepareStatement
("SELECT stoppedstemmedtitle from arnetminer.new_bigdataset "
+ "where title="+"'"+paperTitle+"';");
resultSet = preparedStatement.executeQuery();
resultSet.next();
String stoppedstemmedtitle = resultSet.getString(1);
String querystr = args.length > 0 ? args[0] :stoppedstemmedtitle;
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
Query q = new QueryParser(Version.LUCENE_42, "stoppedstemmedtitle", analyzer).parse(querystr);
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("E:/Lucene/new_bigdataset_index")));
IndexSearcher searcher = new IndexSearcher(reader);
VSMSimilarity vsmSimiliarty = new VSMSimilarity();
searcher.setSimilarity(vsmSimiliarty);
TopDocs hits = searcher.search(q, maxHits);
ScoreDoc[] scoreDocs = hits.scoreDocs;
PrintWriter writer = new PrintWriter("E:/Lucene/result/1.txt", "UTF-8");
int counter = 0;
for (int n = 0; n < scoreDocs.length; ++n) {
ScoreDoc sd = scoreDocs[n];
System.out.println(scoreDocs[n]);
float score = sd.score;
int docId = sd.doc;
Document d = searcher.doc(docId);
String fileName = d.get("title");
String year = d.get("pub_year");
String paperkey = d.get("paperkey");
System.out.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
writer.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
++counter;
}
writer.close();
}
}
And
public class VSMSimilarity extends DefaultSimilarity{
// Weighting codes
public boolean doBasic = true; // Basic tf-idf
public boolean doSublinear = false; // Sublinear tf-idf
public boolean doBoolean = false; // Boolean
//Scoring codes
public boolean doCosine = true;
public boolean doOverlap = false;
// term frequency in document = measure of how often a term appears in the document
public float tf(int freq) {
return super.tf(freq);
}
// inverse document frequency = measure of how often the term appears across the index
public float idf(int docFreq, int numDocs) {
// The default behaviour of Lucene is 1 + log (numDocs/(docFreq+1)), which is what we want (default VSM model)
return super.idf(docFreq, numDocs);
}
// normalization factor so that queries can be compared
public float queryNorm(float sumOfSquaredWeights){
return super.queryNorm(sumOfSquaredWeights);
}
// number of terms in the query that were found in the document
public float coord(int overlap, int maxOverlap) {
// else: can't get here
return super.coord(overlap, maxOverlap);
}
// Note: this happens an index time, which we don't take advantage of (too many indices!)
public float computeNorm(String fieldName, FieldInvertState state){
// else: can't get here
return super.computeNorm(state);
}
}
However, it does not return value 1 for exact documents that has 100% similarity with the input.
If i put user input as follows:Logic Based Knowledge Representation
The output I got and the TFIDF score are (5.165 for document that has 100% similarity with the input):
3086,Logic Based Knowledge Representation.,1999,5.165
33586,A Logic for the Representation of Spatial Knowledge.,1991,4.663
328937,Logic Programming for Knowledge Representation.,2007,4.663
219720,Logic for Knowledge Representation.,1984,4.663
487587,Knowledge Representation with Logic Programs.,1997,4.663
806195,Logic Programming as a Representation of Knowledge.,1983,4.663
806833,The Role of Logic in Knowledge Representation.,1983,4.663
744914,Knowledge Representation and Logic Programming.,2002,4.663
1113802,Knowledge Representation in Fuzzy Logic.,1989,4.663
984276,Logic Programming and Knowledge Representation.,1994,4.663
Is this a normal thing or is there something wrong with my tfidf implementation?
Thank you very much!
First of all - Lucene already have TF-IDF similarity - org.apache.lucene.search.similarities.TFIDFSimilarity
Second one -
tf–idf, short for term frequency–inverse document frequency, is a
numerical statistic that is intended to reflect how important a word
is to a document in a collection or corpus
I've marked word, so this tf-idf stuff is applicable only for one word query, but when query have mutliple words tf-idf will be done like this:
One of the simplest ranking functions is computed by summing the
tf–idf for each query term
So, this is the reason, why tf-idf could return you a score more than 1

implement feedback in lucene

Can someone give me hints to apply pseudo-feedback in lucene. I can not find much help on google. I am using Similarity classes.
Is there any class in lucene which I can extend to implement feedback ?
thanks.
Assuming you are referring to this relevance feedback method, Once you have the TopDocs for your original query, iterate through how ever many (let's say we'll want the top 25 terms for the top 25 docs of the original query) records you desire, and call IndexReader.getTermVectors(int), which will grab the information you need. Iterate through each. while storing the term frequencies in a hash map would be the implementation the immediately occurs to me.
Something like:
//Get the original results
TopDocs docs = indexsearcher.search(query,25);
HashMap<String,ScorePair> map = new HashMap<String,ScorePair>();
for (int i = 0; i < docs.scoreDocs.length; i++) {
//Iterate fields for each result
FieldsEnum fields = indexreader.getTermVectors(docs.scoreDocs[i].doc).iterator();
String fieldname;
while (fieldname = fields.next()) {
//For each field, iterate it's terms
TermsEnum terms = fields.terms().iterator();
while (terms.next()) {
//and store it
putTermInMap(fieldname, terms.term(), terms.docFreq(), map);
}
}
}
List<ScorePair> byScore = new ArrayList<ScorePair>(map.values());
Collections.sort(byScore);
BooleanQuery bq = new BooleanQuery();
//Perhaps we want to give the original query a bit of a boost
query.setBoost(5);
bq.add(query,BooleanClause.Occur.SHOULD);
for (int i = 0; i < 25; i++) {
//Add all our found terms to the final query
ScorePair pair = byScore.get(i);
bq.add(new TermQuery(new Term(pair.field,pair.term)),BooleanClause.Occur.SHOULD);
}
}
//Say, we want to score based on tf/idf
void putTermInMap(String field, String term, int freq, Map<String,ScorePair> map) {
String key = field + ":" + term;
if (map.containsKey(key))
map.get(key).increment();
else
map.put(key,new ScorePair(freq,field,term));
}
private class ScorePair implements Comparable{
int count = 0;
double idf;
String field;
String term;
ScorePair(int docfreq, String field, String term) {
count++;
//Standard Lucene idf calculation. This is calculated once per field:term
idf = (1 + Math.log(indexreader.numDocs()/((double)docfreq + 1))) ^ 2;
this.field = field;
this.term = term;
}
void increment() { count++; }
double score() {
return Math.sqrt(count) * idf;
}
//Standard Lucene TF/IDF calculation, if I'm not mistaken about it.
int compareTo(ScorePair pair) {
if (this.score() < pair.score()) return -1;
else return 1;
}
}
(I make no claim that this is functional code, in it's current state.)