Is there a way to search for frequent phrases with Lucene?
I'm searching successfully for frequent words:
TermStats[] ts = HighFreqTerms.getHighFreqTerms(reader, 20, fieldName, comparator);
but this brings single words, and I'm looking for a way to search for frequent two (or any number) word combinations.
To clarify, I'm not looking for top two words I know of (for example fast and car) but top two frequent word combinations. So if my text is "this is a fast car and this is also a fast car" I'll get as a result that "fast car" and "this is" are the top two word combinations.
I looked at the discussion here but it offers a solution with solr and I'm looking for something with Lucene, and in any case the relevant link is broken.
EDIT: following femtoRgon's comment here's some code from my Analyzer. Is this where the ShingleFilter should be added? It doesn't seem to work as my output looks like this:
ed d
d
d p
p
p pl
pl
pl le
What I need is for the output to include pairs of full words.
Here's my createComponents method:
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new NGramTokenizer(Version.LUCENE_47, reader, 2, 2);
ShingleFilter sf = new ShingleFilter(source, 2, 2);
TokenStreamComponents tsc = new TokenStreamComponents(source, sf);
return tsc;
}
EDIT2: I changed the NGramTokenizer to StandardTokenizer following femtoRgon's comment and now I'm getting full words, but I don't need the single words, just the pairs.
This is the code:
Tokenizer source = new StandardTokenizer(Version.LUCENE_47, reader);
ShingleFilter sf = new ShingleFilter(source, 2, 2);
Note the 2, 2 which according to the documents should generate min words of 2, and max words of 2. But in fact it generates this output:
and
and other
other
other airborne
airborne
airborne particles
So how do I get rid of the single words and get this output?
and other
other airborne
airborne particles
Here's my full Analyzer class that does the job. Note that the TokenStreamComponents method is where the ShingleFilter is declared following femtoRgon excellent comments to my question. Just put in your own string, specify minWords and maxWords and run it.
public class RMAnalyzer extends Analyzer {
public static String someString = "some string";
private int minWords = 2;
private int maxWords = 2;
public static void main(String[] args) {
RMAnalyzer rma = new RMAnalyzer(2, 2);
rma.findFrequentTerms();
rma.close();
}
public RMAnalyzer(int minWords, int maxWords) {
this.minWords = minWords;
this.maxWords = maxWords;
}
public void findFrequentTerms() {
StringReader sr = new StringReader(someString);
try {
TokenStream tokenStream = tokenStream("title", sr);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
String term = charTermAttribute.toString();
System.out.println(term);
}
} catch(Exception e) {
e.printStackTrace();
}
}
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new StandardTokenizer(Version.LUCENE_47, reader);
ShingleFilter sf = new ShingleFilter(source, minWords, maxWords);
sf.setOutputUnigrams(false); // makes it so no one word phrases out in the output.
sf.setOutputUnigramsIfNoShingles(true); // if not enough for min, show anyway.
TokenStreamComponents tsc = new TokenStreamComponents(source, sf);
return tsc;
}
}
Related
I have three fields in my document
Title
Content
Modified Date
So when I search a term it's giving by results sorted by score
Now I would like to further sort the results with same score based upon on modifiedDate i.e. showing recent documents on top with the same score.
I tried sort by score, modified date but it's not working. Anyone can point me to the right direction?
This can be done simply by defining a Sort:
Sort sort = new Sort(
SortField.FIELD_SCORE,
new SortField("myDateField", SortField.Type.STRING));
indexSearcher.search(myQuery, numHits, sort);
Two possible gotchas here:
You should make sure your date is indexed in a searchable, and sortable, form. Generally, the best way to accomplish this is to convert it using DateTools.
The field used for sorting must be indexed, and should not be analyzed (a StringField, for instance). Up to you whether it is stored.
So adding the date field might look something like:
Field dateField = new StringField(
"myDateField",
DateTools.DateToString(myDateInstance, DateTools.Resolution.MINUTE),
Field.Store.YES);
document.add(dateField);
Note: You can also index dates as a numeric field using Date.getTime(). I prefer the DateTools string approach, as it provides some nicer tools for handling them, particularly with regards to precision, but either way can work.
You can use a custom collector for solving this problem. It will sort result by score, then by timestamp. In this collector you should retrieve the timestamp value for second sorting. See class below
public class CustomCollector extends TopDocsCollector<ScoreDocWithTime> {
ScoreDocWithTime pqTop;
// prevents instantiation
public CustomCollector(int numHits) {
super(new HitQueueWithTime(numHits, true));
// HitQueue implements getSentinelObject to return a ScoreDoc, so we know
// that at this point top() is already initialized.
pqTop = pq.top();
}
#Override
public LeafCollector getLeafCollector(LeafReaderContext context)
throws IOException {
final int docBase = context.docBase;
final NumericDocValues modifiedDate =
DocValues.getNumeric(context.reader(), "modifiedDate");
return new LeafCollector() {
Scorer scorer;
#Override
public void setScorer(Scorer scorer) throws IOException {
this.scorer = scorer;
}
#Override
public void collect(int doc) throws IOException {
float score = scorer.score();
// This collector cannot handle these scores:
assert score != Float.NEGATIVE_INFINITY;
assert !Float.isNaN(score);
totalHits++;
if (score <= pqTop.score) {
// Since docs are returned in-order (i.e., increasing doc Id), a document
// with equal score to pqTop.score cannot compete since HitQueue favors
// documents with lower doc Ids. Therefore reject those docs too.
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop.timestamp = modifiedDate.get(doc);
pqTop = pq.updateTop();
}
};
}
#Override
public boolean needsScores() {
return true;
}
}
Also to do second sorting you need add to ScoreDoc an additional field
public class ScoreDocWithTime extends ScoreDoc {
public long timestamp;
public ScoreDocWithTime(long timestamp, int doc, float score) {
super(doc, score);
this.timestamp = timestamp;
}
public ScoreDocWithTime(long timestamp, int doc, float score, int shardIndex) {
super(doc, score, shardIndex);
this.timestamp = timestamp;
}
}
and create a custom priority queue to support this
public class HitQueueWithTime extends PriorityQueue<ScoreDocWithTime> {
public HitQueueWithTime(int numHits, boolean b) {
super(numHits, b);
}
#Override
protected ScoreDocWithTime getSentinelObject() {
return new ScoreDocWithTime(0, Integer.MAX_VALUE, Float.NEGATIVE_INFINITY);
}
#Override
protected boolean lessThan(ScoreDocWithTime hitA, ScoreDocWithTime hitB) {
if (hitA.score == hitB.score)
return (hitA.timestamp == hitB.timestamp) ?
hitA.doc > hitB.doc :
hitA.timestamp < hitB.timestamp;
else
return hitA.score < hitB.score;
}
}
After this you can search result as you need. See example below
public class SearchTest {
public static void main(String[] args) throws IOException {
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
Directory directory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
addDoc(indexWriter, "w1", 1000);
addDoc(indexWriter, "w1", 3000);
addDoc(indexWriter, "w1", 500);
addDoc(indexWriter, "w1 w2", 1000);
addDoc(indexWriter, "w1 w2", 3000);
addDoc(indexWriter, "w1 w2", 2000);
addDoc(indexWriter, "w1 w2", 5000);
final IndexReader indexReader = DirectoryReader.open(indexWriter, false);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term("desc", "w1")), BooleanClause.Occur.SHOULD);
query.add(new TermQuery(new Term("desc", "w2")), BooleanClause.Occur.SHOULD);
CustomCollector results = new CustomCollector(100);
indexSearcher.search(query, results);
TopDocs search = results.topDocs();
for (ScoreDoc sd : search.scoreDocs) {
Document document = indexReader.document(sd.doc);
System.out.println(document.getField("desc").stringValue() + " " + ((ScoreDocWithTime) sd).timestamp);
}
}
private static void addDoc(IndexWriter indexWriter, String decs, long modifiedDate) throws IOException {
Document doc = new Document();
doc.add(new TextField("desc", decs, Field.Store.YES));
doc.add(new LongField("modifiedDate", modifiedDate, Field.Store.YES));
doc.add(new NumericDocValuesField("modifiedDate", modifiedDate));
indexWriter.addDocument(doc);
}
}
Program will output following results
w1 w2 5000
w1 w2 3000
w1 w2 2000
w1 w2 1000
w1 3000
w1 1000
w1 500
P.S. this solution for Lucene 5.1
My documents structure is:
[text:TextField,date:LongField]
I am looking for a 'statistic' query on my documents, based on a precision level on the dateTime field. This means counting documents grouped by the LongField date, ignoring some bytes at the right of the date.
For a given precision, I am looking for how many documents match for each different values of this precision.
Assuming the precision 'year' is grouping by "date/10000"
With the following data:
{text:"text1",dateTime:(some timestamp where year is 2015 like 20150000)}
{text:"text2",dateTime:(some timestamp where year is 2010 like 20109878)}
{text:"text3",dateTime:(some timestamp where year is 2015 like 20150024)}
{text:"text14,dateTime:(some timestamp where year is 1997 like 19970987)}
The result should be:
[{bracket:1997, count:1}
{bracket:2010, count:1}
{bracket:2015, count:2}]
While NumericRangeQuery allow to create 1 (or some) range, is it possible for lucene to generate the ranges based on a precision step?
I can handle this by creating a new field for each precision level that I need, but maybe this kind of things allready exists.
It's a kind of faceted search where the facet is the time. The use case should be:
-give me document count for each milleniums,
-then give me document count for each centuries (inside a millenium)
-then give me document count for each year (inside a century)
-then give me document count for each days (inside a year)
when 0 documents exists inside a bucket, the result should not be in the results.
Regards
Collector can do this without any trick, here is the working code:
public class GroupByTest1 {
private RAMDirectory directory;
private IndexSearcher searcher;
private IndexReader reader;
private Analyzer analyzer;
private class Data {
String text;
Long dateTime;
private Data(String text, Long dateTime) {
this.text = text;
this.dateTime = dateTime;
}
}
#Before
public void setUp() throws Exception {
directory = new RAMDirectory();
analyzer = new WhitespaceAnalyzer();
IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(analyzer));
Data datas[] = {
new Data("A", 2012L),
new Data("B", 2012L),
new Data("C", 2012L),
new Data("D", 2013L),
};
Document doc = new Document();
for (Data data : datas) {
doc.clear();
doc.add(new TextField("text", data.text, Field.Store.YES));
doc.add(new LongField("dateTime", data.dateTime, Field.Store.YES));
writer.addDocument(doc);
}
writer.close();
reader = DirectoryReader.open(directory);
searcher = new IndexSearcher(reader);
}
#Test
public void test1() throws Exception {
final Map<Integer, Long> map = new HashMap<>();
Collector collector = new SimpleCollector() {
int base = 0;
#Override
public void collect(int doc) throws IOException {
String year = reader.document(doc + base).get("dateTime");
if (!map.containsKey(Integer.valueOf(year))) {
map.put(Integer.valueOf(year), 1L);
} else {
long l = map.get(Integer.valueOf(year));
map.put(Integer.valueOf(year), ++l);
}
}
#Override
public boolean needsScores() {
return false;
}
#Override
protected void doSetNextReader(LeafReaderContext context) throws IOException {
base = context.docBase;
}
};
searcher.search(new MatchAllDocsQuery(), collector);
for (Integer integer : map.keySet()) {
System.out.print("year = " + integer);
System.out.println(" count = " + map.get(integer));
}
}
}
Output I get is following:
year = 2012 count = 3
year = 2013 count = 1
This may run slow depending on how many records you have. It loads every single document to know what is the year on it and groups based on that. There is also grouping module which you may also look into it.
I have a pdf. The pdf contains a table. The table contains many cells (>100). I know the exact position (x,y) and dimension (w,h) of every cell of the table.
I need to extract text from cells using itextsharp. Using PdfReaderContentParser + FilteredTextRenderListener (using a code like this http://itextpdf.com/examples/iia.php?id=279 ) I can extract text but I need to run the whole procedure for each cell. My pdf have many cells and the program needs too much time to run. Is there a way to extract text from a list of "rectangle"? I need to know the text of each rectangle. I'm looking for something like PDFTextStripperByArea by PdfBox (you can define as many regions as you need and the get text using .getTextForRegion("region-name") ).
This option is not immediately included in the iTextSharp distribution but it is easy to realize. In the following I use the iText (Java) class, interface, and method names because I am more at home with Java. They should easily be translatable into iTextSharp (C#) names.
If you use the LocationTextExtractionStrategy, you can can use its a posteriori TextChunkFilter mechanism instead of the a priori FilteredRenderListener mechanism used in the sample you linked to. This mechanism has been introduced in version 5.3.3.
For this you first parse the whole page content using the LocationTextExtractionStrategy without any FilteredRenderListener filtering applied. This makes the strategy object collect TextChunk objects for all PDF text objects on the page containing the associated base line segment.
Then you call the strategy's getResultantText overload with a TextChunkFilter argument (instead of the regular no-argument overload):
public String getResultantText(TextChunkFilter chunkFilter)
You call it with a different TextChunkFilter instance for each table cell. You have to implement this filter interface which is not too difficult as it only defines one method:
public static interface TextChunkFilter
{
/**
* #param textChunk the chunk to check
* #return true if the chunk should be allowed
*/
public boolean accept(TextChunk textChunk);
}
So the accept method of the filter for a given cell must test whether the text chunk in question is inside your cell.
(Instead of separate instances for each cell you can of course also create one instance whose parameters, i.e. cell coordinates, can be changed between getResultantText calls.)
PS: As mentioned by the OP, this TextChunkFilter has not yet been ported to iTextSharp. It should not be hard to do so, though, only one small interface and one method to add to the strategy.
PPS: In a comment sschuberth asked
Do you then still call PdfTextExtractor.getTextFromPage() when using getResultantText(), or does it somehow replace that call? If so, how to you then specify the page to extract to?
Actually PdfTextExtractor.getTextFromPage() internally already uses the no-argument getResultantText() overload:
public static String getTextFromPage(PdfReader reader, int pageNumber, TextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText();
}
To make use of a TextChunkFilter you could simply build a similar convenience method, e.g.
public static String getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, TextChunkFilter chunkFilter) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText(chunkFilter);
}
In the context at hand, though, in which we want to parse the page content only once and apply multiple filters, one for each cell, we might generalize this to:
public static List<String> getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, Iterable<TextChunkFilter> chunkFilters) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(pageNumber, strategy, additionalContentOperators)
List<String> result = new ArrayList<>();
for (TextChunkFilter chunkFilter : chunkFilters)
{
result.add(strategy).getResultantText(chunkFilter);
}
return result;
}
(You can make this look fancier by using Java 8 collection streaming instead of the old'fashioned for loop.)
Here's my take on how to extract text from a table-like structure in a PDF using itextsharp. It returns a collection of rows and each row contains a collection of interpreted columns. This may work for you on the premise that there is a gap between one column and the next which is greater than the average width of a single character. I also added an option to check for wrapped text within a virtual column. Your mileage may vary.
using (PdfReader pdfReader = new PdfReader(stream))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
TableExtractionStrategy tableExtractionStrategy = new TableExtractionStrategy();
string pageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, tableExtractionStrategy);
var table = tableExtractionStrategy.GetTable();
}
}
public class TableExtractionStrategy : LocationTextExtractionStrategy
{
public float NextCharacterThreshold { get; set; } = 1;
public int NextLineLookAheadDepth { get; set; } = 500;
public bool AccomodateWordWrapping { get; set; } = true;
private List<TableTextChunk> Chunks { get; set; } = new List<TableTextChunk>();
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
string text = renderInfo.GetText();
Vector bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
Rectangle rectangle = new Rectangle(bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Chunks.Add(new TableTextChunk(rectangle, text));
}
public List<List<string>> GetTable()
{
List<List<string>> lines = new List<List<string>>();
List<string> currentLine = new List<string>();
float? previousBottom = null;
float? previousRight = null;
StringBuilder currentString = new StringBuilder();
// iterate through all chunks and evaluate
for (int i = 0; i < Chunks.Count; i++)
{
TableTextChunk chunk = Chunks[i];
// determine if we are processing the same row based on defined space between subsequent chunks
if (previousBottom.HasValue && previousBottom == chunk.Rectangle.Bottom)
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
else
{
// if we are processing a new line let's check to see if this could be word wrapping behavior
bool isNewLine = true;
if (AccomodateWordWrapping)
{
int readAheadDepth = Math.Min(i + NextLineLookAheadDepth, Chunks.Count);
if (previousBottom.HasValue)
for (int j = i; j < readAheadDepth; j++)
{
if (previousBottom == Chunks[j].Rectangle.Bottom)
{
isNewLine = false;
break;
}
}
}
// if the text was not word wrapped let's treat this as a new table row
if (isNewLine)
{
if (currentString.Length > 0)
currentLine.Add(currentString.ToString());
currentString.Clear();
previousBottom = chunk.Rectangle.Bottom;
previousRight = chunk.Rectangle.Right;
currentString.Append(chunk.Text);
if (currentLine.Count > 0)
lines.Add(currentLine);
currentLine = new List<string>();
}
else
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
}
}
return lines;
}
private struct TableTextChunk
{
public Rectangle Rectangle;
public string Text;
public TableTextChunk(Rectangle rect, string text)
{
Rectangle = rect;
Text = text;
}
public override string ToString()
{
return Text + " (" + Rectangle.Left + ", " + Rectangle.Bottom + ")";
}
}
}
I want to add new fields to my Lucene-based search engine site, however I want to be able to intercept queries and modify them before I pass them on to the Searcher.
For example each document has the field userid so you can search for documents authored by a particular user by their ID, e.g. foo bar userid:123 however I want to add the ability to search by username.
I'd like to add a field user:RonaldMcDonald to queries (not to documents), however I want to be able to intercept that term and replace it with an equivalent userid:123 term (my own code would be responsible for converting "RonaldMcDonald" to "123").
Here's the simple code I'm using right now:
Int32 get = (pageIndex + 1) * pageSize;
Query query;
try {
query = _queryParser.Parse( queryText );
} catch(ParseException pex) {
log.Add("Could not parse query.");
throw new SearchException( "Could not parse query text.", pex );
}
log.Add("Parsed query.");
TopDocs result = _searcher.Search( query, get );
I've had a look at the Query class, but I can't see any way to retrieve, remove, or insert terms.
You can subclass the QueryParser and override NewTermQuery.
QP qp = new QP("user", new SimpleAnalyzer());
var s = qp.Parse("user:RonaldMcDonald data:[aaa TO bbb]");
Where s is will be userid:123 data:[aaa TO bbb]
public class QP : QueryParser
{
Dictionary<string, string> _dict =
new Dictionary<string, string>(new MyComparer()) {{"RonaldMcDonald","123"} };
public QP(string field, Analyzer analyzer) : base(field, analyzer)
{
}
protected override Query NewTermQuery(Term term)
{
if (term.Field() == "user")
{
//Do your username -> userid mapping
return new TermQuery(new Term("userid", _dict[term.Text()]));
}
return base.NewTermQuery(term);
}
//Case insensitive comparer
class MyComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return String.Compare(x, y, true, CultureInfo.InvariantCulture)==0;
}
public int GetHashCode(string obj)
{
return obj.ToLower(CultureInfo.InvariantCulture).GetHashCode();
}
}
}
when i query for "elegant" in solr i get results for "elegance" too.
I used these filters for index analyze
WhitespaceTokenizerFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
SynonymFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
ReversedWildcardFilterFactory
and for query analyze:
WhitespaceTokenizerFactory
SynonymFilterFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
I want to know which filter affecting my search result.
EnglishPorterFilterFactory
Thats the short answer ;)
A little more information:
English Porter means the english porter stemmer stemming alogrithm. And both elegant and elegance have according to the stemmer (which is a heuristical word root builder) the same stem.
You can verify this online e.g. Here. Basically you will see "eleg ant " and "eleg ance" stemmed to the same stem > eleg.
From Solr source:
public void inform(ResourceLoader loader) {
String wordFiles = args.get(PROTECTED_TOKENS);
if (wordFiles != null) {
try {
Here exactly comes the protwords file into play:
File protectedWordFiles = new File(wordFiles);
if (protectedWordFiles.exists()) {
List<String> wlist = loader.getLines(wordFiles);
//This cast is safe in Lucene
protectedWords = new CharArraySet(wlist, false);//No need to go through StopFilter as before, since it just uses a List internally
} else {
List<String> files = StrUtils
.splitFileNames(wordFiles);
for (String file : files) {
List<String> wlist = loader.getLines(file
.trim());
if (protectedWords == null)
protectedWords = new CharArraySet(wlist,
false);
else
protectedWords.addAll(wlist);
}
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
Thats the part which affects the stemming. There you see the invocation of the snowball library
public EnglishPorterFilter create(TokenStream input) {
return new EnglishPorterFilter(input, protectedWords);
}
}
/**
* English Porter2 filter that doesn't use reflection to
* adapt lucene to the snowball stemmer code.
*/
#Deprecated
class EnglishPorterFilter extends SnowballPorterFilter {
public EnglishPorterFilter(TokenStream source,
CharArraySet protWords) {
super (source, new org.tartarus.snowball.ext.EnglishStemmer(),
protWords);
}
}