How can I use Lucene's PriorityQueue when I don't know the max size at create time?

How can I use Lucene's PriorityQueue when I don't know the max size at create time? - lucene

I built a custom collector for Lucene.Net, but I can't figure out how to order (or page) the results. Everytime Collect gets called, I can add the result to an internal PriorityQueue, which I understand is the correct way to do this.
I extended the PriorityQueue, but it requires a size parameter on creation. You have to call Initialize in the constructor and pass in the max size.
However, in a collector, the searcher just calls Collect when it gets a new result, so I don't know how many results I have when I create the PriorityQueue. Based on this, I can't figure out how to make the PriorityQueue work.
I realize I'm probably missing something simple here...

PriorityQueue is not SortedList or SortedDictionary.
It is a kind of sorting implementation where it returns the top M results(your PriorityQueue's size) of N elements. You can add with InsertWithOverflow as many items as you want, but it will only hold only the top M elements.
Suppose your search resulted in 1000000 hits. Would you return all of the results to user?
A better way would be to return the top 10 elements to the user(using PriorityQueue(10)) and
if the user requests for the next 10 result, you can make a new search with PriorityQueue(20) and return the next 10 elements and so on.
This is the trick most search engines like google uses.
Everytime Commit gets called, I can add the result to an internal PriorityQueue.
I can not undestand the relationship between Commit and search, Therefore I will append a sample usage of PriorityQueue:
public class CustomQueue : Lucene.Net.Util.PriorityQueue<Document>
{
public CustomQueue(int maxSize): base()
{
Initialize(maxSize);
}
public override bool LessThan(Document a, Document b)
{
//a.GetField("field1")
//b.GetField("field2");
return //compare a & b
}
}
public class MyCollector : Lucene.Net.Search.Collector
{
CustomQueue _queue = null;
IndexReader _currentReader;
public MyCollector(int maxSize)
{
_queue = new CustomQueue(maxSize);
}
public override bool AcceptsDocsOutOfOrder()
{
return true;
}
public override void Collect(int doc)
{
_queue.InsertWithOverflow(_currentReader.Document(doc));
}
public override void SetNextReader(IndexReader reader, int docBase)
{
_currentReader = reader;
}
public override void SetScorer(Scorer scorer)
{
}
}
searcher.Search(query,new MyCollector(10)) //First page.
searcher.Search(query,new MyCollector(20)) //2nd page.
searcher.Search(query,new MyCollector(30)) //3rd page.
EDIT for #nokturnal
public class MyPriorityQueue<TObj, TComp> : Lucene.Net.Util.PriorityQueue<TObj>
where TComp : IComparable<TComp>
{
Func<TObj, TComp> _KeySelector;
public MyPriorityQueue(int size, Func<TObj, TComp> keySelector) : base()
{
_KeySelector = keySelector;
Initialize(size);
}
public override bool LessThan(TObj a, TObj b)
{
return _KeySelector(a).CompareTo(_KeySelector(b)) < 0;
}
public IEnumerable<TObj> Items
{
get
{
int size = Size();
for (int i = 0; i < size; i++)
yield return Pop();
}
}
}
var pq = new MyPriorityQueue<Document, string>(3, doc => doc.GetField("SomeField").StringValue);
foreach (var item in pq.Items)
{
}

The reason Lucene's Priority Queue is size limited is because it uses a fixed size implementation that is very fast.
Think about what is the reasonable maximum number of results to get back at a time and use that number, the "waste" for when the results are few is not that bad for the benefit it gains.
On the other hand, if you have such a huge number of results that you cannot hold them, then how are you going to be serving/displaying them? Keep in mind that this is for "top" hits so as you iterate through the results you will be hitting less and less relevant ones anyway.

Related

How to set variable inside of a Coroutine after yielding a webrequest

Okay I will try and explain this to the best of my ability. I have searched and searched all day for a solution to this issue but can't seem to find it. The problem that I am having is that I have a list of scriptable objects that I am basically using for custom properties to create gameobjects off of. One of those properties that I need to get is a Texture2D that I turn into a sprite. Therefor, I am using UnityWebRequest in a Coroutine and am having to yield the response. After I get the response I am trying to set the variable. However even using Lambdas it seems to me that if I yield return the response before the result it will not set the variable. So every time I check the variable after the Coroutine it comes back null. If someone could enlighten me with what I am missing here that would be just great!
Here is the Scriptable Object Class I am using.
[CreateAssetMenu(fileName = "new movie",menuName = "movie")]
public class MovieTemplate : ScriptableObject
{
public string Title;
public string Description;
public string ImgURL;
public string mainURL;
public string secondaryURL;
public Sprite thumbnail;
}
Here is the call to the Coroutine
foreach (var item in nodes)
{
templates.Add(GetMovieData(item));
}
foreach (MovieTemplate movie in templates)
{
StartCoroutine(GetMovieImage(movie.ImgURL, result =>
{
movie.thumbnail = result;
}));
}
Here is the Coroutine itself
IEnumerator GetMovieImage(string url, System.Action<Sprite> result)
{
using (UnityWebRequest web = UnityWebRequestTexture.GetTexture(url))
{
yield return web.SendWebRequest();
var img = DownloadHandlerTexture.GetContent(web);
result(Sprite.Create(img, new Rect(0, 0, img.width, img.height), Vector2.zero));
}
}

From what you desribe it still seems that the texture is somehow disposed as soon as the routine finishes. My guess would be that it happens due to the using block.
I would store the original texture reference
[CreateAssetMenu(fileName = "new movie",menuName = "movie")]
public class MovieTemplate : ScriptableObject
{
public string Title;
public string Description;
public string ImgURL;
public string mainURL;
public string secondaryURL;
public Sprite thumbnail;
public Texture texture;
public void SetSprite(Sprite newSprite, Texture newTexture)
{
if(texture) Destroy(texture);
texture = newTexture;
var tex = (Texture2D) texture;
thumbnail = Sprite.Create(tex, new Rect(0, 0, tex.width, tex.height), Vector2.zero);
}
}
So you can keep track of the texture itself as well, let it not be collected by the GC but also destroy it when not needed anymore. Usually Texture2D is removed by the GC as soon as there is no reference to it anymore but Texture2D created by UnityWebRequest might behave different.
Than in the webrequest return the texture and don't use using
IEnumerator GetMovieImage(string url, System.Action<Texture> result)
{
UnityWebRequest web = UnityWebRequestTexture.GetTexture(url));
yield return web.SendWebRequest();
if(!web.error)
{
result?.Invoke(DownloadHandlerTexture.GetContent(web));
}
else
{
Debug.LogErrorFormat(this, "Download error: {0} - {1}", web.responseCode, web.error);
}
}
and finally use it like
for (int i = 0; i < templates.Count; i++)
{
int index = i;//If u use i, it will be overriden too so we make a copy of it
StartCoroutine(
GetMovieImage(
templates[index].ImgURL,
result =>
{
templates[index].SetSprite(result);
})
);
}

The problem is with this section of your code :
foreach (MovieTemplate movie in templates)
{
StartCoroutine(GetMovieImage(movie.ImgURL, result =>
{
movie.thumbnail = result;//wrong movie obj
}));
}
Here you will loose refrence to movie object(override by foreach) before the result of callback arrive .
Change it to something like this :
foreach (int i=0;i<templates.Length;i++)
{
int index= i;//If u use i, it will be overriden too so we make a copy of it
StartCoroutine(GetMovieImage(movie.ImgURL, result =>
{
templates[index].thumbnail = result;
}));
}

Incremental score calculation bug?

I've been dealing with a score corruption error for few days with no apparent reason. The error appears only on FULL_ASSERT mode and it is not related to the constraints defined on the drools file.
Following is the error :
014-07-02 14:51:49,037 [SwingWorker-pool-1-thread-4] TRACE Move index (0), score (-4/-2450/-240/-170), accepted (false) for move (EMP4#START => EMP2).
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Score corruption: the workingScore (-3/-1890/-640/-170) is not the uncorruptedScore (-3/-1890/-640/-250) after completedAction (EMP3#EMP4 => EMP4):
The corrupted scoreDirector has 1 ConstraintMatch(s) which are in excess (and should not be there):
com.abcdl.be.solver/MinimizeTotalTime/level3/[org.drools.core.reteoo.InitialFactImpl#4dde85f0]=-170
The corrupted scoreDirector has 1 ConstraintMatch(s) which are missing:
com.abcdl.be.solver/MinimizeTotalTime/level3/[org.drools.core.reteoo.InitialFactImpl#4dde85f0]=-250
Check your score constraints.
The error appears every time after several steps are completed for no apparent reason.
I'm developing a software to schedule several tasks considering time and resources constraints.
The whole process is represented by a directed tree diagram such that the nodes of the graph represent the tasks and the edges, the dependencies between the tasks.
To do this, the planner change the parent node of each node until he finds the best solution.
The node is the planning entity and its parent the planning variable :
#PlanningEntity(difficultyComparatorClass = NodeDifficultyComparator.class)
public class Node extends ProcessChain {
private Node parent; // Planning variable: changes during planning, between score calculations.
private String delay; // Used to display the delay for nodes of type "And"
private int id; // Used as an identifier for each node. Different nodes cannot have the same id
public Node(String name, String type, int time, int resources, String md, int id)
{
super(name, "", time, resources, type, md);
this.id = id;
}
public Node()
{
super();
this.delay = "";
}
public String getDelay() {
return delay;
}
public void setDelay(String delay) {
this.delay = delay;
}
#PlanningVariable(valueRangeProviderRefs = {"parentRange"}, strengthComparatorClass = ParentStrengthComparator.class, nullable = false)
public Node getParent() {
return parent;
}
public void setParent(Node parent) {
this.parent = parent;
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
/*public String toString()
{
if(this.type.equals("AND"))
return delay;
if(!this.md.isEmpty())
return Tools.excerpt(name+" : "+this.md);
return Tools.excerpt(name);
}*/
public String toString()
{
if(parent!= null)
return Tools.excerpt(name) +"#"+parent;
else
return Tools.excerpt(name);
}
public boolean equals( Object o ) {
if (this == o) {
return true;
} else if (o instanceof Node) {
Node other = (Node) o;
return new EqualsBuilder()
.append(name, other.name)
.append(id, other.id)
.isEquals();
} else {
return false;
}
}
public int hashCode() {
return new HashCodeBuilder()
.append(name)
.append(id)
.toHashCode();
}
// ************************************************************************
// Complex methods
// ************************************************************************
public int getStartTime()
{
try{
return Graph.getInstance().getNode2times().get(this).getFirst();
}
catch(NullPointerException e)
{
System.out.println("getStartTime() is null for " + this);
}
return 10;
}
public int getEndTime()
{ try{
return Graph.getInstance().getNode2times().get(this).getSecond();
}
catch(NullPointerException e)
{
System.out.println("getEndTime() is null for " + this);
}
return 10;
}
#ValueRangeProvider(id = "parentRange")
public Collection<Node> getPossibleParents()
{
Collection<Node> nodes = new ArrayList<Node>(Graph.getInstance().getNodes());
nodes.remove(this); // We remove this node from the list
if(Graph.getInstance().getParentsCount(this) > 0)
nodes.remove(Graph.getInstance().getParents(this)); // We remove its parents from the list
if(Graph.getInstance().getChildCount(this) > 0)
nodes.remove(Graph.getInstance().getChildren(this)); // We remove its children from the list
if(!nodes.contains(Graph.getInstance().getNt()))
nodes.add(Graph.getInstance().getNt());
return nodes;
}
/**
* The normal methods {#link #equals(Object)} and {#link #hashCode()} cannot be used because the rule engine already
* requires them (for performance in their original state).
* #see #solutionHashCode()
*/
public boolean solutionEquals(Object o) {
if (this == o) {
return true;
} else if (o instanceof Node) {
Node other = (Node) o;
return new EqualsBuilder()
.append(name, other.name)
.append(id, other.id)
.isEquals();
} else {
return false;
}
}
/**
* The normal methods {#link #equals(Object)} and {#link #hashCode()} cannot be used because the rule engine already
* requires them (for performance in their original state).
* #see #solutionEquals(Object)
*/
public int solutionHashCode() {
return new HashCodeBuilder()
.append(name)
.append(id)
.toHashCode();
}
}
Each move must update the graph by removing the previous edge and adding the new edge from the node to its parent, so i'm using a custom change move :
public class ParentChangeMove implements Move{
private Node node;
private Node parent;
private Graph g = Graph.getInstance();
public ParentChangeMove(Node node, Node parent) {
this.node = node;
this.parent = parent;
}
public boolean isMoveDoable(ScoreDirector scoreDirector) {
List<Dependency> dep = new ArrayList<Dependency>(g.getDependencies());
dep.add(new Dependency(parent.getName(), node.getName()));
return !ObjectUtils.equals(node.getParent(), parent) && !g.detectCycles(dep) && !g.getParents(node).contains(parent);
}
public Move createUndoMove(ScoreDirector scoreDirector) {
return new ParentChangeMove(node, node.getParent());
}
public void doMove(ScoreDirector scoreDirector) {
scoreDirector.beforeVariableChanged(node, "parent"); // before changes are made
//The previous edge is removed from the graph
if(node.getParent() != null)
{
Dependency d = new Dependency(node.getParent().getName(), node.getName());
g.removeEdge(g.getDep2link().get(d));
g.getDependencies().remove(d);
g.getDep2link().remove(d);
}
node.setParent(parent); // the move
//The new edge is added on the graph (parent ==> node)
Link link = new Link();
Dependency d = new Dependency(parent.getName(), node.getName());
g.addEdge(link, parent, node);
g.getDependencies().add(d);
g.getDep2link().put(d, link);
g.setStepTimes();
scoreDirector.afterVariableChanged(node, "parent"); // after changes are made
}
public Collection<? extends Object> getPlanningEntities() {
return Collections.singletonList(node);
}
public Collection<? extends Object> getPlanningValues() {
return Collections.singletonList(parent);
}
public boolean equals(Object o) {
if (this == o) {
return true;
} else if (o instanceof ParentChangeMove) {
ParentChangeMove other = (ParentChangeMove) o;
return new EqualsBuilder()
.append(node, other.node)
.append(parent, other.parent)
.isEquals();
} else {
return false;
}
}
public int hashCode() {
return new HashCodeBuilder()
.append(node)
.append(parent)
.toHashCode();
}
public String toString() {
return node + " => " + parent;
}
}
The graph does define multiple methods that are used by the constraints to calculate the score for each solution like the following :
rule "MinimizeTotalTime" // Minimize the total process time
when
eval(true)
then
scoreHolder.addSoftConstraintMatch(kcontext, 1, -Graph.getInstance().totalTime());
end
On other environment modes, the error does not appear but the best score calculated is not equal to the actual score.
I don't have any clue as to where the problem could come from. Note that i already checked all my equals and hashcode methods.
EDIT : Following ge0ffrey's proposition, I used collect CE in "MinimizeTotalTime" rule to check if the error comes again :
rule "MinimizeTotalTime" // Minimize the total process time
when
ArrayList() from collect(Node())
then
scoreHolder.addSoftConstraintMatch(kcontext, 0, -Graph.getInstance().totalTime());
end
At this point, no error appears and everything seems ok. But when I use "terminate early", I get the following error :
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Score corruption: the solution's score (-9133) is not the uncorruptedScore (-9765).
Also, I have a rule that doesn't use any method from the Graph class and seems to respect the incremental score calculation but returns another score corruption error.
The purpose of the rule is to make sure that we don't use more resources that available:
rule "addMarks" //insert a Mark each time a task starts or ends
when
Node($startTime : getStartTime(), $endTime : getEndTime())
then
insertLogical(new Mark($startTime));
insertLogical(new Mark($endTime));
end
rule "resourcesLimit" // At any time, The number of resources used must not exceed the total number of resources available
when
Mark($startTime: time)
Mark(time > $startTime, $endTime : time)
not Mark(time > $startTime, time < $endTime)
$total : Number(intValue > Global.getInstance().getAvailableResources() ) from
accumulate(Node(getEndTime() >=$endTime, getStartTime()<= $startTime, $res : resources), sum($res))
then
scoreHolder.addHardConstraintMatch(kcontext, 0, (Global.getInstance().getAvailableResources() - $total.intValue()) * ($endTime - $startTime) );
end
Following is the error :
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Score corruption: the workingScore (-193595) is not the uncorruptedScore (-193574) after completedAction (DWL_CM_XX_101#DWL_PA_XX_180 => DWL_PA_XX_180):
The corrupted scoreDirector has 4 ConstraintMatch(s) which are in excess (and should not be there):
com.abcdl.be.solver/resourcesLimit/level0/[43.0, 2012, 1891]=-2783
com.abcdl.be.solver/resourcesLimit/level0/[45.0, 1870, 1805]=-1625
com.abcdl.be.solver/resourcesLimit/level0/[46.0, 1805, 1774]=-806
com.abcdl.be.solver/resourcesLimit/level0/[45.0, 1774, 1762]=-300
The corrupted scoreDirector has 3 ConstraintMatch(s) which are missing:
com.abcdl.be.solver/resourcesLimit/level0/[43.0, 2012, 1901]=-2553
com.abcdl.be.solver/resourcesLimit/level0/[45.0, 1870, 1762]=-2700
com.abcdl.be.solver/resourcesLimit/level0/[44.0, 1901, 1891]=-240
Check your score constraints.

A score rule that has a LHS of just "eval(true)" is inherently broken. Either that constraint is always broken, for the exact same weight, and there really is no reason to evaluate it. Or it is sometimes broken (or always broken but for different weights) and then the rule needs to refire accordingly.
Problem: the return value of Graph.getInstance().totalTime() changes as the planning variables change value. But Drools just looks at the LHS as planning variables change and it sees that nothing in the LHS has changed so there's no need to re-evaluate that score rule, when the planning variables change. Note: this is called incremental score calculation (see docs), which is a huge performance speedup.
Subproblem: The method Graph.getInstance().totalTime() is inherently not incremental.
Fix: translate that totalTime() function into a DRL function based on Node selections. You 'll probably need to use accumulate. If that's too hard (because it's a complex calculation of the critical path or so), try it anyway (for incremental score calculation's sake) or try a LHS that does a collect over all Nodes (which is like eval(true) but it will be refired every time.

Text extraction from table cells

I have a pdf. The pdf contains a table. The table contains many cells (>100). I know the exact position (x,y) and dimension (w,h) of every cell of the table.
I need to extract text from cells using itextsharp. Using PdfReaderContentParser + FilteredTextRenderListener (using a code like this http://itextpdf.com/examples/iia.php?id=279 ) I can extract text but I need to run the whole procedure for each cell. My pdf have many cells and the program needs too much time to run. Is there a way to extract text from a list of "rectangle"? I need to know the text of each rectangle. I'm looking for something like PDFTextStripperByArea by PdfBox (you can define as many regions as you need and the get text using .getTextForRegion("region-name") ).

This option is not immediately included in the iTextSharp distribution but it is easy to realize. In the following I use the iText (Java) class, interface, and method names because I am more at home with Java. They should easily be translatable into iTextSharp (C#) names.
If you use the LocationTextExtractionStrategy, you can can use its a posteriori TextChunkFilter mechanism instead of the a priori FilteredRenderListener mechanism used in the sample you linked to. This mechanism has been introduced in version 5.3.3.
For this you first parse the whole page content using the LocationTextExtractionStrategy without any FilteredRenderListener filtering applied. This makes the strategy object collect TextChunk objects for all PDF text objects on the page containing the associated base line segment.
Then you call the strategy's getResultantText overload with a TextChunkFilter argument (instead of the regular no-argument overload):
public String getResultantText(TextChunkFilter chunkFilter)
You call it with a different TextChunkFilter instance for each table cell. You have to implement this filter interface which is not too difficult as it only defines one method:
public static interface TextChunkFilter
{
/**
* #param textChunk the chunk to check
* #return true if the chunk should be allowed
*/
public boolean accept(TextChunk textChunk);
}
So the accept method of the filter for a given cell must test whether the text chunk in question is inside your cell.
(Instead of separate instances for each cell you can of course also create one instance whose parameters, i.e. cell coordinates, can be changed between getResultantText calls.)
PS: As mentioned by the OP, this TextChunkFilter has not yet been ported to iTextSharp. It should not be hard to do so, though, only one small interface and one method to add to the strategy.
PPS: In a comment sschuberth asked
Do you then still call PdfTextExtractor.getTextFromPage() when using getResultantText(), or does it somehow replace that call? If so, how to you then specify the page to extract to?
Actually PdfTextExtractor.getTextFromPage() internally already uses the no-argument getResultantText() overload:
public static String getTextFromPage(PdfReader reader, int pageNumber, TextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText();
}
To make use of a TextChunkFilter you could simply build a similar convenience method, e.g.
public static String getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, TextChunkFilter chunkFilter) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText(chunkFilter);
}
In the context at hand, though, in which we want to parse the page content only once and apply multiple filters, one for each cell, we might generalize this to:
public static List<String> getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, Iterable<TextChunkFilter> chunkFilters) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(pageNumber, strategy, additionalContentOperators)
List<String> result = new ArrayList<>();
for (TextChunkFilter chunkFilter : chunkFilters)
{
result.add(strategy).getResultantText(chunkFilter);
}
return result;
}
(You can make this look fancier by using Java 8 collection streaming instead of the old'fashioned for loop.)

Here's my take on how to extract text from a table-like structure in a PDF using itextsharp. It returns a collection of rows and each row contains a collection of interpreted columns. This may work for you on the premise that there is a gap between one column and the next which is greater than the average width of a single character. I also added an option to check for wrapped text within a virtual column. Your mileage may vary.
using (PdfReader pdfReader = new PdfReader(stream))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
TableExtractionStrategy tableExtractionStrategy = new TableExtractionStrategy();
string pageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, tableExtractionStrategy);
var table = tableExtractionStrategy.GetTable();
}
}
public class TableExtractionStrategy : LocationTextExtractionStrategy
{
public float NextCharacterThreshold { get; set; } = 1;
public int NextLineLookAheadDepth { get; set; } = 500;
public bool AccomodateWordWrapping { get; set; } = true;
private List<TableTextChunk> Chunks { get; set; } = new List<TableTextChunk>();
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
string text = renderInfo.GetText();
Vector bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
Rectangle rectangle = new Rectangle(bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Chunks.Add(new TableTextChunk(rectangle, text));
}
public List<List<string>> GetTable()
{
List<List<string>> lines = new List<List<string>>();
List<string> currentLine = new List<string>();
float? previousBottom = null;
float? previousRight = null;
StringBuilder currentString = new StringBuilder();
// iterate through all chunks and evaluate
for (int i = 0; i < Chunks.Count; i++)
{
TableTextChunk chunk = Chunks[i];
// determine if we are processing the same row based on defined space between subsequent chunks
if (previousBottom.HasValue && previousBottom == chunk.Rectangle.Bottom)
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
else
{
// if we are processing a new line let's check to see if this could be word wrapping behavior
bool isNewLine = true;
if (AccomodateWordWrapping)
{
int readAheadDepth = Math.Min(i + NextLineLookAheadDepth, Chunks.Count);
if (previousBottom.HasValue)
for (int j = i; j < readAheadDepth; j++)
{
if (previousBottom == Chunks[j].Rectangle.Bottom)
{
isNewLine = false;
break;
}
}
}
// if the text was not word wrapped let's treat this as a new table row
if (isNewLine)
{
if (currentString.Length > 0)
currentLine.Add(currentString.ToString());
currentString.Clear();
previousBottom = chunk.Rectangle.Bottom;
previousRight = chunk.Rectangle.Right;
currentString.Append(chunk.Text);
if (currentLine.Count > 0)
lines.Add(currentLine);
currentLine = new List<string>();
}
else
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
}
}
return lines;
}
private struct TableTextChunk
{
public Rectangle Rectangle;
public string Text;
public TableTextChunk(Rectangle rect, string text)
{
Rectangle = rect;
Text = text;
}
public override string ToString()
{
return Text + " (" + Rectangle.Left + ", " + Rectangle.Bottom + ")";
}
}
}

Lucene payload scoring

I want to figure out how payload scoring works in lucene. Since I don't understand where PayloadFunction fits in, I think I don't really understand how it works. Tried googling for it, but couldn't find much apart from advice to go through source. Well, it would be nice if someone can explain it here, else source code it is :)

There are three parts of it. First of all you should generate payloads during analysis. This could be done using PayloadAttribute. You just need to add this attribute to terms you want during analysis.
class MyFilter extends TokenFilter {
private PayloadAttribute attr;
public MyFilter() {
attr = addAttribute(PayloadAttribute.class);
}
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
Payload p = new Payload(PayloadHelper.encodeFloat(42));
attr.setPayload(p);
} else {
attr.setPayload(null);
}
}
Then during searching you should use special query class PayloadTermQuery. This class behaves as SpanTermQuery but do track of payloads in index. Using custom Similarity implementation you could score each payload occurrence in document.
public class MySimilarity extends DefaultSimilarity {
public float scorePayload(int docID, String fieldName,
int start, int end, byte[] payload,
int offset, int length) {
if (payload != null) {
return PayloadHelper.decodeFloat(payload, offset);
} else {
return 1.0f;
}
}
}
Finally, using PayloadFunction you could aggregate payload scores over document to produce final document score.

Nunit Assertion for an empty intersection between collection

I've looked all around, and can't quite figure this one out, and my multitude of trial and error attempts have all been useless.
I have a list of user names (we'll call 'original list') one object is returning
I have a list of user names (we'll call 'filtration list') another object is returning
I am testing a method that returns all of the items from the original list not in the filtration list.
Ideally what I want is something like
Assert.That(returnedList, Has.No.Members.In(filtrationList))
So far the only thing I can do is iterate over the filtrationList and do
Assert.That(returnedList, Has.None.EqualTo(filteredUser))

With nunit you can create any custom constraint.
If you want to verify two collections for intersection, you can create something like this:
public class Intersects : CollectionConstraint
{
private IEnumerable _collection2;
public Intersects(IEnumerable collection2)
: base(collection2)
{
_collection2 = collection2;
}
public static Intersects With(IEnumerable arg)
{
return new Intersects(arg);
}
protected override bool doMatch(IEnumerable collection)
{
foreach (object value in collection)
{
foreach (object value2 in _collection2)
if (value.Equals(value2))
return true;
}
return false;
}
public override void WriteDescriptionTo(MessageWriter writer)
{
//You can put here something more meaningful like items which should not be in verified collection.
writer.Write("intersecting collections");
}
}
usage is pretty simple:
string[] returnedList = new string[] { "Martin", "Kent", "Jack"};
List<string> filteredUsers = new List<string>();
filteredUsers.Add("Jack");
filteredUsers.Add("Bob");
Assert.That(returnedList, Intersects.With(filteredUsers));

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I use Lucene's PriorityQueue when I don't know the max size at create time? - lucene

Related

How to set variable inside of a Coroutine after yielding a webrequest

Incremental score calculation bug?

Text extraction from table cells

Lucene payload scoring

Nunit Assertion for an empty intersection between collection

Categories

Resources