postings nextPosition returns nul, freq returns 0, payload() returns null - lucene

I made simplest index with one document using LuceneTestCase. My goal is to write numbers to payload for each position of each term, that will be used in custom scoring formula implemented in custom Query/Scorer.
I used SimpleTextCodec and checked, that freq, positions and payload was really written to index.
But when I'm reading freq from the PostingEnum it returns 0, payload() returns null, nextPosition() throws an exception:
java.lang.AssertionError: got line=field model
at __randomizedtesting.SeedInfo.seed([D334C9D1B5C155E3:2AAE4BE5481F4C8F]:0)
at
org.apache.lucene.codecs.simpletext.SimpleTextFieldsReader$SimpleTextPostings Enum.nextPosition(SimpleTextFieldsReader.java:455)
Here is how I'm reading the postings in the custom Query:
for (String field: fieldScores.keySet()) {
final Terms fieldTerms = reader.terms(field);
if (fieldTerms == null) {
continue;
}
if (!fieldTerms.hasPositions())
throw new IllegalStateException("Index does not contain positions");
if (!fieldTerms.hasPayloads())
throw new IllegalStateException("Index does not contain payloads");
final TermsEnum te = fieldTerms.iterator();
for (int j = 0; j < terms.length; j++) {
final Term t = terms[j];
if (t.field().equals(field) && te.seekExact(t.bytes())) {
PostingsEnum postingsEnum = te.postings(null, PostingsEnum.ALL);
int pos = postingsEnum.nextPosition();
BytesRef payload = postingsEnum.getPayload();
// assert payload.bytesEquals(new BytesRef(new byte[]{1}));
// TODO: use payload in scoring formula
fldScorers.add(new ConstTermScorer(this, t,
fieldScores.get(field) * termScores.get(t.text()),
postingsEnum));
}
}
}

I've found the reason. nextPosition(), freq() and payload() return 0 (or null) values because postingsEnum (iterator) is just created and not positioned on concrete document yet. postingsEnum.nextDoc() wasn't called and postingsEnum.docID() is -1. Stupid situation, but it would be better may be if nextPosition(), freq() and payload() would check postingsEnum.docID.

Related

Itext7 cleanup method throws error - Index was out of range

I am getting the below error while trying to redact pdf document using itext7
I am calling pdfCleanupTool.cleanup() method for redaction and sometimes I am getting the below error from the cleanup method:
Index was out of range. Must be non-negative and less than the size of the collection.\r\nParameter name: index
Any help appreciated.
Thanks!
Updates:
Error Log:
There is a bug in the iText 7 PdfTextArray class which generates stack traces like yours. As you don't share your PDF, though, I cannot be sure whether that's the bug bothering you currently.
The Bug
The bug can be provoked quite easily, in Java like this
PdfTextArray textArray = new PdfTextArray();
textArray.add(1);
textArray.add(-1);
textArray.add(1);
(CancelingAdjustments test testCancelingAdjustments)
and similarly in C#.
This essentially may be what happens in the OP's case; redaction involves removal of text pieces from such text arrays and replacement by equivalent numeric adjustments, so such situations may be more probable during redaction than in general.
The Cause
When adding multiple numbers to a PdfTextArray, it attempts to combine them to a single number, and if that single number is zero, remove it altogether:
public boolean add(float number) {
// adding zero doesn't modify the TextArray at all
if (number != 0) {
if (!Float.isNaN(lastNumber)) {
lastNumber = number + lastNumber;
if (lastNumber != 0) {
set(size() - 1, new PdfNumber(lastNumber));
} else {
remove(size() - 1);
}
} else {
lastNumber = number;
super.add(new PdfNumber(lastNumber));
}
lastString = null;
return true;
}
return false;
}
(PdfTextArray method add)
But this code forgets to reset the lastNumber variable to "not a number" after removal due to cancelation. Thus, this bug can be fixed like this:
public boolean add(float number) {
// adding zero doesn't modify the TextArray at all
if (number != 0) {
if (!Float.isNaN(lastNumber)) {
lastNumber = number + lastNumber;
if (lastNumber != 0) {
set(size() - 1, new PdfNumber(lastNumber));
} else {
remove(size() - 1);
lastNumber = Float.NaN;
}
} else {
lastNumber = number;
super.add(new PdfNumber(lastNumber));
}
lastString = null;
return true;
}
return false;
}
(One could improve this some more by testing whether there is some string at the now last position of the array and initialize lastString accordingly.)
The iText/.Net code is very similar here.

Flutter how to turn Lists encoded as Strings for a SQFL database back to Lists concisely?

I fear I'm trying to reinvent the wheel here. I'm putting Objects into my SQFL database:
https://pub.dev/packages/sqflite
some of the object fields are Lists of ints others are Lists of Strings. I'm encoding these as plain Strings to place in a TEXT field in my SQFL database.
At some point I'm going to have to turn them back, I couldn't find anything on Google, which is surprising because this must be a very common occurrence with SQFL
I've started coding the 'decoding', but it's rookie dart. Is there anything performant around I ought to use?
Code included to prove I'm not totally lazy, no need to look, edge cases make it fail.
List<int> listOfInts = new List<int>();
String testStringOfInts = "[1,2,4]";
List<String> intermediateStep2 = testStringOfInts.split(',');
int numListElements = intermediateStep2.length;
print("intermediateStep2: $intermediateStep2, numListElements: $numListElements");
for (int i = 0; i < numListElements; i++) {
if (i == 0) {
listOfInts.add(int.parse(intermediateStep2[i].substring(1)));
continue;
}
else if ((i) == (numListElements - 1)) {
print('final element: ${intermediateStep2[i]}');
listOfInts.add(int.parse(intermediateStep2[i].substring(0, intermediateStep2[i].length - 1)));
continue;
}
else listOfInts.add(int.parse(intermediateStep2[i]));
}
print('Output: $listOfInts');
/* DECODING LISTS OF STRINGS */
String testString = "['element1','element2','element23']";
List<String> intermediateStep = testString.split("'");
List<String> output = new List<String>();
for (int i = 0; i < intermediateStep.length; i++) {
if (i % 2 == 0) {
continue;
} else {
print('adding a value to output: ${intermediateStep[i]}');
//print('value is a: ${(intermediateStep[i]).runtimeType}');
output.add(intermediateStep[i]);
}
}
print('Output: $output');
}
For the integers your could make the parsing like:
void main() {
print(parseStringAsIntList("[1,2,4]")); // [1, 2, 4]
}
List<int> parseStringAsIntList(String stringOfInts) => stringOfInts
.substring(1, stringOfInts.length - 1)
.split(',')
.map(int.parse)
.toList();
I need more information about how the Strings are saved in some corner cases like if they contain , and/or ' since this will change how the parsing should be done. But if both characters are valid in the string (especially ,) I will recommend you to change the storage format into JSON instead which makes it a lot easier to encode/decode and without the risk of using characters which can give you issues).
But a rather naive solution can be made like this if we know each String does not contain ,:
void main() {
print(parseStringAsStringList("['element1','element2','element23']"));
// [element1, element2, element23]
}
List<String> parseStringAsStringList(String stringOfStrings) => stringOfStrings
.substring(1, stringOfStrings.length - 1)
.split(',')
.map((string) => string.substring(1, string.length - 1))
.toList();

How can I customize values for genes of chromosome?

I am trying to solve one job assignment problem using GeneticSharp. It is assigning gates to the trucks, and not all gates are suitable for the trucks.
Each chromosome is required to have gene values from a certain array of double values, corresponding to gene index (each gene index is equal to truck number). So, I'm trying to get a value randomly from that array and assign to gene in FloatingPointChromosome class, but this gives me an error of 'Object reference not set to an instance of an object. allowedStands was null'.
Could you, please, advise me how to solve it?
public FloatingPointChromosome(double[] minValue, double[] maxValue, int[] totalBits, int[] fractionDigits, double[] geneValues, double[][] allowedStands)
: base(totalBits.Sum())
{
m_minValue = minValue;
m_maxValue = maxValue;
m_totalBits = totalBits;
m_fractionDigits = fractionDigits;
// If values are not supplied, create random values
if (geneValues == null)
{
geneValues = new double[minValue.Length];
//var rnd = RandomizationProvider.Current;
var rnd = new Random();
for (int i = 0; i < geneValues.Length; i++)
{
int a = rnd.Next(allowedStands[i].Length);
geneValues[i] = allowedStands[i][a];
//I make here that it randomly selects from allowed gates array
}
}
m_originalValueStringRepresentation = String.Join(
"",
BinaryStringRepresentation.ToRepresentation(
geneValues,
totalBits,
fractionDigits));
CreateGenes();
}
I guess in the case of truck and gate assignment is better you create your own chromosome, take a look on TspChromosome to get an idea.
public TspChromosome(int numberOfCities) : base(numberOfCities)
{
m_numberOfCities = numberOfCities;
var citiesIndexes = RandomizationProvider.Current.GetUniqueInts(numberOfCities, 0, numberOfCities);
for (int i = 0; i < numberOfCities; i++)
{
ReplaceGene(i, new Gene(citiesIndexes[i]));
}
}
Using the same approach, you cities indexes are your gates indexes.

How to set the line space between two chunks in itextsharp

I am creating a PDF using iTextSharp. This is a reporting tool. Everything is working fine, only the space between two chunks is slighly greater that what I want. I tried to find some help on StackOverflow and got to know SetLeading(fixed, multiplied); but it is not coming with chunk in case.
The reason I need it in chunk is that I have multiple chunks which I am adding into paragraph proceeding to which adding all into Document at a single shot.
public static void createPDF(Paragraph para)
{
string imagepath = "12.pdf";
Document doc = new Document();
try
{
Paragraph p = para;
Rectangle[] COLUMNS = {
new Rectangle(36, 36, 290, 806),
new Rectangle(305, 36, 559, 806)
};
//This is what i have tried
// p.SetLeading(0.4f,0.8f);
p.SpacingBefore = 0.0f;
p.SpacingAfter = 0.1f;
PdfReader inputPdf = new PdfReader(#"");
PdfWriter writer2 = PdfWriter.GetInstance(doc, new FileStream(imagepath, FileMode.Create));
doc.Open();
PdfContentByte canvas = writer2.DirectContent;
for (int ij = 1; ij <= 3; ij++)
{
doc.SetPageSize(inputPdf.GetPageSizeWithRotation(ij));
doc.NewPage();
PdfImportedPage page = writer2.GetImportedPage(inputPdf, ij);
int rotation = inputPdf.GetPageRotation(ij);
if (rotation == 90 || rotation == 270)
{
canvas.AddTemplate(page, 0, -1f, 1f, 0, 0, inputPdf.GetPageSizeWithRotation(ij).Height);
}
else
{
canvas.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
}
}
doc.NewPage();
ColumnText ct = new ColumnText(canvas);
int side_of_the_page = 0;
ct.SetSimpleColumn(COLUMNS[side_of_the_page]);
int paragraphs = 0;
int i = 0;
while (paragraphs < p.Count-1)
{
string TEXT = p[i].ToString();
ct.AddElement(p[i]);
while (ColumnText.HasMoreText(ct.Go()))
{
if (side_of_the_page == 0)
{
side_of_the_page = 1;
canvas.MoveTo(297.5f, 36);
canvas.LineTo(297.5f, 806);
canvas.Stroke();
}
else
{
side_of_the_page = 0;
doc.NewPage();
}
ct.SetSimpleColumn(COLUMNS[side_of_the_page]);
}
i++;
paragraphs++;
}
doc.Close();
}
catch {
}
}
Please read chapter 2 of my book. The Chunk object is called the atomic building block among iText's high-level objects. By design, you cannot define a leading on the level of a Chunk.
I quote from page 23:
A Chunk isn't aware of the space that is needed between two lines.
The leading is defined at the level of a Phrase (and, of course, its superclasses, such as Paragraph). If you want to change the spacing between Chunk objects, you need to wrap Chunks in Phrases or Paragraphs (as you already indicate) and define the leading for those phrases or paragraphs.
Note that the documentation also states:
In normal circumstances you'll use Chunk objects to compose other text objects, such as Phrases and Paragraphs. Typically, you won't add Chunk objects directly to a Document.
Which special circumstance do you have that requires making an exception to this rule?
Extra remarks
You are importing an existing PDF in a way that throws away all existing interactivity. This is suboptimal.
You first compose a paragraph p, you set the leading for p, then you decompose p throwing away the leading you've defined and then you complain that there's no leading.
This is what you are doing wrong:
while (paragraphs < p.Count-1)
{
ct.AddElement(p[i]);
...
}
The object p knows its leading; the separate components of this object (p[0], p[1],...), don't know anything about the leading.
Hence you should do something like this:
ColumnText ct = new ColumnText(canvas);
int side_of_the_page = 0;
ct.SetSimpleColumn(COLUMNS[side_of_the_page]);
ct.AddElement(p);
while (ColumnText.HasMoreText(ct.Go()))
{
if (side_of_the_page == 0)
{
side_of_the_page = 1;
canvas.MoveTo(297.5f, 36);
canvas.LineTo(297.5f, 806);
canvas.Stroke();
}
else
{
side_of_the_page = 0;
doc.NewPage();
}
ct.SetSimpleColumn(COLUMNS[side_of_the_page]);
}
As you have defined the leading at the level of the p object, you must add the p object as an element to the ColumnText.
Regarding the wrong way you're copying the original document: The AddLongTable example shows how to do it correctly. You get a PdfReader object for the existing document. You create a PdfStamper to create a new document. You get the number of pages in the existing document, and then you use insertPage() as many time as needed to add extra content.

What is the fastest way to compare two byte arrays?

I am trying to compare two long bytearrays in VB.NET and have run into a snag. Comparing two 50 megabyte files takes almost two minutes, so I'm clearly doing something wrong. I'm on an x64 machine with tons of memory so there are no issues there. Here is the code that I'm using at the moment and would like to change.
_Bytes and item.Bytes are the two different arrays to compare and are already the same length.
For Each B In item.Bytes
If B <> _Bytes(I) Then
Mismatch = True
Exit For
End If
I += 1
Next
I need to be able to compare as fast as possible files that are potentially hundreds of megabytes and even possibly a gigabyte or two. Any suggests or algorithms that would be able to do this faster?
Item.bytes is an object taken from the database/filesystem that is returned to compare, because its byte length matches the item that the user wants to add. By comparing the two arrays I can then determine if the user has added something new to the DB and if not then I can just map them to the other file and not waste hard disk drive space.
[Update]
I converted the arrays to local variables of Byte() and then did the same comparison, same code and it ran in like one second (I have to benchmark it still and compare it to others), but if you do the same thing with local variables and use a generic array it becomes massively slower. I’m not sure why, but it raises a lot more questions for me about the use of arrays.
What is the _Bytes(I) call doing? It's not loading the file each time, is it? Even with buffering, that would be bad news!
There will be plenty of ways to micro-optimise this in terms of looking at longs at a time, potentially using unsafe code etc - but I'd just concentrate on getting reasonable performance first. Clearly there's something very odd going on.
I suggest you extract the comparison code into a separate function which takes two byte arrays. That way you know you won't be doing anything odd. I'd also use a simple For loop rather than For Each in this case - it'll be simpler. Oh, and check whether the lengths are correct first :)
EDIT: Here's the code (untested, but simple enough) that I'd use. It's in C# for the minute - I'll convert it in a sec:
public static bool Equals(byte[] first, byte[] second)
{
if (first == second)
{
return true;
}
if (first == null || second == null)
{
return false;
}
if (first.Length != second.Length)
{
return false;
}
for (int i=0; i < first.Length; i++)
{
if (first[i] != second[i])
{
return false;
}
}
return true;
}
EDIT: And here's the VB:
Public Shared Function ArraysEqual(ByVal first As Byte(), _
ByVal second As Byte()) As Boolean
If (first Is second) Then
Return True
End If
If (first Is Nothing OrElse second Is Nothing) Then
Return False
End If
If (first.Length <> second.Length) Then
Return False
End If
For i as Integer = 0 To first.Length - 1
If (first(i) <> second(i)) Then
Return False
End If
Next i
Return True
End Function
The fastest way to compare two byte arrays of equal size is to use interop. Run the following code on a console application:
using System;
using System.Runtime.InteropServices;
using System.Security;
namespace CompareByteArray
{
class Program
{
static void Main(string[] args)
{
const int SIZE = 100000;
const int TEST_COUNT = 100;
byte[] arrayA = new byte[SIZE];
byte[] arrayB = new byte[SIZE];
for (int i = 0; i < SIZE; i++)
{
arrayA[i] = 0x22;
arrayB[i] = 0x22;
}
{
DateTime before = DateTime.Now;
for (int i = 0; i < TEST_COUNT; i++)
{
int result = MemCmp_Safe(arrayA, arrayB, (UIntPtr)SIZE);
if (result != 0) throw new Exception();
}
DateTime after = DateTime.Now;
Console.WriteLine("MemCmp_Safe: {0}", after - before);
}
{
DateTime before = DateTime.Now;
for (int i = 0; i < TEST_COUNT; i++)
{
int result = MemCmp_Unsafe(arrayA, arrayB, (UIntPtr)SIZE);
if (result != 0) throw new Exception();
}
DateTime after = DateTime.Now;
Console.WriteLine("MemCmp_Unsafe: {0}", after - before);
}
{
DateTime before = DateTime.Now;
for (int i = 0; i < TEST_COUNT; i++)
{
int result = MemCmp_Pure(arrayA, arrayB, SIZE);
if (result != 0) throw new Exception();
}
DateTime after = DateTime.Now;
Console.WriteLine("MemCmp_Pure: {0}", after - before);
}
return;
}
[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint="memcmp", ExactSpelling=true)]
[SuppressUnmanagedCodeSecurity]
static extern int memcmp_1(byte[] b1, byte[] b2, UIntPtr count);
[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint = "memcmp", ExactSpelling = true)]
[SuppressUnmanagedCodeSecurity]
static extern unsafe int memcmp_2(byte* b1, byte* b2, UIntPtr count);
public static int MemCmp_Safe(byte[] a, byte[] b, UIntPtr count)
{
return memcmp_1(a, b, count);
}
public unsafe static int MemCmp_Unsafe(byte[] a, byte[] b, UIntPtr count)
{
fixed(byte* p_a = a)
{
fixed (byte* p_b = b)
{
return memcmp_2(p_a, p_b, count);
}
}
}
public static int MemCmp_Pure(byte[] a, byte[] b, int count)
{
int result = 0;
for (int i = 0; i < count && result == 0; i += 1)
{
result = a[0] - b[0];
}
return result;
}
}
}
If you don't need to know the byte, use 64-bit ints that gives you 8 at once. Actually, you can figure out the wrong byte, once you've isolated it to a set of 8.
Use BinaryReader:
saveTime = binReader.ReadInt32()
Or for arrays of ints:
Dim count As Integer = binReader.Read(testArray, 0, 3)
Better approach... If you are just trying to see if the two are different then save some time by not having to go through the entire byte array and generate a hash of each byte array as strings and compare the strings. MD5 should work fine and is pretty efficient.
I see two things that might help:
First, rather than always accessing the second array as item.Bytes, use a local variable to point directly at the array. That is, before starting the loop, do something like this:
array2 = item.Bytes
That will save the overhead of dereferencing from the object each time you want a byte. That could be expensive in Visual Basic, especially if there's a Getter method on that property.
Also, use a "definite loop" instead of "for each". You already know the length of the arrays, so just code the loop using that value. This will avoid the overhead of treating the array as a collection. The loop would look something like this:
For i = 1 to max Step 1
If (array1(i) <> array2(i))
Exit For
EndIf
Next
Not strictly related to the comparison algorithm:
Are you sure your bottleneck is not related to the memory available and the time used to load the byte arrays? Loading two 2 GB byte arrays just to compare them could bring most machines to their knees. If the program design allows, try using streams to read smaller chunks instead.