Here's my code:
var sb = new StringBuilder();
var st = new SimpleTextExtractionStrategy();
string raw;
using(var r = new iTextSharp.text.pdf.PdfReader(path)) {
for(int pn = 1; pn <= r.NumberOfPages; pn++) {
raw = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, pn, st);
sb.Append(raw);
}
}
This works for almost all PDFs I've run across... until today:
http://www7.dleg.state.mi.us/orr/Files/AdminCode/356_10334_AdminCode.pdf
For this PDF (and others like it on the same site), the extracted text for page 1 is correct, but the text for page 2 contains pages 1 and 2, page 3 contains pages 1-3, etc. So my StringBuilder ends up with the text from pages 1, 1, 2, 1, 2, 3, 1, 2, 3, 4, etc.
Using the default Location-based strategy has the same issue (and won't work for these particular PDFs anyway).
I recently upgraded from a much older version of iTextSharp (5.1-ish?) and didn't experience this issue before (I believe I've parsed some of these files before without issue). I poked through the source and didn't see anything obvious.
I thought I could work around this by asking for only the last page, but this doesn't work -- I get only the last page. If I hard-code the loop to get pages 2..4, I get 2, 2, 3, 2, 3, 4. So the issue may be some sort of data that PdfReader is maintaining between calls to GetTextFromPage.
Change your code to something like this:
var sb = new StringBuilder();
string raw;
using(var r = new iTextSharp.text.pdf.PdfReader(path)) {
for(int pn = 1; pn <= r.NumberOfPages; pn++) {
var st = new SimpleTextExtractionStrategy();
raw = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, pn, st);
sb.Append(raw);
}
}
Update based on mkl's comment: a strategy remembers all page content it has been confronted with. Thus, you have to use a fresh strategy if you want an extraction with nothing buffered yet.
Related
I have the following Apache Lucene 7 application:
StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(standardAnalyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document document = new Document();
document.add(new TextField("content", new FileReader("document.txt")));
writer.addDocument(document);
writer.close();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Query fuzzyQuery = new FuzzyQuery(new Term("content", "Company"), 2);
TopDocs results = searcher.search(fuzzyQuery, 5);
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore())
when I use it with :
new FuzzyQuery(new Term("content", "Company"), 2);
the application works fine and returns the following result:
Hits: 1
Max score:0.35161147
but when I try to search with multi term query, for example:
new FuzzyQuery(new Term("content", "Company name"), 2);
it returns the following result:
Hits: 0
Max score:NaN
Anyway, the phrase Company name exists in the source document.txt file.
How to properly use FuzzyQuery in this case in order to be able to do the fuzzy search for multi-word phrases.
UPDATED
Based on the provided solution I have tested it on the following text information:
Company name: BlueCross BlueShield Customer Service
1-800-521-2227
of Texas Preauth-Medical 1-800-441-9188
Preauth-MH/CD 1-800-528-7264
Blue Card Access 1-800-810-2583
For the following query:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCross"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
the search works fine:
Hits: 1
Max score:0.5753642
but when I try to corrupt a little bit the search query(for example from BlueCross to BlueCros)
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCros"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
it stops working and returns:
Hits: 0
Max score:NaN
The problem here is the following, you're using TextField, which is tokenizing field. E.g. your text "Company name is working on something" would be effectively split by spaces (and others delimeters). So, even if you have the text Company name, during indexation it will become Company, name, is, etc.
In this case this TermQuery won't be able to find what you're looking for. The trick which going to help you would look like this:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "some"), 2));
clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "text"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
However, I wouldn't recommend this approach much, especially if your load would be big and you're planning on searching on a 10 term long company names. One should be aware, that those query are potentially heavy to execute.
The following problem with BlueCros is the following. By default Lucene uses StandardAnalyzer for TextField. So it means it effectively lowercase the terms, basically it means that BlueCross in the content field becomes bluecross.
Fuzzy difference between BlueCros and bluecross is 3, that's the reason you do not have a match.
Simple proposal would be to convert term in query to the lowercase, by doing something like .toLowerCase()
In general, one should prefer to use same analyzers during the query time as well (e.g. during construction of the query)
For Lucene.Net it can be like this.
private string _IndexPath = #"Your Index Path";
private Directory _Directory;
private Searcher _IndexSearcher;
private MultiPhraseQuery _MultiPhraseQuery;
_Directory = FSDirectory.Open(_IndexPath);
IndexReader indexReader = IndexReader.Open(_Directory, true);
string field = "Name" // Your field name
string keyword = "big red fox"; // your search term
float fuzzy = 0,7f; // between 0-1
using (_IndexSearcher = new IndexSearcher(indexReader))
{
// "big red fox" to [big,red,fox]
var keywordSplit = keyword.Split();
_MultiPhraseQuery = new MultiPhraseQuery();
FuzzyTermEnum[] _FuzzyTermEnum = new FuzzyTermEnum[keywordSplit.Length];
Term[] _Term = new Term[keywordSplit.Length];
for (int i = 0; i < keywordSplit.Length; i++)
{
_FuzzyTermEnum[i] = new FuzzyTermEnum(indexReader, new Term(field, keywordSplit[i]),fuzzy);
_Term[i] = _FuzzyTermEnum[i].Term;
if (_Term[i] == null)
{
_MultiPhraseQuery.Add(new Term(field, keywordSplit[i]));
}
else
{
_MultiPhraseQuery.Add(_FuzzyTermEnum[i].Term);
}
}
var results = _IndexSearcher.Search(_MultiPhraseQuery, indexReader.MaxDoc);
foreach (var loopDoc in results.ScoreDocs.OrderByDescending(s => s.Score))
{
//YourCode Here
}
}
I am using PDF Sharp and have one issue only. I cannot rename form fields. We have a field called 'x' and after an operation is applied to field 'x', it needs to be renamed to field 'y'.
I have seen tons of documentation on how to do this using itextSharp. Unfortunately my firm cannot use them and so I am looking for a solution using PDF Sharp.
Any ideas?
This can give you an idea on how to perform the field renaming
var uniqueIndex = Guid.NewGuid();
var fields = pdfDocument.AcroForm.Fields;
var fieldNames = fields.Names;
for (int idx = 0; idx < fieldNames.Length; ++idx)
{
var fieldName = fieldNames[idx];
var field = fields[fieldName];
field.Elements.SetName($"/{fieldName}", $"{fieldName}_{uniqueIndex}");
}
I was able to rename form field via PdfSharp as follow:
public void RenameAcroField(PdfAcroField field, string newFieldName)
{
field.Elements.SetString("/T", newFieldName);
}
Little bit tricky but worked for my case. Hope it will help.
VB.NET version for PDFsharp 1.50.5147
Dim i = 0
While i < pdfDoc.AcroForm.Fields.Count
pdfDoc.AcroForm.Fields(i).Elements.SetString("/T", "formField" & i)
i += 1
End While
I have a simple scenario where I extract pages from a PDF document (or split the document in two parts, if you will) and merge the parts back to a new document, with an option to add new pages in between.
However, in one particular case the resulting document differs from the original one in that couple of pages (in this case pages 4 and 5) look distorted in comparison to the source document.
How can I circumvent the distortion of the pages? The reproduction code below has been tested with iTextSharp versions 5.5.0.0 and 5.5.6.0 (latest at the moment).
You can find the input-File i used here.
void Main()
{
var pathPrefix = #"C:\temp"; // TODO change
var inputDocPath = #"input.pdf";
var part1 = ExtractPages(Path.Combine(pathPrefix, inputDocPath), 1, 2);
var outputPath1 = Path.Combine(pathPrefix, "part1.pdf");
File.WriteAllBytes(outputPath1, part1);
var part2 = ExtractPages(Path.Combine(pathPrefix, inputDocPath), 3);
var outputPath2 = Path.Combine(pathPrefix, "part2.pdf");
File.WriteAllBytes(outputPath2, part2);
var merged = Merge(new[] {
outputPath1,
outputPath2
});
var mergedPath = Path.Combine(pathPrefix, "output.pdf");
File.WriteAllBytes(mergedPath, merged);
}
//Page sizes:
// input: 8,26x11,68; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,68; 8,26x11,68
// output: 8,26x11,68; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,68; 8,26x11,68
public static byte[] Merge(string[] documentPaths)
{
byte[] mergedDocument;
using (MemoryStream memoryStream = new MemoryStream())
using (Document document = new Document())
{
PdfSmartCopy pdfSmartCopy = new PdfSmartCopy(document, memoryStream);
document.Open();
foreach (var docPath in documentPaths)
{
PdfReader reader = new PdfReader(docPath);
try
{
reader.ConsolidateNamedDestinations();
var numberOfPages = reader.NumberOfPages;
for (int page = 0; page < numberOfPages;)
{
PdfImportedPage pdfImportedPage = pdfSmartCopy.GetImportedPage(reader, ++page);
pdfSmartCopy.AddPage(pdfImportedPage);
}
}
finally
{
reader.Close();
}
}
document.Close();
mergedDocument = memoryStream.ToArray();
}
return mergedDocument;
}
public static byte[] ExtractPages(string pdfDocument, int startPage, int? endPage = null)
{
var reader = new PdfReader(pdfDocument);
var numberOfPages = reader.NumberOfPages;
var endPageResolved = endPage.HasValue ? endPage.Value : numberOfPages;
if (startPage > numberOfPages || endPageResolved > numberOfPages)
string.Format("Error: page indices ({0}, {1}) out of bounds. Document has {2} pages.",
startPage, endPageResolved, numberOfPages).Dump();
byte[] outputDocument;
using (var doc = new Document()) // NOTE use reader.GetPageSizeWithRotation(startPage) ?
using (var msOut = new MemoryStream())
{
var pdfCopyProvider = new PdfCopy(doc, msOut);
doc.Open();
for (var i = startPage; i <= endPageResolved; i++)
{
var page = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(page);
}
doc.Close();
reader.Close();
outputDocument = msOut.ToArray();
}
return outputDocument;
}
I could reproduce the issue using your code and your test file with iTextSharp 5.5.6. Actually, though, the images are not merely distorted, they have been replaced by other ones! Inspecting the result PDF internally, one observes:
Originally page 3 through 5 each had their own respective Resource dictionary containing different entries than the ones of each other.
After split up, as pages 1 through 3 of part2.pdf, they still had different Resource dictionaries.
In the final merged result, though, page 3 through 5 all refer to the same Resource dictionary object, a copy of the resources of the original page 3!
(As page 3 contains images with the same names as the images on pages 4 and 5, this results in page 3 images being shown on pages 4 and 5.)
Somehow PdfSmartCopy seems to outsmart itself here, using PdfCopy instead creates the expected result.
I assume PdfSmartCopy falsely considers those source dictionaries identical, probably some hash collision without actual equality check.
It might be of interest to note that an equivalent test using Java and iText, SmartMerging.java, does not show the same issue, its result is as expected.
Thus, this looks like an issue of the iTextSharp port or .Net in general.
I apologise in advance if there is already an answer to this problem; if so please just link it (I have looked, btw! I just didn't find anything relating to my specific example) :)
I have a text (.txt) file which contains data in the form 1.10.100.0.200 where 1, 10, 100, 0 and 200 are numbers storing the map terrain layout of a game. This file has multiple lines of 1.10.100.0.200 where each line represents an item of terrain in the map.
Here is what I would like to know:
How do I find out how many lines there are, so I know how many items of terrain to create when I read the map file?
What is the method I should use to get each of 1, 10, 100, 0 and 200:
E.g. when I am translating the file into a map terrain at runtime I might use the terrainitem1.Location = New Point(x, y) or terrainitem1.Size = New Size(p, q) commands, where x, y, p and q are integers or doubles relating to the terrain's location or size. Where would I then find x, y etc. out of 1, 10, 100, 0 and 200, if say x is equal to 1, y to 10 and so on?
I am sorry if this isn't clear, please just ask me and I'll try to explain.
N.B. I am using VB.NET WinForms
There is no way to know how many lines a file has without opening the file and reading its contents.
You didn't indicate how far you've got on this. Do you know how to open a file?
Here's some basic code to do what you want. (Sorry, this is C# but the idea is the same in VB.)
string line;
using (TextReader reader = File.OpenText(#"C:\filename.txt"))
{
// Read each line from the file (until null returned)
while ((line = myTextReader.ReadLine()) != null)
{
// Get each number in line (as string)
string[] values = line.Split(new[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
// Convert each number to integer
id = int.Parse(values[0]);
height = int.Parse(values[1]);
width = int.Parse(values[2]);
x = int.Parse(values[3]);
y = int.Parse(values[4]);
}
}
I have two PDFs. One is the main PDF and the other has an image that I need to insert into the first. Also in the second PDF, after inserting that image, I need to concatenate the remainder of the second PDF.
The solution was to superimpose the PDF page with the image onto the main PDF. Then concatenate the rest of it. "design_section" is the PDF with the image in it. This code will do:
PdfReader confirmation_section = new PdfReader(SOURCE);
PdfReader design_section = new PdfReader(SOURCE2);
PdfStamper stamper = new PdfStamper(confirmation_section, new FileOutputStream(RESULT));
PdfImportedPage page = stamper.getImportedPage(design_section, 1);
int c = confirmation_section.getNumberOfPages();
PdfContentByte background;
for (int i = 1; i <= c; i++) {
background = stamper.getUnderContent(i);
if(i == c)
background.addTemplate(page, 0, 0);
}
int d = design_section.getNumberOfPages();
if(d > 1) {
for(int f = 2; f <= d; f++) {
stamper.insertPage(c + f, confirmation_section.getPageSize(1));
page = stamper.getImportedPage(design_section, f);
stamper.getOverContent(c + f - 1).addTemplate(page, 0, 0);
System.out.println("here we are in the loop c + f is: " + (c + f));
}
}
stamper.close();
Pointed suggestion for iText -- how about renaming "addTemplate()" to "addPage()"???. iText is the most cryptic lib I have used and that includes regexp
Thanks for the follow up. I did read that many, many times ))) well, honestly, at least 6 times. I know it is just an excerpt and I am sure that there is more valuable information in the book, but with that said, I did not find what I was looking for. Where in that text does it discuss, compare and differentiate PdfCopy PDFStamper and PDFReader/Writer in the context of, for example, adding pages from one PDF to another?