I have a simple scenario where I extract pages from a PDF document (or split the document in two parts, if you will) and merge the parts back to a new document, with an option to add new pages in between.
However, in one particular case the resulting document differs from the original one in that couple of pages (in this case pages 4 and 5) look distorted in comparison to the source document.
How can I circumvent the distortion of the pages? The reproduction code below has been tested with iTextSharp versions 5.5.0.0 and 5.5.6.0 (latest at the moment).
You can find the input-File i used here.
void Main()
{
var pathPrefix = #"C:\temp"; // TODO change
var inputDocPath = #"input.pdf";
var part1 = ExtractPages(Path.Combine(pathPrefix, inputDocPath), 1, 2);
var outputPath1 = Path.Combine(pathPrefix, "part1.pdf");
File.WriteAllBytes(outputPath1, part1);
var part2 = ExtractPages(Path.Combine(pathPrefix, inputDocPath), 3);
var outputPath2 = Path.Combine(pathPrefix, "part2.pdf");
File.WriteAllBytes(outputPath2, part2);
var merged = Merge(new[] {
outputPath1,
outputPath2
});
var mergedPath = Path.Combine(pathPrefix, "output.pdf");
File.WriteAllBytes(mergedPath, merged);
}
//Page sizes:
// input: 8,26x11,68; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,68; 8,26x11,68
// output: 8,26x11,68; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,68; 8,26x11,68
public static byte[] Merge(string[] documentPaths)
{
byte[] mergedDocument;
using (MemoryStream memoryStream = new MemoryStream())
using (Document document = new Document())
{
PdfSmartCopy pdfSmartCopy = new PdfSmartCopy(document, memoryStream);
document.Open();
foreach (var docPath in documentPaths)
{
PdfReader reader = new PdfReader(docPath);
try
{
reader.ConsolidateNamedDestinations();
var numberOfPages = reader.NumberOfPages;
for (int page = 0; page < numberOfPages;)
{
PdfImportedPage pdfImportedPage = pdfSmartCopy.GetImportedPage(reader, ++page);
pdfSmartCopy.AddPage(pdfImportedPage);
}
}
finally
{
reader.Close();
}
}
document.Close();
mergedDocument = memoryStream.ToArray();
}
return mergedDocument;
}
public static byte[] ExtractPages(string pdfDocument, int startPage, int? endPage = null)
{
var reader = new PdfReader(pdfDocument);
var numberOfPages = reader.NumberOfPages;
var endPageResolved = endPage.HasValue ? endPage.Value : numberOfPages;
if (startPage > numberOfPages || endPageResolved > numberOfPages)
string.Format("Error: page indices ({0}, {1}) out of bounds. Document has {2} pages.",
startPage, endPageResolved, numberOfPages).Dump();
byte[] outputDocument;
using (var doc = new Document()) // NOTE use reader.GetPageSizeWithRotation(startPage) ?
using (var msOut = new MemoryStream())
{
var pdfCopyProvider = new PdfCopy(doc, msOut);
doc.Open();
for (var i = startPage; i <= endPageResolved; i++)
{
var page = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(page);
}
doc.Close();
reader.Close();
outputDocument = msOut.ToArray();
}
return outputDocument;
}
I could reproduce the issue using your code and your test file with iTextSharp 5.5.6. Actually, though, the images are not merely distorted, they have been replaced by other ones! Inspecting the result PDF internally, one observes:
Originally page 3 through 5 each had their own respective Resource dictionary containing different entries than the ones of each other.
After split up, as pages 1 through 3 of part2.pdf, they still had different Resource dictionaries.
In the final merged result, though, page 3 through 5 all refer to the same Resource dictionary object, a copy of the resources of the original page 3!
(As page 3 contains images with the same names as the images on pages 4 and 5, this results in page 3 images being shown on pages 4 and 5.)
Somehow PdfSmartCopy seems to outsmart itself here, using PdfCopy instead creates the expected result.
I assume PdfSmartCopy falsely considers those source dictionaries identical, probably some hash collision without actual equality check.
It might be of interest to note that an equivalent test using Java and iText, SmartMerging.java, does not show the same issue, its result is as expected.
Thus, this looks like an issue of the iTextSharp port or .Net in general.
Related
Trying to use Apache PDFBox version 2.0.2 for a text replace (with the below code) produces an output where few of the characters would not be displayed, mostly the capital Case Character. For example a replacement with "ABCDEFGHIJKLMNOPQRSTUVWXYZ" the output appears in pdf as "ABCDEF HIJKLM OP RST W Y ". Is this some bug ?? or we have some workaround to handle these character .
public static PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
PDPageTree pages = document.getDocumentCatalog().getPages();
for (PDPage page : pages) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
string = StringUtils.replaceOnce(string, searchString, replacement);
cosString.setValue(string.getBytes());
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
out.close();
}
return document;
}
Quoting from
https://pdfbox.apache.org/2.0/migration.html
Why was the ReplaceText example removed?
The ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily. Words are often split, as seen by this excerpt of a content stream:
[ (Do) -29 (c) -1 (umen) 30 (tation) ] TJ
Other problems will appear with font subsets: for example, if only the glyphs for a, b and c are used, these would be encoded as hex 0, 1 and 2, so you won’t find “abc”. Additionally, you can’t replace “c” with “d” because it isn’t part of the subset.
You could also have problems with ligatures, e.g. “ff”, “fl”, “fi”, “ffi”, “ffl”, which can be represented by a single code in many fonts. To understand this yourself, view any file with PDFDebugger and have a look at the “Contents” entry of a page.
======================================================================
Your description suggests that the initial file has been using a font subset, that is missing the characters G, N, Q, V and Y.
And no, there is no easy workaround. You would have to delete the text you don't want from the content stream, and then append a new content stream with the text you want with a new font at the correct place.
P.S. the current PDFBox version is 2.0.7, not 2.0.2.
Need help to generate a pdf with a list of image and text describing the image under it.
Tried the below, but getting image and text beside each other. Please need help with this. Thanks.
........
PdfPTable table = new PdfPTable(1);
table.setHorizontalAlignment(Element.ALIGN_CENTER);
table.setSplitRows(true);
table.setWidthPercentage(90f);
Paragraph paragraph = new Paragraph();
for (int counter = 0; counter < empSize; counter++) {
String imgPath = ... ".png");
Image img = Image.getInstance(imgPath);
img.scaleAbsolute(110f, 95f);
Paragraph textParagraph = new Paragraph("Test" + counter));
textParagraph.setLeading(Math.max(img.getScaledHeight(), img.getScaledHeight()));
textParagraph.setAlignment(Element.ALIGN_CENTER);
Phrase imageTextCollectionPhase = new Phrase();
Phrase ph = new Phrase();
ph.add(new Chunk(img, 0, 0, true));
ph.add(textParagraph);
imageTextCollectionPhase.add(ph);
paragraph.add(imageTextCollectionPhase);
}
PdfPCell cell = new PdfPCell(paragraph);
table.addCell(cell);
doc.add(table);
I assume that you want to get a result that looks like this:
In your case, you are adding all the content (all the images and all the text) to a single cell. You should add them to separate cells as is done in the MultipleImagesInTable example:
public void createPdf(String dest) throws IOException, DocumentException {
Image img1 = Image.getInstance(IMG1);
Image img2 = Image.getInstance(IMG2);
Image img3 = Image.getInstance(IMG3);
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(dest));
document.open();
PdfPTable table = new PdfPTable(1);
table.setWidthPercentage(20);
table.addCell(img1);
table.addCell("Brazil");
table.addCell(img2);
table.addCell("Dog");
table.addCell(img3);
table.addCell("Fox");
document.add(table);
document.close();
}
You can easily change this proof of concept so that a loop is used. Just make sure you put the addCell() methods inside the loop instead of outside the loop.
You can also explicitly create a PdfPCell and combine the text and the image in the same cell like this:
PdfPCell cell = new PdfPCell();
cell.addElement(img1);
cell.addElement(new Paragraph("Brazil"));
table.addCell(cell);
I have two PDFs. One is the main PDF and the other has an image that I need to insert into the first. Also in the second PDF, after inserting that image, I need to concatenate the remainder of the second PDF.
The solution was to superimpose the PDF page with the image onto the main PDF. Then concatenate the rest of it. "design_section" is the PDF with the image in it. This code will do:
PdfReader confirmation_section = new PdfReader(SOURCE);
PdfReader design_section = new PdfReader(SOURCE2);
PdfStamper stamper = new PdfStamper(confirmation_section, new FileOutputStream(RESULT));
PdfImportedPage page = stamper.getImportedPage(design_section, 1);
int c = confirmation_section.getNumberOfPages();
PdfContentByte background;
for (int i = 1; i <= c; i++) {
background = stamper.getUnderContent(i);
if(i == c)
background.addTemplate(page, 0, 0);
}
int d = design_section.getNumberOfPages();
if(d > 1) {
for(int f = 2; f <= d; f++) {
stamper.insertPage(c + f, confirmation_section.getPageSize(1));
page = stamper.getImportedPage(design_section, f);
stamper.getOverContent(c + f - 1).addTemplate(page, 0, 0);
System.out.println("here we are in the loop c + f is: " + (c + f));
}
}
stamper.close();
Pointed suggestion for iText -- how about renaming "addTemplate()" to "addPage()"???. iText is the most cryptic lib I have used and that includes regexp
Thanks for the follow up. I did read that many, many times ))) well, honestly, at least 6 times. I know it is just an excerpt and I am sure that there is more valuable information in the book, but with that said, I did not find what I was looking for. Where in that text does it discuss, compare and differentiate PdfCopy PDFStamper and PDFReader/Writer in the context of, for example, adding pages from one PDF to another?
For the just the sake of learning I've created an index from 1 file and wanted to search it. I am using Lucene Version 4.4. I know that indexing part is true.
tempFileName is the name of file which contains tokens and this file has the following words :
"odd plus odd is even ## even plus even is even ## odd plus even is odd ##"
However when I provide a query it returns nothing. I can't see what would be the problem. Any help is greatly appreciated.
Indexing part :
public void startIndexingDocument(String indexPath) throws IOException {
Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_44);
SimpleFSDirectory directory = new SimpleFSDirectory(new File(indexPath));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_44,
analyzer);
IndexWriter writer = new IndexWriter(directory, config);
indexDocs(writer);
writer.close();
}
private void indexDocs(IndexWriter w) throws IOException {
Document doc = new Document();
File file = new File(tempFileName);
BufferedReader br = new BufferedReader(new FileReader(tempFileName));
Field field = new StringField(fieldName, br.readLine().toString(),
Field.Store.YES);
doc.add(field);
w.addDocument(doc);
}
Searching part :
public void readFromIndex(String indexPath) throws IOException,
ParseException {
Analyzer anal = new WhitespaceAnalyzer(Version.LUCENE_44);
QueryParser parser = new QueryParser(Version.LUCENE_44, fieldName, anal);
Query query = parser.parse("odd");
IndexReader reader = IndexReader.open(NIOFSDirectory.open(new File(
indexPath)));
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// display
System.out.println("fieldName =" + fieldName);
System.out.println("Found : " + hits.length + " hits.");
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get(fieldName));
}
reader.close();
}
The problem is that you are using a StringField. StringField indexes the entire input as a single token. Good for atomic strings, like keywords, identifiers, stuff like that. Not good for full text searching.
Use a TextField.
StringField have a single token. So, I try to test with simple code.
for example #yns~ If you have a file that this is cralwer file and this contents hava a single String.
ex) file name : data03.scd , contents : parktaeha
You try to search with "parktaeha" queryString.
You get the search result!
field name : acet, queryString parktaeha
======== start search!! ========== q=acet:parktaeha Found 1 hits. result array length :1 search result=> parktaeha
======== end search!! ==========
Look under the code. This code is test code.
while((target = in.readLine()) != null){
System.out.println("target:"+target);
doc.add(new TextField("acet",target ,Field.Store.YES)); // use TextField
// TEST : doc.add(new StringField("acet", target.toString(),Field.Store.YES));
}
ref url
I would like to make a simple code that counts the top three most recurring lines/ text in a txt file then saves that line/ text to another text file (this in turn will be read into AutoCAD’s variable system).
Forgetting the AutoCAD part which I can manage how do I in VB.net save the 3 most recurring lines of text each to its own text file see example below:
Text file to be read reads as follows:
APG
BTR
VTS
VTS
VTS
VTS
BTR
BTR
APG
PNG
The VB.net program would then save the text VTS to mostused.txt BTR to 2ndmostused.txt and APG to 3rdmostused.txt
How can this be best achieved?
Since I'm C# developer, I'll use it:
var dict = new Dictionary<string, int>();
using(var sr = new StreamReader(file))
{
var line = string.Empty;
while ((line = sr.ReadLine()) != null)
{
var words = line.Split(' '); // get the words
foreach(var word in words)
{
if(!dict.Contains(word)) dict.Add(word, 0);
dict[word]++; // count them
}
}
}
var query = from d in dict select d order by d.Value; // now you have it sorted
int counter = 1;
foreach(var pair in query)
{
using(var sw = new StreamWriter("file" + counter + ".txt"))
sw.writer(pair.Key);
}