Given a PDF (with multiple pages), a page (stream) and a index , how can i replace the page at target index with my stream and save the pdf ?
public Stream ReplacePDFPageAtIndex(Stream pdfStream,int index,Stream replacePageStream)
{
List<Stream> pdfPages= //fetch pdf pages
pdfPages.RemoveAt(index);
pdfPages.Insert(index,replacePageStream);
Stream newPdf=//create new pdf using the edited list
return newPdf;
}
I was wondering if there is any open source solution or i can roll my own with ease , given the fact that i just need to split the pdf in pages , and replace one of them.
I have also checked ITextSharp but i do not see the functionality that i require.
Related
Looking for a way to add a user-uploaded PDF file to a Word document generated in the browser.
The user fills out a questionnaire and has the option to upload a document. At the end, the user clicks on a button and a Word document is generated with the gathered data (using docxtemplater).
The JSON data used for the doc generation is not stored anywhere after these steps.
Is there
a) a library that allows the uploaded file (which will be a PDF) to be added to/embedded in the generated Word document in the front end or
b) a way to display the file in the Word document by referencing the data URL in the template?
Example JSON (after generating the document):
"attachmentUpload": [
{
"name": "Example.pdf",
"type: "application/pdf",
"content": "data:application/pdf;base64,[.......]"
}
]
All I have found so far are libraries to merge two or more PDF files.
Any pointers would be highly appreciated!
Thanks!
I have existing / source PDF source document and I copy selected pages from it and generate destination PDF with selected pages. Every page in existing / source document is scanned in different resolution and it varies in size:
generated document with 4 pages => 175 kb
generated document with 4 pages => 923 kb (I suppose this is because of higher scan resolution of each page in source document)
What would be best practice to compress this pages?
Is there any code sample with compressing / reducing size of final PDF which consists of copied pages of source document in different resolution?
Kindest regards
If you are just adding scans to a pdf document, it makes sense for the size of the resulting document to go up if you're using a high resolution image.
Keep in mind that iText is a pdf library. Not an image-manipulation library.
You could of course use regular old java to attempt to compress the images.
public static void writeJPG(BufferedImage bufferedImage, OutputStream outputStream, float quality) throws IOException
{
Iterator<ImageWriter> iterator = ImageIO.getImageWritersByFormatName("jpg");
ImageWriter imageWriter = iterator.next();
ImageWriteParam imageWriteParam = imageWriter.getDefaultWriteParam();
imageWriteParam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
imageWriteParam.setCompressionQuality(quality);
ImageOutputStream imageOutputStream = new MemoryCacheImageOutputStream(outputStream);
imageWriter.setOutput(imageOutputStream);
IIOImage iioimage = new IIOImage(bufferedImage, null, null);
imageWriter.write(null, iioimage, imageWriteParam);
imageOutputStream.flush();
}
But really, putting scanned images into a pdf makes life so much more difficult. Imagine the people who have to handle that document after you. They open it, see text, try to select it, and nothing happens.
Additionaly, you might change the WriterProperties when creating your PdfWriter instance:
PdfWriter writer = new PdfWriter(dest,
new WriterProperties().setFullCompressionMode(true));
Full compression mode will compress certain objects into an object stream, and it will also compress the cross-reference table of the PDF. Since most of the objects in your document will be images (which are already compressed), compressing objects won't have much effect, but if you have a large number of pages, compressing the cross-reference table may result in smaller PDF files.
I create a PDF document with EVO PDF library from a HTML page using the code below:
HtmlToPdfConverter htmlToPdfConverter = new HtmlToPdfConverter();
byte[] outPdfBuffer = htmlToPdfConverter.ConvertUrl(url);
Response.AddHeader("Content-Type", "application/pdf");
Response.AddHeader("Content-Disposition", String.Format("attachment; filename=Merge_HTML_with_Existing_PDF.pdf; size={0}", outPdfBuffer.Length.ToString()));
Response.BinaryWrite(outPdfBuffer);
Response.End();
This produces a PDF document but I have another PDF document that I would like to use as cover page in the final PDF document.
One possiblity I was thinking about was to create the PDF document and then to merge my cover page PDF with the PDF produced by converter but this looks like an inefficient solution. Saving the PDF and loading back for merge seems to introduce a unnecessary overhead. I would like to merge the cover page while the PDF document produced by converter is still in memory.
The following line added in your code right after you create the HTML to PDF converter object should do the trick:
// Set the PDF file to be inserted before conversion result
htmlToPdfConverter.PdfDocumentOptions.AddStartDocument("CoverPage.pdf");
I am reading a PDF which has editable fields and the fields can be edited by opening it through Adobe Reader. I am using PDFBox APIs to generate an output PDF with data filled for the editable fields in input PDF. The output PDF can be opened using Adobe Reader and I am able to see the field values but I am unable to edit those fields directly from Adobe reader.
There is also a JIRA ticket for this issue and it is unresolved according to this link :
https://issues.apache.org/jira/browse/PDFBOX-1121
Can anybody please tell me if this got resolved? Also, if possible please answer the following questions related to my question:
Is there any protection policy or access permission that I need to explicitly set in order to edit the output PDF from Adobe reader?
Every time I open the PDF that was written to using pdfbox APIs, I get this message prompt:
" The document has been changed since it was created and use of extended features is no longer available...."
I am using PdfBox 1.8.6 jar and Adobe reader 11.0.8. I would really appreciate if anybody could help me with this issue.
Code snippet added to aid responders in debugging :
String outputFileNameWithPath = "C:\myfolder\testop.pdf";
PDDocument pdf = null;
pdf = PDDocument.load( outputFileNameWithPath );
PDDocumentCatal og docCatalog = pdf.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
//The map pdfValues is a collection of the data that I need to set in the PDF
//I am unable to go into the details of my data soutce
// The key in the data map corresponds to the PDField's expanded name and data
// corresponds to the data that I am trying to set.
Iterator<Entry<String, String>> iter=pdfValues.entrySet().iterator();
String name=null;
String value=null;
PDField field=null;
//Iterate over all data and see if the PDF has a matching field.
while(iter.hasNext()) {
Map.Entry<String, String> currentEntry=iter.next();
name=currentEntry.getKey();
value=currentEntry.getValue();
if(name!=null) {
name=CommonUtils.fromSchemaNameToPdfName(name);
field=acroForm.getField(name);
}
if( field != null && value!=null )
{
field.setValue( value ); //setting the values once field is found.
}
}
// Set access permissions / encryption here before saving
pdf.save(outputFileNameWithPath);
Thanks.
The document has been changed since it was created and use of extended features is no longer available....
This indicates that the original form has been Reader-enabled, i.e. an integrated Usage-Rights digital signature has been applied to the document using a private key held by Adobe which tells the Adobe Reader that it shall make some extra functionality available to the user viewing that form.
If you don't want to break that signature during form fill-ins with PDFBox, you need to make sure that you
don't do any changes but form fill-ins and
save the changes as incremental update.
If you provided your form fill-in code and your source PDF, this could be analyzed in more detail.
How can I count specific words within a pdf file that is locked.
I am talking about annual reports here. You can search within, but you cant copy out of it (for whatever reason, doesnt make sense).
After googling forever, I still havent found a solution.
If your file contains text (and not just scanned images) and used fonts contains information about mapping from glyphs to characters then you should be able to extract text from the file using any PDF library that provides text extraction capabilities.
Copying of text is usually forbidden by setting usage rights. Many PDF libraries ignore these settings and allow text extraction from locked PDFs.
Depending on the library, you might try extracting whole text and splitting it into words yourselves or extracting text as collection of words (if library can split text into words for you).
Here is a sample code for Docotic.Pdf library that shows how to build dictionary that contains information about words found in a PDF document and how many times they are used.
public static Dictionary<string, int> countWords(string file)
{
Dictionary<string, int> wordCounts = new Dictionary<string, int>();
using (PdfDocument pdf = new PdfDocument(file))
{
foreach (PdfPage page in pdf.Pages)
{
PdfCollection<PdfTextData> words = page.GetWords();
foreach (PdfTextData word in words)
{
int count = 0;
wordCounts.TryGetValue(word.Text, out count);
wordCounts[word.Text] = count++;
}
}
}
return wordCounts;
}
Disclaimer: I work for the vendor of Docotic.Pdf.