Convert pdf to pdf/a using iText library - pdf

I want to export document to PdfAConformanceLevel.PDF_A_1B conformance, but when I do document.close, I get error below, resulting pdf is not usable.
I use following itext versions:
<artifactId>itextpdf</artifactId>
<version>5.5.9</version>
<artifactId>itext-pdfa</artifactId>
<version>5.5.9</version>
stack trace:
com.itextpdf.text.pdf.PdfAConformanceException: Real number is out of range.
at com.itextpdf.text.pdf.internal.PdfA1Checker.checkPdfObject(PdfA1Checker.java:259)
at com.itextpdf.text.pdf.internal.PdfAChecker.checkPdfAConformance(PdfAChecker.java:208)
at com.itextpdf.text.pdf.internal.PdfAConformanceImp.checkPdfIsoConformance(PdfAConformanceImp.java:71)
at com.itextpdf.text.pdf.PdfWriter.checkPdfIsoConformance(PdfWriter.java:3480)
at com.itextpdf.text.pdf.PdfWriter.checkPdfIsoConformance(PdfWriter.java:3476)
at com.itextpdf.text.pdf.PdfObject.toPdf(PdfObject.java:174)
at com.itextpdf.text.pdf.PdfArray.toPdf(PdfArray.java:175)
at com.itextpdf.text.pdf.PdfDictionary.toPdf(PdfDictionary.java:149)
at com.itextpdf.text.pdf.PdfStream.superToPdf(PdfStream.java:278)
at com.itextpdf.text.pdf.PRStream.toPdf(PRStream.java:239)
at com.itextpdf.text.pdf.PdfIndirectObject.writeTo(PdfIndirectObject.java:158)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.write(PdfWriter.java:420)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:398)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:377)
at com.itextpdf.text.pdf.PdfWriter.addToBody(PdfWriter.java:872)
at com.itextpdf.text.pdf.PdfReaderInstance.writeAllVisited(PdfReaderInstance.java:161)
at com.itextpdf.text.pdf.PdfReaderInstance.writeAllPages(PdfReaderInstance.java:177)
at com.itextpdf.text.pdf.PdfWriter.addSharedObjectsToBody(PdfWriter.java:1380)
at com.itextpdf.text.pdf.PdfWriter.close(PdfWriter.java:1264)
at com.itextpdf.text.pdf.PdfAWriter.close(PdfAWriter.java:337)
at com.itextpdf.text.pdf.PdfDocument.close(PdfDocument.java:889)
at com.itextpdf.text.Document.close(Document.java:416)
at si.telekom.erender.ERenderImpl.mergeContentOfItems(ERenderImpl.java:2911)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.sun.xml.ws.api.server.MethodUtil.invoke(MethodUtil.java:83)
at com.sun.xml.ws.api.server.InstanceResolver$1.invoke(InstanceResolver.java:250)
at com.sun.xml.ws.server.InvokerTube$2.invoke(InvokerTube.java:149)
at com.sun.xml.ws.server.sei.SEIInvokerTube.processRequest(SEIInvokerTube.java:88)
at com.sun.xml.ws.api.pipe.Fiber.__doRun(Fiber.java:1136)
at com.sun.xml.ws.api.pipe.Fiber._doRun(Fiber.java:1050)
at com.sun.xml.ws.api.pipe.Fiber.doRun(Fiber.java:1019)
at com.sun.xml.ws.api.pipe.Fiber.runSync(Fiber.java:877)
at com.sun.xml.ws.server.WSEndpointImpl$2.process(WSEndpointImpl.java:419)
at com.sun.xml.ws.transport.http.HttpAdapter$HttpToolkit.handle(HttpAdapter.java:868)
at com.sun.xml.ws.transport.http.HttpAdapter.handle(HttpAdapter.java:422)
at com.sun.xml.ws.transport.http.servlet.ServletAdapter.invokeAsync(ServletAdapter.java:225)
at com.sun.xml.ws.transport.http.servlet.WSServletDelegate.doGet(WSServletDelegate.java:161)
at com.sun.xml.ws.transport.http.servlet.WSServletDelegate.doPost(WSServletDelegate.java:197)
at com.sun.xml.ws.transport.http.servlet.WSServlet.doPost(WSServlet.java:81)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:647)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:502)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1041)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:603)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:312)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I am producing PDF with following code:
public byte[] mergeContentOfItems(List<MergeItem> items) throws ErenderException {
MessageContext mc = wsCtx.getMessageContext();
HttpServletRequest req = (HttpServletRequest) mc.get(MessageContext.SERVLET_REQUEST);
getLogger().info("Webservice method 'mergeContentOfItems' called from IP:" + req.getRemoteAddr());
if (items.size() < 1) {
String errDescription = "No barcodes specified!";
throw new ErenderException(errDescription, new ErenderExceptionBean("201", errDescription),
new Throwable(errDescription));
}
com.itextpdf.text.Document document = new com.itextpdf.text.Document();
ByteArrayOutputStream baOs = new ByteArrayOutputStream();
PdfWriter writer = null;
List<PdfReader> readers = new ArrayList<PdfReader>();
int totalPages = 0;
try {
// Create a writer for the outputstream
writer = PdfAWriter.getInstance(document, baOs, PdfAConformanceLevel.PDF_A_1B);
writer.setPdfVersion(PdfWriter.PDF_VERSION_1_4);
writer.createXmpMetadata();
//writer = PdfWriter.getInstance(document, baOs);
document.open();
ICC_Profile icc = ICC_Profile
.getInstance(Thread.currentThread().getContextClassLoader().getResourceAsStream("srgb.profile"));
writer.setOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF
for (int i = 0; i < items.size(); i++) {
String pdfFileName = null;
File urlTempFile = null;
if (items.get(i).getBarcode() != null) {
Template tmpl = TemplatesSynchronizer.getTemplateByBarcode(items.get(i).getBarcode());
String fileName = tmpl.getName();
pdfFileName = fileName.substring(0, fileName.indexOf(".")) + ".pdf";
getLogger().info("\tworking on:" + items.get(i) + " fileName:" + pdfFileName);
if (!new File(pdfFileName).exists()) {
String msg = String.format("Datoteka %s ne obstaja", pdfFileName);
throw new ErenderException("Error", new ErenderExceptionBean("109", msg, new Exception(msg)));
}
} else if (items.get(i).getUrl() != null) {
urlTempFile = File.createTempFile("myTemp", "pdf");
FileUtils.copyURLToFile(new URL(items.get(i).getUrl()), urlTempFile);
}
if (pdfFileName != null || urlTempFile != null) {
PdfReader pdfReader = null;
if (pdfFileName != null)
pdfReader = new PdfReader(pdfFileName);
else if (urlTempFile != null)
pdfReader = new PdfReader(urlTempFile.getAbsolutePath());
if (pdfReader != null) {
// Create Readers for the pdfs.
readers.add(pdfReader);
totalPages += pdfReader.getNumberOfPages();
int pageOfCurrentReaderPDF = 0;
while (pageOfCurrentReaderPDF < pdfReader.getNumberOfPages()) {
document.newPage();
pageOfCurrentReaderPDF++;
PdfImportedPage page = writer.getImportedPage(pdfReader, pageOfCurrentReaderPDF);
document.setPageSize(pdfReader.getPageSizeWithRotation(pageOfCurrentReaderPDF));
document.newPage();
cb.addTemplate(page, 0, 0);
}
}
if (urlTempFile != null)
urlTempFile.delete();
}
}
} catch (Throwable ex) {
StringWriter errorStringWriter = new StringWriter();
PrintWriter pw = new PrintWriter(errorStringWriter);
ex.printStackTrace(pw);
Logger.getLogger(this.getClass()).error(errorStringWriter.getBuffer().toString());
throw new ErenderException("Error", new ErenderExceptionBean("109", "Napaka v merge metodi.",ex), ex);
} finally {
if (document != null && document.isOpen())
try {
document.close();
} catch (Exception ex) {
StringWriter errorStringWriter = new StringWriter();
PrintWriter pw = new PrintWriter(errorStringWriter);
ex.printStackTrace(pw);
Logger.getLogger(this.getClass()).error(errorStringWriter.getBuffer().toString());
getLogger().error("Unable to close document.\n" + errorStringWriter);
}
if (writer != null && writer.isCloseStream()) {
try {
writer.flush();
writer.close();
} catch (Exception ex) {
getLogger().error("Unable to flush or close writer");
}
}
try {
baOs.flush();
baOs.close();
} catch (Exception ex) {
getLogger().error("Unable to close baOs in mergeContent method.");
}
}
getLogger().info("Webservice method 'mergeContent' called from IP:" + req.getRemoteAddr() + " ended. " + totalPages
+ " merged.");
return baOs.toByteArray();
}
Since I don't get error on other files this seems to be input files specific - here is one file to reproduce error:
I am trying to convert this input pdf file:
http://filebin.ca/2hR2xO1SNlzh/09062009073008005.pdf

First this: iText doesn't convert ordinary PDF documents to PDF/A documents. We have customers who use iText to do this, but their code is much more elaborate than yours.
The reason why iText doesn't convert ordinary PDF documents to PDF/A should be evident: an ordinary PDF might not have all the necessary features that are needed in a PDF/A. You might have a PDF of which the fonts aren't embedded. In that case, someone needs to provide the appropriate font program. iText doesn't ship with any font program, hence the software using iText has to provide this.
In your code, you just copy content streams without checking any possible issues that make the end result non-compliant with PDF/A. You should be very careful with the resulting PDFs. They will show the blue bar that the file claims to be PDF/A, but that doesn't mean that the file will validate as a PDF when you pass it through a validator.
Now for your problem. You want to convert an ordinary PDF to PDF/A-1. PDF/A-1 is based on PDF 1.4 dating from 2001. This means that you can't use any of the new features that were introduced after 2001. In PDF 1.4, there was a limitation with respect to object number. Object numbers in PDF couldn't exceed 32,767. This limitation was removed from PDF in PDF 1.5.
My guess is that the problem you describe is caused by your attempt to create a PDF 1.4 with more objects than is allowed in PDF 1.4. There could be two reasons:
Your original PDF is PDF 1.5 or later,
Your manipulations of the PDF require more than the maximum available number of objects.
This could be fixed by generating PDF/A-2 instead of PDF/A-1, but I'm pretty sure that you'll soon hit other limitations (e.g. missing fonts and other issues that are caused by creating a file that claims to be a PDF but that isn't). PdfAWriter will throw exceptions when you try doing things that are blatantly wrong, but there's no guarantee that some more subtle PDF/A requirements are being missed.

Related

Converting XDP to PDF using Live Cycle replaces question marks(?) with space at multiple places. How can that be fixed?

I have been trying to convert XDP to PDF using Adobe Live Cycle. Most of my forms turn up good. But while converting some of them, I fine ??? replaced in place of blank spaces at certain places. Any suggestions to how can I rectify that?
Below is the code snippet that I am using:
public byte[] generatePDF(TextDocument xdpDocument) {
try {
Assert.notNull(xdpDocument, "XDP Document must be passed.");
Assert.hasLength(xdpDocument.getContent(), "XDPDocument content cannot be null");
// Create a ServiceClientFactory object
ServiceClientFactory myFactory = ServiceClientFactory.createInstance(createConnectionProperties());
// Create an OutputClient object
FormsServiceClient formsClient = new FormsServiceClient(myFactory);
formsClient.resetCache();
String text = xdpDocument.getContent();
String charSet = xdpDocument.getCharsetName();
if (charSet == null || charSet.trim().length() == 0) {
charSet = StandardCharsets.UTF_8.name();
}
byte[] bytes = text.getBytes();
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);
Document inTemplate = new Document(byteArrayInputStream);
// Set PDF run-time options
// Set rendering run-time options
RenderOptionsSpec renderOptionsSpec = new RenderOptionsSpec();
renderOptionsSpec.setLinearizedPDF(true);
renderOptionsSpec.setAcrobatVersion(AcrobatVersion.Acrobat_9);
PDFFormRenderSpec pdfFormRenderSpec = new PDFFormRenderSpec();
pdfFormRenderSpec.setGenerateServerAppearance(true);
pdfFormRenderSpec.setCharset("UTF8");
FormsResult formOut = formsClient.renderPDFForm2(inTemplate, null, pdfFormRenderSpec, null, null);
Document xfaPdfOutput = formOut.getOutputContent();
//If the input file is already static PDF, then below method will throw an exception - handle it
OutputClient outClient = new OutputClient(myFactory);
outClient.resetCache();
Document staticPdfOutput = outClient.transformPDF(xfaPdfOutput, TransformationFormat.PDF, null, null, null);
byte[] data = StreamIO.toBytes(staticPdfOutput.getInputStream());
return data;
} catch(IllegalArgumentException ex) {
logger.error("Input validation failed for generatePDF request " + ex.getMessage());
throw new EformsException(ErrorExceptionCode.INPUT_REQUIRED + " - " + ex.getMessage(), ErrorExceptionCode.INPUT_REQUIRED);
} catch (Exception e) {
logger.error("Exception occurred in Adobe Services while generating PDF from xdpDocument..", e);
throw new EformsException(ErrorExceptionCode.PDF_XDP_CONVERSION_EXCEPTION, e);
}
}
I suggest trying 2 things:
Check the font. Switch to something very common like Arial / Times New Roman and see if the characters are still lost
Check the character encoding. It might not be a simple question mark character you are using and if so the character encoding will be important. The easiest way is to make sure your question mark is ascii char 63 (decimal).
I hope that helps.

Lucene - Highlighter throwing exception when using * on search

I'm using Lucene 4.6.1 and Highlighter 4.6.0. Since indexing is working properly, I'm just gonna show my search code:
... code to get all the fields' name/values, numDocs, etc.
...
// Create Query and search
try {
TopScoreDocCollector collector = TopScoreDocCollector.create(numDocs, true);
Query q = MultiFieldQueryParser.parse(Version.LUCENE_40, searchTerms, fields, analyzer);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Highlighter highlighter = new Highlighter(new QueryScorer(q));
highlighter.setTextFragmenter(new SimpleFragmenter(40));
int maxNumFragmentsRequired = 2;
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
for(int j=0; j<fields.length; j++) {
if(d.get(fields[j]) != null) {
String fieldText = d.get(fields[j]).trim();
TokenStream tokenStream = analyzer.tokenStream(fields[j], new StringReader(fieldText));
// Create String without the highlighted term
String unhighlighted = (i + 1) + ". "+fields[j]+ " "+ d.get(fields[j]).trim() + "<br>";
// Create the highlighted term
String highlighted = highlighter.getBestFragments(tokenStream, fieldText, maxNumFragmentsRequired, "...");
// If the highlighted term really exists
if(!highlighted.equals(""))
unhighlighted = (i + 1) + ". "+fields[j]+ " "+ highlighted + "<br>";
response += unhighlighted;
}
}
}
} catch (Exception e) {
System.out.println("Error searching " + searchTerm + " : " + e.getMessage());
}
System.out.println(response);
}
For example: into my index I got many Documents named "Process 001", "Process 002", "Process 003" and so on. If I try to search by: Process, I can retrieve all the Process (this is working perfect!). The problem happens when I try to search by: proc*, or: pr*, or something like that... The errors are here:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/lucene/queries/CommonTermsQuery
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:149)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:99)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:474)
at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217)
at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:186)
at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:197)
at org.apache.lucene.search.highlight.Highlighter.getBestFragments(Highlighter.java:156)
at org.apache.lucene.search.highlight.Highlighter.getBestFragments(Highlighter.java:460)
at freedom.lucene.service.LuceneTestApplication.search(LuceneTestApplication.java:406)
at freedom.lucene.service.LuceneTestApplication.main(LuceneTestApplication.java:75)
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.queries.CommonTermsQuery
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 10 more
The exception occurs on this line:
String highlighted = highlighter.getBestFragments(tokenStream, fieldText, maxNumFragmentsRequired, "...");
If I remove the Highlighter code, the search works properly with *
Add lucene-queries-4.6.1.jar to your classpath.
CommonTermsQuery is not included in the lucene-core jar.
Add lucene-memory-XX.jar to your External Libraries

Split a "tagged" PDF document into multiple documents, keeping the tagging

In a project I have to split a PDF document into two documents, one containing all blank pages, and one containing all pages with content.
For this job, I use a PdfReader to read the source file, and two pdfCopy objects (one for the blank pages document, one for the pages with content document) to write the files to.
I use GetImportedPage to read a PdfImportedPage, which is then added to one of the PdfCopy writers.
Now, the problem is the following: the source file is using the "tagged PDF format". To preserve this (which is absolutely required), I use the SetTagged() method on both PdfCopy writers, and use the extra third parameter in GetImportedPage(...) to keep the tagged format. However, when calling the AddPage(...) on the PdfCopy writer, I get an invalid cast exception:
"Unable to cast object of type 'iTextSharp.text.pdf.PdfDictionary' to type 'iTextSharp.text.pdf.PRIndirectReference'."
Anyone has any ideas on how to solve this ? Any hints ?
Also: the project currently refers version 5.1.0.0 of the itext libraries. In 5.4.4.0 the third parameter to GetImportedPage does not seem to be there anymore.
Below, you can find a code extract:
iTextSharp.text.Document targetPdf = new iTextSharp.text.Document();
iTextSharp.text.Document blankPdf = new iTextSharp.text.Document();
iTextSharp.text.pdf.PdfReader sourcePdfReader = new iTextSharp.text.pdf.PdfReader(inputFile);
iTextSharp.text.pdf.PdfCopy targetPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(targetPdf, new FileStream(outputFile, FileMode.Create));
iTextSharp.text.pdf.PdfCopy blankPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(blankPdf, new FileStream(blanksFile, FileMode.Append));
targetPdfWriter.SetTagged();
blankPdfWriter.SetTagged();
try
{
iTextSharp.text.pdf.PdfImportedPage page = null;
int n = sourcePdfReader.NumberOfPages;
targetPdf.Open();
blankPdf.Open();
blankPdf.Add(new iTextSharp.text.Phrase("This document contains the blank pages removed from " + inputFile));
blankPdf.NewPage();
for (int i = 1; i <= n; i++)
{
byte[] pageBytes = sourcePdfReader.GetPageContent(i);
string pageText = "";
iTextSharp.text.pdf.PRTokeniser token = new iTextSharp.text.pdf.PRTokeniser(new iTextSharp.text.pdf.RandomAccessFileOrArray(pageBytes));
while (token.NextToken())
{
if (token.TokenType == iTextSharp.text.pdf.PRTokeniser.TokType.STRING)
{
pageText += token.StringValue;
}
}
if (pageText.Length >= 15)
{
page = targetPdfWriter.GetImportedPage(sourcePdfReader, i, true);
targetPdfWriter.AddPage(page);
}
else
{
page = blankPdfWriter.GetImportedPage(sourcePdfReader, i, true);
blankPdfWriter.AddPage(page);
blankPageCount++;
}
}
}
catch (Exception ex)
{
Console.WriteLine("Exception at LOC1: " + ex.Message);
}
The error occurs in the call to targetPdfWriter.AddPage(page); near the end of the code sample.
Thank you very much for your help.
Koen.

Split PDF into separate files based on text

I have a large single pdf document which consists of multiple records. Each record usually takes one page however some use 2 pages. A record starts with a defined text, always the same.
My goal is to split this pdf into separate pdfs and the split should happen always before the "header text" is found.
Note: I am looking for a tool or library using java or python. Must be free and available on Win 7.
Any ideas? AFAIK imagemagick won't work for this. May itext do this? I never used and it's
pretty complex so would need some hints.
EDIT:
Marked Answer led me to solution. For completeness here my exact implementation:
public void splitByRegex(String filePath, String regex,
String destinationDirectory, boolean removeBlankPages) throws IOException,
DocumentException {
logger.entry(filePath, regex, destinationDirectory);
destinationDirectory = destinationDirectory == null ? "" : destinationDirectory;
PdfReader reader = null;
Document document = null;
PdfCopy copy = null;
Pattern pattern = Pattern.compile(regex);
try {
reader = new PdfReader(filePath);
final String RESULT = destinationDirectory + "/record%d.pdf";
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 1; i < n; i++) {
final String text = PdfTextExtractor.getTextFromPage(reader, i);
if (pattern.matcher(text).find()) {
if (document != null && document.isOpen()) {
logger.debug("Match found. Closing previous Document..");
document.close();
}
String fileName = String.format(RESULT, i);
logger.debug("Match found. Creating new Document " + fileName + "...");
document = new Document();
copy = new PdfCopy(document,
new FileOutputStream(fileName));
document.open();
logger.debug("Adding page to Document...");
copy.addPage(copy.getImportedPage(reader, i));
} else if (document != null && document.isOpen()) {
logger.debug("Found Open Document. Adding additonal page to Document...");
if (removeBlankPages && !isBlankPage(reader, i)){
copy.addPage(copy.getImportedPage(reader, i));
}
}
}
logger.exit();
} finally {
if (document != null && document.isOpen()) {
document.close();
}
if (reader != null) {
reader.close();
}
}
}
private boolean isBlankPage(PdfReader reader, int pageNumber)
throws IOException {
// see http://itext-general.2136553.n4.nabble.com/Detecting-blank-pages-td2144877.html
PdfDictionary pageDict = reader.getPageN(pageNumber);
// We need to examine the resource dictionary for /Font or
// /XObject keys. If either are present, they're almost
// certainly actually used on the page -> not blank.
PdfDictionary resDict = (PdfDictionary) pageDict.get(PdfName.RESOURCES);
if (resDict != null) {
return resDict.get(PdfName.FONT) == null
&& resDict.get(PdfName.XOBJECT) == null;
} else {
return true;
}
}
You can create a tool for your requirements using iText.
Whenever you are looking for code samples concerning (current versions of) the iText library, you should consult iText in Action — 2nd Edition the code samples from which are online and searchable by keyword from here.
In your case the relevant samples are Burst.java and ExtractPageContentSorted2.java.
Burst.java shows how to split one PDF in multiple smaller PDFs. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
final String RESULT = "record%d.pdf";
// We'll create as many new PDFs as there are pages
Document document;
PdfCopy copy;
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 0; i < n; ) {
// step 1
document = new Document();
// step 2
copy = new PdfCopy(document,
new FileOutputStream(String.format(RESULT, ++i)));
// step 3
document.open();
// step 4
copy.addPage(copy.getImportedPage(reader, i));
// step 5
document.close();
}
reader.close();
This sample splits a PDF in single-page PDFs. In your case you need to split by different criteria. But that only means that in the loop you sometimes have to add more than one imported page (and thus decouple loop index and page numbers to import).
To recognize on which pages a new dataset starts, be inspired by ExtractPageContentSorted2.java. This sample shows how to parse the text content of a page to a string. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
System.out.println("\nPage " + i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
reader.close();
Simply search for the record start text: If the text from page contains it, a new record starts there.
Apache PDFBox has a PDFSplit utility that you can run from the command-line.
If you like Python, there's a nice library: PyPDF2. The library is pure python2, BSD-like license.
Sample code:
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("C:\\Users\\Jarek\\Documents\\x.pdf", "rb"))
# analyze pdf data
print input1.getDocumentInfo()
print input1.getNumPages()
text = input1.getPage(0).extractText()
print text.encode("windows-1250", errors='backslashreplacee')
# create output document
output = PdfFileWriter()
output.addPage(input1.getPage(0))
fout = open("c:\\temp\\1\\y.pdf", "wb")
output.write(fout)
fout.close()
For non coders PDF Content Split is probably the easiest way without reinventing the wheel and has an easy to use interface: http://www.traction-software.co.uk/pdfcontentsplitsa/index.html
hope that helps.

Why am i getting this error java.lang.ClassNotFoundException?

Why am i getting this error message?
Testcase: createIndexBatch_usingTheTestArchive(src.LireHandlerTest): Caused an ERROR
at/lux/imageanalysis/ColorLayoutImpl
java.lang.NoClassDefFoundError: at/lux/imageanalysis/ColorLayoutImpl
at net.semanticmetadata.lire.impl.SimpleDocumentBuilder.createDocument(Unknown Source)
at net.semanticmetadata.lire.AbstractDocumentBuilder.createDocument(Unknown Source)
at backend.storage.LireHandler.createIndexBatch(LireHandler.java:49)
at backend.storage.LireHandler.createIndexBatch(LireHandler.java:57)
at src.LireHandlerTest.createIndexBatch_usingTheTestArchive(LireHandlerTest.java:56)
Caused by: java.lang.ClassNotFoundException: at.lux.imageanalysis.ColorLayoutImpl
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
I am trying to create a Lucene index using documents that are created using Lire. When i am getting to the point where it tries to create a document using the document builder it gives me this error message.
input paramaters first run:
filepath: "test_archive" (this is a image archive)
whereToStoreIndex: "test_index" (Where i am going to store the index)
createNewIndex: true
Since the method is recursive (see the if statement where it checks if is a directory) it will call itself multiple times but all the recursive calls uses createNewIndex = false.
Here is the code:
public static boolean createIndexBatch(String filepath, String whereToStoreIndex, boolean createNewIndex){
DocumentBuilder docBldr = DocumentBuilderFactory.getDefaultDocumentBuilder();
try{
IndexWriter indexWriter = new IndexWriter(
FSDirectory.open(new File(whereToStoreIndex)),
new SimpleAnalyzer(),
createNewIndex,
IndexWriter.MaxFieldLength.UNLIMITED
);
File[] files = FileHandler.getFilesFromDirectory(filepath);
for(File f:files){
if(f.isFile()){
//Hopper over Thumbs.db filene...
if(!f.getName().equals("Thumbs.db")){
//Creating the document that is going to be stored in the index
String name = f.getName();
String path = f.getPath();
FileInputStream stream = new FileInputStream(f);
Document doc = docBldr.createDocument(stream, f.getName());
//Add document to the index
indexWriter.addDocument(doc);
}
}else if(f.isDirectory()){
indexWriter.optimize();
indexWriter.close();
LireHandler.createIndexBatch(f.getPath(), whereToStoreIndex, false);
}
}
indexWriter.close();
}catch(IOException e){
System.out.print("IOException in createIndexBatch:\n"+e.getMessage()+"\n");
return false;
}
return true;
}
It sounds like you are missing a java library.
I think you need to download Caliph & Emir, that may be used indirectly by Lire.
http://www.semanticmetadata.net/download/