iText - Cleaning Up Text in Rectangle without cleaning full row - pdf

I'm trying to clean up text inside rectangle in pdf document using iText.
Following is the piece of code I’m using:
PdfReader pdfReader = null;
PdfStamper stamper = null;
try
{
int pageNo = 1;
List<Float> linkBounds = new ArrayList<Float>();
linkBounds.add(0, (float) 202.3);
linkBounds.add(1, (float) 588.6);
linkBounds.add(2, (float) 265.8);
linkBounds.add(3, (float) 599.7);
pdfReader = new PdfReader("Test1.pdf");
stamper = new PdfStamper(pdfReader, new FileOutputStream("Test2.pdf"));
Rectangle linkLocation = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(pageNo, linkLocation, BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
cleaner.cleanUp();
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
try {
stamper.close();
}
catch (Exception e) {
e.printStackTrace();
}
pdfReader.close();
}
After executing this piece of code, it’s clearing up entire line of text instead of cleaning up text only inside given rectangle.
To explain things in a better way I have attached pdf documents.
input PDF
output PDF
In the input pdf, I have highlighted the text to show the rectangle I’m specifying for cleaning up.
And, in the output pdf as you can clearly see that there is grey rectangle but if you notice it cleaned up the whole line of text.
Any help will be appreciated.

The files input.pdf and output.pdf the OP originally presented did not allow to reproduce the issue but instead seemed not at all to match. Thus, there was an original answer essentially demonstrating that the issue could not be reproduced.
The second set of files Test1.pdf and Test2.pdf, though, did allow to reproduce the issue, giving rise to the updated answer...
Updated answer referring to the OP's second set of sample files
There indeed is an issue in the current (up to 5.5.8) iText clean-up code: In case of tagged files some methods of PdfContentByte used here introduced extra instructions into the content stream which actually damaged it and relocated some text in the eyes of PDF viewers which ignored the damage.
In more detail:
PdfCleanUpContentOperator.writeTextChunks used canvas.setCharacterSpacing(0) and canvas.setWordSpacing(0) to initially set the character and word spacing to 0. Unfortunately these methods in case of tagged files check whether the canvas under construction currently is in a text object and (if not) start a text object. This check depends on a local flag set by beginText; but during clean-up text objects are not started using that method. Thus, writeTextChunks here inserts an extra "BT 1 0 0 1 0 0 Tm" sequence damaging the stream and relocating the following text.
private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
canvas.setCharacterSpacing(0);
canvas.setWordSpacing(0);
...
PdfCleanUpContentOperator.writeTextChunks instead should use hand-crafted Tc and Tw instructions to not trigger this side effect.
private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
if (Float.compare(characterSpacing, 0.0f) != 0 && Float.compare(characterSpacing, -0.0f) != 0) {
new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append(Tc);
}
if (Float.compare(wordSpacing, 0.0f) != 0 && Float.compare(wordSpacing, -0.0f) != 0) {
new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append(Tw);
}
canvas.getInternalBuffer().append((byte) '[');
With this change in place the OP's new sample file "Test1.pdf" is properly redacted by the sample code
#Test
public void testRedactJavishsTest1() throws IOException, DocumentException
{
try ( InputStream resource = getClass().getResourceAsStream("Test1.pdf");
OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "Test1-redactedJavish.pdf")) )
{
PdfReader reader = new PdfReader(resource);
PdfStamper stamper = new PdfStamper(reader, result);
List<Float> linkBounds = new ArrayList<Float>();
linkBounds.add(0, (float) 202.3);
linkBounds.add(1, (float) 588.6);
linkBounds.add(2, (float) 265.8);
linkBounds.add(3, (float) 599.7);
Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
cleaner.cleanUp();
stamper.close();
reader.close();
}
}
(RedactText.java)
Original answer referring to the OP's original sample files
I just tried to reproduce your issue using this test method
#Test
public void testRedactJavishsText() throws IOException, DocumentException
{
try ( InputStream resource = getClass().getResourceAsStream("input.pdf");
OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "input-redactedJavish.pdf")) )
{
PdfReader reader = new PdfReader(resource);
PdfStamper stamper = new PdfStamper(reader, result);
List<Float> linkBounds = new ArrayList<Float>();
linkBounds.add(0, (float) 200.7);
linkBounds.add(1, (float) 547.3);
linkBounds.add(2, (float) 263.3);
linkBounds.add(3, (float) 558.4);
Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
cleaner.cleanUp();
stamper.close();
reader.close();
}
}
(RedactText.java)
For your source PDF looking like this
the result was
and not your
I even re-tested using the iText versions 5.5.5 you mention in a comment and also 5.5.4, but in all cases I got the correct result.
Thus, I cannot reproduce your issue.
I had a closer look at your output.pdf. It is a bit peculiar, in particular it does not contain certain blocks typical for PDFs created or manipulated by current iText versions. Furthermore the content streams look extremely different.
Thus, I assume that after iText redacted your file some other tool post-processed and in doing so damaged it.
In particular the page content instructions preparing the insertion of the redacted line look like this in your input.pdf:
q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/TT5 1 Tf
[...] TJ
and like this in the version I received directly from iText:
q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/TT5 1 Tf
0 Tc
0 Tw
[...] TJ
but the corresponding lines in your output.pdf look like this
BT
1 0 0 1 113.3 548.5 Tm
0 Tc
BT
1 0 0 1 0 0 Tm
0 Tc
[...] TJ
Here the instructions in your output.pdf are
invalid as inside a text object BT ... ET there may be no other text object but you have two BT operations following each other without an ET inbetween;
effectively positioning the text at 0, 0 if a PDF viewer ignores the error mentioned above.
And indeed, if you look at the bottom of your output.pdf page you'll see:
So if my assumption that there is some other program post-processing the iText result, is correct, you should repair that post-processor.
If there is no such post-processor, you seem not to have the officially published iText version but something altogether different.

Related

Copying pages from one PDF to another gives an error on save

I'm trying to take pages from one PDF, scale them down, and put them side-by-side in another PDF. To do this I make an intermediate PDF that has all of the pages from the source scaled down to the size I need to place them side-by-side. Then I go thought the scaled PDF and copy the pages two at a time to the final PDF. My thinking is that I'm down with the scaled PDF so I can close it but when I do that I get an error trying to save the final PDF that says
COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
I'm not sure why the intermediate doc should matter when I try to save the final doc. It could be that I'm doing something wrong in the copying of pages? Here's the code I use for that:
private PDDocument sideBySide(PaperSize paperSize, PaperSize pageSize) throws IOException {
PDRectangle targetPaperSize = getRect(paperSize);
PDRectangle targetPageSize = getRect(pageSize);
PDDocument scaledDoc = scaleDoc(pageSize, doc);
PDDocument outputDoc = new PDDocument();
final double theta = Math.PI / 2;
for (int offset = 0; offset < scaledDoc.getNumberOfPages() - 1; offset+=2) {
PDPage twoUp = new PDPage(targetPaperSize);
twoUp.setRotation(90);
twoUp.setResources(new PDResources());
outputDoc.addPage(twoUp);
PDPage leftPage = scaledDoc.getPage(offset);
PDPage rightPage = scaledDoc.getPage(offset + 1);
PDFormXObject leftObject = importAsXObject(outputDoc, leftPage);
twoUp.getResources().add(leftObject);
PDFormXObject rightObject = importAsXObject(outputDoc, rightPage);
twoUp.getResources().add(rightObject);
PDPageContentStream content = new PDPageContentStream(outputDoc, twoUp);
AffineTransform leftTrans = AffineTransform.getRotateInstance(theta);
leftTrans.concatenate(AffineTransform.getTranslateInstance(0, -targetPageSize.getHeight()));
AffineTransform rightTrans = AffineTransform.getRotateInstance(theta);
rightTrans.concatenate(AffineTransform.getTranslateInstance(targetPageSize.getWidth(), -targetPageSize.getHeight()));
leftObject.setMatrix(leftTrans);
rightObject.setMatrix(rightTrans);
content.drawForm(leftObject);
content.drawForm(rightObject);
content.close();
}
scaledDoc.close();
return outputDoc;
}

When using iText to generate a PDF, if I need to switch fonts many times the file size becomes too large

I have a section of my PDF in which I need to use one font for its unicode symbol and the rest of the paragraph should be a different font. (It is something like "1. a 2. b 3. c" where "1." is the unicode symbol/font and "a" is another font) I have followed the method Bruno describes here: iText 7: How to build a paragraph mixing different fonts? and it works fine to generate the PDF. The issue is that the file size of the PDF goes from around 20MB to around 100MB compared to using only one font and one Text element. This section is used repeatedly in the document thousands of times. I am wondering if there is a way to reduce the impact of switching fonts or to reduce the file size of the entire document in some way.
Style creation pseudocode:
Style style1 = new Style();
Style style2 = new Style();
PdfFont font1 = PdfFontFactory.createFont(FontProgramFactory.createFont(fontFile1), PdfEncodings.IDENTITY_H, true);
style1.setFont(font1).setFontSize(8f).setFontColor(Color.DARK_GRAY);
PdfFont font2 = PdfFontFactory.createFont(FontProgramFactory.createFont(fontFile2), "", false);
style2.setFont(font2).setFontSize(8f).setFontColor(Color.DARK_GRAY);
Writing text/paragraph pseudocode:
Div div = new Div().setPaddingLeft(3).setMarginBottom(0).setKeepTogether(true);
Paragraph paragraph = new Paragraph();
loop up to 25 times: {
Text unicodeText = new Text(unicodeSymbol + " ").addStyle(style1);
paragraph.add(unicodeText);
Text plainText = new Text(plainText + " ").addStyle(style2);
paragraph.add(plainText);
}
div.add(paragraph);
This writing of text/paragraph is done thousands of times and makes up most of the document. Basically the document consists of thousands of "buildings" that have corresponding codes and the codes have categories. I need to have the index for the category as the unicode symbol and then all of the corresponding codes within the paragraph for the building.
Here is reproducable code:
float offSet = 50;
Integer leading = 10;
DateFormat format = new SimpleDateFormat("yyyy_MM_dd_kkmmss");
String formattedDate = format.format(new Date());
String path = "/tmp/testing_pdf_"+formattedDate + ".pdf";
File targetPdfFile = new File(path);
PdfWriter writer = new PdfWriter(path, new WriterProperties().addXmpMetadata());
PdfDocument pdf = new PdfDocument(writer);
pdf.setTagged();
PageSize pageSize = PageSize.LETTER;
Document document = new Document(pdf, pageSize);
document.setMargins(offSet, offSet, offSet, offSet);
byte[] font1file = IOUtils.toByteArray(FileUtility.getInputStreamFromClassPath("fonts/Garamond-Premier-Pro-Regular.ttf"));
byte[] font2file = IOUtils.toByteArray(FileUtility.getInputStreamFromClassPath("fonts/Quivira.otf"));
PdfFont font1 = PdfFontFactory.createFont(FontProgramFactory.createFont(font1file), "", true);
PdfFont font2 = PdfFontFactory.createFont(FontProgramFactory.createFont(font2file), PdfEncodings.IDENTITY_H, true);
Style style1 = new Style().setFont(font1).setFontSize(8f).setFontColor(Color.DARK_GRAY);
Style style2 = new Style().setFont(font2).setFontSize(8f).setFontColor(Color.DARK_GRAY);
float columnGap = 5;
float columnWidth = (pageSize.getWidth() - offSet * 2 - columnGap * 2) / 3;
float columnHeight = pageSize.getHeight() - offSet * 2;
Rectangle[] columns = {
new Rectangle(offSet, offSet, columnWidth, columnHeight),
new Rectangle(offSet + columnWidth + columnGap, offSet, columnWidth, columnHeight),
new Rectangle(offSet + columnWidth * 2 + columnGap * 2, offSet, columnWidth, columnHeight)};
document.setRenderer(new ColumnDocumentRenderer(document, columns));
for (int j = 0; j < 5000; j++) {
Div div = new Div().setPaddingLeft(3).setMarginBottom(0).setKeepTogether(true);
Paragraph paragraph = new Paragraph().setFixedLeading(leading);
// StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < 26; i++) {
paragraph.add(new Text("\u3255 ").addStyle(style2));
paragraph.add(new Text("test ").addStyle(style1));
// stringBuilder.append("\u3255 ").append(" test ");
}
// paragraph.add(stringBuilder.toString()).addStyle(style2);
div.add(paragraph);
document.add(div);
}
document.close();
In creating the reproducible code I have found this this is related to the document being tagged. If you remove the line that marks it as tagged it reduces the file size greatly.
You can also reduce the file size by using the commented out string builder with one font instead of two. (Comment out the two "paragraph.add"s in the for-loop) This mirrors the issue I have in my code.
The problem is not in fonts themselves. The issues comes from the fact that you are creating a tagged PDF. Tagged documents have a lot of PDF objects in them that need a lot of space in the file.
I wasn't able to reproduce your 20MB vs 100MB results. On my machine whether with one font or with two fonts, but with two Text elements, the resultant file size is ~44MB.
To compress file when creating large tagged documents, you should use full compression mode which compresses all PDF objects, not only streams.
To activate full compression mode, create a PdfWriter instance with WriterProperties:
PdfWriter writer = new PdfWriter(outFileName,
new WriterProperties().setFullCompressionMode(true));
This setting reduced the file size for me from >40MB to ~5MB.
Please note that you are using iText 7.0.x while 7.1.x line has already been released and is now the main line of iText, so I recommend that you update to the latest version.

PDFs generated with iTextSharp generated watermark giving error

We are applying a watermark using iTextSharp to PDF documents before passing them to client. On some machines (all using v.11 of PDF viewer), the following error is being displayed.
An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF Document to correct the problem.
The watermarking code is as follows:
protected static byte[] GetStampedDocument(byte[] content, string mark, string heading)
{
PdfReader reader = new PdfReader(content);
using (MemoryStream stream = new MemoryStream())
{
PdfStamper pdfStamper = new PdfStamper(reader, stream);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
iTextSharp.text.Rectangle pageSize = reader.GetPageSizeWithRotation(i);
PdfContentByte pdfPageContents = pdfStamper.GetOverContent(i);
pdfPageContents.BeginText();
PdfGState gstate = new PdfGState();
gstate.FillOpacity = 0.2f;
gstate.StrokeOpacity = 0.3f;
pdfPageContents.SaveState();
pdfPageContents.SetGState(gstate);
BaseFont baseFont = BaseFont.CreateFont(BaseFont.HELVETICA_BOLD, Encoding.ASCII.EncodingName, false);
pdfPageContents.SetFontAndSize(baseFont, 46);
pdfPageContents.SetRGBColorFill(32, 32, 32);
pdfPageContents.ShowTextAligned(PdfContentByte.ALIGN_CENTER, mark, pageSize.Width / 2, pageSize.Height / 2, 66);
if (heading != null && heading.Length > 0)
{
pdfPageContents.SetFontAndSize(baseFont, 12);
pdfPageContents.SetRGBColorFill(32, 32, 32);
pdfPageContents.ShowTextAligned(PdfContentByte.ALIGN_LEFT, heading, 5, pageSize.Height - 15, 0);
}
pdfPageContents.EndText();
pdfPageContents.RestoreState();
}
pdfStamper.FormFlattening = true;
pdfStamper.FreeTextFlattening = true;
pdfStamper.Close();
return stream.ToArray();
}
}
I cannot recreate this on any machine I have tried so there is an environmental element to this as well I expect.
Any ideas?
You save the graphics state inside a text object:
pdfPageContents.BeginText();
[...]
pdfPageContents.SaveState();
[...]
pdfPageContents.EndText();
pdfPageContents.RestoreState();
This is not allowed, cf. Figure 9 — Graphics objects — in ISO 32000-2, special graphics state operators (like saving or restoring the graphics state) may not be used inside a text object.
To prevent this invalid syntax, move pdfPageContents.SaveState() before pdfPageContents.BeginText(). This furthermore makes the nesting of saving/restoring the state and beginning and ending the text object more natural.

Using pdfbox - how to get the font from a COSName?

How to get the font from a COSName?
The solution I'm looking for looks somehow like this:
COSDictionary dict = new COSDictionary();
dict.add(fontname, something); // fontname COSName from below code
PDFontFactory.createFont(dict);
If you need more background, I added the whole story below:
I try to replace some string in a pdf. This succeeds (as long as all text is stored in one token). In order to keep the format I like to re-center the text. As far as I understood I can do this by getting the width of the old string and the new one, do some trivial calculation and setting the new position.
I found some inspiration on stackoverflow for replacing https://stackoverflow.com/a/36404377 (yes it has some issues, but works for my simple pdf's. And How to center a text using PDFBox. Unfortunatly this example uses a font constant.
So using the first link's code I get a handling for operator 'TJ' and one for 'Tj'.
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
java.util.List<Object> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++)
{
Object next = tokens.get(j);
if (next instanceof Operator)
{
Operator op = (Operator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj"))
{
// Tj takes one operator and that is the string to display so lets
// update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
String replaced = prh.getReplacement(string);
if (!string.equals(replaced))
{ // if changes are there, replace the content
previous.setValue(replaced.getBytes());
float xpos = getPosX(tokens, j);
//if (true) // center the text
if (6 * xpos > page.getMediaBox().getWidth()) // check if text starts right from 1/xth page width
{
float fontsize = getFontSize(tokens, j);
COSName fontname = getFontName(tokens, j);
// TODO
PDFont font = ?getFont?(fontname);
// TODO
float widthnew = getStringWidth(replaced, font, fontsize);
setPosX(tokens, j, page.getMediaBox().getWidth() / 2F - (widthnew / 2F));
}
replaceCount++;
}
}
Considering the code between the TODO tags, I will get the required values from the token list. (yes this code is awful, but for now it let's me concentrate on the main issue)
Having the string, the size and the font I should be able to call the getWidth(..) method from the sample code.
Unfortunatly I run into trouble to create a font from the COSName variable.
PDFont doesn't provide a method to create a font by name.
PDFontFactory looks fine, but requests a COSDictionary. This is the point I gave up and request help from you.
The names are associated with font objects in the page resources.
Assuming you use PDFBox 2.0.x and that page is a PDPage instance, you can resolve the name fontname using:
PDFont font = page.getResources().getFont(fontname);
But the warning from the comments to the questions you reference remain: This approach will work only for very simple PDFs and might even damage other ones.
try {
//Loading an existing document
File file = new File("UKRSICH_Mo6i-Spikyer_z1560-FAV.pdf");
PDDocument document = PDDocument.load(file);
PDPage page = document.getPage(0);
PDResources pageResources = page.getResources();
System.out.println(pageResources.getFontNames() );
for (COSName key : pageResources.getFontNames())
{
PDFont font = pageResources.getFont(key);
System.out.println("Font: " + font.getName());
}
document.close();
}

Split PDF into separate files based on text

I have a large single pdf document which consists of multiple records. Each record usually takes one page however some use 2 pages. A record starts with a defined text, always the same.
My goal is to split this pdf into separate pdfs and the split should happen always before the "header text" is found.
Note: I am looking for a tool or library using java or python. Must be free and available on Win 7.
Any ideas? AFAIK imagemagick won't work for this. May itext do this? I never used and it's
pretty complex so would need some hints.
EDIT:
Marked Answer led me to solution. For completeness here my exact implementation:
public void splitByRegex(String filePath, String regex,
String destinationDirectory, boolean removeBlankPages) throws IOException,
DocumentException {
logger.entry(filePath, regex, destinationDirectory);
destinationDirectory = destinationDirectory == null ? "" : destinationDirectory;
PdfReader reader = null;
Document document = null;
PdfCopy copy = null;
Pattern pattern = Pattern.compile(regex);
try {
reader = new PdfReader(filePath);
final String RESULT = destinationDirectory + "/record%d.pdf";
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 1; i < n; i++) {
final String text = PdfTextExtractor.getTextFromPage(reader, i);
if (pattern.matcher(text).find()) {
if (document != null && document.isOpen()) {
logger.debug("Match found. Closing previous Document..");
document.close();
}
String fileName = String.format(RESULT, i);
logger.debug("Match found. Creating new Document " + fileName + "...");
document = new Document();
copy = new PdfCopy(document,
new FileOutputStream(fileName));
document.open();
logger.debug("Adding page to Document...");
copy.addPage(copy.getImportedPage(reader, i));
} else if (document != null && document.isOpen()) {
logger.debug("Found Open Document. Adding additonal page to Document...");
if (removeBlankPages && !isBlankPage(reader, i)){
copy.addPage(copy.getImportedPage(reader, i));
}
}
}
logger.exit();
} finally {
if (document != null && document.isOpen()) {
document.close();
}
if (reader != null) {
reader.close();
}
}
}
private boolean isBlankPage(PdfReader reader, int pageNumber)
throws IOException {
// see http://itext-general.2136553.n4.nabble.com/Detecting-blank-pages-td2144877.html
PdfDictionary pageDict = reader.getPageN(pageNumber);
// We need to examine the resource dictionary for /Font or
// /XObject keys. If either are present, they're almost
// certainly actually used on the page -> not blank.
PdfDictionary resDict = (PdfDictionary) pageDict.get(PdfName.RESOURCES);
if (resDict != null) {
return resDict.get(PdfName.FONT) == null
&& resDict.get(PdfName.XOBJECT) == null;
} else {
return true;
}
}
You can create a tool for your requirements using iText.
Whenever you are looking for code samples concerning (current versions of) the iText library, you should consult iText in Action — 2nd Edition the code samples from which are online and searchable by keyword from here.
In your case the relevant samples are Burst.java and ExtractPageContentSorted2.java.
Burst.java shows how to split one PDF in multiple smaller PDFs. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
final String RESULT = "record%d.pdf";
// We'll create as many new PDFs as there are pages
Document document;
PdfCopy copy;
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 0; i < n; ) {
// step 1
document = new Document();
// step 2
copy = new PdfCopy(document,
new FileOutputStream(String.format(RESULT, ++i)));
// step 3
document.open();
// step 4
copy.addPage(copy.getImportedPage(reader, i));
// step 5
document.close();
}
reader.close();
This sample splits a PDF in single-page PDFs. In your case you need to split by different criteria. But that only means that in the loop you sometimes have to add more than one imported page (and thus decouple loop index and page numbers to import).
To recognize on which pages a new dataset starts, be inspired by ExtractPageContentSorted2.java. This sample shows how to parse the text content of a page to a string. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
System.out.println("\nPage " + i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
reader.close();
Simply search for the record start text: If the text from page contains it, a new record starts there.
Apache PDFBox has a PDFSplit utility that you can run from the command-line.
If you like Python, there's a nice library: PyPDF2. The library is pure python2, BSD-like license.
Sample code:
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("C:\\Users\\Jarek\\Documents\\x.pdf", "rb"))
# analyze pdf data
print input1.getDocumentInfo()
print input1.getNumPages()
text = input1.getPage(0).extractText()
print text.encode("windows-1250", errors='backslashreplacee')
# create output document
output = PdfFileWriter()
output.addPage(input1.getPage(0))
fout = open("c:\\temp\\1\\y.pdf", "wb")
output.write(fout)
fout.close()
For non coders PDF Content Split is probably the easiest way without reinventing the wheel and has an easy to use interface: http://www.traction-software.co.uk/pdfcontentsplitsa/index.html
hope that helps.