I have a large PDF that has been combined from multiple documents.
How can I split the PDF back into multiple documents with a keyword delimiter?
As well as Adobe Reader you will need Adobe Acrobat.
Add the following script using the Action Wizard:
Paste in the following script and modify for your needs. See //comments for help on customisation.
/* Extract Pages into Documents by Keyword */
// Iterates over all pages and find a given string and extracts all
// pages on which that string is found to a new file.
var pageArray = [];
var pageArrayEnd = [];
var stringToSearchFor = app.response("This Action Script splits the document by a keyword on each X number of pages, please enter the keyword:");
for (var p = 0; p < this.numPages; p++) {
// iterate over all words
for (var n = 0; n < this.getPageNumWords(p); n++) {
// DEBUGGING HELP, UNCOMMENT NEXT LINE, CHANGE TO MATCH MULTIPLE WORDS OR WHAT EVER ORDER, eg if ((this.getPageNthWord(p, n) == stringToSearchFor) && (this.getPageNthWord(p, n + 1) == stringToSearchForTWO)) {..., Also add a prompt for the second search word and iterate one less for (var n = 0; n < this.getPageNumWords(p) - 1; n++) ...
//app.alert("Word is " + this.getPageNthWord(p, n));
if (this.getPageNthWord(p, n) == stringToSearchFor) {
//app.alert("Found word on page " + p + " word number " + n, 3);
if (pageArray.length > 0) {
pageArrayEnd.push(p - 1);
}
pageArray.push(p);
break;
}
}
}
pageArrayEnd.push(this.numPages - 1);
//app.alert("Number of sub documents " + pageArray.length, 3);
if (pageArray.length > 0) {
// extract all pages that contain the string into a new document
for (var n = 0; n < pageArray.length; n++) {
var d = app.newDoc(); // this will add a blank page - we need to remove that once we are done
//app.alert("New Doc using pages " + pageArray[n] + " to " + pageArrayEnd[n], 3);
d.insertPages( {
nPage: d.numPages-1,
cPath: this.path,
nStart: pageArray[n],
nEnd: pageArrayEnd[n],
} );
// remove the first page
d.deletePages(0);
d.saveAs({ cPath: this.path.replace(".pdf","") + n + ".pdf" });
d.closeDoc(true);
}
}
Please have a look at this guide on how to split PDF into multiple file:
// Used to register all DLL assemblies.
WorkRegistry.Reset();
String inputFilePath = Program.RootPath + "\\" + "1.pdf";
String outputFileName = "Output";
int[] splitIndex = new int[3] { 1, 3, 5 }; // Valid value for each index: 1 to (Page Count - 1).
// Create output PDF file path list
List<String> outputFilePaths = new List<String>();
for (int i = 0; i <= splitIndex.Length; i++)
{
outputFilePaths.Add(Program.RootPath + "\\" + outputFileName + "_" + i.ToString() + ".pdf");
}
// Split input PDF file to 4 files:
// File 0: page 0.
// File 1: page 1 ~ 2.
// File 2: page 3 ~ 4.
// File 3: page 5 ~ the last page.
PDFDocument.SplitDocument(inputFilePath, splitIndex, outputFilePaths.ToArray());
Related
Trying to use Apache PDFBox version 2.0.2 for a text replace (with the below code) produces an output where few of the characters would not be displayed, mostly the capital Case Character. For example a replacement with "ABCDEFGHIJKLMNOPQRSTUVWXYZ" the output appears in pdf as "ABCDEF HIJKLM OP RST W Y ". Is this some bug ?? or we have some workaround to handle these character .
public static PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
PDPageTree pages = document.getDocumentCatalog().getPages();
for (PDPage page : pages) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
string = StringUtils.replaceOnce(string, searchString, replacement);
cosString.setValue(string.getBytes());
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
out.close();
}
return document;
}
Quoting from
https://pdfbox.apache.org/2.0/migration.html
Why was the ReplaceText example removed?
The ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily. Words are often split, as seen by this excerpt of a content stream:
[ (Do) -29 (c) -1 (umen) 30 (tation) ] TJ
Other problems will appear with font subsets: for example, if only the glyphs for a, b and c are used, these would be encoded as hex 0, 1 and 2, so you won’t find “abc”. Additionally, you can’t replace “c” with “d” because it isn’t part of the subset.
You could also have problems with ligatures, e.g. “ff”, “fl”, “fi”, “ffi”, “ffl”, which can be represented by a single code in many fonts. To understand this yourself, view any file with PDFDebugger and have a look at the “Contents” entry of a page.
======================================================================
Your description suggests that the initial file has been using a font subset, that is missing the characters G, N, Q, V and Y.
And no, there is no easy workaround. You would have to delete the text you don't want from the content stream, and then append a new content stream with the text you want with a new font at the correct place.
P.S. the current PDFBox version is 2.0.7, not 2.0.2.
I have this PDF document that I made with iText in Java.
The PDF Document contains data that is added via PDFPTable objects.
The 'Problem' is that when I have more data then fits on one PDF page, the data is rendered on the next page, leaving me with empty space on the first page. (See the image 'Problem' side).
I would like to have these empty spaces filled with 'PDFPCell' object, see 'Solution' (these PdfPCell object contain another PdfPTable, the data in this PdfPTable must not be 'continued' on the next page of the pdf when it does not fit).
This is a small example in code:
PdfPTable outerTable = new PdfPTable(1);
outerTable.setHorizontalAlignment(Element.ALIGN_LEFT);
outerTable.setWidthPercentage(100);
int i = 0;
while (i < 5)
{
i++;
PdfPTable innerTable = new PdfPTable(new float[] {0.25f, 0.25f, 0.25f, 0.25f});
innerTable .setHorizontalAlignment(Element.ALIGN_LEFT);
innerTable .setWidthPercentage(100);
PdfPCell cell = new PdfPCell(innerTable);
cell.setPadding(0);
innerTable.addCell(new Phrase("test Data"));
innerTable.addCell(new Phrase("test Data"));
innerTable.addCell(new Phrase("test Data"));
innerTable.addCell(new Phrase("test Data"));
outerTable.addCell(cell);
}
document.add(outertable);
document.close();
Please take a look at the DropTablePart example. In this example, I add 4 tables with 19 rows to a ColumnText object. As soon as a table doesn't fit the page, I drop the remaining content of the ColumnText object (which will automatically drop the rest of the table) and I start a new page where a new table will start.
Dropping the content of the ColumnText object can be done in two different ways:
Either:
ct = new ColumnText(writer.getDirectContent());
Or:
ct.setText(null);
The result looks like this:
As you can see, rows 10-18 are dropped from inner table 3.
This is the full code:
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(dest));
document.open();
Rectangle column = new Rectangle(36, 36, 559, 806);
ColumnText ct = new ColumnText(writer.getDirectContent());
ct.setSimpleColumn(column);
for (int i = 0; i < 4; ) {
PdfPTable table = new PdfPTable(new float[]{0.25f, 0.25f, 0.25f, 0.25f});
table.setHorizontalAlignment(Element.ALIGN_LEFT);
table.setWidthPercentage(100);
PdfPCell cell = new PdfPCell(new Phrase("inner table " + (++i)));
cell.setColspan(4);
table.addCell(cell);
for (int j = 0; j < 18; j++) {
table.addCell(new Phrase("test Data " + (j + 1) + ".1"));
table.addCell(new Phrase("test Data " + (j + 1) + ".1"));
table.addCell(new Phrase("test Data " + (j + 1) + ".1"));
table.addCell(new Phrase("test Data " + (j + 1) + ".1"));
}
ct.addElement(table);
if (ColumnText.hasMoreText(ct.go())) {
document.newPage();
ct = new ColumnText(writer.getDirectContent());
ct.setSimpleColumn(column);
}
}
document.close();
}
I didn't use nested tables, because it is generally a bad idea to use nested tables. It has a negative impact on the performance of your application and it usually results in code that is hard to maintain (the programmers who inherit our application will thank you for not using nested tables).
I am trying to delete a paragraph from the .docx document i have generated using the Apache poi XWPF. I can do it easily with the .doc word document using HWPF as below :
for (String paraCount : plcHoldrPargrafDletdLst) {
Paragraph ph = doc.getRange().getParagraph(Integer.parseInt(paraCount));
System.out.println("Deleted Paragraph Start & End: " + ph.getStartOffset() +" & " + ph.getEndOffset());
System.out.println("Deleted Paragraph Test: " + ph.text());
ph.delete();
}
I tried to do the same with
doc.removeBodyElement(Integer.parseInt(paraCount));
But unfortunatley not successful enough to get the result as i want. The result document, i cannot see the paragraph deleted.
Any suggestions on how to accompolish the similar functionality in XWPF.
Ok, this question is a bit old and might not be required anymore, but I just found a different solution than the suggested one.
Hope the following code will help somebody with the same issue
...
FileInputStream fis = new FileInputStream(fileName);
XWPFDocument doc = new XWPFDocument(fis);
fis.close();
// Find a paragraph with todelete text inside
XWPFParagraph toDelete = doc.getParagraphs().stream()
.filter(p -> StringUtils.equalsIgnoreCase("todelete", p.getParagraphText()))
.findFirst().orElse(null);
if (toDelete != null) {
doc.removeBodyElement(doc.getPosOfParagraph(toDelete));
OutputStream fos = new FileOutputStream(fileName);
doc.write(fos);
fos.close();
}
Seems like you're really unable to remove paragraphs from a .docx file.
What you should be able to do is removing the content of paragraphs... So called Runs.You could try with this one:
List<XWPFParagraph> paragraphs = doc.getParagraphs();
for (XWPFParagraph paragraph : paragraphs)
{
for (int i = 0; i < paragraph.getRuns().size(); i++)
{
paragraph.removeRun(i);
}
}
You can also specify which Run of which Paragraph should be removed e.g.
paragraphs.get(23).getRuns().remove(17);
all rights reserved
// Remove all existing runs
removeRun(para, 0);
public static void removeRun(XWPFParagraph para, int depth)
{
if(depth > 10)
{
return;
}
int numberOfRuns = para.getRuns().size();
// Remove all existing runs
for(int i = 0; i < numberOfRuns; i++)
{
try
{
para.removeRun(numberOfRuns - i - 1);
}
catch(Exception e)
{
//e.printStackTrace();
}
}
if(para.getRuns().size() > 0)
{
removeRun(para, ++depth);
}
}
I like Apache POI, and for the most part its great, however I have found the documentation a little scatty to say the least.
The elusive way of deleting a paragraph I found to be quite a nightmare, giving me the following exception error when try to remove a paragraph:
java.util.ConcurrentModificationException
As mention in Ugo Delle Donne example, I solved this by first recording the paragraph that I wanted to delete, and then using the removeBodyElement method the document.
e.g.
List<XWPFParagraph> record = new ArrayList<XWPFParagraph>();
String text = "";
for (XWPFParagraph p : doc.getParagraphs()){
for (XWPFRun r : p.getRuns()){
text += r.text();
// I saw so many examples as r.getText(pos), don't use that
// Find some unique text in the paragraph
//
if (!(text==null) && (text.contains("SOME-UNIQUE-TEXT")) {
// Save the Paragraph to delete for later
record.add( p );
}
}
}
// Now delete the paragraph and anything within it.
for(int i=0; i< record.size(); i++)
{
// Remove the Paragraph and everything within it
doc.removeBodyElement(doc.getPosOfParagraph( record.get(i) ));
}
// Shaaazam, I hope this helps !
I believe your question was answered in this question.
When you are inside of a table you need to use the functions of the XWPFTableCell instead of the XWPFDocument:
cell.removeParagraph(cell.getParagraphs().indexOf(para));
When I try to access the len-variables at the end of the script I get this error: "Cannot iterate twice! If you want to iterate more that once, add _CACHE explicitely."
How can I fix that?
def src_str = query_string
def src_arr = src_str.split(' ')
def trg_arr = doc['my_index'].values
trg_arr_sorted = [:]
trg_arr.each {
_index['my_index'].get(it, _POSITIONS).each { pos ->
trg_arr_sorted[pos.position] = it
}
}
src_len = src_arr.length
def trg_len = trg_arr_sorted.size()
int[][] matrix = new int[src_len + 1][trg_len + 1]
(src_len + 1).times { matrix[it][0] = it }
(trg_len + 1).times { matrix[0][it] = it }
(1..src_len).each { i ->
(1..trg_len).each { j ->
matrix[i][j] = [matrix[i-1][j] + 1, matrix[i][j-1] + 1,
src_arr[i-1] == trg_arr_sorted[j-1] ? matrix[i-1][j-1] : matrix[i-1][j-1] + 1].min()
}
}
return 100 - (100 * matrix[src_len][trg_len] / max(src_len, trg_len)) // over here !!!
The code calculates a score using the levenshtein distance computed in words. It works perfect except of the scoring in the last line.
Okay problem is solved.
I explicitly had to declare cache and positions:
_index['lang'].get(it, _POSITIONS | _CACHE)
The error wasn't in the last line, but I thought so. I changed the script to debug it, but elasticsearch doesn't reload the new scipt instantly.
I am developing mobile application in which I have used WebSql as local database. Now I am creating a search functionality where I want to escape "_" when the user will search the records. I tried using the MS-SQL approach by passing it in square bracket "[ _ ]"
Below are my code example
if ($.trim($('#txtPolicy').val()).length > 0) {
policy = $.trim($('#txtPolicy').val());
if (policy.indexOf("_") >= 0)
policy = policy.replace(/_/g, "[_]");
query += " (";
var arrploicy = policy.split(',');
for (var j = 0; j < arrploicy.length; j++) {
query += " policy like ? or ";
arr[i] = "%" + arrploicy[j] + "%";
++i;
}
query = query.substring(0, query.length - 3);
query += ") ";
}
I have a records which has data as 1234_456789. But it does not return any records, probably because it might be considering it as string.
You can use parameterized query without requering uou to escape. It is more secure too.