I scanned old documents into multiple pages PDFs (typically, 50 pages). Each of the PDF pages encapsulates several pages of the original document. I would like to preprocess the PDF so that its pages matches those of the original document.
Because the original documents do not all have the same format, this necessarily implies manual stages, such as selecting the pages of the original document (red rectangles in the image below).
Many tools can do that, but given the amount of PDFs, I would like something as convenient as possible, in particular, the red rectangles should have always the same size.
So the workflow would be:
Go to page 1 of PDF
Choose rectangle size once for all
Move rectangle to page 1 of original document, extract
Move rectangle to page 2 of original document, extract
Go to page 2 of PDF
Move rectangle to page 1 of original document, extract
...
This is a typical page of a PDF, the red rectangles correspond the pages of the original documents that I would like to extract.
Question Do you know any tool (Linux or Windows), ideally free, that would relevant for what I am trying to do ?
Note: this is related, but in my case the parts I want to extract are not always at the same position (otherwise it should be done easily with pdfcrop and a little script).
I finally decided to write a Mathematica piece of code, reproduced here if it can help someone one day:
(* import *)
fileName = "mypdf.pdf;
pdf = Import[fileName];
(* desired output size *)
rectangleSize = {1620, 2070}
i = 1;
(* update function: crop and save when clicked *)
update[a_, b_] := Block[{},
Export[fileName <> IntegerString[i, 10, 3] <> ".pdf",
ImageTake[
pdf[[i]], {a, a + rectangleSize[[2]]}, {b,
b + rectangleSize[[1]]}]];
i += 2;]
(* a function to deal with pages that should not be cropped, if any *)
relaxCropSize = Block[{},
buffer = rectangleSize;
rectangleSize = ImageDimensions[pdf[[i]]];]
(* the "interface" *)
Manipulate[
ImageResize[ImageTake[pdf[[i]], {a, a + rectangleSize[[2]]}, {b,
b + rectangleSize[[1]]}], 200], {a, 1,
ImageDimensions[pdf[[i]]][[2]] - rectangleSize[[2]]}, {b, 1,
ImageDimensions[pdf[[i]]][[1]] - rectangleSize[[1]]},
Row[{Spacer[100],
Button["Save and next page", update[a, b], Method -> "Queued",
ImageSize -> 100]}],
Row[{Spacer[100],
Button["Relax Crop Size", relaxCropSize, Method -> "Queued",
ImageSize -> 100]}]]
Slide a and b to adjust rectangle, click "save and next page" to save the result and go to next page.
Related
I am generating a PDF document using jsPDF. Is there a way to store metadata in the PDF document that will force Acrobat to open it in 100% view mode (Actual Size) vs sized to fit?
In other words does PDF document specification allow that to specify it in the document itself?
This is definitely possible, because a PDF document can contain information on how it should open.
You might create such a document in Acrobat and then find the opening information, and/or you might have a look at the Portable Document Format Reference, which is part of the Acrobat SDK, downloadable from the Adobe website.
However, I don't know whether you can insert that structure into the PDF with your tool.
I figured it out; in the Catalog section of the PDF document, there is a OpenAction section where we can specify how the view can show the file, among other things.
I changed this
putCatalog = function () {
out('/Type /Catalog');
out('/Pages 1 0 R');
// #TODO: Add zoom and layout modes
out('/OpenAction [3 0 R /FitH null]');
out('/PageLayout /OneColumn');
events.publish('putCatalog');
},
to this
putCatalog = function () {
out('/Type /Catalog');
out('/Pages 1 0 R');
// #TODO: Add zoom and layout modes
out('/OpenAction [3 0 R 1 100]'); //change from standard code to use zoom to 100 % instead of fit to width
out('/PageLayout /OneColumn');
events.publish('putCatalog');
},
The mupdf is a good open source pdf reader. It meets almost all of my requirements except it can't reader pdf from right-to-left. Does anybody have any idea about it?
Actually my understanding for RTL is that I can navigate to the last page first. Then go from there. Buy my question is how am I suppose to know a pdf is a RTL one?
MuPDF does not support RTL. You need to start the book from the last page and make minor changes MuPDFPageAdapter getView method as follows:
public View getView(int pos, View convertView, ViewGroup parent) {
final int position;
if(mDirection == DIRECTION_RTL)
position = mCore.countPages() - pos - 1;
else
position = pos;
/** getView remaining code **/
}
Basically reversing the page order. Index 0 become pageCount - 1 and the last page becomes 0.
You can set the current page to the book's last page as follows:
mDocView.setDisplayedViewIndex(mCore.countPages() - 1);
I've uploaded a working sample here:
https://github.com/mardawi/MuPDF-Android-RTL
There isn't any way to know the reading direction of a PDF file. It seems like what you want is not right-to-left but back to front, or bottom to top.
I am having issues re-writing one of the default logo scripts in GIMP(using Script-fu based on scheme). For one thing the alpha layer is not shown in the layer browser after the image is shown. I am re-writing the Create Neon Logo script(neon-logo.scm) and I want it to do the following before it displays the new image:
add an alpha channel
change black(background color) to transparent via colortoalpha
return the generated image as an object to be used in another python script(using for loops to generate 49 images)
I have tried modifying the following code to the default script:
(gimp-image-undo-disable img)
(apply-neon-logo-effect img tube-layer size bg-color glow-color shadow) *Generates neon logo
(set! end-layer (car (gimp-image-flatten img))) *Flattens image
(gimp-layer-add-alpha end-layer) *Adds alpha layer as last layer in img(img=the image)
(plug-in-colortoalpha img 0 (255 255 255)) *Uses color to alpha-NOT WORKING
(gimp-image-undo-enable img) *Enables undo
(gimp-display-new img) *Displays new image
For number 3 my python code is this:
for str1 in list1:
for color1 in list3:
img = pdb.script_fu_neon_logo(str1,50,"Swis721 BdOul BT",(0,0,0),color1,0)
But img is a "Nonetype" object. I would like to make it so that instead of displaying the generated image in a new window, it just returns the generated image for use with my python script.
Can anyone help?
Maybe to keep everything more managaeable and readable, you should translate theoriginal script into Python - that way you willhaveno surprises on otherwiser trivial things as variable assignment, picking elements from sequences and so on.
1 and 2) your calls are apparantly correct to flaten an "add an alpha channel " (not "alpha layer",a s you write, please) to the image - but you are calling color-to-alpha to make White (255 255 255) transparemt not black. Trey changing that to (0 0 0) - if it does not work, make]
each of the calls individually, either on script-fu console or on python console, and check what is wrong.
3) Script-fu can't return values to the caller (as can be seen by not having a "return value type" parameter to the register call. That means that scripts in scheme in GIMP can only render thigns on themselves and not be used to compose more complex chains.
That leaves you with 2 options: port the original script to Python-fu (and them just register it to return a PF-IMAGE) - or hack around the call like this, in Python:
create a set with all images opened, call your script-fu, check which of your images currently open is not on the set of images previously opened - that will be your new image:
The tricky part with this is that: there is no unique identifier to an image when you see it from Python-fu - so you 'dhave to compose a value like (name, number_of_layers, size) to go on those comparison sets and even that might not suffice - or youg could juggle with "parasites" (arbitrary data that can be attached to an image). As you can see, having the original script-fu rewriten in Python, since all the work is done by PDB calls, and these translate 1:1, is preferable.
In MediaWiki, we would like to display tables of contents (from multiple pages) on one other page. We know that this can be done automatically, e.g. if we include pages 1, 2 & 3 like this:
{{:Page 1}}
{{:Page 2}}
{{:Page 3}}
on page X, then page X displays a combined TOC for pages 1, 2 & 3.
But we want a table on page X which shows each TOC in a separate cell. Is there any way to include each TOC individually?
I have tried using <noinclude></noinclude> tags around the text on pages 1, 2 & 3 and then forcing a table of contents outside (using __TOC__) but that only creates a TOC on page X (using the contents of page X).
You can't. The table of contents is generated dynamically in each page, for all the sections that appear in the current page.
When you include the sections (or at least the section headings) of the other pages, they will show up in the TOC of page X. If you include the __TOC__ magic word, it means only to generate the toc for page X.
Three solutions:
Include the section (headings) of pages 1, 2 and 3. They will show up in the toc of page X even when contained in a <div style="display:none;"> - a really ugly way.
Copy the TOC tables manually to page X. You can view their HTML by looking in the generated HTML source of pages 1, 2 and 3 with your browser.
Write an extension that allows transclusion of TOCs from other pages. It might introduce a new parserfunction {{toc:<pagename>}} and be able to call the toc-generating function in the context of another page.
Include only the section headings as a list. In the pages 1, 2 and 3 you will need to write
== <onlyinclude><includeonly>##</includeonly> Heading Number One </onlyinclude> ==
=== <onlyinclude><includeonly>###</includeonly> Part One of Heading Number One </onlyinclude> ===
...
which you will be able to include in the table at Page X with
{{:Page 1}}
It should show up as a numbered list, like the TOC.
Is there any tool to find the X-Y location on a text content in a pdf file ?
Docotic.Pdf Library can do it. See C# sample below:
using (PdfDocument doc = new PdfDocument("your_pdf.pdf"))
{
foreach (PdfTextData textData in doc.Pages[0].Canvas.GetTextData())
Console.WriteLine(textData.Position + " " + textData.Text);
}
Try running "Preflight..." in Acrobat and choosing PDF Analysis -> List page objects, grouped by type of object.
If you locate the text objects within the results list, you will notice there is a position value (in points) within the Text Properties -> * Font section.
TET, the Text Extraction Toolkit from the pdflib family of products can do that. TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...)
Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.