Automatically remove all PDF content outside a crop area - pdf

For a deck of lecture slides, I have extracted several vector illustrations from a PDF-file. I did this by highlighting the relevant area in Preview.app, copying, and opening a new file from the clipboard.
The figures look just fine, even though I noticed that the files are a little large. When I open them in Illustrator, I can see what's described in the screenshot – that all of the page content is still there, it's just hidden because it lies outside the crop area.
Now I could simply remove everything except the relevant figures in Illustrator, but I would much rather automate the process, since I have a large number of figures.
How can I automate this process such that everything outside the crop area is discarded and everything inside it is preserved as a vector image?

You can use redact utility to remove the content.
Just go to https://doxiview.cib.de/showcase/index.html?locale=default
Choose redact tool
upload your PDF
Choose on the right Select Area and redact fill color as white
Mark all content, which you want to remove
click on apply
download PDF
Afterwards you can crop the PDF and you won't have the content being still there.

There's no need to rasterize. Just crop the pages then use Acrobat DC to "Sanitize" the document. That will completely remove any non-visible parts of the file.

In Acrobat Pro, go to Preflight and select the setting below.
Then click edit to the right
You should be able to create Adobe droplets with this preflight setting for automation

Related

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Scale text to fit in a text box in illustrator using scripts(javascript)?

I have been trying to figure out how to get a font to shrink to fit in to its text box, but it needs to scale down the font size. I want to be able to do it to multiple text boxes at once. I dont have any coding right now. I know a little JavaScripting but not 100% sure how to do if for illustrator.
You can use the code snippet in my LinkedIn article: Dealing with Overset Text.
The Illustrator scripting API gives one some control over the paragraphs, lines, words, characters and arbitrary text ranges of an Illustrator text frame. One upgrade my script could use though is to incorporate text-on-a-path - maybe someday soon I'll fix it up and update my article.
As for using the snippet, just run it however you choose to run scripts (put into App scripts folder or use File > Scripts > Other Scripts. When you run it, any overset text boxes which are area text will have their font shrunk to no longer be overset. You can use this same snippet with Illustrator variable data to ensure a batch process will not have oversets.

Extract PDF coordinates using mouse click

I want to extract the coordinates of a PDF document with the help of a mouse click. I have gone through some posts but since I'm new to this, I'm not being able to understand it properly. Also, can this be done if I render the PDF file in a web page?
You can add javascript to a pdf document. Although you only get access to a limited subset of the language.
If you only need the coordinates once (for instance when doing layout of the document), you can simply open it with adobe and activate the rulers/grid option to see where your mousepointer is currently located.

False dots around circles in pdf export of libreoffice draw

When i draw a small circle in LibreOffice draw and export it to pdf i get some extra dots around the circles. Especially in the upper left and lower right outer corner of the circle.
See example PDF here: https://dl.dropbox.com/u/233922/example-dots-circle.pdf
or as a Screenshot here:
You have any idea how i can get rid of this?
It is old bug and has not been fixed yet. I can reproduce it under Linux and Windows. My version: LibreOffice 4.1.0.
Create new file in LO Impress or LO Draw.
Draw ellipse (or rounded rectangle, or smile etc.).
Set line width e.g. 5mm (for better view).
Export as PDF.
I propose two workaround:
Export to MS PowerPoint and export in it :/
Print to PDF (using e.g. cups-pdf).
ad 1) You must have MS PP and you graphics may look bad.
ad 2) I use cups-pdf and PDF look very well, but:
Text is stored as bitmap graphics (small rectangles)! You can not extract text without using OCR.
You must use paper format from list (A4, A0, Letter etc.). If you use unstandardised paper format you must use bigger format and you get white bars on PDF. However you can use pdfcrop and remove white bars.
PDF is always orienter horizontally. If you print as vertically you can rotate pdf using pdf270 command line tool.
In Adobe Reader (version 11 at least) -> Go to "Preferences" => "Page Display" => uncheck "Enhance thin lines"
Libre Office seems to add dots of 0 size and practically no visibility. When "Enhance thin lines" is checked, Adobe Reader will make these dots visible.
Best wishes,
Patrick
Similar to the https://stackoverflow.com/users/1797782/dzwiedziu-nkg 's answer, I need a multi-step process to fix this issue.
Steps:
Open the file in a pdf viewer (Document Viewer for me in Ubuntu.)
Print the pdf to a file (also a pdf) from the viewer. I assume this also uses cups-pdf, as it modifies the image size. (I don't mind, because I use the next step to eliminate all margins anyways.)
Use pdfcrop to remove all the extra space around the actual content's bounding box. If you just give pdfcrop one argument, it doesn't overwrite the old file, so use the same argument twice:
$ pdfcrop monkey.pdf monkey.pdf
Another "workaround" that worked for me:
Go without outline. You can set the line style in Draw to "none" and just work with flat solid objects.
PS: I see these dots also in Draw, not just in the exported pdf.
A simple workaround is to "patch" the dot in Libreoffice Draw using a white object -- say, a square with white area and white outline. Note that you can not see the dot in Draw. So you first generate the pdf with the orginal drawing, see where the dot appears in the pdf, go back to Draw, and a add a white patch where it is required.
Searching for a workaround myself, I've found this awk script called odg2epsfix that will fix the exported EPS to not contain those ghost dots anymore.
I stumbled upon it in this launchpad bug entry.
Fixed in LibreOffice pre-export.
Steps:
Right click on the circle in LibreOffice and select "Line"
On the "Line" page, set "Corner Style" to "-none-"
Save document and Export as PDF.
The dot is gone without removing line enhance. Mine still shows in preview but doesn't print.
The bug is still present in LO 6.0. But if you set "Cap style" to "flat" in the "Line" tab of the "Graphic Styles", the dots disappear from the screen and from the exported pdf.

How to get the path coordinates of a shape for use with image-maps?

I am creating an image map using ImageMapster from here.
I have created a photoshop image with several images that I have cut out from the original photographs. Each image is on a separate layer.
Now, I need to get the path coordinates of each object, and I don't want to hover over every corner and manually write down each coordinate.
Is there an automated way to get this path?
Maybe there is some application or web service whence I can send my image and get the path in return?
I have tried exporting each layer separately and then importing them into illustrator and vectorizing the shape (it keeps the shape in its original position), but I can't figure out how to get the coordinate path as text. I can export it to svg, but that isn't the same simple code needed for the css image map.
Ah! After googling image-map, much thanks to Sven for the idea (he got my +1), I found this thread here on Stack Overflow.
So here is my process.
Prepare the image in Photoshop with each object on a separate layer with a transparent background (this will make it easy for you when you do the tracing).
Save your photoshop file.
Open the Photoshop file in Illustrator using File...Open (works in CS4 and CS5) and make sure to allow the option to import Photoshop's layers as separate objects. After you open the file, make sure NOT to move any of the objects around - you need them to be in the exact same place as they were in the photoshop file so they can superimpose each other when rendered to the imagemap.
Use the Live Trace with custom settings. Use the black & white mode with the threshold all the up (255). This will produce a black silhouette of the shape. (You can also use "ignore white"). Push the Trace button. If you have many layers, you can save this new tracing pattern as a preset - I called mine, Silhouette. Now, I just click on a layer and choose Silhouette from the tracing buttons' dropdown menu.
Expand the shape and make sure it consists of only a single flat shape:
you can use the blob brush in illustrator to blacken over any unwanted white areas
no groups
no compound shapes (or it won't work) - which means you can't create cutouts.
You can tell the shapes are right when you click on them - you should be able to see the path itself with no "other" shapes involved (perhaps the blob brush additions) - just a single path. An easy method is this:
select the shape
ungroup if necessary
release compound path
unite (shape mode merges all shapes into one)
Don't crop your image - you want your shape to be in the same place in the image's area as in your original photoshop image.
Don't join all the shapes together, either.
The shapes should all be individual whole shapes, all in their original locations, each on a separate layer.
Now, open Illustrator's Attributes panel, and make sure to "show options".
Select your shape and in the "Attributes" panel, switch the "Image Map" combo box from None to Polygon. Make sure to add a url (it doesn't matter what you put; you can change it later - I just put "#" and the name of the shape so I can tell which one it belongs to in the image map code)
Do this for each of the objects.
Now, in the File menu, go to "Save for Web and Devices". Skip all the settings here and just push "Save".
In the "Save As" (the title of the window is "Save Optimized As") dialogue box, use "Save As type:" and select HTML Only(*.html) if you just want the code, or HTML and Images if you want the sillouhuette, too (they will appear in a folder called "images") - and note your save location.
Now go open that html file in notepad!
Voila! All the shapes will be rendered for you as a pre-made image-map - points path and even html code. Here is what it looks like when you open in notepad the html file you just created: For this demo, I chose a particularly complicated image - one which you would never want to estimate by hand, nor have to do twice!
Don't forget to place the actual image file somewhere in your site's images folder. You can save the psd file for later and add more "stuff" if you want, and repeat the process.
I was able to create the image map this way for my photoshop picture in just a brief couple of minutes. After you do it once, it gets easier for next time.
This has been bugging me for so long and I don't have Illustrator to be able to use the solution proposed by BGM, that I created my own Photoshop addon.
You can get it here: https://creative.adobe.com/addons/products/2389
It writes all your paths' points' coordinates to a text file.
Should work for CS6 and above.
The way I use it is I create a marquee, right click -> make work path, rename my path, [repeat], then just export coords via my addon.
If anyone's interested in the scripts behind it, you can have a look here: http://pastebin.com/8ugcAV3j
In case you make any improvements, please post them here so that other people may use them as well.
Hope this helps someone.
EDIT: added link to source script (was only in comments before)
I used this to find the co-ordinates of the outline of a shape to make image hotspots for links in dreamweaver. If you have something else in mind, then you'll have to ignore some of it. This works on a single layer so you may want to make a "flattened copy first", but I don't see why it wouldn't work on a multi layered image.
Use wand to highlight area you want. This will be different for different images.
Right click and hit Make Work Path. Use a suitable tollerance which is found by trial and error. I just use the most sensitive.
Do this for all areas in all of your images creating separate paths for each.
Click edit then export paths to illustrator and save file in sensible place.
Open the saved file in word. Ignore the bumf the the top and use replace to remove ALL LETTERS. Don’t worry about the paragraph characters.
Note that all of the work paths are exported in the same file separated by a blank line so must copied and pasted separately to be used for each hotspot.
After inserting your image. Start making a map in dreamweaver with a couple of co-ordinates then simply replace these in the with information from the illustrator file for each of the map areas to be produced.
I add my updated answer I had to find since adobe has eliminated HTML output in many instances, I work mostly with photoshop (CS4) and this is a perfect solution:
1) download following file: https://github.com/andyhawkes/ps-paths-to-imagemap
2) open your image in photoshop and select the form with the magic wand
3) right click and select 'make work path' (the lesser the px, to more accurate)
4) go to File -> Scripts -> Browse ... and select the script from the first step
that's it !! this script will open your texteditor with the coordinates ...
Something like this may be useful;
http://code.google.com/p/imagemap/
Copy your image into position, then plot.
creating an image map is really simple.
First we need to look at the syntax of the code
Let's create a div.If we want to position it at the right side of our page,we can just begin by writing
<div align="right">
After that, we import the image that we are gonna map.
<img src="" alt="" width="" height="" usemap="#nameofmap" />
Now we have to define the map structure.First lets assume that you want a rectangular portion of an image to act like a hyperlink.
<map name="nameofmap">
<area href="wherever I wanna take that.com" alt="" title=""
shape=rect coords="A,B,C,D"></map>
Now we close the div.
</div>
If the shape is circular,we use the syntax
shape=circle coords="x,y,radius"
If shape is polygonal, we use
**shape=poly coords="a,b,c,d,e,f,gh"
Now comes the big part:How to find the image map coords.
Very simple.Go to
http://www.image-maps.com
Browse your image file,click "Start Mapping your image",then you proceed, and then on the next page,click "Import Old mapping Code" on the right.then you get the coords.
After that, you can use FIREBUG to change the coords according to your specifications,because image-maps only hyperlinks the whole image,so use firebug to change the coords and adjust according to your requirements.
Have fun.