I receive postage labels from a supplier as single page pdf documents. The labels would fit on an A5 sheet but they are presented as a portrait within an A4 page, also in portrait orientation. I would like to be able to print two of these labels per A4 page to cut down on waste.
This can be achieved by rotating the page content without rotating the page itself. Or by resizing the page by swapping the height and width about the content. I am aware that both of these things can result in content being lost, which isn't a problem for my use case. Ideally I'd like a command line application that works on both Linux or Windows machines. Unfortunately, web searches for "rotate" or "resize" pdf will point to the many applications that just rotate or resize pdf pages along with the content which isn't what I want.
Similar questions:
With PdfBox: identical use case, see my comments on PdfBox below.
With iText: almost identical use case, I explicitly don't want any resizing of the content. See my comments on iText below as well.
Things I have investigated tried:
pdftk - too basic
ImageMagick - the original image contains transparency and the extent argument results in a visible loss of quality
pdfjam - also requires install of Latex and PdfPages. Ideally I'd like something that works on both Windows and Linux.
iText7 - the documentation isn't great. Looks like it was completely re-written in the last few years and the Nuget feed makes it clear that previous version, iTextSharp, is EOL. Consequently most of the examples one finds online (including on this site) are out of date. iText7 doesn't let you resize a page. I got as far as saving a document with a new page that was the right size but struggling to copy the content over. I think I could get what I wanted from this but it would take a long time and I'm trying to do something simple.
PdfBox - I've already tried one .NET library without success. Looking at the comments to the question I've linked above, this one seems to also have a version issue. I'm trying to do something really simple here, I will try this one if I exhaust all other avenues
Gimp - does what I want but I have to fire up the application, point and click quite a few times to rescale the image canvas, set the background and export
Screenshot the label from a pdf reader at 100% size and paste into a Word/LibreOffice doc. Sadly this is the most reliable method I have at the moment
I have example labels but they contain the name and address of people I've sent things to, I'd rather not upload them.
Try the command line tool cpdf from here: https://community.coherentpdf.com
cpdf -rotate-contents <angle> in.pdf -o out.pdf
to rotate contents without rotating the page. or...
cpdf -mediabox "100 100 600 500" in.pdf -o out.pdf
(and -cropbox and so on) to change page dimensions without altering content. Chapter 3 of the manual is of relevance.
You can also prepare the file by removing any page rotation whilst counter-rotating the content to leave the visual appearance unchanged:
cpdf -upright in.pdf -o out.pdf
// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.
I want to create a visualization of a matrix for some academic work. I decided to go about this by having the pixels in the image correspond to the values in the matrix. I created the nice small png that follows:
When properly scaled up, you get a very reasonable image:
This is a screenshot from within inkscape. However, when export this as a pdf, both evince and chrome do a terrible job at upscaling what should be very trivial, and instead I get something that looks like:
The pdf itself seems to scale appropriately well for printing, but unfortunately I do a lot of my editing without printing, and this looks unacceptable. I did find this incredibly old thread about people seeming to have a similar issue with chrome's pdf viewer, and the "solution" was to just upscale the raster graphics. This is a solution, but is terribly inefficient.
Is anyone aware of a way to change the pdf so that it gets upscaled appropriately? Maybe a config change in evince or chrome that will render these properly? Even a nice way to go from a raster image to a vector image might be suitable?
The comments aggregated into an answer...
An image dictionary in a PDF has an (optional) boolean entry Interpolate. It is specified as a flag indicating whether image interpolation shall be performed by a conforming reader.
The program used by the OP to create the PDF, Inkscape, seems to have explicitly set this flag to true. Editing the PDF to unset this flag creates a file which looks as desired by the OP.
(This also is a solution proposed in this Inkscape forum thread eventually found by the OP, which is to save the PDF with high-resolution bitmaps embedded. File -> Inkscape Preferences -> Bitmaps -> Resolution for Create Bitmap Copy, and set it to 6000 dpi)
The fact that interpolation looks different in different viewers and different output media, is by design. The PDF specification states on interpolation:
A conforming Reader may choose to not implement this feature of PDF, or may use any specific implementation of interpolation that it wishes.
A different way to get around this problem (especially as some PDF viewers have the tendency to not really live up to the specification and e.g. interpolate ignoring that flag) would be to use vector graphics here, drawing the bitmap pixels as rectangles. The result should be optimal.
Is there a library/tool which would list all colours used in a PDF document ?
I'm sure Acrobat itself would do this but I would like an alternative (ideally something that could be scripted).
So the idea is if you have a very simple PDF document with four colours in it the output might say :
RGB(100,0,0)
RGB(105,0,0)
CMYK(0,0,0,1)
CMYK(1,1,1,1)
You could explore the insides with pdfbox, but you would have to write some code to find and catalog all those colors.
Most PDF tools have access to this information but no api to access it. You could take any tool and add it in
Apago PDFspy generates an XML file containing all kinds of metadata extracted from PDF files. It reports color usage including spot colors.
We recently added a function called GetPageColorSpaces(0) to the Quick PDF Library - www.quickpdflibrary.com to retrieve much of the ColorSpace info used in the document.
Here is some sample output.
Resource,\"QuickPDFCS2eb0f578\",Separation,\"HKS 52 E\",DeviceCMYK,0.95,0,0.55,0
Resource,\"QuickPDFCSb7b05308\",Separation,\"Black\",DeviceCMYK,0,0,0,1
Resource,\"QuickPDFCSd9f10810\",Separation,\"Pantone 117 C\",DeviceCMYK,0,0.18,1,0.15
Resource,\"QuickPDFCS9314518c\",Separation,\"All\",DeviceCMYK,0,1,0,0.5
Resource,\"QuickPDFCS333d463d\",Separation,\"noplate\",DeviceCMYK,1,0,0,0
Resource,\"QuickPDFCSb41cafc4\",Separation,\"noprint\",DeviceCMYK,0,1,0,0
Resource,\"Cs10\",DeviceN,Black,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,P1495,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,CalRGB,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",Separation,\"P1495\",DeviceCMYK,0,0.31,0.69,0
XObject,\"R29\",Image,,DeviceRGB,-1,-1,-1,-1
Disclaimer: I work at Atalasoft.
Our product, DotImage with the PDF Reader add-on, can do this. The easiest way is to rasterize the page and then just use any of our image analysis tools to get the colors.
This example shows how to do it if you want to group similar colors -- the deployed example will only work for PNG and JPEG, but if you download the code, it's trivial to include the add-on and get PDF as well (let me know if you need help)
Source here:
http://www.atalasoft.com/cs/blogs/31appsin31days/archive/2008/05/30/color-scheme-generator.aspx
Run it here:
http://www.atalasoft.com/31apps/ColorSchemeGenerator
If you are working with specific and simple PDF documents from a constrained source then you may be able to find the colors by reading through the content stream. However this cannot be a generic solution.
For example PDF documents can contain gradients or transparency. If your document contains this type of construct then you are likely to end up with a wide range of colors rather than a specific set.
Similarly many PDF documents contain bitmapped images. Given that these will need to be interpolated to be displayed at different resolutions, the set of colors in a displayed PDF may be bigger or different to (though obviously broadly similar to) the embedded bitmap.
Similarly many PDF documents contain constructs in multiple color spaces that are rendered into different color spaces. For example a PDF might contain a DeviceRGB bitmap, a line in an ICC based CMYK color and a Lab based rectangle. The displayed version might be in sRGB for display or CMYK for print. Each of these will influence the precise set of colors.
So the only 100% valid answer is going to be related to a particular render of a PDF at a particular resolution to a particular color space. From the resultant bitmap you can determine the colors that have been used.
There are a variety of PDF libraries that will do this type of render including DotImage (referenced in another answer) and ABCpdf .NET (on which I work).
We are using Blackberries to display PDF reports. Here are background details on the problem:
The PDF reports are created using JasperReports.
Report format can be changed.
Different report formats are available (as per the feature set of JasperReports).
The PDF reports are on a website, too, so retaining a single source is ideal.
The page setup is in Landscape.
Here are the issues we have encountered:
Users cannot see a full line of text on the Blackberry.
The size of the PDF and UI makes reading difficult, at best.
The menu option to convert the PDF to text loses too much formatting to be useful.
The text is blurry (and too small).
Here are solutions we have thought about:
Create a second report (not ideal) in text or HTML format.
Simplify the original report format (not really an option, given the amount of data).
What other options are there for making a report available on the Blackberry, given the constraints of JaserReports, such that the report:
Is legible?
Is formatted for readability?
Displays quickly?
Essentially, we'd like to make sure there are no simple solutions we have overlooked for displaying legible PDFs on Blackberries.
We convert TIFFs to PDF for one of our applications, and have had mixed results with BlackBerry PDF viewers. These were our results.
Working
The following PDF readers worked for our purposes:
RepliGo Reader v1.1.1.1 - $19.95
Works fine.
DataViz Documents To Go Premium Edition v1.003.001 - $49.99
Works and includes a word wrap option to get the current zoom level to fit the available screen width, by moving text onto subsequent lines. Might fit your needs.
Non-Working
The following PDF readers did not work for our purposes:
BeamReader v1.0.8 - $17.99
BeamSuite v3.0.2 - $49.99
These couldn't open our PDF files ("Unsupported document format"). In addition they did not register as a PDF content handler, required for our application.
MasterDoc - $19.95
eOffice - $29.95
These also did not register as a PDF content handler. We had a range of problems with these, including installation issues, and not being able to open any PDFs at all.
Try BeamReader http://www.slgmobile.com/beamreader.html
I hear it's the best at reading PDFs for BlackBerry
How about outputting the file to an RTF or an image file (JPG/GIF), and then viewing them in your web browser?
If that doesn't work well on the native browser, I would focus on viewing the file via some other web browser - for example, Opera Mini. I know for images it's easier to navigate "big" images in Opera Mini than the native browser.
If your blackberries are on a BES server, couldn't you display the reports as HTML on your corporate intranet? - Then you could email a link to the blackberry and simply browse the report.
You can convert pdf to image via xpdf and than show image. xpdf is a BEST renderer of pdf.