Need help/answers on PDF color seperation - pdf

Using the following process:
A PDF is been created by PDFCreator, when a user prints something to the virtual printer
The PDF gets further processed with integrated VBScript handler and passed over to JAVA which does some processing with the PDF content
In the middle of the process an external application is called with the PDF that adds black text and graphics to the PDF
The PDFs are collected and once a week handed over to a print shop that uses a plate for each CMYK
The problem is: the print shop needs a color seperated CMYK PDF, but the added black text & graphics from the external app should be the only content on the K plane (because we want to make a special print effect). All other content which has been printed via PDFCreator should be on CMY plates only, so black must be emulated with those colors.
At the moment we are manually braking the process before calling the external application and seperate the colors via Adobe Creator Pro, but that is no future option because the whole process should work automated.
So basicly I need a way to convert the CMYK PDFCreator PDF to a CMY version only so the external app can throw in as many black K content as needed.
Is the PDF conversion the right direction I'm heading to? Is there any way w/ ghostscript how this can be done? I read the gs documentation but got nowhere as I only saw RGB to CMYK conversion but no CMYK to CMY with empty B...

I believe that PDFCreator is simply a wrapper around ghostscript, so you may have some joy on the ghostscript mailing lists. It seems that gs does support some printers that just ouput CMY so this functionality is likely to be available in there.
Wouldn't you be better off using a new separation called Black? Can't the print shop handle that?

Related

How can I convert a PDF to CMYK with different kinds of black?

I have a PDF that I created in Inkscape. I would like to convert it to CMYK for a printing company, but I have one issue - I want to convert some blacks to "rich black" (20/20/20/100) and others to "flat black" (0/0/0/100). I have 60 cards and would like to automate this rather than having to do it manually for each file. How can I do this?
Specifically, the card has text and design elements on it that shouldn't be rich black because they're too small and detailed, but the printing company suggests that since the border is solid black that part should be rich black. I've managed to convert my PDF to CMYK by following the instructions here but all the blacks seem to be rich.
Well firstly you haven't said what criteris you want to use to differentiate between your types of black, how do you propose to instruct the application which objects drawn in 'black' should be converted to pure K, and which ones should be a mixture of CMYK ?
You don't say what the input colour space is either.
Ghostscript's pdfwrite device does not distinguish between object type, you can (with effort) use a customised ICC profile to create a pure K output from a given input space and colour value but there's no way to say "I want to use this profile for this bit of the file, but this other profile for this other bit of the file".
Also, this isn't (as written) a programming question.

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Issue with ghostscript rendering PPT into PDF

I've been tinkering with Ghostscript with a port monitor(on a HP PCL 6 Universal driver) to convert print job into PDF. I've tested with a few applications such as Words, Excel, Adobe Reader, Microsoft Edge etc and they are all working properly.
However upon testing Microsoft Powerpoint 2016, it seems like there are some graphics that are unable to be rendered properly through Ghostscript.
Actual Slide Below
Output From Ghostscript in PDF Below
I've tested this even with some other PDF generators such as BioPDF,CutePDF as well as AdobePDF and they would all result in the same output as above.
Just wondering has anyone tried and have faced similar issues before? if so could someone point me in the right direction??
What you are doing isn't a single step PowerPoint to PDF and Ghostscript is not rendering the PowerPoint. In fact if you are creating a PDF file Ghostscript isn't (ideally) rendering anything.
What's actually happening is that you are asking PowerPoint to print to a canvas, which is then passed to the PostScript printer driver. That produces PostScript which is sent to the Port. Your (and others) Port Monitor then sends the PostScript to the 'Distiller' (in your case Ghostscript and the pdfwrite device). The Distiller reformats the vector drawing commands into a PDF format and builds a PDF file from them. It doesn't render (turn into a bitmap image) anything unless forced to.
Obviously there are several places along that road where the problem could creep in. Given that you say that the Adobe product (the others you mention al use Ghostscript) has the same problem, I think its safe to assume that the problem isn't Ghostscript.
This also means that you aren't using the driver you think you are. Adobe can't handle PCL as an input medium as far as I'm aware, and nor can Ghostscript. GhostPCL will handle PCL as an input, but that's not what you say you are using.
Of course you haven't linked to an example file to demonstrate the problem, nor supplied an example command line, so this is all supposition.
Now if, somehow, you are using a PCL6 device, then the problem is most likely due to the presence of rasterOps in the output. Rasterops are part of the PCL imaging model which do not exist in PDF and are a form of transparency. There are three ways to handle such content for a PDF output device; firstly render the whole page content to an image, secondly ignore the rasterOps objects, thirdly treat the rasterOps as opaque.
GhostPCL and the pdfwrite device take the third option. So, its just conceivable that your original content has some transparent objects which are being handled as rasterOps by the PCL printer driver, and then rendered as opaque by GhostPCL and the pdfwrite device.
If that's somehow the case then the solution is simple; don't use a PCL printer driver, use the PostScript one.
If you post a link to a (simple, eg single page) example of what you are sending to Ghostscript, and a command line, then I can look at it. Please don't send me the PowerPoint, I can't use it and even if I could, my print setup would not match yours. I need the data being sent to Ghostscript.
[EDIT after looking at files]
Don't mean to sound like I'm lecturing, the problem is people find these result on Google searches and then try to apply them based on a poor understanding of what's happening. So I find it best to be really clear in my answers about what's going on. It saves questions later :-)
The first thing I see is that the PCL is indeed PCL, and if you try running that through Ghostscript it throws horrible errors and exits. So presumably you aren't doing that.
The PostScript file contains nothing except huge images, rendered (presumably at 600 dpi) contains 2 pages, the two pages look like your images above. Which is why the PostScript is better than 20 times larger than the PCL file.
But.... If I open the .ppt file with OpenOffice (4.0.0 is what I have to hand) I see exactly the same thing. I don't, I'm afraid, have a copy of Microsoft PowerPoint, but from what I see here there are two conclusions;
firstly that the PDF I get looks pretty much like the PowerPoint when viewed with OpenOffice at least. So there's something 'interesting' about your PowerPoint.
secondly, even if that's not what you expect, its what's in the PostScript program. That means that either PowerPoint rendered the slide to a bitmap or the Windows printing system/HP driver did.
Now, if I run the PCL through GhostPCL instead of Ghostscript (rendering, not producing a PDF) then the result is more like what I think you are expecting. However, when sent to a PDF file the result is horrible. Which strongly suggests to me that there's some form of transparency involved, PostScript doesn't support transparency at all, and PCL does it through rasterOPs.
I'm afraid that this means that the problem lies either in PowerPoint, the Windows print system or the PostScript printer driver you are using. Since the PCL is at least close to what you expect, I suspect that this means PowerPoint is doing the right thing, and its the printer driver messing up. It appears you are using the Windows PostScript printer driver.
So there's no way you can 'fix' this for files like this, at least not with Ghostscript. You would need to 'fix' the Windows PostScript printer driver, or possibly the Windows print system. You could try reporting a bug to Microsoft, presumably these files print incorrectly when sent to physical PostScript printers too.

Convert PDF to PS in Ghostscript and preserve CMYK split

I have an RGB PDF which I've preflighted in Adobe Acrobat pro to a PDF x1a compliant PDF in US Web Coated SWOP v2.
The PDF now has 4 plates (C/M/Y/K)
C plate is empty
M plate has 100% of a red image
Y plate has 100% of the same red image
K plate has 100% of black text on page (text is not on any other plate)
I'm now trying to convert that PDF into a PS using ghostscript
I've tried:
gs -dNOPAUSE -dBATCH -sDEVICE=ps2write -sProcessColorModel=DeviceCMYK -sOutputFile=output.ps input.pdf
But then when I distill this PS back to a PDF the text is on all the plates and not just the K plate.
I've used this online tool:
http://pdf.my-addr.com/free-online-pdf-to-ps-convert.php
To also do the conversion and the distilled version of the PS generated by that preserves the plate breakdown. They are also using Ghostscript to create the PS.
So I'm assuming there is some setting I am missing.
Does anyone know?
Update 1
Trying in pdftops too and again it is taking my K plate and spreading it across all CMYK plates.
What secret magic are they doing on that web site to preserve plates?!
Update 2
Only main difference I can see is I'm using
%%Creator: GPL Ghostscript 905 (pswrite)
and that website is using
%%Creator: GPL Ghostscript 871 (pswrite)
Could it be a version thing, or are they doing something I'm not?
Ghostscript 9 and above use much better colour management than previous versions, but you have to get the ICC profiles correct. I'd guess you are using the default profiles and I think the first thing I'd suggest is that you use the current version of Ghostscript which is 9.07, I think there were a few changes made to the default profiles.
Its also possible that the PDF file now has an input RGB profile associated with it, which Ghostscript is now using whereas previously it didn't. I'd need to see the file to be able to tell better what is going on, but I have a sneaky suspicion that your 'pre-flight' conversion is causing the problem. What happens if you use the original PDF file ?
I very much doubt if the PDF file actually contains CMYK colour components, I would imagine all that has happened is that different profiles have been inserted into the file that control the conversion from RGB to CIE and from CIE to CMYK.
In passing, don't use pswrite. Its a terrible low-level output that converts much of the content into images. It produces large PostScript that processes very slowly and doesn't scale well (ie if the printer is a different resolution). Use the ps2write device instead.
By the way, since you've already used Acrobat, why don't you just use 'Save As' PostScript from there ?

PDF Box file not compatible with Fax machine

I am using apache pdf box api (java)for generating pdf's from pre-defined template(Created from Open office). Now when i try sending that pdf to fax machine(Rightfax server), its printing with Dots in it. Also if I change fonts nothing is printed except some dots in it.
Kindly advice whether there is any settings to be done in template creation to be compatible with fax format.
Regards,
Niteesh
My guess is that the PDF document was produced in color or greyscale. Fax is strictly black & white. The dots that you're seeing are a result of the dithering process, when the color or greyscale document is converted to black & white.