Irregular shapes Zone in PDF - pdf

Is there a way to crop polygon (irregular shapes) zones in PDF using GhostSCript.

No, you cannot use Ghostscript to do that.
But yes, principally you can do that in PDFs. The concept then is not called "trim" or "crop" but "clip". Clipping is a basic concept of the PostScript and the PDF graphic models.
There are related "operators" for clipping any graphic object in either page description language. Search for "PLRM.pdf" and for "PDF3200_2008.pdf" on the Adobe websites. They contain the official specifications for the two formats.
Then, for both languages/formats lookup the operators for "clipping path":
W and W* for PDF
clippath, clip and other keywords containing 'clip' for PostScript
If you want to clip irregular shapes from complete pages of existing PDFs, you have to create an additional PDF page which constructs an opaque area that has the irregular "hole" clipped into it. Then overlay the new page to the old one.

Related

How to detect Font Family in PDF?

I have a pdf which contains many fonts and What is the best way to check whether it contains font that belongs to Arial font family ?
Is this possible in any language?
I couldn't find any library or language that could do this.
So, I tried by converting pdf to image using ImageMagick and segmented all alphabets present in the image(pdf).And then I tried to compare all segmented alphabets to segmented images of arial font family alphabets which worked fine.
I created all datasets using MS Word.But arial font family looks different in different editors.By 'looks different', I mean the segmented image of same alphabet has different pixel values in different editors.And also alphabet of 10pt size has different dimensions in different editors.So, this method doesn't work.
Any Suggestions on how to do this? May be using svg file or ps file
I also came to learn that, In pdf's alphabets are rendered using Bezier curves, where each bezier curve is drawn using some control points and nodes.
Are these control points, same for all alphabets that belong to one font family? If Yes, how to extract control points of alphabets in pdf as these can be used to detect font family.
There can be three types of text in your document:
text that isn't real text, but part of a raster image,
vector text drawn by PDF syntax without using a real font,
vector text using a real font.
The answer to your question depends on the type of text you're confronted with:
There is no way to extract font information if the text is not real text, but part of a raster image. You need an OCR tool to convert the pixels into characters, but you won't get any info about the font family. You could try to compare pixels, but you've already tried that and you've discovered that this isn't trivial (one might consider your current solution as a bad workaround / bad design).
You describe text that is drawn on a page using Bézier curves. Although, it's possible to draw text like this, you won't find many PDFs that are drawn like this. The reason is obvious: every time you'd need a specific glyph, let's say A, you would need to add the syntax to draw that glyph on the page, leading to plenty of redundant PDF syntax.
PDFs usually work with fonts. A font is stored in a PDF file using a font dictionary. The syntax that makes up a page refers to that font using a name that can be chosen by the PDF producer, but that corresponds with an entry in the page resources that contains a reference to the font dictionary. Each font has an encoding mapping characters to glyphs. In the page content we use characters, based on these characters, glyphs will be selected in the font.
You are asking about the font family. This information is stored in the font dictionary. Take a look at my answer to the question What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp and you'll get an idea of what such a font dictionary looks like.
Do you see the /BaseFont entry in the font dictionary? It has values such as JOJJAH+TT116t00. In this case, the name of the font is "TT116t00", but what is "JOJJAH"? That is explained in my answer to the question What are the extra characters in the font name of my PDF?
Not all fonts are embedded. Sometimes the name of the font is sufficient for the viewer to know what the glyphs look like. For instance: there are 14 Standard Type 1 fonts that every viewer should be able to render.
Arial isn't one of those fonts, so if you want to be sure that Arial is rendered correctly, that font needs to be embedded. The font dictionary will refer to a Font Descriptor where you'll find the syntax to draw glyphs using linear paths, Bézier curves, etc. Suppose that you need the character A, then the font descriptor will contain some syntax that knows how to draw that character. The font dictionary will also have a map that maps the character A to the glyph A. Now when you need that glyph in your content, you can just use the character A and that will refer to the syntax that draws the glyph A. That syntax is stored inside the PDF only once.
Suppose that a PDF has the full Arial font embedded, then the value of /BaseFont would be Arial. However, if we'd embed the full Arial font, the PDF would be bloated. There are way too many characters in Arial; we don't need them all. That's why we'll only embed one or more subsets. When you see 6 characters followed by a + sign in the /BaseFont entry, you have discovered a font subset.
Getting the /BaseFont entry of a font dictionary can be done using different libraries. On the official iText site, we have different Q&As that explain how to Inspect a PDF. There's also an example that lists the fonts used in a PDF. Maybe that can be helpful.
NOTE: as explained in the help section, more specifically on the page What topics can I ask about here?, you will find rule #4: Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam.
I have given you the general information about where to find font information inside a PDF, but it's not allowed for you to ask questions to recommend the best tool to do this. Sorry for that.

Blurry results on EPS/PDF import [duplicate]

I want to create a visualization of a matrix for some academic work. I decided to go about this by having the pixels in the image correspond to the values in the matrix. I created the nice small png that follows:
When properly scaled up, you get a very reasonable image:
This is a screenshot from within inkscape. However, when export this as a pdf, both evince and chrome do a terrible job at upscaling what should be very trivial, and instead I get something that looks like:
The pdf itself seems to scale appropriately well for printing, but unfortunately I do a lot of my editing without printing, and this looks unacceptable. I did find this incredibly old thread about people seeming to have a similar issue with chrome's pdf viewer, and the "solution" was to just upscale the raster graphics. This is a solution, but is terribly inefficient.
Is anyone aware of a way to change the pdf so that it gets upscaled appropriately? Maybe a config change in evince or chrome that will render these properly? Even a nice way to go from a raster image to a vector image might be suitable?
The comments aggregated into an answer...
An image dictionary in a PDF has an (optional) boolean entry Interpolate. It is specified as a flag indicating whether image interpolation shall be performed by a conforming reader.
The program used by the OP to create the PDF, Inkscape, seems to have explicitly set this flag to true. Editing the PDF to unset this flag creates a file which looks as desired by the OP.
(This also is a solution proposed in this Inkscape forum thread eventually found by the OP, which is to save the PDF with high-resolution bitmaps embedded. File -> Inkscape Preferences -> Bitmaps -> Resolution for Create Bitmap Copy, and set it to 6000 dpi)
The fact that interpolation looks different in different viewers and different output media, is by design. The PDF specification states on interpolation:
A conforming Reader may choose to not implement this feature of PDF, or may use any specific implementation of interpolation that it wishes.
A different way to get around this problem (especially as some PDF viewers have the tendency to not really live up to the specification and e.g. interpolate ignoring that flag) would be to use vector graphics here, drawing the bitmap pixels as rectangles. The result should be optimal.

Pdf real cropping

I need to crop a pdf document using the linux shell and then extract the text just in that cropped pdf.
My idea was to crop a pdf using pdfcrop linux tool and then use a txt2pdf text extractor tool to extract the text just in the cropped area, but i've realized that i'm thinking on images, and when i try to do this the result is the same than doing it over the original, not cropped, pdf.
I guess it's a layer problem. As the pdf format works with layers, if i don't "crop" all the layers, the result is gonna include all the information from all the layers, which i don't want.
I would appreciate so much if someone has any idea of how i could do a real "all layers cropping" in a pdf. If its possible or if i should start thinking on another solution.
TY
Its not layers, its the fact that cropping a PDF usually involves simply setting the CropBox, which doesn't alter the actual contents of the PDF (other than the CropBox) at all. Most text extraction code will ignore the CropBox and extract all the text....
You could, with some effort, use Ghostscript to produce a genuinely cropped PDF (though note that partially cropped glyphs will still be included) and then extract the text from that. But that's pretty ugly.
Alternatively Ghostscript and MuPDF can both extract text with co-ordinate information, which may be enough for your needs.

What is the difference between .psd (photoshop) and .xcf (gimp) file types?

What are the technical specifications/capabilities of each file format?
Does one type handle certain types of graphics better than the other?
XCF supports saving each layer, the current selection, channels, transparency, paths and guides. However, unlike the native file format for Adobe Photoshop, PSD, the undo history is not saved in an XCF file.
The .PSD (Photoshop Document), Photoshop's native format, stores an image with support for most imaging options available in Photoshop. These include layers with masks, color spaces, ICC profiles, transparency, text, alpha channels and spot colors, clipping paths, and duotone settings

PDF Colo(u)r Analysis (without Acrobat itself ?)

Is there a library/tool which would list all colours used in a PDF document ?
I'm sure Acrobat itself would do this but I would like an alternative (ideally something that could be scripted).
So the idea is if you have a very simple PDF document with four colours in it the output might say :
RGB(100,0,0)
RGB(105,0,0)
CMYK(0,0,0,1)
CMYK(1,1,1,1)
You could explore the insides with pdfbox, but you would have to write some code to find and catalog all those colors.
Most PDF tools have access to this information but no api to access it. You could take any tool and add it in
Apago PDFspy generates an XML file containing all kinds of metadata extracted from PDF files. It reports color usage including spot colors.
We recently added a function called GetPageColorSpaces(0) to the Quick PDF Library - www.quickpdflibrary.com to retrieve much of the ColorSpace info used in the document.
Here is some sample output.
Resource,\"QuickPDFCS2eb0f578\",Separation,\"HKS 52 E\",DeviceCMYK,0.95,0,0.55,0
Resource,\"QuickPDFCSb7b05308\",Separation,\"Black\",DeviceCMYK,0,0,0,1
Resource,\"QuickPDFCSd9f10810\",Separation,\"Pantone 117 C\",DeviceCMYK,0,0.18,1,0.15
Resource,\"QuickPDFCS9314518c\",Separation,\"All\",DeviceCMYK,0,1,0,0.5
Resource,\"QuickPDFCS333d463d\",Separation,\"noplate\",DeviceCMYK,1,0,0,0
Resource,\"QuickPDFCSb41cafc4\",Separation,\"noprint\",DeviceCMYK,0,1,0,0
Resource,\"Cs10\",DeviceN,Black,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,P1495,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,CalRGB,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",Separation,\"P1495\",DeviceCMYK,0,0.31,0.69,0
XObject,\"R29\",Image,,DeviceRGB,-1,-1,-1,-1
Disclaimer: I work at Atalasoft.
Our product, DotImage with the PDF Reader add-on, can do this. The easiest way is to rasterize the page and then just use any of our image analysis tools to get the colors.
This example shows how to do it if you want to group similar colors -- the deployed example will only work for PNG and JPEG, but if you download the code, it's trivial to include the add-on and get PDF as well (let me know if you need help)
Source here:
http://www.atalasoft.com/cs/blogs/31appsin31days/archive/2008/05/30/color-scheme-generator.aspx
Run it here:
http://www.atalasoft.com/31apps/ColorSchemeGenerator
If you are working with specific and simple PDF documents from a constrained source then you may be able to find the colors by reading through the content stream. However this cannot be a generic solution.
For example PDF documents can contain gradients or transparency. If your document contains this type of construct then you are likely to end up with a wide range of colors rather than a specific set.
Similarly many PDF documents contain bitmapped images. Given that these will need to be interpolated to be displayed at different resolutions, the set of colors in a displayed PDF may be bigger or different to (though obviously broadly similar to) the embedded bitmap.
Similarly many PDF documents contain constructs in multiple color spaces that are rendered into different color spaces. For example a PDF might contain a DeviceRGB bitmap, a line in an ICC based CMYK color and a Lab based rectangle. The displayed version might be in sRGB for display or CMYK for print. Each of these will influence the precise set of colors.
So the only 100% valid answer is going to be related to a particular render of a PDF at a particular resolution to a particular color space. From the resultant bitmap you can determine the colors that have been used.
There are a variety of PDF libraries that will do this type of render including DotImage (referenced in another answer) and ABCpdf .NET (on which I work).