I'm currently trying to extract an image from a pdf file using iTextSharp.
The pdf is made from a scanner: it has a single page that contains one big image.
When looking at the file I find the following:
<< /Type /XObject /Subtype /Image /Name /Obj3 /Width 2480 /Height 3507 /ColorSpace /DeviceGray /BlackIs1 true /BitsPerComponent 1 /Length 5 0 R /Filter /CCITTFaxDecode /DecodeParms << /K -1 /Columns 2480 >> >> stream
I can access that using iTextSharp and I try to save it using the following code:
Dim aFromImageStream = New MemoryStream()
aFromImageStream.Write(bytes, 0, bytes.Length)
Dim anImage = System.Drawing.Bitmap.FromStream(aFromImageStream, True, True)
anImage.Save("c:\test.tiff", System.Drawing.Imaging.ImageFormat.Tiff)
But, that doesn't work as I get one big black tiff file with different shades of gray on top.
Does anyone know a way how I can decode those CCITTFaxDecode images?
Please see answers for Extracting image from PDF with /CCITTFaxDecode filter question.
Related
Can anyone suggest how to extend page size on one side using postscript? I have to put a mark on a heavy bunch of documents. I use postscript for this (as technology that most native to pdfs, obviously as speed is critical for the task).
I am able to put a mark on the document itself, but I have to add a blank field to the right of each page and this is the problem.
This is what is going on (the mark [copy] is added above the pdf)
This is what has to be (page size extended with blank field on the right):
This is the content of my mark.ps file
<<
/EndPage
{
2 eq { pop false }
{
gsave
7.0 setlinewidth
70 780 newpath moveto
70 900 lineto
120 900 lineto
120 780 lineto
66.5 780 lineto
stroke
closepath
/Helvetica findfont 18 scalefont setfont
newpath
100 792.5 moveto
90 rotate
(COPY) true charpath fill
1 setlinewidth stroke
0 setgray
grestore
true
} ifelse
} bind
>> setpagedevice
This is how it is applied to pdf
gs \
-q \
-dBATCH \
-dNOPAUSE \
-sDEVICE=pdfwrite \
-sOutputFile=output.pdf \
-f mark.ps \
input.pdf
What I tried :
Change page size << /PageSize [595 1642] >> setpagedevice - didnt work
use -g option in gs like this: -g595x1642 - also didnt work fine
If someone has relevant suggestions please share!
I am trying to get correct unicode for the documents in which CMap is erroneous. Here I am thinking to get unicode from font file. Is there any way to get unicode from font file. I got that 'cmap' of font file will map character code to gid, 'glyf' will map gid to information required to draw the glyph. Is there any such map, so that I can get Unicode from character code?
You Can find the document here : https://drive.google.com/file/d/1ve1n4yeHzutTspIVwVRu6jT1vY2RQvvs/view?usp=drivesdk . In the given PDF font G1 has the following cmap
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
In the above cmap I can see only code space range. What about code points? How I can get unicode for the CID's?
I am creating a PDF file from a TIFF image using ImageMagick and Ghostscript.
My source tiff is 16 bits per channel with no alpha (48 bit image) with an attached ICC Profile (AdobeRGB) and I want to maintaing this in the final PDF.
convert input.tif[0] -density 600 -alpha Off -size 5809x9408 -depth 16 intermediate.ps
This takes my input tiff image (just the main image, and not the thumbnail via using [0]) and creates a .ps file from the bitmap.
When I look at the size of the PostScript file, it's roughly the same size (3-4 MB larger than the 328MB tiff) as the source TIFF, but I can't tell if the image data in the .ps is 8 or 16 bit per channel.
Then, when I use GhostScript to convert this to a PDF, I'm getting 8 bits per channel in the PDF.
gs -dPDFA=1 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sDefaultRGBProfile=AdobeRGB1998.icc -dOverrideICC -sOutputFile=output.pdf -r600 -P PDFA_def.ps -f custom.joboptions intermediate.ps
If I use pdfimages to inspect the PDF, it shows me 8 bit per channel.
pdfimages -list output.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 5809 9408 rgb 3 8 image no 10 0 600 600 74.1M 47%
The contents of my PDFA_def.ps has been modified from the default Ghostscript install to specify AdobeRGB (1998) as the colour profile:
%!
% This is a sample prefix file for creating a PDF/A document.
% Feel free to modify entries marked with "Customize".
% This assumes an ICC profile to reside in the file (ISO Coated sb.icc),
% unless the user modifies the corresponding line below.
% Define entries in the document Info dictionary :
/ICCProfile (AdobeRGB1998.icc) % Customise
def
[ /Title (Title) % Customise
/DOCINFO pdfmark
% Define an ICC profile :
[/_objdef {icc_PDFA} /type /stream /OBJ pdfmark
[{icc_PDFA}
<<
/N currentpagedevice /ProcessColorModel known {
currentpagedevice /ProcessColorModel get dup /DeviceGray eq
{pop 1} {
/DeviceRGB eq
{3}{4} ifelse
} ifelse
} {
(ERROR, unable to determine ProcessColorModel) == flush
} ifelse
>> /PUT pdfmark
[{icc_PDFA} ICCProfile (r) file /PUT pdfmark
% Define the output intent dictionary :
[/_objdef {OutputIntent_PDFA} /type /dict /OBJ pdfmark
[{OutputIntent_PDFA} <<
/Type /OutputIntent % Must be so (the standard requires).
/S /GTS_PDFA1 % Must be so (the standard requires).
/DestOutputProfile {icc_PDFA} % Must be so (see above).
/OutputConditionIdentifier (sRGB) % Customize
>> /PUT pdfmark
[{Catalog} <</OutputIntents [ {OutputIntent_PDFA} ]>> /PUT pdfmark
I've also got a custom.joboptions file that I created in Acrobat Distiller and then have modified for PDF/A compliance - I have tried to force 16-bit images in this file too, but I'm still getting 8-bit images in the PDF.
I don't know how many of these options Ghostscript respects and how many it ignores however. If I don't use this custom.joboptions file when making the PDF, the images are downsampled to a very low resolution.
<<
/ASCII85EncodePages false
/AllowTransparency false
/AutoPositionEPSFiles true
/AutoRotatePages /All
/Binding /Left
/CalGrayProfile (Dot Gain 20%)
/CalRGBProfile (sRGB IEC61966-2.1)
/CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2)
/sRGBProfile (sRGB IEC61966-2.1)
/CannotEmbedFontPolicy /Error
/CompatibilityLevel 1.4
/CompressObjects /Off
/CompressPages true
/ConvertImagesToIndexed true
/PassThroughJPEGImages true
/CreateJobTicket false
/DefaultRenderingIntent /Default
/DetectBlends true
/DetectCurves 0.0000
/ColorConversionStrategy /LeaveColorUnchanged
/DoThumbnails false
/EmbedAllFonts true
/EmbedOpenType false
/ParseICCProfilesInComments true
/EmbedJobOptions false
/DSCReportingLevel 0
/EmitDSCWarnings false
/EndPage -1
/ImageMemory 1048576
/LockDistillerParams true
/MaxSubsetPct 100
/Optimize false
/OPM 1
/ParseDSCComments true
/ParseDSCCommentsForDocInfo true
/PreserveCopyPage true
/PreserveDICMYKValues true
/PreserveEPSInfo true
/PreserveFlatness true
/PreserveHalftoneInfo false
/PreserveOPIComments false
/PreserveOverprintSettings false
/StartPage 1
/SubsetFonts false
/TransferFunctionInfo /Apply
/UCRandBGInfo /Remove
/UsePrologue false
/ColorSettingsFile (None)
/AlwaysEmbed [ true
]
/NeverEmbed [ true
]
/AntiAliasColorImages false
/CropColorImages true
/ColorImageMinResolution 600
/ColorImageMinResolutionPolicy /OK
/DownsampleColorImages false
/ColorImageDownsampleType /Average
/ColorImageResolution 600
/ColorImageDepth -1
/ColorImageMinDownsampleDepth 16
/ColorImageDownsampleThreshold 1.50000
/EncodeColorImages true
/ColorImageFilter /FlateEncode
/AutoFilterColorImages false
/ColorImageAutoFilterStrategy /JPEG
/ColorACSImageDict <<
/QFactor 0.15
/HSamples [1 1 1 1] /VSamples [1 1 1 1]
>>
/ColorImageDict <<
/QFactor 0.15
/HSamples [1 1 1 1] /VSamples [1 1 1 1]
>>
/JPEG2000ColorACSImageDict <<
/TileWidth 256
/TileHeight 256
/Quality 30
>>
/JPEG2000ColorImageDict <<
/TileWidth 256
/TileHeight 256
/Quality 30
>>
/AntiAliasGrayImages false
/CropGrayImages true
/GrayImageMinResolution 300
/GrayImageMinResolutionPolicy /OK
/DownsampleGrayImages false
/GrayImageDownsampleType /Average
/GrayImageResolution 600
/GrayImageDepth -1
/GrayImageMinDownsampleDepth 2
/GrayImageDownsampleThreshold 1.50000
/EncodeGrayImages true
/GrayImageFilter /FlateEncode
/AutoFilterGrayImages false
/GrayImageAutoFilterStrategy /JPEG
/GrayACSImageDict <<
/QFactor 0.15
/HSamples [1 1 1 1] /VSamples [1 1 1 1]
>>
/GrayImageDict <<
/QFactor 0.15
/HSamples [1 1 1 1] /VSamples [1 1 1 1]
>>
/JPEG2000GrayACSImageDict <<
/TileWidth 256
/TileHeight 256
/Quality 30
>>
/JPEG2000GrayImageDict <<
/TileWidth 256
/TileHeight 256
/Quality 30
>>
/AntiAliasMonoImages false
/CropMonoImages true
/MonoImageMinResolution 1200
/MonoImageMinResolutionPolicy /OK
/DownsampleMonoImages false
/MonoImageDownsampleType /Average
/MonoImageResolution 2400
/MonoImageDepth -1
/MonoImageDownsampleThreshold 1.50000
/EncodeMonoImages true
/MonoImageFilter /CCITTFaxEncode
/MonoImageDict <<
/K -1
>>
/AllowPSXObjects false
/CheckCompliance [
/PDFA1B:2005
]
/PDFX1aCheck false
/PDFX3Check false
/PDFXCompliantPDFOnly true
/PDFXNoTrimBoxError false
/PDFXTrimBoxToMediaBoxOffset [
0.00000
0.00000
0.00000
0.00000
]
/PDFXSetBleedBoxToMediaBox true
/PDFXBleedBoxToTrimBoxOffset [
0.00000
0.00000
0.00000
0.00000
]
/PDFXOutputIntentProfile (Adobe RGB \0501998\051)
/PDFXOutputConditionIdentifier ()
/PDFXOutputCondition ()
/PDFXRegistryName ()
/PDFXTrapped /False
/CreateJDFFile false
>> setdistillerparams
<<
/HWResolution [600 600]
/PageSize [697.080 1128.960]
>> setpagedevice
PostScript can't handle 16 bits per component, it only handles 1, 2, 4, 8 and 12.
PDF doesn't support 12 BPC, only 1, 2, 4, 8 and 16.
So there isn't any way to get a PDF file with more than 12 BPC if you use PostScript as an intermediate format. Even if the PDF file says its 16 BPC the actual data will be limited to 12 (16BPC original -> 12 BPC PostScript -> 16 BPC PDF)
Now further to that, you say that you are creating a PDF/A file, and its PDF/A-1. If you read the PDF/A-1 specification you will see that its limited to PDF 1.4, checking the PDF Reference Manual, we find that 16 BPC images were introduced in PDF 1.5
So even if pdfwrite were able to upscale the 12 BPC image to a 16 BPC image (with padding), its not allowed to do so if you want to create a PDF/A-1 file, because that's not allowed by the specification. So I'm afraid you can't do what you want, you can't create a legal PDF/A-1 file with 16 BPC images using any tool.
Regarding downsampling, the default for colour image downsampling is 'false', so if you don't enable it (DownsampleColorImages=true) then the pdfwrite device won't downsample the images.
in v0.8.0 doing a
process :convert => "pdf"
on a jpg results in a the jpg being stored as is inside the pdf. I.e. the PDF is just a wrapper around the pdf.
after v0.9.0 the same operation results in a much smaller / lower resolution resampled jpg being stored in the pdf.
Same version of minimagick is begin used throughout.
the following two blocks are clipped from the resulting pdfs the first block was generated with carrierwave 0.9.0, the second was 0.8.0. That was the only change in the code / gemfile. The file being converted to PDF is a 600 DPI image. It appears that carrierwave 0.9.0 by default is using a dpi of 72 when converting images to PDF...
/Type /XObject
/BitsPerComponent 8
/ColorSpace /DeviceRGB
/Filter [ /RunLengthDecode ]
/Height 522
/Length 317714
/Name /Im0
/SMask 8 0 R
/Subtype /Image
/Width 378
/Type /XObject
/BitsPerComponent 8
/ColorSpace /DeviceRGB
/Filter [ /DCTDecode ]
/Height 4350
/Length 649502
/Name /Im0
/Subtype /Image
/Width 3150
This is a work around: replace
process :convert => :pdf
with
process :convert_to_pdf
and add
def convert_to_pdf # copied from carrierwave code base
# #format = :pdf # this line causes the problem... why?
manipulate! do |img|
img.format(:pdf)
img
end
end
In otherwords create your own convert function that DOES NOT have #format = :pdf which is what is causing the problem. #format = :pdf I believe is only needed if you are chaining manipulations together.
Its superconfusing for me, since pdfmaker and postscript are doing same, but in practice coding style is quite different.
I know how to make a line with 2 circles at its end, with moveto and lineto and arc command in Postscript language, however, apparently I have to move to pdfmark due to hyperlinks, pdfmark manual is super un-understandable, and there is no other reference(book/online tutorial).
So, I would be appreciate, if one could generate such thing (as my figure shows) with a little description.
Here's the most simplest version possible. This creates a clickable area in the bottom left corner of the PDF that goes off to a URL.
[/Rect [ 0 0 200 200 ] % Draw a rectangle
/Action % Define an action
<<
/Subtype /URI % Define the action's subtype as a hyperlink
/URI (http://www.example.com/) % Set the URL
>>
/Subtype /Link % Set the type of this PDFmark to a link
/ANN pdfmark % Add the annotation
By default a border will be drawn so you might want to clear that out:
[/Rect [ 0 0 200 200 ] % Draw a rectangle
/Action % Define an action
<<
/Subtype /URI % Define the action's subtype as a hyperlink
/URI (http://www.example.com/) % Set the URL
>>
/Border [0 0 0] % Remove the border
/Subtype /Link % Set the type of this PDFmark to a link
/ANN pdfmark % Add the annotation
This only creates a clickable area, however. You then need to draw some text to click on:
/Helvetica findfont 16 scalefont setfont % Set the font to Helvetica 16pt
5 100 moveto % Set the drawing location
(http://www.example.com/) show % Show some text
Lastly, pdfmark isn't technically defined within the standard so they recommend that if you're not using Adobe's Distiller that you define something to handle it. This code will basically just ignore pdfmark if the compiler doesn't recognize it:
/pdfmark where
{pop}
{
/globaldict where
{ pop globaldict }
{ userdict }
ifelse
/pdfmark /cleartomark load put
}
ifelse
And here's a full working PostScript program:
%!PS-Adobe-1.0
/pdfmark where
{pop}
{
/globaldict where
{ pop globaldict }
{ userdict }
ifelse
/pdfmark /cleartomark load put
}
ifelse
[/Rect [ 0 0 200 200 ] % Draw a rectangle
/Action % Define an action
<<
/Subtype /URI % Define the action's subtype as a hyperlink
/URI (http://www.example.com/) % Set the URL
>>
/Border [0 0 0] % Remove the border
/Subtype /Link % Set the type of this PDFmark to a link
/ANN pdfmark % Add the annotation
/Helvetica findfont 16 scalefont setfont % Set the font to Helvetica 16pt
5 100 moveto % Set the drawing location
(http://www.example.com/) show % Show some text
showpage
EDIT
Also, check out this manual for more in-depth instructions on pdfmark
EDIT 2
Also, also, I should point out that I've spaced things out for instructional purposes. In most cases you'll see the /Action written as a single line such as:
/Action << /Subtype /URI /URI (http://www.example.com/) >>