I am doing a massive set of file conversions and several of them happen to be ".dat" files. When I open them I see that the first line is "%!PS-Adobe". Here's an example.....
%!PS-Adobe
^M%c$in
^M/c$in {72.0 mul} def
^M%DEFINE MARGINS
^M/C$LMAR .2 c$in def %LEFT MARGIN
^M/C$RMAR 8.4 c$in def %RIGHT MARGIN
^M/C$TMAR 10.8 c$in def %TOP MARGIN
^M/C$BMAR .2 c$in def %BOTTOM MARGIN
^M/C$CF /Courier def %saves /Courier as C$CF
etc...
Am I correct in assuming that these are indeed Adobe Postscript files and ** NOT ** PDFs?
How hard is it to convert these to PDF? I was thinking command line perl via ImageMagick or something but right now I'm a little stumped about what's been handed to me.
Thank you SO Much...
Janie
You can convert these to PDF using Ghostscript and the pdfwrite device, or for simplicity the ps2pdf script supplied with Ghostscript.
Yup, that's Postscript. A PDF would start with "%PDF". If the text "^M" is literally there like that, then it was created on a Mac and got screwed up being copied to or edited on other platforms. (Maybe it was when the sample was pasted into the S.O. edit box?) It defines some variables with dollar signs in their names, which makes it look funny.
%!PS-Adobe is the signature for conforming Postscript files. (Non-conforming Postscript can get by with %!.) The signature for PDF files is %PDF-.
Related
I try to insert a PDF file into my LaTeX document. It is properly cropped (by using the "Inkscape trick") and should easily be inserted without a problem.
LaTeX however, introduces a huge white space in the document which forces itself to create a "blank" extra page just to fit the PDF to the pages.
Just like this:
This is the code creating the problem.
\end{minipage}
... % Other code
\pagebreak
\subsection[Diagramm]{\uline{Diagramm}}
\vspace{-15pt}
\begin{figure}[H]
\centering
\includegraphics[height=0.98\textheight]{Abbildungen/ASMD_Diagramm_v2_X.pdf}
\caption{ASMD-Diagramm}
\label{Fig ASMD-Diagramm}
\end{figure}
... % Other code
\pagebreak
Now I have another PDF which is inserted properly and exactly as expected.
I have tried the following already:
Recrop the PDF
Using the trim command of \includegraphics
Resizing
Using other file formats (all were not satisfactory)
To be honest I don't know much about LaTeX and am more of a beginner-intermediate user than someone who really know this stuff. But this bothers me and has bothered me for quite a long time already. Anyone has any idea on how to fix this or what I am doing wrong?
Thank you!
Solved the problem. As discussed with samcarter, the space around the pdf file in the document seems to have been too small, so LaTeX couldn't accomodate for the caption, header, etc.
By trial and error, I just changed the size from:
...
\includegraphics[height=0.98\textheight]{Abbildungen/ASMD_Diagramm_v2_X.pdf}
...
to
...
\includegraphics[height=0.95\textheight]{Abbildungen/ASMD_Diagramm_v2_X.pdf}
...
and now it fits quite well on the page.
After processing with Ghostscript, I sometimes see whitespace breaking up the words as seen with pdftotext or in a PDF viewer when searching or selecting. Possibly unrelated but the anomalies seem to correspond with kerning variations in the rendered font.
Is there a way to avoid this?
For example, from GS 9.23 (also occurred with earlier versions):
gs -sDEVICE=pdfwrite \
-dNOPAUSE -dQUIET -dPARANOIDSAFER -dBATCH \
-sOutputFile=./output.pdf input.pdf
Excerpt from pdftotext input.pdf:
Review this manual before
operating deep cleaner
while pdftotext output.pdf:
Re vie w t his m a nua l be fore
ope ra t ing de e p c le a ne r
Ghostscript and the pdfwrite device (as explained in VectorDevices.htm) does not simply 'fiddle' with the input when producing a PDF file. The input (from whatever source; PDF, PostScript, XPS, PCL, PCL-XL) is fully interpreted into marking operations, those marking operations are sent to the device which turns them back into PDF constructs.
So the low level (PDF) format describing the page need not bear any relation to the low level format of the input. In particular you cannot expect the PDF operations in the input to be reflected in the output.
The visual appearance will be the same (or should be, because that's the main goal), but the actual operations won't be.
The reason for the difference in the text output is because, basically, there is no 'metadata' in a PDF file that describes words, paragraphs, columns etc. When you extract text from a PDF file what you actually get is a series of character codes and positions.
Its up to the text extraction code to try and make some sense of that. I'd guess that pdftotext is using the rather naive approach of assuming that text strings are words.
This is problematic because there are numerous different ways to handle kerning, justification and other spacings in PDF. You could do something like :
(Te) Tj
10 0 Td
(st) Tj
Or :
[(Te) 2 (st)] TJ
The pdfwrite device doesn't know what the original was, so what it emits could be either of those, depending on some heuristics. The chances of it matching the original are low.
I suspect that pdftotext would regard the first operation as "Te st" and the second operation as "Test"
One possible solution would be to use Ghostscript's txtwrite device to extract the text, it might do a better job.
As with your other question, it would be best to supply examples when asking these kinds of questions, because without that its pretty much guesswork.
TL;DR
Is there a way to avoid this?
No.
I am generating multiple EPS files, which contain several PostScript drawing commands that are not necessarily encoded efficiently. The first update in the answer to this question describes similar inefficiencies.
Each of my EPS files are around 18 MB, and the resulting PDF files are around 3 MB. I am generating the PDF files using epstopdf, which enables some sort of compression by default.
Are there any suggestions for how to further reduce the resulting PDF file sizes without changing the quality (e.g. rasterizing the vector graphics)?
I tried reducing the precision of the coordinates from 8 digits past the decimal to 3. This reduced the EPS file sizes to about 14 MB, but, counter-intuitively, the PDF file sizes slightly increased.
Update 1: The EPS files contain several occurrences of the sample code below for different coordinates and colors.
newpath
1 setlinejoin
1 setlinecap
<<
/BBox [322 384.0417 615.0087 651.9958]
/Domain [322 384.0417 615.0087 651.9958]
/ShadingType 6
/ColorSpace [/DeviceRGB]
/DataSource
[
0
350.00000000 651.99583594
336.00000000 645.75890880
336.00000000 645.75890880
322.00000000 639.52198166
339.17140372 627.26533984
339.17140372 627.26533984
356.34280743 615.00869803
370.19224806 621.16169097
370.19224806 621.16169097
384.04168868 627.31468392
367.02084434 639.65525993
367.02084434 639.65525993
0.23047 0.29688 0.75
0.23047 0.29688 0.75
0.41081 0.54141 0.93366
0.41112 0.54178 0.93388
]
>>
gsave
322 615.0087 62.04169 36.98714 rectclip
shfill
grestore
Update 2: I have been able to reduce the PDF file sizes by about 15% by using pdftocairo, followed by gs -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDetectDuplicateImages=true -sOutputFile=out.pdf in_.pdf.
PostScript is a programming language and PDF is not, so often you can actually create a smaller PostScript program than the resulting PDF file.
The 'inefficiencies' you mention in your EPS program, and the precision of the input numbers, are completely irrelevant to the size of the PDF file. The operators in PDF are not the same names as the operators in PostScript, so a 'moveto' in PostScript does not simply get translitereated into a 'moveto' in the resulting PDF file. The precision of numbers in the output PDF file is not tied to the precision of the numbers in the input.
In addition, PostScript interpreters often use a fixed precision arithmetic (Ghostscript for example uses 24:8), so (eg) 1.5 on the input may not be produced as 1.5 on the output, it may instead become 1.49999999.
So the result of this, basically, is that nobody can tell why your PDF files are as large as they are without seeing them. I would suggest that a 6:1 reduction in size is pretty reasonable personally. If you post a representative example somewhere its possible someone could look at it and might be able to offer some suggestions, but without seeing the content its not really possible to tell.
NB rendering the content would most likely increase the size of the PDF file, unless you render at a really low resolution.
EDIT
The supplied example is simply a shading dictionary, the PDF file will contain almost exactly the same data for that particular construct. Its already about as compact as you could expect, I very much doubt it this is the sort of thing occupying 18MB of source, that would be an enormous amount of shadings. There is no realistic way to make that smaller, and rendering it to a bitmap (even at very low resolution) would actually make it larger.
Its entirely possible the EPS contains things like a bitmap preview, which will, of course, be removed when creating a PDF. It may also (depending on the creating application) contain the original document, stored as comments, which will also be removed when creating a PDF file. Without seeing the original EPS its not really possible to suggest much.
I'm afraid posting little bits of the file isn't going to help really.
How can I convert a single color from a PDF document into another color, for example convert all instances of #ff0000 (red) to #ffffff (white)?
I've seen a number of ghostscript commands doing something similar (using setcolor, setcolortransfer), but I can't find a solution for this exact problem.
For example, the follwing will create an image-negative of the input PDF:
gs -o output.pdf -sDEVICE=pdfwrite -c "{1 exch sub}{1 exch sub}{1 exch sub}{1 exch sub} setcolortransfer" -f input.pdf
I'd move past this with a higher level of control, focusing on a single color being replaced with a different (not it's negative) color.
Essentially, you can't (or at least not using Ghostscript).
Firstly you seem to be assuming that the colours will be specified in RGB when in fact they could be specified in CMYK, ICC, CalRGB or Lab. You also need to consider Indexed colour spaces.
Secondly Ghostscript does not 'edit' PDF files, when you send a PDF file as input to Ghostscript it is fully interpreted to graphics primitives and the primitives are processed.
When the output is PDF the primitives are reassembled into a new PDF file. The goal of this process is that the visual appearance of the new PDF file should match the original. It is NOT the same PDF file, its internals will likely be completely different.
Finally, how do you plan to handle images ? Will you process those byte by byte to massage the colours ? Or do you plan to ignore them ? Same goes for shadings, where the colours aren't even present in the PDF file directly, but are generated from functions.
Without knowing why you want to do this I can't even offer a different approach other than; decompress the PDF file, read it and replace the colours manually.
We have to construct a postscript file that contains Arabic text, so as English text.
GhostScript shows the Arabic text correctly, but converting it to pdf does not show the Arabic letters.
PS file contains the following:
/TraditionalArabic findfont dup
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /kafinitialarabic put
Encoding 2 /behinitialarabic put
Encoding 3 /yehmedialarabic put
Encoding 4 /seenfinalarabic put
Encoding 5 /eacute put
Encoding 6 /a put
/ArabicTradDict currentdict definefont pop
end
%%Page: 1 1
%%BeginPageSetup
%%PageMedia: Color Weight Type
<< /MediaColor (Blue)/MediaWeight 75 /MediaType () /xx {2.803464567 mul} def /xx {2.83464567 mul} def /PageSize [240 xx 345 xx]>> setpagedevice
%%EndPageSetup
/ArabicTradDict 18 selectfont
72 xx 300 xx moveto
(\004\003\002\001) show
showpage
To run ghostScript: running it from command line to include all windows fonts:
gswin64.exe -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true
To convert the PS file to PDF file: running the following command:
gswin64.exe -dBATCH -dNOPAUSE - sOutputFile=c:/Users/mob/Desktop/TimesNewRomanPSMT.pdf -sDEVICE=pdfwrite - dPDFSETTINGS=/prepress -dCompressFonts=false -dSubsetFonts=false -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true -dEmbedAllFonts=true -f c:/Users/mob/Desktop/TimesNewRomanPSMT.ps
So when converting to PDF, the Arabic characters are not showing correctly, but showing as squares that are of no meaning...
If I use Adobe tool to convert to PDF, the PDF we get is same, except the "eacute -(005) " if included in the PS file, will show after conversion, where as when I convert with the previous command line, all characters that are added from the Encoding are not shown correctly.
Any help with that?
Thanks to KenS hints I was able to solve my problem. The encoding used wrong character names like kafinitialarabic (i mean by wrong, pdf could not understand that), everything that ended with arabic was wrong. The Traditional Arabic font does not have those names for characters. In order to know what it really understood, have converted the ttf font to afm and pfa using the following command, that is converting the true type font to type 42 font which will be understood once embed in postscript file at conversion to pdf
C:\Program Files\gs\gs9.10\bin>gswin64c.exe -dNODISPLAY -q -- ttf2pf.ps times tim
esPS timesAFM
where times is the ttf font name. I then checked the generated pfa file for the characters I wanted to add, instead of kafinitialarabic, there was kafinitial, and for kafmedialarabic there was kafmedial and so on...
It works fine now to add those in encoding, but I want to find a way instead of adding all those characters in the dictionary, I want to use the font like we use with setfont in postscript normally - if that is possible...
As already suggested, you need to ensure the glyph names you use are in the font you use, or create a new font.
I haven't found anything that will choose the correct glyph from the set of initial, medial, final, isolated, depending on context, though.
I resorted to writing a program which takes unicode arabic, reverses it the arabic characters, and then decides which tone of character to use based on it's position in a word, and whether the previous or next characters are forced into isolated or final forms. Unfortunately had to embed quite some intrinsic knowledge about the font in use and the glyph names it has, as well as typos in them, into the program.
If that's of interest, I've stuck it on github, but it's very raw and initial.
It does work, though.
https://github.com/gbjk/arabic2ps
The font I used was a traditional arabic font, with quite a few idiosyncrasies.