View source on pdf rendered inside a browser? - pdf

I have a report in Cognos. The output is rendered in a pdf inside the browser itself. Now the images are not showing up in the pdf. They show up fine in html. Now if they were not showing up on html, i would do a view source and check the image url and go from there. But when a pdf is rendered inside a browser, is there a way to do some kind of a 'View Source'?

As already recommended in comments, use a PDF browser like RUPS (based on iText) or any other one. Select the desired page, open its /Contents value, select the stream and you'll see something like this
/T1_0 1 Tf
0.0004 Tc -0.0002 Tw 13.98 0 0 13.98 189.87 476.67 Tm
(Praise for the First Edition)Tj
/T1_1 1 Tf
0.056 Tw 9.99 0 0 9.99 108.18 437.34 Tm
[(Each aspect is explained with numer)19(ous ex)]TJ
where text is to be displayed. Commands ending with Tf select the font for the text, those ending with Tc or Tw select the character or word spacing, those ending with Tm manipulate the text matrix and so position, rotate, stretch, etc. the text to be printed, and those ending with Tj or TJ actually print text.
Or you'll see something like this:
533.352005 0 0 668.2319946 -1.2660065 -1.0559998 cm
/Im0 Do
where some XObject is to be displayed. Commands ending with cm manipulate the current transformation matrix (again for positioning, rotating, stretching, etc.), and those ending with Do print a XObject.
What a given XObject is, can be seen in the /XObject value in the /Resources of the page, e.g.:
So the XObject is an image (see the value of /Subtype).
Therefore in your case
Now the images are not showing up in the pdf.
you should inspect the page in a likewise manner and search for something like the excerpts above. If you don't find a XObject referenced (and also don't find a command sequence BI … Key-value pairs … ID … Image data … EI in a contents stream; that sequence defines an inlined image), there is no image on that PDF page. Otherwise there is an image which for some reason does not show up.
There actually can be a number of other commands, too, and also other kinds of XObjects. For more details have a look at the PDF specification ISO 32000-1:2008 (made available by Adobe here), especially chapters 8 and 9.

... or search the web for the exact problem
http://www-01.ibm.com/support/docview.wss?uid=swg21339267
Although it doesn't explain why it works for HTML and PDF, most searches indicate that it is a web server security problem, and enabling anonymous authentication in your images folder might fix it.

Related

ways to separate passages in pdf using gap?

I have some pdf's with 2-3 passages for every page. every passage is separated by some line gap, but while reading with pymupdf, I cannot see any machine printable separator between passages. is there any other way, other library can do this?
code:
import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf',)
single_doc = doc.load_page(0) # put here the page number
text=single_doc.get_text('text')
text
page screen shot:
enter image description here
pdf
Full pdf
There is no gap as such, just for the moment as its much easier, lets look closer in your linked viewer rendering :-
   
So lets replicate what is inside the real PDF (that has no web side html <p> markers) :-
support, product design, HR Management, knowledge process outsourcing for
pharmaceutical companies and large complex projects.
Software exports make up 20 % of India's total export revenue in 2003-04, up from 4.9 %
in 1997.This figure is expected to go up to 44% of annual exports by 2010. Though India
See there is "no gap" just left aligned non justified (ragged) text that needs a style such as a font name and stretched out locations added to hold in a page de-void of line feeds nor true carriage returns. (occasionally there are some backspace or vertical/horizontal moves but generally meaningless in line printer text). Even "Tabs" "Indents" and some spatial characters are normally discarded in a PDF printout.
If you want gaps or line-wrap you need to add them.
A good alternative is export the -layout using poppler or xpdf here to - (console) or pipe it or replace that with a path/name.txt, many other options available like -nopgbrk
xpdf-tools-win-4.04\bin32>pdftotext -f 1 -l 1 -layout IT_past.pdf -

gnuplot: Hypertext or tags in PDF?

Is there any way to add hypertext or tags into PDFs via gnuplot?
According to the manual (gnuplot 5.4.0) it's not possible:
Some terminals (wxt, qt, svg, canvas, win) allow you to attach
hypertext to specific points on the graph or elsewhere on the canvas.
When the mouse hovers over the anchor point, a pop-up box containing
the text is displayed. Terminals that do not support hypertext will
display nothing.
Actually, there are 3 desires:
add hypertext into PDF, when hoovering with mouse over the point the text will appear, like in the above terminals (also known as "tool tip" or "bubble help").
add hyperlinks into PDF, when clicked on them it will be redirected to an URL, e.g. www.gnuplot.info or if possible to an other local file (with absolute or relative path).
add some tags or labels which could be used further (when this PDF is included into a LaTeX document) to link to a different chapter, section or figure.
This is probably more a question for tex.stackexchange.
Of course, you can include a gnuplot graph (PNG, PDF) into LaTeX document and then you can probably define areas on the graph for links etc. within LaTeX. However, everytime the graph changes you would have to redefine all positions in LaTeX again and again. That's why I would like to do it automatically in gnuplot.
Maybe other plotting packages can do this, e.g. pgfplots or tikz or others, but since I feel comfortable with gnuplot I wanted to avoid to use yet another package and check whether nevertheless there might be a way with gnuplot.
I'm aware that this is beyond gnuplot's focus of plotting, but maybe somebody knows about a workaround with gnuplot?
Code:
### hypertext in PDF???
reset session
set term pdfcairo size 29.7cm, 21.0cm font ",20" # A4 landscape
set output "Test.pdf"
$Data <<EOD
"Go here" 0.5 0.8
"Go there" 0.5 0.2
"Go left" 0.2 0.5
"Go right" 0.8 0.5
"www.gnuplot.info" 0.5 0.5
EOD
do for [i=1:|$Data|] {
set label i word($Data[i],1) at screen word($Data[i],2), screen word($Data[i],3) hypertext point pt 6 ps 10
}
plot cos(x)
set output
### end of code
Result: (PNG screenshot of PDF just for illustration, of course there will be no hypertext).
If using LaTeX (with set term cairolatex or epslatex or tikz in non-standalone mode) is a valid solution, I would outsource the creation of hyperlinks and hypertexts to hyperref and put the corresponding LaTeX code as a title or label in the plot:
set label '\href{http://www.gnuplot.info}{click me!}' ...
plot ... title 'see \autoref{section:methods} and read Ref.\autocite{bib:John_Smith_2002}'
Single quotes are mandatory, or else all special characters have to be escaped when using double quotes.

How can I replace a single PDF page using Imagemagick?

I have a multi-page PDF document, and I want a version of the document with one of the pages in the middle with a scanned-in copy, as I needed a physical signature in the document. How can I do this with ImageMagick? I'm aware that ImageMagick may not necessarily be the best tool for the job. However, the resulting PDF does not need to be high quality or a high fidelity copy, so it should be sufficient for my needs.
As a specific example, I have a 9 page my-file.pdf, and I want to create a copy of the PDF with the 8th page replaced with page-8.png. It looks like I should be able to achieve this goal with the convert tool, though it's not immediately obvious what the syntax would be. How can I achieve this goal?
If I merely wanted to append the new page to the end of the file, I know I can do the following:
convert my-file.pdf page-8.png output-file.pdf
However, this end up with the original pages 1-9, then the new page 8. What I actually want is to replace the original page 8 with the new page 8. My desired output is:
[original pages 1 - 7],[new page 8],[original page 9]
A specific page or range of pages can be specified using the bracket syntax with zero-based indexing. For instance, [8] will refer to the ninth page, and [0-6] to the first seven pages. Using this, a duplicate of the PDF with the 8th page replaced can be achieved as follows:
convert my-file.pdf[0-6] page-8.png my-file.pdf[8] output-file.pdf
As indicated in the question, as well as a comment, Imagemagick is not a great tool for this task, as it rasterizes the output of the PDF. An alternate solution that avoids this would be to use pdftk to replace the page. While the inserted page will still be an image, the replaced pages will not be rasterized.
First, save the scanned-in page as a PDF using an application capable of this. Then, use the pdftk cat operation to combine the PDFs:
pdftk A=my-file.pdf B=page-8.pdf cat A1-7 B A9 output output-file.pdf

How to get bounding boxes of elements in EPS files

I need to check if a EPS/PDF file contains any vector elements
First I convert the PDF to EPS and remove all text elements and images from the file like this
pdftocairo -f $page_number -l $page_number -eps $input - | sed '/BT/,/ET/ d' | sed '/^8 dict dup begin$/,/^Q$/ c Q' > $output
But how can I then check if any elements are written to the canvas?
What do you mean, exactly, by 'vector elements' ? Anything except an actual bitmap image ? Why do you care ? Perhaps if you explained what you want to achieve it would be easier to help you.
Note that the approach you are using is by no means guaranteed to work, there can easily be 'elements' in the file which won't be removed by your rather basic approach to finding image.
You could use Ghostscript; run the file to a bitmap and specify -dFILTERTEXT and -dFILTERIMAGES. Then examine the pixels fo the bitmap to see if any are non-white. If they are, then there was vector content i the file. You could probably use something like ImageMagick to count the colours and see if there's more than 1.
Or run the file to bitmap twice, once normally, and once with -dFILTERVECTOR. Compare the two bitmaps (MD5 on them would be sufficient). If there are no differences then there was no vector content.
Any PDF that has vector elements will use at least one of the path painting operators. According to chapter 8 of the PDF standard, those are:
S, s, f, F, f*, B, B*, b, b*, n
Of course, since PDF files can be complex, you'll also need it in a standard form. You can do that using the qpdf program's QDF format. (apt install qpdf if you don't have it).
qpdf -qdf schedule.pdf - | egrep -m1 -q '\b[SsfFBbn]\*?$' && echo Yup
That'll print "Yup" if the file schedule.pdf has vector graphics in it.
Note: I think this will do the job for you, but it is not fool proof. It's possible to get false negatives if your PDF is loading vectors from an external file, embedding raw postscript, or doing some other trickiness. And, of course it can have false positives (for example, a file that draws a completely transparent, 0pt dot in white ink on a white background).
Other answers have addressed identifying the drawing operators in a plain text stream. For the other question,
But how can I then check if any elements are written to the canvas?
For this, the elements need to be part of a content stream that is referred to
in the /Contents member of the Page object.
If you read in all of the pdf objects, there will be a tree connecting all the content streams to the Root object declared in the trailer.
Trailer : /Root is a reference to the Document Catalog object
Document Catalog : /Pages is an array of Page objects or Pages nodes
Page : /Contents is an array of references to Content Stream objects that draw the elements of the page
It is possible for there to be stray content stream objects that are not referenced in the Document tree. By traversing the Pages tree you could collect any and all actual content and then feed that result to one of the solutions from the other answers.

What does an /ActualText of FEFF0009 mean in a PDF?

I've been looking into a PDF file to understand how it is built.
I noticed that InDesign has created PDFs with text as below (after decompression using pdftk).
0 Tc /Span<</ActualText<FEFF0009>>> BDC
4.018 -0.2 Td
( )Tj
I understand the role of ActualText (for copy/paste/searching) but I'm wondering exactly how I should be interpreting the FEFF0009. It looks like a UTF-16 string with BOM chars to represent a tab character. This seems incorrect as it's really a space. I'm wondering if there is a special meaning here?
.. This seems incorrect as it's really a space.
No, it's really a tab.
14.9.4 Replacement Text
NOTE 1: Just as alternate descriptions can be provided for images and other items that do not translate naturally into text (as described in the preceding sub-clause), replacement text can be specified for content that does translate into text but that is represented in a nonstandard way.
(PDF 32000-1:2008)
The PDF text engine does not support the concept of 'tabs'. In this case, InDesign mimicked the function of a tab character by inserting a space in the text stream, and it could set the space width to match the distance spanned by the original tab or use a large relative positioning for the rest of the text (which it did here: the horizontal displacement of 4.018 in your code snippet).
The general idea is that a space is rendered on the position of the tab, but when you copy this text and paste somewhere else you get a tab character. I suppose the 'space' is only inserted to have something to copy.