''AN ERROR EXISTS ON THIS PAGE. ACROBAT MAY NOT DISPLAY PAGE CORRECTLY. PLEASE CONTACT THE PERSON WHO HAS CREATED THE PDF DOCUMENT TO CORRECT THE PROBLEM.''
this error appears when trying to open pdf in Adobe reader but in crome browser it works fine.
Sampe_PDF
The reason why Adobe Acrobat Reader complains is because there indeed is trash in the content streams of the second and third page of your document, there are multiple sections like this
0 Tc
1 0 0 sc
? ? m
? ? ? ? ? ? c
? ? ? ? ? ? c
? ? ? ? ? ? c
? ? ? ? ? ? c
f
0 0 0 sc
BT
1 1 1 sc
/F1 10 Tf
? ? Td
(12) Tj
ET
All these question marks would have to be numbers to make this a valid PDF content stream. There even is one similar section with numbers filled in:
1 0 0 sc
18.5 443.384 m
18.5 447.25 21.63401 450.384 25.5 450.384 c
29.36599 450.384 32.5 447.25 32.5 443.384 c
32.5 439.51801 29.36599 436.384 25.5 436.384 c
21.63401 436.384 18.5 439.51801 18.5 443.384 c
f
0 0 0 sc
BT
1 1 1 sc
/F2 10 Tf
20 440.384 Td
(12) Tj
ET
I would assume that the producer here either used the '?' as placeholders in a template and simply forgot to fill-in these placeholders (probably what you have actually is a template, not a PDF for distribution or viewing) or added these sections with NaNs (not-a-number values) which then were serialized as question marks.
Related
I have to write a multi lingual text a pdf using C++. I have unicode values as well as glyph id values with their advances and displacements for the string input.
But I need to know how to position the dependent glyph with the independent base glyph.
Suppose if I have a advance and displacement values using FreeType / HarfBuzz, how should I input these values into the pdf content stream along with the glyph ids in the input.
I have tried the output values of FreeType & HarfBuzz, which could print the individual glyphs properly, but the positioning of the glyphs with its base glyph is not proper still, even if i used the advance and displacement values given in their outputs.
I just need the logic of how to use the output values in the content stream to deliver a proper readable word/letter.
Example:
Text = tamil letter + hindi letter.
I need to print this output.proper output
But currently only I am able to print this. improper output
Tamil combined letter:
வ = U+0BB5 TAMIL LETTER VA = base glyph
ா = U+0BBE TAMIL VOWEL SIGN AA = dependent glyph
HarfBuzz run:
hb-shape.exe -O json -u u+0bb5,u+0bbe --no-glyph-names "C:\\Windows\\Fonts\\Nirmala.ttf"
gid output:
[{"g":2953,"cl":0,"dx":0,"dy":0,"ax":2111,"ay":0},{"g":2959,"cl":0,"dx":0,"dy":0,"ax":1453,"ay":0}]
Hindi combined letter:
म = U+092E DEVANAGARI LETTER MA = base glyph
ि = U+093F DEVANAGARI VOWEL SIGN I = dependent glyph
HarfBuzz run:
hb-shape.exe -O json -u u+092e,u+093f --no-glyph-names "C:\\Windows\\Fonts\\Nirmala.ttf"
gid output:
[{"g":302,"cl":0,"dx":0,"dy":0,"ax":532,"ay":0},{"g":273,"cl":0,"dx":0,"dy":0,"ax":1379,"ay":0}]
Subjecting these output values into the formula,
PDF doc formula
Assuming unity for all variables except width and advance,
by obtaining the width value using FreeType and computing them.
Glyph Advance values for four glyphs in order:
tx = 1769
tx = 1132
tx = 1586
tx = 1448
If I provide these values in the content stream in the order as
<glyph id 1> tx 1 <glyph id 2> tx 2 <glyph id 3> tx 3 <glyph id 4> tx 4
Content stream:
/OC /oc2 BDC q BT /FXF1 1 Tf 70.866142 0.000000 0.000000 70.866142 28.346457 141.732285 Tm[<0B89>-1769<0B8F>-1132<0111>-1586<012E>-1448]TJ ET Q EMC
PDF Doc says (+)ve value of advances will move the text towards left.
Is it other way...?
Or if the difference of the advances is to be obtained...?
Additional PDF objects:
Font descriptor object,Base font object,Font object.
I have tried using only advance values and only computed values also.
The only problem is the horizontal & vertical space within combined glyphs, which also affects the spacing between subsequent glyphs.
Any of these does not render the glyphs as legible, atleast in a generalised programmatic manner.
From my analysis of #mkl at various stack overflow places, I suspect the need for individual transformation matrix or Td for each glyph. But is it that complex...?
As per my thought, it must be easily be rendered.
If individual transformation matrix or Td is the need, then how to compute the values to be supplied in for them.
Any help & guidance is welcome and much appreciated.
Thank you.
It helps to work out pdf as plain text you can compile by save in notepad.
Here I am altering a batch.cmd (work in progress :-) to test my compiler handles the changes as text but you can use raw pdf in editor too. beware cut and paste may need a value or two changed Also unknown yet how you can easily reference non Latin fonts (next hurdle after images, which are almost done), so I used "symbol" font as illustrative of those positioning mods.
Note for specific queries #mkl is the expert I simply do programming by examples, that function not by the book.
%PDF-1.0
%µ¶µ¶
1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj
2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj
3 0 obj<</Type/Page/Parent 2 0 R/MediaBox [0 0 594 792]/Resources<</Font<< /F1 4 0 R /F2 5 0 R>>>>/Contents 6 0 R>>endobj
4 0 obj<</Type/Font/Subtype/Type1/BaseFont/Helvetica>>endobj
5 0 obj<</Type/Font/Subtype/Type1/BaseFont/Symbol>>endobj
%Comment the following /Length 0999 is a dummy value it should be altered to equal decimal stream length, but most readers will ignore or work around invalid
6 0 obj<</Length 1326>>
stream
q
BT /F1 20 Tf 072 740 Td (20 units (default units usually = pts) high Headline) Tj ET
BT /F1 16 Tf 036 700 Td (All text is "Body" text. (no heads or tails)) Tj ET
BT /F1 10 Tf 004 780 Td (Text can be any order see "Body" text above. (Printed by Filename="C:\Users\K\Downloads\Programming\CMDaPDF\MAKE2PDF.cmd") spot the escape errors) Tj ET
BT /F1 12 Tf 036 675 Td (Here # 12 units high you must include just enough text for parts of a line. PDF has no page feeds no wrapping,) Tj 0 -20 Td (nor \\new line feed, no ¶aragraphs) Tj 86 -15 Td (nor carriage \r\\return. \n\r ) Tj 100 5 Td ( It is not \007\010\011\012\\tabular, each page is one row of multiple pages,) Tj 50 -15 Td (each page is one text column wide .[ ×] no yes check) Tj 0 -10 Td (each row is one text column wide .[x] no is yes) Tj 0 -10 Td (each row is one text column wide . · bullet point OK) Tj ET
BT +0.50 Tc -1.4 Tw 999 TL /F1 1 Tf 15 001 10. 30 200.000 440.000 Tm [(Jane A)600(usten)] TJ ET
BT +0.50 Tc 0.00 Tw 000 TL /F2 1 Tf 15 000 000 15 200.000 430.000 Tm [(Ja)-1000(ne Austen)] TJ ET
BT -1.20 Tc 0.00 Tw 999 TL /F2 1 Tf 15 000 000 15 200.000 420.000 Tm [(J)-1200(a)800(ne Austen)] TJ ET
BT +0.00 Tc 0.00 Tw 000 TL /F2 1 Tf 15 000 000 15 200.000 410.000 Tm [(Jane A)100(us)-500(ten)] TJ ET
Q
endstream
xref
0 7
0000000000 65535 f
0000000019 00000 n
0000000065 00000 n
0000000117 00000 n
0000000242 00000 n
0000000306 00000 n
0000000527 00000 n
trailer<</Size 7/Root 1 0 R>>
startxref
1903
%%EOF
I created a code on GoogleAppScript with ORC to get text from a PDF file on Google Drive, using the getFileById() but the problem is that this file is an Adobe PDF Forms type and the code reads only the texts that are not in the fields that were edited in the form. Does anyone have any suggestions on how I can get this?
The file that I used as an example: http://foersom.com/net/HowTo/data/OoPdfFormExample.pdf (Please, fill the file and save on your drive to test it)
The PDF on my Drive:
PDF file image
This is the result:
values shown when code is executed
Here is my code:
function extractTextFromPDF() {
var fileId = '[File ID here]';
const ss = SpreadsheetApp.getActiveSpreadsheet()
//Get all PDF files:
const fileID = DriveApp.getFileById(fileId);
var blob = DriveApp.getFileById(fileId).getBlob()
var resource = {
title: blob.getName(),
mimeType: blob.getContentType(),
};
// Enable the Advanced Drive API Service
var file = Drive.Files.insert( resource, blob, { ocr: true, ocrLanguage: 'en' } );
//,supportsAllDrives: true
// Extract Text from PDF file
var doc = DocumentApp.openById(file.id);
var text = doc.getBody().getText();
Logger.log(text)
return text;
}
The FDF data does not need to be stored in the order you see on the page, often may be the order the fields were added. The easiest way to see the text /V(alues) in reply to /T(ext field) entries is via the FDF a user can save and send without the Source PDF. This has the added advantage that a CHECK box will be sent as /V /Yes or /V /OFF (= NOT checked) which is notoriously difficult with a binary PDF, however you need to know which text on the page was Language 1 since the author did not tag it as Deutsch!
What intrigues me most (without looking closer) is the count of attached objects only went up by 12 but there are more than that number of potentially added answers,
from a field :-) of 17.
/Lang(en-GB)
/AcroForm<</Fields[
5 0 R 7 0 R 8 0 R 9 0 R 10 0 R
11 0 R 12 0 R 13 0 R 14 0 R 16 0 R
17 0 R 18 0 R 19 0 R 20 0 R 21 0 R
22 0 R 23 0 R
]/DR 37 0 R/NeedAppearances true>>
Anyway if you dont want to do it the easy way you will need to program something to read those object chains. Personally it takes each user a few seconds to export the forms or have them sent in the background by email however most modern users dont use historic mailto: so you have to get them to web mail the form for you to press the extract button.
So here is object 5 and we can see it is for your Given Name
5 0 obj
<</Type/Annot/Subtype/Widget/F 4
/Rect[165.7 453.7 315.7 467.9]
/FT/Tx
/P 1 0 R
/T(Given Name Text Box)
/TU<FEFF004600690072007300740020006E0061006D0065>
/V <FEFF>
/DV <FEFF>
/MaxLen 40
/DR<</Font 6 0 R>>
/DA(0 0 0 rg /F3 11 Tf)
/AP<<
/N 38 0 R
>>
>>
endobj
so the entry for 004600690072007300740020006E0061006D0065 converts to First name and we can see the answer when the user sends back the form as a PDF
5 0 obj
<</AP<</N 38 0 R >>/DA(0 0 0 rg /F3 11 Tf)/DR<</Font 6 0 R >>/DV<FEFF>/F 4/FT/Tx/MaxLen 40/P 1 0 R /Rect[ 165.7 453.7 315.7 467.9]/Subtype/Widget/T(Given Name Text Box)
/TU<FEFF004600690072007300740020006E0061006D0065>/Type/Annot/V(Brunno)>>
endobj
I'm trying to write a justified paragraph in a PDF with multiple formats (sizes, italics, bold, colors) and to achieve this I could use the Graphics State Stack to avoid repeating text operators, but it seems that the behavior of the Graphics State Stack depends on the PDF reader. Do I have to repeat the text operators every time I want to change text format? or there is a better way to achieve what I need?
I have the following PDF stream to test the Graphics State Stack of PDF:
BT
1 0 0 1 56.69 785.19 Tm
0 -12.1 Td
/F1 11 Tf
1.79 Tw
(rzo motáúe issjstózñ x vasreqyxñ ómfzzííh nohé hábúgíoújé úyz túit k wf ñxaóúgsri rcémwewá)Tj
0 -16.5 Td
5.1 Tw
(óaxhkd óáfythra)Tj
q
/F2bi 10 Tf
0.835 0.283 0.833 rg
4.1718 Tw
(olvéd cjtwymelgv stzr cc uxnugtqúic)Tj
q
/F3b 15 Tf
0.491 0.895 0.74 rg
15 Tw
(q hwvúñóál íu vpfíxht)Tj
0 -16.5 Td
(qfyébávrx vkámday)Tj
Q
Q
(cúprnfr úhwñ rá wdwñ óyxáumvpn nmrdó)Tj ET
In the Ubuntu PDF reader the q operator doesn't affect the Td operator.
In the Chrome PDF reader the q operator do affect the Td operator.
The save and restore graphics state operators are not allowed in text objects, i.e. q and Q between BT and ET is invalid.
Thus, your content stream is invalid and the behavior of pdf viewers attempting to display it is undefined.
I'm experiencing a real difficulty when trying to compute (tx,ty) position of text objects from a parsed PDF stream.
I have a following stream code:
BT
0.75 0.68 0.67 0.902 k
/GS0 gs
/TT0 1 Tf
-0.018 Tc 7.56 0 0 7.56 77.1871 528.3107 Tm
(Text line 1)Tj
-0.019 Tc 0 -1.917 TD
(Text line 2)Tj
-0.017 Tc 0 -2.917 TD
(HEADER)Tj
ET
q
43.167 489.881 7.56 7.56 re
W n
BT
/TT0 1 Tf
0 Tc 7.56 0 0 7.56 43.1671 491.7707 Tm
(INDEX)Tj
ET
When I open this PDF in some PDF reader, the HEADER and INDEX objects appear exactly next to each other, as they were in the same line.
However, when calculating HEADER's ty value from previous Tm (528,3107), I get 491,7657 which is 0,0050 lower than INDEX's ty (491,7707). In other parts of file the more text paragraph has the greater is this difference.
Basically what I do is multiply Tm's scaling factor (7,56) by TD's ty deltas. Obiously, I'am doing it wrong, but still - on the net there is little docs for dummies like myself...
So my question is - how to the other PDF readers interpret HEADER and INDEX ty values as equal, so they print it at the same ty?
I am on ubuntu.
I have a pdf file with pages divided into a grid. Each block of the grid contains name/age/dob/photo of a candidate. some records have a watermark "disqualified"
I need to scrape his pdf, with disqualified candidates in a separate list.
Using pyPdf I was able to get individual records, but it also includes watermarked candidates.
How to detect the watermark? If I can get the coordinates of the watermark, how do I match it with the candidate?
I am open to solutions other than python pyPdf
(Actually this is not an answer but merely an analysis to bit for a comment.)
I don't know pyPdf (or any python PDF classes) myself, but here is how the watermark is created for a sample entry; based upon this, anyone knowing pyPDF well enough, may more easily advice.
The Roundup
Depending on how pyPDF (or other python PDF classes) allows access to the page content, there are two major basic approaches:
If the class returns information on content (text and image) in their order in the page content stream: The watermark image xobject is referred to right before the data of the entry. Thus, any entry preceded by the drawing of a xobject image is marked.
If otherwise the information are not given in the order indicated by the page content stream, coordinate comparison must be used which per se is quite straight forward. In that case it might be of interest that the images are inserted with a [0.1 0 0 0.1 0 0] transformation matrix in action while the text is drawn with an identity transformation matrix.
The Details
This is entry # 200; the other watermarked entry is constructed similarly:
Watermarking is done by means of an image xobject. There is but one image xobject defined for the page used by both watermarked entries:
4 0 obj
<</Type/Page/MediaBox [0 0 595 841]
/Rotate 0/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /ImageI /Text]
/ColorSpace 18 0 R
/ExtGState 19 0 R
/XObject 20 0 R
/Font 21 0 R
>>
/Contents 5 0 R
>>
endobj
20 0 obj
<</R17
17 0 R>>
endobj
17 0 obj
<</Subtype/Image
/ColorSpace 16 0 R
/Width 128
/Height 88
/BitsPerComponent 8
/Filter/FlateDecode/Length 463>>stream
[...]
endstream
endobj
In the content stream this xobject /R17 is inserted right before the data of the entry itself is drawn:
q 0.1 0 0 0.1 0 0 cm
[...]
q 1045 0 0 495 462.5 6510.5 cm
/R17 Do
Q
q
10 0 0 10 0 0 cm BT
0.000487366 Tc
/R10 8 Tf
1 0 0 1 86 650.75 Tm
(Sex : Male)Tj
0.000304794 Tc
-64 0 Td
(Age : 43)Tj
-0.000140686 Tc
-1 11.05 Td
(House No :)Tj
-0.00002085 Tc
1 31.95 Td
(Name :)Tj
0.00008575 Tc
/R12 7.15 Tf
25.5 17.8 Td
( 200 )Tj
ET
Q
1547.5 6475 485 535.5 re
S
q
10 0 0 10 0 0 cm BT
-0.000403137 Tc
/R14 8 Tf
1 0 0 1 145.1 708.5 Tm
(XVX0001081)Tj
0.000421651 Tc
/R14 7.05 Tf
-90.35 -14.95 Td
(Ramesh Kumar)Tj
0.000373332 Tc
/R10 7.05 Tf
-33 -12.75 Td
(Father's )Tj
0.000193787 Tc
7.3 TL
(Name)'
0.00037774 Tc
/R14 7.05 Tf
40.25 1.8 Td
(Ram Singh)Tj
0 Tc
2.5 -11.85 Td
(37)Tj
0.00137196 Tc
/R12 7.15 Tf
-5.25 13.35 Td
(:)Tj