pdfbox extract color font on a pdf error - pdf

hi everybody sorry for my english level but i'm not english/american.
my question is the next: i try to use the example code that where posted in this site (How to get font color using pdfbox) in the example, the author says that the code was tried but when i tried it shows me this error:
jul 17, 2013 1:05:28 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
jul 17, 2013 1:05:29 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
DeviceGray
org.apache.pdfbox.pdmodel.graphics.color.PDColorState#481958
0.0
the pdf that i was extracting contents 3 letters (RGB) which is painted :
R: painted in red color
G: painted in green color
B: painted in black color
somebody can explain me because is this error o tell me how can i do to extract color text from a pdf?
thanks for all for the futures comments

Those log outputs are of level INFO only, not error:
jul 17, 2013 1:05:28 PM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: BDC
jul 17, 2013 1:05:29 PM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: EMC
All they say is that certain operators (BDC, EMC) were encountered in the page content for which no processor was registered. But as those operators are of interest for analyzing marked content only, those operators can be ignored for your task.
Thereafter you have output from the code you referred to:
DeviceGray
org.apache.pdfbox.pdmodel.graphics.color.PDColorState#481958
0.0
At least the first and the last line match that code: A color in DeviceGray gray with gray value 0 was encountered, most likely your black B. (Could it be you have added an additional output in-between, e.g. of graphicState.getStrokingColor()?)
Thus, no error, all working fine.

Related

Selenium IDE doesn't match text, despite using correct wildcard

I'm using Selenium IDE in Chrome to test a web site. When the test runs successfully, the site produces the text " Success Saving Scenario!" Selenium IDE finds this text, but I can't find the right value to match that text.
Here's my setting:
Command: Assert Text
Target: css-li > span
Value: Success Saving Scenario
Each time I run this test, the IDE records a failure with the message:
assertText on css=li > span with value Success Saving Scenario Failed:
12:23:02
Actual value "Thu, 03 Feb 2022 17:23:02 GMT - Success Saving Scenario!" did not match "Success Saving Scenario"
I checked the page, and sure enough the text displays Thu, 03 Feb 2022 17:23:02 GMT - Success Saving Scenario!
Why does that not match Success Saving Scenario? I thought the asterisks would be a wildcard that would match any characters.
I've tried these values as well with no success:
glob: Success Saving Scenario
regexp: Success Saving Scenario
(just an asterisk by itself)
Any ideas?
I would use 'Assert Element Present' for this case. Find another locator from the dropdown with a 'contain' keyword and remove the timezone from the contain keyword as needed. Leave value field empty.
Sample
Command: assert element present | Target: xpath=//span[contains(.,'Success Saving Scenario')] | Value: empty

Circuits v3.2 Manager.py **events.effects** Issue

Has anyone noticed errors or event.effect errors when using Version 3.2 of Circuits? I've received the message:
Apr 15 03:48:05 **** resilient-circuits[11462]: event.effects -= 1
Apr 15 03:48:05 **** resilient-circuits[11462]: *AttributeError: 'ThreatServiceLookupEvent' object has no attribute 'effects'*
which seems to be coming from circuits. Specifically line 738 of the the Manager.py script.
[Manager.py]: <https://github.com/circuits/circuits/blob/59b2a7be553fa788d06e7575b39a5eb2ec96f884/circuits/core/manager.py#L738>

how to extract camera info and images from ros bag?

I have a ros bag and its information as following
path: zed.bag
version: 2.0
duration: 3:55s (235s)
start: Nov 12 2014 04:28:20.90 (1415737700.90)
end: Nov 12 2014 04:32:16.65 (1415737936.65)
size: 668.3 MB
messages: 54083
compression: none [848/848 chunks]
types: sensor_msgs/CameraInfo [c9a58c1b0b154e0e6da7578cb991d214]
sensor_msgs/CompressedImage [8f7a12909da2c9d3332d540a0977563f]
tf2_msgs/TFMessage [94810edda583a504dfda3829e70d7eec]
topics: /stereo_camera/left/camera_info_throttle 3741 msgs : sensor_msgs/CameraInfo
/stereo_camera/left/image_raw_throttle/compressed 3753 msgs : sensor_msgs/CompressedImage
/stereo_camera/right/camera_info_throttle 3741 msgs : sensor_msgs/CameraInfo
/stereo_camera/right/image_raw_throttle/compressed 3745 msgs : sensor_msgs/CompressedImage
/tf 39103 msgs : tf2_msgs/TFMessage (2 connections)
I can extract images by following
http://wiki.ros.org/rosbag/Tutorials/Exporting%20image%20and%20video%20data
but issue occurs when I want to get camera info, Do anyone know how to solve it?
One can solve it via echoing the text-based information into a file using rostopic:
rostopic echo -b zed.bag /stereo_camera/left/camera_info_throttle > data.txt

Split and merge pdf files using PDFBOX produces large file

I have this large print file in pdf that's contains 5544 pages and is about 36mb in size. The file is created by MS Word 2010 and contains only text and a logo on each letter/document.
I split it into 5544 files and merge back into 2770 letters, based on keywords. Each letter is approx. 140-145kb.
When I merge all the letters into a new pdf print file, still containing 5544 pages, the size of the file is grown to 396mb.
All text extracting, splitting and merging is performed with calls to Apache PDFBox command-line tools from PHP, but result is the same when run from a console.
Any idea how to reduce the file size of the letters and the final print file?
It seems like PDFBox has just appended each letters in the final print file, instead creating a new pdf-document.
It's only in the testing phase that all the documents are merged into the final print file, some of the documents will be send by email.
I have also tried SAMBox (a fork of PDFBox) but with nearly the same result:
pdfinfo Original.pdf
Title: Printfile
Author: Claus Hjort Bube
Creator: Microsoft® Word 2010
Producer: Microsoft® Word 2010
CreationDate: Fri May 19 12:16:34 2017 CEST
ModDate: Fri May 19 12:16:34 2017 CEST
Tagged: yes
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 5544
Encrypted: no
Page size: 595.32 x 841.92 pts (A4)
Page rot: 0
File size: 36092281 bytes
Optimized: no
PDF version: 1.5
pdfinfo PDFBox.pdf
Title: Printfile
Author: Claus Hjort Bube
Creator: Microsoft® Word 2010
Producer: Microsoft® Word 2010
CreationDate: Fri May 19 12:16:34 2017 CEST
ModDate: Fri May 19 12:16:34 2017 CEST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 5544
Encrypted: no
Page size: 595.32 x 841.92 pts (A4)
Page rot: 0
File size: 396622354 bytes
Optimized: no
PDF version: 1.4
pdfinfo SAMBox.pdf
Creator: Sejda Console 3.2.17
Producer: SAMBox 1.1.8 (www.sejda.org)
ModDate: Tue Jul 11 23:34:33 2017 CEST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 5544
Encrypted: no
Page size: 595.32 x 841.92 pts (A4)
Page rot: 0
File size: 378779436 bytes
Optimized: no
PDF version: 1.7
That may sound sad but it is correct. When splitting, each file gets the resources (e.g. fonts and company logo graphic) it needs. When merged back, PDFBox does not know that these may be the same over the whole document, so these are now duplicated a lot.
The only solution I see for you would be to use the PDFBox java API to create the mailing files and the final print file in one step, i.e. without creating single files that are merged back.

How can I convert a PDF from Google Docs to images? [or: GoogleDocs' PDF export is horrible!]

I exported a document from Google Docs as PDF (just simple pages and one of the pre-defined themes) and, like I do usually, I used ImageMagick's convert to get pages converted to images, but it failed (even with the latest version) and showed no errors.
GhostScript also failed.
Other tools such as pdfinfo, mutool or qpdf don't report any error, yet it still fails even if rebuild or clean commands are applied.
Only pdfimages complains and gives me Syntax Error: Missing or invalid Coords in shading dictionary
Ok, I tried to reproduce some bugs, using Google Slides.
However, my bugs are different from yours. Read on for some details...
Google Docs does indeed create a horrible PDF syntax today. I say 'today', because I gave up with Google Docs years ago. The reason: it was always very unstable for me in the past. GoogleDocs' developers seem to change the code they activate for users all the time, and debugging the created PDFs for me was always a moving target.
When I exported to PDF the slideshow I created, and then did run the tools you mentioned on it,...
... I got 4 different results within 20 minutes!
In one case, Mac OS X's Preview.app was unable to render anything else but 3 white pages, while Adobe's Acrobat Pro rendered it (without error message) somehow garbled and different from the GoogleDocs web preview.
In another case, Acrobat Pro showed 3 white pages, while Preview.app rendered it in a garbled way!
Unfortunately, I didn't save the different versions for closer inspection. The lastest PDF I analysed gave however the following details.
Ghostscript:
pdfkungfoo#mbp:> gs -o PDFExportBug-%03d.jpg -sDEVICE=jpeg PDFExportBug.pdf
GPL Ghostscript 9.10 (2013-08-30)
Copyright (C) 2013 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 3.
Page 1
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
Page 2
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
Page 3
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
**** This file had errors that were repaired or ignored.
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
ImageMagick:
convert creates white-only images from the PDF pages.
(That's no wonder because it does not process the PDFs directly, but employs Ghostscript as it's delegate to convert the PDF to a raster format first, which is then familiar ground for ImageMagick to continue with processing... You can see details of this process by adding -verbose to your ImageMagick command line.)
qpdf
Using qpdf --check yields this result:
pdfkungfoo#mbp:> qpdf --check PDFExportBug.pdf
qpdf --check PDFExportBug.pdf
checking GoogleSlidesPDFExportBug.pdf
PDF Version: 1.4
File is not encrypted
File is not linearized
PDFExportBug.pdf (file position 9269):
unknown token while reading object (0.0000-11728996)
pdfimages:
Unlike what you discovered, my error message was this:
pdfkungfoo#mbp:> pdfimages -list PDFExportBug.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
Syntax Warning (9276): Badly formatted number
Syntax Warning (9292): Badly formatted number
Syntax Warning (9592): Badly formatted number
Syntax Warning (9608): Badly formatted number
Syntax Warning (4907): Badly formatted number
Syntax Warning (4907): Badly formatted number
Syntax Warning (9908): Badly formatted number
Syntax Warning (9924): Badly formatted number
Syntax Warning (8212): Badly formatted number
Syntax Warning (8212): Badly formatted number
When I check with a text editor the file-offsets of 9276, 9292, ... 8212 for numbers, I indeed do find the following lines in the PDF code:
Line 412: 0.0000-11728996
Line 413: 0.0000-11728996
Line 466: 0.0000-11728996
Line 467: 0.0000-11728996
Line 522: 0.0000-11728996
Line 523: 0.0000-11728996
PDF code in text editor:
Looking at the context of these lines, one sees the following:
32
0
obj
<<
/ShadingType
2
/ColorSpace
/DeviceRGB
/Function
<<
/FunctionType
2
/Domain
[
0
1
]
/Range
[
0
1
0
1
0
1
]
/C0
[
0.5882353
0.05882353
0.05882353
]
/C1
[
0.78431374
0.1254902
0.03529412
]
/N
1
>>
/Coords
[
0.000000000000053689468
0.0000
-11728996
0.0000
-11728996
26.832815
]
/Extend
[
true
true
]
>>
endobj
That's true! GoogleDocs gave me a PDF that created a newline after each single token!
PDF code, if Google had formatted it less horribly:
These lines are part of a code snippet that should probably be formatted like this, if the Google PDF export wasn't as horrible as it in fact is:
32 0 obj
<<
/ShadingType 2
/ColorSpace /DeviceRGB
/Function << /FunctionType 2
/Domain [ 0 1 ]
/Range [ 0 1 0 1 0 1 ]
/C0 [ 0.5882353 0.05882353 0.05882353 ]
/C1 [ 0.78431374 0.1254902 0.03529412 ]
/N 1
>>
/Coords [ 0.000000000000053689468 0.0000 -11728996 0.0000 -11728996 26.832815 ]
/Extend [ true true ]
>>
endobj
PDF code compared to the PDF specification:
So GoogleDoc's PDF uses /ShadingType 2 (for axial shading). This Shading Type requires a 'shading dictionary' with an entry for the /Coords key that should have as value an array of 4 numbers [x0 y0 x1 y1]. These numbers would specify the starting and ending coordinates of the axis (expressed in the shading’s target coordinate space).
However, instead of a /Coords array of 4 numbers it uses one of 6 numbers: [0.000000000000053689468 0.0000 -11728996 0.0000 -11728996 26.832815].
But Coords arrays with 6 numbers are to be used by /ShadingType 3 (radial shading).
The 6 numbers [x0 y0 r0 x1 y1 r1] then represent, according to ISO 32000:
"[...] the centres and radii of the starting and ending circles, expressed in the shading’s target coordinate space. The radii r0 and r1 shall both be greater than or equal to 0. If one radius is 0, the corresponding circle shall be treated as a point; if both are 0, nothing shall be painted."
15 minutes later, I exported the PDF again, but now I got these lines:
/Coords
[
0.000000000000053689468
0.0000-11728996
0.0000-11728996
26.832815
]
As you'll notice, now indeed the /Coords array has 4 entries -- but 0.0000-11728996 isn't a valid number!
In any case, the particular numbers in my objects 32, 33 and 34 do look funny somehow:
Either they are meant to be 6 numbers:
[0.000000000000053689468 0.0000 -11728996 0.0000 -11728996 26.832815]
Then they can only be meant for a /ShadingType 3 (radial shading)
But they are noted in the context of /ShadingType 2 (axial shading)
Or they are meant to be 4 numbers:
[0.000000000000053689468 0.0000-11728996 0.0000-11728996 26.832815]
Then 0.0000-11728996 are not valid numbers.
Fix
So the fix could be in...
...either change the /ShadingType 2 to /ShadingType 3 and keep the array of 6 numbers
...or keep the /ShadingType 2 and throw away 2 of the 6 numbers to keep only 4 (but which?)
I decided (arbitrarily, by chance) to try with ShadingType 2 first and delete these two numbers: -11728996 0.0000.
I was lucky: the PDF now lets convert process the PDF pages into JPEGs (which means the Ghostscript command called by convert was also working correctly).
Good luck with your continued using of GoogleDocs when creating PDFs...
...but don't count me in!
Update
Here is a link to a GoogleDoc currently exhibiting one of the bug variants explained above:
To see the bug, save it as a PDF. Then open it in a text editor.
Should the doc from this link stop to export buggy PDFs and stop to exhibit one of the details I've described above, then Google has applied a fix... (until they break it again?!?)