Extracting embedded PNG byte streams from PDF - pdf

I am programming in Python, but if some tool/library exists in another language that would help me considerably, I am open to suggestions.
I have a large collection of pdf pages that live in a database, and I am trying to automate the collection of those pages to build some image recognition models with them.
These "pdfs" are actually just PNG images encased with a PDF wrapper (presumably so they can be read by PDF readers like Adobe Acrobat). I need the pdfs in image format to feed into the image recognition model pipeline. I am assuming they are PNG images, because when I save the images from the browser (i.e., right click and save image as), the resulting file is a PNG file.
After reading this question from 2010, and checking out this blog post from 2007, I've concluded that there must be a way to just extract the PNG byte array from the PDF instead of re-converting the PDF into a new image. Oddly though, I couldn't find the PNG file header with
#Python 3.6
header = bytes([137, 80, 78, 71, 13, 10, 26, 10])
#the resulting header looks like this: b'\x89PNG\r\n\x1a\n'
file.find(header)
Does that mean that the embedded image is not in fact a PNG image?
If there is no easy way to extract the embedded image byte array, what tool might I use to automate the conversion of each PDF file to some image format (preferably JPEG, PNG, or TIFF)?
Edit: I know tools like ImageMagick exist for format conversions, but I'd really rather do the extraction method for the sake of learning more about these file formats.

pip install pdf2image
pip install pillow
pip install numpy
pip install opencv-python
Then,
import numpy as np
from pdf2image import convert_from_path as read
import PIL
import cv2
#pdf in the form of numpy array to play around with in OpenCV or PIL
img = np.asarray(read('path to the pdf file')[0])#first page of pdf
cv2.imwrite('path to save the image with the file extension',img)

Related

Generate jpeg-YCbCr tiles in geotiff file with jfif format instead pure jpeg format

Currently, my app creates GeoTiff tiled files using following options:
PROFILE=GeoTIFF
TILED=YES
BLOCKXSIZE=xxx
BLOCKYSIZE=xxx
COMPRESS=JPEG
PHOTOMETRIC=YCBCR
JPEG_QUALITY=xx
However, some apps that use my served tiles do not work due to "invalid" JFIF format.
How can I force gdal to ensure JFIF format in GeoTiff tiles?
See my own answer in https://gis.stackexchange.com/questions/426732/generate-jpeg-ycbcr-tiles-in-geotiff-file-with-jfif-format-instead-pure-jpeg-for/428023#428023.
Basically, solution involves gdal code modifications

Saving plotly plots in one pdf file

I am trying to save plotly plots generated within a for a loop into one pdf fil, but here is says we need to pay for it
Is there any updates on this feature? Do we really need to pay to save as pdf?
For anyone else still looking for a quick answer 2 years later:
It is possible to export static plotly figures as pdf. Produce a plotly figure, say fig. This has the .write_image method to export to several formats. Simply do:
fig.write_image("your_image.pdf")
NOTE: you may need to install kaleido (plotly uses it to convert to static images).
pip install -U kaleido
Then:
fig.write_image("your_image.pdf", engine="kaleido")
Credits and references:
plotly explaining
it
kaleido
I believe you do need to pay in order to save as a pdf:
py.plotly.image.save_as(fig, filename='file.pdf')
PlotlyRequestError: Hi there! Accounts on the Community Plan can only download PNG and JPEG (raster) images.
To download publication-quality vector images (SVG, PDF, and EPS), please upgrade your account.
UPGRADE HERE: https://plot.ly/products/cloud
you can save them as a jpeg or png:
py.plotly.image.save_as(fig, filename='file.png')

Convert three.js to Adobe 3D-pdf?

Is there way to convert three.hs as U3D to 3D pdf?
I want to export assembly I have as three.js to 3d pdf based on tree structure (children of children).
Is it possible?
Something opposite to Convert Adobe 3D-pdf to WebGL?
We have lately done some experiments ObjExporter (amended to also export the container colors). By the same method you can also import STL Files, but then we have not found a way to transfer the colors.
The obj/stl file are then transformed to U3D files with Meshlab.
These U3D files then can be transformed with miktex/latex into 3D pdfs(see here). The system also can run on a server with some batch script.

Export SVG elements to PDF?

I have a visualization generated by d3 (a javascript visualization library similar to Protovis or Raphael, which draws stuff using SVG elements). The vis is interactive, so the user can interact with and edit it. Once the user is satisfied with his/her visualization, I would like the user to be able to export this visualization as a PDF. I've tried several HTML to PDF libraries and they don't work with SVG elements.
It is okay if the solution is either client side or server side. I'm using PHP server side but Python or Java implementations might also work.
Browser support: Ideally it would support all modern browsers, but minimally I'd like to support latest versions of both Firefox and webkit browsers.
I do not know of any strong PDF libraries on the client side.
A quick possible way would be to send the svg content to a server, and use something like batik for java to turn the svg to pdf and then send the response to the client again.
Here is a related SO for the converstion.
There's also wkhtml2pdf, which can render anything webkit can as a PDF. If you want to render a combination of SVG and HTML, or want to have some JavaScript run before the PDF snapshot is taken, it's great for that.
http://code.google.com/p/wkhtmltopdf/
PhantomJS can also rasterize url/html to PDF. Same backend (QTWebKit) with wkhtml2pdf.
I did not try d3, but I achieved the effect you are looking for like this in Python3.6:
# Pdf library
from reportlab.pdfgen import canvas
from reportlab.graphics import renderPDF, renderPM
# Svg library
import svgwrite
# Svg to reportlab
from svglib.svglib import svg2rlg, SvgRenderer
# Xml parser
from lxml import etree
# Create the svg
dwg = svgwrite.Drawing('test.svg', profile='tiny')
dwg.add(dwg.line((0, 0), (10, 10), stroke=svgwrite.rgb(10, 10, 16, '%')))
dwg.add(dwg.text('Test', insert=(0, 0.2)))
# Create canvas for pdf
c = canvas.Canvas("output.pdf")
# Parse the xml of the svg
parser = etree.XMLParser(remove_comments=True, recover=True)
root = etree.fromstring(dwg.tostring())
# Render the svg itself
svgRenderer = SvgRenderer()
drawing = svgRenderer.render(root)
# Now render the drawing in the pdf
renderPDF.draw(drawing , c, 10, 10)
# End page and save pdf file
c.showPage()
c.save()
# Or render to a seperate png
renderPM.drawToFile(drawing, "file.png", fmt="PNG")
Reportlab is an open source pdf library and svglib is a library that is able to convert svg's to reportlab Drawings. Rendering svg's directly from the xml is not supported out of the box, that is why I use the SvgRenderer.

Magick++ - Reading JPEG2000 images

I'm trying to read JPEG2000 images in Magick++ (the C++ API of ImageMagick). To read an image I use the following code:
Image img("path/to/my/image.jp2");
But when I try to do this, ImageMagick throws an Exception and doesn´t load the image.
I extract the images out of PDF files. Could it be that something´s different to normal JPEG2000 images? To extract the images I read the stream of Image objects which have a JPXDecode-filter and save them to a file.
Hope someone can help me!
ImageMagick uses a package called JasPer to handle JPEG2000's. According to the wikipedia page on OpenJpeg, JasPer does not completely support the JPEG2000 specification. I have several extrected JPEG2000 that open fine in QuickTime, but fail to decode with ImageMagick.
I have had better results using OpenJpeg to decode the the Jpeg2000. The interface is less flexible, it will convert to PNG and BMP.