Analyze format of pdf-file and extract text and images [closed] - pdf

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I need to extract the "articles" from this magazine which has both text and images. The image content has to be placed separately, the text extracted (as far as possible) and placed separately.
How do i go about doing this? Is there a commercial service / api that does this already? The input to the program/service will just be the file.
Eg of input: http://edition.pagesuite-professional.co.uk/pdfspool/rQBvRbttuPUWUoJlU6dBVSRnIlE=.pdf
(the actual file would be a normal pdf-file, not a seured one)

Docotic.Pdf library can extract images and text from PDF files for you.
Here are couple of samples for your task:
Extract text from PDFs
Extract images from a PDF
Extracted images can be saved as JPEG and TIFFs. You can extract text from each page or from the whole document. And you can extract text chunks with their coordinates.
Disclaimer: I work for Bit Miracle, vendor of the library.

Try this one:
http://asp.syncfusion.com/sfaspnetsamplebrowser/9.1.0.20/Web/Pdf.Web/samples/4.0/Importing/TextExtraction/CS/Default.aspx?args=7
The same component has the image-extraction feature also.
You could make a try!!

If you can afford a commercial option, Amyuni PDF Creator will allow you to enumerate all components inside the pdf file (text, image, etc), you will be able to extract them as independent objects and you can create new PDF files with them.

You may use Aspose.Pdf.Kit to extract text and images separately from a PDF file. The API is quite simple. You can also find samples, tutorials and support on Aspose website.
Note: I'm working as Developer Evangelist at Aspose.

Related

Extracting data from a pdf, turning into a QR code and inserting the pdf and QR codes into a word.doc [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
In a nutshell - i want to take an existing PDF and read just the tool numbers and add barcodes for the tool number to the PDF/Word doc. since word will convert pdf's.
I need some ideas to get data from a PDF which is a printout of an access database.
So we pull up the doc after filling out a few things on the form (access) then we print it. well this database is not available for me to play around with so i wanted to print to a PDF and then read the "TOOL NUMBERS"using TABULA or something similar then export to excel. turn them into either 39extended barcode or QR code. then import into word the original PDF and insert the BAR CODE under the tool number and print. yes crazy as it sounds this is the only work around i can come up with.
i wrote the excel column with tool numbers to QR code (.png's) "toolnumber.png". or is there a way for me to find the MDB file and extract data from that? the column in that datafile should be "ToolNumber".
Since you ultimately want the output in Excel, there's no need to involve Word or PDFs in the process at all - simply query the Access DB directly from Excel and format the output as required.

How to convert PDF file to KML file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Is there a way to convert a map pdf file to a kml file? How can I convert it or is there any guidelines to do so?
Apache PDFBox is an open source Java library that can parse PDF document and extract content. The API includes methods to extract text, metadata, and embedded files from PDF files as well create PDF files from scratch. Apache PDFBox also includes several command-line utilities. One command-line tool called pdfbox-app has options to extract all text or images from PDF files.
There is also Apache Tika library which focuses solely on text extraction from a variety of file formats including PDF.

Autodesk-forge can we open PDF files on viewer [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
We are trying to View following files on Autodesk Forge Viewer in our
application.
DWF
DWFX
DWG
DXF
NWD
RVT
NWC
PDF
RCP
GBXML
IFC
As per Autodesk documentation
-(https://developer.autodesk.com/en/docs/model-derivative/v2/overview/supported-translations/)
these files format are supported to viewer.
But, on (https://viewer.autodesk.com/) this site some files are getting
Format error such as PDF, RCP.
So my questions are:
Which file format is supporting to viewer
Can we open PDF files on viewer or it require any specific PDF file to
launch on viewer.
The files that the Viewer support are in the link you provided from the Model Derivative API. so PDF and RCP are supported. The A360 Viewer does support just to a number of file types, you can see the list from that website here
As you can see those 2 types are not mentioned there, but it doesn't mean they are not supported from the Forge platform. You will have to use the Model Derivative API in order to translate those file types and use the Viewer API in order to visualize them.

Is it possible to embed 3D data into XSL-FO? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Setup
XSL processor: Saxon
FO processor / PDF renderer: Antenna House Formatter V6.2
Is it possible to embed a 3D PDF, XVL or 3DU via a FO transformation / PDF rendering into the current publication?
The source data would have several XML, XVL (whatever) 3D data nodes that have to be processed into the generated PDF.
Thanks in advance.
You can embed a 3D PDF using AH Formatter V6.2 or V6.3. Use fo:external-graphic to refer to the PDF just as you would for any other external image.
In the AH Formatter GUI, you can select to embed 3D annotations in the 'PDF Option Setting' dialog box (see https://www.antennahouse.com/product/ahf60/docs/ahf-gui.html#others-page). On the AHFCmd (or run.sh on Linux/Unix) command line, you may need to specify -p3da (see https://www.antennahouse.com/product/ahf60/docs/ahf-xslcmd.html#keyIDAR1YD) and/or enable 3D annotations in the Option Setting File (see https://www.antennahouse.com/product/ahf60/docs/ahf-optset.html#keyIDAVUFU).

Add comments in pdf [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions must demonstrate a minimal understanding of the problem being solved. Tell us what you've tried to do, why it didn't work, and how it should work. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I want to add some text in a pdf document from latex. The text is not supposed to be seen in the actual PDF, I want it more to be like a comment in a code, so I can load the "code" in a program and read the comments. Is this possible?
Kind regards
I don't know Latex enough to comment on that part of your question, but there are a number of different ways information can be stored inside PDF files that would satisfy your question.
Images in PDF files are typically objects (Image XObjects to be exact) - these have a dictionary where additional information could be stored next to the image data.
PDF supports the concept of object metadata where XMP metadata can be embedded in a PDF file for a specific object. This would be a second way to embed additional non-visible information in the PDF file (and a better one).
And perhaps best of all if you can generate this from Latex is the fact that PDF allows object properties, which uses marked content operators in the page stream to delineate a number of objects and then allows associating information to that marked content.
All of those should be easy to find in the PDF specification on the Adobe website; what would remain would be to figure out what ways you have in Latex to generate any of this and what you'd have to do to read them in your program :-)
There are two different ways:
You can either comment out single lines by adding a % in front of them
% This text will be a comment
Or you can comment out larger sections by doing this:
\usepackage{comment}
\begin{comment}
This text will be commented out.
\end{comment}
Hope this helps!