Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Does anyone know how to compare two pdf files using adobe acrobat through command line.
I want to do this via command line because we want to compare hundreds of file every day through some automated windows tasks.
Any kind of help will be greatly. I do not want to limit myself to acrobat to compare , if there is something else available.
How about i-net PDFC - it does a full content comparison - text, images, lines, header/footer-detection and so on. You can use it either on command line or with a GUI (2.0, currently in public beta-phase).
The command-line tool already has the option to compare folders with PDFs against each other (or the extreme way: use the API ;))
Disclaimer: Yep, I work for the company who made this - so feedback highly appreciated.
Check out comparepdf:
comparepdf is a command line tool for comparing two PDF files. By default it compares their texts but it can also compare them visually (e.g., to detect changes in diagrams, images, fonts, and layout). It should prove useful for automated testing.
It is Open Source (GPL) and there are Windows binaries available.
Also:
If you want a GUI application that shows the detailed differences between PDFs use DiffPDF instead.
What you want simply cannot be done with Adobe Acrobat through the command line. However, you could do it with the help of some commandline utilities which you could unite into a shell or batch script.
1. Quick visual check for page image differences
One ingredient of this would be ImageMagick's convert command, which you can test like this for two 1-page PDF files which have page contents similar to each other's:
convert -label '%f' -density '100' first.pdf second.pdf -scale '100%' miff:- \
| montage - -geometry +0+0 -tile 1x1 -background white miff:- \
| animate -delay '50' -dispose background -loop 0 -
This will open a window which switches with a delay of 50 dezi-seconds between displaying each of the two files, so it is easy to discover visual differences.
2. Script to generate PDF output visualizing differences between PDF files
I'm doing the same thing using a shell script on Linux that wraps
ImageMagick's compare command
the pdftk utility
Ghostscript (optionally)
(It would be rather easy to port this to a .bat Batch file for DOS/Windows.)
You can read details about this approach in this answer.
Related
Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 months ago.
Improve this question
I have took quite some time to get a reasonable answer to my inquiry by myself but ran into a dead end and hope you guys can help me.
Issue:
For the purpose of business reporting, I have created some juypter notebooks which include multiple pandas tables and seaborn / matplotlib plots as code cell output with some occasional markdown cells in between to provide explanations. Now, I want these reports to be in a business-ready format to share them with stakeholders. With business-ready I intend the following requirements:
The report does not include code
Output file format: PDF
The report includes a title page with title, additional information (e.g. date of analysis) and a table of contents
Tables are in a appealing visual format that allows easy reception of information
The report is well structured
... and I am not able to get all these requirements together.
So far, I prefer to work with vscode and use the browser based juypter notebook if necessary (which unfortunately lacks some functionalities).
What I have tried:
(1) this was a no brainer, I just --no-input to the nbconvert command in the anaconda shell and whatever I do regarding the next points, it excludes the code
(2) There are two ways I could find so far, which influence all subsequent steps/requirements
Way 1 ("html detour"): I convert the .ipynb to html and print it as PDF (this is a 2-step process, thus I see it as a detour)
Way 2 ("latex conversion"): I convert it to a PDF via nbconvert --to pdf and it uses latex in the background to create a pdf
(3) ...and here start the issues:
html detour: I can get a toc via the nbextension extension for jupyter notebooks and with it, I can use either the H1 header level as title or include an extra markdown cell and increase the font size with an html command such that it looks appealing. Additional information are added manually in extra code cells. However, the toc only works in the browser version of jupyter, which results in writing the analysis in vscode, going to the browser to add the toc, converting it in the shell, open the html and print it as pdf...
latex conversion: I can set up a latex template, which is included in the nbconvert command that includes a toc by design. However, it either picks up the filename as title automatically or a title I can set in the metadata of the notebook, which I can only edit from the browser. Further, the date of conversion is added below the title automatically as well, which might note be the date of the analysis in case I have to reconvert it because someone wants a minor change or something. Thus, I cannot turn auto title and date off (at least I couldn't find an option so far) and I have multiple steps as well.
(4) This one makes eventually the difference in the usability of the report
html detour: The format in the html file itself is the quite appealing format you usually get from tables using display() command on a table in jupyter (which is used anyway if you just call a variable in juyper without print()) or if you build a table in a markdown cell. The table has a bold header and every other row has a grey background. Using pandas .style method, I can format the table in the html file very nicely with red fon color for negative values only or percentage bars as cell background. However, I loose all these formats when I print the PDF. Then its just a bold header, a bold line splitting header and body and the rows. Further, all cell output tables are left aligned in the html (and I refer to the table itself, not its content) and the markdown tables are centered, which looks strange or rather - and this is the issue - unprofessional. The benefit, however, is that these tables are somewhat auto-adjusted to a letter size format in a certain range if the table would be wider than a letter page.
latex conversion: By design, the tables are not converted. I have to use pandas.set_option(display.large_repr, True) to convert all subsequent pandas table output or add .to_latex()to every single pandas table. This has several downfalls. Using this, all tables are displayed as the code that would be required to build a table in latex and while doing the analysis, this is often harder to interpret... especially if you want to find errors. Adding it only when the analysis is done, creates just unnecessary iterations. Further, I want to use the last report as template for the next and would have to delete the command, do my stuff and add it again. Wider tables taht don't fit the letter size are just cut of regardless of how much wider they are compared to the page size and I would have to check every table (last report were 20+) whether everything is included. ...and headers become longer if they include explanatory information. And finally, the latex table format eventually looks professional, but more scientifically professional and not business professional and can discourage one or another reader in my experience.
(5) So, since everything is made from cells and converted automatically, you get some strange output with headers on the end of one page and text and tables and plots on the next ...or pages with just a plot and so on...
html detour Its hard to describe the general issues I have. If you have ever printed a website, you have probably got some weird text bulk that looks unstructured with occasional half white pages where they should not be. Thats what you get, when printing the html file of a jupyter. It would help, if I could include a forced pagebreak and you can find several versions of adding pagebreaks in the cell or metadata of cells but they do not work since the html is created with a high level setting prohibiting a pagebreak. Thus, I could only go in the html code and add page breaks manually. Manuel effort I would like to avoid.
latex conversion:Well, \pagebreakworks.
So, due to the issues above, I currently tend towards the html detour but it does not make it look like an appealing report. I have tried several latex templates but was usually dissatisfied with the output since the .to_latex command makes it tedious and the report eventually looks like a scientific paper and not like a business report. The thing is, while this looks like a high standard, all these requirements are fulfilled by R-mardkown notebooks basically out of the box with slight additions to the yaml command in the top of the file. But I cannot use them for the report I want to create.
So, after this long intro (and I thank everybody for taking the time to read it), my question is how do I get appealing reports from a jupyter notebook?
Thanks!!!!!
Honestly, I'm in the same boat as you. It seems quite challenging to generate publication-ready PDF Reports natively from JupyterLab / Jupyter using nbconvert and friends.
Solution (that I'm using): What I can recommend is a different tool that will help you make amazing PDF reports. It's using RStudio's Rmarkdown (completely free) and the new ability to use Python from RStudio. I'm going to be teaching this in my R/Python Teams Course (course waitlist is up).
Report Example
Here's how I'm doing it in my course:
Step 1 - Install Rstudio IDE 1.4+ & R 4.0+
Head over to Rstudio and install their IDE. You'll also need to install R.
Step 2 - Create a Project
Step 3 - Set Python Environment of your Project
Go to Tools > Project Options. Select the Python Interpreter.
Step 4 - Begin Coding Markdown and Python
Use "Python Code Chunks".
Step 5 - Knit to PDF
Note that this requires some form of LatTex. You can install easily with this package: tinytex.
Step 6 - Check out your PDF Report
Looks pretty slick.
Try it out and see if it works for you.
I'd go like this from terminal (this is to convert to Word, but also PDF is available, just change your last output to .pdf):
jupyter nbconvert --to html notebook.ipynb --TemplateExporter.exclude_input=True && pandoc notebook.html -s -o results.docx --resource-path=img --toc
Apart from installation and other pieces there are several aspects which make usage of nbconvert for files conversion quite a tedious task .
Anyone tried out the Jupyter Executable Notebook or R markdown methods ( they are useful but there is an extra cost of time and efforts which makes it less feasible )
What i found to be very useful is there are many websites serving this purpose it quick, easy and hassle free .
I use this IPYNB TO PDF , there are others as well .
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 7 years ago.
Improve this question
I have encountered a bug where using gradients sometimes randomly corrupts the illustrator file. When I open it I see the bug popup (yeah, I use Polish-localized version of Illustrator CC).
The bug report states something along:
Can't open the illustration. The illustration contains an invalid operation argument.
Offending Operator: Bd
Content:
%AI5_EndGradient
%AI5_BeginGradient:
I am using Windows 8.1. How can I recover my file?
You can follow the steps outlined in on that adobe page even though it states it's only for Illustrator CS2-CS5 it will work for CC as well.
The file location for Windows 8 is:
C:\Users\[ username ]\AppData\Roaming\Adobe\Adobe Illustrator 17 Settings\[localisation code]\[version]\
localisation code for Poland will be pl_PL, but don't worry about that, there will most probably be just one folder in the Adobe Illustrator 17 Settings.
version either x64 or x86, choose the one you are using
The file you are looking for has also localised name for Polish it's: Preferencje programu Adobe Illustrator for other localisations it will be some translation of Adobe Illustrator Preferences
in section:
/aiFileFormat {
/PDFCompatibility 1
enableATEReadRecovery 0
/enableContentRecovery 0
/enableATEWriteRecovery 0
/clipboardPSLevel 3
}
Set the enableContentRecovery flag to 1 /enableContentRecovery 1
Then follow the "Starting Document Recovery" section from the link.
When you have the _[your filename].ai file you need a huge text file editor so that you can remove offending operators. I have used 010 editor which has a 30 days trial.
Open the file and search (ctrl+F) for the offending content, it's a markup language so you have to remove whole sections between %AI5_BeginGradient: [your gradient name] and %AI5_EndGradient.
Remove one gradient.
Save the file in your text editor.
Try to open it in illustrator. (notice that the error message changes, if it doesn't try to look for the exact same name and remove it's section).
Rinse and repeat until it opens.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Surely, I am the 100th user who is asking this but after I have searched through similar topics here and on other websites I still cannot find what I need.
I like to have a simple command line tool for my GNU/Linux which converts .doc(x) files to .pdf BUT the output should look the same as the original.
LibreOffice doesn't seem like a good choice for this because it does not convert well in some cases. I have found a website freepdfconvert.com which does the job very well, but I cannot upload any sensitive files since it is a big risk. I don't say they would do anything bad with them but it is how it is.
If I can't find any good tool maybe I will have to write one myself.
Unfortunately there are no Linux-based guaranteed 1-to-1 convertors for Word (doc/docx) to PDF. This is because Word, a Microsoft product, uses a proprietary format that changes slightly with every release. As it was not traditionally a publicly documented format and Microsoft does not port Word/Office to Linux (nor ever will) then you must rely upon reverse engineered third party tools for older formats (doc) and proper interpretation of the Office Open XML format by third party developers.
We found the best open source solution is LibreOffice (which was forked from OpenOffice.org, which itself was called Star Office before it was open sourced). It is much more actively developed than AbiWord, as another answer suggested.
The usage from the command line is simple and well documented with plenty of examples:
soffice --headless --convert-to pdf filename.doc
Or also you can use libreoffice instead of soffice on newer versions.
There is also Pandoc.
Pandoc, mainly known for its Markdown-capable processing goodness (for outputting HTML, LaTeX, PDF, EPUB and what-not) in recent months has gained a rather well-working capability to process DOCX input files.
(NOTE: Pandoc only works for DOCX, not for DOC files.)
For its PDF output to work, it requires a working LaTeX installation (with either or all of pdflatex, lualatex and xelatex included). In this case the following simple command should work:
pandoc -o output.pdf -f docx input.docx
Note however, that the output layout and font styles now will not look at all similar to what it would look if you exported the DOCX from Word to PDF. It will be using the styles of a default LaTeX document.
You can influence the output style of the LaTeX-generated PDF by using a custom template file like this...
pandoc \
-o output.pdf \
-f docx \
--template=my-latex-template.tmplt \
input.docx
...but this is a feature more for Pandoc/LaTeX experts to use than for beginners.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
The community reviewed whether to reopen this question 4 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Can anyone recommend a library/API for extracting the text and images from a PDF?
We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.
We would like that data to be output in xml or json format. We're currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.
Are there alternatives (commercial ones or free) for extracting text from a pdf programatically?
I was given a 400 page pdf file with a table of data that I had to import - luckily no images. Ghostscript worked for me:
gswin64c -sDEVICE=txtwrite -o output.txt input.pdf
The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. -dSIMPLE and -dCOMPLEX made no difference in this case.
An efficient command line tool, open source, free of any fee, available on both linux & windows : simply named pdftotext. This tool is a part of the xpdf library.
http://en.wikipedia.org/wiki/Pdftotext
Since today I know it: the best thing for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.
PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible".
TET's first incarnation is a library. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces.
pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes.
And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.
I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.
This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements.
TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...
Give it a try.
For python, there is PDFMiner and pyPDF2. For more information on these, see Python module for converting PDF to text.
Here is my suggestion.
If you want to extract text from PDF, you could import the pdf file into Google Docs, then export it to a more friendly format such as .html, .odf, .rtf, .txt, etc. All of this using the Drive API. It is free* and robust. Take a look at:
https://developers.google.com/drive/v2/reference/files/insert https://developers.google.com/drive/v2/reference/files/get
Because it is a rest API, it is compatible with ALL programing languages. The links I posted aboove have working examples for many languages including: Java, .NET, Python, PHP, Ruby, and others.
I hope it helps.
PdfTextStream (which you said you have been looking at) is now free for single threaded applications. In my opinion its quality is much better than other libraries (esp. for things like funky embedded fonts, etc).
It is available in Java and C#.
Alternatively, you should have a look at Apache PDFBox, open source.
One of the comments here used gs on Windows. I had some success with that on Linux/OSX too, with the following syntax:
gs \
-q \
-dNODISPLAY \
-dSAFER \
-dDELAYBIND \
-dWRITESYSTEMDICT \
-dSIMPLE \
-f ps2ascii.ps \
"${input}" \
-dQUIET \
-c quit
I used dSIMPLE instead of dCOMPLEX because the latter outputs 1 character per line.
Docotic.Pdf library may be used to extract text from PDF files as plain text or as a collection of text chunks with coordinates for each chunk.
Docotic.Pdf can be used to extract images from PDFs, too.
Disclaimer: I work for Bit Miracle.
As the question is specifically about alternative tools to get data from PDF as XML so you may be interested to take a look at the commercial tool "ByteScout PDF Extractor SDK" that is capable of doing exactly this: extract text from PDF as XML along with the positioning data (x,y) and font information:
Text in the source PDF:
Products | Units | Price
Output XML:
<row>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="212" y="126" width="47" height="11">Products</text>
</column>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="428" y="126" width="27" height="11">Units</text>
</column>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="503" y="126" width="26" height="11">Price</text>
</column>
</row>
P.S.: additionally it also breaks the text into a table based structure.
Disclosure: I work for ByteScout
The best thing I can currently think of (within the list of "simple" tools) is Ghostscript (current version is v.8.71) and the PostScript utility program ps2ascii.ps. Ghostscript ships it in its lib subdirectory. Try this (on Windows):
gswin32c.exe ^
-q ^
-sFONTPATH=c:/windows/fonts ^
-dNODISPLAY ^
-dSAFER ^
-dDELAYBIND ^
-dWRITESYSTEMDICT ^
-dCOMPLEX ^
-f ps2ascii.ps ^
-dFirstPage=3 ^
-dLastPage=7 ^
input.pdf ^
-dQUIET ^
-c quit
This command processes pages 3-7 of input.pdf. Read the comments in the ps2ascii.ps file itself to see what the "weird" numbers and additional infos mean (they indicate strings, positions, widths, colors, pictures, rectangles, fonts and page breaks...). To get a "simple" text output, replace the -dCOMPLEX part by -dSIMPLE.
I know that this topic is quite old, but this need is still alive. I read many documents, forum and script and build a new advanced one which supports compressed and uncompressed pdf :
https://gist.github.com/smalot/6183152
In some cases, command line is forbidden for security reasons.
So a native PHP class can fit many needs.
Hope it helps everone
For image extraction, pdfimages is a free command line tool for Linux or Windows (win32):
pdfimages: Extract and Save Images From A Portable Document Format ( PDF ) File
Apache pdfbox has this feature - the text part is described in:
http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html
for an example implementation see
https://github.com/WolfgangFahl/pdfindexer
the testcase TestPdfIndexer.testExtracting shows how it works
QuickPDF seems to be a reasonable library that should do what you want for a reasonable price.
http://www.quickpdflibrary.com/ - They have a 30 day trial.
On my Macintosh systems, I find that "Adobe Reader" does a reasonably good job. I created an alias on my Desktop that points to the "Adobe Reader.app", and all I do is drop a pdf-file on the alias, which makes it the active document in Adobe Reader, and then from the File-menu, I choose "Save as Text...", give it a name and where to save it, click "Save", and I'm done.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I've tried using LaTeX and DocBook for documenting programming tools, to get PDF output. What I've found is that these tools are excellent in some ways - easily versioned, and generating very usable PDF manuals. But there is a serious flaw. Code-snippets cannot simply be cut-and-pasted out of the PDF.
With DocBook, the problem is the loss of whitespace - mostly for indentation, but any repeated spaces seem to get stripped out. So, once you paste the snippet into a text editor, you'll need to clean up the indentation and vertical alignment. Not too much hassle for two or three lines, but it quickly gets annoying.
With LaTeX - well, it's a mess. The following was taken from a PDF generated using the LaTeX in MikTeX 2.8.
node myclas s
f f i e l d f i e l d 0 1 : i n t ;
f i e l d f i e l d 0 2 : ” char ” ;
g;
The intended example is...
node myclass
{
field field01 : int;
field field02 : "char*";
};
Other than the fact LaTeX plays with the quotes, the intended form is what you see in Adobe Reader - but not much like what you get from a cut-and-paste. Don't ask me what's going on with the spaces, or why the braces turned into letters, or what happened to the asterisk - I don't know!
Mostly, I've noticed these things playing with ways of keeping my own personal notes, and just went back to other ways. Some notes are in HTML or plain text, so I can version them. Others are in an old Journal program I've used for years. But I've written a tool that I may want to release soon - and I'll want to include a usable PDF manual, which will need to include examples.
So - is there a way of creating PDF documentation where the code snippets can be easily cut-and-pasted? Preferably a way that allows me to keep "sources" in versioned text files.
EDIT
Any solution must be portable. I will need to use it on Linux and on Windows XP.
EDIT
It looks like this may be impossible.
I've tried printing from Notepad++ to the Adobe Acrobat Pro 7 printer driver. The resulting document looked fine, but cutting and pasting gave the same missing whitespace problems as occur with DocBook.
I tried using the touchup text tool in Acrobat Pro to add leading spaces. These are preserved when you save and reload - but when you select text normally in acrobat, they aren't included. You can only cut-and-paste including those spaces using the touchup text tool, so far as I can tell, which is obviously not included in reader.
In other words, this looks like a fundamental limitation - not of the PDF format itself so much as the tools that work with it. There appears to be a general assumption at work here that whitespace is insignificant - which for my purposes obviously isn't true.
EDIT
One solution may be a "text field". I can add these fairly easily using Acrobat Pro, can set a fixed width font, enter multiple lines of text and make the field read only. In Acrobat Pro 7, the text in the field then isn't selectable - but in Reader 9 it is selectable and everything is preserved when you cut and paste.
The question is - can text fields be generated directly using some kind of markup language that is usable to create complete manuals?
I'd suggest enscript. I use it for producing archives and documentations.
Also, you can merge multiple source codes ps'ed with enscript into another pdf.
If your code is kept in external files, one way would be to attach the original file(s) as PDF attachments. This could be done with Docbook, LaTeX, DITA, and a few others.
For example, if you are using this method to include code in Docbook, you can write some code to your XSL customization layer for adding the external file as an attachment to the PDF. As far as I know, this is portable (although I haven't personally tried to open PDF files with attachments in Evince, Okular, Xpdf, etc to see what happens).
If you are processing the Docbook files using even FOP, you should still be able to write something into your customization layer to attach files. See the section on PDF attachments. You could even output a link to the attachment below the codeblock in the PDF if you want to make it more discoverable to people.
A similar solution should be possible using LaTeX with the attachfile package.