Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 months ago.
Improve this question
I have took quite some time to get a reasonable answer to my inquiry by myself but ran into a dead end and hope you guys can help me.
Issue:
For the purpose of business reporting, I have created some juypter notebooks which include multiple pandas tables and seaborn / matplotlib plots as code cell output with some occasional markdown cells in between to provide explanations. Now, I want these reports to be in a business-ready format to share them with stakeholders. With business-ready I intend the following requirements:
The report does not include code
Output file format: PDF
The report includes a title page with title, additional information (e.g. date of analysis) and a table of contents
Tables are in a appealing visual format that allows easy reception of information
The report is well structured
... and I am not able to get all these requirements together.
So far, I prefer to work with vscode and use the browser based juypter notebook if necessary (which unfortunately lacks some functionalities).
What I have tried:
(1) this was a no brainer, I just --no-input to the nbconvert command in the anaconda shell and whatever I do regarding the next points, it excludes the code
(2) There are two ways I could find so far, which influence all subsequent steps/requirements
Way 1 ("html detour"): I convert the .ipynb to html and print it as PDF (this is a 2-step process, thus I see it as a detour)
Way 2 ("latex conversion"): I convert it to a PDF via nbconvert --to pdf and it uses latex in the background to create a pdf
(3) ...and here start the issues:
html detour: I can get a toc via the nbextension extension for jupyter notebooks and with it, I can use either the H1 header level as title or include an extra markdown cell and increase the font size with an html command such that it looks appealing. Additional information are added manually in extra code cells. However, the toc only works in the browser version of jupyter, which results in writing the analysis in vscode, going to the browser to add the toc, converting it in the shell, open the html and print it as pdf...
latex conversion: I can set up a latex template, which is included in the nbconvert command that includes a toc by design. However, it either picks up the filename as title automatically or a title I can set in the metadata of the notebook, which I can only edit from the browser. Further, the date of conversion is added below the title automatically as well, which might note be the date of the analysis in case I have to reconvert it because someone wants a minor change or something. Thus, I cannot turn auto title and date off (at least I couldn't find an option so far) and I have multiple steps as well.
(4) This one makes eventually the difference in the usability of the report
html detour: The format in the html file itself is the quite appealing format you usually get from tables using display() command on a table in jupyter (which is used anyway if you just call a variable in juyper without print()) or if you build a table in a markdown cell. The table has a bold header and every other row has a grey background. Using pandas .style method, I can format the table in the html file very nicely with red fon color for negative values only or percentage bars as cell background. However, I loose all these formats when I print the PDF. Then its just a bold header, a bold line splitting header and body and the rows. Further, all cell output tables are left aligned in the html (and I refer to the table itself, not its content) and the markdown tables are centered, which looks strange or rather - and this is the issue - unprofessional. The benefit, however, is that these tables are somewhat auto-adjusted to a letter size format in a certain range if the table would be wider than a letter page.
latex conversion: By design, the tables are not converted. I have to use pandas.set_option(display.large_repr, True) to convert all subsequent pandas table output or add .to_latex()to every single pandas table. This has several downfalls. Using this, all tables are displayed as the code that would be required to build a table in latex and while doing the analysis, this is often harder to interpret... especially if you want to find errors. Adding it only when the analysis is done, creates just unnecessary iterations. Further, I want to use the last report as template for the next and would have to delete the command, do my stuff and add it again. Wider tables taht don't fit the letter size are just cut of regardless of how much wider they are compared to the page size and I would have to check every table (last report were 20+) whether everything is included. ...and headers become longer if they include explanatory information. And finally, the latex table format eventually looks professional, but more scientifically professional and not business professional and can discourage one or another reader in my experience.
(5) So, since everything is made from cells and converted automatically, you get some strange output with headers on the end of one page and text and tables and plots on the next ...or pages with just a plot and so on...
html detour Its hard to describe the general issues I have. If you have ever printed a website, you have probably got some weird text bulk that looks unstructured with occasional half white pages where they should not be. Thats what you get, when printing the html file of a jupyter. It would help, if I could include a forced pagebreak and you can find several versions of adding pagebreaks in the cell or metadata of cells but they do not work since the html is created with a high level setting prohibiting a pagebreak. Thus, I could only go in the html code and add page breaks manually. Manuel effort I would like to avoid.
latex conversion:Well, \pagebreakworks.
So, due to the issues above, I currently tend towards the html detour but it does not make it look like an appealing report. I have tried several latex templates but was usually dissatisfied with the output since the .to_latex command makes it tedious and the report eventually looks like a scientific paper and not like a business report. The thing is, while this looks like a high standard, all these requirements are fulfilled by R-mardkown notebooks basically out of the box with slight additions to the yaml command in the top of the file. But I cannot use them for the report I want to create.
So, after this long intro (and I thank everybody for taking the time to read it), my question is how do I get appealing reports from a jupyter notebook?
Thanks!!!!!
Honestly, I'm in the same boat as you. It seems quite challenging to generate publication-ready PDF Reports natively from JupyterLab / Jupyter using nbconvert and friends.
Solution (that I'm using): What I can recommend is a different tool that will help you make amazing PDF reports. It's using RStudio's Rmarkdown (completely free) and the new ability to use Python from RStudio. I'm going to be teaching this in my R/Python Teams Course (course waitlist is up).
Report Example
Here's how I'm doing it in my course:
Step 1 - Install Rstudio IDE 1.4+ & R 4.0+
Head over to Rstudio and install their IDE. You'll also need to install R.
Step 2 - Create a Project
Step 3 - Set Python Environment of your Project
Go to Tools > Project Options. Select the Python Interpreter.
Step 4 - Begin Coding Markdown and Python
Use "Python Code Chunks".
Step 5 - Knit to PDF
Note that this requires some form of LatTex. You can install easily with this package: tinytex.
Step 6 - Check out your PDF Report
Looks pretty slick.
Try it out and see if it works for you.
I'd go like this from terminal (this is to convert to Word, but also PDF is available, just change your last output to .pdf):
jupyter nbconvert --to html notebook.ipynb --TemplateExporter.exclude_input=True && pandoc notebook.html -s -o results.docx --resource-path=img --toc
Apart from installation and other pieces there are several aspects which make usage of nbconvert for files conversion quite a tedious task .
Anyone tried out the Jupyter Executable Notebook or R markdown methods ( they are useful but there is an extra cost of time and efforts which makes it less feasible )
What i found to be very useful is there are many websites serving this purpose it quick, easy and hassle free .
I use this IPYNB TO PDF , there are others as well .
Related
I am trying to make a dynamic PDF generator as an .NET Core API. I want to take an existing PDF, or .docx file, and edit it so it replaces the current name (John Doe) with something that can be replaced like #NAME_PLACEHOLDER.
I then want to transform #NAME_PLACEHOLDER -> John Doe (or whatever is in the KeyValuePair or Dictionary<string, string>).
I am running this on a Docker environment, so I can easily execute commands and I am willing to do that as well.
So far I have tried a few things:
1) pdf2htmlEX
Executes as pdf2htmlEX file.pdf
Does the job pretty well
Can be converted back to PDF using Google Chrome headless or similar
Problem: Only the characters used in the PDF can be used to replace. So if I only use A, B, C as characters, it will make D into Times New Roman (or default font)
2) LibreOffice ODT to PDF
This was pretty nice, because I could simply unzip the .odt file, open content.xml, search and replace, then save it as an .odt file again
Could be converted into PDF rather easily using soffice --convert-to pdf
LibreOffice is quite nice
Problem 1: Microsoft Word -> Save as ODT tends to break the formatting, so we have to use LibreOffice to go and change it back again
Problem 2: We don't want to move away from Microsoft's Office suite
3) HTML to PDF using Chrome Headless
What you see is what you get
By far the best option, if we're all developers aaand have unlimited time
Problem 1: Only our developers can make changes, since our marketing department do not know HTML
Problem 2: Our existing PDFs would have to be rewritten in HTML
As you can see, I have tried a bunch of things. None of them, except Chrome Headless, has lived up to my expectations. What I really like about #3 is what you see is what you get. I can make the whole thing in HTML, press CTRL+P and see what it looks like as a finished PDF, basically.
I am looking for a better solution, though. It can be paid. It can be free. All I need is to change out words/phrases with other words dynamically, which apparently seems like a tough thing to do.
Thanks for specifying what you've already found clearly. It helps a lot providing a succinct answer.
The conversion is always tricky - I'm sure you know Word has trouble displaying/editing some Word documents itself.
I have experience regarding point #2 "LibreOffice ODT to PDF" and can suggest a few things to test:
Don't use Microsoft to do the docx->odt conversion. It's not good as you know. Use LibreOffice itself to do this step. The rest of your process remains the same.
For some documents, Libre Office does doc->odt much better. So, you can instead work with DOC format and get a better result without any other changes.
You won't be able to remove the devs from the process, but you can certainly reduce their role allowing your business/marketing teams to have more direct input simply by:
get the starting point document to the devs to run through the conversion process. The devs can "clean up" the document to make it convert nicely.
make this version of the document the "official" starting point. The business or technical teams can load it, adjust it, and put it back into the process.
if possible, expose a test-platform to the business teams so they can download, adjust, upload and render to PDF. This cycle means they will be able to achieve more and if they're good, do impressive stuff without any dev input.
the above steps simply mean don't expect perfect conversion of arbitrary complex documents. Starting from a (even complex) working baseline is great.
Some of that might show you that your #2 is actually going to get the best overall results.
I hope that helps.
I am writing a quick and dirty report using pandoc and markdown.
I need to generate a PDF or a DOCX with minimum hassle, I don't care much about which (best would be both, of course). Also, I am somewhat constrained regarding the figures and tables -- they have been generated a priori with another program and I would rather be able to insert them as they are then to convert them to suit pandoc's needs.
However, the main constraint is that I don't want to edit the resulting document manually, be that LaTeX or DOCX. I want to do all editing in markdown.
Here is the problem:
In DOCX, the tables are displayed fine: they have the width of the document. However, the figures are much too wide. I can either convert the images to a lower resolution (which doesn't look nice), or manually resize the images in Word (which is out of question).
In PDF, the generated figures are fine (more or less), however another two problems appear:
The tables are too wide, because there are no line breaks, and
LaTeX being LaTeX, the order of figures and tables are "reorganized", that is, they are not consecutive.
Thus, none of the documents generated are usable for my purposes.
All I wanted to do is to slap together some results and generate a file that I can send to another scientist.
Question: what is the best solution to generate a quick and dirty report in pandoc with minimum effort and at least all results visible?
Update: Upgrading pandoc to 1.4 or later solves the issue -- the figures have now correct sizes in docx documents.
Control over image size
Currently you cannot control that feature directly from Markdown. For LaTeX/PDF output, this is automatically handled by LaTeX/pdflatex itself.
In recent months there have been some discussions going on in the Pandoc developer + user community about how to best implement it and create an easy-to-use syntax, for example
![Image Caption](./path/to/image.jpg "Image Comment"){width="60%", height="150px"}
(Warning: Example only, made up on the spot + extracted from thin air by myself -- can't remember the latest state of the discussion...) This is designed to then transfer to all the supported output formats which can contain images, not just to LaTeX/PDF.
So something along these lines is planned to be a major new feature for the next major release of Pandoc, and will start to be working better in ODT/DOCX output as well.
Control over table/cell widths and line breaks within cells
How exactly do you specify your tables in Markdown syntax?
Are you aware that Pandoc supports several variations like gid_tables, pipe_tables, simple_tables and multiline_tables?
You should look into using pandoc --from=markdown+multiline_tables ... as your command and write the critical tables as multiline_tables in your Markdown.
Read all about the details via man pandoc_markdown...
Multiline tables give you a limited control over the width of individual columns in the output, just by widening or narrowing the column widths in the markdown source itself.
Order of figures and tables when outputting LaTeX/PDF
Pandoc supports the insertion of raw_tex lines and environments into the Markdown source file. When it encounters such lines, it transmits them un-changed into its LaTeX output. (But it will be ignored for all other outputs.)
So you can insert lines like
\newpage{}
into the Markdown to enforce a page break. This already gives you some limited control over keeping the order of mis-behaving figures or tables. (After all, you said you look for a "quick and dirty" method, not a sophisticated typeset document...)
Of course, if you know LaTeX more and better, you can also use stuff like
/FloatBarrier inside your Markdown.
Going down that road (mixing LaTeX code into Markdown) gives you a few disadvantages:
The Markdown will not look as pretty any more.
The Markdown will not work fully with other output formats (should you need them).
But the advantage still are:
You will be writing and modifying the document text much faster in Markdown than authoring it in LaTeX.
You have some additional control over the final look of your PDF:
order of tables + figures
look + width of tables + figures (because, you can of course insert a complete LaTeX 'figure' or 'table' environment).
I have to extract text from invoices and bills pdf files
The files layouts can get complex, though its mostly filled with tables.
I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.
Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.
Adobe also has an online service called exportPdf but it can't be customized
Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.
I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.
I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.
I'll be very grateful if someone with such experience give me a hint.
I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.
That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.
Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.
The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.
Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.
It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?
Anyway here are some specialized tools including engines they use:
Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)
DISCLAIMER: I work for ByteScout.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I've tried using LaTeX and DocBook for documenting programming tools, to get PDF output. What I've found is that these tools are excellent in some ways - easily versioned, and generating very usable PDF manuals. But there is a serious flaw. Code-snippets cannot simply be cut-and-pasted out of the PDF.
With DocBook, the problem is the loss of whitespace - mostly for indentation, but any repeated spaces seem to get stripped out. So, once you paste the snippet into a text editor, you'll need to clean up the indentation and vertical alignment. Not too much hassle for two or three lines, but it quickly gets annoying.
With LaTeX - well, it's a mess. The following was taken from a PDF generated using the LaTeX in MikTeX 2.8.
node myclas s
f f i e l d f i e l d 0 1 : i n t ;
f i e l d f i e l d 0 2 : ” char ” ;
g;
The intended example is...
node myclass
{
field field01 : int;
field field02 : "char*";
};
Other than the fact LaTeX plays with the quotes, the intended form is what you see in Adobe Reader - but not much like what you get from a cut-and-paste. Don't ask me what's going on with the spaces, or why the braces turned into letters, or what happened to the asterisk - I don't know!
Mostly, I've noticed these things playing with ways of keeping my own personal notes, and just went back to other ways. Some notes are in HTML or plain text, so I can version them. Others are in an old Journal program I've used for years. But I've written a tool that I may want to release soon - and I'll want to include a usable PDF manual, which will need to include examples.
So - is there a way of creating PDF documentation where the code snippets can be easily cut-and-pasted? Preferably a way that allows me to keep "sources" in versioned text files.
EDIT
Any solution must be portable. I will need to use it on Linux and on Windows XP.
EDIT
It looks like this may be impossible.
I've tried printing from Notepad++ to the Adobe Acrobat Pro 7 printer driver. The resulting document looked fine, but cutting and pasting gave the same missing whitespace problems as occur with DocBook.
I tried using the touchup text tool in Acrobat Pro to add leading spaces. These are preserved when you save and reload - but when you select text normally in acrobat, they aren't included. You can only cut-and-paste including those spaces using the touchup text tool, so far as I can tell, which is obviously not included in reader.
In other words, this looks like a fundamental limitation - not of the PDF format itself so much as the tools that work with it. There appears to be a general assumption at work here that whitespace is insignificant - which for my purposes obviously isn't true.
EDIT
One solution may be a "text field". I can add these fairly easily using Acrobat Pro, can set a fixed width font, enter multiple lines of text and make the field read only. In Acrobat Pro 7, the text in the field then isn't selectable - but in Reader 9 it is selectable and everything is preserved when you cut and paste.
The question is - can text fields be generated directly using some kind of markup language that is usable to create complete manuals?
I'd suggest enscript. I use it for producing archives and documentations.
Also, you can merge multiple source codes ps'ed with enscript into another pdf.
If your code is kept in external files, one way would be to attach the original file(s) as PDF attachments. This could be done with Docbook, LaTeX, DITA, and a few others.
For example, if you are using this method to include code in Docbook, you can write some code to your XSL customization layer for adding the external file as an attachment to the PDF. As far as I know, this is portable (although I haven't personally tried to open PDF files with attachments in Evince, Okular, Xpdf, etc to see what happens).
If you are processing the Docbook files using even FOP, you should still be able to write something into your customization layer to attach files. See the section on PDF attachments. You could even output a link to the attachment below the codeblock in the PDF if you want to make it more discoverable to people.
A similar solution should be possible using LaTeX with the attachfile package.
I have a PDF file which has the marklist of certain exam.
I am particularly interested in the first list, but which unfortunately has 2112 entries. And they aren't properly formatted. I need to sort all these entries (based on marks in last 2 columns- sum of marks in Aptitude and Computer), to know what my rank is.
I tried to copy in in MS Word and Excel, but if you try it, you can see it won't help. After pasting it in a plain text file, I tried to format it using regular expressions (in Notepad++), wrote a code in C to properly separate each field by '\t' (so that later I can properly copy them in an Excel sheet), but the inconsistency made me fail (some entries are spawned multiple lines, the "names" do not have fixed no. of fields).
Can someone come up with any idea that will make it possible to copy the first list in PDF to a spreadsheet in tabular form exactly as the original file?
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting point '1.' above! -- see these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Well I sort of managed it. I first copied it to a plain text file, deleted all letters from it leaving only the serial number and corresponding marks, separated by spaces or tabs. Then using "import" in an OpenOffice Spreadsheet, told it the delimiters are spaces and tabs (combine them if necessary) and bingo! I got my rank.
But I would still like to know if it's possible to copy the whole table as it is. So keeping this question open.
I once was tasked with building a parser which would extract data from a pdf with tabular and non-tabular data in a number of different encodings and with a mix a rtl and ltr text. That project took quite the effort but with a simple English table you should be able to dissect the pdf in no time. Look for the PDF specs on adobe.com and if it is that desperate start digging in.
Also you'll first need to use pdftk.exe to uncompress the file.
A shortcut that me be of aid:
http://www.adobe.com/devnet/pdf/pdf_reference.html
This is the shortcut I meant: http://www.codeproject.com/KB/cs/PDFToText.aspx