Creating math formulas in convert PDF files (where input is website) - pdf

I am using MathJax in my website to create math formulas and it's working great.
I need a way to output those formulas into PDF documents generated by my server.
I'm currently using a Windows server and backend is PHP.
I'm using TCPDF to create my PDF files, but I cant find any way to get my math formulas into those PDFs, be cause the math formulas are stored in my database in TeX format.
Is there anyway to convert them to math formulas before I insert them into my document?
At first I tried to use the SVG output format from MathJax and somehow extract that output, save it in my database and use that with TCPDF, but apparently it is not good enough because the SVG output from MathJax isn't only SVG.
I have looked for online tools to convert my TeX formulas into images, but I didn't find any site that provides an API for that. So I looked for a command line tool, but it seems like most (or all of them) are for Linux systems only.
I tried this one, tex2png,
but it didnt work :
BTW, I don't have Latex installed. Do I need to install in order to use tex2png?

Related

How to create business ready reports from jupyter notebooks? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 months ago.
Improve this question
I have took quite some time to get a reasonable answer to my inquiry by myself but ran into a dead end and hope you guys can help me.
Issue:
For the purpose of business reporting, I have created some juypter notebooks which include multiple pandas tables and seaborn / matplotlib plots as code cell output with some occasional markdown cells in between to provide explanations. Now, I want these reports to be in a business-ready format to share them with stakeholders. With business-ready I intend the following requirements:
The report does not include code
Output file format: PDF
The report includes a title page with title, additional information (e.g. date of analysis) and a table of contents
Tables are in a appealing visual format that allows easy reception of information
The report is well structured
... and I am not able to get all these requirements together.
So far, I prefer to work with vscode and use the browser based juypter notebook if necessary (which unfortunately lacks some functionalities).
What I have tried:
(1) this was a no brainer, I just --no-input to the nbconvert command in the anaconda shell and whatever I do regarding the next points, it excludes the code
(2) There are two ways I could find so far, which influence all subsequent steps/requirements
Way 1 ("html detour"): I convert the .ipynb to html and print it as PDF (this is a 2-step process, thus I see it as a detour)
Way 2 ("latex conversion"): I convert it to a PDF via nbconvert --to pdf and it uses latex in the background to create a pdf
(3) ...and here start the issues:
html detour: I can get a toc via the nbextension extension for jupyter notebooks and with it, I can use either the H1 header level as title or include an extra markdown cell and increase the font size with an html command such that it looks appealing. Additional information are added manually in extra code cells. However, the toc only works in the browser version of jupyter, which results in writing the analysis in vscode, going to the browser to add the toc, converting it in the shell, open the html and print it as pdf...
latex conversion: I can set up a latex template, which is included in the nbconvert command that includes a toc by design. However, it either picks up the filename as title automatically or a title I can set in the metadata of the notebook, which I can only edit from the browser. Further, the date of conversion is added below the title automatically as well, which might note be the date of the analysis in case I have to reconvert it because someone wants a minor change or something. Thus, I cannot turn auto title and date off (at least I couldn't find an option so far) and I have multiple steps as well.
(4) This one makes eventually the difference in the usability of the report
html detour: The format in the html file itself is the quite appealing format you usually get from tables using display() command on a table in jupyter (which is used anyway if you just call a variable in juyper without print()) or if you build a table in a markdown cell. The table has a bold header and every other row has a grey background. Using pandas .style method, I can format the table in the html file very nicely with red fon color for negative values only or percentage bars as cell background. However, I loose all these formats when I print the PDF. Then its just a bold header, a bold line splitting header and body and the rows. Further, all cell output tables are left aligned in the html (and I refer to the table itself, not its content) and the markdown tables are centered, which looks strange or rather - and this is the issue - unprofessional. The benefit, however, is that these tables are somewhat auto-adjusted to a letter size format in a certain range if the table would be wider than a letter page.
latex conversion: By design, the tables are not converted. I have to use pandas.set_option(display.large_repr, True) to convert all subsequent pandas table output or add .to_latex()to every single pandas table. This has several downfalls. Using this, all tables are displayed as the code that would be required to build a table in latex and while doing the analysis, this is often harder to interpret... especially if you want to find errors. Adding it only when the analysis is done, creates just unnecessary iterations. Further, I want to use the last report as template for the next and would have to delete the command, do my stuff and add it again. Wider tables taht don't fit the letter size are just cut of regardless of how much wider they are compared to the page size and I would have to check every table (last report were 20+) whether everything is included. ...and headers become longer if they include explanatory information. And finally, the latex table format eventually looks professional, but more scientifically professional and not business professional and can discourage one or another reader in my experience.
(5) So, since everything is made from cells and converted automatically, you get some strange output with headers on the end of one page and text and tables and plots on the next ...or pages with just a plot and so on...
html detour Its hard to describe the general issues I have. If you have ever printed a website, you have probably got some weird text bulk that looks unstructured with occasional half white pages where they should not be. Thats what you get, when printing the html file of a jupyter. It would help, if I could include a forced pagebreak and you can find several versions of adding pagebreaks in the cell or metadata of cells but they do not work since the html is created with a high level setting prohibiting a pagebreak. Thus, I could only go in the html code and add page breaks manually. Manuel effort I would like to avoid.
latex conversion:Well, \pagebreakworks.
So, due to the issues above, I currently tend towards the html detour but it does not make it look like an appealing report. I have tried several latex templates but was usually dissatisfied with the output since the .to_latex command makes it tedious and the report eventually looks like a scientific paper and not like a business report. The thing is, while this looks like a high standard, all these requirements are fulfilled by R-mardkown notebooks basically out of the box with slight additions to the yaml command in the top of the file. But I cannot use them for the report I want to create.
So, after this long intro (and I thank everybody for taking the time to read it), my question is how do I get appealing reports from a jupyter notebook?
Thanks!!!!!
Honestly, I'm in the same boat as you. It seems quite challenging to generate publication-ready PDF Reports natively from JupyterLab / Jupyter using nbconvert and friends.
Solution (that I'm using): What I can recommend is a different tool that will help you make amazing PDF reports. It's using RStudio's Rmarkdown (completely free) and the new ability to use Python from RStudio. I'm going to be teaching this in my R/Python Teams Course (course waitlist is up).
Report Example
Here's how I'm doing it in my course:
Step 1 - Install Rstudio IDE 1.4+ & R 4.0+
Head over to Rstudio and install their IDE. You'll also need to install R.
Step 2 - Create a Project
Step 3 - Set Python Environment of your Project
Go to Tools > Project Options. Select the Python Interpreter.
Step 4 - Begin Coding Markdown and Python
Use "Python Code Chunks".
Step 5 - Knit to PDF
Note that this requires some form of LatTex. You can install easily with this package: tinytex.
Step 6 - Check out your PDF Report
Looks pretty slick.
Try it out and see if it works for you.
I'd go like this from terminal (this is to convert to Word, but also PDF is available, just change your last output to .pdf):
jupyter nbconvert --to html notebook.ipynb --TemplateExporter.exclude_input=True && pandoc notebook.html -s -o results.docx --resource-path=img --toc
Apart from installation and other pieces there are several aspects which make usage of nbconvert for files conversion quite a tedious task .
Anyone tried out the Jupyter Executable Notebook or R markdown methods ( they are useful but there is an extra cost of time and efforts which makes it less feasible )
What i found to be very useful is there are many websites serving this purpose it quick, easy and hassle free .
I use this IPYNB TO PDF , there are others as well .

ReadTheDocs generates PDFs without my HTML tables

We are converting a sizeable document for hosting on ReadTheDocs. We weren't happy with the simple presentation enabled by Markdown table syntax, so we coded our tables as HTML. Very nice in the HTML viewer (e.g., the end of http://manual.cytoscape.org/en/latest/Command_Line_Arguments.html).
In the PDF version generated by ReadTheDocs, each of our tables is completely missing (see page 9 on https://media.readthedocs.org/pdf/cytoscape-working-copy/latest/cytoscape-working-copy.pdf).
Have we made a mistake by coding tables as HTML? Could we have taken a different route and gotten nice tables in both HTML and PDF?
Any advice would be helpful ...
Thanks!
I have not used ReadTheDocs myself, but from reading their Getting Started guide, I assume you are using Sphinx? While Markdown supports embedding raw HTML, Sphinx does not support converting it to other formats.
You should consider moving to reStructuredText (Sphinx's native markup format), as it is much more advanced than Markdown. It can even be extended with custom directives and roles, should you need this. But be sure to first check whether reStructuredText tables offer the flexibility you require. Pandoc can convert your Markdown files to reStructuredText.
I see you are using a table to document command line options. reStructuredText supports documenting command line options using option lists. In theory, you could change how option lists are represented in the output document, but this might not be easy to accomplish, especially for PDF output using LaTeX (shameless plug: using rinohtype for PDF output should make this much easier in the future).

Suggestions on extracting text from uploaded documents

I currently have a number of documents uploaded to my website on a daily basis (.doc, .docx, .odt, pdf) and these docs are stored in a sql database (mediumblob).
Currently I open the docs from the database and cut and paste a text version into a field in the database for a quick reference and search function.
I'm looking to automate this "cut & paste" process - formatting isn't a real concern just as long as I can extract the text - and was hoping that some people may be able to suggest a good route to go down?
I've tried manipulating the content of the blob field using regex but it is not really working.
I've been looking at Apache POI with a view to extracting the text at the point of upload but I can't help thinking that this maybe a bit of an overkill given my relatively simple needs.
Given the various document formats I encounter and the current storing of the content in a blob field would Apache POI be the best solution to use in this instance or can anybody suggest an alternative?
Help and suggestions greatly appreciated.
Chris
Apache POI will only work for the Microsoft Office formats (.xls, .docx, .msg etc). For these formats, it provides classes for working with the files (always read, for many write support too), as well as text extractors.
For a general text extraction framework, you should look at Apache Tika. Tika uses POI internally to handle the Microsoft formats, and uses a number of other libraries to handle different formats. Tika will, for example, handle both PDF and ODF/ODT, which are the other two file formats you mentioned in the question.
There are some quick start tutorials and examples on the Apache Tika website, I'd suggest you have a look through. It's very quick to get started with, and you should be able to easily change your code to send the document through Tika during upload to get a plain text version, or event XHTML if that's more helpful to you.

generating pdf files with php

After some work with PHPExcel, I finally get it to generate sheets of 3000cells in ~5 seconds by using a big array.
With same data, I'll need to generate some pdf files. I've tried to do it with PHPExcel, but it is not a good choice. Generating a pdf file with PHPExcel, took a lot of time and a lot of resources.
I've tried to generate a pdf file with html2pdf php library. The file which contain a table with 3000 cells took me 20 seconds o generate.
My problem is that I can't find a good solution to my problem. Do you know any good library? Do you know any good practices in generating pdf files faster, with a low load on server side?
You can use the FPDF library to generate PDF files in a fast manner and you can use the Write HTML tables add-on to achieve what you want (see example at the bottom of the page).
PhpExcel uses TCPDF to generate PDF, the same as HTML2PDF with PHP5:
HTML2PDF is a HTML to PDF converter written in PHP4 (use FPDF), and PHP5 (use TCPDF).
I think that when generating a PDF, PhpExcel first generates XLS, then converts it to HTML, then again converts it to PDF. Not very efficient.
That is why by using HTML2PDF you can cut to 20 seconds.
--
To cut waiting time even more, maybe you could try another library, like dompdf, and keep skiping PhpExcel when what you need is a PDF.
If your table doesn't have formulas, you can generate all the content in an array, and pass it to some function to generate an XLS with PhpExcel, and to another to generate a PDF.

create one pdf from multiple ppt files

Someone knows how can I create one pdf file from multiple ppt files ?
Whether it to write script or computer program. However if it can be done with some program it will be the best.
I searched the web for something like this but I didn't get any results.
If you want to convert the PPT/PPTX files to PDF and then join those converted PDF files into a single PDF using either .NET or Java, you may try Aspose.Slides and Aspose.Pdf.Kit components.
Aspose.Slides allows you to convert the PPT/PPTX files to PDF and Aspose.Pdf.kit allows you to join the PDF files into a single PDF. Please see if this solution can work for your scenario.
Disclosure: I work as developer evangelist at Aspose.