extracting data from pdf and make a list of lists

extracting data from pdf and make a list of lists - pandas

I need some help extracting and manipulating data from a pdf.
pdf in question below, link: https://www.england.nhs.uk/wp-content/uploads/2018/04/national-tables-5-mgml-v3.pdf
national dose band screenshot
What I want is to create a list of lists, with the items on columns 1 and 3, like this one: oxalirange = ([5.75, 6.24], [6.25, 6.74], [6.75, 7.24],...
I know how to extract the pdf as an excel table via Camelot and pandas, and then what I have been doing is manually compiling the list, so what I'd like to know is how to automate that via python and pandas (or any other python library)
I am happy to be pointed out to the most relevant website so I can find the info myself.
Thanks in advance.

You can uses xlrd library in python to read an excel file here is a link to their documentation, However it will be limited to .xls files only (old excel)
https://xlrd.readthedocs.io/en/latest/
but here is a list of alternative libraries related to excel
https://www.python-excel.org/

Related

Select three rows of text in PDF using VBA

I am trying to use VBA to select three rows of data in a PDF file and copy them to the clipboard. I have tried third party libraries but I still can't seem to find a simple solution. I can use the cursor to select the data and copy it, so I just want to automate this step with VBA.
I have looked high and low for an answer to this and I feel like it might be really simple and I'm just missing it. I assume I could just use the "highliteList" method in the acrobat library to select the rows, but I don't know how to specify where to begin the selection. There is a header on each page, so I just want to say something like:
For Each header In pdf.pages
NextLine.SelectRow
NextLine.SelectRow
Next header
Selection.CopyToClipboard
Is this possible? I know those methods probly don't exist, I was using it as an example. Does anyone have experience with doing this? Thanks in advance for any help

I found a solution for all those interested. I used Bytescout PDF extractor library to convert the file to .xls format. Then I just parsed out what I needed in Excel since Excel is easy to work with via vb.net.

Creating math formulas in convert PDF files (where input is website)

I am using MathJax in my website to create math formulas and it's working great.
I need a way to output those formulas into PDF documents generated by my server.
I'm currently using a Windows server and backend is PHP.
I'm using TCPDF to create my PDF files, but I cant find any way to get my math formulas into those PDFs, be cause the math formulas are stored in my database in TeX format.
Is there anyway to convert them to math formulas before I insert them into my document?
At first I tried to use the SVG output format from MathJax and somehow extract that output, save it in my database and use that with TCPDF, but apparently it is not good enough because the SVG output from MathJax isn't only SVG.
I have looked for online tools to convert my TeX formulas into images, but I didn't find any site that provides an API for that. So I looked for a command line tool, but it seems like most (or all of them) are for Linux systems only.
I tried this one, tex2png,
but it didnt work :
BTW, I don't have Latex installed. Do I need to install in order to use tex2png?

Import PDF Fields into Database

I'm trying to import fields from a fill-able PDF into a sql databse.
I can't seem to find an answer online:
What's the best way to import/read data from pdf files?
Insert a PDF file into Core Data?'
http://www.utteraccess.com/forum/Import-Fillable-Pfd-Data-t1971535.html
So I'm wondering does anyone know how to extract data from a fill-able PDF into a database(or excel from which it can be imported into a database)
Thanks

Data from fillable pdf's can be exported into an .FDF file, which is a text file. pdftk is a command-line utility that will allow you to extract the data programmatically. You will then need to write a custom parser to pull the data out of the .FDF file.
It won't be a lot of fun, but it should be do-able.

You can use pdftk. I used it and it's great works like a charm. Lot's of coding though. You can get back at me if you need any help

writing text to a pdf file

I have several pdf files (about 20) and very month or so I need to change spme fields with new data. This is a very time consuming task and would like to know if there is an easy way via some sort of application where users can change the name of the variables that have to be stored into the different pdf files. This would be an enormous time saver. thanks for any help.

there are lots of solutions for this.. if you are willing to write some code things can get really interesting.
a simple solution would be to create a template pdf file with placeholder fields (like #{name}, #{age} etc.,), when a new pdf needs to be created using new values you can simple use itest to edit the pdf & replace the placeholders with actual values.
you can also use jasperreports for this but it would be an overkill for just 20 odd documents.
if you are interested in a sample program i'd be happy to provide you one.

If you have form fields in the PDF file then you may use Aspose.Pdf (.NET or Java version) to fill data into those fields programmatically. You can either fill the fields using individual values or import the data from the XML/FDF/XFDF files etc. You can take a template PDF and save the output PDF files with different values. Please see if this might help in your scenario.
Disclosure: I work as developer evangelist at Aspose.

spreadsheet program that supports reading HDF5 files

Is there any spreadsheet program that supports reading HDF5 files ?;

have you already tried HDFview?
its tabular view is quite similar to a spreadsheet application, you can also save to text file and then open it with a more standard spreadsheet application if you prefer:
http://www.hdfgroup.org/hdf-java-html/hdfview/UsersGuide/ug05spreadsheet.html#ug05save

You can download HDFview here:
http://www.hdfgroup.org/hdf-java-html/hdfview/index.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

extracting data from pdf and make a list of lists - pandas

You can uses xlrd library in python to read an excel file here is a link to their documentation, However it will be limited to .xls files only (old excel) https://xlrd.readthedocs.io/en/latest/ but here is a list of alternative libraries related to excel https://www.python-excel.org/

Related

Select three rows of text in PDF using VBA

Creating math formulas in convert PDF files (where input is website)

Import PDF Fields into Database

writing text to a pdf file

spreadsheet program that supports reading HDF5 files

Categories

Resources