I'm looking for a way to convert a PDF document into multiple ics files that staff can use to add their fortnight roster to their smart phone calendars or outlook calendar on their desktops. The information required to create the multiple files would be pulled from the PDF by searching for selected initials from each column then referencing data from the same row as the initials. Is their a particular order I need the data to appear in the ics file to allow it to import to a smartphone calendar??
You can search for pdf APIs for more details in handling a pdf using programmatically.
and here are some online converters that could help. They convert a pdf into word
http://www.pdftoword.com/success.aspx
http://www.pdfescape.com/account/?expired
However, reconstructing structured data from PDF is not trivial because a program has to deduct the semantics in the layout. So most programs can only restore scattered data from a pdf.
I've done this with PERL and windows Adobe PDF viewer to highlight all the text in the PDF and cut and paste to a text file. As the previous answer said, you have to write PERL (or any other text processing language) to pick out the format of the PDF you have. Then you can print it with PERL to csv or to ical or whatever format you want. I've shared my code on github.com. I'm not sure if you know GIT, but send me a private message if you want me to send the PERL code outside of GIT.
The PDF's I've converted are here:
http://recplexonline.com/sports/hockey/old-geezers-hockey-35
The Git hub of my PERL code and the input files I used are here:
https://github.com/jdeltoft/PdfParse
It's pretty ugly perl, sorry for that. But it works. I'll try to clean it up soon.
Related
I have 2 sets of PDFs like below
The column changed for every pdf, I checked with document understanding, read pdf text, read pdf text with ocr and screen scrapping not working properly.
For using read pdf text, i got the output below
I need to get the table with spaces(emplty cells), how can i get this?
There are tools & APIs available that offer swift PDF data extraction to Excel. Some of them are integrated with UiPath which opens a wider range of activities that can be performed.
UiPath integrations can be easily set up to extract data automatically. They can also merge with other integrations for a more sophisticated task.
Here’s a tutorial explaining how to extract PDF data using UiPath integration. That is how you can either extract a single PDF document or choose a few pages or even a specific area in the PDF file. The output can be done in Excel or CSV with further configurations available.
Find a more detailed explanation below.
How to Convert PDF to CSV with UiPath - PDF.co
Learn how you can convert PDF to CSV with UiPath and PDF.co extension. Check out the detailed tutorial and video demos.
https://pdf.co/pdf-to-csv-with-uipath
I need help creating a solution that would help me resolve PDF text replacement. We hired a programmer that tried to achieve our objective with a python-coded app but failed.
Our project hopes to automate these steps:
Get a folder with pdf documents
In Each document, we need to find particular text (usually in the upper third of the page). The information can be on multiple pages.
Hide/erase/ and replace the text with different information that we would pull from excel or SQL database. Make sure that the text is replaced on all occasions.
Next, rename the document based on the assigned doc name (pulled from excel or SQL).
Have a report of all documents that were processed and when the new version was and was not created.
Keep the original document saved for review and comparison.
I am happy to provide the original code from the developers if necessary, but it did not work...
Thank you for your help, community!
I have a Notes app that was designed for the browser, not the client. It allowed upload of files into the documents, so nearly all the documents have files. The files are stored in the NSF as $FILE and displayed in the documents as links.
I am using Adobe Acrobat Pro to create PDFs from the documents and need to include the file attachments within the PDFs, however the PDFs just include links to the files, not the attachments. Can I write an agent to run against the documents to get those files and embed them within the documents? When I view those documents through the client, I see all of the HTML etc. and then at the bottom of the document, the file attachments appear. When I view these same documents in the browser, the file attachments do not appear. If I could merely ensure that they are there, then when running the PDF generator in Acrobat Pro, they would be included in the PDFs and executable.
I am really stuck here, with no other way to 'archive' this notes database with all the data intact.
Thanks in advance for any insights!!
Ginni
There is a commercial product from Swing Software that does this. I hear that it's quite good, but I've never used it. Let me explain why...
The way I usually end up doing this is just quick-and-dirty. I write an agent to export the files, using the document UNID as part of the filename. The same agent exports all the data fields from the document into a CSV file, and I add a column with the filename of the extracted attachment. In your case, I would add two columns -- one for the extracted attachment(s), and one for the generated PDF. The CSV serves as an index for the exported data. It can be imported into something more friendly, or just left as-is and brought up in Excel, depending on the customer's usage requirements and available systems. I've recommended Swing Software's product and offered to explore other ideas for developing code (e.g., using wkhtmltopdf for Domino web apps to capture a WYSIWYG rendering based on an HTML crawl) for PDF rendering of Notes documents for a couple of clients, but none of them have justified the cost that would be involved in buying licenses and/or writing the code. Quick and dirty always seems to win, even when there are retention and eDiscovery considerations taken into account.
I am implementing a program that is the same as google cloud printer. It is a virtual printer using postscript class driver. As the picture shows(I add the 64 suffix). The chinese translating to english is
Helpfile, ConfigurationFile DataFile DriverFile Dependency.
I use redmon to catch the standard input and use ghostscript to convert it to pdf. At the same time I get the job infomation from printer queue. With the pdf and job infomation, I can send them to my server. Then my server can print the document. I invoke ghostscript as the picture shows.
When I use WPS(a chinese application which is the same as Microsoft Word) to print docx document, the job infomation in job queue is correct. For example, when I print test.docx and select three copies, collate, color, I get the right result from job queue. Things get weird when it comes to Microsoft Word. When I use Microsoft Word to print docx, I get the job infomation from queue. No matter how much copies the user specifies, the copies is always one. At the same time, the converted pdf contains one copy too. This means that I have no way to achieve my goal(get pdf and job infomation like copies. Then send them to my server). Does anyone know how I can get the right number of copies, or at least it should behave as Microsoft Print to Pdf printer(as the ps illustrates). My written english is not good. Thanks!
ps: I also have tested Microsoft Print to PDF. If I select three copies in Word, the job infomation in job queue is always one. However, the destination pdf file contains three copies(If the docx is one page, the destination pdf is three pages).
It seems like you've asked two questions here, and only one of them relates to Ghostscript. Your first question seems to be regarding what the Windows print subsystem displays when you print a job, I can't help you with that. In fact I doubt anyone other than the developers of the applications (WPS and Word) can tell you why they drive the print subsystem differently.
Your second question seems to be 'why do I only get one copy of the PDF file'. The first question I have to ask, then, is what you expect ? Do you expect one PDF file with three copies of the content, or 3 PDF files each containing one copy of the content ?
There are two possible ways to get multiple copies; firstly send the content three times and alternatively (in the PostScript program) set /#Copies to the number of required copies. I can't tell (because you have not supplied the PostScript program in either case) which approach is being used by each application.
If your problem is that you are getting three copies from WPS and one copy from Word, then my guess would be that WPS is sending the content 3 times, and Word is sending it once, but setting the number of copies to 3. The pdfwrite device in Ghostscript ignores /#Copies, and only produces one copy of the cotnent in the output.
You can't change that.
My customer actually stores his documents, which are single page automotive forfeits, in a single MS Word document... this method is of course generating a huge file which is slow to open, not to talk about searches.
After a user compiles a document, he may need to print it to manually sign it. Then the document is scanned back and stored in PDF format. The document may be printed again to be
signed a second time by a manager. The doubly signed document is scanned again and saved
overwriting the singly-signed one.
The user wants to be able to search the document using a couple of search keys (the doc number and a sort of a SSN). That is the reason they are using a single file, to be able to search in the file using Word's search feature.
I have to propose an IT solution. I was thinking about giving them a software tool that:
reads a pdf form/template; the template rarely changes
shows the template on the screen and allows the user to input his variable fields in the form
some of the fields must be defined as searchable
the user saves only the form fields, not the whole pdf.
the sw is able to rebuild a document by coupling the template with the fields. I have to find a way to tie the template with the saved fields, so that the template can change (versioning) without breaking the old documents
the tool allows to search in multiple documents, using the defined search fields
the tool allows to print the document to manually sign it; this is the hard part. When the document is signed cannot be changed anymore, but if the document is simply scanned and coupled with the form/fields pdf, then I'll loose the benefits of only storing the data decoupled from the template. Should I only scan the signature and attach it to the document as an image?
What do you suggest to use?
Adobe XML Forms?
Adobe Forms Data Format?
An already existing software?
Other?
For the existing documents, I want allow the customer to import his huge MS Word file into the new system.
Thanks.
Sounds like you want a PDF form template that submits data to a dB that can be searched.
OTOH, if you just save the PDFs, Acrobat Pro can generate an index file from a directory, that can be searched (from reader?). Yep, you can run searches on an index from reader, but can only build them with Acrobat.
I prefer AcroForms to LiveCycle forms myself. There's a lot more software out there that works with 'em. If you go with LiveCycle, you're almost completely locked into Adobe. And Adobe server software is EXPENSIVE.