How to Diff PDF/XFA forms in GitLab - pdf

To all whom are concerned:
My boss and I are on GitLab and we have problems trying to differentiate between dynamic PDF files.
Normally, for code files like C# class .cs files, it's easy to double-click and have GitLab highlight the changes made between two different versions.
However, we also create dynamic XFA/PDF files in Adobe LiveCycle and it's difficult to tell what has been changed at times, especially if the commit messages are not too specific or too vague. We know people suggested taking screenshots of the PDF between each version, but you can't diff text changes or format changes on image files.
We tried the program DiffPDF found here:
http://www.qtrac.eu/diffpdf.html
But we found out that it does not work with XFA/dynamic forms.
Does anyone have any suggestions on any possible programs that can diff the actual content on PDFs in GitLab?
Thank you for your time and future advice.

I will agree that it is hard to see what have been changed on a PDF, but if you instead look at the XDP-file you will be able to see what code have been changed.
If that is possible for you.

Related

Replace words/phrases in existing PDF or docx with other words

I am trying to make a dynamic PDF generator as an .NET Core API. I want to take an existing PDF, or .docx file, and edit it so it replaces the current name (John Doe) with something that can be replaced like #NAME_PLACEHOLDER.
I then want to transform #NAME_PLACEHOLDER -> John Doe (or whatever is in the KeyValuePair or Dictionary<string, string>).
I am running this on a Docker environment, so I can easily execute commands and I am willing to do that as well.
So far I have tried a few things:
1) pdf2htmlEX
Executes as pdf2htmlEX file.pdf
Does the job pretty well
Can be converted back to PDF using Google Chrome headless or similar
Problem: Only the characters used in the PDF can be used to replace. So if I only use A, B, C as characters, it will make D into Times New Roman (or default font)
2) LibreOffice ODT to PDF
This was pretty nice, because I could simply unzip the .odt file, open content.xml, search and replace, then save it as an .odt file again
Could be converted into PDF rather easily using soffice --convert-to pdf
LibreOffice is quite nice
Problem 1: Microsoft Word -> Save as ODT tends to break the formatting, so we have to use LibreOffice to go and change it back again
Problem 2: We don't want to move away from Microsoft's Office suite
3) HTML to PDF using Chrome Headless
What you see is what you get
By far the best option, if we're all developers aaand have unlimited time
Problem 1: Only our developers can make changes, since our marketing department do not know HTML
Problem 2: Our existing PDFs would have to be rewritten in HTML
As you can see, I have tried a bunch of things. None of them, except Chrome Headless, has lived up to my expectations. What I really like about #3 is what you see is what you get. I can make the whole thing in HTML, press CTRL+P and see what it looks like as a finished PDF, basically.
I am looking for a better solution, though. It can be paid. It can be free. All I need is to change out words/phrases with other words dynamically, which apparently seems like a tough thing to do.
Thanks for specifying what you've already found clearly. It helps a lot providing a succinct answer.
The conversion is always tricky - I'm sure you know Word has trouble displaying/editing some Word documents itself.
I have experience regarding point #2 "LibreOffice ODT to PDF" and can suggest a few things to test:
Don't use Microsoft to do the docx->odt conversion. It's not good as you know. Use LibreOffice itself to do this step. The rest of your process remains the same.
For some documents, Libre Office does doc->odt much better. So, you can instead work with DOC format and get a better result without any other changes.
You won't be able to remove the devs from the process, but you can certainly reduce their role allowing your business/marketing teams to have more direct input simply by:
get the starting point document to the devs to run through the conversion process. The devs can "clean up" the document to make it convert nicely.
make this version of the document the "official" starting point. The business or technical teams can load it, adjust it, and put it back into the process.
if possible, expose a test-platform to the business teams so they can download, adjust, upload and render to PDF. This cycle means they will be able to achieve more and if they're good, do impressive stuff without any dev input.
the above steps simply mean don't expect perfect conversion of arbitrary complex documents. Starting from a (even complex) working baseline is great.
Some of that might show you that your #2 is actually going to get the best overall results.
I hope that helps.

Workflow / best practices for XLIFF

I am using a command line tool (ng-xi18n) to extract the i18n strings from an angular 2 app I wrote. The output of this command is a messages.xlf file. Coming from a .po background, and being not familiar with .xlf, I assumed that this file is the equivalent to the .pot file (correct me if I am wrong).
I then assumed that if I want to translate my app, I had to cp messages.xlf messages.de.xlf to have a copy (messages.de.xlf) of the template file (messages.xlf) where I can translate each message into German (hence the .de.xlf).
After translating some dummy texts and running the app, I saw that it worked as expected, so I quit translating and continued developing the app. After some time, I added more i18n strings, and eventually thought that I had to update my template. And this is where things got hardly maintainable. I updated the template messages.xlf file, and quickly was wondering how I could update the new strings to my already translated messages.de.xlf file without loosing my progress.
When I was developing using .po files, this was no problem thanks to good tools like poEdit, but I didn't find anything comparable for .xlf. After trying some tools, I thought that the best choice would be Lokalize, but I didn't find a possibility to merge the template file to already translated (but outdated) files either.
Up to now, this was rather an essay than a question, so here's a quick summary:
Is the workflow of dealing with .xlf files really comparable to .po as I initially thought (described above), or is it completely different?
How am I suppose to update my already translated files?
What are the best practices dealing with .xlf files?
What are proof of concept tools to work with .xlf?
Sidenotes:
The Lokalize handbook was not helpful at all. I see a lot of functions that sound promising, like:
"File" > "Update file from template". I did not find anything in the handbook to explain this function. If I click on this, nothing happens.
"Sync" > "Open file for sync/merge". This seems to be a function to merge two similar files (by multiple translators) rather than a tool to update the translation file from a template. Even though there is a tooltip in Lokalize's primary sync tab, notifying me about "x unmatched entries", I just couldn't find anything to append those unmatched entries to my .de.xlf file.
[Update] Turns out, I had similar issues as in this question. After downgrading my version of Lokalize to the suggested one, many issues (including the ones mentioned in the question) disappeared. However, now the "Update file from template" option is greyed out, and I don't know why.
I also tried OmegaT, which does not work at all on my platform (Ubuntu 16.04).
[Update] Virtaal works great for merging new strings from a template, but the UI in general is very poorly designed...
Googling did not help, as every hit seems to be related to XCode or something.
Thanks for any help in advance, I really appreciate it
I wrote a small npm command line tool called xliffmerge.
In principle it does the same, that Roland Oldengarm does with his gulp tasks described in his blog article.
It is free and you can have a look at it at https://github.com/martinroob/ngx-i18nsupport#readme
The best workflow automation solution I have seen described so far is from Roland Oldengarm's blog entry "Angular 2: Automated i18n workflow using gulp". To summarize, in a few dozen lines of Gulp code he created the tooling to handle some of the challenges you faced. Specifically it runs ng-xi18n to extract the messages; creates an English translation with sources copied to targets; updates existing translations by adding new trans-units, keeping existing ones, and removing missing ones; and then exposes all xlf files as TypeScript string constants. These last strings can then be imported to supply the bootstrapModule with its translation provider options.
Caveat: I have not used this exact solution (and code) myself, but I was able to expose generated xlf as TypeScript strings and use them in an app in a manner similar to what he described. As for maintaining translations, I have leveraged IntelliJ IDEA (WebStorm) file comparison features and Counterparts Lite (for Mac) for that. My own efforts are still in early stages but are working end to end for an application that is in active development.
Official Angular docs are now updated for Internationalization (i18n) at https://angular.io/docs/ts/latest/cookbook/i18n.html including a section specifically for creating a translation source file with the ng-xi18n tool.

Replace content inte a PDF File

The system I'm working with are receiving PDF documents, inside those documents there are two clickable images. The click events just triggers a http url. The thing is that I need to update those two url:s when I receive the document.
So my question is, is it possible to find the events and change the url and then save the file again? Those two images can be anywhere in the document so I can't look in a specific location.
Edit: I forgot to say that I'm coding in C# so it needs to be a .NET library.
Yes, it's possible.
It's hard to describe the way it can be achieved without knowing how PDFs are constructed (there are a few ways to create the described behavior) and tools you are going to use.
I just want to tell you how I solved this problem, or rather where I found the solution. I used the code in this thread, and it worked like a charm.

Rule based PDF text extraction for verious bills and invoices

I have to extract text from invoices and bills pdf files
The files layouts can get complex, though its mostly filled with tables.
I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.
Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.
Adobe also has an online service called exportPdf but it can't be customized
Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.
I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.
I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.
I'll be very grateful if someone with such experience give me a hint.
I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.
That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.
Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.
The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.
Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.
It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?
Anyway here are some specialized tools including engines they use:
Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)
DISCLAIMER: I work for ByteScout.

Is there a way to access VBA help files from the command line

I'm going to have to write a number of vba modules for a project I'm working on, and would prefer to use SciTe to the built in editer in Office.
SciTe allows you to redirect the effect of hitting F1 to a arbitary command with the selected text as an argument. Is there anyway of using this functionality to search the relevant .chm files?
I'm guessing not, given that the help for vba is spread across multiple files, but I'm hoping someone can prove me wrong...
I'm especially interested if anyone can suggest a way to find out which chm file a particular libraries help resides, just from the fully delimitered name of the function.
Another approach is to use the HTML Help command line program HH.EXE to either show specific pages, or to decompile a particular CHM into HTML files.
Go to the folder mentioned by Lunatik in a command window and enter this command:
hh -decompile html vbaac10.chm
^^
# ac is for Access; use xl for Excel, wd for Word, etc
This will create an "html" folder below it and fill it with most of the files that went into creating the CHM file. The resulting HTML files can be opened directly in your browser, although they won't find their related style sheets or scripts which are addressed by their locations in CHM files. The style sheets and scripts do get extracted though so you can work with them too.
Also take a look at the XML files in the 1033 folder like VB_ACTOC.XML - this is the Table of Contents for the Access VBA help. It contains topic nodes with labels and urls for each item in the help file:
<topic>
<label>CheckBox Object</label>
<url>mk:#MSITStore:vbaac10.chm::/html/acobjCheckBox.htm</url>
</topic>
The mk:etc... url can be put on the HH command line to open that topic in a regular HTML Help window. Also, it shows the source CHM filename, and the relative path of the file when decompiled.
hh mk:#MSITStore:vbaac10.chm::/html/acobjCheckBox.htm
Working from these files, you could put together a script to find/grep files by keyword and show them in a browser, or you could reengineer the files into some sort of database or other lookup capability to work with SciTe's command based help system.
Some sites with more info about using HH.EXE:
HTMLHelp command-line
tips on using the HH command line and links to other sites
KeyHH 1.1
an alternate/supplemental program to HH.EXE for working with CHM files
The main files are held (for Office 2003 anyway) in Program Files\OFFICE11\1033, but accessing pages within them could be a bit tricky as Microsoft have gradually had to reign in the ability to delve into CHM files over the years due to security concerns.
This page (download) has some good info on what might still be possible as far as linking to specific pages inside a CHM
Having said that, I don't think this file is the default help shown to most users nowadays, but it's close enough, missing only the Office 2007 pimping most of the time. The online help seems to be set as default unless you specifically disable it during the Office install. The URLs are, I think, not very SEO friendly so couldn't be guessed. I suppose you could borrow a sneaky trick from scammers and craft URLs that point to the top link on Google, thusly: Range.
EDIT: Google cache link?
Inspired heavily by Lunatik's answer, adding:
command.help.$(file.patterns.vb)=http://www.google.co.uk/search?hl=en&newwindow=1&q=site%3Amsdn.microsoft.com+%222003+VBA%22+$(CurrentWord)
command.help.subsystem.$(file.patterns.vb)=2
to my vb.properties file gives me a reasonable work around (loads a Google search results page with search criteria of:
site:msdn.microsoft.com "2003 VBA" $(CurrentWord)
Obviously no guarantees of it taking me to a helpful page, but then the inline help in the VBA editer isn't all that reliable on that one either...
Can anyone who knows SciTe better suggest a more elegant solution?